Merge tag 'cgroup-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

+16 -6

Documentation/admin-guide/cgroup-v2.rst

··· 533 533 Because the resource control interface files in a given directory 534 534 control the distribution of the parent's resources, the delegatee 535 535 shouldn't be allowed to write to them. For the first method, this is 536 - achieved by not granting access to these files. For the second, the 537 - kernel rejects writes to all files other than "cgroup.procs" and 538 - "cgroup.subtree_control" on a namespace root from inside the 539 - namespace. 536 + achieved by not granting access to these files. For the second, files 537 + outside the namespace should be hidden from the delegatee by the means 538 + of at least mount namespacing, and the kernel rejects writes to all 539 + files on a namespace root from inside the cgroup namespace, except for 540 + those files listed in "/sys/kernel/cgroup/delegate" (including 541 + "cgroup.procs", "cgroup.threads", "cgroup.subtree_control", etc.). 540 542 541 543 The end results are equivalent for both delegation types. Once 542 544 delegated, the user can build sub-hierarchy under the directory, ··· 982 980 983 981 A dying cgroup can consume system resources not exceeding 984 982 limits, which were active at the moment of cgroup deletion. 983 + 984 + nr_subsys_<cgroup_subsys> 985 + Total number of live cgroup subsystems (e.g memory 986 + cgroup) at and beneath the current cgroup. 987 + 988 + nr_dying_subsys_<cgroup_subsys> 989 + Total number of dying cgroup subsystems (e.g. memory 990 + cgroup) at and beneath the current cgroup. 985 991 986 992 cgroup.freeze 987 993 A read-write single value file which exists on non-root cgroups. ··· 2950 2940 2951 2941 - "cgroup.clone_children" is removed. 2952 2942 2953 - - /proc/cgroups is meaningless for v2. Use "cgroup.controllers" file 2954 - at the root instead. 2943 + - /proc/cgroups is meaningless for v2. Use "cgroup.controllers" or 2944 + "cgroup.stat" files at the root instead. 2955 2945 2956 2946 2957 2947 Issues with v1 and Rationales for v2

+1

Documentation/core-api/index.rst

··· 49 49 wrappers/atomic_t 50 50 wrappers/atomic_bitops 51 51 floating-point 52 + union_find 52 53 53 54 Low level entry and exit 54 55 ========================

+106

Documentation/core-api/union_find.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ==================== 4 + Union-Find in Linux 5 + ==================== 6 + 7 + 8 + :Date: June 21, 2024 9 + :Author: Xavier <xavier_qy@163.com> 10 + 11 + What is union-find, and what is it used for? 12 + ------------------------------------------------ 13 + 14 + Union-find is a data structure used to handle the merging and querying 15 + of disjoint sets. The primary operations supported by union-find are: 16 + 17 + Initialization: Resetting each element as an individual set, with 18 + each set's initial parent node pointing to itself. 19 + 20 + Find: Determine which set a particular element belongs to, usually by 21 + returning a “representative element” of that set. This operation 22 + is used to check if two elements are in the same set. 23 + 24 + Union: Merge two sets into one. 25 + 26 + As a data structure used to maintain sets (groups), union-find is commonly 27 + utilized to solve problems related to offline queries, dynamic connectivity, 28 + and graph theory. It is also a key component in Kruskal's algorithm for 29 + computing the minimum spanning tree, which is crucial in scenarios like 30 + network routing. Consequently, union-find is widely referenced. Additionally, 31 + union-find has applications in symbolic computation, register allocation, 32 + and more. 33 + 34 + Space Complexity: O(n), where n is the number of nodes. 35 + 36 + Time Complexity: Using path compression can reduce the time complexity of 37 + the find operation, and using union by rank can reduce the time complexity 38 + of the union operation. These optimizations reduce the average time 39 + complexity of each find and union operation to O(α(n)), where α(n) is the 40 + inverse Ackermann function. This can be roughly considered a constant time 41 + complexity for practical purposes. 42 + 43 + This document covers use of the Linux union-find implementation. For more 44 + information on the nature and implementation of union-find, see: 45 + 46 + Wikipedia entry on union-find 47 + https://en.wikipedia.org/wiki/Disjoint-set_data_structure 48 + 49 + Linux implementation of union-find 50 + ----------------------------------- 51 + 52 + Linux's union-find implementation resides in the file "lib/union_find.c". 53 + To use it, "#include <linux/union_find.h>". 54 + 55 + The union-find data structure is defined as follows:: 56 + 57 + struct uf_node { 58 + struct uf_node *parent; 59 + unsigned int rank; 60 + }; 61 + 62 + In this structure, parent points to the parent node of the current node. 63 + The rank field represents the height of the current tree. During a union 64 + operation, the tree with the smaller rank is attached under the tree with the 65 + larger rank to maintain balance. 66 + 67 + Initializing union-find 68 + ----------------------- 69 + 70 + You can complete the initialization using either static or initialization 71 + interface. Initialize the parent pointer to point to itself and set the rank 72 + to 0. 73 + Example:: 74 + 75 + struct uf_node my_node = UF_INIT_NODE(my_node); 76 + 77 + or 78 + 79 + uf_node_init(&my_node); 80 + 81 + Find the Root Node of union-find 82 + -------------------------------- 83 + 84 + This operation is mainly used to determine whether two nodes belong to the same 85 + set in the union-find. If they have the same root, they are in the same set. 86 + During the find operation, path compression is performed to improve the 87 + efficiency of subsequent find operations. 88 + Example:: 89 + 90 + int connected; 91 + struct uf_node *root1 = uf_find(&node_1); 92 + struct uf_node *root2 = uf_find(&node_2); 93 + if (root1 == root2) 94 + connected = 1; 95 + else 96 + connected = 0; 97 + 98 + Union Two Sets in union-find 99 + ---------------------------- 100 + 101 + To union two sets in the union-find, you first find their respective root nodes 102 + and then link the smaller node to the larger node based on the rank of the root 103 + nodes. 104 + Example:: 105 + 106 + uf_union(&node_1, &node_2);

+1

Documentation/translations/zh_CN/core-api/index.rst

··· 49 49 generic-radix-tree 50 50 packing 51 51 this_cpu_ops 52 + union_find 52 53 53 54 ======= 54 55

+92

Documentation/translations/zh_CN/core-api/union_find.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + .. include:: ../disclaimer-zh_CN.rst 3 + 4 + :Original: Documentation/core-api/union_find.rst 5 + 6 + ============================= 7 + Linux中的并查集（Union-Find） 8 + ============================= 9 + 10 + 11 + :日期: 2024年6月21日 12 + :作者: Xavier <xavier_qy@163.com> 13 + 14 + 何为并查集，它有什么用？ 15 + ------------------------ 16 + 17 + 并查集是一种数据结构，用于处理一些不交集的合并及查询问题。并查集支持的主要操作： 18 + 初始化：将每个元素初始化为单独的集合，每个集合的初始父节点指向自身。 19 + 20 + 查询：查询某个元素属于哪个集合，通常是返回集合中的一个“代表元素”。这个操作是为 21 + 了判断两个元素是否在同一个集合之中。 22 + 23 + 合并：将两个集合合并为一个。 24 + 25 + 并查集作为一种用于维护集合（组）的数据结构，它通常用于解决一些离线查询、动态连通性和 26 + 图论等相关问题，同时也是用于计算最小生成树的克鲁斯克尔算法中的关键，由于最小生成树在 27 + 网络路由等场景下十分重要，并查集也得到了广泛的引用。此外，并查集在符号计算，寄存器分 28 + 配等方面也有应用。 29 + 30 + 空间复杂度: O(n)，n为节点数。 31 + 32 + 时间复杂度：使用路径压缩可以减少查找操作的时间复杂度，使用按秩合并可以减少合并操作的 33 + 时间复杂度，使得并查集每个查询和合并操作的平均时间复杂度仅为O(α(n))，其中α(n)是反阿 34 + 克曼函数，可以粗略地认为并查集的操作有常数的时间复杂度。 35 + 36 + 本文档涵盖了对Linux并查集实现的使用方法。更多关于并查集的性质和实现的信息，参见： 37 + 38 + 维基百科并查集词条 39 + https://en.wikipedia.org/wiki/Disjoint-set_data_structure 40 + 41 + 并查集的Linux实现 42 + ------------------ 43 + 44 + Linux的并查集实现在文件“lib/union_find.c”中。要使用它，需要 45 + “#include <linux/union_find.h>”。 46 + 47 + 并查集的数据结构定义如下:: 48 + 49 + struct uf_node { 50 + struct uf_node *parent; 51 + unsigned int rank; 52 + }; 53 + 54 + 其中parent为当前节点的父节点，rank为当前树的高度，在合并时将rank小的节点接到rank大 55 + 的节点下面以增加平衡性。 56 + 57 + 初始化并查集 58 + ------------- 59 + 60 + 可以采用静态或初始化接口完成初始化操作。初始化时，parent 指针指向自身，rank 设置 61 + 为 0。 62 + 示例:: 63 + 64 + struct uf_node my_node = UF_INIT_NODE(my_node); 65 + 66 + 或 67 + 68 + uf_node_init(&my_node); 69 + 70 + 查找并查集的根节点 71 + ------------------ 72 + 73 + 主要用于判断两个并查集是否属于一个集合，如果根相同，那么他们就是一个集合。在查找过程中 74 + 会对路径进行压缩，提高后续查找效率。 75 + 示例:: 76 + 77 + int connected; 78 + struct uf_node *root1 = uf_find(&node_1); 79 + struct uf_node *root2 = uf_find(&node_2); 80 + if (root1 == root2) 81 + connected = 1; 82 + else 83 + connected = 0; 84 + 85 + 合并两个并查集 86 + -------------- 87 + 88 + 对于两个相交的并查集进行合并，会首先查找它们各自的根节点，然后根据根节点秩大小，将小的 89 + 节点连接到大的节点下面。 90 + 示例:: 91 + 92 + uf_union(&node_1, &node_2);

+12

MAINTAINERS

··· 5736 5736 T: git git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git 5737 5737 F: Documentation/admin-guide/cgroup-v1/cpusets.rst 5738 5738 F: include/linux/cpuset.h 5739 + F: kernel/cgroup/cpuset-internal.h 5740 + F: kernel/cgroup/cpuset-v1.c 5739 5741 F: kernel/cgroup/cpuset.c 5740 5742 F: tools/testing/selftests/cgroup/test_cpuset.c 5741 5743 F: tools/testing/selftests/cgroup/test_cpuset_prs.sh 5744 + F: tools/testing/selftests/cgroup/test_cpuset_v1_base.sh 5742 5745 5743 5746 CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG) 5744 5747 M: Johannes Weiner <hannes@cmpxchg.org> ··· 23608 23605 F: drivers/cdrom/cdrom.c 23609 23606 F: include/linux/cdrom.h 23610 23607 F: include/uapi/linux/cdrom.h 23608 + 23609 + UNION-FIND 23610 + M: Xavier <xavier_qy@163.com> 23611 + L: linux-kernel@vger.kernel.org 23612 + S: Maintained 23613 + F: Documentation/core-api/union_find.rst 23614 + F: Documentation/translations/zh_CN/core-api/union_find.rst 23615 + F: include/linux/union_find.h 23616 + F: lib/union_find.c 23611 23617 23612 23618 UNIVERSAL FLASH STORAGE HOST CONTROLLER DRIVER 23613 23619 R: Alim Akhtar <alim.akhtar@samsung.com>

+14

include/linux/cgroup-defs.h

··· 210 210 * fields of the containing structure. 211 211 */ 212 212 struct cgroup_subsys_state *parent; 213 + 214 + /* 215 + * Keep track of total numbers of visible descendant CSSes. 216 + * The total number of dying CSSes is tracked in 217 + * css->cgroup->nr_dying_subsys[ssid]. 218 + * Protected by cgroup_mutex. 219 + */ 220 + int nr_descendants; 213 221 }; 214 222 215 223 /* ··· 477 469 478 470 /* Private pointers for each registered subsystem */ 479 471 struct cgroup_subsys_state __rcu *subsys[CGROUP_SUBSYS_COUNT]; 472 + 473 + /* 474 + * Keep track of total number of dying CSSes at and below this cgroup. 475 + * Protected by cgroup_mutex. 476 + */ 477 + int nr_dying_subsys[CGROUP_SUBSYS_COUNT]; 480 478 481 479 struct cgroup_root *root; 482 480

+4 -6

include/linux/cpuset.h

··· 99 99 extern int cpuset_mems_allowed_intersects(const struct task_struct *tsk1, 100 100 const struct task_struct *tsk2); 101 101 102 + #ifdef CONFIG_CPUSETS_V1 102 103 #define cpuset_memory_pressure_bump() \ 103 104 do { \ 104 105 if (cpuset_memory_pressure_enabled) \ ··· 107 106 } while (0) 108 107 extern int cpuset_memory_pressure_enabled; 109 108 extern void __cpuset_memory_pressure_bump(void); 109 + #else 110 + static inline void cpuset_memory_pressure_bump(void) { } 111 + #endif 110 112 111 113 extern void cpuset_task_status_allowed(struct seq_file *m, 112 114 struct task_struct *task); ··· 117 113 struct pid *pid, struct task_struct *tsk); 118 114 119 115 extern int cpuset_mem_spread_node(void); 120 - extern int cpuset_slab_spread_node(void); 121 116 122 117 static inline int cpuset_do_page_mem_spread(void) 123 118 { ··· 245 242 } 246 243 247 244 static inline int cpuset_mem_spread_node(void) 248 - { 249 - return 0; 250 - } 251 - 252 - static inline int cpuset_slab_spread_node(void) 253 245 { 254 246 return 0; 255 247 }

-1

include/linux/sched.h

··· 1243 1243 /* Sequence number to catch updates: */ 1244 1244 seqcount_spinlock_t mems_allowed_seq; 1245 1245 int cpuset_mem_spread_rotor; 1246 - int cpuset_slab_spread_rotor; 1247 1246 #endif 1248 1247 #ifdef CONFIG_CGROUPS 1249 1248 /* Control Group info protected by css_set_lock: */

+41

include/linux/union_find.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #ifndef __LINUX_UNION_FIND_H 3 + #define __LINUX_UNION_FIND_H 4 + /** 5 + * union_find.h - union-find data structure implementation 6 + * 7 + * This header provides functions and structures to implement the union-find 8 + * data structure. The union-find data structure is used to manage disjoint 9 + * sets and supports efficient union and find operations. 10 + * 11 + * See Documentation/core-api/union_find.rst for documentation and samples. 12 + */ 13 + 14 + struct uf_node { 15 + struct uf_node *parent; 16 + unsigned int rank; 17 + }; 18 + 19 + /* This macro is used for static initialization of a union-find node. */ 20 + #define UF_INIT_NODE(node) {.parent = &node, .rank = 0} 21 + 22 + /** 23 + * uf_node_init - Initialize a union-find node 24 + * @node: pointer to the union-find node to be initialized 25 + * 26 + * This function sets the parent of the node to itself and 27 + * initializes its rank to 0. 28 + */ 29 + static inline void uf_node_init(struct uf_node *node) 30 + { 31 + node->parent = node; 32 + node->rank = 0; 33 + } 34 + 35 + /* find the root of a node */ 36 + struct uf_node *uf_find(struct uf_node *node); 37 + 38 + /* Merge two intersecting nodes */ 39 + void uf_union(struct uf_node *node1, struct uf_node *node2); 40 + 41 + #endif /* __LINUX_UNION_FIND_H */

+13

init/Kconfig

··· 1143 1143 1144 1144 Say N if unsure. 1145 1145 1146 + config CPUSETS_V1 1147 + bool "Legacy cgroup v1 cpusets controller" 1148 + depends on CPUSETS 1149 + default n 1150 + help 1151 + Legacy cgroup v1 cpusets controller which has been deprecated by 1152 + cgroup v2 implementation. The v1 is there for legacy applications 1153 + which haven't migrated to the new cgroup v2 interface yet. If you 1154 + do not have any such application then you are completely fine leaving 1155 + this option disabled. 1156 + 1157 + Say N if unsure. 1158 + 1146 1159 config PROC_PID_CPUSET 1147 1160 bool "Include legacy /proc/<pid>/cpuset file" 1148 1161 depends on CPUSETS

+1

kernel/cgroup/Makefile

··· 5 5 obj-$(CONFIG_CGROUP_PIDS) += pids.o 6 6 obj-$(CONFIG_CGROUP_RDMA) += rdma.o 7 7 obj-$(CONFIG_CPUSETS) += cpuset.o 8 + obj-$(CONFIG_CPUSETS_V1) += cpuset-v1.o 8 9 obj-$(CONFIG_CGROUP_MISC) += misc.o 9 10 obj-$(CONFIG_CGROUP_DEBUG) += debug.o

+14 -3

kernel/cgroup/cgroup-v1.c

··· 46 46 return cgroup_no_v1_mask & (1 << ssid); 47 47 } 48 48 49 + static bool cgroup1_subsys_absent(struct cgroup_subsys *ss) 50 + { 51 + /* Check also dfl_cftypes for file-less controllers, i.e. perf_event */ 52 + return ss->legacy_cftypes == NULL && ss->dfl_cftypes; 53 + } 54 + 49 55 /** 50 56 * cgroup_attach_task_all - attach task 'tsk' to all cgroups of task 'from' 51 57 * @from: attach to all cgroups of a given task ··· 681 675 * cgroup_mutex contention. 682 676 */ 683 677 684 - for_each_subsys(ss, i) 678 + for_each_subsys(ss, i) { 679 + if (cgroup1_subsys_absent(ss)) 680 + continue; 685 681 seq_printf(m, "%s\t%d\t%d\t%d\n", 686 682 ss->legacy_name, ss->root->hierarchy_id, 687 683 atomic_read(&ss->root->nr_cgrps), 688 684 cgroup_ssid_enabled(i)); 685 + } 689 686 690 687 return 0; 691 688 } ··· 941 932 if (ret != -ENOPARAM) 942 933 return ret; 943 934 for_each_subsys(ss, i) { 944 - if (strcmp(param->key, ss->legacy_name)) 935 + if (strcmp(param->key, ss->legacy_name) || 936 + cgroup1_subsys_absent(ss)) 945 937 continue; 946 938 if (!cgroup_ssid_enabled(i) || cgroup1_ssid_disabled(i)) 947 939 return invalfc(fc, "Disabled controller '%s'", ··· 1034 1024 mask = ~((u16)1 << cpuset_cgrp_id); 1035 1025 #endif 1036 1026 for_each_subsys(ss, i) 1037 - if (cgroup_ssid_enabled(i) && !cgroup1_ssid_disabled(i)) 1027 + if (cgroup_ssid_enabled(i) && !cgroup1_ssid_disabled(i) && 1028 + !cgroup1_subsys_absent(ss)) 1038 1029 enabled |= 1 << i; 1039 1030 1040 1031 ctx->subsys_mask &= enabled;

+63 -5

kernel/cgroup/cgroup.c

··· 2331 2331 .fs_flags = FS_USERNS_MOUNT, 2332 2332 }; 2333 2333 2334 - #ifdef CONFIG_CPUSETS 2334 + #ifdef CONFIG_CPUSETS_V1 2335 2335 static const struct fs_context_operations cpuset_fs_context_ops = { 2336 2336 .get_tree = cgroup1_get_tree, 2337 2337 .free = cgroup_fs_context_free, ··· 3669 3669 static int cgroup_stat_show(struct seq_file *seq, void *v) 3670 3670 { 3671 3671 struct cgroup *cgroup = seq_css(seq)->cgroup; 3672 + struct cgroup_subsys_state *css; 3673 + int dying_cnt[CGROUP_SUBSYS_COUNT]; 3674 + int ssid; 3672 3675 3673 3676 seq_printf(seq, "nr_descendants %d\n", 3674 3677 cgroup->nr_descendants); 3678 + 3679 + /* 3680 + * Show the number of live and dying csses associated with each of 3681 + * non-inhibited cgroup subsystems that is bound to cgroup v2. 3682 + * 3683 + * Without proper lock protection, racing is possible. So the 3684 + * numbers may not be consistent when that happens. 3685 + */ 3686 + rcu_read_lock(); 3687 + for (ssid = 0; ssid < CGROUP_SUBSYS_COUNT; ssid++) { 3688 + dying_cnt[ssid] = -1; 3689 + if ((BIT(ssid) & cgrp_dfl_inhibit_ss_mask) || 3690 + (cgroup_subsys[ssid]->root != &cgrp_dfl_root)) 3691 + continue; 3692 + css = rcu_dereference_raw(cgroup->subsys[ssid]); 3693 + dying_cnt[ssid] = cgroup->nr_dying_subsys[ssid]; 3694 + seq_printf(seq, "nr_subsys_%s %d\n", cgroup_subsys[ssid]->name, 3695 + css ? (css->nr_descendants + 1) : 0); 3696 + } 3697 + 3675 3698 seq_printf(seq, "nr_dying_descendants %d\n", 3676 3699 cgroup->nr_dying_descendants); 3677 - 3700 + for (ssid = 0; ssid < CGROUP_SUBSYS_COUNT; ssid++) { 3701 + if (dying_cnt[ssid] >= 0) 3702 + seq_printf(seq, "nr_dying_subsys_%s %d\n", 3703 + cgroup_subsys[ssid]->name, dying_cnt[ssid]); 3704 + } 3705 + rcu_read_unlock(); 3678 3706 return 0; 3679 3707 } 3680 3708 ··· 4124 4096 * If namespaces are delegation boundaries, disallow writes to 4125 4097 * files in an non-init namespace root from inside the namespace 4126 4098 * except for the files explicitly marked delegatable - 4127 - * cgroup.procs and cgroup.subtree_control. 4099 + * eg. cgroup.procs, cgroup.threads and cgroup.subtree_control. 4128 4100 */ 4129 4101 if ((cgrp->root->flags & CGRP_ROOT_NS_DELEGATE) && 4130 4102 !(cft->flags & CFTYPE_NS_DELEGATABLE) && ··· 5452 5424 list_del_rcu(&css->sibling); 5453 5425 5454 5426 if (ss) { 5427 + struct cgroup *parent_cgrp; 5428 + 5455 5429 /* css release path */ 5456 5430 if (!list_empty(&css->rstat_css_node)) { 5457 5431 cgroup_rstat_flush(cgrp); ··· 5463 5433 cgroup_idr_replace(&ss->css_idr, NULL, css->id); 5464 5434 if (ss->css_released) 5465 5435 ss->css_released(css); 5436 + 5437 + cgrp->nr_dying_subsys[ss->id]--; 5438 + /* 5439 + * When a css is released and ready to be freed, its 5440 + * nr_descendants must be zero. However, the corresponding 5441 + * cgrp->nr_dying_subsys[ss->id] may not be 0 if a subsystem 5442 + * is activated and deactivated multiple times with one or 5443 + * more of its previous activation leaving behind dying csses. 5444 + */ 5445 + WARN_ON_ONCE(css->nr_descendants); 5446 + parent_cgrp = cgroup_parent(cgrp); 5447 + while (parent_cgrp) { 5448 + parent_cgrp->nr_dying_subsys[ss->id]--; 5449 + parent_cgrp = cgroup_parent(parent_cgrp); 5450 + } 5466 5451 } else { 5467 5452 struct cgroup *tcgrp; 5468 5453 ··· 5562 5517 rcu_assign_pointer(css->cgroup->subsys[ss->id], css); 5563 5518 5564 5519 atomic_inc(&css->online_cnt); 5565 - if (css->parent) 5520 + if (css->parent) { 5566 5521 atomic_inc(&css->parent->online_cnt); 5522 + while ((css = css->parent)) 5523 + css->nr_descendants++; 5524 + } 5567 5525 } 5568 5526 return ret; 5569 5527 } ··· 5588 5540 RCU_INIT_POINTER(css->cgroup->subsys[ss->id], NULL); 5589 5541 5590 5542 wake_up_all(&css->cgroup->offline_waitq); 5543 + 5544 + css->cgroup->nr_dying_subsys[ss->id]++; 5545 + /* 5546 + * Parent css and cgroup cannot be freed until after the freeing 5547 + * of child css, see css_free_rwork_fn(). 5548 + */ 5549 + while ((css = css->parent)) { 5550 + css->nr_descendants--; 5551 + css->cgroup->nr_dying_subsys[ss->id]++; 5552 + } 5591 5553 } 5592 5554 5593 5555 /** ··· 6236 6178 WARN_ON(register_filesystem(&cgroup_fs_type)); 6237 6179 WARN_ON(register_filesystem(&cgroup2_fs_type)); 6238 6180 WARN_ON(!proc_create_single("cgroups", 0, NULL, proc_cgroupstats_show)); 6239 - #ifdef CONFIG_CPUSETS 6181 + #ifdef CONFIG_CPUSETS_V1 6240 6182 WARN_ON(register_filesystem(&cpuset_fs_type)); 6241 6183 #endif 6242 6184

+305

kernel/cgroup/cpuset-internal.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0-or-later */ 2 + 3 + #ifndef __CPUSET_INTERNAL_H 4 + #define __CPUSET_INTERNAL_H 5 + 6 + #include <linux/cgroup.h> 7 + #include <linux/cpu.h> 8 + #include <linux/cpumask.h> 9 + #include <linux/cpuset.h> 10 + #include <linux/spinlock.h> 11 + #include <linux/union_find.h> 12 + 13 + /* See "Frequency meter" comments, below. */ 14 + 15 + struct fmeter { 16 + int cnt; /* unprocessed events count */ 17 + int val; /* most recent output value */ 18 + time64_t time; /* clock (secs) when val computed */ 19 + spinlock_t lock; /* guards read or write of above */ 20 + }; 21 + 22 + /* 23 + * Invalid partition error code 24 + */ 25 + enum prs_errcode { 26 + PERR_NONE = 0, 27 + PERR_INVCPUS, 28 + PERR_INVPARENT, 29 + PERR_NOTPART, 30 + PERR_NOTEXCL, 31 + PERR_NOCPUS, 32 + PERR_HOTPLUG, 33 + PERR_CPUSEMPTY, 34 + PERR_HKEEPING, 35 + PERR_ACCESS, 36 + }; 37 + 38 + /* bits in struct cpuset flags field */ 39 + typedef enum { 40 + CS_ONLINE, 41 + CS_CPU_EXCLUSIVE, 42 + CS_MEM_EXCLUSIVE, 43 + CS_MEM_HARDWALL, 44 + CS_MEMORY_MIGRATE, 45 + CS_SCHED_LOAD_BALANCE, 46 + CS_SPREAD_PAGE, 47 + CS_SPREAD_SLAB, 48 + } cpuset_flagbits_t; 49 + 50 + /* The various types of files and directories in a cpuset file system */ 51 + 52 + typedef enum { 53 + FILE_MEMORY_MIGRATE, 54 + FILE_CPULIST, 55 + FILE_MEMLIST, 56 + FILE_EFFECTIVE_CPULIST, 57 + FILE_EFFECTIVE_MEMLIST, 58 + FILE_SUBPARTS_CPULIST, 59 + FILE_EXCLUSIVE_CPULIST, 60 + FILE_EFFECTIVE_XCPULIST, 61 + FILE_ISOLATED_CPULIST, 62 + FILE_CPU_EXCLUSIVE, 63 + FILE_MEM_EXCLUSIVE, 64 + FILE_MEM_HARDWALL, 65 + FILE_SCHED_LOAD_BALANCE, 66 + FILE_PARTITION_ROOT, 67 + FILE_SCHED_RELAX_DOMAIN_LEVEL, 68 + FILE_MEMORY_PRESSURE_ENABLED, 69 + FILE_MEMORY_PRESSURE, 70 + FILE_SPREAD_PAGE, 71 + FILE_SPREAD_SLAB, 72 + } cpuset_filetype_t; 73 + 74 + struct cpuset { 75 + struct cgroup_subsys_state css; 76 + 77 + unsigned long flags; /* "unsigned long" so bitops work */ 78 + 79 + /* 80 + * On default hierarchy: 81 + * 82 + * The user-configured masks can only be changed by writing to 83 + * cpuset.cpus and cpuset.mems, and won't be limited by the 84 + * parent masks. 85 + * 86 + * The effective masks is the real masks that apply to the tasks 87 + * in the cpuset. They may be changed if the configured masks are 88 + * changed or hotplug happens. 89 + * 90 + * effective_mask == configured_mask & parent's effective_mask, 91 + * and if it ends up empty, it will inherit the parent's mask. 92 + * 93 + * 94 + * On legacy hierarchy: 95 + * 96 + * The user-configured masks are always the same with effective masks. 97 + */ 98 + 99 + /* user-configured CPUs and Memory Nodes allow to tasks */ 100 + cpumask_var_t cpus_allowed; 101 + nodemask_t mems_allowed; 102 + 103 + /* effective CPUs and Memory Nodes allow to tasks */ 104 + cpumask_var_t effective_cpus; 105 + nodemask_t effective_mems; 106 + 107 + /* 108 + * Exclusive CPUs dedicated to current cgroup (default hierarchy only) 109 + * 110 + * The effective_cpus of a valid partition root comes solely from its 111 + * effective_xcpus and some of the effective_xcpus may be distributed 112 + * to sub-partitions below & hence excluded from its effective_cpus. 113 + * For a valid partition root, its effective_cpus have no relationship 114 + * with cpus_allowed unless its exclusive_cpus isn't set. 115 + * 116 + * This value will only be set if either exclusive_cpus is set or 117 + * when this cpuset becomes a local partition root. 118 + */ 119 + cpumask_var_t effective_xcpus; 120 + 121 + /* 122 + * Exclusive CPUs as requested by the user (default hierarchy only) 123 + * 124 + * Its value is independent of cpus_allowed and designates the set of 125 + * CPUs that can be granted to the current cpuset or its children when 126 + * it becomes a valid partition root. The effective set of exclusive 127 + * CPUs granted (effective_xcpus) depends on whether those exclusive 128 + * CPUs are passed down by its ancestors and not yet taken up by 129 + * another sibling partition root along the way. 130 + * 131 + * If its value isn't set, it defaults to cpus_allowed. 132 + */ 133 + cpumask_var_t exclusive_cpus; 134 + 135 + /* 136 + * This is old Memory Nodes tasks took on. 137 + * 138 + * - top_cpuset.old_mems_allowed is initialized to mems_allowed. 139 + * - A new cpuset's old_mems_allowed is initialized when some 140 + * task is moved into it. 141 + * - old_mems_allowed is used in cpuset_migrate_mm() when we change 142 + * cpuset.mems_allowed and have tasks' nodemask updated, and 143 + * then old_mems_allowed is updated to mems_allowed. 144 + */ 145 + nodemask_t old_mems_allowed; 146 + 147 + struct fmeter fmeter; /* memory_pressure filter */ 148 + 149 + /* 150 + * Tasks are being attached to this cpuset. Used to prevent 151 + * zeroing cpus/mems_allowed between ->can_attach() and ->attach(). 152 + */ 153 + int attach_in_progress; 154 + 155 + /* for custom sched domain */ 156 + int relax_domain_level; 157 + 158 + /* number of valid local child partitions */ 159 + int nr_subparts; 160 + 161 + /* partition root state */ 162 + int partition_root_state; 163 + 164 + /* 165 + * number of SCHED_DEADLINE tasks attached to this cpuset, so that we 166 + * know when to rebuild associated root domain bandwidth information. 167 + */ 168 + int nr_deadline_tasks; 169 + int nr_migrate_dl_tasks; 170 + u64 sum_migrate_dl_bw; 171 + 172 + /* Invalid partition error code, not lock protected */ 173 + enum prs_errcode prs_err; 174 + 175 + /* Handle for cpuset.cpus.partition */ 176 + struct cgroup_file partition_file; 177 + 178 + /* Remote partition silbling list anchored at remote_children */ 179 + struct list_head remote_sibling; 180 + 181 + /* Used to merge intersecting subsets for generate_sched_domains */ 182 + struct uf_node node; 183 + }; 184 + 185 + static inline struct cpuset *css_cs(struct cgroup_subsys_state *css) 186 + { 187 + return css ? container_of(css, struct cpuset, css) : NULL; 188 + } 189 + 190 + /* Retrieve the cpuset for a task */ 191 + static inline struct cpuset *task_cs(struct task_struct *task) 192 + { 193 + return css_cs(task_css(task, cpuset_cgrp_id)); 194 + } 195 + 196 + static inline struct cpuset *parent_cs(struct cpuset *cs) 197 + { 198 + return css_cs(cs->css.parent); 199 + } 200 + 201 + /* convenient tests for these bits */ 202 + static inline bool is_cpuset_online(struct cpuset *cs) 203 + { 204 + return test_bit(CS_ONLINE, &cs->flags) && !css_is_dying(&cs->css); 205 + } 206 + 207 + static inline int is_cpu_exclusive(const struct cpuset *cs) 208 + { 209 + return test_bit(CS_CPU_EXCLUSIVE, &cs->flags); 210 + } 211 + 212 + static inline int is_mem_exclusive(const struct cpuset *cs) 213 + { 214 + return test_bit(CS_MEM_EXCLUSIVE, &cs->flags); 215 + } 216 + 217 + static inline int is_mem_hardwall(const struct cpuset *cs) 218 + { 219 + return test_bit(CS_MEM_HARDWALL, &cs->flags); 220 + } 221 + 222 + static inline int is_sched_load_balance(const struct cpuset *cs) 223 + { 224 + return test_bit(CS_SCHED_LOAD_BALANCE, &cs->flags); 225 + } 226 + 227 + static inline int is_memory_migrate(const struct cpuset *cs) 228 + { 229 + return test_bit(CS_MEMORY_MIGRATE, &cs->flags); 230 + } 231 + 232 + static inline int is_spread_page(const struct cpuset *cs) 233 + { 234 + return test_bit(CS_SPREAD_PAGE, &cs->flags); 235 + } 236 + 237 + static inline int is_spread_slab(const struct cpuset *cs) 238 + { 239 + return test_bit(CS_SPREAD_SLAB, &cs->flags); 240 + } 241 + 242 + /** 243 + * cpuset_for_each_child - traverse online children of a cpuset 244 + * @child_cs: loop cursor pointing to the current child 245 + * @pos_css: used for iteration 246 + * @parent_cs: target cpuset to walk children of 247 + * 248 + * Walk @child_cs through the online children of @parent_cs. Must be used 249 + * with RCU read locked. 250 + */ 251 + #define cpuset_for_each_child(child_cs, pos_css, parent_cs) \ 252 + css_for_each_child((pos_css), &(parent_cs)->css) \ 253 + if (is_cpuset_online(((child_cs) = css_cs((pos_css))))) 254 + 255 + /** 256 + * cpuset_for_each_descendant_pre - pre-order walk of a cpuset's descendants 257 + * @des_cs: loop cursor pointing to the current descendant 258 + * @pos_css: used for iteration 259 + * @root_cs: target cpuset to walk ancestor of 260 + * 261 + * Walk @des_cs through the online descendants of @root_cs. Must be used 262 + * with RCU read locked. The caller may modify @pos_css by calling 263 + * css_rightmost_descendant() to skip subtree. @root_cs is included in the 264 + * iteration and the first node to be visited. 265 + */ 266 + #define cpuset_for_each_descendant_pre(des_cs, pos_css, root_cs) \ 267 + css_for_each_descendant_pre((pos_css), &(root_cs)->css) \ 268 + if (is_cpuset_online(((des_cs) = css_cs((pos_css))))) 269 + 270 + void rebuild_sched_domains_locked(void); 271 + void cpuset_callback_lock_irq(void); 272 + void cpuset_callback_unlock_irq(void); 273 + void cpuset_update_tasks_cpumask(struct cpuset *cs, struct cpumask *new_cpus); 274 + void cpuset_update_tasks_nodemask(struct cpuset *cs); 275 + int cpuset_update_flag(cpuset_flagbits_t bit, struct cpuset *cs, int turning_on); 276 + ssize_t cpuset_write_resmask(struct kernfs_open_file *of, 277 + char *buf, size_t nbytes, loff_t off); 278 + int cpuset_common_seq_show(struct seq_file *sf, void *v); 279 + 280 + /* 281 + * cpuset-v1.c 282 + */ 283 + #ifdef CONFIG_CPUSETS_V1 284 + extern struct cftype cpuset1_files[]; 285 + void fmeter_init(struct fmeter *fmp); 286 + void cpuset1_update_task_spread_flags(struct cpuset *cs, 287 + struct task_struct *tsk); 288 + void cpuset1_update_tasks_flags(struct cpuset *cs); 289 + void cpuset1_hotplug_update_tasks(struct cpuset *cs, 290 + struct cpumask *new_cpus, nodemask_t *new_mems, 291 + bool cpus_updated, bool mems_updated); 292 + int cpuset1_validate_change(struct cpuset *cur, struct cpuset *trial); 293 + #else 294 + static inline void fmeter_init(struct fmeter *fmp) {} 295 + static inline void cpuset1_update_task_spread_flags(struct cpuset *cs, 296 + struct task_struct *tsk) {} 297 + static inline void cpuset1_update_tasks_flags(struct cpuset *cs) {} 298 + static inline void cpuset1_hotplug_update_tasks(struct cpuset *cs, 299 + struct cpumask *new_cpus, nodemask_t *new_mems, 300 + bool cpus_updated, bool mems_updated) {} 301 + static inline int cpuset1_validate_change(struct cpuset *cur, 302 + struct cpuset *trial) { return 0; } 303 + #endif /* CONFIG_CPUSETS_V1 */ 304 + 305 + #endif /* __CPUSET_INTERNAL_H */

+562

kernel/cgroup/cpuset-v1.c

··· 1 + // SPDX-License-Identifier: GPL-2.0-or-later 2 + 3 + #include "cpuset-internal.h" 4 + 5 + /* 6 + * Legacy hierarchy call to cgroup_transfer_tasks() is handled asynchrously 7 + */ 8 + struct cpuset_remove_tasks_struct { 9 + struct work_struct work; 10 + struct cpuset *cs; 11 + }; 12 + 13 + /* 14 + * Frequency meter - How fast is some event occurring? 15 + * 16 + * These routines manage a digitally filtered, constant time based, 17 + * event frequency meter. There are four routines: 18 + * fmeter_init() - initialize a frequency meter. 19 + * fmeter_markevent() - called each time the event happens. 20 + * fmeter_getrate() - returns the recent rate of such events. 21 + * fmeter_update() - internal routine used to update fmeter. 22 + * 23 + * A common data structure is passed to each of these routines, 24 + * which is used to keep track of the state required to manage the 25 + * frequency meter and its digital filter. 26 + * 27 + * The filter works on the number of events marked per unit time. 28 + * The filter is single-pole low-pass recursive (IIR). The time unit 29 + * is 1 second. Arithmetic is done using 32-bit integers scaled to 30 + * simulate 3 decimal digits of precision (multiplied by 1000). 31 + * 32 + * With an FM_COEF of 933, and a time base of 1 second, the filter 33 + * has a half-life of 10 seconds, meaning that if the events quit 34 + * happening, then the rate returned from the fmeter_getrate() 35 + * will be cut in half each 10 seconds, until it converges to zero. 36 + * 37 + * It is not worth doing a real infinitely recursive filter. If more 38 + * than FM_MAXTICKS ticks have elapsed since the last filter event, 39 + * just compute FM_MAXTICKS ticks worth, by which point the level 40 + * will be stable. 41 + * 42 + * Limit the count of unprocessed events to FM_MAXCNT, so as to avoid 43 + * arithmetic overflow in the fmeter_update() routine. 44 + * 45 + * Given the simple 32 bit integer arithmetic used, this meter works 46 + * best for reporting rates between one per millisecond (msec) and 47 + * one per 32 (approx) seconds. At constant rates faster than one 48 + * per msec it maxes out at values just under 1,000,000. At constant 49 + * rates between one per msec, and one per second it will stabilize 50 + * to a value N*1000, where N is the rate of events per second. 51 + * At constant rates between one per second and one per 32 seconds, 52 + * it will be choppy, moving up on the seconds that have an event, 53 + * and then decaying until the next event. At rates slower than 54 + * about one in 32 seconds, it decays all the way back to zero between 55 + * each event. 56 + */ 57 + 58 + #define FM_COEF 933 /* coefficient for half-life of 10 secs */ 59 + #define FM_MAXTICKS ((u32)99) /* useless computing more ticks than this */ 60 + #define FM_MAXCNT 1000000 /* limit cnt to avoid overflow */ 61 + #define FM_SCALE 1000 /* faux fixed point scale */ 62 + 63 + /* Initialize a frequency meter */ 64 + void fmeter_init(struct fmeter *fmp) 65 + { 66 + fmp->cnt = 0; 67 + fmp->val = 0; 68 + fmp->time = 0; 69 + spin_lock_init(&fmp->lock); 70 + } 71 + 72 + /* Internal meter update - process cnt events and update value */ 73 + static void fmeter_update(struct fmeter *fmp) 74 + { 75 + time64_t now; 76 + u32 ticks; 77 + 78 + now = ktime_get_seconds(); 79 + ticks = now - fmp->time; 80 + 81 + if (ticks == 0) 82 + return; 83 + 84 + ticks = min(FM_MAXTICKS, ticks); 85 + while (ticks-- > 0) 86 + fmp->val = (FM_COEF * fmp->val) / FM_SCALE; 87 + fmp->time = now; 88 + 89 + fmp->val += ((FM_SCALE - FM_COEF) * fmp->cnt) / FM_SCALE; 90 + fmp->cnt = 0; 91 + } 92 + 93 + /* Process any previous ticks, then bump cnt by one (times scale). */ 94 + static void fmeter_markevent(struct fmeter *fmp) 95 + { 96 + spin_lock(&fmp->lock); 97 + fmeter_update(fmp); 98 + fmp->cnt = min(FM_MAXCNT, fmp->cnt + FM_SCALE); 99 + spin_unlock(&fmp->lock); 100 + } 101 + 102 + /* Process any previous ticks, then return current value. */ 103 + static int fmeter_getrate(struct fmeter *fmp) 104 + { 105 + int val; 106 + 107 + spin_lock(&fmp->lock); 108 + fmeter_update(fmp); 109 + val = fmp->val; 110 + spin_unlock(&fmp->lock); 111 + return val; 112 + } 113 + 114 + /* 115 + * Collection of memory_pressure is suppressed unless 116 + * this flag is enabled by writing "1" to the special 117 + * cpuset file 'memory_pressure_enabled' in the root cpuset. 118 + */ 119 + 120 + int cpuset_memory_pressure_enabled __read_mostly; 121 + 122 + /* 123 + * __cpuset_memory_pressure_bump - keep stats of per-cpuset reclaims. 124 + * 125 + * Keep a running average of the rate of synchronous (direct) 126 + * page reclaim efforts initiated by tasks in each cpuset. 127 + * 128 + * This represents the rate at which some task in the cpuset 129 + * ran low on memory on all nodes it was allowed to use, and 130 + * had to enter the kernels page reclaim code in an effort to 131 + * create more free memory by tossing clean pages or swapping 132 + * or writing dirty pages. 133 + * 134 + * Display to user space in the per-cpuset read-only file 135 + * "memory_pressure". Value displayed is an integer 136 + * representing the recent rate of entry into the synchronous 137 + * (direct) page reclaim by any task attached to the cpuset. 138 + */ 139 + 140 + void __cpuset_memory_pressure_bump(void) 141 + { 142 + rcu_read_lock(); 143 + fmeter_markevent(&task_cs(current)->fmeter); 144 + rcu_read_unlock(); 145 + } 146 + 147 + static int update_relax_domain_level(struct cpuset *cs, s64 val) 148 + { 149 + #ifdef CONFIG_SMP 150 + if (val < -1 || val > sched_domain_level_max + 1) 151 + return -EINVAL; 152 + #endif 153 + 154 + if (val != cs->relax_domain_level) { 155 + cs->relax_domain_level = val; 156 + if (!cpumask_empty(cs->cpus_allowed) && 157 + is_sched_load_balance(cs)) 158 + rebuild_sched_domains_locked(); 159 + } 160 + 161 + return 0; 162 + } 163 + 164 + static int cpuset_write_s64(struct cgroup_subsys_state *css, struct cftype *cft, 165 + s64 val) 166 + { 167 + struct cpuset *cs = css_cs(css); 168 + cpuset_filetype_t type = cft->private; 169 + int retval = -ENODEV; 170 + 171 + cpus_read_lock(); 172 + cpuset_lock(); 173 + if (!is_cpuset_online(cs)) 174 + goto out_unlock; 175 + 176 + switch (type) { 177 + case FILE_SCHED_RELAX_DOMAIN_LEVEL: 178 + retval = update_relax_domain_level(cs, val); 179 + break; 180 + default: 181 + retval = -EINVAL; 182 + break; 183 + } 184 + out_unlock: 185 + cpuset_unlock(); 186 + cpus_read_unlock(); 187 + return retval; 188 + } 189 + 190 + static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft) 191 + { 192 + struct cpuset *cs = css_cs(css); 193 + cpuset_filetype_t type = cft->private; 194 + 195 + switch (type) { 196 + case FILE_SCHED_RELAX_DOMAIN_LEVEL: 197 + return cs->relax_domain_level; 198 + default: 199 + BUG(); 200 + } 201 + 202 + /* Unreachable but makes gcc happy */ 203 + return 0; 204 + } 205 + 206 + /* 207 + * update task's spread flag if cpuset's page/slab spread flag is set 208 + * 209 + * Call with callback_lock or cpuset_mutex held. The check can be skipped 210 + * if on default hierarchy. 211 + */ 212 + void cpuset1_update_task_spread_flags(struct cpuset *cs, 213 + struct task_struct *tsk) 214 + { 215 + if (cgroup_subsys_on_dfl(cpuset_cgrp_subsys)) 216 + return; 217 + 218 + if (is_spread_page(cs)) 219 + task_set_spread_page(tsk); 220 + else 221 + task_clear_spread_page(tsk); 222 + 223 + if (is_spread_slab(cs)) 224 + task_set_spread_slab(tsk); 225 + else 226 + task_clear_spread_slab(tsk); 227 + } 228 + 229 + /** 230 + * cpuset1_update_tasks_flags - update the spread flags of tasks in the cpuset. 231 + * @cs: the cpuset in which each task's spread flags needs to be changed 232 + * 233 + * Iterate through each task of @cs updating its spread flags. As this 234 + * function is called with cpuset_mutex held, cpuset membership stays 235 + * stable. 236 + */ 237 + void cpuset1_update_tasks_flags(struct cpuset *cs) 238 + { 239 + struct css_task_iter it; 240 + struct task_struct *task; 241 + 242 + css_task_iter_start(&cs->css, 0, &it); 243 + while ((task = css_task_iter_next(&it))) 244 + cpuset1_update_task_spread_flags(cs, task); 245 + css_task_iter_end(&it); 246 + } 247 + 248 + /* 249 + * If CPU and/or memory hotplug handlers, below, unplug any CPUs 250 + * or memory nodes, we need to walk over the cpuset hierarchy, 251 + * removing that CPU or node from all cpusets. If this removes the 252 + * last CPU or node from a cpuset, then move the tasks in the empty 253 + * cpuset to its next-highest non-empty parent. 254 + */ 255 + static void remove_tasks_in_empty_cpuset(struct cpuset *cs) 256 + { 257 + struct cpuset *parent; 258 + 259 + /* 260 + * Find its next-highest non-empty parent, (top cpuset 261 + * has online cpus, so can't be empty). 262 + */ 263 + parent = parent_cs(cs); 264 + while (cpumask_empty(parent->cpus_allowed) || 265 + nodes_empty(parent->mems_allowed)) 266 + parent = parent_cs(parent); 267 + 268 + if (cgroup_transfer_tasks(parent->css.cgroup, cs->css.cgroup)) { 269 + pr_err("cpuset: failed to transfer tasks out of empty cpuset "); 270 + pr_cont_cgroup_name(cs->css.cgroup); 271 + pr_cont("\n"); 272 + } 273 + } 274 + 275 + static void cpuset_migrate_tasks_workfn(struct work_struct *work) 276 + { 277 + struct cpuset_remove_tasks_struct *s; 278 + 279 + s = container_of(work, struct cpuset_remove_tasks_struct, work); 280 + remove_tasks_in_empty_cpuset(s->cs); 281 + css_put(&s->cs->css); 282 + kfree(s); 283 + } 284 + 285 + void cpuset1_hotplug_update_tasks(struct cpuset *cs, 286 + struct cpumask *new_cpus, nodemask_t *new_mems, 287 + bool cpus_updated, bool mems_updated) 288 + { 289 + bool is_empty; 290 + 291 + cpuset_callback_lock_irq(); 292 + cpumask_copy(cs->cpus_allowed, new_cpus); 293 + cpumask_copy(cs->effective_cpus, new_cpus); 294 + cs->mems_allowed = *new_mems; 295 + cs->effective_mems = *new_mems; 296 + cpuset_callback_unlock_irq(); 297 + 298 + /* 299 + * Don't call cpuset_update_tasks_cpumask() if the cpuset becomes empty, 300 + * as the tasks will be migrated to an ancestor. 301 + */ 302 + if (cpus_updated && !cpumask_empty(cs->cpus_allowed)) 303 + cpuset_update_tasks_cpumask(cs, new_cpus); 304 + if (mems_updated && !nodes_empty(cs->mems_allowed)) 305 + cpuset_update_tasks_nodemask(cs); 306 + 307 + is_empty = cpumask_empty(cs->cpus_allowed) || 308 + nodes_empty(cs->mems_allowed); 309 + 310 + /* 311 + * Move tasks to the nearest ancestor with execution resources, 312 + * This is full cgroup operation which will also call back into 313 + * cpuset. Execute it asynchronously using workqueue. 314 + */ 315 + if (is_empty && cs->css.cgroup->nr_populated_csets && 316 + css_tryget_online(&cs->css)) { 317 + struct cpuset_remove_tasks_struct *s; 318 + 319 + s = kzalloc(sizeof(*s), GFP_KERNEL); 320 + if (WARN_ON_ONCE(!s)) { 321 + css_put(&cs->css); 322 + return; 323 + } 324 + 325 + s->cs = cs; 326 + INIT_WORK(&s->work, cpuset_migrate_tasks_workfn); 327 + schedule_work(&s->work); 328 + } 329 + } 330 + 331 + /* 332 + * is_cpuset_subset(p, q) - Is cpuset p a subset of cpuset q? 333 + * 334 + * One cpuset is a subset of another if all its allowed CPUs and 335 + * Memory Nodes are a subset of the other, and its exclusive flags 336 + * are only set if the other's are set. Call holding cpuset_mutex. 337 + */ 338 + 339 + static int is_cpuset_subset(const struct cpuset *p, const struct cpuset *q) 340 + { 341 + return cpumask_subset(p->cpus_allowed, q->cpus_allowed) && 342 + nodes_subset(p->mems_allowed, q->mems_allowed) && 343 + is_cpu_exclusive(p) <= is_cpu_exclusive(q) && 344 + is_mem_exclusive(p) <= is_mem_exclusive(q); 345 + } 346 + 347 + /* 348 + * cpuset1_validate_change() - Validate conditions specific to legacy (v1) 349 + * behavior. 350 + */ 351 + int cpuset1_validate_change(struct cpuset *cur, struct cpuset *trial) 352 + { 353 + struct cgroup_subsys_state *css; 354 + struct cpuset *c, *par; 355 + int ret; 356 + 357 + WARN_ON_ONCE(!rcu_read_lock_held()); 358 + 359 + /* Each of our child cpusets must be a subset of us */ 360 + ret = -EBUSY; 361 + cpuset_for_each_child(c, css, cur) 362 + if (!is_cpuset_subset(c, trial)) 363 + goto out; 364 + 365 + /* On legacy hierarchy, we must be a subset of our parent cpuset. */ 366 + ret = -EACCES; 367 + par = parent_cs(cur); 368 + if (par && !is_cpuset_subset(trial, par)) 369 + goto out; 370 + 371 + ret = 0; 372 + out: 373 + return ret; 374 + } 375 + 376 + static u64 cpuset_read_u64(struct cgroup_subsys_state *css, struct cftype *cft) 377 + { 378 + struct cpuset *cs = css_cs(css); 379 + cpuset_filetype_t type = cft->private; 380 + 381 + switch (type) { 382 + case FILE_CPU_EXCLUSIVE: 383 + return is_cpu_exclusive(cs); 384 + case FILE_MEM_EXCLUSIVE: 385 + return is_mem_exclusive(cs); 386 + case FILE_MEM_HARDWALL: 387 + return is_mem_hardwall(cs); 388 + case FILE_SCHED_LOAD_BALANCE: 389 + return is_sched_load_balance(cs); 390 + case FILE_MEMORY_MIGRATE: 391 + return is_memory_migrate(cs); 392 + case FILE_MEMORY_PRESSURE_ENABLED: 393 + return cpuset_memory_pressure_enabled; 394 + case FILE_MEMORY_PRESSURE: 395 + return fmeter_getrate(&cs->fmeter); 396 + case FILE_SPREAD_PAGE: 397 + return is_spread_page(cs); 398 + case FILE_SPREAD_SLAB: 399 + return is_spread_slab(cs); 400 + default: 401 + BUG(); 402 + } 403 + 404 + /* Unreachable but makes gcc happy */ 405 + return 0; 406 + } 407 + 408 + static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft, 409 + u64 val) 410 + { 411 + struct cpuset *cs = css_cs(css); 412 + cpuset_filetype_t type = cft->private; 413 + int retval = 0; 414 + 415 + cpus_read_lock(); 416 + cpuset_lock(); 417 + if (!is_cpuset_online(cs)) { 418 + retval = -ENODEV; 419 + goto out_unlock; 420 + } 421 + 422 + switch (type) { 423 + case FILE_CPU_EXCLUSIVE: 424 + retval = cpuset_update_flag(CS_CPU_EXCLUSIVE, cs, val); 425 + break; 426 + case FILE_MEM_EXCLUSIVE: 427 + retval = cpuset_update_flag(CS_MEM_EXCLUSIVE, cs, val); 428 + break; 429 + case FILE_MEM_HARDWALL: 430 + retval = cpuset_update_flag(CS_MEM_HARDWALL, cs, val); 431 + break; 432 + case FILE_SCHED_LOAD_BALANCE: 433 + retval = cpuset_update_flag(CS_SCHED_LOAD_BALANCE, cs, val); 434 + break; 435 + case FILE_MEMORY_MIGRATE: 436 + retval = cpuset_update_flag(CS_MEMORY_MIGRATE, cs, val); 437 + break; 438 + case FILE_MEMORY_PRESSURE_ENABLED: 439 + cpuset_memory_pressure_enabled = !!val; 440 + break; 441 + case FILE_SPREAD_PAGE: 442 + retval = cpuset_update_flag(CS_SPREAD_PAGE, cs, val); 443 + break; 444 + case FILE_SPREAD_SLAB: 445 + retval = cpuset_update_flag(CS_SPREAD_SLAB, cs, val); 446 + break; 447 + default: 448 + retval = -EINVAL; 449 + break; 450 + } 451 + out_unlock: 452 + cpuset_unlock(); 453 + cpus_read_unlock(); 454 + return retval; 455 + } 456 + 457 + /* 458 + * for the common functions, 'private' gives the type of file 459 + */ 460 + 461 + struct cftype cpuset1_files[] = { 462 + { 463 + .name = "cpus", 464 + .seq_show = cpuset_common_seq_show, 465 + .write = cpuset_write_resmask, 466 + .max_write_len = (100U + 6 * NR_CPUS), 467 + .private = FILE_CPULIST, 468 + }, 469 + 470 + { 471 + .name = "mems", 472 + .seq_show = cpuset_common_seq_show, 473 + .write = cpuset_write_resmask, 474 + .max_write_len = (100U + 6 * MAX_NUMNODES), 475 + .private = FILE_MEMLIST, 476 + }, 477 + 478 + { 479 + .name = "effective_cpus", 480 + .seq_show = cpuset_common_seq_show, 481 + .private = FILE_EFFECTIVE_CPULIST, 482 + }, 483 + 484 + { 485 + .name = "effective_mems", 486 + .seq_show = cpuset_common_seq_show, 487 + .private = FILE_EFFECTIVE_MEMLIST, 488 + }, 489 + 490 + { 491 + .name = "cpu_exclusive", 492 + .read_u64 = cpuset_read_u64, 493 + .write_u64 = cpuset_write_u64, 494 + .private = FILE_CPU_EXCLUSIVE, 495 + }, 496 + 497 + { 498 + .name = "mem_exclusive", 499 + .read_u64 = cpuset_read_u64, 500 + .write_u64 = cpuset_write_u64, 501 + .private = FILE_MEM_EXCLUSIVE, 502 + }, 503 + 504 + { 505 + .name = "mem_hardwall", 506 + .read_u64 = cpuset_read_u64, 507 + .write_u64 = cpuset_write_u64, 508 + .private = FILE_MEM_HARDWALL, 509 + }, 510 + 511 + { 512 + .name = "sched_load_balance", 513 + .read_u64 = cpuset_read_u64, 514 + .write_u64 = cpuset_write_u64, 515 + .private = FILE_SCHED_LOAD_BALANCE, 516 + }, 517 + 518 + { 519 + .name = "sched_relax_domain_level", 520 + .read_s64 = cpuset_read_s64, 521 + .write_s64 = cpuset_write_s64, 522 + .private = FILE_SCHED_RELAX_DOMAIN_LEVEL, 523 + }, 524 + 525 + { 526 + .name = "memory_migrate", 527 + .read_u64 = cpuset_read_u64, 528 + .write_u64 = cpuset_write_u64, 529 + .private = FILE_MEMORY_MIGRATE, 530 + }, 531 + 532 + { 533 + .name = "memory_pressure", 534 + .read_u64 = cpuset_read_u64, 535 + .private = FILE_MEMORY_PRESSURE, 536 + }, 537 + 538 + { 539 + .name = "memory_spread_page", 540 + .read_u64 = cpuset_read_u64, 541 + .write_u64 = cpuset_write_u64, 542 + .private = FILE_SPREAD_PAGE, 543 + }, 544 + 545 + { 546 + /* obsolete, may be removed in the future */ 547 + .name = "memory_spread_slab", 548 + .read_u64 = cpuset_read_u64, 549 + .write_u64 = cpuset_write_u64, 550 + .private = FILE_SPREAD_SLAB, 551 + }, 552 + 553 + { 554 + .name = "memory_pressure_enabled", 555 + .flags = CFTYPE_ONLY_ON_ROOT, 556 + .read_u64 = cpuset_read_u64, 557 + .write_u64 = cpuset_write_u64, 558 + .private = FILE_MEMORY_PRESSURE_ENABLED, 559 + }, 560 + 561 + { } /* terminate */ 562 + };

+144 -1011

kernel/cgroup/cpuset.c

··· 22 22 * distribution for more details. 23 23 */ 24 24 #include "cgroup-internal.h" 25 + #include "cpuset-internal.h" 25 26 26 - #include <linux/cpu.h> 27 - #include <linux/cpumask.h> 28 - #include <linux/cpuset.h> 29 - #include <linux/delay.h> 30 27 #include <linux/init.h> 31 28 #include <linux/interrupt.h> 32 29 #include <linux/kernel.h> ··· 37 40 #include <linux/sched/mm.h> 38 41 #include <linux/sched/task.h> 39 42 #include <linux/security.h> 40 - #include <linux/spinlock.h> 41 43 #include <linux/oom.h> 42 44 #include <linux/sched/isolation.h> 43 - #include <linux/cgroup.h> 44 45 #include <linux/wait.h> 45 46 #include <linux/workqueue.h> 46 47 ··· 52 57 */ 53 58 DEFINE_STATIC_KEY_FALSE(cpusets_insane_config_key); 54 59 55 - /* See "Frequency meter" comments, below. */ 56 - 57 - struct fmeter { 58 - int cnt; /* unprocessed events count */ 59 - int val; /* most recent output value */ 60 - time64_t time; /* clock (secs) when val computed */ 61 - spinlock_t lock; /* guards read or write of above */ 62 - }; 63 - 64 - /* 65 - * Invalid partition error code 66 - */ 67 - enum prs_errcode { 68 - PERR_NONE = 0, 69 - PERR_INVCPUS, 70 - PERR_INVPARENT, 71 - PERR_NOTPART, 72 - PERR_NOTEXCL, 73 - PERR_NOCPUS, 74 - PERR_HOTPLUG, 75 - PERR_CPUSEMPTY, 76 - PERR_HKEEPING, 77 - }; 78 - 79 60 static const char * const perr_strings[] = { 80 61 [PERR_INVCPUS] = "Invalid cpu list in cpuset.cpus.exclusive", 81 62 [PERR_INVPARENT] = "Parent is an invalid partition root", ··· 61 90 [PERR_HOTPLUG] = "No cpu available due to hotplug", 62 91 [PERR_CPUSEMPTY] = "cpuset.cpus and cpuset.cpus.exclusive are empty", 63 92 [PERR_HKEEPING] = "partition config conflicts with housekeeping setup", 64 - }; 65 - 66 - struct cpuset { 67 - struct cgroup_subsys_state css; 68 - 69 - unsigned long flags; /* "unsigned long" so bitops work */ 70 - 71 - /* 72 - * On default hierarchy: 73 - * 74 - * The user-configured masks can only be changed by writing to 75 - * cpuset.cpus and cpuset.mems, and won't be limited by the 76 - * parent masks. 77 - * 78 - * The effective masks is the real masks that apply to the tasks 79 - * in the cpuset. They may be changed if the configured masks are 80 - * changed or hotplug happens. 81 - * 82 - * effective_mask == configured_mask & parent's effective_mask, 83 - * and if it ends up empty, it will inherit the parent's mask. 84 - * 85 - * 86 - * On legacy hierarchy: 87 - * 88 - * The user-configured masks are always the same with effective masks. 89 - */ 90 - 91 - /* user-configured CPUs and Memory Nodes allow to tasks */ 92 - cpumask_var_t cpus_allowed; 93 - nodemask_t mems_allowed; 94 - 95 - /* effective CPUs and Memory Nodes allow to tasks */ 96 - cpumask_var_t effective_cpus; 97 - nodemask_t effective_mems; 98 - 99 - /* 100 - * Exclusive CPUs dedicated to current cgroup (default hierarchy only) 101 - * 102 - * The effective_cpus of a valid partition root comes solely from its 103 - * effective_xcpus and some of the effective_xcpus may be distributed 104 - * to sub-partitions below & hence excluded from its effective_cpus. 105 - * For a valid partition root, its effective_cpus have no relationship 106 - * with cpus_allowed unless its exclusive_cpus isn't set. 107 - * 108 - * This value will only be set if either exclusive_cpus is set or 109 - * when this cpuset becomes a local partition root. 110 - */ 111 - cpumask_var_t effective_xcpus; 112 - 113 - /* 114 - * Exclusive CPUs as requested by the user (default hierarchy only) 115 - * 116 - * Its value is independent of cpus_allowed and designates the set of 117 - * CPUs that can be granted to the current cpuset or its children when 118 - * it becomes a valid partition root. The effective set of exclusive 119 - * CPUs granted (effective_xcpus) depends on whether those exclusive 120 - * CPUs are passed down by its ancestors and not yet taken up by 121 - * another sibling partition root along the way. 122 - * 123 - * If its value isn't set, it defaults to cpus_allowed. 124 - */ 125 - cpumask_var_t exclusive_cpus; 126 - 127 - /* 128 - * This is old Memory Nodes tasks took on. 129 - * 130 - * - top_cpuset.old_mems_allowed is initialized to mems_allowed. 131 - * - A new cpuset's old_mems_allowed is initialized when some 132 - * task is moved into it. 133 - * - old_mems_allowed is used in cpuset_migrate_mm() when we change 134 - * cpuset.mems_allowed and have tasks' nodemask updated, and 135 - * then old_mems_allowed is updated to mems_allowed. 136 - */ 137 - nodemask_t old_mems_allowed; 138 - 139 - struct fmeter fmeter; /* memory_pressure filter */ 140 - 141 - /* 142 - * Tasks are being attached to this cpuset. Used to prevent 143 - * zeroing cpus/mems_allowed between ->can_attach() and ->attach(). 144 - */ 145 - int attach_in_progress; 146 - 147 - /* partition number for rebuild_sched_domains() */ 148 - int pn; 149 - 150 - /* for custom sched domain */ 151 - int relax_domain_level; 152 - 153 - /* number of valid local child partitions */ 154 - int nr_subparts; 155 - 156 - /* partition root state */ 157 - int partition_root_state; 158 - 159 - /* 160 - * Default hierarchy only: 161 - * use_parent_ecpus - set if using parent's effective_cpus 162 - * child_ecpus_count - # of children with use_parent_ecpus set 163 - */ 164 - int use_parent_ecpus; 165 - int child_ecpus_count; 166 - 167 - /* 168 - * number of SCHED_DEADLINE tasks attached to this cpuset, so that we 169 - * know when to rebuild associated root domain bandwidth information. 170 - */ 171 - int nr_deadline_tasks; 172 - int nr_migrate_dl_tasks; 173 - u64 sum_migrate_dl_bw; 174 - 175 - /* Invalid partition error code, not lock protected */ 176 - enum prs_errcode prs_err; 177 - 178 - /* Handle for cpuset.cpus.partition */ 179 - struct cgroup_file partition_file; 180 - 181 - /* Remote partition silbling list anchored at remote_children */ 182 - struct list_head remote_sibling; 183 - }; 184 - 185 - /* 186 - * Legacy hierarchy call to cgroup_transfer_tasks() is handled asynchrously 187 - */ 188 - struct cpuset_remove_tasks_struct { 189 - struct work_struct work; 190 - struct cpuset *cs; 93 + [PERR_ACCESS] = "Enable partition not permitted", 191 94 }; 192 95 193 96 /* ··· 73 228 * Exclusive CPUs in isolated partitions 74 229 */ 75 230 static cpumask_var_t isolated_cpus; 231 + 232 + /* 233 + * Housekeeping (HK_TYPE_DOMAIN) CPUs at boot 234 + */ 235 + static cpumask_var_t boot_hk_cpus; 236 + static bool have_boot_isolcpus; 76 237 77 238 /* List of remote partition root children */ 78 239 static struct list_head remote_children; ··· 130 279 cpumask_var_t new_cpus; /* For update_cpumasks_hier() */ 131 280 }; 132 281 133 - static inline struct cpuset *css_cs(struct cgroup_subsys_state *css) 134 - { 135 - return css ? container_of(css, struct cpuset, css) : NULL; 136 - } 137 - 138 - /* Retrieve the cpuset for a task */ 139 - static inline struct cpuset *task_cs(struct task_struct *task) 140 - { 141 - return css_cs(task_css(task, cpuset_cgrp_id)); 142 - } 143 - 144 - static inline struct cpuset *parent_cs(struct cpuset *cs) 145 - { 146 - return css_cs(cs->css.parent); 147 - } 148 - 149 282 void inc_dl_tasks_cs(struct task_struct *p) 150 283 { 151 284 struct cpuset *cs = task_cs(p); ··· 142 307 struct cpuset *cs = task_cs(p); 143 308 144 309 cs->nr_deadline_tasks--; 145 - } 146 - 147 - /* bits in struct cpuset flags field */ 148 - typedef enum { 149 - CS_ONLINE, 150 - CS_CPU_EXCLUSIVE, 151 - CS_MEM_EXCLUSIVE, 152 - CS_MEM_HARDWALL, 153 - CS_MEMORY_MIGRATE, 154 - CS_SCHED_LOAD_BALANCE, 155 - CS_SPREAD_PAGE, 156 - CS_SPREAD_SLAB, 157 - } cpuset_flagbits_t; 158 - 159 - /* convenient tests for these bits */ 160 - static inline bool is_cpuset_online(struct cpuset *cs) 161 - { 162 - return test_bit(CS_ONLINE, &cs->flags) && !css_is_dying(&cs->css); 163 - } 164 - 165 - static inline int is_cpu_exclusive(const struct cpuset *cs) 166 - { 167 - return test_bit(CS_CPU_EXCLUSIVE, &cs->flags); 168 - } 169 - 170 - static inline int is_mem_exclusive(const struct cpuset *cs) 171 - { 172 - return test_bit(CS_MEM_EXCLUSIVE, &cs->flags); 173 - } 174 - 175 - static inline int is_mem_hardwall(const struct cpuset *cs) 176 - { 177 - return test_bit(CS_MEM_HARDWALL, &cs->flags); 178 - } 179 - 180 - static inline int is_sched_load_balance(const struct cpuset *cs) 181 - { 182 - return test_bit(CS_SCHED_LOAD_BALANCE, &cs->flags); 183 - } 184 - 185 - static inline int is_memory_migrate(const struct cpuset *cs) 186 - { 187 - return test_bit(CS_MEMORY_MIGRATE, &cs->flags); 188 - } 189 - 190 - static inline int is_spread_page(const struct cpuset *cs) 191 - { 192 - return test_bit(CS_SPREAD_PAGE, &cs->flags); 193 - } 194 - 195 - static inline int is_spread_slab(const struct cpuset *cs) 196 - { 197 - return test_bit(CS_SPREAD_SLAB, &cs->flags); 198 310 } 199 311 200 312 static inline int is_partition_valid(const struct cpuset *cs) ··· 184 402 .relax_domain_level = -1, 185 403 .remote_sibling = LIST_HEAD_INIT(top_cpuset.remote_sibling), 186 404 }; 187 - 188 - /** 189 - * cpuset_for_each_child - traverse online children of a cpuset 190 - * @child_cs: loop cursor pointing to the current child 191 - * @pos_css: used for iteration 192 - * @parent_cs: target cpuset to walk children of 193 - * 194 - * Walk @child_cs through the online children of @parent_cs. Must be used 195 - * with RCU read locked. 196 - */ 197 - #define cpuset_for_each_child(child_cs, pos_css, parent_cs) \ 198 - css_for_each_child((pos_css), &(parent_cs)->css) \ 199 - if (is_cpuset_online(((child_cs) = css_cs((pos_css))))) 200 - 201 - /** 202 - * cpuset_for_each_descendant_pre - pre-order walk of a cpuset's descendants 203 - * @des_cs: loop cursor pointing to the current descendant 204 - * @pos_css: used for iteration 205 - * @root_cs: target cpuset to walk ancestor of 206 - * 207 - * Walk @des_cs through the online descendants of @root_cs. Must be used 208 - * with RCU read locked. The caller may modify @pos_css by calling 209 - * css_rightmost_descendant() to skip subtree. @root_cs is included in the 210 - * iteration and the first node to be visited. 211 - */ 212 - #define cpuset_for_each_descendant_pre(des_cs, pos_css, root_cs) \ 213 - css_for_each_descendant_pre((pos_css), &(root_cs)->css) \ 214 - if (is_cpuset_online(((des_cs) = css_cs((pos_css))))) 215 405 216 406 /* 217 407 * There are two global locks guarding cpuset structures - cpuset_mutex and ··· 238 484 239 485 static DEFINE_SPINLOCK(callback_lock); 240 486 487 + void cpuset_callback_lock_irq(void) 488 + { 489 + spin_lock_irq(&callback_lock); 490 + } 491 + 492 + void cpuset_callback_unlock_irq(void) 493 + { 494 + spin_unlock_irq(&callback_lock); 495 + } 496 + 241 497 static struct workqueue_struct *cpuset_migrate_mm_wq; 242 498 243 499 static DECLARE_WAIT_QUEUE_HEAD(cpuset_attach_wq); ··· 261 497 "Cpuset allocations might fail even with a lot of memory available.\n", 262 498 nodemask_pr_args(nodes)); 263 499 } 500 + } 501 + 502 + /* 503 + * decrease cs->attach_in_progress. 504 + * wake_up cpuset_attach_wq if cs->attach_in_progress==0. 505 + */ 506 + static inline void dec_attach_in_progress_locked(struct cpuset *cs) 507 + { 508 + lockdep_assert_held(&cpuset_mutex); 509 + 510 + cs->attach_in_progress--; 511 + if (!cs->attach_in_progress) 512 + wake_up(&cpuset_attach_wq); 513 + } 514 + 515 + static inline void dec_attach_in_progress(struct cpuset *cs) 516 + { 517 + mutex_lock(&cpuset_mutex); 518 + dec_attach_in_progress_locked(cs); 519 + mutex_unlock(&cpuset_mutex); 264 520 } 265 521 266 522 /* ··· 378 594 while (!nodes_intersects(cs->effective_mems, node_states[N_MEMORY])) 379 595 cs = parent_cs(cs); 380 596 nodes_and(*pmask, cs->effective_mems, node_states[N_MEMORY]); 381 - } 382 - 383 - /* 384 - * update task's spread flag if cpuset's page/slab spread flag is set 385 - * 386 - * Call with callback_lock or cpuset_mutex held. The check can be skipped 387 - * if on default hierarchy. 388 - */ 389 - static void cpuset_update_task_spread_flags(struct cpuset *cs, 390 - struct task_struct *tsk) 391 - { 392 - if (cgroup_subsys_on_dfl(cpuset_cgrp_subsys)) 393 - return; 394 - 395 - if (is_spread_page(cs)) 396 - task_set_spread_page(tsk); 397 - else 398 - task_clear_spread_page(tsk); 399 - 400 - if (is_spread_slab(cs)) 401 - task_set_spread_slab(tsk); 402 - else 403 - task_clear_spread_slab(tsk); 404 - } 405 - 406 - /* 407 - * is_cpuset_subset(p, q) - Is cpuset p a subset of cpuset q? 408 - * 409 - * One cpuset is a subset of another if all its allowed CPUs and 410 - * Memory Nodes are a subset of the other, and its exclusive flags 411 - * are only set if the other's are set. Call holding cpuset_mutex. 412 - */ 413 - 414 - static int is_cpuset_subset(const struct cpuset *p, const struct cpuset *q) 415 - { 416 - return cpumask_subset(p->cpus_allowed, q->cpus_allowed) && 417 - nodes_subset(p->mems_allowed, q->mems_allowed) && 418 - is_cpu_exclusive(p) <= is_cpu_exclusive(q) && 419 - is_mem_exclusive(p) <= is_mem_exclusive(q); 420 597 } 421 598 422 599 /** ··· 495 750 cpumask_empty(cs->exclusive_cpus); 496 751 } 497 752 498 - static inline struct cpumask *fetch_xcpus(struct cpuset *cs) 499 - { 500 - return !cpumask_empty(cs->exclusive_cpus) ? cs->exclusive_cpus : 501 - cpumask_empty(cs->effective_xcpus) ? cs->cpus_allowed 502 - : cs->effective_xcpus; 503 - } 504 - 505 753 /* 506 754 * cpusets_are_exclusive() - check if two cpusets are exclusive 507 755 * ··· 502 764 */ 503 765 static inline bool cpusets_are_exclusive(struct cpuset *cs1, struct cpuset *cs2) 504 766 { 505 - struct cpumask *xcpus1 = fetch_xcpus(cs1); 506 - struct cpumask *xcpus2 = fetch_xcpus(cs2); 767 + struct cpumask *xcpus1 = user_xcpus(cs1); 768 + struct cpumask *xcpus2 = user_xcpus(cs2); 507 769 508 770 if (cpumask_intersects(xcpus1, xcpus2)) 509 771 return false; 510 772 return true; 511 - } 512 - 513 - /* 514 - * validate_change_legacy() - Validate conditions specific to legacy (v1) 515 - * behavior. 516 - */ 517 - static int validate_change_legacy(struct cpuset *cur, struct cpuset *trial) 518 - { 519 - struct cgroup_subsys_state *css; 520 - struct cpuset *c, *par; 521 - int ret; 522 - 523 - WARN_ON_ONCE(!rcu_read_lock_held()); 524 - 525 - /* Each of our child cpusets must be a subset of us */ 526 - ret = -EBUSY; 527 - cpuset_for_each_child(c, css, cur) 528 - if (!is_cpuset_subset(c, trial)) 529 - goto out; 530 - 531 - /* On legacy hierarchy, we must be a subset of our parent cpuset. */ 532 - ret = -EACCES; 533 - par = parent_cs(cur); 534 - if (par && !is_cpuset_subset(trial, par)) 535 - goto out; 536 - 537 - ret = 0; 538 - out: 539 - return ret; 540 773 } 541 774 542 775 /* ··· 539 830 rcu_read_lock(); 540 831 541 832 if (!is_in_v2_mode()) 542 - ret = validate_change_legacy(cur, trial); 833 + ret = cpuset1_validate_change(cur, trial); 543 834 if (ret) 544 835 goto out; 545 836 ··· 705 996 * were changed (added or removed.) 706 997 * 707 998 * Finding the best partition (set of domains): 708 - * The triple nested loops below over i, j, k scan over the 709 - * load balanced cpusets (using the array of cpuset pointers in 710 - * csa[]) looking for pairs of cpusets that have overlapping 711 - * cpus_allowed, but which don't have the same 'pn' partition 712 - * number and gives them in the same partition number. It keeps 713 - * looping on the 'restart' label until it can no longer find 714 - * any such pairs. 999 + * The double nested loops below over i, j scan over the load 1000 + * balanced cpusets (using the array of cpuset pointers in csa[]) 1001 + * looking for pairs of cpusets that have overlapping cpus_allowed 1002 + * and merging them using a union-find algorithm. 715 1003 * 716 - * The union of the cpus_allowed masks from the set of 717 - * all cpusets having the same 'pn' value then form the one 718 - * element of the partition (one sched domain) to be passed to 719 - * partition_sched_domains(). 1004 + * The union of the cpus_allowed masks from the set of all cpusets 1005 + * having the same root then form the one element of the partition 1006 + * (one sched domain) to be passed to partition_sched_domains(). 1007 + * 720 1008 */ 721 1009 static int generate_sched_domains(cpumask_var_t **domains, 722 1010 struct sched_domain_attr **attributes) ··· 721 1015 struct cpuset *cp; /* top-down scan of cpusets */ 722 1016 struct cpuset **csa; /* array of all cpuset ptrs */ 723 1017 int csn; /* how many cpuset ptrs in csa so far */ 724 - int i, j, k; /* indices for partition finding loops */ 1018 + int i, j; /* indices for partition finding loops */ 725 1019 cpumask_var_t *doms; /* resulting partition; i.e. sched domains */ 726 1020 struct sched_domain_attr *dattr; /* attributes for custom domains */ 727 1021 int ndoms = 0; /* number of sched domains in result */ ··· 729 1023 struct cgroup_subsys_state *pos_css; 730 1024 bool root_load_balance = is_sched_load_balance(&top_cpuset); 731 1025 bool cgrpv2 = cgroup_subsys_on_dfl(cpuset_cgrp_subsys); 1026 + int nslot_update; 732 1027 733 1028 doms = NULL; 734 1029 dattr = NULL; ··· 818 1111 goto single_root_domain; 819 1112 820 1113 for (i = 0; i < csn; i++) 821 - csa[i]->pn = i; 822 - ndoms = csn; 1114 + uf_node_init(&csa[i]->node); 823 1115 824 - restart: 825 - /* Find the best partition (set of sched domains) */ 1116 + /* Merge overlapping cpusets */ 826 1117 for (i = 0; i < csn; i++) { 827 - struct cpuset *a = csa[i]; 828 - int apn = a->pn; 829 - 830 - for (j = 0; j < csn; j++) { 831 - struct cpuset *b = csa[j]; 832 - int bpn = b->pn; 833 - 834 - if (apn != bpn && cpusets_overlap(a, b)) { 835 - for (k = 0; k < csn; k++) { 836 - struct cpuset *c = csa[k]; 837 - 838 - if (c->pn == bpn) 839 - c->pn = apn; 840 - } 841 - ndoms--; /* one less element */ 842 - goto restart; 1118 + for (j = i + 1; j < csn; j++) { 1119 + if (cpusets_overlap(csa[i], csa[j])) { 1120 + /* 1121 + * Cgroup v2 shouldn't pass down overlapping 1122 + * partition root cpusets. 1123 + */ 1124 + WARN_ON_ONCE(cgrpv2); 1125 + uf_union(&csa[i]->node, &csa[j]->node); 843 1126 } 844 1127 } 1128 + } 1129 + 1130 + /* Count the total number of domains */ 1131 + for (i = 0; i < csn; i++) { 1132 + if (uf_find(&csa[i]->node) == &csa[i]->node) 1133 + ndoms++; 845 1134 } 846 1135 847 1136 /* ··· 870 1167 } 871 1168 872 1169 for (nslot = 0, i = 0; i < csn; i++) { 873 - struct cpuset *a = csa[i]; 874 - struct cpumask *dp; 875 - int apn = a->pn; 876 - 877 - if (apn < 0) { 878 - /* Skip completed partitions */ 879 - continue; 880 - } 881 - 882 - dp = doms[nslot]; 883 - 884 - if (nslot == ndoms) { 885 - static int warnings = 10; 886 - if (warnings) { 887 - pr_warn("rebuild_sched_domains confused: nslot %d, ndoms %d, csn %d, i %d, apn %d\n", 888 - nslot, ndoms, csn, i, apn); 889 - warnings--; 890 - } 891 - continue; 892 - } 893 - 894 - cpumask_clear(dp); 895 - if (dattr) 896 - *(dattr + nslot) = SD_ATTR_INIT; 1170 + nslot_update = 0; 897 1171 for (j = i; j < csn; j++) { 898 - struct cpuset *b = csa[j]; 1172 + if (uf_find(&csa[j]->node) == &csa[i]->node) { 1173 + struct cpumask *dp = doms[nslot]; 899 1174 900 - if (apn == b->pn) { 901 - cpumask_or(dp, dp, b->effective_cpus); 1175 + if (i == j) { 1176 + nslot_update = 1; 1177 + cpumask_clear(dp); 1178 + if (dattr) 1179 + *(dattr + nslot) = SD_ATTR_INIT; 1180 + } 1181 + cpumask_or(dp, dp, csa[j]->effective_cpus); 902 1182 cpumask_and(dp, dp, housekeeping_cpumask(HK_TYPE_DOMAIN)); 903 1183 if (dattr) 904 - update_domain_attr_tree(dattr + nslot, b); 905 - 906 - /* Done with this partition */ 907 - b->pn = -1; 1184 + update_domain_attr_tree(dattr + nslot, csa[j]); 908 1185 } 909 1186 } 910 - nslot++; 1187 + if (nslot_update) 1188 + nslot++; 911 1189 } 912 1190 BUG_ON(nslot != ndoms); 913 1191 ··· 980 1296 * 981 1297 * Call with cpuset_mutex held. Takes cpus_read_lock(). 982 1298 */ 983 - static void rebuild_sched_domains_locked(void) 1299 + void rebuild_sched_domains_locked(void) 984 1300 { 985 1301 struct cgroup_subsys_state *pos_css; 986 1302 struct sched_domain_attr *attr; ··· 1032 1348 partition_and_rebuild_sched_domains(ndoms, doms, attr); 1033 1349 } 1034 1350 #else /* !CONFIG_SMP */ 1035 - static void rebuild_sched_domains_locked(void) 1351 + void rebuild_sched_domains_locked(void) 1036 1352 { 1037 1353 } 1038 1354 #endif /* CONFIG_SMP */ ··· 1052 1368 } 1053 1369 1054 1370 /** 1055 - * update_tasks_cpumask - Update the cpumasks of tasks in the cpuset. 1371 + * cpuset_update_tasks_cpumask - Update the cpumasks of tasks in the cpuset. 1056 1372 * @cs: the cpuset in which each task's cpus_allowed mask needs to be changed 1057 1373 * @new_cpus: the temp variable for the new effective_cpus mask 1058 1374 * ··· 1062 1378 * is used instead of effective_cpus to make sure all offline CPUs are also 1063 1379 * included as hotplug code won't update cpumasks for tasks in top_cpuset. 1064 1380 */ 1065 - static void update_tasks_cpumask(struct cpuset *cs, struct cpumask *new_cpus) 1381 + void cpuset_update_tasks_cpumask(struct cpuset *cs, struct cpumask *new_cpus) 1066 1382 { 1067 1383 struct css_task_iter it; 1068 1384 struct task_struct *task; ··· 1112 1428 partcmd_invalidate, /* Make partition invalid */ 1113 1429 }; 1114 1430 1115 - static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs, 1116 - int turning_on); 1117 1431 static void update_sibling_cpumasks(struct cpuset *parent, struct cpuset *cs, 1118 1432 struct tmpmasks *tmp); 1119 1433 ··· 1125 1443 bool exclusive = (new_prs > PRS_MEMBER); 1126 1444 1127 1445 if (exclusive && !is_cpu_exclusive(cs)) { 1128 - if (update_flag(CS_CPU_EXCLUSIVE, cs, 1)) 1446 + if (cpuset_update_flag(CS_CPU_EXCLUSIVE, cs, 1)) 1129 1447 return PERR_NOTEXCL; 1130 1448 } else if (!exclusive && is_cpu_exclusive(cs)) { 1131 1449 /* Turning off CS_CPU_EXCLUSIVE will not return error */ 1132 - update_flag(CS_CPU_EXCLUSIVE, cs, 0); 1450 + cpuset_update_flag(CS_CPU_EXCLUSIVE, cs, 0); 1133 1451 } 1134 1452 return 0; 1135 1453 } ··· 1198 1516 if (is_cpu_exclusive(cs)) 1199 1517 clear_bit(CS_CPU_EXCLUSIVE, &cs->flags); 1200 1518 } 1201 - if (!cpumask_and(cs->effective_cpus, 1202 - parent->effective_cpus, cs->cpus_allowed)) { 1203 - cs->use_parent_ecpus = true; 1204 - parent->child_ecpus_count++; 1519 + if (!cpumask_and(cs->effective_cpus, parent->effective_cpus, cs->cpus_allowed)) 1205 1520 cpumask_copy(cs->effective_cpus, parent->effective_cpus); 1206 - } 1207 1521 } 1208 1522 1209 1523 /* ··· 1340 1662 * @cs: the cpuset to update 1341 1663 * @new_prs: new partition_root_state 1342 1664 * @tmp: temparary masks 1343 - * Return: 1 if successful, 0 if error 1665 + * Return: 0 if successful, errcode if error 1344 1666 * 1345 1667 * Enable the current cpuset to become a remote partition root taking CPUs 1346 1668 * directly from the top cpuset. cpuset_mutex must be held by the caller. ··· 1354 1676 * The user must have sysadmin privilege. 1355 1677 */ 1356 1678 if (!capable(CAP_SYS_ADMIN)) 1357 - return 0; 1679 + return PERR_ACCESS; 1358 1680 1359 1681 /* 1360 1682 * The requested exclusive_cpus must not be allocated to other ··· 1368 1690 if (cpumask_empty(tmp->new_cpus) || 1369 1691 cpumask_intersects(tmp->new_cpus, subpartitions_cpus) || 1370 1692 cpumask_subset(top_cpuset.effective_cpus, tmp->new_cpus)) 1371 - return 0; 1693 + return PERR_INVCPUS; 1372 1694 1373 1695 spin_lock_irq(&callback_lock); 1374 1696 isolcpus_updated = partition_xcpus_add(new_prs, NULL, tmp->new_cpus); 1375 1697 list_add(&cs->remote_sibling, &remote_children); 1376 - if (cs->use_parent_ecpus) { 1377 - struct cpuset *parent = parent_cs(cs); 1378 - 1379 - cs->use_parent_ecpus = false; 1380 - parent->child_ecpus_count--; 1381 - } 1382 1698 spin_unlock_irq(&callback_lock); 1383 1699 update_unbound_workqueue_cpumask(isolcpus_updated); 1384 1700 1385 1701 /* 1386 1702 * Proprogate changes in top_cpuset's effective_cpus down the hierarchy. 1387 1703 */ 1388 - update_tasks_cpumask(&top_cpuset, tmp->new_cpus); 1704 + cpuset_update_tasks_cpumask(&top_cpuset, tmp->new_cpus); 1389 1705 update_sibling_cpumasks(&top_cpuset, NULL, tmp); 1390 - return 1; 1706 + return 0; 1391 1707 } 1392 1708 1393 1709 /* ··· 1415 1743 /* 1416 1744 * Proprogate changes in top_cpuset's effective_cpus down the hierarchy. 1417 1745 */ 1418 - update_tasks_cpumask(&top_cpuset, tmp->new_cpus); 1746 + cpuset_update_tasks_cpumask(&top_cpuset, tmp->new_cpus); 1419 1747 update_sibling_cpumasks(&top_cpuset, NULL, tmp); 1420 1748 } 1421 1749 ··· 1467 1795 /* 1468 1796 * Proprogate changes in top_cpuset's effective_cpus down the hierarchy. 1469 1797 */ 1470 - update_tasks_cpumask(&top_cpuset, tmp->new_cpus); 1798 + cpuset_update_tasks_cpumask(&top_cpuset, tmp->new_cpus); 1471 1799 update_sibling_cpumasks(&top_cpuset, NULL, tmp); 1472 1800 return; 1473 1801 ··· 1522 1850 * @new_cpus: cpu mask 1523 1851 * Return: true if there is conflict, false otherwise 1524 1852 * 1525 - * CPUs outside of housekeeping_cpumask(HK_TYPE_DOMAIN) can only be used in 1526 - * an isolated partition. 1853 + * CPUs outside of boot_hk_cpus, if defined, can only be used in an 1854 + * isolated partition. 1527 1855 */ 1528 1856 static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus) 1529 1857 { 1530 - const struct cpumask *hk_domain = housekeeping_cpumask(HK_TYPE_DOMAIN); 1531 - bool all_in_hk = cpumask_subset(new_cpus, hk_domain); 1858 + if (!have_boot_isolcpus) 1859 + return false; 1532 1860 1533 - if (!all_in_hk && (prstate != PRS_ISOLATED)) 1861 + if ((prstate != PRS_ISOLATED) && !cpumask_subset(new_cpus, boot_hk_cpus)) 1534 1862 return true; 1535 1863 1536 1864 return false; ··· 1839 2167 update_partition_exclusive(cs, new_prs); 1840 2168 1841 2169 if (adding || deleting) { 1842 - update_tasks_cpumask(parent, tmp->addmask); 2170 + cpuset_update_tasks_cpumask(parent, tmp->addmask); 1843 2171 update_sibling_cpumasks(parent, cs, tmp); 1844 2172 } 1845 2173 ··· 1997 2325 * it is a partition root that has explicitly distributed 1998 2326 * out all its CPUs. 1999 2327 */ 2000 - if (is_in_v2_mode() && !remote && cpumask_empty(tmp->new_cpus)) { 2328 + if (is_in_v2_mode() && !remote && cpumask_empty(tmp->new_cpus)) 2001 2329 cpumask_copy(tmp->new_cpus, parent->effective_cpus); 2002 - if (!cp->use_parent_ecpus) { 2003 - cp->use_parent_ecpus = true; 2004 - parent->child_ecpus_count++; 2005 - } 2006 - } else if (cp->use_parent_ecpus) { 2007 - cp->use_parent_ecpus = false; 2008 - WARN_ON_ONCE(!parent->child_ecpus_count); 2009 - parent->child_ecpus_count--; 2010 - } 2011 2330 2012 2331 if (remote) 2013 2332 goto get_css; ··· 2022 2359 /* 2023 2360 * update_parent_effective_cpumask() should have been called 2024 2361 * for cs already in update_cpumask(). We should also call 2025 - * update_tasks_cpumask() again for tasks in the parent 2362 + * cpuset_update_tasks_cpumask() again for tasks in the parent 2026 2363 * cpuset if the parent's effective_cpus changes. 2027 2364 */ 2028 2365 if ((cp != cs) && old_prs) { ··· 2079 2416 WARN_ON(!is_in_v2_mode() && 2080 2417 !cpumask_equal(cp->cpus_allowed, cp->effective_cpus)); 2081 2418 2082 - update_tasks_cpumask(cp, cp->effective_cpus); 2419 + cpuset_update_tasks_cpumask(cp, cp->effective_cpus); 2083 2420 2084 2421 /* 2085 2422 * On default hierarchy, inherit the CS_SCHED_LOAD_BALANCE ··· 2135 2472 * Check all its siblings and call update_cpumasks_hier() 2136 2473 * if their effective_cpus will need to be changed. 2137 2474 * 2138 - * With the addition of effective_xcpus which is a subset of 2139 - * cpus_allowed. It is possible a change in parent's effective_cpus 2475 + * It is possible a change in parent's effective_cpus 2140 2476 * due to a change in a child partition's effective_xcpus will impact 2141 2477 * its siblings even if they do not inherit parent's effective_cpus 2142 2478 * directly. ··· 2149 2487 cpuset_for_each_child(sibling, pos_css, parent) { 2150 2488 if (sibling == cs) 2151 2489 continue; 2152 - if (!sibling->use_parent_ecpus && 2153 - !is_partition_valid(sibling)) { 2490 + if (!is_partition_valid(sibling)) { 2154 2491 compute_effective_cpumask(tmp->new_cpus, sibling, 2155 2492 parent); 2156 2493 if (cpumask_equal(tmp->new_cpus, sibling->effective_cpus)) ··· 2259 2598 invalidate = true; 2260 2599 rcu_read_lock(); 2261 2600 cpuset_for_each_child(cp, css, parent) { 2262 - struct cpumask *xcpus = fetch_xcpus(trialcs); 2601 + struct cpumask *xcpus = user_xcpus(trialcs); 2263 2602 2264 2603 if (is_partition_valid(cp) && 2265 2604 cpumask_intersects(xcpus, cp->effective_xcpus)) { ··· 2506 2845 static void *cpuset_being_rebound; 2507 2846 2508 2847 /** 2509 - * update_tasks_nodemask - Update the nodemasks of tasks in the cpuset. 2848 + * cpuset_update_tasks_nodemask - Update the nodemasks of tasks in the cpuset. 2510 2849 * @cs: the cpuset in which each task's mems_allowed mask needs to be changed 2511 2850 * 2512 2851 * Iterate through each task of @cs updating its mems_allowed to the 2513 2852 * effective cpuset's. As this function is called with cpuset_mutex held, 2514 2853 * cpuset membership stays stable. 2515 2854 */ 2516 - static void update_tasks_nodemask(struct cpuset *cs) 2855 + void cpuset_update_tasks_nodemask(struct cpuset *cs) 2517 2856 { 2518 2857 static nodemask_t newmems; /* protected by cpuset_mutex */ 2519 2858 struct css_task_iter it; ··· 2611 2950 WARN_ON(!is_in_v2_mode() && 2612 2951 !nodes_equal(cp->mems_allowed, cp->effective_mems)); 2613 2952 2614 - update_tasks_nodemask(cp); 2953 + cpuset_update_tasks_nodemask(cp); 2615 2954 2616 2955 rcu_read_lock(); 2617 2956 css_put(&cp->css); ··· 2697 3036 return ret; 2698 3037 } 2699 3038 2700 - static int update_relax_domain_level(struct cpuset *cs, s64 val) 2701 - { 2702 - #ifdef CONFIG_SMP 2703 - if (val < -1 || val > sched_domain_level_max + 1) 2704 - return -EINVAL; 2705 - #endif 2706 - 2707 - if (val != cs->relax_domain_level) { 2708 - cs->relax_domain_level = val; 2709 - if (!cpumask_empty(cs->cpus_allowed) && 2710 - is_sched_load_balance(cs)) 2711 - rebuild_sched_domains_locked(); 2712 - } 2713 - 2714 - return 0; 2715 - } 2716 - 2717 - /** 2718 - * update_tasks_flags - update the spread flags of tasks in the cpuset. 2719 - * @cs: the cpuset in which each task's spread flags needs to be changed 2720 - * 2721 - * Iterate through each task of @cs updating its spread flags. As this 2722 - * function is called with cpuset_mutex held, cpuset membership stays 2723 - * stable. 2724 - */ 2725 - static void update_tasks_flags(struct cpuset *cs) 2726 - { 2727 - struct css_task_iter it; 2728 - struct task_struct *task; 2729 - 2730 - css_task_iter_start(&cs->css, 0, &it); 2731 - while ((task = css_task_iter_next(&it))) 2732 - cpuset_update_task_spread_flags(cs, task); 2733 - css_task_iter_end(&it); 2734 - } 2735 - 2736 3039 /* 2737 - * update_flag - read a 0 or a 1 in a file and update associated flag 3040 + * cpuset_update_flag - read a 0 or a 1 in a file and update associated flag 2738 3041 * bit: the bit to update (see cpuset_flagbits_t) 2739 3042 * cs: the cpuset to update 2740 3043 * turning_on: whether the flag is being set or cleared ··· 2706 3081 * Call with cpuset_mutex held. 2707 3082 */ 2708 3083 2709 - static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs, 3084 + int cpuset_update_flag(cpuset_flagbits_t bit, struct cpuset *cs, 2710 3085 int turning_on) 2711 3086 { 2712 3087 struct cpuset *trialcs; ··· 2742 3117 rebuild_sched_domains_locked(); 2743 3118 2744 3119 if (spread_flag_changed) 2745 - update_tasks_flags(cs); 3120 + cpuset1_update_tasks_flags(cs); 2746 3121 out: 2747 3122 free_cpuset(trialcs); 2748 3123 return err; ··· 2791 3166 goto out; 2792 3167 2793 3168 if (!old_prs) { 2794 - enum partition_cmd cmd = (new_prs == PRS_ROOT) 2795 - ? partcmd_enable : partcmd_enablei; 2796 - 2797 3169 /* 2798 3170 * cpus_allowed and exclusive_cpus cannot be both empty. 2799 3171 */ ··· 2799 3177 goto out; 2800 3178 } 2801 3179 2802 - err = update_parent_effective_cpumask(cs, cmd, NULL, &tmpmask); 2803 3180 /* 2804 - * If an attempt to become local partition root fails, 2805 - * try to become a remote partition root instead. 3181 + * If parent is valid partition, enable local partiion. 3182 + * Otherwise, enable a remote partition. 2806 3183 */ 2807 - if (err && remote_partition_enable(cs, new_prs, &tmpmask)) 2808 - err = 0; 3184 + if (is_partition_valid(parent)) { 3185 + enum partition_cmd cmd = (new_prs == PRS_ROOT) 3186 + ? partcmd_enable : partcmd_enablei; 3187 + 3188 + err = update_parent_effective_cpumask(cs, cmd, NULL, &tmpmask); 3189 + } else { 3190 + err = remote_partition_enable(cs, new_prs, &tmpmask); 3191 + } 2809 3192 } else if (old_prs && new_prs) { 2810 3193 /* 2811 3194 * A change in load balance state only, no change in cpumasks. ··· 2861 3234 notify_partition_change(cs, old_prs); 2862 3235 free_cpumasks(NULL, &tmpmask); 2863 3236 return 0; 2864 - } 2865 - 2866 - /* 2867 - * Frequency meter - How fast is some event occurring? 2868 - * 2869 - * These routines manage a digitally filtered, constant time based, 2870 - * event frequency meter. There are four routines: 2871 - * fmeter_init() - initialize a frequency meter. 2872 - * fmeter_markevent() - called each time the event happens. 2873 - * fmeter_getrate() - returns the recent rate of such events. 2874 - * fmeter_update() - internal routine used to update fmeter. 2875 - * 2876 - * A common data structure is passed to each of these routines, 2877 - * which is used to keep track of the state required to manage the 2878 - * frequency meter and its digital filter. 2879 - * 2880 - * The filter works on the number of events marked per unit time. 2881 - * The filter is single-pole low-pass recursive (IIR). The time unit 2882 - * is 1 second. Arithmetic is done using 32-bit integers scaled to 2883 - * simulate 3 decimal digits of precision (multiplied by 1000). 2884 - * 2885 - * With an FM_COEF of 933, and a time base of 1 second, the filter 2886 - * has a half-life of 10 seconds, meaning that if the events quit 2887 - * happening, then the rate returned from the fmeter_getrate() 2888 - * will be cut in half each 10 seconds, until it converges to zero. 2889 - * 2890 - * It is not worth doing a real infinitely recursive filter. If more 2891 - * than FM_MAXTICKS ticks have elapsed since the last filter event, 2892 - * just compute FM_MAXTICKS ticks worth, by which point the level 2893 - * will be stable. 2894 - * 2895 - * Limit the count of unprocessed events to FM_MAXCNT, so as to avoid 2896 - * arithmetic overflow in the fmeter_update() routine. 2897 - * 2898 - * Given the simple 32 bit integer arithmetic used, this meter works 2899 - * best for reporting rates between one per millisecond (msec) and 2900 - * one per 32 (approx) seconds. At constant rates faster than one 2901 - * per msec it maxes out at values just under 1,000,000. At constant 2902 - * rates between one per msec, and one per second it will stabilize 2903 - * to a value N*1000, where N is the rate of events per second. 2904 - * At constant rates between one per second and one per 32 seconds, 2905 - * it will be choppy, moving up on the seconds that have an event, 2906 - * and then decaying until the next event. At rates slower than 2907 - * about one in 32 seconds, it decays all the way back to zero between 2908 - * each event. 2909 - */ 2910 - 2911 - #define FM_COEF 933 /* coefficient for half-life of 10 secs */ 2912 - #define FM_MAXTICKS ((u32)99) /* useless computing more ticks than this */ 2913 - #define FM_MAXCNT 1000000 /* limit cnt to avoid overflow */ 2914 - #define FM_SCALE 1000 /* faux fixed point scale */ 2915 - 2916 - /* Initialize a frequency meter */ 2917 - static void fmeter_init(struct fmeter *fmp) 2918 - { 2919 - fmp->cnt = 0; 2920 - fmp->val = 0; 2921 - fmp->time = 0; 2922 - spin_lock_init(&fmp->lock); 2923 - } 2924 - 2925 - /* Internal meter update - process cnt events and update value */ 2926 - static void fmeter_update(struct fmeter *fmp) 2927 - { 2928 - time64_t now; 2929 - u32 ticks; 2930 - 2931 - now = ktime_get_seconds(); 2932 - ticks = now - fmp->time; 2933 - 2934 - if (ticks == 0) 2935 - return; 2936 - 2937 - ticks = min(FM_MAXTICKS, ticks); 2938 - while (ticks-- > 0) 2939 - fmp->val = (FM_COEF * fmp->val) / FM_SCALE; 2940 - fmp->time = now; 2941 - 2942 - fmp->val += ((FM_SCALE - FM_COEF) * fmp->cnt) / FM_SCALE; 2943 - fmp->cnt = 0; 2944 - } 2945 - 2946 - /* Process any previous ticks, then bump cnt by one (times scale). */ 2947 - static void fmeter_markevent(struct fmeter *fmp) 2948 - { 2949 - spin_lock(&fmp->lock); 2950 - fmeter_update(fmp); 2951 - fmp->cnt = min(FM_MAXCNT, fmp->cnt + FM_SCALE); 2952 - spin_unlock(&fmp->lock); 2953 - } 2954 - 2955 - /* Process any previous ticks, then return current value. */ 2956 - static int fmeter_getrate(struct fmeter *fmp) 2957 - { 2958 - int val; 2959 - 2960 - spin_lock(&fmp->lock); 2961 - fmeter_update(fmp); 2962 - val = fmp->val; 2963 - spin_unlock(&fmp->lock); 2964 - return val; 2965 3237 } 2966 3238 2967 3239 static struct cpuset *cpuset_attach_old_cs; ··· 2971 3445 cs = css_cs(css); 2972 3446 2973 3447 mutex_lock(&cpuset_mutex); 2974 - cs->attach_in_progress--; 2975 - if (!cs->attach_in_progress) 2976 - wake_up(&cpuset_attach_wq); 3448 + dec_attach_in_progress_locked(cs); 2977 3449 2978 3450 if (cs->nr_migrate_dl_tasks) { 2979 3451 int cpu = cpumask_any(cs->effective_cpus); ··· 3007 3483 WARN_ON_ONCE(set_cpus_allowed_ptr(task, cpus_attach)); 3008 3484 3009 3485 cpuset_change_task_nodemask(task, &cpuset_attach_nodemask_to); 3010 - cpuset_update_task_spread_flags(cs, task); 3486 + cpuset1_update_task_spread_flags(cs, task); 3011 3487 } 3012 3488 3013 3489 static void cpuset_attach(struct cgroup_taskset *tset) ··· 3086 3562 reset_migrate_dl_data(cs); 3087 3563 } 3088 3564 3089 - cs->attach_in_progress--; 3090 - if (!cs->attach_in_progress) 3091 - wake_up(&cpuset_attach_wq); 3565 + dec_attach_in_progress_locked(cs); 3092 3566 3093 3567 mutex_unlock(&cpuset_mutex); 3094 - } 3095 - 3096 - /* The various types of files and directories in a cpuset file system */ 3097 - 3098 - typedef enum { 3099 - FILE_MEMORY_MIGRATE, 3100 - FILE_CPULIST, 3101 - FILE_MEMLIST, 3102 - FILE_EFFECTIVE_CPULIST, 3103 - FILE_EFFECTIVE_MEMLIST, 3104 - FILE_SUBPARTS_CPULIST, 3105 - FILE_EXCLUSIVE_CPULIST, 3106 - FILE_EFFECTIVE_XCPULIST, 3107 - FILE_ISOLATED_CPULIST, 3108 - FILE_CPU_EXCLUSIVE, 3109 - FILE_MEM_EXCLUSIVE, 3110 - FILE_MEM_HARDWALL, 3111 - FILE_SCHED_LOAD_BALANCE, 3112 - FILE_PARTITION_ROOT, 3113 - FILE_SCHED_RELAX_DOMAIN_LEVEL, 3114 - FILE_MEMORY_PRESSURE_ENABLED, 3115 - FILE_MEMORY_PRESSURE, 3116 - FILE_SPREAD_PAGE, 3117 - FILE_SPREAD_SLAB, 3118 - } cpuset_filetype_t; 3119 - 3120 - static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft, 3121 - u64 val) 3122 - { 3123 - struct cpuset *cs = css_cs(css); 3124 - cpuset_filetype_t type = cft->private; 3125 - int retval = 0; 3126 - 3127 - cpus_read_lock(); 3128 - mutex_lock(&cpuset_mutex); 3129 - if (!is_cpuset_online(cs)) { 3130 - retval = -ENODEV; 3131 - goto out_unlock; 3132 - } 3133 - 3134 - switch (type) { 3135 - case FILE_CPU_EXCLUSIVE: 3136 - retval = update_flag(CS_CPU_EXCLUSIVE, cs, val); 3137 - break; 3138 - case FILE_MEM_EXCLUSIVE: 3139 - retval = update_flag(CS_MEM_EXCLUSIVE, cs, val); 3140 - break; 3141 - case FILE_MEM_HARDWALL: 3142 - retval = update_flag(CS_MEM_HARDWALL, cs, val); 3143 - break; 3144 - case FILE_SCHED_LOAD_BALANCE: 3145 - retval = update_flag(CS_SCHED_LOAD_BALANCE, cs, val); 3146 - break; 3147 - case FILE_MEMORY_MIGRATE: 3148 - retval = update_flag(CS_MEMORY_MIGRATE, cs, val); 3149 - break; 3150 - case FILE_MEMORY_PRESSURE_ENABLED: 3151 - cpuset_memory_pressure_enabled = !!val; 3152 - break; 3153 - case FILE_SPREAD_PAGE: 3154 - retval = update_flag(CS_SPREAD_PAGE, cs, val); 3155 - break; 3156 - case FILE_SPREAD_SLAB: 3157 - retval = update_flag(CS_SPREAD_SLAB, cs, val); 3158 - break; 3159 - default: 3160 - retval = -EINVAL; 3161 - break; 3162 - } 3163 - out_unlock: 3164 - mutex_unlock(&cpuset_mutex); 3165 - cpus_read_unlock(); 3166 - return retval; 3167 - } 3168 - 3169 - static int cpuset_write_s64(struct cgroup_subsys_state *css, struct cftype *cft, 3170 - s64 val) 3171 - { 3172 - struct cpuset *cs = css_cs(css); 3173 - cpuset_filetype_t type = cft->private; 3174 - int retval = -ENODEV; 3175 - 3176 - cpus_read_lock(); 3177 - mutex_lock(&cpuset_mutex); 3178 - if (!is_cpuset_online(cs)) 3179 - goto out_unlock; 3180 - 3181 - switch (type) { 3182 - case FILE_SCHED_RELAX_DOMAIN_LEVEL: 3183 - retval = update_relax_domain_level(cs, val); 3184 - break; 3185 - default: 3186 - retval = -EINVAL; 3187 - break; 3188 - } 3189 - out_unlock: 3190 - mutex_unlock(&cpuset_mutex); 3191 - cpus_read_unlock(); 3192 - return retval; 3193 3568 } 3194 3569 3195 3570 /* 3196 3571 * Common handling for a write to a "cpus" or "mems" file. 3197 3572 */ 3198 - static ssize_t cpuset_write_resmask(struct kernfs_open_file *of, 3573 + ssize_t cpuset_write_resmask(struct kernfs_open_file *of, 3199 3574 char *buf, size_t nbytes, loff_t off) 3200 3575 { 3201 3576 struct cpuset *cs = css_cs(of_css(of)); ··· 3169 3746 * and since these maps can change value dynamically, one could read 3170 3747 * gibberish by doing partial reads while a list was changing. 3171 3748 */ 3172 - static int cpuset_common_seq_show(struct seq_file *sf, void *v) 3749 + int cpuset_common_seq_show(struct seq_file *sf, void *v) 3173 3750 { 3174 3751 struct cpuset *cs = css_cs(seq_css(sf)); 3175 3752 cpuset_filetype_t type = seq_cft(sf)->private; ··· 3208 3785 3209 3786 spin_unlock_irq(&callback_lock); 3210 3787 return ret; 3211 - } 3212 - 3213 - static u64 cpuset_read_u64(struct cgroup_subsys_state *css, struct cftype *cft) 3214 - { 3215 - struct cpuset *cs = css_cs(css); 3216 - cpuset_filetype_t type = cft->private; 3217 - switch (type) { 3218 - case FILE_CPU_EXCLUSIVE: 3219 - return is_cpu_exclusive(cs); 3220 - case FILE_MEM_EXCLUSIVE: 3221 - return is_mem_exclusive(cs); 3222 - case FILE_MEM_HARDWALL: 3223 - return is_mem_hardwall(cs); 3224 - case FILE_SCHED_LOAD_BALANCE: 3225 - return is_sched_load_balance(cs); 3226 - case FILE_MEMORY_MIGRATE: 3227 - return is_memory_migrate(cs); 3228 - case FILE_MEMORY_PRESSURE_ENABLED: 3229 - return cpuset_memory_pressure_enabled; 3230 - case FILE_MEMORY_PRESSURE: 3231 - return fmeter_getrate(&cs->fmeter); 3232 - case FILE_SPREAD_PAGE: 3233 - return is_spread_page(cs); 3234 - case FILE_SPREAD_SLAB: 3235 - return is_spread_slab(cs); 3236 - default: 3237 - BUG(); 3238 - } 3239 - 3240 - /* Unreachable but makes gcc happy */ 3241 - return 0; 3242 - } 3243 - 3244 - static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft) 3245 - { 3246 - struct cpuset *cs = css_cs(css); 3247 - cpuset_filetype_t type = cft->private; 3248 - switch (type) { 3249 - case FILE_SCHED_RELAX_DOMAIN_LEVEL: 3250 - return cs->relax_domain_level; 3251 - default: 3252 - BUG(); 3253 - } 3254 - 3255 - /* Unreachable but makes gcc happy */ 3256 - return 0; 3257 3788 } 3258 3789 3259 3790 static int sched_partition_show(struct seq_file *seq, void *v) ··· 3272 3895 css_put(&cs->css); 3273 3896 return retval ?: nbytes; 3274 3897 } 3275 - 3276 - /* 3277 - * for the common functions, 'private' gives the type of file 3278 - */ 3279 - 3280 - static struct cftype legacy_files[] = { 3281 - { 3282 - .name = "cpus", 3283 - .seq_show = cpuset_common_seq_show, 3284 - .write = cpuset_write_resmask, 3285 - .max_write_len = (100U + 6 * NR_CPUS), 3286 - .private = FILE_CPULIST, 3287 - }, 3288 - 3289 - { 3290 - .name = "mems", 3291 - .seq_show = cpuset_common_seq_show, 3292 - .write = cpuset_write_resmask, 3293 - .max_write_len = (100U + 6 * MAX_NUMNODES), 3294 - .private = FILE_MEMLIST, 3295 - }, 3296 - 3297 - { 3298 - .name = "effective_cpus", 3299 - .seq_show = cpuset_common_seq_show, 3300 - .private = FILE_EFFECTIVE_CPULIST, 3301 - }, 3302 - 3303 - { 3304 - .name = "effective_mems", 3305 - .seq_show = cpuset_common_seq_show, 3306 - .private = FILE_EFFECTIVE_MEMLIST, 3307 - }, 3308 - 3309 - { 3310 - .name = "cpu_exclusive", 3311 - .read_u64 = cpuset_read_u64, 3312 - .write_u64 = cpuset_write_u64, 3313 - .private = FILE_CPU_EXCLUSIVE, 3314 - }, 3315 - 3316 - { 3317 - .name = "mem_exclusive", 3318 - .read_u64 = cpuset_read_u64, 3319 - .write_u64 = cpuset_write_u64, 3320 - .private = FILE_MEM_EXCLUSIVE, 3321 - }, 3322 - 3323 - { 3324 - .name = "mem_hardwall", 3325 - .read_u64 = cpuset_read_u64, 3326 - .write_u64 = cpuset_write_u64, 3327 - .private = FILE_MEM_HARDWALL, 3328 - }, 3329 - 3330 - { 3331 - .name = "sched_load_balance", 3332 - .read_u64 = cpuset_read_u64, 3333 - .write_u64 = cpuset_write_u64, 3334 - .private = FILE_SCHED_LOAD_BALANCE, 3335 - }, 3336 - 3337 - { 3338 - .name = "sched_relax_domain_level", 3339 - .read_s64 = cpuset_read_s64, 3340 - .write_s64 = cpuset_write_s64, 3341 - .private = FILE_SCHED_RELAX_DOMAIN_LEVEL, 3342 - }, 3343 - 3344 - { 3345 - .name = "memory_migrate", 3346 - .read_u64 = cpuset_read_u64, 3347 - .write_u64 = cpuset_write_u64, 3348 - .private = FILE_MEMORY_MIGRATE, 3349 - }, 3350 - 3351 - { 3352 - .name = "memory_pressure", 3353 - .read_u64 = cpuset_read_u64, 3354 - .private = FILE_MEMORY_PRESSURE, 3355 - }, 3356 - 3357 - { 3358 - .name = "memory_spread_page", 3359 - .read_u64 = cpuset_read_u64, 3360 - .write_u64 = cpuset_write_u64, 3361 - .private = FILE_SPREAD_PAGE, 3362 - }, 3363 - 3364 - { 3365 - /* obsolete, may be removed in the future */ 3366 - .name = "memory_spread_slab", 3367 - .read_u64 = cpuset_read_u64, 3368 - .write_u64 = cpuset_write_u64, 3369 - .private = FILE_SPREAD_SLAB, 3370 - }, 3371 - 3372 - { 3373 - .name = "memory_pressure_enabled", 3374 - .flags = CFTYPE_ONLY_ON_ROOT, 3375 - .read_u64 = cpuset_read_u64, 3376 - .write_u64 = cpuset_write_u64, 3377 - .private = FILE_MEMORY_PRESSURE_ENABLED, 3378 - }, 3379 - 3380 - { } /* terminate */ 3381 - }; 3382 3898 3383 3899 /* 3384 3900 * This is currently a minimal set for the default hierarchy. It can be ··· 3420 4150 if (is_in_v2_mode()) { 3421 4151 cpumask_copy(cs->effective_cpus, parent->effective_cpus); 3422 4152 cs->effective_mems = parent->effective_mems; 3423 - cs->use_parent_ecpus = true; 3424 - parent->child_ecpus_count++; 3425 4153 } 3426 4154 spin_unlock_irq(&callback_lock); 3427 4155 ··· 3483 4215 3484 4216 if (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys) && 3485 4217 is_sched_load_balance(cs)) 3486 - update_flag(CS_SCHED_LOAD_BALANCE, cs, 0); 3487 - 3488 - if (cs->use_parent_ecpus) { 3489 - struct cpuset *parent = parent_cs(cs); 3490 - 3491 - cs->use_parent_ecpus = false; 3492 - parent->child_ecpus_count--; 3493 - } 4218 + cpuset_update_flag(CS_SCHED_LOAD_BALANCE, cs, 0); 3494 4219 3495 4220 cpuset_dec(); 3496 4221 clear_bit(CS_ONLINE, &cs->flags); ··· 3573 4312 if (same_cs) 3574 4313 return; 3575 4314 3576 - mutex_lock(&cpuset_mutex); 3577 - cs->attach_in_progress--; 3578 - if (!cs->attach_in_progress) 3579 - wake_up(&cpuset_attach_wq); 3580 - mutex_unlock(&cpuset_mutex); 4315 + dec_attach_in_progress(cs); 3581 4316 } 3582 4317 3583 4318 /* ··· 3605 4348 guarantee_online_mems(cs, &cpuset_attach_nodemask_to); 3606 4349 cpuset_attach_task(cs, task); 3607 4350 3608 - cs->attach_in_progress--; 3609 - if (!cs->attach_in_progress) 3610 - wake_up(&cpuset_attach_wq); 3611 - 4351 + dec_attach_in_progress_locked(cs); 3612 4352 mutex_unlock(&cpuset_mutex); 3613 4353 } 3614 4354 ··· 3622 4368 .can_fork = cpuset_can_fork, 3623 4369 .cancel_fork = cpuset_cancel_fork, 3624 4370 .fork = cpuset_fork, 3625 - .legacy_cftypes = legacy_files, 4371 + #ifdef CONFIG_CPUSETS_V1 4372 + .legacy_cftypes = cpuset1_files, 4373 + #endif 3626 4374 .dfl_cftypes = dfl_files, 3627 4375 .early_init = true, 3628 4376 .threaded = true, ··· 3657 4401 3658 4402 BUG_ON(!alloc_cpumask_var(&cpus_attach, GFP_KERNEL)); 3659 4403 4404 + have_boot_isolcpus = housekeeping_enabled(HK_TYPE_DOMAIN); 4405 + if (have_boot_isolcpus) { 4406 + BUG_ON(!alloc_cpumask_var(&boot_hk_cpus, GFP_KERNEL)); 4407 + cpumask_copy(boot_hk_cpus, housekeeping_cpumask(HK_TYPE_DOMAIN)); 4408 + cpumask_andnot(isolated_cpus, cpu_possible_mask, boot_hk_cpus); 4409 + } 4410 + 3660 4411 return 0; 3661 - } 3662 - 3663 - /* 3664 - * If CPU and/or memory hotplug handlers, below, unplug any CPUs 3665 - * or memory nodes, we need to walk over the cpuset hierarchy, 3666 - * removing that CPU or node from all cpusets. If this removes the 3667 - * last CPU or node from a cpuset, then move the tasks in the empty 3668 - * cpuset to its next-highest non-empty parent. 3669 - */ 3670 - static void remove_tasks_in_empty_cpuset(struct cpuset *cs) 3671 - { 3672 - struct cpuset *parent; 3673 - 3674 - /* 3675 - * Find its next-highest non-empty parent, (top cpuset 3676 - * has online cpus, so can't be empty). 3677 - */ 3678 - parent = parent_cs(cs); 3679 - while (cpumask_empty(parent->cpus_allowed) || 3680 - nodes_empty(parent->mems_allowed)) 3681 - parent = parent_cs(parent); 3682 - 3683 - if (cgroup_transfer_tasks(parent->css.cgroup, cs->css.cgroup)) { 3684 - pr_err("cpuset: failed to transfer tasks out of empty cpuset "); 3685 - pr_cont_cgroup_name(cs->css.cgroup); 3686 - pr_cont("\n"); 3687 - } 3688 - } 3689 - 3690 - static void cpuset_migrate_tasks_workfn(struct work_struct *work) 3691 - { 3692 - struct cpuset_remove_tasks_struct *s; 3693 - 3694 - s = container_of(work, struct cpuset_remove_tasks_struct, work); 3695 - remove_tasks_in_empty_cpuset(s->cs); 3696 - css_put(&s->cs->css); 3697 - kfree(s); 3698 - } 3699 - 3700 - static void 3701 - hotplug_update_tasks_legacy(struct cpuset *cs, 3702 - struct cpumask *new_cpus, nodemask_t *new_mems, 3703 - bool cpus_updated, bool mems_updated) 3704 - { 3705 - bool is_empty; 3706 - 3707 - spin_lock_irq(&callback_lock); 3708 - cpumask_copy(cs->cpus_allowed, new_cpus); 3709 - cpumask_copy(cs->effective_cpus, new_cpus); 3710 - cs->mems_allowed = *new_mems; 3711 - cs->effective_mems = *new_mems; 3712 - spin_unlock_irq(&callback_lock); 3713 - 3714 - /* 3715 - * Don't call update_tasks_cpumask() if the cpuset becomes empty, 3716 - * as the tasks will be migrated to an ancestor. 3717 - */ 3718 - if (cpus_updated && !cpumask_empty(cs->cpus_allowed)) 3719 - update_tasks_cpumask(cs, new_cpus); 3720 - if (mems_updated && !nodes_empty(cs->mems_allowed)) 3721 - update_tasks_nodemask(cs); 3722 - 3723 - is_empty = cpumask_empty(cs->cpus_allowed) || 3724 - nodes_empty(cs->mems_allowed); 3725 - 3726 - /* 3727 - * Move tasks to the nearest ancestor with execution resources, 3728 - * This is full cgroup operation which will also call back into 3729 - * cpuset. Execute it asynchronously using workqueue. 3730 - */ 3731 - if (is_empty && cs->css.cgroup->nr_populated_csets && 3732 - css_tryget_online(&cs->css)) { 3733 - struct cpuset_remove_tasks_struct *s; 3734 - 3735 - s = kzalloc(sizeof(*s), GFP_KERNEL); 3736 - if (WARN_ON_ONCE(!s)) { 3737 - css_put(&cs->css); 3738 - return; 3739 - } 3740 - 3741 - s->cs = cs; 3742 - INIT_WORK(&s->work, cpuset_migrate_tasks_workfn); 3743 - schedule_work(&s->work); 3744 - } 3745 4412 } 3746 4413 3747 4414 static void ··· 3684 4505 spin_unlock_irq(&callback_lock); 3685 4506 3686 4507 if (cpus_updated) 3687 - update_tasks_cpumask(cs, new_cpus); 4508 + cpuset_update_tasks_cpumask(cs, new_cpus); 3688 4509 if (mems_updated) 3689 - update_tasks_nodemask(cs); 4510 + cpuset_update_tasks_nodemask(cs); 3690 4511 } 3691 4512 3692 4513 void cpuset_force_rebuild(void) ··· 3787 4608 hotplug_update_tasks(cs, &new_cpus, &new_mems, 3788 4609 cpus_updated, mems_updated); 3789 4610 else 3790 - hotplug_update_tasks_legacy(cs, &new_cpus, &new_mems, 4611 + cpuset1_hotplug_update_tasks(cs, &new_cpus, &new_mems, 3791 4612 cpus_updated, mems_updated); 3792 4613 3793 4614 unlock: ··· 3872 4693 top_cpuset.mems_allowed = new_mems; 3873 4694 top_cpuset.effective_mems = new_mems; 3874 4695 spin_unlock_irq(&callback_lock); 3875 - update_tasks_nodemask(&top_cpuset); 4696 + cpuset_update_tasks_nodemask(&top_cpuset); 3876 4697 } 3877 4698 3878 4699 mutex_unlock(&cpuset_mutex); ··· 4212 5033 } 4213 5034 4214 5035 /** 4215 - * cpuset_slab_spread_node() - On which node to begin search for a slab page 4216 - */ 4217 - int cpuset_slab_spread_node(void) 4218 - { 4219 - if (current->cpuset_slab_spread_rotor == NUMA_NO_NODE) 4220 - current->cpuset_slab_spread_rotor = 4221 - node_random(&current->mems_allowed); 4222 - 4223 - return cpuset_spread_node(&current->cpuset_slab_spread_rotor); 4224 - } 4225 - EXPORT_SYMBOL_GPL(cpuset_mem_spread_node); 4226 - 4227 - /** 4228 5036 * cpuset_mems_allowed_intersects - Does @tsk1's mems_allowed intersect @tsk2's? 4229 5037 * @tsk1: pointer to task_struct of some task. 4230 5038 * @tsk2: pointer to task_struct of some other task. ··· 4246 5080 pr_cont(",mems_allowed=%*pbl", 4247 5081 nodemask_pr_args(&current->mems_allowed)); 4248 5082 4249 - rcu_read_unlock(); 4250 - } 4251 - 4252 - /* 4253 - * Collection of memory_pressure is suppressed unless 4254 - * this flag is enabled by writing "1" to the special 4255 - * cpuset file 'memory_pressure_enabled' in the root cpuset. 4256 - */ 4257 - 4258 - int cpuset_memory_pressure_enabled __read_mostly; 4259 - 4260 - /* 4261 - * __cpuset_memory_pressure_bump - keep stats of per-cpuset reclaims. 4262 - * 4263 - * Keep a running average of the rate of synchronous (direct) 4264 - * page reclaim efforts initiated by tasks in each cpuset. 4265 - * 4266 - * This represents the rate at which some task in the cpuset 4267 - * ran low on memory on all nodes it was allowed to use, and 4268 - * had to enter the kernels page reclaim code in an effort to 4269 - * create more free memory by tossing clean pages or swapping 4270 - * or writing dirty pages. 4271 - * 4272 - * Display to user space in the per-cpuset read-only file 4273 - * "memory_pressure". Value displayed is an integer 4274 - * representing the recent rate of entry into the synchronous 4275 - * (direct) page reclaim by any task attached to the cpuset. 4276 - */ 4277 - 4278 - void __cpuset_memory_pressure_bump(void) 4279 - { 4280 - rcu_read_lock(); 4281 - fmeter_markevent(&task_cs(current)->fmeter); 4282 5083 rcu_read_unlock(); 4283 5084 } 4284 5085

+9 -23

kernel/cgroup/pids.c

··· 244 244 struct pids_cgroup *pids_over_limit) 245 245 { 246 246 struct pids_cgroup *p = pids_forking; 247 - bool limit = false; 248 247 249 248 /* Only log the first time limit is hit. */ 250 249 if (atomic64_inc_return(&p->events_local[PIDCG_FORKFAIL]) == 1) { ··· 251 252 pr_cont_cgroup_path(p->css.cgroup); 252 253 pr_cont("\n"); 253 254 } 254 - cgroup_file_notify(&p->events_local_file); 255 255 if (!cgroup_subsys_on_dfl(pids_cgrp_subsys) || 256 - cgrp_dfl_root.flags & CGRP_ROOT_PIDS_LOCAL_EVENTS) 256 + cgrp_dfl_root.flags & CGRP_ROOT_PIDS_LOCAL_EVENTS) { 257 + cgroup_file_notify(&p->events_local_file); 257 258 return; 259 + } 258 260 259 - for (; parent_pids(p); p = parent_pids(p)) { 260 - if (p == pids_over_limit) { 261 - limit = true; 262 - atomic64_inc(&p->events_local[PIDCG_MAX]); 263 - cgroup_file_notify(&p->events_local_file); 264 - } 265 - if (limit) 266 - atomic64_inc(&p->events[PIDCG_MAX]); 261 + atomic64_inc(&pids_over_limit->events_local[PIDCG_MAX]); 262 + cgroup_file_notify(&pids_over_limit->events_local_file); 267 263 264 + for (p = pids_over_limit; parent_pids(p); p = parent_pids(p)) { 265 + atomic64_inc(&p->events[PIDCG_MAX]); 268 266 cgroup_file_notify(&p->events_file); 269 267 } 270 268 } ··· 272 276 */ 273 277 static int pids_can_fork(struct task_struct *task, struct css_set *cset) 274 278 { 275 - struct cgroup_subsys_state *css; 276 279 struct pids_cgroup *pids, *pids_over_limit; 277 280 int err; 278 281 279 - if (cset) 280 - css = cset->subsys[pids_cgrp_id]; 281 - else 282 - css = task_css_check(current, pids_cgrp_id, true); 283 - pids = css_pids(css); 282 + pids = css_pids(cset->subsys[pids_cgrp_id]); 284 283 err = pids_try_charge(pids, 1, &pids_over_limit); 285 284 if (err) 286 285 pids_event(pids, pids_over_limit); ··· 285 294 286 295 static void pids_cancel_fork(struct task_struct *task, struct css_set *cset) 287 296 { 288 - struct cgroup_subsys_state *css; 289 297 struct pids_cgroup *pids; 290 298 291 - if (cset) 292 - css = cset->subsys[pids_cgrp_id]; 293 - else 294 - css = task_css_check(current, pids_cgrp_id, true); 295 - pids = css_pids(css); 299 + pids = css_pids(cset->subsys[pids_cgrp_id]); 296 300 pids_uncharge(pids, 1); 297 301 } 298 302

-1

kernel/fork.c

··· 2311 2311 #endif 2312 2312 #ifdef CONFIG_CPUSETS 2313 2313 p->cpuset_mem_spread_rotor = NUMA_NO_NODE; 2314 - p->cpuset_slab_spread_rotor = NUMA_NO_NODE; 2315 2314 seqcount_spinlock_init(&p->mems_allowed_seq, &p->alloc_lock); 2316 2315 #endif 2317 2316 #ifdef CONFIG_TRACE_IRQFLAGS

+1 -1

lib/Makefile

··· 34 34 is_single_threaded.o plist.o decompress.o kobject_uevent.o \ 35 35 earlycpio.o seq_buf.o siphash.o dec_and_lock.o \ 36 36 nmi_backtrace.o win_minmax.o memcat_p.o \ 37 - buildid.o objpool.o 37 + buildid.o objpool.o union_find.o 38 38 39 39 lib-$(CONFIG_PRINTK) += dump_stack.o 40 40 lib-$(CONFIG_SMP) += cpumask.o

+49

lib/union_find.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + #include <linux/union_find.h> 3 + 4 + /** 5 + * uf_find - Find the root of a node and perform path compression 6 + * @node: the node to find the root of 7 + * 8 + * This function returns the root of the node by following the parent 9 + * pointers. It also performs path compression, making the tree shallower. 10 + * 11 + * Returns the root node of the set containing node. 12 + */ 13 + struct uf_node *uf_find(struct uf_node *node) 14 + { 15 + struct uf_node *parent; 16 + 17 + while (node->parent != node) { 18 + parent = node->parent; 19 + node->parent = parent->parent; 20 + node = parent; 21 + } 22 + return node; 23 + } 24 + 25 + /** 26 + * uf_union - Merge two sets, using union by rank 27 + * @node1: the first node 28 + * @node2: the second node 29 + * 30 + * This function merges the sets containing node1 and node2, by comparing 31 + * the ranks to keep the tree balanced. 32 + */ 33 + void uf_union(struct uf_node *node1, struct uf_node *node2) 34 + { 35 + struct uf_node *root1 = uf_find(node1); 36 + struct uf_node *root2 = uf_find(node2); 37 + 38 + if (root1 == root2) 39 + return; 40 + 41 + if (root1->rank < root2->rank) { 42 + root1->parent = root2; 43 + } else if (root1->rank > root2->rank) { 44 + root2->parent = root1; 45 + } else { 46 + root2->parent = root1; 47 + root1->rank++; 48 + } 49 + }

+44 -12

tools/testing/selftests/cgroup/test_cpuset_prs.sh

··· 84 84 echo "" > test/cpuset.cpus 85 85 [[ $RESULT -eq 0 ]] && skip_test "Child cgroups are using cpuset!" 86 86 87 + # 88 + # If isolated CPUs have been reserved at boot time (as shown in 89 + # cpuset.cpus.isolated), these isolated CPUs should be outside of CPUs 0-7 90 + # that will be used by this script for testing purpose. If not, some of 91 + # the tests may fail incorrectly. These isolated CPUs will also be removed 92 + # before being compared with the expected results. 93 + # 94 + BOOT_ISOLCPUS=$(cat $CGROUP2/cpuset.cpus.isolated) 95 + if [[ -n "$BOOT_ISOLCPUS" ]] 96 + then 97 + [[ $(echo $BOOT_ISOLCPUS | sed -e "s/[,-].*//") -le 7 ]] && 98 + skip_test "Pre-isolated CPUs ($BOOT_ISOLCPUS) overlap CPUs to be tested" 99 + echo "Pre-isolated CPUs: $BOOT_ISOLCPUS" 100 + fi 87 101 cleanup() 88 102 { 89 103 online_cpus ··· 335 321 # old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate ISOLCPUS 336 322 # ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------ -------- 337 323 # 338 - # Incorrect change to cpuset.cpus invalidates partition root 324 + # Incorrect change to cpuset.cpus[.exclusive] invalidates partition root 339 325 # 340 326 # Adding CPUs to partition root that are not in parent's 341 327 # cpuset.cpus is allowed, but those extra CPUs are ignored. ··· 378 364 379 365 # cpuset.cpus can overlap with sibling cpuset.cpus.exclusive but not subsumed by it 380 366 " C0-3 . . C4-5 X5 . . . 0 A1:0-3,B1:4-5" 367 + 368 + # Child partition root that try to take all CPUs from parent partition 369 + # with tasks will remain invalid. 370 + " C1-4:P1:S+ P1 . . . . . . 0 A1:1-4,A2:1-4 A1:P1,A2:P-1" 371 + " C1-4:P1:S+ P1 . . . C1-4 . . 0 A1,A2:1-4 A1:P1,A2:P1" 372 + " C1-4:P1:S+ P1 . . T C1-4 . . 0 A1:1-4,A2:1-4 A1:P1,A2:P-1" 373 + 374 + # Clearing of cpuset.cpus with a preset cpuset.cpus.exclusive shouldn't 375 + # affect cpuset.cpus.exclusive.effective. 376 + " C1-4:X3:S+ C1:X3 . . . C . . 0 A2:1-4,XA2:3" 381 377 382 378 # old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate ISOLCPUS 383 379 # ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------ -------- ··· 656 632 # Note that isolated CPUs from the sched/domains context include offline 657 633 # CPUs as well as CPUs in non-isolated 1-CPU partition. Those CPUs may 658 634 # not be included in the cpuset.cpus.isolated control file which contains 659 - # only CPUs in isolated partitions. 635 + # only CPUs in isolated partitions as well as those that are isolated at 636 + # boot time. 660 637 # 661 638 # $1 - expected isolated cpu list(s) <isolcpus1>{,<isolcpus2>} 662 639 # <isolcpus1> - expected sched/domains value ··· 684 659 fi 685 660 686 661 # 687 - # Check the debug isolated cpumask, if present 662 + # Check cpuset.cpus.isolated cpumask 688 663 # 689 - [[ -f $ISCPUS ]] && { 664 + if [[ -z "$BOOT_ISOLCPUS" ]] 665 + then 690 666 ISOLCPUS=$(cat $ISCPUS) 691 - [[ "$EXPECT_VAL2" != "$ISOLCPUS" ]] && { 692 - # Take a 50ms pause and try again 693 - pause 0.05 694 - ISOLCPUS=$(cat $ISCPUS) 695 - } 696 - [[ "$EXPECT_VAL2" != "$ISOLCPUS" ]] && return 1 697 - ISOLCPUS= 667 + else 668 + ISOLCPUS=$(cat $ISCPUS | sed -e "s/,*$BOOT_ISOLCPUS//") 669 + fi 670 + [[ "$EXPECT_VAL2" != "$ISOLCPUS" ]] && { 671 + # Take a 50ms pause and try again 672 + pause 0.05 673 + ISOLCPUS=$(cat $ISCPUS) 698 674 } 675 + [[ "$EXPECT_VAL2" != "$ISOLCPUS" ]] && return 1 676 + ISOLCPUS= 699 677 700 678 # 701 679 # Use the sched domain in debugfs to check isolated CPUs, if available ··· 731 703 fi 732 704 done 733 705 [[ "$ISOLCPUS" = *- ]] && ISOLCPUS=${ISOLCPUS}$LASTISOLCPU 706 + [[ -n "BOOT_ISOLCPUS" ]] && 707 + ISOLCPUS=$(echo $ISOLCPUS | sed -e "s/,*$BOOT_ISOLCPUS//") 708 + 734 709 [[ "$EXPECT_VAL" = "$ISOLCPUS" ]] 735 710 } 736 711 ··· 751 720 } 752 721 753 722 # 754 - # Check to see if there are unexpected isolated CPUs left 723 + # Check to see if there are unexpected isolated CPUs left beyond the boot 724 + # time isolated ones. 755 725 # 756 726 null_isolcpus_check() 757 727 {

+77

tools/testing/selftests/cgroup/test_cpuset_v1_base.sh

··· 1 + #!/bin/bash 2 + # SPDX-License-Identifier: GPL-2.0 3 + # 4 + # Basc test for cpuset v1 interfaces write/read 5 + # 6 + 7 + skip_test() { 8 + echo "$1" 9 + echo "Test SKIPPED" 10 + exit 4 # ksft_skip 11 + } 12 + 13 + write_test() { 14 + dir=$1 15 + interface=$2 16 + value=$3 17 + original=$(cat $dir/$interface) 18 + echo "testing $interface $value" 19 + echo $value > $dir/$interface 20 + new=$(cat $dir/$interface) 21 + [[ $value -ne $(cat $dir/$interface) ]] && { 22 + echo "$interface write $value failed: new:$new" 23 + exit 1 24 + } 25 + } 26 + 27 + [[ $(id -u) -eq 0 ]] || skip_test "Test must be run as root!" 28 + 29 + # Find cpuset v1 mount point 30 + CPUSET=$(mount -t cgroup | grep cpuset | head -1 | awk '{print $3}') 31 + [[ -n "$CPUSET" ]] || skip_test "cpuset v1 mount point not found!" 32 + 33 + # 34 + # Create a test cpuset, read write test 35 + # 36 + TDIR=test$$ 37 + [[ -d $CPUSET/$TDIR ]] || mkdir $CPUSET/$TDIR 38 + 39 + ITF_MATRIX=( 40 + #interface value expect root_only 41 + 'cpuset.cpus 0-1 0-1 0' 42 + 'cpuset.mem_exclusive 1 1 0' 43 + 'cpuset.mem_exclusive 0 0 0' 44 + 'cpuset.mem_hardwall 1 1 0' 45 + 'cpuset.mem_hardwall 0 0 0' 46 + 'cpuset.memory_migrate 1 1 0' 47 + 'cpuset.memory_migrate 0 0 0' 48 + 'cpuset.memory_spread_page 1 1 0' 49 + 'cpuset.memory_spread_page 0 0 0' 50 + 'cpuset.memory_spread_slab 1 1 0' 51 + 'cpuset.memory_spread_slab 0 0 0' 52 + 'cpuset.mems 0 0 0' 53 + 'cpuset.sched_load_balance 1 1 0' 54 + 'cpuset.sched_load_balance 0 0 0' 55 + 'cpuset.sched_relax_domain_level 2 2 0' 56 + 'cpuset.memory_pressure_enabled 1 1 1' 57 + 'cpuset.memory_pressure_enabled 0 0 1' 58 + ) 59 + 60 + run_test() 61 + { 62 + cnt="${ITF_MATRIX[@]}" 63 + for i in "${ITF_MATRIX[@]}" ; do 64 + args=($i) 65 + root_only=${args[3]} 66 + [[ $root_only -eq 1 ]] && { 67 + write_test "$CPUSET" "${args[0]}" "${args[1]}" "${args[2]}" 68 + continue 69 + } 70 + write_test "$CPUSET/$TDIR" "${args[0]}" "${args[1]}" "${args[2]}" 71 + done 72 + } 73 + 74 + run_test 75 + rmdir $CPUSET/$TDIR 76 + echo "Test PASSED" 77 + exit 0

Configure Feed

Configure Feed