Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

at master 342 lines 8.6 kB view raw
1===================================================== 2Memory Resource Controller(Memcg) Implementation Memo 3===================================================== 4 5Last Updated: 2010/2 6 7Base Kernel Version: based on 2.6.33-rc7-mm(candidate for 34). 8 9Because VM is getting complex (one of reasons is memcg...), memcg's behavior 10is complex. This is a document for memcg's internal behavior. 11Please note that implementation details can be changed. 12 13(*) Topics on API should be in Documentation/admin-guide/cgroup-v1/memory.rst) 14 150. How to record usage ? 16======================== 17 18 2 objects are used. 19 20 page_cgroup ....an object per page. 21 22 Allocated at boot or memory hotplug. Freed at memory hot removal. 23 24 swap_cgroup ... an entry per swp_entry. 25 26 Allocated at swapon(). Freed at swapoff(). 27 28 The page_cgroup has USED bit and double count against a page_cgroup never 29 occurs. swap_cgroup is used only when a charged page is swapped-out. 30 311. Charge 32========= 33 34 a page/swp_entry may be charged (usage += PAGE_SIZE) at 35 36 mem_cgroup_try_charge() 37 382. Uncharge 39=========== 40 41 a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by 42 43 mem_cgroup_uncharge() 44 Called when a page's refcount goes down to 0. 45 46 mem_cgroup_uncharge_swap() 47 Called when swp_entry's refcnt goes down to 0. A charge against swap 48 disappears. 49 503. charge-commit 51======================= 52 53 Memcg pages are charged in two steps: 54 55 - mem_cgroup_try_charge() 56 - commit_charge() 57 58 At try_charge(), there are no flags to say "this page is charged". 59 at this point, usage += PAGE_SIZE. 60 61 At commit(), the page is associated with the memcg. 62 63Under below explanation, we assume CONFIG_SWAP=y. 64 654. Anonymous 66============ 67 68 Anonymous page is newly allocated at 69 - page fault into MAP_ANONYMOUS mapping. 70 - Copy-On-Write. 71 72 4.1 Swap-in. 73 At swap-in, the page is taken from swap-cache. There are 2 cases. 74 75 (a) If the SwapCache is newly allocated and read, it has no charges. 76 (b) If the SwapCache has been mapped by processes, it has been 77 charged already. 78 79 4.2 Swap-out. 80 At swap-out, typical state transition is below. 81 82 (a) add to swap cache. (marked as SwapCache) 83 swp_entry's refcnt += 1. 84 (b) fully unmapped. 85 swp_entry's refcnt += # of ptes. 86 (c) write back to swap. 87 (d) delete from swap cache. (remove from SwapCache) 88 swp_entry's refcnt -= 1. 89 90 91 Finally, at task exit, 92 (e) zap_pte() is called and swp_entry's refcnt -=1 -> 0. 93 945. Page Cache 95============= 96 97 Page Cache is charged at 98 - filemap_add_folio(). 99 100 The logic is very clear. (About migration, see below) 101 102 Note: 103 __filemap_remove_folio() is called by filemap_remove_folio() 104 and __remove_mapping(). 105 1066. Shmem(tmpfs) Page Cache 107=========================== 108 109 The best way to understand shmem's page state transition is to read 110 mm/shmem.c. 111 112 But brief explanation of the behavior of memcg around shmem will be 113 helpful to understand the logic. 114 115 Shmem's page (just leaf page, not direct/indirect block) can be on 116 117 - radix-tree of shmem's inode. 118 - SwapCache. 119 - Both on radix-tree and SwapCache. This happens at swap-in 120 and swap-out, 121 122 It's charged when... 123 124 - A new page is added to shmem's radix-tree. 125 - A swp page is read. (move a charge from swap_cgroup to page_cgroup) 126 1277. Page Migration 128================= 129 130 mem_cgroup_migrate() 131 1328. LRU 133====== 134 Each memcg has its own vector of LRUs (inactive anon, active anon, 135 inactive file, active file, unevictable) of pages from each node, 136 each LRU handled under a single lru_lock for that memcg and node. 137 1389. Typical Tests. 139================= 140 141 Tests for racy cases. 142 1439.1 Small limit to memcg. 144------------------------- 145 146 When you do test to do racy case, it's good test to set memcg's limit 147 to be very small rather than GB. Many races found in the test under 148 xKB or xxMB limits. 149 150 (Memory behavior under GB and Memory behavior under MB shows very 151 different situation.) 152 1539.2 Shmem 154--------- 155 156 Historically, memcg's shmem handling was poor and we saw some amount 157 of troubles here. This is because shmem is page-cache but can be 158 SwapCache. Test with shmem/tmpfs is always good test. 159 1609.3 Migration 161------------- 162 163 For NUMA, migration is an another special case. To do easy test, cpuset 164 is useful. Following is a sample script to do migration:: 165 166 mount -t cgroup -o cpuset none /opt/cpuset 167 168 mkdir /opt/cpuset/01 169 echo 1 > /opt/cpuset/01/cpuset.cpus 170 echo 0 > /opt/cpuset/01/cpuset.mems 171 echo 1 > /opt/cpuset/01/cpuset.memory_migrate 172 mkdir /opt/cpuset/02 173 echo 1 > /opt/cpuset/02/cpuset.cpus 174 echo 1 > /opt/cpuset/02/cpuset.mems 175 echo 1 > /opt/cpuset/02/cpuset.memory_migrate 176 177 In above set, when you moves a task from 01 to 02, page migration to 178 node 0 to node 1 will occur. Following is a script to migrate all 179 under cpuset.:: 180 181 -- 182 move_task() 183 { 184 for pid in $1 185 do 186 /bin/echo $pid >$2/tasks 2>/dev/null 187 echo -n $pid 188 echo -n " " 189 done 190 echo END 191 } 192 193 G1_TASK=`cat ${G1}/tasks` 194 G2_TASK=`cat ${G2}/tasks` 195 move_task "${G1_TASK}" ${G2} & 196 -- 197 1989.4 Memory hotplug 199------------------ 200 201 memory hotplug test is one of good test. 202 203 to offline memory, do following:: 204 205 # echo offline > /sys/devices/system/memory/memoryXXX/state 206 207 (XXX is the place of memory) 208 209 This is an easy way to test page migration, too. 210 2119.5 nested cgroups 212------------------ 213 214 Use tests like the following for testing nested cgroups:: 215 216 mkdir /opt/cgroup/01/child_a 217 mkdir /opt/cgroup/01/child_b 218 219 set limit to 01. 220 add limit to 01/child_b 221 run jobs under child_a and child_b 222 223 create/delete following groups at random while jobs are running:: 224 225 /opt/cgroup/01/child_a/child_aa 226 /opt/cgroup/01/child_b/child_bb 227 /opt/cgroup/01/child_c 228 229 running new jobs in new group is also good. 230 2319.6 Mount with other subsystems 232------------------------------- 233 234 Mounting with other subsystems is a good test because there is a 235 race and lock dependency with other cgroup subsystems. 236 237 example:: 238 239 # mount -t cgroup none /cgroup -o cpuset,memory,cpu,devices 240 241 and do task move, mkdir, rmdir etc...under this. 242 2439.7 swapoff 244----------- 245 246 Besides management of swap is one of complicated parts of memcg, 247 call path of swap-in at swapoff is not same as usual swap-in path.. 248 It's worth to be tested explicitly. 249 250 For example, test like following is good: 251 252 (Shell-A):: 253 254 # mount -t cgroup none /cgroup -o memory 255 # mkdir /cgroup/test 256 # echo 40M > /cgroup/test/memory.limit_in_bytes 257 # echo 0 > /cgroup/test/tasks 258 259 Run malloc(100M) program under this. You'll see 60M of swaps. 260 261 (Shell-B):: 262 263 # move all tasks in /cgroup/test to /cgroup 264 # /sbin/swapoff -a 265 # rmdir /cgroup/test 266 # kill malloc task. 267 268 Of course, tmpfs v.s. swapoff test should be tested, too. 269 2709.8 OOM-Killer 271-------------- 272 273 Out-of-memory caused by memcg's limit will kill tasks under 274 the memcg. When hierarchy is used, a task under hierarchy 275 will be killed by the kernel. 276 277 In this case, panic_on_oom shouldn't be invoked and tasks 278 in other groups shouldn't be killed. 279 280 It's not difficult to cause OOM under memcg as following. 281 282 Case A) when you can swapoff:: 283 284 #swapoff -a 285 #echo 50M > /memory.limit_in_bytes 286 287 run 51M of malloc 288 289 Case B) when you use mem+swap limitation:: 290 291 #echo 50M > memory.limit_in_bytes 292 #echo 50M > memory.memsw.limit_in_bytes 293 294 run 51M of malloc 295 2969.9 Move charges at task migration 297---------------------------------- 298 299 Charges associated with a task can be moved along with task migration. 300 301 (Shell-A):: 302 303 #mkdir /cgroup/A 304 #echo $$ >/cgroup/A/tasks 305 306 run some programs which uses some amount of memory in /cgroup/A. 307 308 (Shell-B):: 309 310 #mkdir /cgroup/B 311 #echo 1 >/cgroup/B/memory.move_charge_at_immigrate 312 #echo "pid of the program running in group A" >/cgroup/B/tasks 313 314 You can see charges have been moved by reading ``*.usage_in_bytes`` or 315 memory.stat of both A and B. 316 317 See 8.2 of Documentation/admin-guide/cgroup-v1/memory.rst to see what value should 318 be written to move_charge_at_immigrate. 319 3209.10 Memory thresholds 321---------------------- 322 323 Memory controller implements memory thresholds using cgroups notification 324 API. You can use tools/cgroup/cgroup_event_listener.c to test it. 325 326 (Shell-A) Create cgroup and run event listener:: 327 328 # mkdir /cgroup/A 329 # ./cgroup_event_listener /cgroup/A/memory.usage_in_bytes 5M 330 331 (Shell-B) Add task to cgroup and try to allocate and free memory:: 332 333 # echo $$ >/cgroup/A/tasks 334 # a="$(dd if=/dev/zero bs=1M count=10)" 335 # a= 336 337 You will see message from cgroup_event_listener every time you cross 338 the thresholds. 339 340 Use /cgroup/A/memory.memsw.usage_in_bytes to test memsw thresholds. 341 342 It's good idea to test root cgroup as well.