Linux kernel mirror (for testing)
git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel
os
linux
1=====================================================
2Memory Resource Controller(Memcg) Implementation Memo
3=====================================================
4
5Last Updated: 2010/2
6
7Base Kernel Version: based on 2.6.33-rc7-mm(candidate for 34).
8
9Because VM is getting complex (one of reasons is memcg...), memcg's behavior
10is complex. This is a document for memcg's internal behavior.
11Please note that implementation details can be changed.
12
13(*) Topics on API should be in Documentation/admin-guide/cgroup-v1/memory.rst)
14
150. How to record usage ?
16========================
17
18 2 objects are used.
19
20 page_cgroup ....an object per page.
21
22 Allocated at boot or memory hotplug. Freed at memory hot removal.
23
24 swap_cgroup ... an entry per swp_entry.
25
26 Allocated at swapon(). Freed at swapoff().
27
28 The page_cgroup has USED bit and double count against a page_cgroup never
29 occurs. swap_cgroup is used only when a charged page is swapped-out.
30
311. Charge
32=========
33
34 a page/swp_entry may be charged (usage += PAGE_SIZE) at
35
36 mem_cgroup_try_charge()
37
382. Uncharge
39===========
40
41 a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by
42
43 mem_cgroup_uncharge()
44 Called when a page's refcount goes down to 0.
45
46 mem_cgroup_uncharge_swap()
47 Called when swp_entry's refcnt goes down to 0. A charge against swap
48 disappears.
49
503. charge-commit
51=======================
52
53 Memcg pages are charged in two steps:
54
55 - mem_cgroup_try_charge()
56 - commit_charge()
57
58 At try_charge(), there are no flags to say "this page is charged".
59 at this point, usage += PAGE_SIZE.
60
61 At commit(), the page is associated with the memcg.
62
63Under below explanation, we assume CONFIG_SWAP=y.
64
654. Anonymous
66============
67
68 Anonymous page is newly allocated at
69 - page fault into MAP_ANONYMOUS mapping.
70 - Copy-On-Write.
71
72 4.1 Swap-in.
73 At swap-in, the page is taken from swap-cache. There are 2 cases.
74
75 (a) If the SwapCache is newly allocated and read, it has no charges.
76 (b) If the SwapCache has been mapped by processes, it has been
77 charged already.
78
79 4.2 Swap-out.
80 At swap-out, typical state transition is below.
81
82 (a) add to swap cache. (marked as SwapCache)
83 swp_entry's refcnt += 1.
84 (b) fully unmapped.
85 swp_entry's refcnt += # of ptes.
86 (c) write back to swap.
87 (d) delete from swap cache. (remove from SwapCache)
88 swp_entry's refcnt -= 1.
89
90
91 Finally, at task exit,
92 (e) zap_pte() is called and swp_entry's refcnt -=1 -> 0.
93
945. Page Cache
95=============
96
97 Page Cache is charged at
98 - filemap_add_folio().
99
100 The logic is very clear. (About migration, see below)
101
102 Note:
103 __filemap_remove_folio() is called by filemap_remove_folio()
104 and __remove_mapping().
105
1066. Shmem(tmpfs) Page Cache
107===========================
108
109 The best way to understand shmem's page state transition is to read
110 mm/shmem.c.
111
112 But brief explanation of the behavior of memcg around shmem will be
113 helpful to understand the logic.
114
115 Shmem's page (just leaf page, not direct/indirect block) can be on
116
117 - radix-tree of shmem's inode.
118 - SwapCache.
119 - Both on radix-tree and SwapCache. This happens at swap-in
120 and swap-out,
121
122 It's charged when...
123
124 - A new page is added to shmem's radix-tree.
125 - A swp page is read. (move a charge from swap_cgroup to page_cgroup)
126
1277. Page Migration
128=================
129
130 mem_cgroup_migrate()
131
1328. LRU
133======
134 Each memcg has its own vector of LRUs (inactive anon, active anon,
135 inactive file, active file, unevictable) of pages from each node,
136 each LRU handled under a single lru_lock for that memcg and node.
137
1389. Typical Tests.
139=================
140
141 Tests for racy cases.
142
1439.1 Small limit to memcg.
144-------------------------
145
146 When you do test to do racy case, it's good test to set memcg's limit
147 to be very small rather than GB. Many races found in the test under
148 xKB or xxMB limits.
149
150 (Memory behavior under GB and Memory behavior under MB shows very
151 different situation.)
152
1539.2 Shmem
154---------
155
156 Historically, memcg's shmem handling was poor and we saw some amount
157 of troubles here. This is because shmem is page-cache but can be
158 SwapCache. Test with shmem/tmpfs is always good test.
159
1609.3 Migration
161-------------
162
163 For NUMA, migration is an another special case. To do easy test, cpuset
164 is useful. Following is a sample script to do migration::
165
166 mount -t cgroup -o cpuset none /opt/cpuset
167
168 mkdir /opt/cpuset/01
169 echo 1 > /opt/cpuset/01/cpuset.cpus
170 echo 0 > /opt/cpuset/01/cpuset.mems
171 echo 1 > /opt/cpuset/01/cpuset.memory_migrate
172 mkdir /opt/cpuset/02
173 echo 1 > /opt/cpuset/02/cpuset.cpus
174 echo 1 > /opt/cpuset/02/cpuset.mems
175 echo 1 > /opt/cpuset/02/cpuset.memory_migrate
176
177 In above set, when you moves a task from 01 to 02, page migration to
178 node 0 to node 1 will occur. Following is a script to migrate all
179 under cpuset.::
180
181 --
182 move_task()
183 {
184 for pid in $1
185 do
186 /bin/echo $pid >$2/tasks 2>/dev/null
187 echo -n $pid
188 echo -n " "
189 done
190 echo END
191 }
192
193 G1_TASK=`cat ${G1}/tasks`
194 G2_TASK=`cat ${G2}/tasks`
195 move_task "${G1_TASK}" ${G2} &
196 --
197
1989.4 Memory hotplug
199------------------
200
201 memory hotplug test is one of good test.
202
203 to offline memory, do following::
204
205 # echo offline > /sys/devices/system/memory/memoryXXX/state
206
207 (XXX is the place of memory)
208
209 This is an easy way to test page migration, too.
210
2119.5 nested cgroups
212------------------
213
214 Use tests like the following for testing nested cgroups::
215
216 mkdir /opt/cgroup/01/child_a
217 mkdir /opt/cgroup/01/child_b
218
219 set limit to 01.
220 add limit to 01/child_b
221 run jobs under child_a and child_b
222
223 create/delete following groups at random while jobs are running::
224
225 /opt/cgroup/01/child_a/child_aa
226 /opt/cgroup/01/child_b/child_bb
227 /opt/cgroup/01/child_c
228
229 running new jobs in new group is also good.
230
2319.6 Mount with other subsystems
232-------------------------------
233
234 Mounting with other subsystems is a good test because there is a
235 race and lock dependency with other cgroup subsystems.
236
237 example::
238
239 # mount -t cgroup none /cgroup -o cpuset,memory,cpu,devices
240
241 and do task move, mkdir, rmdir etc...under this.
242
2439.7 swapoff
244-----------
245
246 Besides management of swap is one of complicated parts of memcg,
247 call path of swap-in at swapoff is not same as usual swap-in path..
248 It's worth to be tested explicitly.
249
250 For example, test like following is good:
251
252 (Shell-A)::
253
254 # mount -t cgroup none /cgroup -o memory
255 # mkdir /cgroup/test
256 # echo 40M > /cgroup/test/memory.limit_in_bytes
257 # echo 0 > /cgroup/test/tasks
258
259 Run malloc(100M) program under this. You'll see 60M of swaps.
260
261 (Shell-B)::
262
263 # move all tasks in /cgroup/test to /cgroup
264 # /sbin/swapoff -a
265 # rmdir /cgroup/test
266 # kill malloc task.
267
268 Of course, tmpfs v.s. swapoff test should be tested, too.
269
2709.8 OOM-Killer
271--------------
272
273 Out-of-memory caused by memcg's limit will kill tasks under
274 the memcg. When hierarchy is used, a task under hierarchy
275 will be killed by the kernel.
276
277 In this case, panic_on_oom shouldn't be invoked and tasks
278 in other groups shouldn't be killed.
279
280 It's not difficult to cause OOM under memcg as following.
281
282 Case A) when you can swapoff::
283
284 #swapoff -a
285 #echo 50M > /memory.limit_in_bytes
286
287 run 51M of malloc
288
289 Case B) when you use mem+swap limitation::
290
291 #echo 50M > memory.limit_in_bytes
292 #echo 50M > memory.memsw.limit_in_bytes
293
294 run 51M of malloc
295
2969.9 Move charges at task migration
297----------------------------------
298
299 Charges associated with a task can be moved along with task migration.
300
301 (Shell-A)::
302
303 #mkdir /cgroup/A
304 #echo $$ >/cgroup/A/tasks
305
306 run some programs which uses some amount of memory in /cgroup/A.
307
308 (Shell-B)::
309
310 #mkdir /cgroup/B
311 #echo 1 >/cgroup/B/memory.move_charge_at_immigrate
312 #echo "pid of the program running in group A" >/cgroup/B/tasks
313
314 You can see charges have been moved by reading ``*.usage_in_bytes`` or
315 memory.stat of both A and B.
316
317 See 8.2 of Documentation/admin-guide/cgroup-v1/memory.rst to see what value should
318 be written to move_charge_at_immigrate.
319
3209.10 Memory thresholds
321----------------------
322
323 Memory controller implements memory thresholds using cgroups notification
324 API. You can use tools/cgroup/cgroup_event_listener.c to test it.
325
326 (Shell-A) Create cgroup and run event listener::
327
328 # mkdir /cgroup/A
329 # ./cgroup_event_listener /cgroup/A/memory.usage_in_bytes 5M
330
331 (Shell-B) Add task to cgroup and try to allocate and free memory::
332
333 # echo $$ >/cgroup/A/tasks
334 # a="$(dd if=/dev/zero bs=1M count=10)"
335 # a=
336
337 You will see message from cgroup_event_listener every time you cross
338 the thresholds.
339
340 Use /cgroup/A/memory.memsw.usage_in_bytes to test memsw thresholds.
341
342 It's good idea to test root cgroup as well.