memcg: introduce non-blocking limit setting option

Setting the max and high limits can trigger synchronous reclaim and/or
oom-kill if the usage is higher than the given limit. This behavior is
fine for newly created cgroups but it can cause issues for the node
controller while setting limits for existing cgroups.

In our production multi-tenant and overcommitted environment, we are
seeing priority inversion when the node controller dynamically adjusts the
limits of running jobs of different priorities. Based on the system
situation, the node controller may reduce the limits of lower priority
jobs and increase the limits of higher priority jobs. However we are
seeing node controller getting stuck for long period of time while
reclaiming from lower priority jobs while setting their limits and also
spends a lot of its own CPU.

One of the workaround we are trying is to fork a new process which sets
the limit of the lower priority job along with setting an alarm to get
itself killed if it get stuck in the reclaim for lower priority job.
However we are finding it very unreliable and costly. Either we need a
good enough time buffer for the alarm to be delivered after setting limit
and potentialy spend a lot of CPU in the reclaim or be unreliable in
setting the limit for much shorter but cheaper (less reclaim) alarms.

Let's introduce new limit setting option which does not trigger reclaim
and/or oom-kill and let the processes in the target cgroup to trigger
reclaim and/or throttling and/or oom-kill in their next charge request.
This will make the node controller on multi-tenant overcommitted
environment much more reliable.

Explanation from Johannes on side-effects of O_NONBLOCK limit change:
It's usually the allocating tasks inside the group bearing the cost of
limit enforcement and reclaim. This allows a (privileged) updater from
outside the group to keep that cost in there - instead of having to
help, from a context that doesn't necessarily make sense.

I suppose the tradeoff with that - and the reason why this was doing
sync reclaim in the first place - is that, if the group is idle and
not trying to allocate more, it can take indefinitely for the new
limit to actually be met.

It should be okay in most scenarios in practice. As the capacity is
reallocated from group A to B, B will exert pressure on A once it
tries to claim it and thereby shrink it down. If A is idle, that
shouldn't be hard. If A is running, it's likely to fault/allocate
soon-ish and then join the effort.

It does leave a (malicious) corner case where A is just busy-hitting
its memory to interfere with the clawback. This is comparable to
reclaiming memory.low overage from the outside, though, which is an
acceptable risk. Users of O_NONBLOCK just need to be aware.

Link: https://lkml.kernel.org/r/20250419183545.1982187-1-shakeel.butt@linux.dev
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Shakeel Butt and committed by

Andrew Morton 1 year ago c8e6002b 8d88b076

+22 -2

2 changed files

expand all

Documentation

admin-guide

cgroup-v2.rst

memcontrol.c

+14

Documentation/admin-guide/cgroup-v2.rst

··· 1299 1299 monitors the limited cgroup to alleviate heavy reclaim 1300 1300 pressure. 1301 1301 1302 + If memory.high is opened with O_NONBLOCK then the synchronous 1303 + reclaim is bypassed. This is useful for admin processes that 1304 + need to dynamically adjust the job's memory limits without 1305 + expending their own CPU resources on memory reclamation. The 1306 + job will trigger the reclaim and/or get throttled on its 1307 + next charge request. 1308 + 1302 1309 memory.max 1303 1310 A read-write single value file which exists on non-root 1304 1311 cgroups. The default is "max". ··· 1322 1315 Some kinds of allocations don't invoke the OOM killer. 1323 1316 Caller could retry them differently, return into userspace 1324 1317 as -ENOMEM or silently ignore in cases like disk readahead. 1318 + 1319 + If memory.max is opened with O_NONBLOCK, then the synchronous 1320 + reclaim and oom-kill are bypassed. This is useful for admin 1321 + processes that need to dynamically adjust the job's memory limits 1322 + without expending their own CPU resources on memory reclamation. 1323 + The job will trigger the reclaim and/or oom-kill on its next 1324 + charge request. 1325 1325 1326 1326 memory.reclaim 1327 1327 A write-only nested-keyed file which exists for all cgroups.

+8 -2

mm/memcontrol.c

··· 4269 4269 4270 4270 page_counter_set_high(&memcg->memory, high); 4271 4271 4272 + if (of->file->f_flags & O_NONBLOCK) 4273 + goto out; 4274 + 4272 4275 for (;;) { 4273 4276 unsigned long nr_pages = page_counter_read(&memcg->memory); 4274 4277 unsigned long reclaimed; ··· 4294 4291 if (!reclaimed && !nr_retries--) 4295 4292 break; 4296 4293 } 4297 - 4294 + out: 4298 4295 memcg_wb_domain_size_changed(memcg); 4299 4296 return nbytes; 4300 4297 } ··· 4320 4317 return err; 4321 4318 4322 4319 xchg(&memcg->memory.max, max); 4320 + 4321 + if (of->file->f_flags & O_NONBLOCK) 4322 + goto out; 4323 4323 4324 4324 for (;;) { 4325 4325 unsigned long nr_pages = page_counter_read(&memcg->memory); ··· 4351 4345 break; 4352 4346 cond_resched(); 4353 4347 } 4354 - 4348 + out: 4355 4349 memcg_wb_domain_size_changed(memcg); 4356 4350 return nbytes; 4357 4351 }

Configure Feed

Configure Feed