Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

mm: slowly shrink slabs with a relatively small number of objects

9092c71bb724 ("mm: use sc->priority for slab shrink targets") changed the
way that the target slab pressure is calculated and made it
priority-based:

delta = freeable >> priority;
delta *= 4;
do_div(delta, shrinker->seeks);

The problem is that on a default priority (which is 12) no pressure is
applied at all, if the number of potentially reclaimable objects is less
than 4096 (1<<12).

This causes the last objects on slab caches of no longer used cgroups to
(almost) never get reclaimed. It's obviously a waste of memory.

It can be especially painful, if these stale objects are holding a
reference to a dying cgroup. Slab LRU lists are reparented on memcg
offlining, but corresponding objects are still holding a reference to the
dying cgroup. If we don't scan these objects, the dying cgroup can't go
away. Most likely, the parent cgroup hasn't any directly charged objects,
only remaining objects from dying children cgroups. So it can easily hold
a reference to hundreds of dying cgroups.

If there are no big spikes in memory pressure, and new memory cgroups are
created and destroyed periodically, this causes the number of dying
cgroups grow steadily, causing a slow-ish and hard-to-detect memory
"leak". It's not a real leak, as the memory can be eventually reclaimed,
but it could not happen in a real life at all. I've seen hosts with a
steadily climbing number of dying cgroups, which doesn't show any signs of
a decline in months, despite the host is loaded with a production
workload.

It is an obvious waste of memory, and to prevent it, let's apply a minimal
pressure even on small shrinker lists. E.g. if there are freeable
objects, let's scan at least min(freeable, scan_batch) objects.

This fix significantly improves a chance of a dying cgroup to be
reclaimed, and together with some previous patches stops the steady growth
of the dying cgroups number on some of our hosts.

Link: http://lkml.kernel.org/r/20180905230759.12236-1-guro@fb.com
Fixes: 9092c71bb724 ("mm: use sc->priority for slab shrink targets")
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Rik van Riel <riel@surriel.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

authored by

Roman Gushchin and committed by
Greg Kroah-Hartman
172b06c3 3bf181bc

+11
+11
mm/vmscan.c
··· 476 476 delta = freeable >> priority; 477 477 delta *= 4; 478 478 do_div(delta, shrinker->seeks); 479 + 480 + /* 481 + * Make sure we apply some minimal pressure on default priority 482 + * even on small cgroups. Stale objects are not only consuming memory 483 + * by themselves, but can also hold a reference to a dying cgroup, 484 + * preventing it from being reclaimed. A dying cgroup with all 485 + * corresponding structures like per-cpu stats and kmem caches 486 + * can be really big, so it may lead to a significant waste of memory. 487 + */ 488 + delta = max_t(unsigned long long, delta, min(freeable, batch_size)); 489 + 479 490 total_scan += delta; 480 491 if (total_scan < 0) { 481 492 pr_err("shrink_slab: %pF negative objects to delete nr=%ld\n",