mm: re-enable kswapd when memory pressure subsides or demotion is toggled

If kswapd fails to reclaim pages from a node MAX_RECLAIM_RETRIES in a
row, kswapd on that node gets disabled. That is, the system won't wakeup
kswapd for that node until page reclamation is observed at least once.
That reclamation is mostly done by direct reclaim, which in turn enables
kswapd back.

However, on systems with CXL memory nodes, workloads with high anon page
usage can disable kswapd indefinitely, without triggering direct
reclaim. This can be reproduced with following steps:

numa node 0 (32GB memory, 48 CPUs)
numa node 2~5 (512GB CXL memory, 128GB each)
(numa node 1 is disabled)
swap space 8GB

1) Set /sys/kernel/mm/demotion_enabled to 0.
2) Set /proc/sys/kernel/numa_balancing to 0.
3) Run a process that allocates and random accesses 500GB of anon
pages.
4) Let the process exit normally.

During 3), free memory on node 0 gets lower than low watermark, and
kswapd runs and depletes swap space. Then, kswapd fails consecutively
and gets disabled. Allocation afterwards happens on CXL memory, so node
0 never gains more memory pressure to trigger direct reclaim.

After 4), kswapd on node 0 remains disabled, and tasks running on that
node are unable to swap. If you turn on NUMA_BALANCING_MEMORY_TIERING
and demotion now, it won't work properly since kswapd is disabled.

To mitigate this problem, reset kswapd_failures to 0 on following
conditions:

a) ZONE_BELOW_HIGH bit of a zone in hopeless node with a fallback
memory node gets cleared.
b) demotion_enabled is changed from false to true.

Rationale for a):
ZONE_BELOW_HIGH bit being cleared might be a sign that the node may
be reclaimable afterwards. This won't help much if the memory-hungry
process keeps running without freeing anything, but at least the node
will go back to reclaimable state when the process exits.

Rationale for b):
When demotion_enabled is false, kswapd can only reclaim anon pages by
swapping them out to swap space. If demotion_enabled is turned on,
kswapd can demote anon pages to another node for reclaiming. So, the
original failure count for determining reclaimability is no longer
valid.

Since kswapd_failures resets may be missed by ++ operation, it is
changed from int to atomic_t.

[akpm@linux-foundation.org: tweak whitespace]
Link: https://lkml.kernel.org/r/aL6qGi69jWXfPc4D@pcw-MS-7D22
Signed-off-by: Chanwon Park <flyinrm@gmail.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Chanwon Park and committed by

Andrew Morton 8 months ago e7a5f249 c5632525

+45 -17

6 changed files

expand all

include

linux

mmzone.h

memory-tiers.c

page_alloc.c

show_mem.c

vmscan.c

vmstat.c

+1 -1

include/linux/mmzone.h

··· 1440 1440 int kswapd_order; 1441 1441 enum zone_type kswapd_highest_zoneidx; 1442 1442 1443 - int kswapd_failures; /* Number of 'reclaimed == 0' runs */ 1443 + atomic_t kswapd_failures; /* Number of 'reclaimed == 0' runs */ 1444 1444 1445 1445 #ifdef CONFIG_COMPACTION 1446 1446 int kcompactd_max_order;

+12

mm/memory-tiers.c

··· 942 942 const char *buf, size_t count) 943 943 { 944 944 ssize_t ret; 945 + bool before = numa_demotion_enabled; 945 946 946 947 ret = kstrtobool(buf, &numa_demotion_enabled); 947 948 if (ret) 948 949 return ret; 950 + 951 + /* 952 + * Reset kswapd_failures statistics. They may no longer be 953 + * valid since the policy for kswapd has changed. 954 + */ 955 + if (before == false && numa_demotion_enabled == true) { 956 + struct pglist_data *pgdat; 957 + 958 + for_each_online_pgdat(pgdat) 959 + atomic_set(&pgdat->kswapd_failures, 0); 960 + } 949 961 950 962 return count; 951 963 }

+22 -7

mm/page_alloc.c

··· 2860 2860 */ 2861 2861 return; 2862 2862 } 2863 + 2863 2864 high = nr_pcp_high(pcp, zone, batch, free_high); 2864 - if (pcp->count >= high) { 2865 - free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high), 2866 - pcp, pindex); 2867 - if (test_bit(ZONE_BELOW_HIGH, &zone->flags) && 2868 - zone_watermark_ok(zone, 0, high_wmark_pages(zone), 2869 - ZONE_MOVABLE, 0)) 2870 - clear_bit(ZONE_BELOW_HIGH, &zone->flags); 2865 + if (pcp->count < high) 2866 + return; 2867 + 2868 + free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high), 2869 + pcp, pindex); 2870 + if (test_bit(ZONE_BELOW_HIGH, &zone->flags) && 2871 + zone_watermark_ok(zone, 0, high_wmark_pages(zone), 2872 + ZONE_MOVABLE, 0)) { 2873 + struct pglist_data *pgdat = zone->zone_pgdat; 2874 + clear_bit(ZONE_BELOW_HIGH, &zone->flags); 2875 + 2876 + /* 2877 + * Assume that memory pressure on this node is gone and may be 2878 + * in a reclaimable state. If a memory fallback node exists, 2879 + * direct reclaim may not have been triggered, causing a 2880 + * 'hopeless node' to stay in that state for a while. Let 2881 + * kswapd work again by resetting kswapd_failures. 2882 + */ 2883 + if (atomic_read(&pgdat->kswapd_failures) >= MAX_RECLAIM_RETRIES && 2884 + next_memory_node(pgdat->node_id) < MAX_NUMNODES) 2885 + atomic_set(&pgdat->kswapd_failures, 0); 2871 2886 } 2872 2887 } 2873 2888

+2 -1

mm/show_mem.c

··· 278 278 #endif 279 279 K(node_page_state(pgdat, NR_PAGETABLE)), 280 280 K(node_page_state(pgdat, NR_SECONDARY_PAGETABLE)), 281 - str_yes_no(pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES), 281 + str_yes_no(atomic_read(&pgdat->kswapd_failures) >= 282 + MAX_RECLAIM_RETRIES), 282 283 K(node_page_state(pgdat, NR_BALLOON_PAGES))); 283 284 } 284 285

+7 -7

mm/vmscan.c

··· 518 518 * If kswapd is disabled, reschedule if necessary but do not 519 519 * throttle as the system is likely near OOM. 520 520 */ 521 - if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES) 521 + if (atomic_read(&pgdat->kswapd_failures) >= MAX_RECLAIM_RETRIES) 522 522 return true; 523 523 524 524 /* ··· 5101 5101 blk_finish_plug(&plug); 5102 5102 done: 5103 5103 if (sc->nr_reclaimed > reclaimed) 5104 - pgdat->kswapd_failures = 0; 5104 + atomic_set(&pgdat->kswapd_failures, 0); 5105 5105 } 5106 5106 5107 5107 /****************************************************************************** ··· 6180 6180 * successful direct reclaim run will revive a dormant kswapd. 6181 6181 */ 6182 6182 if (reclaimable) 6183 - pgdat->kswapd_failures = 0; 6183 + atomic_set(&pgdat->kswapd_failures, 0); 6184 6184 else if (sc->cache_trim_mode) 6185 6185 sc->cache_trim_mode_failed = 1; 6186 6186 } ··· 6492 6492 int i; 6493 6493 bool wmark_ok; 6494 6494 6495 - if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES) 6495 + if (atomic_read(&pgdat->kswapd_failures) >= MAX_RECLAIM_RETRIES) 6496 6496 return true; 6497 6497 6498 6498 for_each_managed_zone_pgdat(zone, pgdat, i, ZONE_NORMAL) { ··· 6902 6902 wake_up_all(&pgdat->pfmemalloc_wait); 6903 6903 6904 6904 /* Hopeless node, leave it to direct reclaim */ 6905 - if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES) 6905 + if (atomic_read(&pgdat->kswapd_failures) >= MAX_RECLAIM_RETRIES) 6906 6906 return true; 6907 6907 6908 6908 if (pgdat_balanced(pgdat, order, highest_zoneidx)) { ··· 7170 7170 } 7171 7171 7172 7172 if (!sc.nr_reclaimed) 7173 - pgdat->kswapd_failures++; 7173 + atomic_inc(&pgdat->kswapd_failures); 7174 7174 7175 7175 out: 7176 7176 clear_reclaim_active(pgdat, highest_zoneidx); ··· 7429 7429 return; 7430 7430 7431 7431 /* Hopeless node, leave it to direct reclaim if possible */ 7432 - if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES || 7432 + if (atomic_read(&pgdat->kswapd_failures) >= MAX_RECLAIM_RETRIES || 7433 7433 (pgdat_balanced(pgdat, order, highest_zoneidx) && 7434 7434 !pgdat_watermark_boosted(pgdat, highest_zoneidx))) { 7435 7435 /*

+1 -1

mm/vmstat.c

··· 1848 1848 seq_printf(m, 1849 1849 "\n node_unreclaimable: %u" 1850 1850 "\n start_pfn: %lu", 1851 - pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES, 1851 + atomic_read(&pgdat->kswapd_failures) >= MAX_RECLAIM_RETRIES, 1852 1852 zone->zone_start_pfn); 1853 1853 seq_putc(m, '\n'); 1854 1854 }

Configure Feed

Configure Feed