Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

mm/vmscan: fix demotion targets checks in reclaim/demotion

Patch series "mm/vmscan: fix demotion targets checks in reclaim/demotion",
v9.

This patch series addresses two issues in demote_folio_list(),
can_demote(), and next_demotion_node() in reclaim/demotion.

1. demote_folio_list() and can_demote() do not correctly check
demotion target against cpuset.mems_effective, which will cause (a)
pages to be demoted to not-allowed nodes and (b) pages fail demotion
even if the system still has allowed demotion nodes.

Patch 1 fixes this bug by updating cpuset_node_allowed() and
mem_cgroup_node_allowed() to return effective_mems, allowing directly
logic-and operation against demotion targets.

2. next_demotion_node() returns a preferred demotion target, but it
does not check the node against allowed nodes.

Patch 2 ensures that next_demotion_node() filters against the allowed
node mask and selects the closest demotion target to the source node.


This patch (of 2):

Fix two bugs in demote_folio_list() and can_demote() due to incorrect
demotion target checks against cpuset.mems_effective in reclaim/demotion.

Commit 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim")
introduces the cpuset.mems_effective check and applies it to can_demote().
However:

1. It does not apply this check in demote_folio_list(), which leads
to situations where pages are demoted to nodes that are
explicitly excluded from the task's cpuset.mems.

2. It checks only the nodes in the immediate next demotion hierarchy
and does not check all allowed demotion targets in can_demote().
This can cause pages to never be demoted if the nodes in the next
demotion hierarchy are not set in mems_effective.

These bugs break resource isolation provided by cpuset.mems. This is
visible from userspace because pages can either fail to be demoted
entirely or are demoted to nodes that are not allowed in multi-tier memory
systems.

To address these bugs, update cpuset_node_allowed() and
mem_cgroup_node_allowed() to return effective_mems, allowing directly
logic-and operation against demotion targets. Also update can_demote()
and demote_folio_list() accordingly.

Bug 1 reproduction:
Assume a system with 4 nodes, where nodes 0-1 are top-tier and
nodes 2-3 are far-tier memory. All nodes have equal capacity.

Test script:
echo 1 > /sys/kernel/mm/numa/demotion_enabled
mkdir /sys/fs/cgroup/test
echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
echo "0-2" > /sys/fs/cgroup/test/cpuset.mems
echo $$ > /sys/fs/cgroup/test/cgroup.procs
swapoff -a
# Expectation: Should respect node 0-2 limit.
# Observation: Node 3 shows significant allocation (MemFree drops)
stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1

Bug 2 reproduction:
Assume a system with 6 nodes, where nodes 0-2 are top-tier,
node 3 is a far-tier node, and nodes 4-5 are the farthest-tier nodes.
All nodes have equal capacity.

Test script:
echo 1 > /sys/kernel/mm/numa/demotion_enabled
mkdir /sys/fs/cgroup/test
echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
echo "0-2,4-5" > /sys/fs/cgroup/test/cpuset.mems
echo $$ > /sys/fs/cgroup/test/cgroup.procs
swapoff -a
# Expectation: Pages are demoted to Nodes 4-5
# Observation: No pages are demoted before oom.
stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1,2

Link: https://lkml.kernel.org/r/20260114205305.2869796-1-bingjiao@google.com
Link: https://lkml.kernel.org/r/20260114205305.2869796-2-bingjiao@google.com
Fixes: 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim")
Signed-off-by: Bing Jiao <bingjiao@google.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Gregory Price <gourry@gourry.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Bing Jiao and committed by
Andrew Morton
1aceed56 fb4ddf20

+78 -38
+3 -3
include/linux/cpuset.h
··· 174 174 task_unlock(current); 175 175 } 176 176 177 - extern bool cpuset_node_allowed(struct cgroup *cgroup, int nid); 177 + extern void cpuset_nodes_allowed(struct cgroup *cgroup, nodemask_t *mask); 178 178 #else /* !CONFIG_CPUSETS */ 179 179 180 180 static inline bool cpusets_enabled(void) { return false; } ··· 301 301 return false; 302 302 } 303 303 304 - static inline bool cpuset_node_allowed(struct cgroup *cgroup, int nid) 304 + static inline void cpuset_nodes_allowed(struct cgroup *cgroup, nodemask_t *mask) 305 305 { 306 - return true; 306 + nodes_copy(*mask, node_states[N_MEMORY]); 307 307 } 308 308 #endif /* !CONFIG_CPUSETS */ 309 309
+3 -3
include/linux/memcontrol.h
··· 1736 1736 rcu_read_unlock(); 1737 1737 } 1738 1738 1739 - bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid); 1739 + void mem_cgroup_node_filter_allowed(struct mem_cgroup *memcg, nodemask_t *mask); 1740 1740 1741 1741 void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg); 1742 1742 ··· 1807 1807 return 0; 1808 1808 } 1809 1809 1810 - static inline bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid) 1810 + static inline void mem_cgroup_node_filter_allowed(struct mem_cgroup *memcg, 1811 + nodemask_t *mask) 1811 1812 { 1812 - return true; 1813 1813 } 1814 1814 1815 1815 static inline void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg)
+36 -18
kernel/cgroup/cpuset.c
··· 4423 4423 return allowed; 4424 4424 } 4425 4425 4426 - bool cpuset_node_allowed(struct cgroup *cgroup, int nid) 4426 + /** 4427 + * cpuset_nodes_allowed - return effective_mems mask from a cgroup cpuset. 4428 + * @cgroup: pointer to struct cgroup. 4429 + * @mask: pointer to struct nodemask_t to be returned. 4430 + * 4431 + * Returns effective_mems mask from a cgroup cpuset if it is cgroup v2 and 4432 + * has cpuset subsys. Otherwise, returns node_states[N_MEMORY]. 4433 + * 4434 + * This function intentionally avoids taking the cpuset_mutex or callback_lock 4435 + * when accessing effective_mems. This is because the obtained effective_mems 4436 + * is stale immediately after the query anyway (e.g., effective_mems is updated 4437 + * immediately after releasing the lock but before returning). 4438 + * 4439 + * As a result, returned @mask may be empty because cs->effective_mems can be 4440 + * rebound during this call. Besides, nodes in @mask are not guaranteed to be 4441 + * online due to hot plugins. Callers should check the mask for validity on 4442 + * return based on its subsequent use. 4443 + **/ 4444 + void cpuset_nodes_allowed(struct cgroup *cgroup, nodemask_t *mask) 4427 4445 { 4428 4446 struct cgroup_subsys_state *css; 4429 4447 struct cpuset *cs; 4430 - bool allowed; 4431 4448 4432 4449 /* 4433 4450 * In v1, mem_cgroup and cpuset are unlikely in the same hierarchy 4434 4451 * and mems_allowed is likely to be empty even if we could get to it, 4435 - * so return true to avoid taking a global lock on the empty check. 4452 + * so return directly to avoid taking a global lock on the empty check. 4436 4453 */ 4437 - if (!cpuset_v2()) 4438 - return true; 4454 + if (!cgroup || !cpuset_v2()) { 4455 + nodes_copy(*mask, node_states[N_MEMORY]); 4456 + return; 4457 + } 4439 4458 4440 4459 css = cgroup_get_e_css(cgroup, &cpuset_cgrp_subsys); 4441 - if (!css) 4442 - return true; 4460 + if (!css) { 4461 + nodes_copy(*mask, node_states[N_MEMORY]); 4462 + return; 4463 + } 4443 4464 4444 4465 /* 4466 + * The reference taken via cgroup_get_e_css is sufficient to 4467 + * protect css, but it does not imply safe accesses to effective_mems. 4468 + * 4445 4469 * Normally, accessing effective_mems would require the cpuset_mutex 4446 - * or callback_lock - but node_isset is atomic and the reference 4447 - * taken via cgroup_get_e_css is sufficient to protect css. 4448 - * 4449 - * Since this interface is intended for use by migration paths, we 4450 - * relax locking here to avoid taking global locks - while accepting 4451 - * there may be rare scenarios where the result may be innaccurate. 4452 - * 4453 - * Reclaim and migration are subject to these same race conditions, and 4454 - * cannot make strong isolation guarantees, so this is acceptable. 4470 + * or callback_lock - but the correctness of this information is stale 4471 + * immediately after the query anyway. We do not acquire the lock 4472 + * during this process to save lock contention in exchange for racing 4473 + * against mems_allowed rebinds. 4455 4474 */ 4456 4475 cs = container_of(css, struct cpuset, css); 4457 - allowed = node_isset(nid, cs->effective_mems); 4476 + nodes_copy(*mask, cs->effective_mems); 4458 4477 css_put(css); 4459 - return allowed; 4460 4478 } 4461 4479 4462 4480 /**
+14 -2
mm/memcontrol.c
··· 5593 5593 5594 5594 #endif /* CONFIG_SWAP */ 5595 5595 5596 - bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid) 5596 + void mem_cgroup_node_filter_allowed(struct mem_cgroup *memcg, nodemask_t *mask) 5597 5597 { 5598 - return memcg ? cpuset_node_allowed(memcg->css.cgroup, nid) : true; 5598 + nodemask_t allowed; 5599 + 5600 + if (!memcg) 5601 + return; 5602 + 5603 + /* 5604 + * Since this interface is intended for use by migration paths, and 5605 + * reclaim and migration are subject to race conditions such as changes 5606 + * in effective_mems and hot-unpluging of nodes, inaccurate allowed 5607 + * mask is acceptable. 5608 + */ 5609 + cpuset_nodes_allowed(memcg->css.cgroup, &allowed); 5610 + nodes_and(*mask, *mask, allowed); 5599 5611 } 5600 5612 5601 5613 void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg)
+22 -12
mm/vmscan.c
··· 343 343 static bool can_demote(int nid, struct scan_control *sc, 344 344 struct mem_cgroup *memcg) 345 345 { 346 - int demotion_nid; 346 + struct pglist_data *pgdat = NODE_DATA(nid); 347 + nodemask_t allowed_mask; 347 348 348 - if (!numa_demotion_enabled) 349 + if (!pgdat || !numa_demotion_enabled) 349 350 return false; 350 351 if (sc && sc->no_demotion) 351 352 return false; 352 353 353 - demotion_nid = next_demotion_node(nid); 354 - if (demotion_nid == NUMA_NO_NODE) 354 + node_get_allowed_targets(pgdat, &allowed_mask); 355 + if (nodes_empty(allowed_mask)) 355 356 return false; 356 357 357 - /* If demotion node isn't in the cgroup's mems_allowed, fall back */ 358 - return mem_cgroup_node_allowed(memcg, demotion_nid); 358 + /* Filter out nodes that are not in cgroup's mems_allowed. */ 359 + mem_cgroup_node_filter_allowed(memcg, &allowed_mask); 360 + return !nodes_empty(allowed_mask); 359 361 } 360 362 361 363 static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg, ··· 1019 1017 * Folios which are not demoted are left on @demote_folios. 1020 1018 */ 1021 1019 static unsigned int demote_folio_list(struct list_head *demote_folios, 1022 - struct pglist_data *pgdat) 1020 + struct pglist_data *pgdat, 1021 + struct mem_cgroup *memcg) 1023 1022 { 1024 - int target_nid = next_demotion_node(pgdat->node_id); 1023 + int target_nid; 1025 1024 unsigned int nr_succeeded; 1026 1025 nodemask_t allowed_mask; 1027 1026 ··· 1034 1031 */ 1035 1032 .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | 1036 1033 __GFP_NOMEMALLOC | GFP_NOWAIT, 1037 - .nid = target_nid, 1038 1034 .nmask = &allowed_mask, 1039 1035 .reason = MR_DEMOTION, 1040 1036 }; ··· 1041 1039 if (list_empty(demote_folios)) 1042 1040 return 0; 1043 1041 1044 - if (target_nid == NUMA_NO_NODE) 1042 + node_get_allowed_targets(pgdat, &allowed_mask); 1043 + mem_cgroup_node_filter_allowed(memcg, &allowed_mask); 1044 + if (nodes_empty(allowed_mask)) 1045 1045 return 0; 1046 1046 1047 - node_get_allowed_targets(pgdat, &allowed_mask); 1047 + target_nid = next_demotion_node(pgdat->node_id); 1048 + if (target_nid == NUMA_NO_NODE) 1049 + /* No lower-tier nodes or nodes were hot-unplugged. */ 1050 + return 0; 1051 + if (!node_isset(target_nid, allowed_mask)) 1052 + target_nid = node_random(&allowed_mask); 1053 + mtc.nid = target_nid; 1048 1054 1049 1055 /* Demotion ignores all cpuset and mempolicy settings */ 1050 1056 migrate_pages(demote_folios, alloc_demote_folio, NULL, ··· 1574 1564 /* 'folio_list' is always empty here */ 1575 1565 1576 1566 /* Migrate folios selected for demotion */ 1577 - nr_demoted = demote_folio_list(&demote_folios, pgdat); 1567 + nr_demoted = demote_folio_list(&demote_folios, pgdat, memcg); 1578 1568 nr_reclaimed += nr_demoted; 1579 1569 stat->nr_demoted += nr_demoted; 1580 1570 /* Folios that could not be demoted are still in @demote_folios */