Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

mm: memory-tiering: fix PGPROMOTE_CANDIDATE counting

Goto-san reported confusing pgpromote statistics where the
pgpromote_success count significantly exceeded pgpromote_candidate.

On a system with three nodes (nodes 0-1: DRAM 4GB, node 2: NVDIMM 4GB):
# Enable demotion only
echo 1 > /sys/kernel/mm/numa/demotion_enabled
numactl -m 0-1 memhog -r200 3500M >/dev/null &
pid=$!
sleep 2
numactl memhog -r100 2500M >/dev/null &
sleep 10
kill -9 $pid # terminate the 1st memhog
# Enable promotion
echo 2 > /proc/sys/kernel/numa_balancing

After a few seconds, we observeed `pgpromote_candidate < pgpromote_success`
$ grep -e pgpromote /proc/vmstat
pgpromote_success 2579
pgpromote_candidate 0

In this scenario, after terminating the first memhog, the conditions for
pgdat_free_space_enough() are quickly met, and triggers promotion.
However, these migrated pages are only counted for in PGPROMOTE_SUCCESS,
not in PGPROMOTE_CANDIDATE.

To solve these confusing statistics, introduce PGPROMOTE_CANDIDATE_NRL to
count the missed promotion pages. And also, not counting these pages into
PGPROMOTE_CANDIDATE is to avoid changing the existing algorithm or
performance of the promotion rate limit.

Link: https://lkml.kernel.org/r/20250901090122.124262-1-ruansy.fnst@fujitsu.com
Link: https://lkml.kernel.org/r/20250729035101.1601407-1-ruansy.fnst@fujitsu.com
Fixes: c6833e10008f ("memory tiering: rate limit NUMA migration throughput")
Co-developed-by: Li Zhijian <lizhijian@fujitsu.com>
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Signed-off-by: Ruan Shiyang <ruansy.fnst@fujitsu.com>
Reported-by: Yasunori Gotou (Fujitsu) <y-goto@fujitsu.com>
Suggested-by: Huang Ying <ying.huang@linux.alibaba.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Huang Ying <ying.huang@linux.alibaba.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Ruan Shiyang and committed by
Andrew Morton
337135e6 915a4022

+19 -3
+15 -1
include/linux/mmzone.h
··· 234 234 #endif 235 235 #ifdef CONFIG_NUMA_BALANCING 236 236 PGPROMOTE_SUCCESS, /* promote successfully */ 237 - PGPROMOTE_CANDIDATE, /* candidate pages to promote */ 237 + /** 238 + * Candidate pages for promotion based on hint fault latency. This 239 + * counter is used to control the promotion rate and adjust the hot 240 + * threshold. 241 + */ 242 + PGPROMOTE_CANDIDATE, 243 + /** 244 + * Not rate-limited (NRL) candidate pages for those can be promoted 245 + * without considering hot threshold because of enough free pages in 246 + * fast-tier node. These promotions bypass the regular hotness checks 247 + * and do NOT influence the promotion rate-limiter or 248 + * threshold-adjustment logic. 249 + * This is for statistics/monitoring purposes. 250 + */ 251 + PGPROMOTE_CANDIDATE_NRL, 238 252 #endif 239 253 /* PGDEMOTE_*: pages demoted */ 240 254 PGDEMOTE_KSWAPD,
+3 -2
kernel/sched/fair.c
··· 1923 1923 struct pglist_data *pgdat; 1924 1924 unsigned long rate_limit; 1925 1925 unsigned int latency, th, def_th; 1926 + long nr = folio_nr_pages(folio); 1926 1927 1927 1928 pgdat = NODE_DATA(dst_nid); 1928 1929 if (pgdat_free_space_enough(pgdat)) { 1929 1930 /* workload changed, reset hot threshold */ 1930 1931 pgdat->nbp_threshold = 0; 1932 + mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE_NRL, nr); 1931 1933 return true; 1932 1934 } 1933 1935 ··· 1943 1941 if (latency >= th) 1944 1942 return false; 1945 1943 1946 - return !numa_promotion_rate_limit(pgdat, rate_limit, 1947 - folio_nr_pages(folio)); 1944 + return !numa_promotion_rate_limit(pgdat, rate_limit, nr); 1948 1945 } 1949 1946 1950 1947 this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid);
+1
mm/vmstat.c
··· 1280 1280 #ifdef CONFIG_NUMA_BALANCING 1281 1281 [I(PGPROMOTE_SUCCESS)] = "pgpromote_success", 1282 1282 [I(PGPROMOTE_CANDIDATE)] = "pgpromote_candidate", 1283 + [I(PGPROMOTE_CANDIDATE_NRL)] = "pgpromote_candidate_nrl", 1283 1284 #endif 1284 1285 [I(PGDEMOTE_KSWAPD)] = "pgdemote_kswapd", 1285 1286 [I(PGDEMOTE_DIRECT)] = "pgdemote_direct",