Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

mm/page_alloc/vmstat: simplify refresh_cpu_vm_stats change detection

Patch series "mm/page_alloc: Batch callers of free_pcppages_bulk", v5.

Motivation & Approach
=====================

While testing workloads with high sustained memory pressure on large
machines in the Meta fleet (1Tb memory, 316 CPUs), we saw an unexpectedly
high number of softlockups. Further investigation showed that the zone
lock in free_pcppages_bulk was being held for a long time, and was called
to free 2k+ pages over 100 times just during boot.

This causes starvation in other processes for the zone lock, which can
lead to the system stalling as multiple threads cannot make progress
without the locks. We can see these issues manifesting as warnings:

[ 4512.591979] rcu: INFO: rcu_sched self-detected stall on CPU
[ 4512.604370] rcu: 20-....: (9312 ticks this GP) idle=a654/1/0x4000000000000000 softirq=309340/309344 fqs=5426
[ 4512.626401] rcu: hardirqs softirqs csw/system
[ 4512.638793] rcu: number: 0 145 0
[ 4512.651177] rcu: cputime: 30 10410 174 ==> 10558(ms)
[ 4512.666657] rcu: (t=21077 jiffies g=783665 q=1242213 ncpus=316)

While these warnings don't indicate a crash or a kernel panic, they do
point to the underlying issue of lock contention. To prevent starvation
in both locks, batch the freeing of pages using pcp->batch.

Because free_pcppages_bulk is called with the pcp lock and acquires the
zone lock, relinquishing and reacquiring the locks are only effective when
both of them are broken together (unless the system was built with queued
spinlocks). Thus, instead of modifying free_pcppages_bulk to break both
locks, batch the freeing from its callers instead.

A similar fix has been implemented in the Meta fleet, and we have seen
significantly less softlockups.

Testing
=======
The following are a few synthetic benchmarks, made on three machines. The
first is a large machine with 754GiB memory and 316 processors.
The second is a relatively smaller machine with 251GiB memory and 176
processors. The third and final is the smallest of the three, which has 62GiB
memory and 36 processors.

On all machines, I kick off a kernel build with -j$(nproc).
Negative delta is better (faster compilation).

Large machine (754GiB memory, 316 processors)
make -j$(nproc)
+------------+---------------+-----------+
| Metric (s) | Variation (%) | Delta(%) |
+------------+---------------+-----------+
| real | 0.8070 | - 1.4865 |
| user | 0.2823 | + 0.4081 |
| sys | 5.0267 | -11.8737 |
+------------+---------------+-----------+

Medium machine (251GiB memory, 176 processors)
make -j$(nproc)
+------------+---------------+----------+
| Metric (s) | Variation (%) | Delta(%) |
+------------+---------------+----------+
| real | 0.2806 | +0.0351 |
| user | 0.0994 | +0.3170 |
| sys | 0.6229 | -0.6277 |
+------------+---------------+----------+

Small machine (62GiB memory, 36 processors)
make -j$(nproc)
+------------+---------------+----------+
| Metric (s) | Variation (%) | Delta(%) |
+------------+---------------+----------+
| real | 0.1503 | -2.6585 |
| user | 0.0431 | -2.2984 |
| sys | 0.1870 | -3.2013 |
+------------+---------------+----------+

Here, variation is the coefficient of variation, i.e. standard deviation
/ mean.

Based on these results, it seems like there are varying degrees to how
much lock contention this reduces. For the largest and smallest machines
that I ran the tests on, it seems like there is quite some significant
reduction. There is also some performance increases visible from
userspace.

Interestingly, the performance gains don't scale with the size of the
machine, but rather there seems to be a dip in the gain there is for the
medium-sized machine. One possible theory is that because the high
watermark depends on both memory and the number of local CPUs, what
impacts zone contention the most is not these individual values, but
rather the ratio of mem:processors.


This patch (of 5):

Currently, refresh_cpu_vm_stats returns an int, indicating how many
changes were made during its updates. Using this information, callers
like vmstat_update can heuristically determine if more work will be done
in the future.

However, all of refresh_cpu_vm_stats's callers either (a) ignore the
result, only caring about performing the updates, or (b) only care about
whether changes were made, but not *how many* changes were made.

Simplify the code by returning a bool instead to indicate if updates
were made.

In addition, simplify fold_diff and decay_pcp_high to return a bool
for the same reason.

Link: https://lkml.kernel.org/r/20251014145011.3427205-1-joshua.hahnjy@gmail.com
Link: https://lkml.kernel.org/r/20251014145011.3427205-2-joshua.hahnjy@gmail.com
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Chris Mason <clm@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Joshua Hahn and committed by
Andrew Morton
0acc67c4 f66e2727

+20 -18
+1 -1
include/linux/gfp.h
··· 387 387 #define free_page(addr) free_pages((addr), 0) 388 388 389 389 void page_alloc_init_cpuhp(void); 390 - int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp); 390 + bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp); 391 391 void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp); 392 392 void drain_all_pages(struct zone *zone); 393 393 void drain_local_pages(struct zone *zone);
+4 -4
mm/page_alloc.c
··· 2557 2557 * Called from the vmstat counter updater to decay the PCP high. 2558 2558 * Return whether there are addition works to do. 2559 2559 */ 2560 - int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp) 2560 + bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp) 2561 2561 { 2562 2562 int high_min, to_drain, batch; 2563 - int todo = 0; 2563 + bool todo = false; 2564 2564 2565 2565 high_min = READ_ONCE(pcp->high_min); 2566 2566 batch = READ_ONCE(pcp->batch); ··· 2573 2573 pcp->high = max3(pcp->count - (batch << CONFIG_PCP_BATCH_SCALE_MAX), 2574 2574 pcp->high - (pcp->high >> 3), high_min); 2575 2575 if (pcp->high > high_min) 2576 - todo++; 2576 + todo = true; 2577 2577 } 2578 2578 2579 2579 to_drain = pcp->count - pcp->high; ··· 2581 2581 spin_lock(&pcp->lock); 2582 2582 free_pcppages_bulk(zone, to_drain, pcp, 0); 2583 2583 spin_unlock(&pcp->lock); 2584 - todo++; 2584 + todo = true; 2585 2585 } 2586 2586 2587 2587 return todo;
+15 -13
mm/vmstat.c
··· 771 771 772 772 /* 773 773 * Fold a differential into the global counters. 774 - * Returns the number of counters updated. 774 + * Returns whether counters were updated. 775 775 */ 776 776 static int fold_diff(int *zone_diff, int *node_diff) 777 777 { 778 778 int i; 779 - int changes = 0; 779 + bool changed = false; 780 780 781 781 for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) 782 782 if (zone_diff[i]) { 783 783 atomic_long_add(zone_diff[i], &vm_zone_stat[i]); 784 - changes++; 784 + changed = true; 785 785 } 786 786 787 787 for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) 788 788 if (node_diff[i]) { 789 789 atomic_long_add(node_diff[i], &vm_node_stat[i]); 790 - changes++; 790 + changed = true; 791 791 } 792 - return changes; 792 + return changed; 793 793 } 794 794 795 795 /* ··· 806 806 * with the global counters. These could cause remote node cache line 807 807 * bouncing and will have to be only done when necessary. 808 808 * 809 - * The function returns the number of global counters updated. 809 + * The function returns whether global counters were updated. 810 810 */ 811 - static int refresh_cpu_vm_stats(bool do_pagesets) 811 + static bool refresh_cpu_vm_stats(bool do_pagesets) 812 812 { 813 813 struct pglist_data *pgdat; 814 814 struct zone *zone; 815 815 int i; 816 816 int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, }; 817 817 int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, }; 818 - int changes = 0; 818 + bool changed = false; 819 819 820 820 for_each_populated_zone(zone) { 821 821 struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats; ··· 839 839 if (do_pagesets) { 840 840 cond_resched(); 841 841 842 - changes += decay_pcp_high(zone, this_cpu_ptr(pcp)); 842 + if (decay_pcp_high(zone, this_cpu_ptr(pcp))) 843 + changed = true; 843 844 #ifdef CONFIG_NUMA 844 845 /* 845 846 * Deal with draining the remote pageset of this ··· 862 861 } 863 862 864 863 if (__this_cpu_dec_return(pcp->expire)) { 865 - changes++; 864 + changed = true; 866 865 continue; 867 866 } 868 867 869 868 if (__this_cpu_read(pcp->count)) { 870 869 drain_zone_pages(zone, this_cpu_ptr(pcp)); 871 - changes++; 870 + changed = true; 872 871 } 873 872 #endif 874 873 } ··· 888 887 } 889 888 } 890 889 891 - changes += fold_diff(global_zone_diff, global_node_diff); 892 - return changes; 890 + if (fold_diff(global_zone_diff, global_node_diff)) 891 + changed = true; 892 + return changed; 893 893 } 894 894 895 895 /*