mm/page_alloc: effectively disable pcp with CONFIG_SMP=n

Patch series "mm/page_alloc: pcp locking cleanup".

This is a followup to the hotfix 038a102535eb ("mm/page_alloc: prevent pcp
corruption with SMP=n"), to simplify the code and deal with the original
issue properly. The previous RFC attempt [1] argued for changing the UP
spinlock implementation, which was discouraged, but thanks to David's
off-list suggestion, we can achieve the goal without changing the spinlock
implementation.

The main change in Patch 1 relies on the fact that on UP we don't need the
pcp lists for scalability, so just make them always bypassed during
alloc/free by making the pcp trylock an unconditional failure.

The various drain paths that use pcp_spin_lock_maybe_irqsave() continue to
exist but will never do any work in practice. In Patch 2 we can again
remove the irq saving from them that commit 038a102535eb added.

Besides simpler code with all the ugly UP_flags removed, we get less bloat
with CONFIG_SMP=n for mm/page_alloc.o as a result:

add/remove: 25/28 grow/shrink: 4/5 up/down: 2105/-6665 (-4560)
Function old new delta
get_page_from_freelist 5689 7248 +1559
free_unref_folios 2006 2324 +318
make_alloc_exact 270 286 +16
__zone_watermark_ok 306 322 +16
drain_pages_zone.isra 119 109 -10
decay_pcp_high 181 149 -32
setup_pcp_cacheinfo 193 147 -46
__free_frozen_pages 1339 1089 -250
alloc_pages_bulk_noprof 1054 419 -635
free_frozen_page_commit 907 - -907
try_to_claim_block 1975 - -1975
__rmqueue_pcplist 2614 - -2614
Total: Before=54624, After=50064, chg -8.35%

This patch (of 3):

The page allocator has been using a locking scheme for its percpu page
caches (pcp) based on spin_trylock() with no _irqsave() part. The trick
is that if we interrupt the locked section, we fail the trylock and just
fallback to the slowpath taking the zone lock. That's more expensive, but
rare, so we don't need to pay the irqsave/restore cost all the time in the
fastpaths.

It's similar to but not exactly local_trylock_t (which is also newer
anyway) because in some cases we do lock the pcp of a non-local cpu to
drain it, in a way that's cheaper than using IPI or queue_work_on().

The complication of this scheme has been UP non-debug spinlock
implementation which assumes spin_trylock() can't fail on UP and has no
state to track whether it's locked. It just doesn't anticipate this usage
scenario. So to work around that we disable IRQs only on UP, complicating
the implementation. Also recently we found years old bug in where we
didn't disable IRQs in related paths - see 038a102535eb ("mm/page_alloc:
prevent pcp corruption with SMP=n").

We can avoid this UP complication by realizing that we do not need the pcp
caching for scalability on UP in the first place. Removing it completely
with #ifdefs is not worth the trouble either. Just make
pcp_spin_trylock() return NULL unconditionally on CONFIG_SMP=n. This
makes the slowpaths unconditional, and we can remove the IRQ save/restore
handling in pcp_spin_trylock()/unlock() completely.

Link: https://lkml.kernel.org/r/20260227-b4-pcp-locking-cleanup-v1-0-f7e22e603447@kernel.org
Link: https://lkml.kernel.org/r/20260227-b4-pcp-locking-cleanup-v1-1-f7e22e603447@kernel.org
Link: https://lore.kernel.org/all/d762c46b-36f0-471a-b5b4-23c8cf5628ae@suse.cz/ [1]
Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Suggested-by: David Hildenbrand (Arm) <david@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Vlastimil Babka and committed by

Andrew Morton 2 months ago a373f371 ca6969e0

+34 -58

1 changed file

expand all

page_alloc.c

+34 -58

mm/page_alloc.c

··· 94 94 static DEFINE_MUTEX(pcp_batch_high_lock); 95 95 #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8) 96 96 97 - #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT) 98 - /* 99 - * On SMP, spin_trylock is sufficient protection. 100 - * On PREEMPT_RT, spin_trylock is equivalent on both SMP and UP. 101 - * Pass flags to a no-op inline function to typecheck and silence the unused 102 - * variable warning. 103 - */ 104 - static inline void __pcp_trylock_noop(unsigned long *flags) { } 105 - #define pcp_trylock_prepare(flags) __pcp_trylock_noop(&(flags)) 106 - #define pcp_trylock_finish(flags) __pcp_trylock_noop(&(flags)) 107 - #else 108 - 109 - /* UP spin_trylock always succeeds so disable IRQs to prevent re-entrancy. */ 110 - #define pcp_trylock_prepare(flags) local_irq_save(flags) 111 - #define pcp_trylock_finish(flags) local_irq_restore(flags) 112 - #endif 113 - 114 97 /* 115 98 * Locking a pcp requires a PCP lookup followed by a spinlock. To avoid 116 99 * a migration causing the wrong PCP to be locked and remote memory being ··· 132 149 pcpu_task_unpin(); \ 133 150 }) 134 151 135 - /* struct per_cpu_pages specific helpers. */ 136 - #define pcp_spin_trylock(ptr, UP_flags) \ 137 - ({ \ 138 - struct per_cpu_pages *__ret; \ 139 - pcp_trylock_prepare(UP_flags); \ 140 - __ret = pcpu_spin_trylock(struct per_cpu_pages, lock, ptr); \ 141 - if (!__ret) \ 142 - pcp_trylock_finish(UP_flags); \ 143 - __ret; \ 144 - }) 152 + /* struct per_cpu_pages specific helpers.*/ 153 + #ifdef CONFIG_SMP 154 + #define pcp_spin_trylock(ptr) \ 155 + pcpu_spin_trylock(struct per_cpu_pages, lock, ptr) 145 156 146 - #define pcp_spin_unlock(ptr, UP_flags) \ 147 - ({ \ 148 - pcpu_spin_unlock(lock, ptr); \ 149 - pcp_trylock_finish(UP_flags); \ 150 - }) 157 + #define pcp_spin_unlock(ptr) \ 158 + pcpu_spin_unlock(lock, ptr) 151 159 152 160 /* 153 - * With the UP spinlock implementation, when we spin_lock(&pcp->lock) (for i.e. 154 - * a potentially remote cpu drain) and get interrupted by an operation that 155 - * attempts pcp_spin_trylock(), we can't rely on the trylock failure due to UP 156 - * spinlock assumptions making the trylock a no-op. So we have to turn that 157 - * spin_lock() to a spin_lock_irqsave(). This works because on UP there are no 158 - * remote cpu's so we can only be locking the only existing local one. 161 + * On CONFIG_SMP=n the UP implementation of spin_trylock() never fails and thus 162 + * is not compatible with our locking scheme. However we do not need pcp for 163 + * scalability in the first place, so just make all the trylocks fail and take 164 + * the slow path unconditionally. 159 165 */ 166 + #else 167 + #define pcp_spin_trylock(ptr) \ 168 + NULL 169 + 170 + #define pcp_spin_unlock(ptr) \ 171 + BUG_ON(1) 172 + #endif 173 + 160 174 #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT) 161 175 static inline void __flags_noop(unsigned long *flags) { } 162 176 #define pcp_spin_lock_maybe_irqsave(ptr, flags) \ ··· 2838 2858 */ 2839 2859 static bool free_frozen_page_commit(struct zone *zone, 2840 2860 struct per_cpu_pages *pcp, struct page *page, int migratetype, 2841 - unsigned int order, fpi_t fpi_flags, unsigned long *UP_flags) 2861 + unsigned int order, fpi_t fpi_flags) 2842 2862 { 2843 2863 int high, batch; 2844 2864 int to_free, to_free_batched; ··· 2898 2918 if (to_free == 0 || pcp->count == 0) 2899 2919 break; 2900 2920 2901 - pcp_spin_unlock(pcp, *UP_flags); 2921 + pcp_spin_unlock(pcp); 2902 2922 2903 - pcp = pcp_spin_trylock(zone->per_cpu_pageset, *UP_flags); 2923 + pcp = pcp_spin_trylock(zone->per_cpu_pageset); 2904 2924 if (!pcp) { 2905 2925 ret = false; 2906 2926 break; ··· 2912 2932 * returned in an unlocked state. 2913 2933 */ 2914 2934 if (smp_processor_id() != cpu) { 2915 - pcp_spin_unlock(pcp, *UP_flags); 2935 + pcp_spin_unlock(pcp); 2916 2936 ret = false; 2917 2937 break; 2918 2938 } ··· 2944 2964 static void __free_frozen_pages(struct page *page, unsigned int order, 2945 2965 fpi_t fpi_flags) 2946 2966 { 2947 - unsigned long UP_flags; 2948 2967 struct per_cpu_pages *pcp; 2949 2968 struct zone *zone; 2950 2969 unsigned long pfn = page_to_pfn(page); ··· 2979 3000 add_page_to_zone_llist(zone, page, order); 2980 3001 return; 2981 3002 } 2982 - pcp = pcp_spin_trylock(zone->per_cpu_pageset, UP_flags); 3003 + pcp = pcp_spin_trylock(zone->per_cpu_pageset); 2983 3004 if (pcp) { 2984 3005 if (!free_frozen_page_commit(zone, pcp, page, migratetype, 2985 - order, fpi_flags, &UP_flags)) 3006 + order, fpi_flags)) 2986 3007 return; 2987 - pcp_spin_unlock(pcp, UP_flags); 3008 + pcp_spin_unlock(pcp); 2988 3009 } else { 2989 3010 free_one_page(zone, page, pfn, order, fpi_flags); 2990 3011 } ··· 3005 3026 */ 3006 3027 void free_unref_folios(struct folio_batch *folios) 3007 3028 { 3008 - unsigned long UP_flags; 3009 3029 struct per_cpu_pages *pcp = NULL; 3010 3030 struct zone *locked_zone = NULL; 3011 3031 int i, j; ··· 3047 3069 if (zone != locked_zone || 3048 3070 is_migrate_isolate(migratetype)) { 3049 3071 if (pcp) { 3050 - pcp_spin_unlock(pcp, UP_flags); 3072 + pcp_spin_unlock(pcp); 3051 3073 locked_zone = NULL; 3052 3074 pcp = NULL; 3053 3075 } ··· 3066 3088 * trylock is necessary as folios may be getting freed 3067 3089 * from IRQ or SoftIRQ context after an IO completion. 3068 3090 */ 3069 - pcp = pcp_spin_trylock(zone->per_cpu_pageset, UP_flags); 3091 + pcp = pcp_spin_trylock(zone->per_cpu_pageset); 3070 3092 if (unlikely(!pcp)) { 3071 3093 free_one_page(zone, &folio->page, pfn, 3072 3094 order, FPI_NONE); ··· 3084 3106 3085 3107 trace_mm_page_free_batched(&folio->page); 3086 3108 if (!free_frozen_page_commit(zone, pcp, &folio->page, 3087 - migratetype, order, FPI_NONE, &UP_flags)) { 3109 + migratetype, order, FPI_NONE)) { 3088 3110 pcp = NULL; 3089 3111 locked_zone = NULL; 3090 3112 } 3091 3113 } 3092 3114 3093 3115 if (pcp) 3094 - pcp_spin_unlock(pcp, UP_flags); 3116 + pcp_spin_unlock(pcp); 3095 3117 folio_batch_reinit(folios); 3096 3118 } 3097 3119 ··· 3349 3371 struct per_cpu_pages *pcp; 3350 3372 struct list_head *list; 3351 3373 struct page *page; 3352 - unsigned long UP_flags; 3353 3374 3354 3375 /* spin_trylock may fail due to a parallel drain or IRQ reentrancy. */ 3355 - pcp = pcp_spin_trylock(zone->per_cpu_pageset, UP_flags); 3376 + pcp = pcp_spin_trylock(zone->per_cpu_pageset); 3356 3377 if (!pcp) 3357 3378 return NULL; 3358 3379 ··· 3363 3386 pcp->free_count >>= 1; 3364 3387 list = &pcp->lists[order_to_pindex(migratetype, order)]; 3365 3388 page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list); 3366 - pcp_spin_unlock(pcp, UP_flags); 3389 + pcp_spin_unlock(pcp); 3367 3390 if (page) { 3368 3391 __count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order); 3369 3392 zone_statistics(preferred_zone, zone, 1); ··· 5044 5067 struct page **page_array) 5045 5068 { 5046 5069 struct page *page; 5047 - unsigned long UP_flags; 5048 5070 struct zone *zone; 5049 5071 struct zoneref *z; 5050 5072 struct per_cpu_pages *pcp; ··· 5137 5161 goto failed; 5138 5162 5139 5163 /* spin_trylock may fail due to a parallel drain or IRQ reentrancy. */ 5140 - pcp = pcp_spin_trylock(zone->per_cpu_pageset, UP_flags); 5164 + pcp = pcp_spin_trylock(zone->per_cpu_pageset); 5141 5165 if (!pcp) 5142 5166 goto failed; 5143 5167 ··· 5156 5180 if (unlikely(!page)) { 5157 5181 /* Try and allocate at least one page */ 5158 5182 if (!nr_account) { 5159 - pcp_spin_unlock(pcp, UP_flags); 5183 + pcp_spin_unlock(pcp); 5160 5184 goto failed; 5161 5185 } 5162 5186 break; ··· 5168 5192 page_array[nr_populated++] = page; 5169 5193 } 5170 5194 5171 - pcp_spin_unlock(pcp, UP_flags); 5195 + pcp_spin_unlock(pcp); 5172 5196 5173 5197 __count_zid_vm_events(PGALLOC, zone_idx(zone), nr_account); 5174 5198 zone_statistics(zonelist_zone(ac.preferred_zoneref), zone, nr_account);

Configure Feed

Configure Feed