Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

mm: always call rmap_walk() on locked folios

Patch series "Improve UFFDIO_MOVE scalability by removing anon_vma lock", v2.

Userfaultfd has a scalability issue in its UFFDIO_MOVE ioctl, which is
heavily used in Android as its java garbage collector uses it for
concurrent heap compaction.

The issue arises because UFFDIO_MOVE updates folio->mapping to an anon_vma
with a different root, in order to move the folio from a src VMA to dst
VMA. It performs the operation with the folio locked, but this is
insufficient, because rmap_walk() can be performed on non-KSM anonymous
folios without folio lock.

This means that UFFDIO_MOVE has to acquire the anon_vma write lock of the
root anon_vma belonging to the folio it wishes to move.

This causes scalability bottleneck when multiple threads perform
UFFDIO_MOVE simultanously on distinct pages of the same src VMA. In field
traces of arm64 android devices, we have observed janky user interactions
due to long (sometimes over ~50ms) uninterruptible sleeps on main UI
thread caused by anon_vma lock contention in UFFDIO_MOVE. This is
particularly severe during the beginning of GC's compaction phase when it
is likely to have multiple threads involved.

This patch resolves the issue by removing the exception in rmap_walk() for
non-KSM anon folios by ensuring that all folios are locked during rmap
walk. This is less problematic than it might seem, as the only major
caller which utilises this mode is shrink_active_list(), which is covered
in detail in the first patch of this series.

As a result of changing our approach to locking, we can remove all the
code that took steps to acquire an anon_vma write lock instead of a folio
lock. This results in a significant simplification and scalability
improvement of the code (currently only in UFFDIO_MOVE). Furthermore, as
a side-effect, folio_lock_anon_vma_read() gets simpler as we don't need to
worry that folio->mapping may have changed under us.


This patch (of 2):

Guarantee that rmap_walk() is called on locked folios so that threads
changing folio->mapping and folio->index for non-KSM anon folios can
serialize on fine-grained folio lock rather than anon_vma lock. Other
folio types are already always locked before rmap_walk(). With this, we
are going from 'not necessarily' locking the non-KSM anon folio to
'definitely' locking it during rmap walks.

This patch is in preparation for removing anon_vma write-lock from
UFFDIO_MOVE.

With this patch, three functions are now expected to be called with a
locked folio. To be careful of not missing any case, here is the
exhaustive list of all their callers.

1) rmap_walk() is called from:

a) folio_referenced()
b) damon_folio_mkold()
c) damon_folio_young()
d) page_idle_clear_pte_refs()
e) try_to_unmap()
f) try_to_migrate()
g) folio_mkclean()
h) remove_migration_ptes()

In the above list, first 4 are changed in this patch to try-lock non-KSM
anon folios, similar to other types of folios. The remaining functions in
the list already hold folio lock when calling rmap_walk().

2) folio_lock_anon_vma_read() is called from following functions:

a) collect_procs_anon()
b) page_idle_clear_pte_refs()
c) damon_folio_mkold()
d) damon_folio_young()
e) folio_referenced()
f) try_to_unmap()
g) try_to_migrate()

All the functions in above list, except collect_procs_anon(), are covered
by the rmap_walk() list above. For collect_procs_anon(), with
kill_procs_now() changed to take folio lock in this patch ensures that all
callers of folio_lock_anon_vma_read() now hold the lock.

3) folio_get_anon_vma() is called from following functions, all of which
already hold the folio lock:

a) move_pages_huge_pmd()
b) __folio_split()
c) move_pages_ptes()
d) migrate_folio_unmap()
e) unmap_and_move_huge_page()

Functionally, this patch doesn't break the logic because rmap walkers
generally do some other check to see if what is expected to mapped did
happen so it's fine, or otherwise treat things as best-effort.

Among the 4 functions changed in this patch, folio_referenced() is the
only core-mm function, and is also frequently accessed. To assess the
impact of locking non-KSM anon folios in
shrink_active_list()->folio_referenced() path, we performed an app cycle
test on an arm64 android device. During the whole duration of the test
there were over 140k invocations of shrink_active_list(), out of which
over 29k had at least one non-KSM anon folio on which folio_referenced()
was called. In none of these invocations folio_trylock() failed.

Of course, we now take a lock where we wouldn't previously have. In the
past it would have had a major impact in causing a CoW write fault to copy
a page in do_wp_page(), as commit 09854ba94c6a ("mm: do_wp_page()
simplification") caused a failure to obtain folio lock to result in a page
copy even if one wasn't necessary.

However, since commit 6c287605fd56 ("mm: remember exclusively mapped
anonymous pages with PG_anon_exclusive"), and the introduction of the
folio anon exclusive flag, this issue is significantly mitigated.

The only case remaining that we might worry about from this perspective is
that of read-only folios immediately after fork where the anon exclusive
bit will not have been set yet.

We note however in the case of read-only just-forked folios that
wp_can_reuse_anon_folio() will notice the raised reference count
established by shrink_active_list() via isolate_lru_folios() and refuse to
reuse in any case, so this will in fact have no impact - the folio lock is
ultimately immaterial here.

All-in-all it appears that there is little opportunity for meaningful
negative impact from this change.

Link: https://lkml.kernel.org/r/20250923071019.775806-1-lokeshgidra@google.com
Link: https://lkml.kernel.org/r/20250923071019.775806-2-lokeshgidra@google.com
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Harry Yoo <harry.yoo@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Barry Song <baohua@kernel.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Nicolas Geoffray <ngeoffray@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Lokesh Gidra and committed by
Andrew Morton
95b34d66 eb02f14c

+21 -48
+4 -12
mm/damon/ops-common.c
··· 162 162 .rmap_one = damon_folio_mkold_one, 163 163 .anon_lock = folio_lock_anon_vma_read, 164 164 }; 165 - bool need_lock; 166 165 167 166 if (!folio_mapped(folio) || !folio_raw_mapping(folio)) { 168 167 folio_set_idle(folio); 169 168 return; 170 169 } 171 170 172 - need_lock = !folio_test_anon(folio) || folio_test_ksm(folio); 173 - if (need_lock && !folio_trylock(folio)) 171 + if (!folio_trylock(folio)) 174 172 return; 175 173 176 174 rmap_walk(folio, &rwc); 177 - 178 - if (need_lock) 179 - folio_unlock(folio); 175 + folio_unlock(folio); 180 176 181 177 } 182 178 ··· 224 228 .rmap_one = damon_folio_young_one, 225 229 .anon_lock = folio_lock_anon_vma_read, 226 230 }; 227 - bool need_lock; 228 231 229 232 if (!folio_mapped(folio) || !folio_raw_mapping(folio)) { 230 233 if (folio_test_idle(folio)) ··· 232 237 return true; 233 238 } 234 239 235 - need_lock = !folio_test_anon(folio) || folio_test_ksm(folio); 236 - if (need_lock && !folio_trylock(folio)) 240 + if (!folio_trylock(folio)) 237 241 return false; 238 242 239 243 rmap_walk(folio, &rwc); 240 - 241 - if (need_lock) 242 - folio_unlock(folio); 244 + folio_unlock(folio); 243 245 244 246 return accessed; 245 247 }
+3
mm/memory-failure.c
··· 2140 2140 { 2141 2141 LIST_HEAD(tokill); 2142 2142 2143 + folio_lock(folio); 2143 2144 collect_procs(folio, p, &tokill, flags & MF_ACTION_REQUIRED); 2145 + folio_unlock(folio); 2146 + 2144 2147 kill_procs(&tokill, true, pfn, flags); 2145 2148 } 2146 2149
+2 -6
mm/page_idle.c
··· 101 101 .rmap_one = page_idle_clear_pte_refs_one, 102 102 .anon_lock = folio_lock_anon_vma_read, 103 103 }; 104 - bool need_lock; 105 104 106 105 if (!folio_mapped(folio) || !folio_raw_mapping(folio)) 107 106 return; 108 107 109 - need_lock = !folio_test_anon(folio) || folio_test_ksm(folio); 110 - if (need_lock && !folio_trylock(folio)) 108 + if (!folio_trylock(folio)) 111 109 return; 112 110 113 111 rmap_walk(folio, &rwc); 114 - 115 - if (need_lock) 116 - folio_unlock(folio); 112 + folio_unlock(folio); 117 113 } 118 114 119 115 static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
+12 -30
mm/rmap.c
··· 489 489 * if there is a mapcount, we can dereference the anon_vma after observing 490 490 * those. 491 491 * 492 - * NOTE: the caller should normally hold folio lock when calling this. If 493 - * not, the caller needs to double check the anon_vma didn't change after 494 - * taking the anon_vma lock for either read or write (UFFDIO_MOVE can modify it 495 - * concurrently without folio lock protection). See folio_lock_anon_vma_read() 496 - * which has already covered that, and comment above remap_pages(). 492 + * NOTE: the caller should hold folio lock when calling this. 497 493 */ 498 494 struct anon_vma *folio_get_anon_vma(const struct folio *folio) 499 495 { 500 496 struct anon_vma *anon_vma = NULL; 501 497 unsigned long anon_mapping; 498 + 499 + VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); 502 500 503 501 rcu_read_lock(); 504 502 anon_mapping = (unsigned long)READ_ONCE(folio->mapping); ··· 544 546 struct anon_vma *root_anon_vma; 545 547 unsigned long anon_mapping; 546 548 547 - retry: 549 + VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); 550 + 548 551 rcu_read_lock(); 549 552 anon_mapping = (unsigned long)READ_ONCE(folio->mapping); 550 553 if ((anon_mapping & FOLIO_MAPPING_FLAGS) != FOLIO_MAPPING_ANON) ··· 556 557 anon_vma = (struct anon_vma *) (anon_mapping - FOLIO_MAPPING_ANON); 557 558 root_anon_vma = READ_ONCE(anon_vma->root); 558 559 if (down_read_trylock(&root_anon_vma->rwsem)) { 559 - /* 560 - * folio_move_anon_rmap() might have changed the anon_vma as we 561 - * might not hold the folio lock here. 562 - */ 563 - if (unlikely((unsigned long)READ_ONCE(folio->mapping) != 564 - anon_mapping)) { 565 - up_read(&root_anon_vma->rwsem); 566 - rcu_read_unlock(); 567 - goto retry; 568 - } 569 - 570 560 /* 571 561 * If the folio is still mapped, then this anon_vma is still 572 562 * its anon_vma, and holding the mutex ensures that it will ··· 589 601 /* we pinned the anon_vma, its safe to sleep */ 590 602 rcu_read_unlock(); 591 603 anon_vma_lock_read(anon_vma); 592 - 593 - /* 594 - * folio_move_anon_rmap() might have changed the anon_vma as we might 595 - * not hold the folio lock here. 596 - */ 597 - if (unlikely((unsigned long)READ_ONCE(folio->mapping) != 598 - anon_mapping)) { 599 - anon_vma_unlock_read(anon_vma); 600 - put_anon_vma(anon_vma); 601 - anon_vma = NULL; 602 - goto retry; 603 - } 604 604 605 605 if (atomic_dec_and_test(&anon_vma->refcount)) { 606 606 /* ··· 964 988 if (!folio_raw_mapping(folio)) 965 989 return 0; 966 990 967 - if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) { 991 + if (!is_locked) { 968 992 we_locked = folio_trylock(folio); 969 993 if (!we_locked) 970 994 return 1; ··· 2803 2827 struct anon_vma *anon_vma; 2804 2828 pgoff_t pgoff_start, pgoff_end; 2805 2829 struct anon_vma_chain *avc; 2830 + 2831 + /* 2832 + * The folio lock ensures that folio->mapping can't be changed under us 2833 + * to an anon_vma with different root. 2834 + */ 2835 + VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); 2806 2836 2807 2837 if (locked) { 2808 2838 anon_vma = folio_anon_vma(folio);