mm/shmem, swap: improve cached mTHP handling and fix potential hang

The current swap-in code assumes that, when a swap entry in shmem mapping
is order 0, its cached folios (if present) must be order 0 too, which
turns out not always correct.

The problem is shmem_split_large_entry is called before verifying the
folio will eventually be swapped in, one possible race is:

CPU1 CPU2
shmem_swapin_folio
/* swap in of order > 0 swap entry S1 */
folio = swap_cache_get_folio
/* folio = NULL */
order = xa_get_order
/* order > 0 */
folio = shmem_swap_alloc_folio
/* mTHP alloc failure, folio = NULL */
<... Interrupted ...>
shmem_swapin_folio
/* S1 is swapped in */
shmem_writeout
/* S1 is swapped out, folio cached */
shmem_split_large_entry(..., S1)
/* S1 is split, but the folio covering it has order > 0 now */

Now any following swapin of S1 will hang: `xa_get_order` returns 0, and
folio lookup will return a folio with order > 0. The
`xa_get_order(&mapping->i_pages, index) != folio_order(folio)` will always
return false causing swap-in to return -EEXIST.

And this looks fragile. So fix this up by allowing seeing a larger folio
in swap cache, and check the whole shmem mapping range covered by the
swapin have the right swap value upon inserting the folio. And drop the
redundant tree walks before the insertion.

This will actually improve performance, as it avoids two redundant Xarray
tree walks in the hot path, and the only side effect is that in the
failure path, shmem may redundantly reallocate a few folios causing
temporary slight memory pressure.

And worth noting, it may seems the order and value check before inserting
might help reducing the lock contention, which is not true. The swap
cache layer ensures raced swapin will either see a swap cache folio or
failed to do a swapin (we have SWAP_HAS_CACHE bit even if swap cache is
bypassed), so holding the folio lock and checking the folio flag is
already good enough for avoiding the lock contention. The chance that a
folio passes the swap entry value check but the shmem mapping slot has
changed should be very low.

Link: https://lkml.kernel.org/r/20250728075306.12704-1-ryncsn@gmail.com
Link: https://lkml.kernel.org/r/20250728075306.12704-2-ryncsn@gmail.com
Fixes: 809bc86517cc ("mm: shmem: support large folio swap out")
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Kairui Song and committed by

Andrew Morton 11 months ago 5c241ed8 af915c3c

+30 -9

1 changed file

expand all

shmem.c

+30 -9

mm/shmem.c

··· 891 891 pgoff_t index, void *expected, gfp_t gfp) 892 892 { 893 893 XA_STATE_ORDER(xas, &mapping->i_pages, index, folio_order(folio)); 894 - long nr = folio_nr_pages(folio); 894 + unsigned long nr = folio_nr_pages(folio); 895 + swp_entry_t iter, swap; 896 + void *entry; 895 897 896 898 VM_BUG_ON_FOLIO(index != round_down(index, nr), folio); 897 899 VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); ··· 905 903 906 904 gfp &= GFP_RECLAIM_MASK; 907 905 folio_throttle_swaprate(folio, gfp); 906 + swap = radix_to_swp_entry(expected); 908 907 909 908 do { 909 + iter = swap; 910 910 xas_lock_irq(&xas); 911 - if (expected != xas_find_conflict(&xas)) { 912 - xas_set_err(&xas, -EEXIST); 913 - goto unlock; 911 + xas_for_each_conflict(&xas, entry) { 912 + /* 913 + * The range must either be empty, or filled with 914 + * expected swap entries. Shmem swap entries are never 915 + * partially freed without split of both entry and 916 + * folio, so there shouldn't be any holes. 917 + */ 918 + if (!expected || entry != swp_to_radix_entry(iter)) { 919 + xas_set_err(&xas, -EEXIST); 920 + goto unlock; 921 + } 922 + iter.val += 1 << xas_get_order(&xas); 914 923 } 915 - if (expected && xas_find_conflict(&xas)) { 924 + if (expected && iter.val - nr != swap.val) { 916 925 xas_set_err(&xas, -EEXIST); 917 926 goto unlock; 918 927 } ··· 2372 2359 error = -ENOMEM; 2373 2360 goto failed; 2374 2361 } 2375 - } else if (order != folio_order(folio)) { 2362 + } else if (order > folio_order(folio)) { 2376 2363 /* 2377 2364 * Swap readahead may swap in order 0 folios into swapcache 2378 2365 * asynchronously, while the shmem mapping can still stores ··· 2397 2384 2398 2385 swap = swp_entry(swp_type(swap), swp_offset(swap) + offset); 2399 2386 } 2387 + } else if (order < folio_order(folio)) { 2388 + swap.val = round_down(swap.val, 1 << folio_order(folio)); 2389 + index = round_down(index, 1 << folio_order(folio)); 2400 2390 } 2401 2391 2402 2392 alloced: 2403 - /* We have to do this with folio locked to prevent races */ 2393 + /* 2394 + * We have to do this with the folio locked to prevent races. 2395 + * The shmem_confirm_swap below only checks if the first swap 2396 + * entry matches the folio, that's enough to ensure the folio 2397 + * is not used outside of shmem, as shmem swap entries 2398 + * and swap cache folios are never partially freed. 2399 + */ 2404 2400 folio_lock(folio); 2405 2401 if ((!skip_swapcache && !folio_test_swapcache(folio)) || 2406 - folio->swap.val != swap.val || 2407 2402 !shmem_confirm_swap(mapping, index, swap) || 2408 - xa_get_order(&mapping->i_pages, index) != folio_order(folio)) { 2403 + folio->swap.val != swap.val) { 2409 2404 error = -EEXIST; 2410 2405 goto unlock; 2411 2406 }

Configure Feed

Configure Feed