mm: set folio swapbacked iff folios are dirty in try_to_unmap_one

Patch series "mm: batched unmap lazyfree large folios during reclamation",
v4.

Commit 735ecdfaf4e8 ("mm/vmscan: avoid split lazyfree THP during
shrink_folio_list()") prevents the splitting of MADV_FREE'd THP in
madvise.c.

However, those folios are still added to the deferred_split list in
try_to_unmap_one() because we are unmapping PTEs and removing rmap entries
one by one.

Firstly, this has rendered the following counter somewhat confusing,
/sys/kernel/mm/transparent_hugepage/hugepages-size/stats/split_deferred
The split_deferred counter was originally designed to track operations
such as partial unmap or madvise of large folios. However, in practice,
most split_deferred cases arise from memory reclamation of aligned
lazyfree mTHPs as observed by Tangquan. This discrepancy has made the
split_deferred counter highly misleading.

Secondly, this approach is slow because it requires iterating through each
PTE and removing the rmap one by one for a large folio. In fact, all PTEs
of a pte-mapped large folio should be unmapped at once, and the entire
folio should be removed from the rmap as a whole.

Thirdly, it also increases the risk of a race condition where lazyfree
folios are incorrectly set back to swapbacked, as a speculative folio_get
may occur in the shrinker's callback.

deferred_split_scan() might call folio_try_get(folio) since we have added
the folio to split_deferred list while removing rmap for the 1st subpage,
and while we are scanning the 2nd to nr_pages PTEs of this folio in
try_to_unmap_one(), the entire mTHP could be transitioned back to
swap-backed because the reference count is incremented, which can make
"ref_count == 1 + map_count" within try_to_unmap_one() false.

/*
* The only page refs must be one from isolation
* plus the rmap(s) (dropped by discard:).
*/
if (ref_count == 1 + map_count &&
(!folio_test_dirty(folio) ||
...
(vma->vm_flags & VM_DROPPABLE))) {
dec_mm_counter(mm, MM_ANONPAGES);
goto discard;
}

This patchset resolves the issue by marking only genuinely dirty folios as
swap-backed, as suggested by David, and transitioning to batched unmapping
of entire folios in try_to_unmap_one(). Consequently, the deferred_split
count drops to zero, and memory reclamation performance improves
significantly — reclaiming 64KiB lazyfree large folios is now 2.5x
faster(The specific data is embedded in the changelog of patch 3/4).

By the way, while the patchset is primarily aimed at PTE-mapped large
folios, Baolin and Lance also found that try_to_unmap_one() handles
lazyfree redirtied PMD-mapped large folios inefficiently — it splits the
PMD into PTEs and iterates over them. This patchset removes the
unnecessary splitting, enabling us to skip redirtied PMD-mapped large
folios 3.5X faster during memory reclamation. (The specific data can be
found in the changelog of patch 4/4).

This patch (of 4):

The refcount may be temporarily or long-term increased, but this does not
change the fundamental nature of the folio already being lazy- freed.
Therefore, we only reset 'swapbacked' when we are certain the folio is
dirty and not droppable.

Link: https://lkml.kernel.org/r/20250214093015.51024-1-21cnbao@gmail.com
Link: https://lkml.kernel.org/r/20250214093015.51024-2-21cnbao@gmail.com
Fixes: 6c8e2a256915 ("mm: fix race between MADV_FREE reclaim and blkdev direct IO read")
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Lance Yang <ioworker0@gmail.com>
Cc: Mauricio Faria de Oliveira <mfo@canonical.com>
Cc: Chis Li <chrisl@kernel.org> (Google)
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Tangquan Zheng <zhengtangquan@oppo.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Gavin Shan <gshan@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Shaoqin Huang <shahuang@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yicong Yang <yangyicong@hisilicon.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Barry Song and committed by

Andrew Morton 1 year ago faeb2831 b23ceebd

+22 -27

1 changed file

expand all

rmap.c

+22 -27

mm/rmap.c

··· 1963 1963 */ 1964 1964 smp_rmb(); 1965 1965 1966 - /* 1967 - * The only page refs must be one from isolation 1968 - * plus the rmap(s) (dropped by discard:). 1969 - */ 1970 - if (ref_count == 1 + map_count && 1971 - (!folio_test_dirty(folio) || 1972 - /* 1973 - * Unlike MADV_FREE mappings, VM_DROPPABLE 1974 - * ones can be dropped even if they've 1975 - * been dirtied. 1976 - */ 1977 - (vma->vm_flags & VM_DROPPABLE))) { 1978 - dec_mm_counter(mm, MM_ANONPAGES); 1979 - goto discard; 1980 - } 1981 - 1982 - /* 1983 - * If the folio was redirtied, it cannot be 1984 - * discarded. Remap the page to page table. 1985 - */ 1986 - set_pte_at(mm, address, pvmw.pte, pteval); 1987 - /* 1988 - * Unlike MADV_FREE mappings, VM_DROPPABLE ones 1989 - * never get swap backed on failure to drop. 1990 - */ 1991 - if (!(vma->vm_flags & VM_DROPPABLE)) 1966 + if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) { 1967 + /* 1968 + * redirtied either using the page table or a previously 1969 + * obtained GUP reference. 1970 + */ 1971 + set_pte_at(mm, address, pvmw.pte, pteval); 1992 1972 folio_set_swapbacked(folio); 1993 - goto walk_abort; 1973 + goto walk_abort; 1974 + } else if (ref_count != 1 + map_count) { 1975 + /* 1976 + * Additional reference. Could be a GUP reference or any 1977 + * speculative reference. GUP users must mark the folio 1978 + * dirty if there was a modification. This folio cannot be 1979 + * reclaimed right now either way, so act just like nothing 1980 + * happened. 1981 + * We'll come back here later and detect if the folio was 1982 + * dirtied when the additional reference is gone. 1983 + */ 1984 + set_pte_at(mm, address, pvmw.pte, pteval); 1985 + goto walk_abort; 1986 + } 1987 + dec_mm_counter(mm, MM_ANONPAGES); 1988 + goto discard; 1994 1989 } 1995 1990 1996 1991 if (swap_duplicate(entry) < 0) {

Configure Feed

Configure Feed