Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd()

The khugepaged daemon and madvise_collapse have two different
implementations that do almost the same thing. Create collapse_single_pmd
to increase code reuse and create an entry point to these two users.

Refactor madvise_collapse and collapse_scan_mm_slot to use the new
collapse_single_pmd function. To help reduce confusion around the
mmap_locked variable, we rename mmap_locked to lock_dropped in the
collapse_scan_mm_slot() function, and remove the redundant mmap_locked in
madvise_collapse(); this further unifies the code readiblity. the
SCAN_PTE_MAPPED_HUGEPAGE enum is no longer reachable in the
madvise_collapse() function, so we drop it from the list of "continuing"
enums.

This introduces a minor behavioral change that is most likely an
undiscovered bug. The current implementation of khugepaged tests
collapse_test_exit_or_disable() before calling collapse_pte_mapped_thp,
but we weren't doing it in the madvise_collapse case. By unifying these
two callers madvise_collapse now also performs this check. We also modify
the return value to be SCAN_ANY_PROCESS which properly indicates that this
process is no longer valid to operate on.

By moving the madvise_collapse writeback-retry logic into the helper
function we can also avoid having to revalidate the VMA.

We guard the khugepaged_pages_collapsed variable to ensure its only
incremented for khugepaged.

As requested we also convert a VM_BUG_ON to a VM_WARN_ON.

Link: https://lkml.kernel.org/r/20260325114022.444081-6-npache@redhat.com
Signed-off-by: Nico Pache <npache@redhat.com>
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nanyong Sun <sunnanyong@huawei.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rafael Aquini <raquini@redhat.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shivank Garg <shivankg@amd.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Takashi Iwai (SUSE) <tiwai@suse.de>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Usama Arif <usamaarif642@gmail.com>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Zach O'Keefe <zokeefe@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Nico Pache and committed by
Andrew Morton
a155d945 ff7e03a8

+72 -70
+72 -70
mm/khugepaged.c
··· 1250 1250 1251 1251 static enum scan_result collapse_scan_pmd(struct mm_struct *mm, 1252 1252 struct vm_area_struct *vma, unsigned long start_addr, 1253 - bool *mmap_locked, struct collapse_control *cc) 1253 + bool *lock_dropped, struct collapse_control *cc) 1254 1254 { 1255 1255 pmd_t *pmd; 1256 1256 pte_t *pte, *_pte; ··· 1425 1425 result = collapse_huge_page(mm, start_addr, referenced, 1426 1426 unmapped, cc); 1427 1427 /* collapse_huge_page will return with the mmap_lock released */ 1428 - *mmap_locked = false; 1428 + *lock_dropped = true; 1429 1429 } 1430 1430 out: 1431 1431 trace_mm_khugepaged_scan_pmd(mm, folio, referenced, ··· 2417 2417 return result; 2418 2418 } 2419 2419 2420 + /* 2421 + * Try to collapse a single PMD starting at a PMD aligned addr, and return 2422 + * the results. 2423 + */ 2424 + static enum scan_result collapse_single_pmd(unsigned long addr, 2425 + struct vm_area_struct *vma, bool *lock_dropped, 2426 + struct collapse_control *cc) 2427 + { 2428 + struct mm_struct *mm = vma->vm_mm; 2429 + bool triggered_wb = false; 2430 + enum scan_result result; 2431 + struct file *file; 2432 + pgoff_t pgoff; 2433 + 2434 + mmap_assert_locked(mm); 2435 + 2436 + if (vma_is_anonymous(vma)) { 2437 + result = collapse_scan_pmd(mm, vma, addr, lock_dropped, cc); 2438 + goto end; 2439 + } 2440 + 2441 + file = get_file(vma->vm_file); 2442 + pgoff = linear_page_index(vma, addr); 2443 + 2444 + mmap_read_unlock(mm); 2445 + *lock_dropped = true; 2446 + retry: 2447 + result = collapse_scan_file(mm, addr, file, pgoff, cc); 2448 + 2449 + /* 2450 + * For MADV_COLLAPSE, when encountering dirty pages, try to writeback, 2451 + * then retry the collapse one time. 2452 + */ 2453 + if (!cc->is_khugepaged && result == SCAN_PAGE_DIRTY_OR_WRITEBACK && 2454 + !triggered_wb && mapping_can_writeback(file->f_mapping)) { 2455 + const loff_t lstart = (loff_t)pgoff << PAGE_SHIFT; 2456 + const loff_t lend = lstart + HPAGE_PMD_SIZE - 1; 2457 + 2458 + filemap_write_and_wait_range(file->f_mapping, lstart, lend); 2459 + triggered_wb = true; 2460 + goto retry; 2461 + } 2462 + fput(file); 2463 + 2464 + if (result == SCAN_PTE_MAPPED_HUGEPAGE) { 2465 + mmap_read_lock(mm); 2466 + if (collapse_test_exit_or_disable(mm)) 2467 + result = SCAN_ANY_PROCESS; 2468 + else 2469 + result = try_collapse_pte_mapped_thp(mm, addr, 2470 + !cc->is_khugepaged); 2471 + if (result == SCAN_PMD_MAPPED) 2472 + result = SCAN_SUCCEED; 2473 + mmap_read_unlock(mm); 2474 + } 2475 + end: 2476 + if (cc->is_khugepaged && result == SCAN_SUCCEED) 2477 + ++khugepaged_pages_collapsed; 2478 + return result; 2479 + } 2480 + 2420 2481 static void collapse_scan_mm_slot(unsigned int progress_max, 2421 2482 enum scan_result *result, struct collapse_control *cc) 2422 2483 __releases(&khugepaged_mm_lock) ··· 2539 2478 VM_BUG_ON(khugepaged_scan.address & ~HPAGE_PMD_MASK); 2540 2479 2541 2480 while (khugepaged_scan.address < hend) { 2542 - bool mmap_locked = true; 2481 + bool lock_dropped = false; 2543 2482 2544 2483 cond_resched(); 2545 2484 if (unlikely(collapse_test_exit_or_disable(mm))) 2546 2485 goto breakouterloop; 2547 2486 2548 - VM_BUG_ON(khugepaged_scan.address < hstart || 2487 + VM_WARN_ON_ONCE(khugepaged_scan.address < hstart || 2549 2488 khugepaged_scan.address + HPAGE_PMD_SIZE > 2550 2489 hend); 2551 - if (!vma_is_anonymous(vma)) { 2552 - struct file *file = get_file(vma->vm_file); 2553 - pgoff_t pgoff = linear_page_index(vma, 2554 - khugepaged_scan.address); 2555 2490 2556 - mmap_read_unlock(mm); 2557 - mmap_locked = false; 2558 - *result = collapse_scan_file(mm, 2559 - khugepaged_scan.address, file, pgoff, cc); 2560 - fput(file); 2561 - if (*result == SCAN_PTE_MAPPED_HUGEPAGE) { 2562 - mmap_read_lock(mm); 2563 - if (collapse_test_exit_or_disable(mm)) 2564 - goto breakouterloop; 2565 - *result = try_collapse_pte_mapped_thp(mm, 2566 - khugepaged_scan.address, false); 2567 - if (*result == SCAN_PMD_MAPPED) 2568 - *result = SCAN_SUCCEED; 2569 - mmap_read_unlock(mm); 2570 - } 2571 - } else { 2572 - *result = collapse_scan_pmd(mm, vma, 2573 - khugepaged_scan.address, &mmap_locked, cc); 2574 - } 2575 - 2576 - if (*result == SCAN_SUCCEED) 2577 - ++khugepaged_pages_collapsed; 2578 - 2491 + *result = collapse_single_pmd(khugepaged_scan.address, 2492 + vma, &lock_dropped, cc); 2579 2493 /* move to next address */ 2580 2494 khugepaged_scan.address += HPAGE_PMD_SIZE; 2581 - if (!mmap_locked) 2495 + if (lock_dropped) 2582 2496 /* 2583 2497 * We released mmap_lock so break loop. Note 2584 2498 * that we drop mmap_lock before all hugepage ··· 2828 2792 unsigned long hstart, hend, addr; 2829 2793 enum scan_result last_fail = SCAN_FAIL; 2830 2794 int thps = 0; 2831 - bool mmap_locked = true; 2832 2795 2833 2796 BUG_ON(vma->vm_start > start); 2834 2797 BUG_ON(vma->vm_end < end); ··· 2849 2814 2850 2815 for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) { 2851 2816 enum scan_result result = SCAN_FAIL; 2852 - bool triggered_wb = false; 2853 2817 2854 - retry: 2855 - if (!mmap_locked) { 2818 + if (*lock_dropped) { 2856 2819 cond_resched(); 2857 2820 mmap_read_lock(mm); 2858 - mmap_locked = true; 2821 + *lock_dropped = false; 2859 2822 result = hugepage_vma_revalidate(mm, addr, false, &vma, 2860 2823 cc); 2861 2824 if (result != SCAN_SUCCEED) { ··· 2863 2830 2864 2831 hend = min(hend, vma->vm_end & HPAGE_PMD_MASK); 2865 2832 } 2866 - mmap_assert_locked(mm); 2867 - if (!vma_is_anonymous(vma)) { 2868 - struct file *file = get_file(vma->vm_file); 2869 - pgoff_t pgoff = linear_page_index(vma, addr); 2870 2833 2871 - mmap_read_unlock(mm); 2872 - mmap_locked = false; 2873 - *lock_dropped = true; 2874 - result = collapse_scan_file(mm, addr, file, pgoff, cc); 2834 + result = collapse_single_pmd(addr, vma, lock_dropped, cc); 2875 2835 2876 - if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb && 2877 - mapping_can_writeback(file->f_mapping)) { 2878 - loff_t lstart = (loff_t)pgoff << PAGE_SHIFT; 2879 - loff_t lend = lstart + HPAGE_PMD_SIZE - 1; 2880 - 2881 - filemap_write_and_wait_range(file->f_mapping, lstart, lend); 2882 - triggered_wb = true; 2883 - fput(file); 2884 - goto retry; 2885 - } 2886 - fput(file); 2887 - } else { 2888 - result = collapse_scan_pmd(mm, vma, addr, &mmap_locked, cc); 2889 - } 2890 - if (!mmap_locked) 2891 - *lock_dropped = true; 2892 - 2893 - handle_result: 2894 2836 switch (result) { 2895 2837 case SCAN_SUCCEED: 2896 2838 case SCAN_PMD_MAPPED: 2897 2839 ++thps; 2898 2840 break; 2899 - case SCAN_PTE_MAPPED_HUGEPAGE: 2900 - BUG_ON(mmap_locked); 2901 - mmap_read_lock(mm); 2902 - result = try_collapse_pte_mapped_thp(mm, addr, true); 2903 - mmap_read_unlock(mm); 2904 - goto handle_result; 2905 2841 /* Whitelisted set of results where continuing OK */ 2906 2842 case SCAN_NO_PTE_TABLE: 2907 2843 case SCAN_PTE_NON_PRESENT: ··· 2893 2891 2894 2892 out_maybelock: 2895 2893 /* Caller expects us to hold mmap_lock on return */ 2896 - if (!mmap_locked) 2894 + if (*lock_dropped) 2897 2895 mmap_read_lock(mm); 2898 2896 out_nolock: 2899 2897 mmap_assert_locked(mm);