Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Revert "mm/ksm: convert break_ksm() from walk_page_range_vma() to folio_walk"

Patch series "ksm: perform a range-walk to jump over holes in break_ksm",
v4.

When unmerging an address range, unmerge_ksm_pages function walks every
page address in the specified range to locate ksm pages. This becomes
highly inefficient when scanning large virtual memory areas that contain
mostly unmapped regions, causing the process to get blocked for several
minutes.

This patch makes break_ksm, function called by unmerge_ksm_pages for every
page in an address range, perform a range walk, allowing it to skip over
entire unmapped holes in a VMA, avoiding unnecessary lookups.

As pointed out by David Hildenbrand in [1], unmerge_ksm_pages() is called
from:

* ksm_madvise() through madvise(MADV_UNMERGEABLE). There are not a lot
of users of that function.

* __ksm_del_vma() through ksm_del_vmas(). Effectively called when
disabling KSM for a process either through the sysctl or from s390x gmap
code when enabling storage keys for a VM.

Consider the following test program which creates a 32 TiB mapping in the
virtual address space but only populates a single page:

#include <unistd.h>
#include <stdio.h>
#include <sys/mman.h>

/* 32 TiB */
const size_t size = 32ul * 1024 * 1024 * 1024 * 1024;

int main() {
char *area = mmap(NULL, size, PROT_READ | PROT_WRITE,
MAP_NORESERVE | MAP_PRIVATE | MAP_ANON, -1, 0);

if (area == MAP_FAILED) {
perror("mmap() failed\n");
return -1;
}

/* Populate a single page such that we get an anon_vma. */
*area = 0;

/* Enable KSM. */
madvise(area, size, MADV_MERGEABLE);
madvise(area, size, MADV_UNMERGEABLE);
return 0;
}


Without this patch, this program takes 9 minutes to finish, while with
this patch it finishes in less then 5 seconds.


This patch (of 3):

This reverts commit e317a8d8b4f600fc7ec9725e26417030ee594f52 and changes
function break_ksm_pmd_entry() to use folios.

This reverts break_ksm() to use walk_page_range_vma() instead of
folio_walk_start().

Change break_ksm_pmd_entry() to call is_ksm_zero_pte() only if we know the
folio is present, and also rename variable ret to found. This will make
it easier to later modify break_ksm() to perform a proper range walk.

Link: https://lkml.kernel.org/r/20251105184912.186329-1-pedrodemargomes@gmail.com
Link: https://lkml.kernel.org/r/20251105184912.186329-2-pedrodemargomes@gmail.com
Link: https://lore.kernel.org/linux-mm/e0886fdf-d198-4130-bd9a-be276c59da37@redhat.com/ [1]
Signed-off-by: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Pedro Demarchi Gomes and committed by
Andrew Morton
912aa825 ed1f8855

+48 -16
+48 -16
mm/ksm.c
··· 607 607 return atomic_read(&mm->mm_users) == 0; 608 608 } 609 609 610 + static int break_ksm_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long next, 611 + struct mm_walk *walk) 612 + { 613 + struct folio *folio = NULL; 614 + spinlock_t *ptl; 615 + pte_t *pte; 616 + pte_t ptent; 617 + int found; 618 + 619 + pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); 620 + if (!pte) 621 + return 0; 622 + ptent = ptep_get(pte); 623 + if (pte_present(ptent)) { 624 + folio = vm_normal_folio(walk->vma, addr, ptent); 625 + } else if (!pte_none(ptent)) { 626 + swp_entry_t entry = pte_to_swp_entry(ptent); 627 + 628 + /* 629 + * As KSM pages remain KSM pages until freed, no need to wait 630 + * here for migration to end. 631 + */ 632 + if (is_migration_entry(entry)) 633 + folio = pfn_swap_entry_folio(entry); 634 + } 635 + /* return 1 if the page is an normal ksm page or KSM-placed zero page */ 636 + found = (folio && folio_test_ksm(folio)) || 637 + (pte_present(ptent) && is_ksm_zero_pte(ptent)); 638 + pte_unmap_unlock(pte, ptl); 639 + return found; 640 + } 641 + 642 + static const struct mm_walk_ops break_ksm_ops = { 643 + .pmd_entry = break_ksm_pmd_entry, 644 + .walk_lock = PGWALK_RDLOCK, 645 + }; 646 + 647 + static const struct mm_walk_ops break_ksm_lock_vma_ops = { 648 + .pmd_entry = break_ksm_pmd_entry, 649 + .walk_lock = PGWALK_WRLOCK, 650 + }; 651 + 610 652 /* 611 653 * We use break_ksm to break COW on a ksm page by triggering unsharing, 612 654 * such that the ksm page will get replaced by an exclusive anonymous page. ··· 665 623 static int break_ksm(struct vm_area_struct *vma, unsigned long addr, bool lock_vma) 666 624 { 667 625 vm_fault_t ret = 0; 668 - 669 - if (lock_vma) 670 - vma_start_write(vma); 626 + const struct mm_walk_ops *ops = lock_vma ? 627 + &break_ksm_lock_vma_ops : &break_ksm_ops; 671 628 672 629 do { 673 - bool ksm_page = false; 674 - struct folio_walk fw; 675 - struct folio *folio; 630 + int ksm_page; 676 631 677 632 cond_resched(); 678 - folio = folio_walk_start(&fw, vma, addr, 679 - FW_MIGRATION | FW_ZEROPAGE); 680 - if (folio) { 681 - /* Small folio implies FW_LEVEL_PTE. */ 682 - if (!folio_test_large(folio) && 683 - (folio_test_ksm(folio) || is_ksm_zero_pte(fw.pte))) 684 - ksm_page = true; 685 - folio_walk_end(&fw, vma); 686 - } 687 - 633 + ksm_page = walk_page_range_vma(vma, addr, addr + 1, ops, NULL); 634 + if (WARN_ON_ONCE(ksm_page < 0)) 635 + return ksm_page; 688 636 if (!ksm_page) 689 637 return 0; 690 638 ret = handle_mm_fault(vma, addr,