Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

mm/huge_memory: mark PMD mappings of the huge zero folio special

The huge zero folio is refcounted (+mapcounted -- is that a word?)
differently than "normal" folios, similarly (but different) to the
ordinary shared zeropage.

For this reason, we special-case these pages in
vm_normal_page*/vm_normal_folio*, and only allow selected callers to still
use them (e.g., GUP can still take a reference on them).

vm_normal_page_pmd() already filters out the huge zero folio, to indicate
it a special (return NULL). However, so far we are not making use of
pmd_special() on architectures that support it
(CONFIG_ARCH_HAS_PTE_SPECIAL), like we would with the ordinary shared
zeropage.

Let's mark PMD mappings of the huge zero folio similarly as special, so we
can avoid the manual check for the huge zero folio with
CONFIG_ARCH_HAS_PTE_SPECIAL next, and only perform the check on
!CONFIG_ARCH_HAS_PTE_SPECIAL.

In copy_huge_pmd(), where we have a manual pmd_special() check to handle
PFNMAP, we have to manually rule out the huge zero folio. That code needs
a serious cleanup, but that's something for another day.

While at it, update the doc regarding the shared zero folios.

No functional change intended: vm_normal_page_pmd() still returns NULL
when it encounters the huge zero folio.

Link: https://lkml.kernel.org/r/20250811112631.759341-6-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Juegren Gross <jgross@suse.com>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

David Hildenbrand and committed by
Andrew Morton
d82d09e4 b0f86aae

+16 -7
+6 -2
mm/huge_memory.c
··· 1309 1309 { 1310 1310 pmd_t entry; 1311 1311 entry = folio_mk_pmd(zero_folio, vma->vm_page_prot); 1312 + entry = pmd_mkspecial(entry); 1312 1313 pgtable_trans_huge_deposit(mm, pmd, pgtable); 1313 1314 set_pmd_at(mm, haddr, pmd, entry); 1314 1315 mm_inc_nr_ptes(mm); ··· 1419 1418 if (fop.is_folio) { 1420 1419 entry = folio_mk_pmd(fop.folio, vma->vm_page_prot); 1421 1420 1422 - if (!is_huge_zero_folio(fop.folio)) { 1421 + if (is_huge_zero_folio(fop.folio)) { 1422 + entry = pmd_mkspecial(entry); 1423 + } else { 1423 1424 folio_get(fop.folio); 1424 1425 folio_add_file_rmap_pmd(fop.folio, &fop.folio->page, vma); 1425 1426 add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PMD_NR); ··· 1646 1643 int ret = -ENOMEM; 1647 1644 1648 1645 pmd = pmdp_get_lockless(src_pmd); 1649 - if (unlikely(pmd_present(pmd) && pmd_special(pmd))) { 1646 + if (unlikely(pmd_present(pmd) && pmd_special(pmd) && 1647 + !is_huge_zero_pmd(pmd))) { 1650 1648 dst_ptl = pmd_lock(dst_mm, dst_pmd); 1651 1649 src_ptl = pmd_lockptr(src_mm, src_pmd); 1652 1650 spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
+10 -5
mm/memory.c
··· 555 555 * 556 556 * "Special" mappings do not wish to be associated with a "struct page" (either 557 557 * it doesn't exist, or it exists but they don't want to touch it). In this 558 - * case, NULL is returned here. "Normal" mappings do have a struct page. 558 + * case, NULL is returned here. "Normal" mappings do have a struct page and 559 + * are ordinarily refcounted. 560 + * 561 + * Page mappings of the shared zero folios are always considered "special", as 562 + * they are not ordinarily refcounted: neither the refcount nor the mapcount 563 + * of these folios is adjusted when mapping them into user page tables. 564 + * Selected page table walkers (such as GUP) can still identify mappings of the 565 + * shared zero folios and work with the underlying "struct page". 559 566 * 560 567 * There are 2 broad cases. Firstly, an architecture may define a pte_special() 561 568 * pte bit, in which case this function is trivial. Secondly, an architecture ··· 592 585 * 593 586 * VM_MIXEDMAP mappings can likewise contain memory with or without "struct 594 587 * page" backing, however the difference is that _all_ pages with a struct 595 - * page (that is, those where pfn_valid is true) are refcounted and considered 596 - * normal pages by the VM. The only exception are zeropages, which are 597 - * *never* refcounted. 588 + * page (that is, those where pfn_valid is true, except the shared zero 589 + * folios) are refcounted and considered normal pages by the VM. 598 590 * 599 591 * The disadvantage is that pages are refcounted (which can be slower and 600 592 * simply not an option for some PFNMAP users). The advantage is that we ··· 673 667 { 674 668 unsigned long pfn = pmd_pfn(pmd); 675 669 676 - /* Currently it's only used for huge pfnmaps */ 677 670 if (unlikely(pmd_special(pmd))) 678 671 return NULL; 679 672