Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

mm: correctly handle UFFD PTE markers

Patch series "mm: remove is_swap_[pte, pmd]() + non-swap entries,
introduce leaf entries", v3.

There's an established convention in the kernel that we treat leaf page
tables (so far at the PTE, PMD level) as containing 'swap entries' should
they be neither empty (i.e. p**_none() evaluating true) nor present (i.e.
p**_present() evaluating true).

However, at the same time we also have helper predicates - is_swap_pte(),
is_swap_pmd() - which are inconsistently used.

This is problematic, as it is logical to assume that should somebody wish
to operate upon a page table swap entry they should first check to see if
it is in fact one.

It also implies that perhaps, in future, we might introduce a non-present,
none page table entry that is not a swap entry.

This series resolves this issue by systematically eliminating all use of
the is_swap_pte() and is swap_pmd() predicates so we retain only the
convention that should a leaf page table entry be neither none nor present
it is a swap entry.

We also have the further issue that 'swap entry' is unfortunately a really
rather overloaded term and in fact refers to both entries for swap and for
other information such as migration entries, page table markers, and
device private entries.

We therefore have the rather 'unique' concept of a 'non-swap' swap entry.

This series therefore introduces the concept of 'software leaf entries',
of type softleaf_t, to eliminate this confusion.

A software leaf entry in this sense is any page table entry which is
non-present, and represented by the softleaf_t type. That is - page table
leaf entries which are software-controlled by the kernel.

This includes 'none' or empty entries, which are simply represented by an
zero leaf entry value.

In order to maintain compatibility as we transition the kernel to this new
type, we simply typedef swp_entry_t to softleaf_t.

We introduce a number of predicates and helpers to interact with software
leaf entries in include/linux/leafops.h which, as it imports swapops.h,
can be treated as a drop-in replacement for swapops.h wherever leaf entry
helpers are used.

Since softleaf_from_[pte, pmd]() treats present entries as they were
empty/none leaf entries, this allows for a great deal of simplification of
code throughout the code base, which this series utilises a great deal.

We additionally change from swap entry to software leaf entry handling
where it makes sense to and eliminate functions from swapops.h where
software leaf entries obviate the need for the functions.


This patch (of 16):

PTE markers were previously only concerned with UFFD-specific logic - that
is, PTE entries with the UFFD WP marker set or those marked via
UFFDIO_POISON.

However since the introduction of guard markers in commit 7c53dfbdb024
("mm: add PTE_MARKER_GUARD PTE marker"), this has no longer been the case.

Issues have been avoided as guard regions are not permitted in conjunction
with UFFD, but it still leaves very confusing logic in place, most notably
the misleading and poorly named pte_none_mostly() and
huge_pte_none_mostly().

This predicate returns true for PTE entries that ought to be treated as
none, but only in certain circumstances, and on the assumption we are
dealing with H/W poison markers or UFFD WP markers.

This patch removes these functions and makes each invocation of these
functions instead explicitly check what it needs to check.

As part of this effort it introduces is_uffd_pte_marker() to explicitly
determine if a marker in fact is used as part of UFFD or not.

In the HMM logic we note that the only time we would need to check for a
fault is in the case of a UFFD WP marker, otherwise we simply encounter a
fault error (VM_FAULT_HWPOISON for H/W poisoned marker, VM_FAULT_SIGSEGV
for a guard marker), so only check for the UFFD WP case.

While we're here we also refactor code to make it easier to understand.

[akpm@linux-foundation.org: fix comment typo, per Mike]
Link: https://lkml.kernel.org/r/cover.1762812360.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/c38625fd9a1c1f1cf64ae8a248858e45b3dcdf11.1762812360.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Chris Li <chrisl@kernel.org>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mathew Brost <matthew.brost@intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Wei Xu <weixugc@google.com>
Cc: xu xin <xu.xin16@zte.com.cn>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Lorenzo Stoakes and committed by
Andrew Morton
c093cf45 8a0e4bdd

+148 -110
+65 -46
fs/userfaultfd.c
··· 233 233 { 234 234 struct vm_area_struct *vma = vmf->vma; 235 235 pte_t *ptep, pte; 236 - bool ret = true; 237 236 238 237 assert_fault_locked(vmf); 239 238 240 239 ptep = hugetlb_walk(vma, vmf->address, vma_mmu_pagesize(vma)); 241 240 if (!ptep) 242 - goto out; 241 + return true; 243 242 244 - ret = false; 245 243 pte = huge_ptep_get(vma->vm_mm, vmf->address, ptep); 246 244 247 245 /* 248 246 * Lockless access: we're in a wait_event so it's ok if it 249 - * changes under us. PTE markers should be handled the same as none 250 - * ptes here. 247 + * changes under us. 251 248 */ 252 - if (huge_pte_none_mostly(pte)) 253 - ret = true; 249 + 250 + /* Entry is still missing, wait for userspace to resolve the fault. */ 251 + if (huge_pte_none(pte)) 252 + return true; 253 + /* UFFD PTE markers require userspace to resolve the fault. */ 254 + if (is_uffd_pte_marker(pte)) 255 + return true; 256 + /* 257 + * If VMA has UFFD WP faults enabled and WP fault, wait for userspace to 258 + * resolve the fault. 259 + */ 254 260 if (!huge_pte_write(pte) && (reason & VM_UFFD_WP)) 255 - ret = true; 256 - out: 257 - return ret; 261 + return true; 262 + 263 + return false; 258 264 } 259 265 #else 260 266 static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx, 261 267 struct vm_fault *vmf, 262 268 unsigned long reason) 263 269 { 264 - return false; /* should never get here */ 270 + /* Should never get here. */ 271 + VM_WARN_ON_ONCE(1); 272 + return false; 265 273 } 266 274 #endif /* CONFIG_HUGETLB_PAGE */ 267 275 268 276 /* 269 - * Verify the pagetables are still not ok after having reigstered into 277 + * Verify the pagetables are still not ok after having registered into 270 278 * the fault_pending_wqh to avoid userland having to UFFDIO_WAKE any 271 279 * userfault that has already been resolved, if userfaultfd_read_iter and 272 280 * UFFDIO_COPY|ZEROPAGE are being run simultaneously on two different ··· 292 284 pmd_t *pmd, _pmd; 293 285 pte_t *pte; 294 286 pte_t ptent; 295 - bool ret = true; 287 + bool ret; 296 288 297 289 assert_fault_locked(vmf); 298 290 299 291 pgd = pgd_offset(mm, address); 300 292 if (!pgd_present(*pgd)) 301 - goto out; 293 + return true; 302 294 p4d = p4d_offset(pgd, address); 303 295 if (!p4d_present(*p4d)) 304 - goto out; 296 + return true; 305 297 pud = pud_offset(p4d, address); 306 298 if (!pud_present(*pud)) 307 - goto out; 299 + return true; 308 300 pmd = pmd_offset(pud, address); 309 301 again: 310 302 _pmd = pmdp_get_lockless(pmd); 311 303 if (pmd_none(_pmd)) 304 + return true; 305 + 306 + /* 307 + * A race could arise which would result in a softleaf entry such as 308 + * migration entry unexpectedly being present in the PMD, so explicitly 309 + * check for this and bail out if so. 310 + */ 311 + if (!pmd_present(_pmd)) 312 + return false; 313 + 314 + if (pmd_trans_huge(_pmd)) 315 + return !pmd_write(_pmd) && (reason & VM_UFFD_WP); 316 + 317 + pte = pte_offset_map(pmd, address); 318 + if (!pte) 319 + goto again; 320 + 321 + /* 322 + * Lockless access: we're in a wait_event so it's ok if it 323 + * changes under us. 324 + */ 325 + ptent = ptep_get(pte); 326 + 327 + ret = true; 328 + /* Entry is still missing, wait for userspace to resolve the fault. */ 329 + if (pte_none(ptent)) 330 + goto out; 331 + /* UFFD PTE markers require userspace to resolve the fault. */ 332 + if (is_uffd_pte_marker(ptent)) 333 + goto out; 334 + /* 335 + * If VMA has UFFD WP faults enabled and WP fault, wait for userspace to 336 + * resolve the fault. 337 + */ 338 + if (!pte_write(ptent) && (reason & VM_UFFD_WP)) 312 339 goto out; 313 340 314 341 ret = false; 315 - if (!pmd_present(_pmd)) 316 - goto out; 317 - 318 - if (pmd_trans_huge(_pmd)) { 319 - if (!pmd_write(_pmd) && (reason & VM_UFFD_WP)) 320 - ret = true; 321 - goto out; 322 - } 323 - 324 - pte = pte_offset_map(pmd, address); 325 - if (!pte) { 326 - ret = true; 327 - goto again; 328 - } 329 - /* 330 - * Lockless access: we're in a wait_event so it's ok if it 331 - * changes under us. PTE markers should be handled the same as none 332 - * ptes here. 333 - */ 334 - ptent = ptep_get(pte); 335 - if (pte_none_mostly(ptent)) 336 - ret = true; 337 - if (!pte_write(ptent) && (reason & VM_UFFD_WP)) 338 - ret = true; 339 - pte_unmap(pte); 340 - 341 342 out: 343 + pte_unmap(pte); 342 344 return ret; 343 345 } 344 346 ··· 508 490 set_current_state(blocking_state); 509 491 spin_unlock_irq(&ctx->fault_pending_wqh.lock); 510 492 511 - if (!is_vm_hugetlb_page(vma)) 512 - must_wait = userfaultfd_must_wait(ctx, vmf, reason); 513 - else 493 + if (is_vm_hugetlb_page(vma)) { 514 494 must_wait = userfaultfd_huge_must_wait(ctx, vmf, reason); 515 - if (is_vm_hugetlb_page(vma)) 516 495 hugetlb_vma_unlock_read(vma); 496 + } else { 497 + must_wait = userfaultfd_must_wait(ctx, vmf, reason); 498 + } 499 + 517 500 release_fault_lock(vmf); 518 501 519 502 if (likely(must_wait && !READ_ONCE(ctx->released))) {
-8
include/asm-generic/hugetlb.h
··· 97 97 } 98 98 #endif 99 99 100 - /* Please refer to comments above pte_none_mostly() for the usage */ 101 - #ifndef __HAVE_ARCH_HUGE_PTE_NONE_MOSTLY 102 - static inline int huge_pte_none_mostly(pte_t pte) 103 - { 104 - return huge_pte_none(pte) || is_pte_marker(pte); 105 - } 106 - #endif 107 - 108 100 #ifndef __HAVE_ARCH_HUGE_PTEP_SET_WRPROTECT 109 101 static inline void huge_ptep_set_wrprotect(struct mm_struct *mm, 110 102 unsigned long addr, pte_t *ptep)
-18
include/linux/swapops.h
··· 469 469 (pte_marker_get(entry) & PTE_MARKER_GUARD); 470 470 } 471 471 472 - /* 473 - * This is a special version to check pte_none() just to cover the case when 474 - * the pte is a pte marker. It existed because in many cases the pte marker 475 - * should be seen as a none pte; it's just that we have stored some information 476 - * onto the none pte so it becomes not-none any more. 477 - * 478 - * It should be used when the pte is file-backed, ram-based and backing 479 - * userspace pages, like shmem. It is not needed upon pgtables that do not 480 - * support pte markers at all. For example, it's not needed on anonymous 481 - * memory, kernel-only memory (including when the system is during-boot), 482 - * non-ram based generic file-system. It's fine to be used even there, but the 483 - * extra pte marker check will be pure overhead. 484 - */ 485 - static inline int pte_none_mostly(pte_t pte) 486 - { 487 - return pte_none(pte) || is_pte_marker(pte); 488 - } 489 - 490 472 static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry) 491 473 { 492 474 struct page *p = pfn_to_page(swp_offset_pfn(entry));
+21
include/linux/userfaultfd_k.h
··· 479 479 return false; 480 480 } 481 481 482 + 483 + static inline bool is_uffd_pte_marker(pte_t pte) 484 + { 485 + swp_entry_t entry; 486 + 487 + if (pte_present(pte)) 488 + return false; 489 + 490 + entry = pte_to_swp_entry(pte); 491 + if (!is_pte_marker_entry(entry)) 492 + return false; 493 + 494 + /* UFFD WP, poisoned swap entries are UFFD handled. */ 495 + if (pte_marker_entry_uffd_wp(entry)) 496 + return true; 497 + if (is_poisoned_swp_entry(entry)) 498 + return true; 499 + 500 + return false; 501 + } 502 + 482 503 #endif /* _LINUX_USERFAULTFD_K_H */
+6 -1
mm/hmm.c
··· 244 244 uint64_t pfn_req_flags = *hmm_pfn; 245 245 uint64_t new_pfn_flags = 0; 246 246 247 - if (pte_none_mostly(pte)) { 247 + /* 248 + * Any other marker than a UFFD WP marker will result in a fault error 249 + * that will be correctly handled, so we need only check for UFFD WP 250 + * here. 251 + */ 252 + if (pte_none(pte) || pte_marker_uffd_wp(pte)) { 248 253 required_fault = 249 254 hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0); 250 255 if (required_fault)
+25 -24
mm/hugetlb.c
··· 6037 6037 } 6038 6038 6039 6039 vmf.orig_pte = huge_ptep_get(mm, vmf.address, vmf.pte); 6040 - if (huge_pte_none_mostly(vmf.orig_pte)) { 6041 - if (is_pte_marker(vmf.orig_pte)) { 6042 - pte_marker marker = 6043 - pte_marker_get(pte_to_swp_entry(vmf.orig_pte)); 6044 - 6045 - if (marker & PTE_MARKER_POISONED) { 6046 - ret = VM_FAULT_HWPOISON_LARGE | 6047 - VM_FAULT_SET_HINDEX(hstate_index(h)); 6048 - goto out_mutex; 6049 - } else if (WARN_ON_ONCE(marker & PTE_MARKER_GUARD)) { 6050 - /* This isn't supported in hugetlb. */ 6051 - ret = VM_FAULT_SIGSEGV; 6052 - goto out_mutex; 6053 - } 6054 - } 6055 - 6040 + if (huge_pte_none(vmf.orig_pte)) 6056 6041 /* 6057 - * Other PTE markers should be handled the same way as none PTE. 6058 - * 6059 6042 * hugetlb_no_page will drop vma lock and hugetlb fault 6060 6043 * mutex internally, which make us return immediately. 6061 6044 */ 6045 + return hugetlb_no_page(mapping, &vmf); 6046 + 6047 + if (is_pte_marker(vmf.orig_pte)) { 6048 + const pte_marker marker = 6049 + pte_marker_get(pte_to_swp_entry(vmf.orig_pte)); 6050 + 6051 + if (marker & PTE_MARKER_POISONED) { 6052 + ret = VM_FAULT_HWPOISON_LARGE | 6053 + VM_FAULT_SET_HINDEX(hstate_index(h)); 6054 + goto out_mutex; 6055 + } else if (WARN_ON_ONCE(marker & PTE_MARKER_GUARD)) { 6056 + /* This isn't supported in hugetlb. */ 6057 + ret = VM_FAULT_SIGSEGV; 6058 + goto out_mutex; 6059 + } 6060 + 6062 6061 return hugetlb_no_page(mapping, &vmf); 6063 6062 } 6064 6063 ··· 6227 6228 int ret = -ENOMEM; 6228 6229 struct folio *folio; 6229 6230 bool folio_in_pagecache = false; 6231 + pte_t dst_ptep; 6230 6232 6231 6233 if (uffd_flags_mode_is(flags, MFILL_ATOMIC_POISON)) { 6232 6234 ptl = huge_pte_lock(h, dst_mm, dst_pte); ··· 6367 6367 if (folio_test_hwpoison(folio)) 6368 6368 goto out_release_unlock; 6369 6369 6370 - /* 6371 - * We allow to overwrite a pte marker: consider when both MISSING|WP 6372 - * registered, we firstly wr-protect a none pte which has no page cache 6373 - * page backing it, then access the page. 6374 - */ 6375 6370 ret = -EEXIST; 6376 - if (!huge_pte_none_mostly(huge_ptep_get(dst_mm, dst_addr, dst_pte))) 6371 + 6372 + dst_ptep = huge_ptep_get(dst_mm, dst_addr, dst_pte); 6373 + /* 6374 + * See comment about UFFD marker overwriting in 6375 + * mfill_atomic_install_pte(). 6376 + */ 6377 + if (!huge_pte_none(dst_ptep) && !is_uffd_pte_marker(dst_ptep)) 6377 6378 goto out_release_unlock; 6378 6379 6379 6380 if (folio_in_pagecache)
+14 -3
mm/mincore.c
··· 32 32 spinlock_t *ptl; 33 33 34 34 ptl = huge_pte_lock(hstate_vma(walk->vma), walk->mm, pte); 35 + 35 36 /* 36 37 * Hugepages under user process are always in RAM and never 37 38 * swapped out, but theoretically it needs to be checked. 38 39 */ 39 - present = pte && !huge_pte_none_mostly(huge_ptep_get(walk->mm, addr, pte)); 40 + if (!pte) { 41 + present = 0; 42 + } else { 43 + const pte_t ptep = huge_ptep_get(walk->mm, addr, pte); 44 + 45 + if (huge_pte_none(ptep) || is_pte_marker(ptep)) 46 + present = 0; 47 + else 48 + present = 1; 49 + } 50 + 40 51 for (; addr != end; vec++, addr += PAGE_SIZE) 41 52 *vec = present; 42 53 walk->private = vec; ··· 186 175 pte_t pte = ptep_get(ptep); 187 176 188 177 step = 1; 189 - /* We need to do cache lookup too for pte markers */ 190 - if (pte_none_mostly(pte)) 178 + /* We need to do cache lookup too for markers */ 179 + if (pte_none(pte) || is_pte_marker(pte)) 191 180 __mincore_unmapped_range(addr, addr + PAGE_SIZE, 192 181 vma, vec); 193 182 else if (pte_present(pte)) {
+17 -10
mm/userfaultfd.c
··· 178 178 spinlock_t *ptl; 179 179 struct folio *folio = page_folio(page); 180 180 bool page_in_cache = folio_mapping(folio); 181 + pte_t dst_ptep; 181 182 182 183 _dst_pte = mk_pte(page, dst_vma->vm_page_prot); 183 184 _dst_pte = pte_mkdirty(_dst_pte); ··· 200 199 } 201 200 202 201 ret = -EEXIST; 202 + 203 + dst_ptep = ptep_get(dst_pte); 204 + 203 205 /* 204 - * We allow to overwrite a pte marker: consider when both MISSING|WP 205 - * registered, we firstly wr-protect a none pte which has no page cache 206 - * page backing it, then access the page. 206 + * We are allowed to overwrite a UFFD pte marker: consider when both 207 + * MISSING|WP registered, we firstly wr-protect a none pte which has no 208 + * page cache page backing it, then access the page. 207 209 */ 208 - if (!pte_none_mostly(ptep_get(dst_pte))) 210 + if (!pte_none(dst_ptep) && !is_uffd_pte_marker(dst_ptep)) 209 211 goto out_unlock; 210 212 211 213 if (page_in_cache) { ··· 587 583 goto out_unlock; 588 584 } 589 585 590 - if (!uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE) && 591 - !huge_pte_none_mostly(huge_ptep_get(dst_mm, dst_addr, dst_pte))) { 592 - err = -EEXIST; 593 - hugetlb_vma_unlock_read(dst_vma); 594 - mutex_unlock(&hugetlb_fault_mutex_table[hash]); 595 - goto out_unlock; 586 + if (!uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) { 587 + const pte_t ptep = huge_ptep_get(dst_mm, dst_addr, dst_pte); 588 + 589 + if (!huge_pte_none(ptep) && !is_uffd_pte_marker(ptep)) { 590 + err = -EEXIST; 591 + hugetlb_vma_unlock_read(dst_vma); 592 + mutex_unlock(&hugetlb_fault_mutex_table[hash]); 593 + goto out_unlock; 594 + } 596 595 } 597 596 598 597 err = hugetlb_mfill_atomic_pte(dst_pte, dst_vma, dst_addr,