Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

mm: check VMA flags to avoid invalid PROT_NONE NUMA balancing

The NUMA balancing logic uses an arch-specific PROT_NONE page table flag
defined by pte_protnone() or pmd_protnone() to mark PTEs or huge page
PMDs respectively as requiring balancing upon a subsequent page fault.
User-defined PROT_NONE memory regions which also have this flag set will
not normally invoke the NUMA balancing code as do_page_fault() will send
a segfault to the process before handle_mm_fault() is even called.

However if access_remote_vm() is invoked to access a PROT_NONE region of
memory, handle_mm_fault() is called via faultin_page() and
__get_user_pages() without any access checks being performed, meaning
the NUMA balancing logic is incorrectly invoked on a non-NUMA memory
region.

A simple means of triggering this problem is to access PROT_NONE mmap'd
memory using /proc/self/mem which reliably results in the NUMA handling
functions being invoked when CONFIG_NUMA_BALANCING is set.

This issue was reported in bugzilla (issue 99101) which includes some
simple repro code.

There are BUG_ON() checks in do_numa_page() and do_huge_pmd_numa_page()
added at commit c0e7cad to avoid accidentally provoking strange
behaviour by attempting to apply NUMA balancing to pages that are in
fact PROT_NONE. The BUG_ON()'s are consistently triggered by the repro.

This patch moves the PROT_NONE check into mm/memory.c rather than
invoking BUG_ON() as faulting in these pages via faultin_page() is a
valid reason for reaching the NUMA check with the PROT_NONE page table
flag set and is therefore not always a bug.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=99101
Reported-by: Trevor Saunders <tbsaunde@tbsaunde.org>
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Lorenzo Stoakes and committed by
Linus Torvalds
38e08854 831e45d8

+7 -8
-3
mm/huge_memory.c
··· 1138 1138 bool was_writable; 1139 1139 int flags = 0; 1140 1140 1141 - /* A PROT_NONE fault should not end up here */ 1142 - BUG_ON(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))); 1143 - 1144 1141 fe->ptl = pmd_lock(vma->vm_mm, fe->pmd); 1145 1142 if (unlikely(!pmd_same(pmd, *fe->pmd))) 1146 1143 goto out_unlock;
+7 -5
mm/memory.c
··· 3351 3351 bool was_writable = pte_write(pte); 3352 3352 int flags = 0; 3353 3353 3354 - /* A PROT_NONE fault should not end up here */ 3355 - BUG_ON(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))); 3356 - 3357 3354 /* 3358 3355 * The "pte" at this point cannot be used safely without 3359 3356 * validation through pte_unmap_same(). It's of NUMA type but ··· 3455 3458 return VM_FAULT_FALLBACK; 3456 3459 } 3457 3460 3461 + static inline bool vma_is_accessible(struct vm_area_struct *vma) 3462 + { 3463 + return vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE); 3464 + } 3465 + 3458 3466 /* 3459 3467 * These routines also need to handle stuff like marking pages dirty 3460 3468 * and/or accessed for architectures that don't do it in hardware (most ··· 3526 3524 if (!pte_present(entry)) 3527 3525 return do_swap_page(fe, entry); 3528 3526 3529 - if (pte_protnone(entry)) 3527 + if (pte_protnone(entry) && vma_is_accessible(fe->vma)) 3530 3528 return do_numa_page(fe, entry); 3531 3529 3532 3530 fe->ptl = pte_lockptr(fe->vma->vm_mm, fe->pmd); ··· 3592 3590 3593 3591 barrier(); 3594 3592 if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) { 3595 - if (pmd_protnone(orig_pmd)) 3593 + if (pmd_protnone(orig_pmd) && vma_is_accessible(vma)) 3596 3594 return do_huge_pmd_numa_page(&fe, orig_pmd); 3597 3595 3598 3596 if ((fe.flags & FAULT_FLAG_WRITE) &&