Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

mm: set the VM_MAYBE_GUARD flag on guard region install

Now we have established the VM_MAYBE_GUARD flag and added the capacity to
set it atomically, do so upon MADV_GUARD_INSTALL.

The places where this flag is used currently and matter are:

* VMA merge - performed under mmap/VMA write lock, therefore excluding
racing writes.

* /proc/$pid/smaps - can race the write, however this isn't meaningful
as the flag write is performed at the point of the guard region being
established, and thus an smaps reader can't reasonably expect to avoid
races. Due to atomicity, a reader will observe either the flag being
set or not. Therefore consistency will be maintained.

In all other cases the flag being set is irrelevant and atomicity
guarantees other flags will be read correctly.

Note that non-atomic updates of unrelated flags do not cause an issue with
this flag being set atomically, as writes of other flags are performed
under mmap/VMA write lock, and these atomic writes are performed under
mmap/VMA read lock, which excludes the write, avoiding RMW races.

Note that we do not encounter issues with KCSAN by adjusting this flag
atomically, as we are only updating a single bit in the flag bitmap and
therefore we do not need to annotate these changes.

We intentionally set this flag in advance of actually updating the page
tables, to ensure that any racing atomic read of this flag will only
return false prior to page tables being updated, to allow for
serialisation via page table locks.

Note that we set vma->anon_vma for anonymous mappings. This is because
the expectation for anonymous mappings is that an anon_vma is established
should they possess any page table mappings. This is also consistent with
what we were doing prior to this patch (unconditionally setting anon_vma
on guard region installation).

We also need to update retract_page_tables() to ensure that madvise(...,
MADV_COLLAPSE) doesn't incorrectly collapse file-backed ranges contain
guard regions.

This was previously guarded by anon_vma being set to catch MAP_PRIVATE
cases, but the introduction of VM_MAYBE_GUARD necessitates that we check
this flag instead.

We utilise vma_flag_test_atomic() to do so - we first perform an
optimistic check, then after the PTE page table lock is held, we can check
again safely, as upon guard marker install the flag is set atomically
prior to the page table lock being taken to actually apply it.

So if the initial check fails either:

* Page table retraction acquires page table lock prior to VM_MAYBE_GUARD
being set - guard marker installation will be blocked until page table
retraction is complete.

OR:

* Guard marker installation acquires page table lock after setting
VM_MAYBE_GUARD, which raced and didn't pick this up in the initial
optimistic check, blocking page table retraction until the guard regions
are installed - the second VM_MAYBE_GUARD check will prevent page table
retraction.

Either way we're safe.

We refactor the retraction checks into a single
file_backed_vma_is_retractable(), there doesn't seem to be any reason that
the checks were separated as before.

Note that VM_MAYBE_GUARD being set atomically remains correct as
vma_needs_copy() is invoked with the mmap and VMA write locks held,
excluding any race with madvise_guard_install().

Link: https://lkml.kernel.org/r/e9e9ce95b6ac17497de7f60fc110c7dd9e489e8d.1763460113.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: David Hildenbrand (Red Hat) <david@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nico Pache <npache@redhat.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Lorenzo Stoakes and committed by
Andrew Morton
49e14dab ab04b530

+61 -32
+47 -24
mm/khugepaged.c
··· 1717 1717 return result; 1718 1718 } 1719 1719 1720 + /* Can we retract page tables for this file-backed VMA? */ 1721 + static bool file_backed_vma_is_retractable(struct vm_area_struct *vma) 1722 + { 1723 + /* 1724 + * Check vma->anon_vma to exclude MAP_PRIVATE mappings that 1725 + * got written to. These VMAs are likely not worth removing 1726 + * page tables from, as PMD-mapping is likely to be split later. 1727 + */ 1728 + if (READ_ONCE(vma->anon_vma)) 1729 + return false; 1730 + 1731 + /* 1732 + * When a vma is registered with uffd-wp, we cannot recycle 1733 + * the page table because there may be pte markers installed. 1734 + * Other vmas can still have the same file mapped hugely, but 1735 + * skip this one: it will always be mapped in small page size 1736 + * for uffd-wp registered ranges. 1737 + */ 1738 + if (userfaultfd_wp(vma)) 1739 + return false; 1740 + 1741 + /* 1742 + * If the VMA contains guard regions then we can't collapse it. 1743 + * 1744 + * This is set atomically on guard marker installation under mmap/VMA 1745 + * read lock, and here we may not hold any VMA or mmap lock at all. 1746 + * 1747 + * This is therefore serialised on the PTE page table lock, which is 1748 + * obtained on guard region installation after the flag is set, so this 1749 + * check being performed under this lock excludes races. 1750 + */ 1751 + if (vma_flag_test_atomic(vma, VM_MAYBE_GUARD_BIT)) 1752 + return false; 1753 + 1754 + return true; 1755 + } 1756 + 1720 1757 static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff) 1721 1758 { 1722 1759 struct vm_area_struct *vma; ··· 1768 1731 spinlock_t *ptl; 1769 1732 bool success = false; 1770 1733 1771 - /* 1772 - * Check vma->anon_vma to exclude MAP_PRIVATE mappings that 1773 - * got written to. These VMAs are likely not worth removing 1774 - * page tables from, as PMD-mapping is likely to be split later. 1775 - */ 1776 - if (READ_ONCE(vma->anon_vma)) 1777 - continue; 1778 - 1779 1734 addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); 1780 1735 if (addr & ~HPAGE_PMD_MASK || 1781 1736 vma->vm_end < addr + HPAGE_PMD_SIZE) ··· 1779 1750 1780 1751 if (hpage_collapse_test_exit(mm)) 1781 1752 continue; 1782 - /* 1783 - * When a vma is registered with uffd-wp, we cannot recycle 1784 - * the page table because there may be pte markers installed. 1785 - * Other vmas can still have the same file mapped hugely, but 1786 - * skip this one: it will always be mapped in small page size 1787 - * for uffd-wp registered ranges. 1788 - */ 1789 - if (userfaultfd_wp(vma)) 1753 + 1754 + if (!file_backed_vma_is_retractable(vma)) 1790 1755 continue; 1791 1756 1792 1757 /* PTEs were notified when unmapped; but now for the PMD? */ ··· 1807 1784 spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); 1808 1785 1809 1786 /* 1810 - * Huge page lock is still held, so normally the page table 1811 - * must remain empty; and we have already skipped anon_vma 1812 - * and userfaultfd_wp() vmas. But since the mmap_lock is not 1813 - * held, it is still possible for a racing userfaultfd_ioctl() 1814 - * to have inserted ptes or markers. Now that we hold ptlock, 1815 - * repeating the anon_vma check protects from one category, 1816 - * and repeating the userfaultfd_wp() check from another. 1787 + * Huge page lock is still held, so normally the page table must 1788 + * remain empty; and we have already skipped anon_vma and 1789 + * userfaultfd_wp() vmas. But since the mmap_lock is not held, 1790 + * it is still possible for a racing userfaultfd_ioctl() or 1791 + * madvise() to have inserted ptes or markers. Now that we hold 1792 + * ptlock, repeating the retractable checks protects us from 1793 + * races against the prior checks. 1817 1794 */ 1818 - if (likely(!vma->anon_vma && !userfaultfd_wp(vma))) { 1795 + if (likely(file_backed_vma_is_retractable(vma))) { 1819 1796 pgt_pmd = pmdp_collapse_flush(vma, addr, pmd); 1820 1797 pmdp_get_lockless_sync(); 1821 1798 success = true;
+14 -8
mm/madvise.c
··· 1141 1141 return -EINVAL; 1142 1142 1143 1143 /* 1144 - * If we install guard markers, then the range is no longer 1145 - * empty from a page table perspective and therefore it's 1146 - * appropriate to have an anon_vma. 1147 - * 1148 - * This ensures that on fork, we copy page tables correctly. 1144 + * Set atomically under read lock. All pertinent readers will need to 1145 + * acquire an mmap/VMA write lock to read it. All remaining readers may 1146 + * or may not see the flag set, but we don't care. 1149 1147 */ 1150 - err = anon_vma_prepare(vma); 1151 - if (err) 1152 - return err; 1148 + vma_flag_set_atomic(vma, VM_MAYBE_GUARD_BIT); 1149 + 1150 + /* 1151 + * If anonymous and we are establishing page tables the VMA ought to 1152 + * have an anon_vma associated with it. 1153 + */ 1154 + if (vma_is_anonymous(vma)) { 1155 + err = anon_vma_prepare(vma); 1156 + if (err) 1157 + return err; 1158 + } 1153 1159 1154 1160 /* 1155 1161 * Optimistically try to install the guard marker pages first. If any