Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

userfaultfd: move core VMA manipulation logic to mm/userfaultfd.c

Patch series "Make core VMA operations internal and testable", v4.

There are a number of "core" VMA manipulation functions implemented in
mm/mmap.c, notably those concerning VMA merging, splitting, modifying,
expanding and shrinking, which logically don't belong there.

More importantly this functionality represents an internal implementation
detail of memory management and should not be exposed outside of mm/
itself.

This patch series isolates core VMA manipulation functionality into its
own file, mm/vma.c, and provides an API to the rest of the mm code in
mm/vma.h.

Importantly, it also carefully implements mm/vma_internal.h, which
specifies which headers need to be imported by vma.c, leading to the very
useful property that vma.c depends only on mm/vma.h and mm/vma_internal.h.

This means we can then re-implement vma_internal.h in userland, adding
shims for kernel mechanisms as required, allowing us to unit test internal
VMA functionality.

This testing is useful as opposed to an e.g. kunit implementation as this
way we can avoid all external kernel side-effects while testing, run tests
VERY quickly, and iterate on and debug problems quickly.

Excitingly this opens the door to, in the future, recreating precise
problems observed in production in userland and very quickly debugging
problems that might otherwise be very difficult to reproduce.

This patch series takes advantage of existing shim logic and full userland
maple tree support contained in tools/testing/radix-tree/ and
tools/include/linux/, separating out shared components of the radix tree
implementation to provide this testing.

Kernel functionality is stubbed and shimmed as needed in
tools/testing/vma/ which contains a fully functional userland
vma_internal.h file and which imports mm/vma.c and mm/vma.h to be directly
tested from userland.

A simple, skeleton testing implementation is provided in
tools/testing/vma/vma.c as a proof-of-concept, asserting that simple VMA
merge, modify (testing split), expand and shrink functionality work
correctly.


This patch (of 4):

This patch forms part of a patch series intending to separate out VMA
logic and render it testable from userspace, which requires that core
manipulation functions be exposed in an mm/-internal header file.

In order to do this, we must abstract APIs we wish to test, in this
instance functions which ultimately invoke vma_modify().

This patch therefore moves all logic which ultimately invokes vma_modify()
to mm/userfaultfd.c, trying to transfer code at a functional granularity
where possible.

[lorenzo.stoakes@oracle.com: fix user-after-free in userfaultfd_clear_vma()]
Link: https://lkml.kernel.org/r/3c947ddc-b804-49b7-8fe9-3ea3ca13def5@lucifer.local
Link: https://lkml.kernel.org/r/cover.1722251717.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/50c3ed995fd81c45876c86304c8a00bf3e396cfd.1722251717.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Gow <davidgow@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Kees Cook <kees@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Rae Moar <rmoar@google.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Pengfei Xu <pengfei.xu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Lorenzo Stoakes and committed by
Andrew Morton
a17c7d8f d075bcce

+198 -149
+11 -149
fs/userfaultfd.c
··· 104 104 return ctx->features & UFFD_FEATURE_WP_UNPOPULATED; 105 105 } 106 106 107 - static void userfaultfd_set_vm_flags(struct vm_area_struct *vma, 108 - vm_flags_t flags) 109 - { 110 - const bool uffd_wp_changed = (vma->vm_flags ^ flags) & VM_UFFD_WP; 111 - 112 - vm_flags_reset(vma, flags); 113 - /* 114 - * For shared mappings, we want to enable writenotify while 115 - * userfaultfd-wp is enabled (see vma_wants_writenotify()). We'll simply 116 - * recalculate vma->vm_page_prot whenever userfaultfd-wp changes. 117 - */ 118 - if ((vma->vm_flags & VM_SHARED) && uffd_wp_changed) 119 - vma_set_page_prot(vma); 120 - } 121 - 122 107 static int userfaultfd_wake_function(wait_queue_entry_t *wq, unsigned mode, 123 108 int wake_flags, void *key) 124 109 { ··· 600 615 spin_unlock_irq(&ctx->event_wqh.lock); 601 616 602 617 if (release_new_ctx) { 603 - struct vm_area_struct *vma; 604 - struct mm_struct *mm = release_new_ctx->mm; 605 - VMA_ITERATOR(vmi, mm, 0); 606 - 607 - /* the various vma->vm_userfaultfd_ctx still points to it */ 608 - mmap_write_lock(mm); 609 - for_each_vma(vmi, vma) { 610 - if (vma->vm_userfaultfd_ctx.ctx == release_new_ctx) { 611 - vma_start_write(vma); 612 - vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; 613 - userfaultfd_set_vm_flags(vma, 614 - vma->vm_flags & ~__VM_UFFD_FLAGS); 615 - } 616 - } 617 - mmap_write_unlock(mm); 618 - 618 + userfaultfd_release_new(release_new_ctx); 619 619 userfaultfd_ctx_put(release_new_ctx); 620 620 } 621 621 ··· 632 662 return 0; 633 663 634 664 if (!(octx->features & UFFD_FEATURE_EVENT_FORK)) { 635 - vma_start_write(vma); 636 - vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; 637 - userfaultfd_set_vm_flags(vma, vma->vm_flags & ~__VM_UFFD_FLAGS); 665 + userfaultfd_reset_ctx(vma); 638 666 return 0; 639 667 } 640 668 ··· 717 749 up_write(&ctx->map_changing_lock); 718 750 } else { 719 751 /* Drop uffd context if remap feature not enabled */ 720 - vma_start_write(vma); 721 - vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; 722 - userfaultfd_set_vm_flags(vma, vma->vm_flags & ~__VM_UFFD_FLAGS); 752 + userfaultfd_reset_ctx(vma); 723 753 } 724 754 } 725 755 ··· 836 870 { 837 871 struct userfaultfd_ctx *ctx = file->private_data; 838 872 struct mm_struct *mm = ctx->mm; 839 - struct vm_area_struct *vma, *prev; 840 873 /* len == 0 means wake all */ 841 874 struct userfaultfd_wake_range range = { .len = 0, }; 842 - unsigned long new_flags; 843 - VMA_ITERATOR(vmi, mm, 0); 844 875 845 876 WRITE_ONCE(ctx->released, true); 846 877 847 - if (!mmget_not_zero(mm)) 848 - goto wakeup; 878 + userfaultfd_release_all(mm, ctx); 849 879 850 - /* 851 - * Flush page faults out of all CPUs. NOTE: all page faults 852 - * must be retried without returning VM_FAULT_SIGBUS if 853 - * userfaultfd_ctx_get() succeeds but vma->vma_userfault_ctx 854 - * changes while handle_userfault released the mmap_lock. So 855 - * it's critical that released is set to true (above), before 856 - * taking the mmap_lock for writing. 857 - */ 858 - mmap_write_lock(mm); 859 - prev = NULL; 860 - for_each_vma(vmi, vma) { 861 - cond_resched(); 862 - BUG_ON(!!vma->vm_userfaultfd_ctx.ctx ^ 863 - !!(vma->vm_flags & __VM_UFFD_FLAGS)); 864 - if (vma->vm_userfaultfd_ctx.ctx != ctx) { 865 - prev = vma; 866 - continue; 867 - } 868 - /* Reset ptes for the whole vma range if wr-protected */ 869 - if (userfaultfd_wp(vma)) 870 - uffd_wp_range(vma, vma->vm_start, 871 - vma->vm_end - vma->vm_start, false); 872 - new_flags = vma->vm_flags & ~__VM_UFFD_FLAGS; 873 - vma = vma_modify_flags_uffd(&vmi, prev, vma, vma->vm_start, 874 - vma->vm_end, new_flags, 875 - NULL_VM_UFFD_CTX); 876 - 877 - vma_start_write(vma); 878 - userfaultfd_set_vm_flags(vma, new_flags); 879 - vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; 880 - 881 - prev = vma; 882 - } 883 - mmap_write_unlock(mm); 884 - mmput(mm); 885 - wakeup: 886 880 /* 887 881 * After no new page faults can wait on this fault_*wqh, flush 888 882 * the last page faults that may have been already waiting on ··· 1219 1293 unsigned long arg) 1220 1294 { 1221 1295 struct mm_struct *mm = ctx->mm; 1222 - struct vm_area_struct *vma, *prev, *cur; 1296 + struct vm_area_struct *vma, *cur; 1223 1297 int ret; 1224 1298 struct uffdio_register uffdio_register; 1225 1299 struct uffdio_register __user *user_uffdio_register; 1226 - unsigned long vm_flags, new_flags; 1300 + unsigned long vm_flags; 1227 1301 bool found; 1228 1302 bool basic_ioctls; 1229 - unsigned long start, end, vma_end; 1303 + unsigned long start, end; 1230 1304 struct vma_iterator vmi; 1231 1305 bool wp_async = userfaultfd_wp_async_ctx(ctx); 1232 1306 ··· 1354 1428 } for_each_vma_range(vmi, cur, end); 1355 1429 BUG_ON(!found); 1356 1430 1357 - vma_iter_set(&vmi, start); 1358 - prev = vma_prev(&vmi); 1359 - if (vma->vm_start < start) 1360 - prev = vma; 1361 - 1362 - ret = 0; 1363 - for_each_vma_range(vmi, vma, end) { 1364 - cond_resched(); 1365 - 1366 - BUG_ON(!vma_can_userfault(vma, vm_flags, wp_async)); 1367 - BUG_ON(vma->vm_userfaultfd_ctx.ctx && 1368 - vma->vm_userfaultfd_ctx.ctx != ctx); 1369 - WARN_ON(!(vma->vm_flags & VM_MAYWRITE)); 1370 - 1371 - /* 1372 - * Nothing to do: this vma is already registered into this 1373 - * userfaultfd and with the right tracking mode too. 1374 - */ 1375 - if (vma->vm_userfaultfd_ctx.ctx == ctx && 1376 - (vma->vm_flags & vm_flags) == vm_flags) 1377 - goto skip; 1378 - 1379 - if (vma->vm_start > start) 1380 - start = vma->vm_start; 1381 - vma_end = min(end, vma->vm_end); 1382 - 1383 - new_flags = (vma->vm_flags & ~__VM_UFFD_FLAGS) | vm_flags; 1384 - vma = vma_modify_flags_uffd(&vmi, prev, vma, start, vma_end, 1385 - new_flags, 1386 - (struct vm_userfaultfd_ctx){ctx}); 1387 - if (IS_ERR(vma)) { 1388 - ret = PTR_ERR(vma); 1389 - break; 1390 - } 1391 - 1392 - /* 1393 - * In the vma_merge() successful mprotect-like case 8: 1394 - * the next vma was merged into the current one and 1395 - * the current one has not been updated yet. 1396 - */ 1397 - vma_start_write(vma); 1398 - userfaultfd_set_vm_flags(vma, new_flags); 1399 - vma->vm_userfaultfd_ctx.ctx = ctx; 1400 - 1401 - if (is_vm_hugetlb_page(vma) && uffd_disable_huge_pmd_share(vma)) 1402 - hugetlb_unshare_all_pmds(vma); 1403 - 1404 - skip: 1405 - prev = vma; 1406 - start = vma->vm_end; 1407 - } 1431 + ret = userfaultfd_register_range(ctx, vma, vm_flags, start, end, 1432 + wp_async); 1408 1433 1409 1434 out_unlock: 1410 1435 mmap_write_unlock(mm); ··· 1396 1519 struct vm_area_struct *vma, *prev, *cur; 1397 1520 int ret; 1398 1521 struct uffdio_range uffdio_unregister; 1399 - unsigned long new_flags; 1400 1522 bool found; 1401 1523 unsigned long start, end, vma_end; 1402 1524 const void __user *buf = (void __user *)arg; ··· 1498 1622 wake_userfault(vma->vm_userfaultfd_ctx.ctx, &range); 1499 1623 } 1500 1624 1501 - /* Reset ptes for the whole vma range if wr-protected */ 1502 - if (userfaultfd_wp(vma)) 1503 - uffd_wp_range(vma, start, vma_end - start, false); 1504 - 1505 - new_flags = vma->vm_flags & ~__VM_UFFD_FLAGS; 1506 - vma = vma_modify_flags_uffd(&vmi, prev, vma, start, vma_end, 1507 - new_flags, NULL_VM_UFFD_CTX); 1625 + vma = userfaultfd_clear_vma(&vmi, prev, vma, 1626 + start, vma_end); 1508 1627 if (IS_ERR(vma)) { 1509 1628 ret = PTR_ERR(vma); 1510 1629 break; 1511 1630 } 1512 - 1513 - /* 1514 - * In the vma_merge() successful mprotect-like case 8: 1515 - * the next vma was merged into the current one and 1516 - * the current one has not been updated yet. 1517 - */ 1518 - vma_start_write(vma); 1519 - userfaultfd_set_vm_flags(vma, new_flags); 1520 - vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; 1521 1631 1522 1632 skip: 1523 1633 prev = vma;
+19
include/linux/userfaultfd_k.h
··· 267 267 extern bool userfaultfd_wp_unpopulated(struct vm_area_struct *vma); 268 268 extern bool userfaultfd_wp_async(struct vm_area_struct *vma); 269 269 270 + void userfaultfd_reset_ctx(struct vm_area_struct *vma); 271 + 272 + struct vm_area_struct *userfaultfd_clear_vma(struct vma_iterator *vmi, 273 + struct vm_area_struct *prev, 274 + struct vm_area_struct *vma, 275 + unsigned long start, 276 + unsigned long end); 277 + 278 + int userfaultfd_register_range(struct userfaultfd_ctx *ctx, 279 + struct vm_area_struct *vma, 280 + unsigned long vm_flags, 281 + unsigned long start, unsigned long end, 282 + bool wp_async); 283 + 284 + void userfaultfd_release_new(struct userfaultfd_ctx *ctx); 285 + 286 + void userfaultfd_release_all(struct mm_struct *mm, 287 + struct userfaultfd_ctx *ctx); 288 + 270 289 #else /* CONFIG_USERFAULTFD */ 271 290 272 291 /* mm helpers */
+168
mm/userfaultfd.c
··· 1760 1760 VM_WARN_ON(!moved && !err); 1761 1761 return moved ? moved : err; 1762 1762 } 1763 + 1764 + static void userfaultfd_set_vm_flags(struct vm_area_struct *vma, 1765 + vm_flags_t flags) 1766 + { 1767 + const bool uffd_wp_changed = (vma->vm_flags ^ flags) & VM_UFFD_WP; 1768 + 1769 + vm_flags_reset(vma, flags); 1770 + /* 1771 + * For shared mappings, we want to enable writenotify while 1772 + * userfaultfd-wp is enabled (see vma_wants_writenotify()). We'll simply 1773 + * recalculate vma->vm_page_prot whenever userfaultfd-wp changes. 1774 + */ 1775 + if ((vma->vm_flags & VM_SHARED) && uffd_wp_changed) 1776 + vma_set_page_prot(vma); 1777 + } 1778 + 1779 + static void userfaultfd_set_ctx(struct vm_area_struct *vma, 1780 + struct userfaultfd_ctx *ctx, 1781 + unsigned long flags) 1782 + { 1783 + vma_start_write(vma); 1784 + vma->vm_userfaultfd_ctx = (struct vm_userfaultfd_ctx){ctx}; 1785 + userfaultfd_set_vm_flags(vma, 1786 + (vma->vm_flags & ~__VM_UFFD_FLAGS) | flags); 1787 + } 1788 + 1789 + void userfaultfd_reset_ctx(struct vm_area_struct *vma) 1790 + { 1791 + userfaultfd_set_ctx(vma, NULL, 0); 1792 + } 1793 + 1794 + struct vm_area_struct *userfaultfd_clear_vma(struct vma_iterator *vmi, 1795 + struct vm_area_struct *prev, 1796 + struct vm_area_struct *vma, 1797 + unsigned long start, 1798 + unsigned long end) 1799 + { 1800 + struct vm_area_struct *ret; 1801 + 1802 + /* Reset ptes for the whole vma range if wr-protected */ 1803 + if (userfaultfd_wp(vma)) 1804 + uffd_wp_range(vma, start, end - start, false); 1805 + 1806 + ret = vma_modify_flags_uffd(vmi, prev, vma, start, end, 1807 + vma->vm_flags & ~__VM_UFFD_FLAGS, 1808 + NULL_VM_UFFD_CTX); 1809 + 1810 + /* 1811 + * In the vma_merge() successful mprotect-like case 8: 1812 + * the next vma was merged into the current one and 1813 + * the current one has not been updated yet. 1814 + */ 1815 + if (!IS_ERR(ret)) 1816 + userfaultfd_reset_ctx(ret); 1817 + 1818 + return ret; 1819 + } 1820 + 1821 + /* Assumes mmap write lock taken, and mm_struct pinned. */ 1822 + int userfaultfd_register_range(struct userfaultfd_ctx *ctx, 1823 + struct vm_area_struct *vma, 1824 + unsigned long vm_flags, 1825 + unsigned long start, unsigned long end, 1826 + bool wp_async) 1827 + { 1828 + VMA_ITERATOR(vmi, ctx->mm, start); 1829 + struct vm_area_struct *prev = vma_prev(&vmi); 1830 + unsigned long vma_end; 1831 + unsigned long new_flags; 1832 + 1833 + if (vma->vm_start < start) 1834 + prev = vma; 1835 + 1836 + for_each_vma_range(vmi, vma, end) { 1837 + cond_resched(); 1838 + 1839 + BUG_ON(!vma_can_userfault(vma, vm_flags, wp_async)); 1840 + BUG_ON(vma->vm_userfaultfd_ctx.ctx && 1841 + vma->vm_userfaultfd_ctx.ctx != ctx); 1842 + WARN_ON(!(vma->vm_flags & VM_MAYWRITE)); 1843 + 1844 + /* 1845 + * Nothing to do: this vma is already registered into this 1846 + * userfaultfd and with the right tracking mode too. 1847 + */ 1848 + if (vma->vm_userfaultfd_ctx.ctx == ctx && 1849 + (vma->vm_flags & vm_flags) == vm_flags) 1850 + goto skip; 1851 + 1852 + if (vma->vm_start > start) 1853 + start = vma->vm_start; 1854 + vma_end = min(end, vma->vm_end); 1855 + 1856 + new_flags = (vma->vm_flags & ~__VM_UFFD_FLAGS) | vm_flags; 1857 + vma = vma_modify_flags_uffd(&vmi, prev, vma, start, vma_end, 1858 + new_flags, 1859 + (struct vm_userfaultfd_ctx){ctx}); 1860 + if (IS_ERR(vma)) 1861 + return PTR_ERR(vma); 1862 + 1863 + /* 1864 + * In the vma_merge() successful mprotect-like case 8: 1865 + * the next vma was merged into the current one and 1866 + * the current one has not been updated yet. 1867 + */ 1868 + userfaultfd_set_ctx(vma, ctx, vm_flags); 1869 + 1870 + if (is_vm_hugetlb_page(vma) && uffd_disable_huge_pmd_share(vma)) 1871 + hugetlb_unshare_all_pmds(vma); 1872 + 1873 + skip: 1874 + prev = vma; 1875 + start = vma->vm_end; 1876 + } 1877 + 1878 + return 0; 1879 + } 1880 + 1881 + void userfaultfd_release_new(struct userfaultfd_ctx *ctx) 1882 + { 1883 + struct mm_struct *mm = ctx->mm; 1884 + struct vm_area_struct *vma; 1885 + VMA_ITERATOR(vmi, mm, 0); 1886 + 1887 + /* the various vma->vm_userfaultfd_ctx still points to it */ 1888 + mmap_write_lock(mm); 1889 + for_each_vma(vmi, vma) { 1890 + if (vma->vm_userfaultfd_ctx.ctx == ctx) 1891 + userfaultfd_reset_ctx(vma); 1892 + } 1893 + mmap_write_unlock(mm); 1894 + } 1895 + 1896 + void userfaultfd_release_all(struct mm_struct *mm, 1897 + struct userfaultfd_ctx *ctx) 1898 + { 1899 + struct vm_area_struct *vma, *prev; 1900 + VMA_ITERATOR(vmi, mm, 0); 1901 + 1902 + if (!mmget_not_zero(mm)) 1903 + return; 1904 + 1905 + /* 1906 + * Flush page faults out of all CPUs. NOTE: all page faults 1907 + * must be retried without returning VM_FAULT_SIGBUS if 1908 + * userfaultfd_ctx_get() succeeds but vma->vma_userfault_ctx 1909 + * changes while handle_userfault released the mmap_lock. So 1910 + * it's critical that released is set to true (above), before 1911 + * taking the mmap_lock for writing. 1912 + */ 1913 + mmap_write_lock(mm); 1914 + prev = NULL; 1915 + for_each_vma(vmi, vma) { 1916 + cond_resched(); 1917 + BUG_ON(!!vma->vm_userfaultfd_ctx.ctx ^ 1918 + !!(vma->vm_flags & __VM_UFFD_FLAGS)); 1919 + if (vma->vm_userfaultfd_ctx.ctx != ctx) { 1920 + prev = vma; 1921 + continue; 1922 + } 1923 + 1924 + vma = userfaultfd_clear_vma(&vmi, prev, vma, 1925 + vma->vm_start, vma->vm_end); 1926 + prev = vma; 1927 + } 1928 + mmap_write_unlock(mm); 1929 + mmput(mm); 1930 + }