Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

mm: change dup_mmap() recovery

When the dup_mmap() fails during the vma duplication or setup, don't write
the XA_ZERO entry in the vma tree. Instead, destroy the tree and free the
new resources, leaving an empty vma tree.

Using XA_ZERO introduced races where the vma could be found between
dup_mmap() dropping all locks and exit_mmap() taking the locks. The race
can occur because the mm can be reached through the other trees via
successfully copied vmas and other methods such as the swapoff code.

XA_ZERO was marking the location to stop vma removal and pagetable
freeing. The newly created arguments to the unmap_vmas() and
free_pgtables() serve this function.

Replacing the XA_ZERO entry use with the new argument list also means the
checks for xa_is_zero() are no longer necessary so these are also removed.

Note that dup_mmap() now cleans up when ALL vmas are successfully copied,
but the dup_mmap() fails to completely set up some other aspect of the
duplication.

Link: https://lkml.kernel.org/r/20260121164946.2093480-8-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Liam R. Howlett and committed by
Andrew Morton
43873af7 243de0c0

+32 -16
+1 -5
mm/memory.c
··· 411 411 struct vm_area_struct *next; 412 412 413 413 next = mas_find(mas, vma_end - 1); 414 - if (unlikely(xa_is_zero(next))) 415 - next = NULL; 416 414 417 415 /* 418 416 * Hide vma from rmap and truncate_pagecache before freeing ··· 429 431 while (next && next->vm_start <= vma->vm_end + PMD_SIZE) { 430 432 vma = next; 431 433 next = mas_find(mas, vma_end - 1); 432 - if (unlikely(xa_is_zero(next))) 433 - next = NULL; 434 434 if (mm_wr_locked) 435 435 vma_start_write(vma); 436 436 unlink_anon_vmas(vma); ··· 2182 2186 unmap_single_vma(tlb, vma, start, end, &details); 2183 2187 hugetlb_zap_end(vma, &details); 2184 2188 vma = mas_find(mas, tree_end - 1); 2185 - } while (vma && likely(!xa_is_zero(vma))); 2189 + } while (vma); 2186 2190 mmu_notifier_invalidate_range_end(&range); 2187 2191 } 2188 2192
+31 -11
mm/mmap.c
··· 1285 1285 arch_exit_mmap(mm); 1286 1286 1287 1287 vma = vma_next(&vmi); 1288 - if (!vma || unlikely(xa_is_zero(vma))) { 1288 + if (!vma) { 1289 1289 /* Can happen if dup_mmap() received an OOM */ 1290 1290 mmap_read_unlock(mm); 1291 1291 mmap_write_lock(mm); ··· 1851 1851 ksm_fork(mm, oldmm); 1852 1852 khugepaged_fork(mm, oldmm); 1853 1853 } else { 1854 + unsigned long end; 1854 1855 1855 1856 /* 1856 - * The entire maple tree has already been duplicated. If the 1857 - * mmap duplication fails, mark the failure point with 1858 - * XA_ZERO_ENTRY. In exit_mmap(), if this marker is encountered, 1859 - * stop releasing VMAs that have not been duplicated after this 1860 - * point. 1857 + * The entire maple tree has already been duplicated, but 1858 + * replacing the vmas failed at mpnt (which could be NULL if 1859 + * all were allocated but the last vma was not fully set up). 1860 + * Use the start address of the failure point to clean up the 1861 + * partially initialized tree. 1861 1862 */ 1862 - if (mpnt) { 1863 - mas_set_range(&vmi.mas, mpnt->vm_start, mpnt->vm_end - 1); 1864 - mas_store(&vmi.mas, XA_ZERO_ENTRY); 1865 - /* Avoid OOM iterating a broken tree */ 1866 - mm_flags_set(MMF_OOM_SKIP, mm); 1863 + if (!mm->map_count) { 1864 + /* zero vmas were written to the new tree. */ 1865 + end = 0; 1866 + } else if (mpnt) { 1867 + /* partial tree failure */ 1868 + end = mpnt->vm_start; 1869 + } else { 1870 + /* All vmas were written to the new tree */ 1871 + end = ULONG_MAX; 1867 1872 } 1873 + 1874 + /* Hide mm from oom killer because the memory is being freed */ 1875 + mm_flags_set(MMF_OOM_SKIP, mm); 1876 + if (end) { 1877 + vma_iter_set(&vmi, 0); 1878 + tmp = vma_next(&vmi); 1879 + flush_cache_mm(mm); 1880 + unmap_region(&vmi.mas, /* vma = */ tmp, 1881 + /* vma_start = */ 0, /* vma_end = */ end, 1882 + /* pg_end = */ end, /* prev = */ NULL, 1883 + /* next = */ NULL); 1884 + charge = tear_down_vmas(mm, &vmi, tmp, end); 1885 + vm_unacct_memory(charge); 1886 + } 1887 + __mt_destroy(&mm->mm_mt); 1868 1888 /* 1869 1889 * The mm_struct is going to exit, but the locks will be dropped 1870 1890 * first. Set the mm_struct as unstable is advisable as it is