mm: use per_vma lock for MADV_DONTNEED

Certain madvise operations, especially MADV_DONTNEED, occur far more
frequently than other madvise options, particularly in native and Java
heaps for dynamic memory management.

Currently, the mmap_lock is always held during these operations, even when
unnecessary. This causes lock contention and can lead to severe priority
inversion, where low-priority threads—such as Android's
HeapTaskDaemon— hold the lock and block higher-priority threads.

This patch enables the use of per-VMA locks when the advised range lies
entirely within a single VMA, avoiding the need for full VMA traversal.
In practice, userspace heaps rarely issue MADV_DONTNEED across multiple
VMAs.

Tangquan's testing shows that over 99.5% of memory reclaimed by Android
benefits from this per-VMA lock optimization. After extended runtime,
217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while
only 1,231 fell back to mmap_lock.

To simplify handling, the implementation falls back to the standard
mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity of
userfaultfd_remove().

Many thanks to Lorenzo's work[1] on "mm/madvise: support VMA read locks
for MADV_DONTNEED[_LOCKED]"

Then use this mechanism to permit VMA locking to be done later in the
madvise() logic and also to allow altering of the locking mode to permit
falling back to an mmap read lock if required."

One important point, as pointed out by Jann[2], is that
untagged_addr_remote() requires holding mmap_lock. This is because
address tagging on x86 and RISC-V is quite complex.

Until untagged_addr_remote() becomes atomic—which seems unlikely in the
near future—we cannot support per-VMA locks for remote processes. So
for now, only local processes are supported.

Lance said:

: Just to put some numbers on it, I ran a micro-benchmark with 100
: parallel threads, where each thread calls madvise() on its own 1GiB
: chunk of 64KiB mTHP-backed memory. The performance gain is huge:
:
: 1) MADV_DONTNEED saw its average time drop from 0.0508s to 0.0270s
: (~47% faster)
:
: 2) MADV_FREE saw its average time drop from 0.3078s to 0.1095s (~64%
: faster)

[lorenzo.stoakes@oracle.com: avoid any chance of uninitialised pointer deref]
Link: https://lkml.kernel.org/r/309d22ca-6cd9-4601-8402-d441a07d9443@lucifer.local
Link: https://lore.kernel.org/all/0b96ce61-a52c-4036-b5b6-5c50783db51f@lucifer.local/ [1]
Link: https://lore.kernel.org/all/CAG48ez11zi-1jicHUZtLhyoNPGGVB+ROeAJCUw48bsjk4bbEkA@mail.gmail.com/ [2]
Link: https://lkml.kernel.org/r/20250607220150.2980-1-21cnbao@gmail.com
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Tangquan Zheng <zhengtangquan@oppo.com>
Cc: Lance Yang <ioworker0@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Barry Song and committed by

Andrew Morton 11 months ago a6fde7ad 5e00e318

+152 -50

1 changed file

expand all

madvise.c

+152 -50

mm/madvise.c

··· 48 48 bool pageout; 49 49 }; 50 50 51 + enum madvise_lock_mode { 52 + MADVISE_NO_LOCK, 53 + MADVISE_MMAP_READ_LOCK, 54 + MADVISE_MMAP_WRITE_LOCK, 55 + MADVISE_VMA_READ_LOCK, 56 + }; 57 + 51 58 struct madvise_behavior { 52 59 int behavior; 53 60 struct mmu_gather *tlb; 61 + enum madvise_lock_mode lock_mode; 54 62 }; 55 - 56 - /* 57 - * Any behaviour which results in changes to the vma->vm_flags needs to 58 - * take mmap_lock for writing. Others, which simply traverse vmas, need 59 - * to only take it for reading. 60 - */ 61 - static int madvise_need_mmap_write(int behavior) 62 - { 63 - switch (behavior) { 64 - case MADV_REMOVE: 65 - case MADV_WILLNEED: 66 - case MADV_DONTNEED: 67 - case MADV_DONTNEED_LOCKED: 68 - case MADV_COLD: 69 - case MADV_PAGEOUT: 70 - case MADV_FREE: 71 - case MADV_POPULATE_READ: 72 - case MADV_POPULATE_WRITE: 73 - case MADV_COLLAPSE: 74 - case MADV_GUARD_INSTALL: 75 - case MADV_GUARD_REMOVE: 76 - return 0; 77 - default: 78 - /* be safe, default to 1. list exceptions explicitly */ 79 - return 1; 80 - } 81 - } 82 63 83 64 #ifdef CONFIG_ANON_VMA_NAME 84 65 struct anon_vma_name *anon_vma_name_alloc(const char *name) ··· 1320 1339 return madvise_guard_remove(vma, prev, start, end); 1321 1340 } 1322 1341 1342 + /* We cannot provide prev in this lock mode. */ 1343 + VM_WARN_ON_ONCE(arg->lock_mode == MADVISE_VMA_READ_LOCK); 1323 1344 anon_name = anon_vma_name(vma); 1324 1345 anon_vma_name_get(anon_name); 1325 1346 error = madvise_update_vma(vma, prev, start, end, new_flags, ··· 1472 1489 } 1473 1490 1474 1491 /* 1492 + * Try to acquire a VMA read lock if possible. 1493 + * 1494 + * We only support this lock over a single VMA, which the input range must 1495 + * span either partially or fully. 1496 + * 1497 + * This function always returns with an appropriate lock held. If a VMA read 1498 + * lock could be acquired, we return the locked VMA. 1499 + * 1500 + * If a VMA read lock could not be acquired, we return NULL and expect caller to 1501 + * fallback to mmap lock behaviour. 1502 + */ 1503 + static struct vm_area_struct *try_vma_read_lock(struct mm_struct *mm, 1504 + struct madvise_behavior *madv_behavior, 1505 + unsigned long start, unsigned long end) 1506 + { 1507 + struct vm_area_struct *vma; 1508 + 1509 + vma = lock_vma_under_rcu(mm, start); 1510 + if (!vma) 1511 + goto take_mmap_read_lock; 1512 + /* 1513 + * Must span only a single VMA; uffd and remote processes are 1514 + * unsupported. 1515 + */ 1516 + if (end > vma->vm_end || current->mm != mm || 1517 + userfaultfd_armed(vma)) { 1518 + vma_end_read(vma); 1519 + goto take_mmap_read_lock; 1520 + } 1521 + return vma; 1522 + 1523 + take_mmap_read_lock: 1524 + mmap_read_lock(mm); 1525 + madv_behavior->lock_mode = MADVISE_MMAP_READ_LOCK; 1526 + return NULL; 1527 + } 1528 + 1529 + /* 1475 1530 * Walk the vmas in range [start,end), and call the visit function on each one. 1476 1531 * The visit function will get start and end parameters that cover the overlap 1477 1532 * between the current vma and the original range. Any unmapped regions in the ··· 1519 1498 */ 1520 1499 static 1521 1500 int madvise_walk_vmas(struct mm_struct *mm, unsigned long start, 1522 - unsigned long end, void *arg, 1501 + unsigned long end, struct madvise_behavior *madv_behavior, 1502 + void *arg, 1523 1503 int (*visit)(struct vm_area_struct *vma, 1524 1504 struct vm_area_struct **prev, unsigned long start, 1525 1505 unsigned long end, void *arg)) ··· 1529 1507 struct vm_area_struct *prev; 1530 1508 unsigned long tmp; 1531 1509 int unmapped_error = 0; 1510 + int error; 1511 + 1512 + /* 1513 + * If VMA read lock is supported, apply madvise to a single VMA 1514 + * tentatively, avoiding walking VMAs. 1515 + */ 1516 + if (madv_behavior && madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK) { 1517 + vma = try_vma_read_lock(mm, madv_behavior, start, end); 1518 + if (vma) { 1519 + prev = vma; 1520 + error = visit(vma, &prev, start, end, arg); 1521 + vma_end_read(vma); 1522 + return error; 1523 + } 1524 + } 1532 1525 1533 1526 /* 1534 1527 * If the interval [start,end) covers some unmapped address ··· 1555 1518 prev = vma; 1556 1519 1557 1520 for (;;) { 1558 - int error; 1559 - 1560 1521 /* Still start < end. */ 1561 1522 if (!vma) 1562 1523 return -ENOMEM; ··· 1635 1600 if (end == start) 1636 1601 return 0; 1637 1602 1638 - return madvise_walk_vmas(mm, start, end, anon_name, 1603 + return madvise_walk_vmas(mm, start, end, NULL, anon_name, 1639 1604 madvise_vma_anon_name); 1640 1605 } 1641 1606 #endif /* CONFIG_ANON_VMA_NAME */ 1642 1607 1643 - static int madvise_lock(struct mm_struct *mm, int behavior) 1644 - { 1645 - if (is_memory_failure(behavior)) 1646 - return 0; 1647 1608 1648 - if (madvise_need_mmap_write(behavior)) { 1609 + /* 1610 + * Any behaviour which results in changes to the vma->vm_flags needs to 1611 + * take mmap_lock for writing. Others, which simply traverse vmas, need 1612 + * to only take it for reading. 1613 + */ 1614 + static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavior) 1615 + { 1616 + int behavior = madv_behavior->behavior; 1617 + 1618 + if (is_memory_failure(behavior)) 1619 + return MADVISE_NO_LOCK; 1620 + 1621 + switch (behavior) { 1622 + case MADV_REMOVE: 1623 + case MADV_WILLNEED: 1624 + case MADV_COLD: 1625 + case MADV_PAGEOUT: 1626 + case MADV_FREE: 1627 + case MADV_POPULATE_READ: 1628 + case MADV_POPULATE_WRITE: 1629 + case MADV_COLLAPSE: 1630 + case MADV_GUARD_INSTALL: 1631 + case MADV_GUARD_REMOVE: 1632 + return MADVISE_MMAP_READ_LOCK; 1633 + case MADV_DONTNEED: 1634 + case MADV_DONTNEED_LOCKED: 1635 + return MADVISE_VMA_READ_LOCK; 1636 + default: 1637 + return MADVISE_MMAP_WRITE_LOCK; 1638 + } 1639 + } 1640 + 1641 + static int madvise_lock(struct mm_struct *mm, 1642 + struct madvise_behavior *madv_behavior) 1643 + { 1644 + enum madvise_lock_mode lock_mode = get_lock_mode(madv_behavior); 1645 + 1646 + switch (lock_mode) { 1647 + case MADVISE_NO_LOCK: 1648 + break; 1649 + case MADVISE_MMAP_WRITE_LOCK: 1649 1650 if (mmap_write_lock_killable(mm)) 1650 1651 return -EINTR; 1651 - } else { 1652 + break; 1653 + case MADVISE_MMAP_READ_LOCK: 1652 1654 mmap_read_lock(mm); 1655 + break; 1656 + case MADVISE_VMA_READ_LOCK: 1657 + /* We will acquire the lock per-VMA in madvise_walk_vmas(). */ 1658 + break; 1653 1659 } 1660 + 1661 + madv_behavior->lock_mode = lock_mode; 1654 1662 return 0; 1655 1663 } 1656 1664 1657 - static void madvise_unlock(struct mm_struct *mm, int behavior) 1665 + static void madvise_unlock(struct mm_struct *mm, 1666 + struct madvise_behavior *madv_behavior) 1658 1667 { 1659 - if (is_memory_failure(behavior)) 1668 + switch (madv_behavior->lock_mode) { 1669 + case MADVISE_NO_LOCK: 1660 1670 return; 1661 - 1662 - if (madvise_need_mmap_write(behavior)) 1671 + case MADVISE_MMAP_WRITE_LOCK: 1663 1672 mmap_write_unlock(mm); 1664 - else 1673 + break; 1674 + case MADVISE_MMAP_READ_LOCK: 1665 1675 mmap_read_unlock(mm); 1676 + break; 1677 + case MADVISE_VMA_READ_LOCK: 1678 + /* We will drop the lock per-VMA in madvise_walk_vmas(). */ 1679 + break; 1680 + } 1681 + 1682 + madv_behavior->lock_mode = MADVISE_NO_LOCK; 1666 1683 } 1667 1684 1668 1685 static bool madvise_batch_tlb_flush(int behavior) ··· 1799 1712 } 1800 1713 } 1801 1714 1715 + /* 1716 + * untagged_addr_remote() assumes mmap_lock is already held. On 1717 + * architectures like x86 and RISC-V, tagging is tricky because each 1718 + * mm may have a different tagging mask. However, we might only hold 1719 + * the per-VMA lock (currently only local processes are supported), 1720 + * so untagged_addr is used to avoid the mmap_lock assertion for 1721 + * local processes. 1722 + */ 1723 + static inline unsigned long get_untagged_addr(struct mm_struct *mm, 1724 + unsigned long start) 1725 + { 1726 + return current->mm == mm ? untagged_addr(start) : 1727 + untagged_addr_remote(mm, start); 1728 + } 1729 + 1802 1730 static int madvise_do_behavior(struct mm_struct *mm, 1803 1731 unsigned long start, size_t len_in, 1804 1732 struct madvise_behavior *madv_behavior) ··· 1825 1723 1826 1724 if (is_memory_failure(behavior)) 1827 1725 return madvise_inject_error(behavior, start, start + len_in); 1828 - start = untagged_addr_remote(mm, start); 1726 + start = get_untagged_addr(mm, start); 1829 1727 end = start + PAGE_ALIGN(len_in); 1830 1728 1831 1729 blk_start_plug(&plug); ··· 1833 1731 error = madvise_populate(mm, start, end, behavior); 1834 1732 else 1835 1733 error = madvise_walk_vmas(mm, start, end, madv_behavior, 1836 - madvise_vma_behavior); 1734 + madv_behavior, madvise_vma_behavior); 1837 1735 blk_finish_plug(&plug); 1838 1736 return error; 1839 1737 } ··· 1921 1819 1922 1820 if (madvise_should_skip(start, len_in, behavior, &error)) 1923 1821 return error; 1924 - error = madvise_lock(mm, behavior); 1822 + error = madvise_lock(mm, &madv_behavior); 1925 1823 if (error) 1926 1824 return error; 1927 1825 madvise_init_tlb(&madv_behavior, mm); 1928 1826 error = madvise_do_behavior(mm, start, len_in, &madv_behavior); 1929 1827 madvise_finish_tlb(&madv_behavior); 1930 - madvise_unlock(mm, behavior); 1828 + madvise_unlock(mm, &madv_behavior); 1931 1829 1932 1830 return error; 1933 1831 } ··· 1951 1849 1952 1850 total_len = iov_iter_count(iter); 1953 1851 1954 - ret = madvise_lock(mm, behavior); 1852 + ret = madvise_lock(mm, &madv_behavior); 1955 1853 if (ret) 1956 1854 return ret; 1957 1855 madvise_init_tlb(&madv_behavior, mm); ··· 1984 1882 1985 1883 /* Drop and reacquire lock to unwind race. */ 1986 1884 madvise_finish_tlb(&madv_behavior); 1987 - madvise_unlock(mm, behavior); 1988 - ret = madvise_lock(mm, behavior); 1885 + madvise_unlock(mm, &madv_behavior); 1886 + ret = madvise_lock(mm, &madv_behavior); 1989 1887 if (ret) 1990 1888 goto out; 1991 1889 madvise_init_tlb(&madv_behavior, mm); ··· 1996 1894 iov_iter_advance(iter, iter_iov_len(iter)); 1997 1895 } 1998 1896 madvise_finish_tlb(&madv_behavior); 1999 - madvise_unlock(mm, behavior); 1897 + madvise_unlock(mm, &madv_behavior); 2000 1898 2001 1899 out: 2002 1900 ret = (total_len - iov_iter_count(iter)) ? : ret;

Configure Feed

Configure Feed