Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

mm: resolve faulty mmap_region() error path behaviour

The mmap_region() function is somewhat terrifying, with spaghetti-like
control flow and numerous means by which issues can arise and incomplete
state, memory leaks and other unpleasantness can occur.

A large amount of the complexity arises from trying to handle errors late
in the process of mapping a VMA, which forms the basis of recently
observed issues with resource leaks and observable inconsistent state.

Taking advantage of previous patches in this series we move a number of
checks earlier in the code, simplifying things by moving the core of the
logic into a static internal function __mmap_region().

Doing this allows us to perform a number of checks up front before we do
any real work, and allows us to unwind the writable unmap check
unconditionally as required and to perform a CONFIG_DEBUG_VM_MAPLE_TREE
validation unconditionally also.

We move a number of things here:

1. We preallocate memory for the iterator before we call the file-backed
memory hook, allowing us to exit early and avoid having to perform
complicated and error-prone close/free logic. We carefully free
iterator state on both success and error paths.

2. The enclosing mmap_region() function handles the mapping_map_writable()
logic early. Previously the logic had the mapping_map_writable() at the
point of mapping a newly allocated file-backed VMA, and a matching
mapping_unmap_writable() on success and error paths.

We now do this unconditionally if this is a file-backed, shared writable
mapping. If a driver changes the flags to eliminate VM_MAYWRITE, however
doing so does not invalidate the seal check we just performed, and we in
any case always decrement the counter in the wrapper.

We perform a debug assert to ensure a driver does not attempt to do the
opposite.

3. We also move arch_validate_flags() up into the mmap_region()
function. This is only relevant on arm64 and sparc64, and the check is
only meaningful for SPARC with ADI enabled. We explicitly add a warning
for this arch if a driver invalidates this check, though the code ought
eventually to be fixed to eliminate the need for this.

With all of these measures in place, we no longer need to explicitly close
the VMA on error paths, as we place all checks which might fail prior to a
call to any driver mmap hook.

This eliminates an entire class of errors, makes the code easier to reason
about and more robust.

Link: https://lkml.kernel.org/r/6e0becb36d2f5472053ac5d544c0edfe9b899e25.1730224667.git.lorenzo.stoakes@oracle.com
Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails")
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reported-by: Jann Horn <jannh@google.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Mark Brown <broonie@kernel.org>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Helge Deller <deller@gmx.de>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Lorenzo Stoakes and committed by
Andrew Morton
5de19506 5baf8b03

+65 -54
+65 -54
mm/mmap.c
··· 1358 1358 return do_vmi_munmap(&vmi, mm, start, len, uf, false); 1359 1359 } 1360 1360 1361 - unsigned long mmap_region(struct file *file, unsigned long addr, 1361 + static unsigned long __mmap_region(struct file *file, unsigned long addr, 1362 1362 unsigned long len, vm_flags_t vm_flags, unsigned long pgoff, 1363 1363 struct list_head *uf) 1364 1364 { 1365 1365 struct mm_struct *mm = current->mm; 1366 1366 struct vm_area_struct *vma = NULL; 1367 1367 pgoff_t pglen = PHYS_PFN(len); 1368 - struct vm_area_struct *merge; 1369 1368 unsigned long charged = 0; 1370 1369 struct vma_munmap_struct vms; 1371 1370 struct ma_state mas_detach; 1372 1371 struct maple_tree mt_detach; 1373 1372 unsigned long end = addr + len; 1374 - bool writable_file_mapping = false; 1375 1373 int error; 1376 1374 VMA_ITERATOR(vmi, mm, addr); 1377 1375 VMG_STATE(vmg, mm, &vmi, addr, end, vm_flags, pgoff); ··· 1443 1445 vm_flags_init(vma, vm_flags); 1444 1446 vma->vm_page_prot = vm_get_page_prot(vm_flags); 1445 1447 1448 + if (vma_iter_prealloc(&vmi, vma)) { 1449 + error = -ENOMEM; 1450 + goto free_vma; 1451 + } 1452 + 1446 1453 if (file) { 1447 1454 vma->vm_file = get_file(file); 1448 1455 error = mmap_file(file, vma); 1449 1456 if (error) 1450 - goto unmap_and_free_vma; 1457 + goto unmap_and_free_file_vma; 1451 1458 1452 - if (vma_is_shared_maywrite(vma)) { 1453 - error = mapping_map_writable(file->f_mapping); 1454 - if (error) 1455 - goto close_and_free_vma; 1456 - 1457 - writable_file_mapping = true; 1458 - } 1459 - 1459 + /* Drivers cannot alter the address of the VMA. */ 1460 + WARN_ON_ONCE(addr != vma->vm_start); 1460 1461 /* 1461 - * Expansion is handled above, merging is handled below. 1462 - * Drivers should not alter the address of the VMA. 1462 + * Drivers should not permit writability when previously it was 1463 + * disallowed. 1463 1464 */ 1464 - if (WARN_ON((addr != vma->vm_start))) { 1465 - error = -EINVAL; 1466 - goto close_and_free_vma; 1467 - } 1465 + VM_WARN_ON_ONCE(vm_flags != vma->vm_flags && 1466 + !(vm_flags & VM_MAYWRITE) && 1467 + (vma->vm_flags & VM_MAYWRITE)); 1468 1468 1469 1469 vma_iter_config(&vmi, addr, end); 1470 1470 /* ··· 1470 1474 * vma again as we may succeed this time. 1471 1475 */ 1472 1476 if (unlikely(vm_flags != vma->vm_flags && vmg.prev)) { 1477 + struct vm_area_struct *merge; 1478 + 1473 1479 vmg.flags = vma->vm_flags; 1474 1480 /* If this fails, state is reset ready for a reattempt. */ 1475 1481 merge = vma_merge_new_range(&vmg); ··· 1489 1491 vma = merge; 1490 1492 /* Update vm_flags to pick up the change. */ 1491 1493 vm_flags = vma->vm_flags; 1492 - goto unmap_writable; 1494 + goto file_expanded; 1493 1495 } 1494 1496 vma_iter_config(&vmi, addr, end); 1495 1497 } ··· 1498 1500 } else if (vm_flags & VM_SHARED) { 1499 1501 error = shmem_zero_setup(vma); 1500 1502 if (error) 1501 - goto free_vma; 1503 + goto free_iter_vma; 1502 1504 } else { 1503 1505 vma_set_anonymous(vma); 1504 1506 } 1505 1507 1506 - if (map_deny_write_exec(vma->vm_flags, vma->vm_flags)) { 1507 - error = -EACCES; 1508 - goto close_and_free_vma; 1509 - } 1510 - 1511 - /* Allow architectures to sanity-check the vm_flags */ 1512 - if (!arch_validate_flags(vma->vm_flags)) { 1513 - error = -EINVAL; 1514 - goto close_and_free_vma; 1515 - } 1516 - 1517 - if (vma_iter_prealloc(&vmi, vma)) { 1518 - error = -ENOMEM; 1519 - goto close_and_free_vma; 1520 - } 1508 + #ifdef CONFIG_SPARC64 1509 + /* TODO: Fix SPARC ADI! */ 1510 + WARN_ON_ONCE(!arch_validate_flags(vm_flags)); 1511 + #endif 1521 1512 1522 1513 /* Lock the VMA since it is modified after insertion into VMA tree */ 1523 1514 vma_start_write(vma); ··· 1520 1533 */ 1521 1534 khugepaged_enter_vma(vma, vma->vm_flags); 1522 1535 1523 - /* Once vma denies write, undo our temporary denial count */ 1524 - unmap_writable: 1525 - if (writable_file_mapping) 1526 - mapping_unmap_writable(file->f_mapping); 1536 + file_expanded: 1527 1537 file = vma->vm_file; 1528 1538 ksm_add_vma(vma); 1529 1539 expanded: ··· 1553 1569 1554 1570 vma_set_page_prot(vma); 1555 1571 1556 - validate_mm(mm); 1557 1572 return addr; 1558 1573 1559 - close_and_free_vma: 1560 - vma_close(vma); 1574 + unmap_and_free_file_vma: 1575 + fput(vma->vm_file); 1576 + vma->vm_file = NULL; 1561 1577 1562 - if (file || vma->vm_file) { 1563 - unmap_and_free_vma: 1564 - fput(vma->vm_file); 1565 - vma->vm_file = NULL; 1566 - 1567 - vma_iter_set(&vmi, vma->vm_end); 1568 - /* Undo any partial mapping done by a device driver. */ 1569 - unmap_region(&vmi.mas, vma, vmg.prev, vmg.next); 1570 - } 1571 - if (writable_file_mapping) 1572 - mapping_unmap_writable(file->f_mapping); 1578 + vma_iter_set(&vmi, vma->vm_end); 1579 + /* Undo any partial mapping done by a device driver. */ 1580 + unmap_region(&vmi.mas, vma, vmg.prev, vmg.next); 1581 + free_iter_vma: 1582 + vma_iter_free(&vmi); 1573 1583 free_vma: 1574 1584 vm_area_free(vma); 1575 1585 unacct_error: ··· 1573 1595 abort_munmap: 1574 1596 vms_abort_munmap_vmas(&vms, &mas_detach); 1575 1597 gather_failed: 1576 - validate_mm(mm); 1577 1598 return error; 1599 + } 1600 + 1601 + unsigned long mmap_region(struct file *file, unsigned long addr, 1602 + unsigned long len, vm_flags_t vm_flags, unsigned long pgoff, 1603 + struct list_head *uf) 1604 + { 1605 + unsigned long ret; 1606 + bool writable_file_mapping = false; 1607 + 1608 + /* Check to see if MDWE is applicable. */ 1609 + if (map_deny_write_exec(vm_flags, vm_flags)) 1610 + return -EACCES; 1611 + 1612 + /* Allow architectures to sanity-check the vm_flags. */ 1613 + if (!arch_validate_flags(vm_flags)) 1614 + return -EINVAL; 1615 + 1616 + /* Map writable and ensure this isn't a sealed memfd. */ 1617 + if (file && is_shared_maywrite(vm_flags)) { 1618 + int error = mapping_map_writable(file->f_mapping); 1619 + 1620 + if (error) 1621 + return error; 1622 + writable_file_mapping = true; 1623 + } 1624 + 1625 + ret = __mmap_region(file, addr, len, vm_flags, pgoff, uf); 1626 + 1627 + /* Clear our write mapping regardless of error. */ 1628 + if (writable_file_mapping) 1629 + mapping_unmap_writable(file->f_mapping); 1630 + 1631 + validate_mm(current->mm); 1632 + return ret; 1578 1633 } 1579 1634 1580 1635 static int __vm_munmap(unsigned long start, size_t len, bool unlock)