Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

arm64: mm: Handle invalid large leaf mappings correctly

It has been possible for a long time to mark ptes in the linear map as
invalid. This is done for secretmem, kfence, realm dma memory un/share,
and others, by simply clearing the PTE_VALID bit. But until commit
a166563e7ec37 ("arm64: mm: support large block mapping when
rodata=full") large leaf mappings were never made invalid in this way.

It turns out various parts of the code base are not equipped to handle
invalid large leaf mappings (in the way they are currently encoded) and
I've observed a kernel panic while booting a realm guest on a
BBML2_NOABORT system as a result:

[ 15.432706] software IO TLB: Memory encryption is active and system is using DMA bounce buffers
[ 15.476896] Unable to handle kernel paging request at virtual address ffff000019600000
[ 15.513762] Mem abort info:
[ 15.527245] ESR = 0x0000000096000046
[ 15.548553] EC = 0x25: DABT (current EL), IL = 32 bits
[ 15.572146] SET = 0, FnV = 0
[ 15.592141] EA = 0, S1PTW = 0
[ 15.612694] FSC = 0x06: level 2 translation fault
[ 15.640644] Data abort info:
[ 15.661983] ISV = 0, ISS = 0x00000046, ISS2 = 0x00000000
[ 15.694875] CM = 0, WnR = 1, TnD = 0, TagAccess = 0
[ 15.723740] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[ 15.755776] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000081f3f000
[ 15.800410] [ffff000019600000] pgd=0000000000000000, p4d=180000009ffff403, pud=180000009fffe403, pmd=00e8000199600704
[ 15.855046] Internal error: Oops: 0000000096000046 [#1] SMP
[ 15.886394] Modules linked in:
[ 15.900029] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 7.0.0-rc4-dirty #4 PREEMPT
[ 15.935258] Hardware name: linux,dummy-virt (DT)
[ 15.955612] pstate: 21400005 (nzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
[ 15.986009] pc : __pi_memcpy_generic+0x128/0x22c
[ 16.006163] lr : swiotlb_bounce+0xf4/0x158
[ 16.024145] sp : ffff80008000b8f0
[ 16.038896] x29: ffff80008000b8f0 x28: 0000000000000000 x27: 0000000000000000
[ 16.069953] x26: ffffb3976d261ba8 x25: 0000000000000000 x24: ffff000019600000
[ 16.100876] x23: 0000000000000001 x22: ffff0000043430d0 x21: 0000000000007ff0
[ 16.131946] x20: 0000000084570010 x19: 0000000000000000 x18: ffff00001ffe3fcc
[ 16.163073] x17: 0000000000000000 x16: 00000000003fffff x15: 646e612065766974
[ 16.194131] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[ 16.225059] x11: 0000000000000000 x10: 0000000000000010 x9 : 0000000000000018
[ 16.256113] x8 : 0000000000000018 x7 : 0000000000000000 x6 : 0000000000000000
[ 16.287203] x5 : ffff000019607ff0 x4 : ffff000004578000 x3 : ffff000019600000
[ 16.318145] x2 : 0000000000007ff0 x1 : ffff000004570010 x0 : ffff000019600000
[ 16.349071] Call trace:
[ 16.360143] __pi_memcpy_generic+0x128/0x22c (P)
[ 16.380310] swiotlb_tbl_map_single+0x154/0x2b4
[ 16.400282] swiotlb_map+0x5c/0x228
[ 16.415984] dma_map_phys+0x244/0x2b8
[ 16.432199] dma_map_page_attrs+0x44/0x58
[ 16.449782] virtqueue_map_page_attrs+0x38/0x44
[ 16.469596] virtqueue_map_single_attrs+0xc0/0x130
[ 16.490509] virtnet_rq_alloc.isra.0+0xa4/0x1fc
[ 16.510355] try_fill_recv+0x2a4/0x584
[ 16.526989] virtnet_open+0xd4/0x238
[ 16.542775] __dev_open+0x110/0x24c
[ 16.558280] __dev_change_flags+0x194/0x20c
[ 16.576879] netif_change_flags+0x24/0x6c
[ 16.594489] dev_change_flags+0x48/0x7c
[ 16.611462] ip_auto_config+0x258/0x1114
[ 16.628727] do_one_initcall+0x80/0x1c8
[ 16.645590] kernel_init_freeable+0x208/0x2f0
[ 16.664917] kernel_init+0x24/0x1e0
[ 16.680295] ret_from_fork+0x10/0x20
[ 16.696369] Code: 927cec03 cb0e0021 8b0e0042 a9411c26 (a900340c)
[ 16.723106] ---[ end trace 0000000000000000 ]---
[ 16.752866] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[ 16.792556] Kernel Offset: 0x3396ea200000 from 0xffff800080000000
[ 16.818966] PHYS_OFFSET: 0xfff1000080000000
[ 16.837237] CPU features: 0x0000000,00060005,13e38581,957e772f
[ 16.862904] Memory Limit: none
[ 16.876526] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---

This panic occurs because the swiotlb memory was previously shared to
the host (__set_memory_enc_dec()), which involves transitioning the
(large) leaf mappings to invalid, sharing to the host, then marking the
mappings valid again. But pageattr_p[mu]d_entry() would only update the
entry if it is a section mapping, since otherwise it concluded it must
be a table entry so shouldn't be modified. But p[mu]d_sect() only
returns true if the entry is valid. So the result was that the large
leaf entry was made invalid in the first pass then ignored in the second
pass. It remains invalid until the above code tries to access it and
blows up.

The simple fix would be to update pageattr_pmd_entry() to use
!pmd_table() instead of pmd_sect(). That would solve this problem.

But the ptdump code also suffers from a similar issue. It checks
pmd_leaf() and doesn't call into the arch-specific note_page() machinery
if it returns false. As a result of this, ptdump wasn't even able to
show the invalid large leaf mappings; it looked like they were valid
which made this super fun to debug. the ptdump code is core-mm and
pmd_table() is arm64-specific so we can't use the same trick to solve
that.

But we already support the concept of "present-invalid" for user space
entries. And even better, pmd_leaf() will return true for a leaf mapping
that is marked present-invalid. So let's just use that encoding for
present-invalid kernel mappings too. Then we can use pmd_leaf() where we
previously used pmd_sect() and everything is magically fixed.

Additionally, from inspection kernel_page_present() was broken in a
similar way, so I'm also updating that to use pmd_leaf().

The transitional page tables component was also similarly broken; it
creates a copy of the kernel page tables, making RO leaf mappings RW in
the process. It also makes invalid (but-not-none) pte mappings valid.
But it was not doing this for large leaf mappings. This could have
resulted in crashes at kexec- or hibernate-time. This code is fixed to
flip "present-invalid" mappings back to "present-valid" at all levels.

Finally, I have hardened split_pmd()/split_pud() so that if it is passed
a "present-invalid" leaf, it will maintain that property in the split
leaves, since I wasn't able to convince myself that it would only ever
be called for "present-valid" leaves.

Fixes: a166563e7ec3 ("arm64: mm: support large block mapping when rodata=full")
Cc: stable@vger.kernel.org
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

authored by

Ryan Roberts and committed by
Catalin Marinas
15bfba1a f12b435d

+48 -59
+2
arch/arm64/include/asm/pgtable-prot.h
··· 25 25 */ 26 26 #define PTE_PRESENT_INVALID (PTE_NG) /* only when !PTE_VALID */ 27 27 28 + #define PTE_PRESENT_VALID_KERNEL (PTE_VALID | PTE_MAYBE_NG) 29 + 28 30 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP 29 31 #define PTE_UFFD_WP (_AT(pteval_t, 1) << 58) /* uffd-wp tracking */ 30 32 #define PTE_SWP_UFFD_WP (_AT(pteval_t, 1) << 3) /* only for swp ptes */
+7 -2
arch/arm64/include/asm/pgtable.h
··· 322 322 return clear_pte_bit(pte, __pgprot(PTE_CONT)); 323 323 } 324 324 325 - static inline pte_t pte_mkvalid(pte_t pte) 325 + static inline pte_t pte_mkvalid_k(pte_t pte) 326 326 { 327 - return set_pte_bit(pte, __pgprot(PTE_VALID)); 327 + pte = clear_pte_bit(pte, __pgprot(PTE_PRESENT_INVALID)); 328 + pte = set_pte_bit(pte, __pgprot(PTE_PRESENT_VALID_KERNEL)); 329 + return pte; 328 330 } 329 331 330 332 static inline pte_t pte_mkinvalid(pte_t pte) ··· 596 594 #define pmd_mkclean(pmd) pte_pmd(pte_mkclean(pmd_pte(pmd))) 597 595 #define pmd_mkdirty(pmd) pte_pmd(pte_mkdirty(pmd_pte(pmd))) 598 596 #define pmd_mkyoung(pmd) pte_pmd(pte_mkyoung(pmd_pte(pmd))) 597 + #define pmd_mkvalid_k(pmd) pte_pmd(pte_mkvalid_k(pmd_pte(pmd))) 599 598 #define pmd_mkinvalid(pmd) pte_pmd(pte_mkinvalid(pmd_pte(pmd))) 600 599 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP 601 600 #define pmd_uffd_wp(pmd) pte_uffd_wp(pmd_pte(pmd)) ··· 638 635 639 636 #define pud_young(pud) pte_young(pud_pte(pud)) 640 637 #define pud_mkyoung(pud) pte_pud(pte_mkyoung(pud_pte(pud))) 638 + #define pud_mkwrite_novma(pud) pte_pud(pte_mkwrite_novma(pud_pte(pud))) 639 + #define pud_mkvalid_k(pud) pte_pud(pte_mkvalid_k(pud_pte(pud))) 641 640 #define pud_write(pud) pte_write(pud_pte(pud)) 642 641 643 642 static inline pud_t pud_mkhuge(pud_t pud)
+4
arch/arm64/mm/mmu.c
··· 602 602 tableprot |= PMD_TABLE_PXN; 603 603 604 604 prot = __pgprot((pgprot_val(prot) & ~PTE_TYPE_MASK) | PTE_TYPE_PAGE); 605 + if (!pmd_valid(pmd)) 606 + prot = pte_pgprot(pte_mkinvalid(pfn_pte(0, prot))); 605 607 prot = __pgprot(pgprot_val(prot) & ~PTE_CONT); 606 608 if (to_cont) 607 609 prot = __pgprot(pgprot_val(prot) | PTE_CONT); ··· 649 647 tableprot |= PUD_TABLE_PXN; 650 648 651 649 prot = __pgprot((pgprot_val(prot) & ~PMD_TYPE_MASK) | PMD_TYPE_SECT); 650 + if (!pud_valid(pud)) 651 + prot = pmd_pgprot(pmd_mkinvalid(pfn_pmd(0, prot))); 652 652 prot = __pgprot(pgprot_val(prot) & ~PTE_CONT); 653 653 if (to_cont) 654 654 prot = __pgprot(pgprot_val(prot) | PTE_CONT);
+28 -22
arch/arm64/mm/pageattr.c
··· 25 25 { 26 26 struct page_change_data *masks = walk->private; 27 27 28 + /* 29 + * Some users clear and set bits which alias each other (e.g. PTE_NG and 30 + * PTE_PRESENT_INVALID). It is therefore important that we always clear 31 + * first then set. 32 + */ 28 33 val &= ~(pgprot_val(masks->clear_mask)); 29 34 val |= (pgprot_val(masks->set_mask)); 30 35 ··· 41 36 { 42 37 pud_t val = pudp_get(pud); 43 38 44 - if (pud_sect(val)) { 39 + if (pud_leaf(val)) { 45 40 if (WARN_ON_ONCE((next - addr) != PUD_SIZE)) 46 41 return -EINVAL; 47 42 val = __pud(set_pageattr_masks(pud_val(val), walk)); ··· 57 52 { 58 53 pmd_t val = pmdp_get(pmd); 59 54 60 - if (pmd_sect(val)) { 55 + if (pmd_leaf(val)) { 61 56 if (WARN_ON_ONCE((next - addr) != PMD_SIZE)) 62 57 return -EINVAL; 63 58 val = __pmd(set_pageattr_masks(pmd_val(val), walk)); ··· 137 132 ret = update_range_prot(start, size, set_mask, clear_mask); 138 133 139 134 /* 140 - * If the memory is being made valid without changing any other bits 141 - * then a TLBI isn't required as a non-valid entry cannot be cached in 142 - * the TLB. 135 + * If the memory is being switched from present-invalid to valid without 136 + * changing any other bits then a TLBI isn't required as a non-valid 137 + * entry cannot be cached in the TLB. 143 138 */ 144 - if (pgprot_val(set_mask) != PTE_VALID || pgprot_val(clear_mask)) 139 + if (pgprot_val(set_mask) != PTE_PRESENT_VALID_KERNEL || 140 + pgprot_val(clear_mask) != PTE_PRESENT_INVALID) 145 141 flush_tlb_kernel_range(start, start + size); 146 142 return ret; 147 143 } ··· 243 237 { 244 238 if (enable) 245 239 return __change_memory_common(addr, PAGE_SIZE * numpages, 246 - __pgprot(PTE_VALID), 247 - __pgprot(0)); 240 + __pgprot(PTE_PRESENT_VALID_KERNEL), 241 + __pgprot(PTE_PRESENT_INVALID)); 248 242 else 249 243 return __change_memory_common(addr, PAGE_SIZE * numpages, 250 - __pgprot(0), 251 - __pgprot(PTE_VALID)); 244 + __pgprot(PTE_PRESENT_INVALID), 245 + __pgprot(PTE_PRESENT_VALID_KERNEL)); 252 246 } 253 247 254 248 int set_direct_map_invalid_noflush(struct page *page) 255 249 { 256 - pgprot_t clear_mask = __pgprot(PTE_VALID); 257 - pgprot_t set_mask = __pgprot(0); 250 + pgprot_t clear_mask = __pgprot(PTE_PRESENT_VALID_KERNEL); 251 + pgprot_t set_mask = __pgprot(PTE_PRESENT_INVALID); 258 252 259 253 if (!can_set_direct_map()) 260 254 return 0; ··· 265 259 266 260 int set_direct_map_default_noflush(struct page *page) 267 261 { 268 - pgprot_t set_mask = __pgprot(PTE_VALID | PTE_WRITE); 269 - pgprot_t clear_mask = __pgprot(PTE_RDONLY); 262 + pgprot_t set_mask = __pgprot(PTE_PRESENT_VALID_KERNEL | PTE_WRITE); 263 + pgprot_t clear_mask = __pgprot(PTE_PRESENT_INVALID | PTE_RDONLY); 270 264 271 265 if (!can_set_direct_map()) 272 266 return 0; ··· 302 296 * entries or Synchronous External Aborts caused by RIPAS_EMPTY 303 297 */ 304 298 ret = __change_memory_common(addr, PAGE_SIZE * numpages, 305 - __pgprot(set_prot), 306 - __pgprot(clear_prot | PTE_VALID)); 299 + __pgprot(set_prot | PTE_PRESENT_INVALID), 300 + __pgprot(clear_prot | PTE_PRESENT_VALID_KERNEL)); 307 301 308 302 if (ret) 309 303 return ret; ··· 317 311 return ret; 318 312 319 313 return __change_memory_common(addr, PAGE_SIZE * numpages, 320 - __pgprot(PTE_VALID), 321 - __pgprot(0)); 314 + __pgprot(PTE_PRESENT_VALID_KERNEL), 315 + __pgprot(PTE_PRESENT_INVALID)); 322 316 } 323 317 324 318 static int realm_set_memory_encrypted(unsigned long addr, int numpages) ··· 410 404 pud = READ_ONCE(*pudp); 411 405 if (pud_none(pud)) 412 406 return false; 413 - if (pud_sect(pud)) 414 - return true; 407 + if (pud_leaf(pud)) 408 + return pud_valid(pud); 415 409 416 410 pmdp = pmd_offset(pudp, addr); 417 411 pmd = READ_ONCE(*pmdp); 418 412 if (pmd_none(pmd)) 419 413 return false; 420 - if (pmd_sect(pmd)) 421 - return true; 414 + if (pmd_leaf(pmd)) 415 + return pmd_valid(pmd); 422 416 423 417 ptep = pte_offset_kernel(pmdp, addr); 424 418 return pte_valid(__ptep_get(ptep));
+7 -35
arch/arm64/mm/trans_pgd.c
··· 31 31 return info->trans_alloc_page(info->trans_alloc_arg); 32 32 } 33 33 34 - static void _copy_pte(pte_t *dst_ptep, pte_t *src_ptep, unsigned long addr) 35 - { 36 - pte_t pte = __ptep_get(src_ptep); 37 - 38 - if (pte_valid(pte)) { 39 - /* 40 - * Resume will overwrite areas that may be marked 41 - * read only (code, rodata). Clear the RDONLY bit from 42 - * the temporary mappings we use during restore. 43 - */ 44 - __set_pte(dst_ptep, pte_mkwrite_novma(pte)); 45 - } else if (!pte_none(pte)) { 46 - /* 47 - * debug_pagealloc will removed the PTE_VALID bit if 48 - * the page isn't in use by the resume kernel. It may have 49 - * been in use by the original kernel, in which case we need 50 - * to put it back in our copy to do the restore. 51 - * 52 - * Other cases include kfence / vmalloc / memfd_secret which 53 - * may call `set_direct_map_invalid_noflush()`. 54 - * 55 - * Before marking this entry valid, check the pfn should 56 - * be mapped. 57 - */ 58 - BUG_ON(!pfn_valid(pte_pfn(pte))); 59 - 60 - __set_pte(dst_ptep, pte_mkvalid(pte_mkwrite_novma(pte))); 61 - } 62 - } 63 - 64 34 static int copy_pte(struct trans_pgd_info *info, pmd_t *dst_pmdp, 65 35 pmd_t *src_pmdp, unsigned long start, unsigned long end) 66 36 { ··· 46 76 47 77 src_ptep = pte_offset_kernel(src_pmdp, start); 48 78 do { 49 - _copy_pte(dst_ptep, src_ptep, addr); 79 + pte_t pte = __ptep_get(src_ptep); 80 + 81 + if (pte_none(pte)) 82 + continue; 83 + __set_pte(dst_ptep, pte_mkvalid_k(pte_mkwrite_novma(pte))); 50 84 } while (dst_ptep++, src_ptep++, addr += PAGE_SIZE, addr != end); 51 85 52 86 return 0; ··· 83 109 if (copy_pte(info, dst_pmdp, src_pmdp, addr, next)) 84 110 return -ENOMEM; 85 111 } else { 86 - set_pmd(dst_pmdp, 87 - __pmd(pmd_val(pmd) & ~PMD_SECT_RDONLY)); 112 + set_pmd(dst_pmdp, pmd_mkvalid_k(pmd_mkwrite_novma(pmd))); 88 113 } 89 114 } while (dst_pmdp++, src_pmdp++, addr = next, addr != end); 90 115 ··· 118 145 if (copy_pmd(info, dst_pudp, src_pudp, addr, next)) 119 146 return -ENOMEM; 120 147 } else { 121 - set_pud(dst_pudp, 122 - __pud(pud_val(pud) & ~PUD_SECT_RDONLY)); 148 + set_pud(dst_pudp, pud_mkvalid_k(pud_mkwrite_novma(pud))); 123 149 } 124 150 } while (dst_pudp++, src_pudp++, addr = next, addr != end); 125 151