Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Merge tag 'x86_shstk_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 shadow stack support from Dave Hansen:
"This is the long awaited x86 shadow stack support, part of Intel's
Control-flow Enforcement Technology (CET).

CET consists of two related security features: shadow stacks and
indirect branch tracking. This series implements just the shadow stack
part of this feature, and just for userspace.

The main use case for shadow stack is providing protection against
return oriented programming attacks. It works by maintaining a
secondary (shadow) stack using a special memory type that has
protections against modification. When executing a CALL instruction,
the processor pushes the return address to both the normal stack and
to the special permission shadow stack. Upon RET, the processor pops
the shadow stack copy and compares it to the normal stack copy.

For more information, refer to the links below for the earlier
versions of this patch set"

Link: https://lore.kernel.org/lkml/20220130211838.8382-1-rick.p.edgecombe@intel.com/
Link: https://lore.kernel.org/lkml/20230613001108.3040476-1-rick.p.edgecombe@intel.com/

* tag 'x86_shstk_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (47 commits)
x86/shstk: Change order of __user in type
x86/ibt: Convert IBT selftest to asm
x86/shstk: Don't retry vm_munmap() on -EINTR
x86/kbuild: Fix Documentation/ reference
x86/shstk: Move arch detail comment out of core mm
x86/shstk: Add ARCH_SHSTK_STATUS
x86/shstk: Add ARCH_SHSTK_UNLOCK
x86: Add PTRACE interface for shadow stack
selftests/x86: Add shadow stack test
x86/cpufeatures: Enable CET CR4 bit for shadow stack
x86/shstk: Wire in shadow stack interface
x86: Expose thread features in /proc/$PID/status
x86/shstk: Support WRSS for userspace
x86/shstk: Introduce map_shadow_stack syscall
x86/shstk: Check that signal frame is shadow stack mem
x86/shstk: Check that SSP is aligned on sigreturn
x86/shstk: Handle signals for shadow stack
x86/shstk: Introduce routines modifying shstk
x86/shstk: Handle thread shadow stack
x86/shstk: Add user-mode shadow stack support
...

+2790 -308
+1
Documentation/arch/x86/index.rst
··· 22 22 mtrr 23 23 pat 24 24 intel-hfi 25 + shstk 25 26 iommu 26 27 intel_txt 27 28 amd-memory-encryption
+179
Documentation/arch/x86/shstk.rst
··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ====================================================== 4 + Control-flow Enforcement Technology (CET) Shadow Stack 5 + ====================================================== 6 + 7 + CET Background 8 + ============== 9 + 10 + Control-flow Enforcement Technology (CET) covers several related x86 processor 11 + features that provide protection against control flow hijacking attacks. CET 12 + can protect both applications and the kernel. 13 + 14 + CET introduces shadow stack and indirect branch tracking (IBT). A shadow stack 15 + is a secondary stack allocated from memory which cannot be directly modified by 16 + applications. When executing a CALL instruction, the processor pushes the 17 + return address to both the normal stack and the shadow stack. Upon 18 + function return, the processor pops the shadow stack copy and compares it 19 + to the normal stack copy. If the two differ, the processor raises a 20 + control-protection fault. IBT verifies indirect CALL/JMP targets are intended 21 + as marked by the compiler with 'ENDBR' opcodes. Not all CPU's have both Shadow 22 + Stack and Indirect Branch Tracking. Today in the 64-bit kernel, only userspace 23 + shadow stack and kernel IBT are supported. 24 + 25 + Requirements to use Shadow Stack 26 + ================================ 27 + 28 + To use userspace shadow stack you need HW that supports it, a kernel 29 + configured with it and userspace libraries compiled with it. 30 + 31 + The kernel Kconfig option is X86_USER_SHADOW_STACK. When compiled in, shadow 32 + stacks can be disabled at runtime with the kernel parameter: nousershstk. 33 + 34 + To build a user shadow stack enabled kernel, Binutils v2.29 or LLVM v6 or later 35 + are required. 36 + 37 + At run time, /proc/cpuinfo shows CET features if the processor supports 38 + CET. "user_shstk" means that userspace shadow stack is supported on the current 39 + kernel and HW. 40 + 41 + Application Enabling 42 + ==================== 43 + 44 + An application's CET capability is marked in its ELF note and can be verified 45 + from readelf/llvm-readelf output:: 46 + 47 + readelf -n <application> | grep -a SHSTK 48 + properties: x86 feature: SHSTK 49 + 50 + The kernel does not process these applications markers directly. Applications 51 + or loaders must enable CET features using the interface described in section 4. 52 + Typically this would be done in dynamic loader or static runtime objects, as is 53 + the case in GLIBC. 54 + 55 + Enabling arch_prctl()'s 56 + ======================= 57 + 58 + Elf features should be enabled by the loader using the below arch_prctl's. They 59 + are only supported in 64 bit user applications. These operate on the features 60 + on a per-thread basis. The enablement status is inherited on clone, so if the 61 + feature is enabled on the first thread, it will propagate to all the thread's 62 + in an app. 63 + 64 + arch_prctl(ARCH_SHSTK_ENABLE, unsigned long feature) 65 + Enable a single feature specified in 'feature'. Can only operate on 66 + one feature at a time. 67 + 68 + arch_prctl(ARCH_SHSTK_DISABLE, unsigned long feature) 69 + Disable a single feature specified in 'feature'. Can only operate on 70 + one feature at a time. 71 + 72 + arch_prctl(ARCH_SHSTK_LOCK, unsigned long features) 73 + Lock in features at their current enabled or disabled status. 'features' 74 + is a mask of all features to lock. All bits set are processed, unset bits 75 + are ignored. The mask is ORed with the existing value. So any feature bits 76 + set here cannot be enabled or disabled afterwards. 77 + 78 + arch_prctl(ARCH_SHSTK_UNLOCK, unsigned long features) 79 + Unlock features. 'features' is a mask of all features to unlock. All 80 + bits set are processed, unset bits are ignored. Only works via ptrace. 81 + 82 + arch_prctl(ARCH_SHSTK_STATUS, unsigned long addr) 83 + Copy the currently enabled features to the address passed in addr. The 84 + features are described using the bits passed into the others in 85 + 'features'. 86 + 87 + The return values are as follows. On success, return 0. On error, errno can 88 + be:: 89 + 90 + -EPERM if any of the passed feature are locked. 91 + -ENOTSUPP if the feature is not supported by the hardware or 92 + kernel. 93 + -EINVAL arguments (non existing feature, etc) 94 + -EFAULT if could not copy information back to userspace 95 + 96 + The feature's bits supported are:: 97 + 98 + ARCH_SHSTK_SHSTK - Shadow stack 99 + ARCH_SHSTK_WRSS - WRSS 100 + 101 + Currently shadow stack and WRSS are supported via this interface. WRSS 102 + can only be enabled with shadow stack, and is automatically disabled 103 + if shadow stack is disabled. 104 + 105 + Proc Status 106 + =========== 107 + To check if an application is actually running with shadow stack, the 108 + user can read the /proc/$PID/status. It will report "wrss" or "shstk" 109 + depending on what is enabled. The lines look like this:: 110 + 111 + x86_Thread_features: shstk wrss 112 + x86_Thread_features_locked: shstk wrss 113 + 114 + Implementation of the Shadow Stack 115 + ================================== 116 + 117 + Shadow Stack Size 118 + ----------------- 119 + 120 + A task's shadow stack is allocated from memory to a fixed size of 121 + MIN(RLIMIT_STACK, 4 GB). In other words, the shadow stack is allocated to 122 + the maximum size of the normal stack, but capped to 4 GB. In the case 123 + of the clone3 syscall, there is a stack size passed in and shadow stack 124 + uses this instead of the rlimit. 125 + 126 + Signal 127 + ------ 128 + 129 + The main program and its signal handlers use the same shadow stack. Because 130 + the shadow stack stores only return addresses, a large shadow stack covers 131 + the condition that both the program stack and the signal alternate stack run 132 + out. 133 + 134 + When a signal happens, the old pre-signal state is pushed on the stack. When 135 + shadow stack is enabled, the shadow stack specific state is pushed onto the 136 + shadow stack. Today this is only the old SSP (shadow stack pointer), pushed 137 + in a special format with bit 63 set. On sigreturn this old SSP token is 138 + verified and restored by the kernel. The kernel will also push the normal 139 + restorer address to the shadow stack to help userspace avoid a shadow stack 140 + violation on the sigreturn path that goes through the restorer. 141 + 142 + So the shadow stack signal frame format is as follows:: 143 + 144 + |1...old SSP| - Pointer to old pre-signal ssp in sigframe token format 145 + (bit 63 set to 1) 146 + | ...| - Other state may be added in the future 147 + 148 + 149 + 32 bit ABI signals are not supported in shadow stack processes. Linux prevents 150 + 32 bit execution while shadow stack is enabled by the allocating shadow stacks 151 + outside of the 32 bit address space. When execution enters 32 bit mode, either 152 + via far call or returning to userspace, a #GP is generated by the hardware 153 + which, will be delivered to the process as a segfault. When transitioning to 154 + userspace the register's state will be as if the userspace ip being returned to 155 + caused the segfault. 156 + 157 + Fork 158 + ---- 159 + 160 + The shadow stack's vma has VM_SHADOW_STACK flag set; its PTEs are required 161 + to be read-only and dirty. When a shadow stack PTE is not RO and dirty, a 162 + shadow access triggers a page fault with the shadow stack access bit set 163 + in the page fault error code. 164 + 165 + When a task forks a child, its shadow stack PTEs are copied and both the 166 + parent's and the child's shadow stack PTEs are cleared of the dirty bit. 167 + Upon the next shadow stack access, the resulting shadow stack page fault 168 + is handled by page copy/re-use. 169 + 170 + When a pthread child is created, the kernel allocates a new shadow stack 171 + for the new thread. New shadow stack creation behaves like mmap() with respect 172 + to ASLR behavior. Similarly, on thread exit the thread's shadow stack is 173 + disabled. 174 + 175 + Exec 176 + ---- 177 + 178 + On exec, shadow stack features are disabled by the kernel. At which point, 179 + userspace can choose to re-enable, or lock them.
+1
Documentation/filesystems/proc.rst
··· 566 566 mt arm64 MTE allocation tags are enabled 567 567 um userfaultfd missing tracking 568 568 uw userfaultfd wr-protect tracking 569 + ss shadow stack page 569 570 == ======================================= 570 571 571 572 Note that there is no guarantee that every flag and associated mnemonic will
+10 -2
Documentation/mm/arch_pgtable_helpers.rst
··· 46 46 +---------------------------+--------------------------------------------------+ 47 47 | pte_mkclean | Creates a clean PTE | 48 48 +---------------------------+--------------------------------------------------+ 49 - | pte_mkwrite | Creates a writable PTE | 49 + | pte_mkwrite | Creates a writable PTE of the type specified by | 50 + | | the VMA. | 51 + +---------------------------+--------------------------------------------------+ 52 + | pte_mkwrite_novma | Creates a writable PTE, of the conventional type | 53 + | | of writable. | 50 54 +---------------------------+--------------------------------------------------+ 51 55 | pte_wrprotect | Creates a write protected PTE | 52 56 +---------------------------+--------------------------------------------------+ ··· 122 118 +---------------------------+--------------------------------------------------+ 123 119 | pmd_mkclean | Creates a clean PMD | 124 120 +---------------------------+--------------------------------------------------+ 125 - | pmd_mkwrite | Creates a writable PMD | 121 + | pmd_mkwrite | Creates a writable PMD of the type specified by | 122 + | | the VMA. | 123 + +---------------------------+--------------------------------------------------+ 124 + | pmd_mkwrite_novma | Creates a writable PMD, of the conventional type | 125 + | | of writable. | 126 126 +---------------------------+--------------------------------------------------+ 127 127 | pmd_wrprotect | Creates a write protected PMD | 128 128 +---------------------------+--------------------------------------------------+
+8
arch/Kconfig
··· 931 931 config ARCH_WANT_HUGE_PMD_SHARE 932 932 bool 933 933 934 + # Archs that want to use pmd_mkwrite on kernel memory need it defined even 935 + # if there are no userspace memory management features that use it 936 + config ARCH_WANT_KERNEL_PMD_MKWRITE 937 + bool 938 + 939 + config ARCH_WANT_PMD_MKWRITE 940 + def_bool TRANSPARENT_HUGEPAGE || ARCH_WANT_KERNEL_PMD_MKWRITE 941 + 934 942 config HAVE_ARCH_SOFT_DIRTY 935 943 bool 936 944
+1 -1
arch/alpha/include/asm/pgtable.h
··· 256 256 extern inline pte_t pte_wrprotect(pte_t pte) { pte_val(pte) |= _PAGE_FOW; return pte; } 257 257 extern inline pte_t pte_mkclean(pte_t pte) { pte_val(pte) &= ~(__DIRTY_BITS); return pte; } 258 258 extern inline pte_t pte_mkold(pte_t pte) { pte_val(pte) &= ~(__ACCESS_BITS); return pte; } 259 - extern inline pte_t pte_mkwrite(pte_t pte) { pte_val(pte) &= ~_PAGE_FOW; return pte; } 259 + extern inline pte_t pte_mkwrite_novma(pte_t pte){ pte_val(pte) &= ~_PAGE_FOW; return pte; } 260 260 extern inline pte_t pte_mkdirty(pte_t pte) { pte_val(pte) |= __DIRTY_BITS; return pte; } 261 261 extern inline pte_t pte_mkyoung(pte_t pte) { pte_val(pte) |= __ACCESS_BITS; return pte; } 262 262
+1 -1
arch/arc/include/asm/hugepage.h
··· 21 21 } 22 22 23 23 #define pmd_wrprotect(pmd) pte_pmd(pte_wrprotect(pmd_pte(pmd))) 24 - #define pmd_mkwrite(pmd) pte_pmd(pte_mkwrite(pmd_pte(pmd))) 24 + #define pmd_mkwrite_novma(pmd) pte_pmd(pte_mkwrite_novma(pmd_pte(pmd))) 25 25 #define pmd_mkdirty(pmd) pte_pmd(pte_mkdirty(pmd_pte(pmd))) 26 26 #define pmd_mkold(pmd) pte_pmd(pte_mkold(pmd_pte(pmd))) 27 27 #define pmd_mkyoung(pmd) pte_pmd(pte_mkyoung(pmd_pte(pmd)))
+1 -1
arch/arc/include/asm/pgtable-bits-arcv2.h
··· 87 87 88 88 PTE_BIT_FUNC(mknotpresent, &= ~(_PAGE_PRESENT)); 89 89 PTE_BIT_FUNC(wrprotect, &= ~(_PAGE_WRITE)); 90 - PTE_BIT_FUNC(mkwrite, |= (_PAGE_WRITE)); 90 + PTE_BIT_FUNC(mkwrite_novma, |= (_PAGE_WRITE)); 91 91 PTE_BIT_FUNC(mkclean, &= ~(_PAGE_DIRTY)); 92 92 PTE_BIT_FUNC(mkdirty, |= (_PAGE_DIRTY)); 93 93 PTE_BIT_FUNC(mkold, &= ~(_PAGE_ACCESSED));
+1 -1
arch/arm/include/asm/pgtable-3level.h
··· 202 202 203 203 PMD_BIT_FUNC(wrprotect, |= L_PMD_SECT_RDONLY); 204 204 PMD_BIT_FUNC(mkold, &= ~PMD_SECT_AF); 205 - PMD_BIT_FUNC(mkwrite, &= ~L_PMD_SECT_RDONLY); 205 + PMD_BIT_FUNC(mkwrite_novma, &= ~L_PMD_SECT_RDONLY); 206 206 PMD_BIT_FUNC(mkdirty, |= L_PMD_SECT_DIRTY); 207 207 PMD_BIT_FUNC(mkclean, &= ~L_PMD_SECT_DIRTY); 208 208 PMD_BIT_FUNC(mkyoung, |= PMD_SECT_AF);
+1 -1
arch/arm/include/asm/pgtable.h
··· 228 228 return set_pte_bit(pte, __pgprot(L_PTE_RDONLY)); 229 229 } 230 230 231 - static inline pte_t pte_mkwrite(pte_t pte) 231 + static inline pte_t pte_mkwrite_novma(pte_t pte) 232 232 { 233 233 return clear_pte_bit(pte, __pgprot(L_PTE_RDONLY)); 234 234 }
+1 -1
arch/arm/kernel/signal.c
··· 682 682 */ 683 683 static_assert(NSIGILL == 11); 684 684 static_assert(NSIGFPE == 15); 685 - static_assert(NSIGSEGV == 9); 685 + static_assert(NSIGSEGV == 10); 686 686 static_assert(NSIGBUS == 5); 687 687 static_assert(NSIGTRAP == 6); 688 688 static_assert(NSIGCHLD == 6);
+2 -2
arch/arm64/include/asm/pgtable.h
··· 181 181 return pmd; 182 182 } 183 183 184 - static inline pte_t pte_mkwrite(pte_t pte) 184 + static inline pte_t pte_mkwrite_novma(pte_t pte) 185 185 { 186 186 pte = set_pte_bit(pte, __pgprot(PTE_WRITE)); 187 187 pte = clear_pte_bit(pte, __pgprot(PTE_RDONLY)); ··· 487 487 #define pmd_cont(pmd) pte_cont(pmd_pte(pmd)) 488 488 #define pmd_wrprotect(pmd) pte_pmd(pte_wrprotect(pmd_pte(pmd))) 489 489 #define pmd_mkold(pmd) pte_pmd(pte_mkold(pmd_pte(pmd))) 490 - #define pmd_mkwrite(pmd) pte_pmd(pte_mkwrite(pmd_pte(pmd))) 490 + #define pmd_mkwrite_novma(pmd) pte_pmd(pte_mkwrite_novma(pmd_pte(pmd))) 491 491 #define pmd_mkclean(pmd) pte_pmd(pte_mkclean(pmd_pte(pmd))) 492 492 #define pmd_mkdirty(pmd) pte_pmd(pte_mkdirty(pmd_pte(pmd))) 493 493 #define pmd_mkyoung(pmd) pte_pmd(pte_mkyoung(pmd_pte(pmd)))
+1 -1
arch/arm64/kernel/signal.c
··· 1344 1344 */ 1345 1345 static_assert(NSIGILL == 11); 1346 1346 static_assert(NSIGFPE == 15); 1347 - static_assert(NSIGSEGV == 9); 1347 + static_assert(NSIGSEGV == 10); 1348 1348 static_assert(NSIGBUS == 5); 1349 1349 static_assert(NSIGTRAP == 6); 1350 1350 static_assert(NSIGCHLD == 6);
+1 -1
arch/arm64/kernel/signal32.c
··· 460 460 */ 461 461 static_assert(NSIGILL == 11); 462 462 static_assert(NSIGFPE == 15); 463 - static_assert(NSIGSEGV == 9); 463 + static_assert(NSIGSEGV == 10); 464 464 static_assert(NSIGBUS == 5); 465 465 static_assert(NSIGTRAP == 6); 466 466 static_assert(NSIGCHLD == 6);
+2 -2
arch/arm64/mm/trans_pgd.c
··· 41 41 * read only (code, rodata). Clear the RDONLY bit from 42 42 * the temporary mappings we use during restore. 43 43 */ 44 - set_pte(dst_ptep, pte_mkwrite(pte)); 44 + set_pte(dst_ptep, pte_mkwrite_novma(pte)); 45 45 } else if ((debug_pagealloc_enabled() || 46 46 is_kfence_address((void *)addr)) && !pte_none(pte)) { 47 47 /* ··· 55 55 */ 56 56 BUG_ON(!pfn_valid(pte_pfn(pte))); 57 57 58 - set_pte(dst_ptep, pte_mkpresent(pte_mkwrite(pte))); 58 + set_pte(dst_ptep, pte_mkpresent(pte_mkwrite_novma(pte))); 59 59 } 60 60 } 61 61
+1 -1
arch/csky/include/asm/pgtable.h
··· 176 176 return pte; 177 177 } 178 178 179 - static inline pte_t pte_mkwrite(pte_t pte) 179 + static inline pte_t pte_mkwrite_novma(pte_t pte) 180 180 { 181 181 pte_val(pte) |= _PAGE_WRITE; 182 182 if (pte_val(pte) & _PAGE_MODIFIED)
+1 -1
arch/hexagon/include/asm/pgtable.h
··· 300 300 } 301 301 302 302 /* pte_mkwrite - mark page as writable */ 303 - static inline pte_t pte_mkwrite(pte_t pte) 303 + static inline pte_t pte_mkwrite_novma(pte_t pte) 304 304 { 305 305 pte_val(pte) |= _PAGE_WRITE; 306 306 return pte;
+1 -1
arch/ia64/include/asm/pgtable.h
··· 269 269 * access rights: 270 270 */ 271 271 #define pte_wrprotect(pte) (__pte(pte_val(pte) & ~_PAGE_AR_RW)) 272 - #define pte_mkwrite(pte) (__pte(pte_val(pte) | _PAGE_AR_RW)) 272 + #define pte_mkwrite_novma(pte) (__pte(pte_val(pte) | _PAGE_AR_RW)) 273 273 #define pte_mkold(pte) (__pte(pte_val(pte) & ~_PAGE_A)) 274 274 #define pte_mkyoung(pte) (__pte(pte_val(pte) | _PAGE_A)) 275 275 #define pte_mkclean(pte) (__pte(pte_val(pte) & ~_PAGE_D))
+2 -2
arch/loongarch/include/asm/pgtable.h
··· 384 384 return pte; 385 385 } 386 386 387 - static inline pte_t pte_mkwrite(pte_t pte) 387 + static inline pte_t pte_mkwrite_novma(pte_t pte) 388 388 { 389 389 pte_val(pte) |= _PAGE_WRITE; 390 390 if (pte_val(pte) & _PAGE_MODIFIED) ··· 493 493 return !!(pmd_val(pmd) & _PAGE_WRITE); 494 494 } 495 495 496 - static inline pmd_t pmd_mkwrite(pmd_t pmd) 496 + static inline pmd_t pmd_mkwrite_novma(pmd_t pmd) 497 497 { 498 498 pmd_val(pmd) |= _PAGE_WRITE; 499 499 if (pmd_val(pmd) & _PAGE_MODIFIED)
+1 -1
arch/m68k/include/asm/mcf_pgtable.h
··· 210 210 return pte; 211 211 } 212 212 213 - static inline pte_t pte_mkwrite(pte_t pte) 213 + static inline pte_t pte_mkwrite_novma(pte_t pte) 214 214 { 215 215 pte_val(pte) |= CF_PAGE_WRITABLE; 216 216 return pte;
+1 -1
arch/m68k/include/asm/motorola_pgtable.h
··· 156 156 static inline pte_t pte_wrprotect(pte_t pte) { pte_val(pte) |= _PAGE_RONLY; return pte; } 157 157 static inline pte_t pte_mkclean(pte_t pte) { pte_val(pte) &= ~_PAGE_DIRTY; return pte; } 158 158 static inline pte_t pte_mkold(pte_t pte) { pte_val(pte) &= ~_PAGE_ACCESSED; return pte; } 159 - static inline pte_t pte_mkwrite(pte_t pte) { pte_val(pte) &= ~_PAGE_RONLY; return pte; } 159 + static inline pte_t pte_mkwrite_novma(pte_t pte){ pte_val(pte) &= ~_PAGE_RONLY; return pte; } 160 160 static inline pte_t pte_mkdirty(pte_t pte) { pte_val(pte) |= _PAGE_DIRTY; return pte; } 161 161 static inline pte_t pte_mkyoung(pte_t pte) { pte_val(pte) |= _PAGE_ACCESSED; return pte; } 162 162 static inline pte_t pte_mknocache(pte_t pte)
+1 -1
arch/m68k/include/asm/sun3_pgtable.h
··· 144 144 static inline pte_t pte_wrprotect(pte_t pte) { pte_val(pte) &= ~SUN3_PAGE_WRITEABLE; return pte; } 145 145 static inline pte_t pte_mkclean(pte_t pte) { pte_val(pte) &= ~SUN3_PAGE_MODIFIED; return pte; } 146 146 static inline pte_t pte_mkold(pte_t pte) { pte_val(pte) &= ~SUN3_PAGE_ACCESSED; return pte; } 147 - static inline pte_t pte_mkwrite(pte_t pte) { pte_val(pte) |= SUN3_PAGE_WRITEABLE; return pte; } 147 + static inline pte_t pte_mkwrite_novma(pte_t pte){ pte_val(pte) |= SUN3_PAGE_WRITEABLE; return pte; } 148 148 static inline pte_t pte_mkdirty(pte_t pte) { pte_val(pte) |= SUN3_PAGE_MODIFIED; return pte; } 149 149 static inline pte_t pte_mkyoung(pte_t pte) { pte_val(pte) |= SUN3_PAGE_ACCESSED; return pte; } 150 150 static inline pte_t pte_mknocache(pte_t pte) { pte_val(pte) |= SUN3_PAGE_NOCACHE; return pte; }
+1 -1
arch/microblaze/include/asm/pgtable.h
··· 266 266 { pte_val(pte) |= _PAGE_USER; return pte; } 267 267 static inline pte_t pte_mkexec(pte_t pte) \ 268 268 { pte_val(pte) |= _PAGE_USER | _PAGE_EXEC; return pte; } 269 - static inline pte_t pte_mkwrite(pte_t pte) \ 269 + static inline pte_t pte_mkwrite_novma(pte_t pte) \ 270 270 { pte_val(pte) |= _PAGE_RW; return pte; } 271 271 static inline pte_t pte_mkdirty(pte_t pte) \ 272 272 { pte_val(pte) |= _PAGE_DIRTY; return pte; }
+3 -3
arch/mips/include/asm/pgtable.h
··· 319 319 return pte; 320 320 } 321 321 322 - static inline pte_t pte_mkwrite(pte_t pte) 322 + static inline pte_t pte_mkwrite_novma(pte_t pte) 323 323 { 324 324 pte.pte_low |= _PAGE_WRITE; 325 325 if (pte.pte_low & _PAGE_MODIFIED) { ··· 374 374 return pte; 375 375 } 376 376 377 - static inline pte_t pte_mkwrite(pte_t pte) 377 + static inline pte_t pte_mkwrite_novma(pte_t pte) 378 378 { 379 379 pte_val(pte) |= _PAGE_WRITE; 380 380 if (pte_val(pte) & _PAGE_MODIFIED) ··· 646 646 return pmd; 647 647 } 648 648 649 - static inline pmd_t pmd_mkwrite(pmd_t pmd) 649 + static inline pmd_t pmd_mkwrite_novma(pmd_t pmd) 650 650 { 651 651 pmd_val(pmd) |= _PAGE_WRITE; 652 652 if (pmd_val(pmd) & _PAGE_MODIFIED)
+1 -1
arch/nios2/include/asm/pgtable.h
··· 129 129 return pte; 130 130 } 131 131 132 - static inline pte_t pte_mkwrite(pte_t pte) 132 + static inline pte_t pte_mkwrite_novma(pte_t pte) 133 133 { 134 134 pte_val(pte) |= _PAGE_WRITE; 135 135 return pte;
+1 -1
arch/openrisc/include/asm/pgtable.h
··· 250 250 return pte; 251 251 } 252 252 253 - static inline pte_t pte_mkwrite(pte_t pte) 253 + static inline pte_t pte_mkwrite_novma(pte_t pte) 254 254 { 255 255 pte_val(pte) |= _PAGE_WRITE; 256 256 return pte;
+1 -1
arch/parisc/include/asm/pgtable.h
··· 322 322 static inline pte_t pte_wrprotect(pte_t pte) { pte_val(pte) &= ~_PAGE_WRITE; return pte; } 323 323 static inline pte_t pte_mkdirty(pte_t pte) { pte_val(pte) |= _PAGE_DIRTY; return pte; } 324 324 static inline pte_t pte_mkyoung(pte_t pte) { pte_val(pte) |= _PAGE_ACCESSED; return pte; } 325 - static inline pte_t pte_mkwrite(pte_t pte) { pte_val(pte) |= _PAGE_WRITE; return pte; } 325 + static inline pte_t pte_mkwrite_novma(pte_t pte) { pte_val(pte) |= _PAGE_WRITE; return pte; } 326 326 static inline pte_t pte_mkspecial(pte_t pte) { pte_val(pte) |= _PAGE_SPECIAL; return pte; } 327 327 328 328 /*
+1 -1
arch/powerpc/include/asm/book3s/32/pgtable.h
··· 493 493 return pte; 494 494 } 495 495 496 - static inline pte_t pte_mkwrite(pte_t pte) 496 + static inline pte_t pte_mkwrite_novma(pte_t pte) 497 497 { 498 498 return __pte(pte_val(pte) | _PAGE_RW); 499 499 }
+2 -2
arch/powerpc/include/asm/book3s/64/pgtable.h
··· 596 596 return __pte_raw(pte_raw(pte) | cpu_to_be64(_PAGE_EXEC)); 597 597 } 598 598 599 - static inline pte_t pte_mkwrite(pte_t pte) 599 + static inline pte_t pte_mkwrite_novma(pte_t pte) 600 600 { 601 601 /* 602 602 * write implies read, hence set both ··· 1088 1088 #define pmd_mkdirty(pmd) pte_pmd(pte_mkdirty(pmd_pte(pmd))) 1089 1089 #define pmd_mkclean(pmd) pte_pmd(pte_mkclean(pmd_pte(pmd))) 1090 1090 #define pmd_mkyoung(pmd) pte_pmd(pte_mkyoung(pmd_pte(pmd))) 1091 - #define pmd_mkwrite(pmd) pte_pmd(pte_mkwrite(pmd_pte(pmd))) 1091 + #define pmd_mkwrite_novma(pmd) pte_pmd(pte_mkwrite_novma(pmd_pte(pmd))) 1092 1092 1093 1093 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY 1094 1094 #define pmd_soft_dirty(pmd) pte_soft_dirty(pmd_pte(pmd))
+2 -2
arch/powerpc/include/asm/nohash/32/pgtable.h
··· 170 170 #define pte_clear(mm, addr, ptep) \ 171 171 do { pte_update(mm, addr, ptep, ~0, 0, 0); } while (0) 172 172 173 - #ifndef pte_mkwrite 174 - static inline pte_t pte_mkwrite(pte_t pte) 173 + #ifndef pte_mkwrite_novma 174 + static inline pte_t pte_mkwrite_novma(pte_t pte) 175 175 { 176 176 return __pte(pte_val(pte) | _PAGE_RW); 177 177 }
+2 -2
arch/powerpc/include/asm/nohash/32/pte-8xx.h
··· 101 101 102 102 #define pte_write pte_write 103 103 104 - static inline pte_t pte_mkwrite(pte_t pte) 104 + static inline pte_t pte_mkwrite_novma(pte_t pte) 105 105 { 106 106 return __pte(pte_val(pte) & ~_PAGE_RO); 107 107 } 108 108 109 - #define pte_mkwrite pte_mkwrite 109 + #define pte_mkwrite_novma pte_mkwrite_novma 110 110 111 111 static inline bool pte_user(pte_t pte) 112 112 {
+1 -1
arch/powerpc/include/asm/nohash/64/pgtable.h
··· 85 85 #ifndef __ASSEMBLY__ 86 86 /* pte_clear moved to later in this file */ 87 87 88 - static inline pte_t pte_mkwrite(pte_t pte) 88 + static inline pte_t pte_mkwrite_novma(pte_t pte) 89 89 { 90 90 return __pte(pte_val(pte) | _PAGE_RW); 91 91 }
+3 -3
arch/riscv/include/asm/pgtable.h
··· 380 380 381 381 /* static inline pte_t pte_mkread(pte_t pte) */ 382 382 383 - static inline pte_t pte_mkwrite(pte_t pte) 383 + static inline pte_t pte_mkwrite_novma(pte_t pte) 384 384 { 385 385 return __pte(pte_val(pte) | _PAGE_WRITE); 386 386 } ··· 677 677 return pte_pmd(pte_mkyoung(pmd_pte(pmd))); 678 678 } 679 679 680 - static inline pmd_t pmd_mkwrite(pmd_t pmd) 680 + static inline pmd_t pmd_mkwrite_novma(pmd_t pmd) 681 681 { 682 - return pte_pmd(pte_mkwrite(pmd_pte(pmd))); 682 + return pte_pmd(pte_mkwrite_novma(pmd_pte(pmd))); 683 683 } 684 684 685 685 static inline pmd_t pmd_wrprotect(pmd_t pmd)
+1
arch/s390/Kconfig
··· 127 127 select ARCH_WANTS_NO_INSTR 128 128 select ARCH_WANT_DEFAULT_BPF_JIT 129 129 select ARCH_WANT_IPC_PARSE_VERSION 130 + select ARCH_WANT_KERNEL_PMD_MKWRITE 130 131 select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP 131 132 select BUILDTIME_TABLE_SORT 132 133 select CLONE_BACKWARDS2
+1 -1
arch/s390/include/asm/hugetlb.h
··· 104 104 105 105 static inline pte_t huge_pte_mkwrite(pte_t pte) 106 106 { 107 - return pte_mkwrite(pte); 107 + return pte_mkwrite_novma(pte); 108 108 } 109 109 110 110 static inline pte_t huge_pte_mkdirty(pte_t pte)
+2 -2
arch/s390/include/asm/pgtable.h
··· 1001 1001 return set_pte_bit(pte, __pgprot(_PAGE_PROTECT)); 1002 1002 } 1003 1003 1004 - static inline pte_t pte_mkwrite(pte_t pte) 1004 + static inline pte_t pte_mkwrite_novma(pte_t pte) 1005 1005 { 1006 1006 pte = set_pte_bit(pte, __pgprot(_PAGE_WRITE)); 1007 1007 if (pte_val(pte) & _PAGE_DIRTY) ··· 1498 1498 return set_pmd_bit(pmd, __pgprot(_SEGMENT_ENTRY_PROTECT)); 1499 1499 } 1500 1500 1501 - static inline pmd_t pmd_mkwrite(pmd_t pmd) 1501 + static inline pmd_t pmd_mkwrite_novma(pmd_t pmd) 1502 1502 { 1503 1503 pmd = set_pmd_bit(pmd, __pgprot(_SEGMENT_ENTRY_WRITE)); 1504 1504 if (pmd_val(pmd) & _SEGMENT_ENTRY_DIRTY)
+2 -2
arch/s390/mm/pageattr.c
··· 98 98 if (flags & SET_MEMORY_RO) 99 99 new = pte_wrprotect(new); 100 100 else if (flags & SET_MEMORY_RW) 101 - new = pte_mkwrite(pte_mkdirty(new)); 101 + new = pte_mkwrite_novma(pte_mkdirty(new)); 102 102 if (flags & SET_MEMORY_NX) 103 103 new = set_pte_bit(new, __pgprot(_PAGE_NOEXEC)); 104 104 else if (flags & SET_MEMORY_X) ··· 156 156 if (flags & SET_MEMORY_RO) 157 157 new = pmd_wrprotect(new); 158 158 else if (flags & SET_MEMORY_RW) 159 - new = pmd_mkwrite(pmd_mkdirty(new)); 159 + new = pmd_mkwrite_novma(pmd_mkdirty(new)); 160 160 if (flags & SET_MEMORY_NX) 161 161 new = set_pmd_bit(new, __pgprot(_SEGMENT_ENTRY_NOEXEC)); 162 162 else if (flags & SET_MEMORY_X)
+2 -2
arch/sh/include/asm/pgtable_32.h
··· 358 358 * kernel permissions), we attempt to couple them a bit more sanely here. 359 359 */ 360 360 PTE_BIT_FUNC(high, wrprotect, &= ~(_PAGE_EXT_USER_WRITE | _PAGE_EXT_KERN_WRITE)); 361 - PTE_BIT_FUNC(high, mkwrite, |= _PAGE_EXT_USER_WRITE | _PAGE_EXT_KERN_WRITE); 361 + PTE_BIT_FUNC(high, mkwrite_novma, |= _PAGE_EXT_USER_WRITE | _PAGE_EXT_KERN_WRITE); 362 362 PTE_BIT_FUNC(high, mkhuge, |= _PAGE_SZHUGE); 363 363 #else 364 364 PTE_BIT_FUNC(low, wrprotect, &= ~_PAGE_RW); 365 - PTE_BIT_FUNC(low, mkwrite, |= _PAGE_RW); 365 + PTE_BIT_FUNC(low, mkwrite_novma, |= _PAGE_RW); 366 366 PTE_BIT_FUNC(low, mkhuge, |= _PAGE_SZHUGE); 367 367 #endif 368 368
+1 -1
arch/sparc/include/asm/pgtable_32.h
··· 239 239 return __pte(pte_val(pte) & ~SRMMU_REF); 240 240 } 241 241 242 - static inline pte_t pte_mkwrite(pte_t pte) 242 + static inline pte_t pte_mkwrite_novma(pte_t pte) 243 243 { 244 244 return __pte(pte_val(pte) | SRMMU_WRITE); 245 245 }
+3 -3
arch/sparc/include/asm/pgtable_64.h
··· 518 518 return __pte(val); 519 519 } 520 520 521 - static inline pte_t pte_mkwrite(pte_t pte) 521 + static inline pte_t pte_mkwrite_novma(pte_t pte) 522 522 { 523 523 unsigned long val = pte_val(pte), mask; 524 524 ··· 773 773 return __pmd(pte_val(pte)); 774 774 } 775 775 776 - static inline pmd_t pmd_mkwrite(pmd_t pmd) 776 + static inline pmd_t pmd_mkwrite_novma(pmd_t pmd) 777 777 { 778 778 pte_t pte = __pte(pmd_val(pmd)); 779 779 780 - pte = pte_mkwrite(pte); 780 + pte = pte_mkwrite_novma(pte); 781 781 782 782 return __pmd(pte_val(pte)); 783 783 }
+1 -1
arch/sparc/kernel/signal32.c
··· 753 753 */ 754 754 static_assert(NSIGILL == 11); 755 755 static_assert(NSIGFPE == 15); 756 - static_assert(NSIGSEGV == 9); 756 + static_assert(NSIGSEGV == 10); 757 757 static_assert(NSIGBUS == 5); 758 758 static_assert(NSIGTRAP == 6); 759 759 static_assert(NSIGCHLD == 6);
+1 -1
arch/sparc/kernel/signal_64.c
··· 562 562 */ 563 563 static_assert(NSIGILL == 11); 564 564 static_assert(NSIGFPE == 15); 565 - static_assert(NSIGSEGV == 9); 565 + static_assert(NSIGSEGV == 10); 566 566 static_assert(NSIGBUS == 5); 567 567 static_assert(NSIGTRAP == 6); 568 568 static_assert(NSIGCHLD == 6);
+1 -1
arch/um/include/asm/pgtable.h
··· 207 207 return(pte); 208 208 } 209 209 210 - static inline pte_t pte_mkwrite(pte_t pte) 210 + static inline pte_t pte_mkwrite_novma(pte_t pte) 211 211 { 212 212 if (unlikely(pte_get_bits(pte, _PAGE_RW))) 213 213 return pte;
+24
arch/x86/Kconfig
··· 1815 1815 (CC_IS_CLANG && CLANG_VERSION >= 140000)) && \ 1816 1816 $(as-instr,endbr64) 1817 1817 1818 + config X86_CET 1819 + def_bool n 1820 + help 1821 + CET features configured (Shadow stack or IBT) 1822 + 1818 1823 config X86_KERNEL_IBT 1819 1824 prompt "Indirect Branch Tracking" 1820 1825 def_bool y ··· 1827 1822 # https://github.com/llvm/llvm-project/commit/9d7001eba9c4cb311e03cd8cdc231f9e579f2d0f 1828 1823 depends on !LD_IS_LLD || LLD_VERSION >= 140000 1829 1824 select OBJTOOL 1825 + select X86_CET 1830 1826 help 1831 1827 Build the kernel with support for Indirect Branch Tracking, a 1832 1828 hardware support course-grain forward-edge Control Flow Integrity ··· 1918 1912 only be accessed by code running within the enclave. Accesses from 1919 1913 outside the enclave, including other enclaves, are disallowed by 1920 1914 hardware. 1915 + 1916 + If unsure, say N. 1917 + 1918 + config X86_USER_SHADOW_STACK 1919 + bool "X86 userspace shadow stack" 1920 + depends on AS_WRUSS 1921 + depends on X86_64 1922 + select ARCH_USES_HIGH_VMA_FLAGS 1923 + select X86_CET 1924 + help 1925 + Shadow stack protection is a hardware feature that detects function 1926 + return address corruption. This helps mitigate ROP attacks. 1927 + Applications must be enabled to use it, and old userspace does not 1928 + get protection "for free". 1929 + 1930 + CPUs supporting shadow stacks were first released in 2020. 1931 + 1932 + See Documentation/arch/x86/shstk.rst for more information. 1921 1933 1922 1934 If unsure, say N. 1923 1935
+5
arch/x86/Kconfig.assembler
··· 24 24 def_bool $(as-instr,vgf2p8mulb %xmm0$(comma)%xmm1$(comma)%xmm2) 25 25 help 26 26 Supported by binutils >= 2.30 and LLVM integrated assembler 27 + 28 + config AS_WRUSS 29 + def_bool $(as-instr,wrussq %rax$(comma)(%rbx)) 30 + help 31 + Supported by binutils >= 2.31 and LLVM integrated assembler
+1
arch/x86/entry/syscalls/syscall_64.tbl
··· 374 374 450 common set_mempolicy_home_node sys_set_mempolicy_home_node 375 375 451 common cachestat sys_cachestat 376 376 452 common fchmodat2 sys_fchmodat2 377 + 453 64 map_shadow_stack sys_map_shadow_stack 377 378 378 379 # 379 380 # Due to a historical design error, certain syscalls are numbered differently
+2
arch/x86/include/asm/cpufeatures.h
··· 307 307 #define X86_FEATURE_MSR_TSX_CTRL (11*32+20) /* "" MSR IA32_TSX_CTRL (Intel) implemented */ 308 308 #define X86_FEATURE_SMBA (11*32+21) /* "" Slow Memory Bandwidth Allocation */ 309 309 #define X86_FEATURE_BMEC (11*32+22) /* "" Bandwidth Monitoring Event Configuration */ 310 + #define X86_FEATURE_USER_SHSTK (11*32+23) /* Shadow stack support for user mode applications */ 310 311 311 312 #define X86_FEATURE_SRSO (11*32+24) /* "" AMD BTB untrain RETs */ 312 313 #define X86_FEATURE_SRSO_ALIAS (11*32+25) /* "" AMD BTB untrain RETs through aliasing */ ··· 384 383 #define X86_FEATURE_OSPKE (16*32+ 4) /* OS Protection Keys Enable */ 385 384 #define X86_FEATURE_WAITPKG (16*32+ 5) /* UMONITOR/UMWAIT/TPAUSE Instructions */ 386 385 #define X86_FEATURE_AVX512_VBMI2 (16*32+ 6) /* Additional AVX512 Vector Bit Manipulation Instructions */ 386 + #define X86_FEATURE_SHSTK (16*32+ 7) /* "" Shadow stack */ 387 387 #define X86_FEATURE_GFNI (16*32+ 8) /* Galois Field New Instructions */ 388 388 #define X86_FEATURE_VAES (16*32+ 9) /* Vector AES */ 389 389 #define X86_FEATURE_VPCLMULQDQ (16*32+10) /* Carry-Less Multiplication Double Quadword */
+14 -2
arch/x86/include/asm/disabled-features.h
··· 105 105 # define DISABLE_TDX_GUEST (1 << (X86_FEATURE_TDX_GUEST & 31)) 106 106 #endif 107 107 108 + #ifdef CONFIG_X86_USER_SHADOW_STACK 109 + #define DISABLE_USER_SHSTK 0 110 + #else 111 + #define DISABLE_USER_SHSTK (1 << (X86_FEATURE_USER_SHSTK & 31)) 112 + #endif 113 + 114 + #ifdef CONFIG_X86_KERNEL_IBT 115 + #define DISABLE_IBT 0 116 + #else 117 + #define DISABLE_IBT (1 << (X86_FEATURE_IBT & 31)) 118 + #endif 119 + 108 120 /* 109 121 * Make sure to add features to the correct mask 110 122 */ ··· 132 120 #define DISABLED_MASK9 (DISABLE_SGX) 133 121 #define DISABLED_MASK10 0 134 122 #define DISABLED_MASK11 (DISABLE_RETPOLINE|DISABLE_RETHUNK|DISABLE_UNRET| \ 135 - DISABLE_CALL_DEPTH_TRACKING) 123 + DISABLE_CALL_DEPTH_TRACKING|DISABLE_USER_SHSTK) 136 124 #define DISABLED_MASK12 (DISABLE_LAM) 137 125 #define DISABLED_MASK13 0 138 126 #define DISABLED_MASK14 0 ··· 140 128 #define DISABLED_MASK16 (DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \ 141 129 DISABLE_ENQCMD) 142 130 #define DISABLED_MASK17 0 143 - #define DISABLED_MASK18 0 131 + #define DISABLED_MASK18 (DISABLE_IBT) 144 132 #define DISABLED_MASK19 0 145 133 #define DISABLED_MASK20 0 146 134 #define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 21)
+9
arch/x86/include/asm/fpu/api.h
··· 82 82 preempt_enable(); 83 83 } 84 84 85 + /* 86 + * FPU state gets lazily restored before returning to userspace. So when in the 87 + * kernel, the valid FPU state may be kept in the buffer. This function will force 88 + * restore all the fpu state to the registers early if needed, and lock them from 89 + * being automatically saved/restored. Then FPU state can be modified safely in the 90 + * registers, before unlocking with fpregs_unlock(). 91 + */ 92 + void fpregs_lock_and_load(void); 93 + 85 94 #ifdef CONFIG_X86_DEBUG_FPU 86 95 extern void fpregs_assert_state_consistent(void); 87 96 #else
+4 -3
arch/x86/include/asm/fpu/regset.h
··· 7 7 8 8 #include <linux/regset.h> 9 9 10 - extern user_regset_active_fn regset_fpregs_active, regset_xregset_fpregs_active; 10 + extern user_regset_active_fn regset_fpregs_active, regset_xregset_fpregs_active, 11 + ssp_active; 11 12 extern user_regset_get2_fn fpregs_get, xfpregs_get, fpregs_soft_get, 12 - xstateregs_get; 13 + xstateregs_get, ssp_get; 13 14 extern user_regset_set_fn fpregs_set, xfpregs_set, fpregs_soft_set, 14 - xstateregs_set; 15 + xstateregs_set, ssp_set; 15 16 16 17 /* 17 18 * xstateregs_active == regset_fpregs_active. Please refer to the comment
+2 -1
arch/x86/include/asm/fpu/sched.h
··· 11 11 12 12 extern void save_fpregs_to_fpstate(struct fpu *fpu); 13 13 extern void fpu__drop(struct fpu *fpu); 14 - extern int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal); 14 + extern int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal, 15 + unsigned long shstk_addr); 15 16 extern void fpu_flush_thread(void); 16 17 17 18 /*
+14 -2
arch/x86/include/asm/fpu/types.h
··· 115 115 XFEATURE_PT_UNIMPLEMENTED_SO_FAR, 116 116 XFEATURE_PKRU, 117 117 XFEATURE_PASID, 118 - XFEATURE_RSRVD_COMP_11, 119 - XFEATURE_RSRVD_COMP_12, 118 + XFEATURE_CET_USER, 119 + XFEATURE_CET_KERNEL_UNUSED, 120 120 XFEATURE_RSRVD_COMP_13, 121 121 XFEATURE_RSRVD_COMP_14, 122 122 XFEATURE_LBR, ··· 138 138 #define XFEATURE_MASK_PT (1 << XFEATURE_PT_UNIMPLEMENTED_SO_FAR) 139 139 #define XFEATURE_MASK_PKRU (1 << XFEATURE_PKRU) 140 140 #define XFEATURE_MASK_PASID (1 << XFEATURE_PASID) 141 + #define XFEATURE_MASK_CET_USER (1 << XFEATURE_CET_USER) 142 + #define XFEATURE_MASK_CET_KERNEL (1 << XFEATURE_CET_KERNEL_UNUSED) 141 143 #define XFEATURE_MASK_LBR (1 << XFEATURE_LBR) 142 144 #define XFEATURE_MASK_XTILE_CFG (1 << XFEATURE_XTILE_CFG) 143 145 #define XFEATURE_MASK_XTILE_DATA (1 << XFEATURE_XTILE_DATA) ··· 253 251 u32 pkru; 254 252 u32 pad; 255 253 } __packed; 254 + 255 + /* 256 + * State component 11 is Control-flow Enforcement user states 257 + */ 258 + struct cet_user_state { 259 + /* user control-flow settings */ 260 + u64 user_cet; 261 + /* user shadow stack pointer */ 262 + u64 user_ssp; 263 + }; 256 264 257 265 /* 258 266 * State component 15: Architectural LBR configuration state.
+4 -2
arch/x86/include/asm/fpu/xstate.h
··· 50 50 #define XFEATURE_MASK_USER_DYNAMIC XFEATURE_MASK_XTILE_DATA 51 51 52 52 /* All currently supported supervisor features */ 53 - #define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID) 53 + #define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID | \ 54 + XFEATURE_MASK_CET_USER) 54 55 55 56 /* 56 57 * A supervisor state component may not always contain valuable information, ··· 78 77 * Unsupported supervisor features. When a supervisor feature in this mask is 79 78 * supported in the future, move it to the supported supervisor feature mask. 80 79 */ 81 - #define XFEATURE_MASK_SUPERVISOR_UNSUPPORTED (XFEATURE_MASK_PT) 80 + #define XFEATURE_MASK_SUPERVISOR_UNSUPPORTED (XFEATURE_MASK_PT | \ 81 + XFEATURE_MASK_CET_KERNEL) 82 82 83 83 /* All supervisor states including supported and unsupported states. */ 84 84 #define XFEATURE_MASK_SUPERVISOR_ALL (XFEATURE_MASK_SUPERVISOR_SUPPORTED | \
+1 -1
arch/x86/include/asm/idtentry.h
··· 614 614 #endif 615 615 616 616 /* #CP */ 617 - #ifdef CONFIG_X86_KERNEL_IBT 617 + #ifdef CONFIG_X86_CET 618 618 DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_CP, exc_control_protection); 619 619 #endif 620 620
+2
arch/x86/include/asm/mmu_context.h
··· 186 186 #else 187 187 #define deactivate_mm(tsk, mm) \ 188 188 do { \ 189 + if (!tsk->vfork_done) \ 190 + shstk_free(tsk); \ 189 191 load_gs_index(0); \ 190 192 loadsegment(fs, 0); \ 191 193 } while (0)
+266 -36
arch/x86/include/asm/pgtable.h
··· 125 125 * The following only work if pte_present() is true. 126 126 * Undefined behaviour if not.. 127 127 */ 128 - static inline int pte_dirty(pte_t pte) 128 + static inline bool pte_dirty(pte_t pte) 129 129 { 130 - return pte_flags(pte) & _PAGE_DIRTY; 130 + return pte_flags(pte) & _PAGE_DIRTY_BITS; 131 + } 132 + 133 + static inline bool pte_shstk(pte_t pte) 134 + { 135 + return cpu_feature_enabled(X86_FEATURE_SHSTK) && 136 + (pte_flags(pte) & (_PAGE_RW | _PAGE_DIRTY)) == _PAGE_DIRTY; 131 137 } 132 138 133 139 static inline int pte_young(pte_t pte) ··· 141 135 return pte_flags(pte) & _PAGE_ACCESSED; 142 136 } 143 137 144 - static inline int pmd_dirty(pmd_t pmd) 138 + static inline bool pmd_dirty(pmd_t pmd) 145 139 { 146 - return pmd_flags(pmd) & _PAGE_DIRTY; 140 + return pmd_flags(pmd) & _PAGE_DIRTY_BITS; 141 + } 142 + 143 + static inline bool pmd_shstk(pmd_t pmd) 144 + { 145 + return cpu_feature_enabled(X86_FEATURE_SHSTK) && 146 + (pmd_flags(pmd) & (_PAGE_RW | _PAGE_DIRTY | _PAGE_PSE)) == 147 + (_PAGE_DIRTY | _PAGE_PSE); 147 148 } 148 149 149 150 #define pmd_young pmd_young ··· 159 146 return pmd_flags(pmd) & _PAGE_ACCESSED; 160 147 } 161 148 162 - static inline int pud_dirty(pud_t pud) 149 + static inline bool pud_dirty(pud_t pud) 163 150 { 164 - return pud_flags(pud) & _PAGE_DIRTY; 151 + return pud_flags(pud) & _PAGE_DIRTY_BITS; 165 152 } 166 153 167 154 static inline int pud_young(pud_t pud) ··· 171 158 172 159 static inline int pte_write(pte_t pte) 173 160 { 174 - return pte_flags(pte) & _PAGE_RW; 161 + /* 162 + * Shadow stack pages are logically writable, but do not have 163 + * _PAGE_RW. Check for them separately from _PAGE_RW itself. 164 + */ 165 + return (pte_flags(pte) & _PAGE_RW) || pte_shstk(pte); 166 + } 167 + 168 + #define pmd_write pmd_write 169 + static inline int pmd_write(pmd_t pmd) 170 + { 171 + /* 172 + * Shadow stack pages are logically writable, but do not have 173 + * _PAGE_RW. Check for them separately from _PAGE_RW itself. 174 + */ 175 + return (pmd_flags(pmd) & _PAGE_RW) || pmd_shstk(pmd); 176 + } 177 + 178 + #define pud_write pud_write 179 + static inline int pud_write(pud_t pud) 180 + { 181 + return pud_flags(pud) & _PAGE_RW; 175 182 } 176 183 177 184 static inline int pte_huge(pte_t pte) ··· 325 292 return native_make_pte(v & ~clear); 326 293 } 327 294 295 + /* 296 + * Write protection operations can result in Dirty=1,Write=0 PTEs. But in the 297 + * case of X86_FEATURE_USER_SHSTK, these PTEs denote shadow stack memory. So 298 + * when creating dirty, write-protected memory, a software bit is used: 299 + * _PAGE_BIT_SAVED_DIRTY. The following functions take a PTE and transition the 300 + * Dirty bit to SavedDirty, and vice-vesra. 301 + * 302 + * This shifting is only done if needed. In the case of shifting 303 + * Dirty->SavedDirty, the condition is if the PTE is Write=0. In the case of 304 + * shifting SavedDirty->Dirty, the condition is Write=1. 305 + */ 306 + static inline pgprotval_t mksaveddirty_shift(pgprotval_t v) 307 + { 308 + pgprotval_t cond = (~v >> _PAGE_BIT_RW) & 1; 309 + 310 + v |= ((v >> _PAGE_BIT_DIRTY) & cond) << _PAGE_BIT_SAVED_DIRTY; 311 + v &= ~(cond << _PAGE_BIT_DIRTY); 312 + 313 + return v; 314 + } 315 + 316 + static inline pgprotval_t clear_saveddirty_shift(pgprotval_t v) 317 + { 318 + pgprotval_t cond = (v >> _PAGE_BIT_RW) & 1; 319 + 320 + v |= ((v >> _PAGE_BIT_SAVED_DIRTY) & cond) << _PAGE_BIT_DIRTY; 321 + v &= ~(cond << _PAGE_BIT_SAVED_DIRTY); 322 + 323 + return v; 324 + } 325 + 326 + static inline pte_t pte_mksaveddirty(pte_t pte) 327 + { 328 + pteval_t v = native_pte_val(pte); 329 + 330 + v = mksaveddirty_shift(v); 331 + return native_make_pte(v); 332 + } 333 + 334 + static inline pte_t pte_clear_saveddirty(pte_t pte) 335 + { 336 + pteval_t v = native_pte_val(pte); 337 + 338 + v = clear_saveddirty_shift(v); 339 + return native_make_pte(v); 340 + } 341 + 328 342 static inline pte_t pte_wrprotect(pte_t pte) 329 343 { 330 - return pte_clear_flags(pte, _PAGE_RW); 344 + pte = pte_clear_flags(pte, _PAGE_RW); 345 + 346 + /* 347 + * Blindly clearing _PAGE_RW might accidentally create 348 + * a shadow stack PTE (Write=0,Dirty=1). Move the hardware 349 + * dirty value to the software bit, if present. 350 + */ 351 + return pte_mksaveddirty(pte); 331 352 } 332 353 333 354 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP ··· 419 332 420 333 static inline pte_t pte_mkclean(pte_t pte) 421 334 { 422 - return pte_clear_flags(pte, _PAGE_DIRTY); 335 + return pte_clear_flags(pte, _PAGE_DIRTY_BITS); 423 336 } 424 337 425 338 static inline pte_t pte_mkold(pte_t pte) ··· 434 347 435 348 static inline pte_t pte_mkdirty(pte_t pte) 436 349 { 437 - return pte_set_flags(pte, _PAGE_DIRTY | _PAGE_SOFT_DIRTY); 350 + pte = pte_set_flags(pte, _PAGE_DIRTY | _PAGE_SOFT_DIRTY); 351 + 352 + return pte_mksaveddirty(pte); 353 + } 354 + 355 + static inline pte_t pte_mkwrite_shstk(pte_t pte) 356 + { 357 + pte = pte_clear_flags(pte, _PAGE_RW); 358 + 359 + return pte_set_flags(pte, _PAGE_DIRTY); 438 360 } 439 361 440 362 static inline pte_t pte_mkyoung(pte_t pte) ··· 451 355 return pte_set_flags(pte, _PAGE_ACCESSED); 452 356 } 453 357 454 - static inline pte_t pte_mkwrite(pte_t pte) 358 + static inline pte_t pte_mkwrite_novma(pte_t pte) 455 359 { 456 360 return pte_set_flags(pte, _PAGE_RW); 457 361 } 362 + 363 + struct vm_area_struct; 364 + pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma); 365 + #define pte_mkwrite pte_mkwrite 458 366 459 367 static inline pte_t pte_mkhuge(pte_t pte) 460 368 { ··· 504 404 return native_make_pmd(v & ~clear); 505 405 } 506 406 407 + /* See comments above mksaveddirty_shift() */ 408 + static inline pmd_t pmd_mksaveddirty(pmd_t pmd) 409 + { 410 + pmdval_t v = native_pmd_val(pmd); 411 + 412 + v = mksaveddirty_shift(v); 413 + return native_make_pmd(v); 414 + } 415 + 416 + /* See comments above mksaveddirty_shift() */ 417 + static inline pmd_t pmd_clear_saveddirty(pmd_t pmd) 418 + { 419 + pmdval_t v = native_pmd_val(pmd); 420 + 421 + v = clear_saveddirty_shift(v); 422 + return native_make_pmd(v); 423 + } 424 + 507 425 static inline pmd_t pmd_wrprotect(pmd_t pmd) 508 426 { 509 - return pmd_clear_flags(pmd, _PAGE_RW); 427 + pmd = pmd_clear_flags(pmd, _PAGE_RW); 428 + 429 + /* 430 + * Blindly clearing _PAGE_RW might accidentally create 431 + * a shadow stack PMD (RW=0, Dirty=1). Move the hardware 432 + * dirty value to the software bit. 433 + */ 434 + return pmd_mksaveddirty(pmd); 510 435 } 511 436 512 437 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP ··· 558 433 559 434 static inline pmd_t pmd_mkclean(pmd_t pmd) 560 435 { 561 - return pmd_clear_flags(pmd, _PAGE_DIRTY); 436 + return pmd_clear_flags(pmd, _PAGE_DIRTY_BITS); 562 437 } 563 438 564 439 static inline pmd_t pmd_mkdirty(pmd_t pmd) 565 440 { 566 - return pmd_set_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY); 441 + pmd = pmd_set_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY); 442 + 443 + return pmd_mksaveddirty(pmd); 444 + } 445 + 446 + static inline pmd_t pmd_mkwrite_shstk(pmd_t pmd) 447 + { 448 + pmd = pmd_clear_flags(pmd, _PAGE_RW); 449 + 450 + return pmd_set_flags(pmd, _PAGE_DIRTY); 567 451 } 568 452 569 453 static inline pmd_t pmd_mkdevmap(pmd_t pmd) ··· 590 456 return pmd_set_flags(pmd, _PAGE_ACCESSED); 591 457 } 592 458 593 - static inline pmd_t pmd_mkwrite(pmd_t pmd) 459 + static inline pmd_t pmd_mkwrite_novma(pmd_t pmd) 594 460 { 595 461 return pmd_set_flags(pmd, _PAGE_RW); 596 462 } 463 + 464 + pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma); 465 + #define pmd_mkwrite pmd_mkwrite 597 466 598 467 static inline pud_t pud_set_flags(pud_t pud, pudval_t set) 599 468 { ··· 612 475 return native_make_pud(v & ~clear); 613 476 } 614 477 478 + /* See comments above mksaveddirty_shift() */ 479 + static inline pud_t pud_mksaveddirty(pud_t pud) 480 + { 481 + pudval_t v = native_pud_val(pud); 482 + 483 + v = mksaveddirty_shift(v); 484 + return native_make_pud(v); 485 + } 486 + 487 + /* See comments above mksaveddirty_shift() */ 488 + static inline pud_t pud_clear_saveddirty(pud_t pud) 489 + { 490 + pudval_t v = native_pud_val(pud); 491 + 492 + v = clear_saveddirty_shift(v); 493 + return native_make_pud(v); 494 + } 495 + 615 496 static inline pud_t pud_mkold(pud_t pud) 616 497 { 617 498 return pud_clear_flags(pud, _PAGE_ACCESSED); ··· 637 482 638 483 static inline pud_t pud_mkclean(pud_t pud) 639 484 { 640 - return pud_clear_flags(pud, _PAGE_DIRTY); 485 + return pud_clear_flags(pud, _PAGE_DIRTY_BITS); 641 486 } 642 487 643 488 static inline pud_t pud_wrprotect(pud_t pud) 644 489 { 645 - return pud_clear_flags(pud, _PAGE_RW); 490 + pud = pud_clear_flags(pud, _PAGE_RW); 491 + 492 + /* 493 + * Blindly clearing _PAGE_RW might accidentally create 494 + * a shadow stack PUD (RW=0, Dirty=1). Move the hardware 495 + * dirty value to the software bit. 496 + */ 497 + return pud_mksaveddirty(pud); 646 498 } 647 499 648 500 static inline pud_t pud_mkdirty(pud_t pud) 649 501 { 650 - return pud_set_flags(pud, _PAGE_DIRTY | _PAGE_SOFT_DIRTY); 502 + pud = pud_set_flags(pud, _PAGE_DIRTY | _PAGE_SOFT_DIRTY); 503 + 504 + return pud_mksaveddirty(pud); 651 505 } 652 506 653 507 static inline pud_t pud_mkdevmap(pud_t pud) ··· 676 512 677 513 static inline pud_t pud_mkwrite(pud_t pud) 678 514 { 679 - return pud_set_flags(pud, _PAGE_RW); 515 + pud = pud_set_flags(pud, _PAGE_RW); 516 + 517 + return pud_clear_saveddirty(pud); 680 518 } 681 519 682 520 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY ··· 795 629 static inline pte_t pte_modify(pte_t pte, pgprot_t newprot) 796 630 { 797 631 pteval_t val = pte_val(pte), oldval = val; 632 + pte_t pte_result; 798 633 799 634 /* 800 635 * Chop off the NX bit (if present), and add the NX portion of ··· 804 637 val &= _PAGE_CHG_MASK; 805 638 val |= check_pgprot(newprot) & ~_PAGE_CHG_MASK; 806 639 val = flip_protnone_guard(oldval, val, PTE_PFN_MASK); 807 - return __pte(val); 640 + 641 + pte_result = __pte(val); 642 + 643 + /* 644 + * To avoid creating Write=0,Dirty=1 PTEs, pte_modify() needs to avoid: 645 + * 1. Marking Write=0 PTEs Dirty=1 646 + * 2. Marking Dirty=1 PTEs Write=0 647 + * 648 + * The first case cannot happen because the _PAGE_CHG_MASK will filter 649 + * out any Dirty bit passed in newprot. Handle the second case by 650 + * going through the mksaveddirty exercise. Only do this if the old 651 + * value was Write=1 to avoid doing this on Shadow Stack PTEs. 652 + */ 653 + if (oldval & _PAGE_RW) 654 + pte_result = pte_mksaveddirty(pte_result); 655 + else 656 + pte_result = pte_clear_saveddirty(pte_result); 657 + 658 + return pte_result; 808 659 } 809 660 810 661 static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot) 811 662 { 812 663 pmdval_t val = pmd_val(pmd), oldval = val; 664 + pmd_t pmd_result; 813 665 814 - val &= _HPAGE_CHG_MASK; 666 + val &= (_HPAGE_CHG_MASK & ~_PAGE_DIRTY); 815 667 val |= check_pgprot(newprot) & ~_HPAGE_CHG_MASK; 816 668 val = flip_protnone_guard(oldval, val, PHYSICAL_PMD_PAGE_MASK); 817 - return __pmd(val); 669 + 670 + pmd_result = __pmd(val); 671 + 672 + /* 673 + * To avoid creating Write=0,Dirty=1 PMDs, pte_modify() needs to avoid: 674 + * 1. Marking Write=0 PMDs Dirty=1 675 + * 2. Marking Dirty=1 PMDs Write=0 676 + * 677 + * The first case cannot happen because the _PAGE_CHG_MASK will filter 678 + * out any Dirty bit passed in newprot. Handle the second case by 679 + * going through the mksaveddirty exercise. Only do this if the old 680 + * value was Write=1 to avoid doing this on Shadow Stack PTEs. 681 + */ 682 + if (oldval & _PAGE_RW) 683 + pmd_result = pmd_mksaveddirty(pmd_result); 684 + else 685 + pmd_result = pmd_clear_saveddirty(pmd_result); 686 + 687 + return pmd_result; 818 688 } 819 689 820 690 /* ··· 1035 831 * (Currently stuck as a macro because of indirect forward reference 1036 832 * to linux/mm.h:page_to_nid()) 1037 833 */ 1038 - #define mk_pte(page, pgprot) pfn_pte(page_to_pfn(page), (pgprot)) 834 + #define mk_pte(page, pgprot) \ 835 + ({ \ 836 + pgprot_t __pgprot = pgprot; \ 837 + \ 838 + WARN_ON_ONCE((pgprot_val(__pgprot) & (_PAGE_DIRTY | _PAGE_RW)) == \ 839 + _PAGE_DIRTY); \ 840 + pfn_pte(page_to_pfn(page), __pgprot); \ 841 + }) 1039 842 1040 843 static inline int pmd_bad(pmd_t pmd) 1041 844 { ··· 1301 1090 static inline void ptep_set_wrprotect(struct mm_struct *mm, 1302 1091 unsigned long addr, pte_t *ptep) 1303 1092 { 1304 - clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte); 1093 + /* 1094 + * Avoid accidentally creating shadow stack PTEs 1095 + * (Write=0,Dirty=1). Use cmpxchg() to prevent races with 1096 + * the hardware setting Dirty=1. 1097 + */ 1098 + pte_t old_pte, new_pte; 1099 + 1100 + old_pte = READ_ONCE(*ptep); 1101 + do { 1102 + new_pte = pte_wrprotect(old_pte); 1103 + } while (!try_cmpxchg((long *)&ptep->pte, (long *)&old_pte, *(long *)&new_pte)); 1305 1104 } 1306 1105 1307 1106 #define flush_tlb_fix_spurious_fault(vma, address, ptep) do { } while (0) ··· 1336 1115 extern int pmdp_clear_flush_young(struct vm_area_struct *vma, 1337 1116 unsigned long address, pmd_t *pmdp); 1338 1117 1339 - 1340 - #define pmd_write pmd_write 1341 - static inline int pmd_write(pmd_t pmd) 1342 - { 1343 - return pmd_flags(pmd) & _PAGE_RW; 1344 - } 1345 1118 1346 1119 #define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR 1347 1120 static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm, unsigned long addr, ··· 1363 1148 static inline void pmdp_set_wrprotect(struct mm_struct *mm, 1364 1149 unsigned long addr, pmd_t *pmdp) 1365 1150 { 1366 - clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp); 1367 - } 1151 + /* 1152 + * Avoid accidentally creating shadow stack PTEs 1153 + * (Write=0,Dirty=1). Use cmpxchg() to prevent races with 1154 + * the hardware setting Dirty=1. 1155 + */ 1156 + pmd_t old_pmd, new_pmd; 1368 1157 1369 - #define pud_write pud_write 1370 - static inline int pud_write(pud_t pud) 1371 - { 1372 - return pud_flags(pud) & _PAGE_RW; 1158 + old_pmd = READ_ONCE(*pmdp); 1159 + do { 1160 + new_pmd = pmd_wrprotect(old_pmd); 1161 + } while (!try_cmpxchg((long *)pmdp, (long *)&old_pmd, *(long *)&new_pmd)); 1373 1162 } 1374 1163 1375 1164 #ifndef pmdp_establish ··· 1631 1412 { 1632 1413 unsigned long need_pte_bits = _PAGE_PRESENT|_PAGE_USER; 1633 1414 1415 + /* 1416 + * Write=0,Dirty=1 PTEs are shadow stack, which the kernel 1417 + * shouldn't generally allow access to, but since they 1418 + * are already Write=0, the below logic covers both cases. 1419 + */ 1634 1420 if (write) 1635 1421 need_pte_bits |= _PAGE_RW; 1636 1422 ··· 1676 1452 { 1677 1453 return true; 1678 1454 } 1455 + 1456 + #define arch_check_zapped_pte arch_check_zapped_pte 1457 + void arch_check_zapped_pte(struct vm_area_struct *vma, pte_t pte); 1458 + 1459 + #define arch_check_zapped_pmd arch_check_zapped_pmd 1460 + void arch_check_zapped_pmd(struct vm_area_struct *vma, pmd_t pmd); 1679 1461 1680 1462 #ifdef CONFIG_XEN_PV 1681 1463 #define arch_has_hw_nonleaf_pmd_young arch_has_hw_nonleaf_pmd_young
+36 -8
arch/x86/include/asm/pgtable_types.h
··· 21 21 #define _PAGE_BIT_SOFTW2 10 /* " */ 22 22 #define _PAGE_BIT_SOFTW3 11 /* " */ 23 23 #define _PAGE_BIT_PAT_LARGE 12 /* On 2MB or 1GB pages */ 24 - #define _PAGE_BIT_SOFTW4 58 /* available for programmer */ 24 + #define _PAGE_BIT_SOFTW4 57 /* available for programmer */ 25 + #define _PAGE_BIT_SOFTW5 58 /* available for programmer */ 25 26 #define _PAGE_BIT_PKEY_BIT0 59 /* Protection Keys, bit 1/4 */ 26 27 #define _PAGE_BIT_PKEY_BIT1 60 /* Protection Keys, bit 2/4 */ 27 28 #define _PAGE_BIT_PKEY_BIT2 61 /* Protection Keys, bit 3/4 */ ··· 34 33 #define _PAGE_BIT_UFFD_WP _PAGE_BIT_SOFTW2 /* userfaultfd wrprotected */ 35 34 #define _PAGE_BIT_SOFT_DIRTY _PAGE_BIT_SOFTW3 /* software dirty tracking */ 36 35 #define _PAGE_BIT_DEVMAP _PAGE_BIT_SOFTW4 36 + 37 + #ifdef CONFIG_X86_64 38 + #define _PAGE_BIT_SAVED_DIRTY _PAGE_BIT_SOFTW5 /* Saved Dirty bit */ 39 + #else 40 + /* Shared with _PAGE_BIT_UFFD_WP which is not supported on 32 bit */ 41 + #define _PAGE_BIT_SAVED_DIRTY _PAGE_BIT_SOFTW2 /* Saved Dirty bit */ 42 + #endif 37 43 38 44 /* If _PAGE_BIT_PRESENT is clear, we use these: */ 39 45 /* - if the user mapped it with PROT_NONE; pte_present gives true */ ··· 125 117 #define _PAGE_SOFTW4 (_AT(pteval_t, 0)) 126 118 #endif 127 119 120 + /* 121 + * The hardware requires shadow stack to be Write=0,Dirty=1. However, 122 + * there are valid cases where the kernel might create read-only PTEs that 123 + * are dirty (e.g., fork(), mprotect(), uffd-wp(), soft-dirty tracking). In 124 + * this case, the _PAGE_SAVED_DIRTY bit is used instead of the HW-dirty bit, 125 + * to avoid creating a wrong "shadow stack" PTEs. Such PTEs have 126 + * (Write=0,SavedDirty=1,Dirty=0) set. 127 + */ 128 + #define _PAGE_SAVED_DIRTY (_AT(pteval_t, 1) << _PAGE_BIT_SAVED_DIRTY) 129 + 130 + #define _PAGE_DIRTY_BITS (_PAGE_DIRTY | _PAGE_SAVED_DIRTY) 131 + 128 132 #define _PAGE_PROTNONE (_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE) 129 133 130 134 /* ··· 145 125 * instance, and is *not* included in this mask since 146 126 * pte_modify() does modify it. 147 127 */ 148 - #define _COMMON_PAGE_CHG_MASK (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT | \ 149 - _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY |\ 150 - _PAGE_SOFT_DIRTY | _PAGE_DEVMAP | _PAGE_ENC | \ 151 - _PAGE_UFFD_WP) 128 + #define _COMMON_PAGE_CHG_MASK (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT | \ 129 + _PAGE_SPECIAL | _PAGE_ACCESSED | \ 130 + _PAGE_DIRTY_BITS | _PAGE_SOFT_DIRTY | \ 131 + _PAGE_DEVMAP | _PAGE_ENC | _PAGE_UFFD_WP) 152 132 #define _PAGE_CHG_MASK (_COMMON_PAGE_CHG_MASK | _PAGE_PAT) 153 133 #define _HPAGE_CHG_MASK (_COMMON_PAGE_CHG_MASK | _PAGE_PSE | _PAGE_PAT_LARGE) 154 134 ··· 209 189 210 190 #define __PAGE_KERNEL (__PP|__RW| 0|___A|__NX|___D| 0|___G) 211 191 #define __PAGE_KERNEL_EXEC (__PP|__RW| 0|___A| 0|___D| 0|___G) 192 + 193 + /* 194 + * Page tables needs to have Write=1 in order for any lower PTEs to be 195 + * writable. This includes shadow stack memory (Write=0, Dirty=1) 196 + */ 212 197 #define _KERNPG_TABLE_NOENC (__PP|__RW| 0|___A| 0|___D| 0| 0) 213 198 #define _KERNPG_TABLE (__PP|__RW| 0|___A| 0|___D| 0| 0| _ENC) 214 199 #define _PAGE_TABLE_NOENC (__PP|__RW|_USR|___A| 0|___D| 0| 0) 215 200 #define _PAGE_TABLE (__PP|__RW|_USR|___A| 0|___D| 0| 0| _ENC) 216 - #define __PAGE_KERNEL_RO (__PP| 0| 0|___A|__NX|___D| 0|___G) 217 - #define __PAGE_KERNEL_ROX (__PP| 0| 0|___A| 0|___D| 0|___G) 201 + 202 + #define __PAGE_KERNEL_RO (__PP| 0| 0|___A|__NX| 0| 0|___G) 203 + #define __PAGE_KERNEL_ROX (__PP| 0| 0|___A| 0| 0| 0|___G) 204 + #define __PAGE_KERNEL (__PP|__RW| 0|___A|__NX|___D| 0|___G) 205 + #define __PAGE_KERNEL_EXEC (__PP|__RW| 0|___A| 0|___D| 0|___G) 218 206 #define __PAGE_KERNEL_NOCACHE (__PP|__RW| 0|___A|__NX|___D| 0|___G| __NC) 219 - #define __PAGE_KERNEL_VVAR (__PP| 0|_USR|___A|__NX|___D| 0|___G) 207 + #define __PAGE_KERNEL_VVAR (__PP| 0|_USR|___A|__NX| 0| 0|___G) 220 208 #define __PAGE_KERNEL_LARGE (__PP|__RW| 0|___A|__NX|___D|_PSE|___G) 221 209 #define __PAGE_KERNEL_LARGE_EXEC (__PP|__RW| 0|___A| 0|___D|_PSE|___G) 222 210 #define __PAGE_KERNEL_WP (__PP|__RW| 0|___A|__NX|___D| 0|___G| __WP)
+8
arch/x86/include/asm/processor.h
··· 28 28 #include <asm/unwind_hints.h> 29 29 #include <asm/vmxfeatures.h> 30 30 #include <asm/vdso/processor.h> 31 + #include <asm/shstk.h> 31 32 32 33 #include <linux/personality.h> 33 34 #include <linux/cache.h> ··· 474 473 * PKRU is the hardware itself. 475 474 */ 476 475 u32 pkru; 476 + 477 + #ifdef CONFIG_X86_USER_SHADOW_STACK 478 + unsigned long features; 479 + unsigned long features_locked; 480 + 481 + struct thread_shstk shstk; 482 + #endif 477 483 478 484 /* Floating point and extended processor state */ 479 485 struct fpu fpu;
+38
arch/x86/include/asm/shstk.h
··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #ifndef _ASM_X86_SHSTK_H 3 + #define _ASM_X86_SHSTK_H 4 + 5 + #ifndef __ASSEMBLY__ 6 + #include <linux/types.h> 7 + 8 + struct task_struct; 9 + struct ksignal; 10 + 11 + #ifdef CONFIG_X86_USER_SHADOW_STACK 12 + struct thread_shstk { 13 + u64 base; 14 + u64 size; 15 + }; 16 + 17 + long shstk_prctl(struct task_struct *task, int option, unsigned long arg2); 18 + void reset_thread_features(void); 19 + unsigned long shstk_alloc_thread_stack(struct task_struct *p, unsigned long clone_flags, 20 + unsigned long stack_size); 21 + void shstk_free(struct task_struct *p); 22 + int setup_signal_shadow_stack(struct ksignal *ksig); 23 + int restore_signal_shadow_stack(void); 24 + #else 25 + static inline long shstk_prctl(struct task_struct *task, int option, 26 + unsigned long arg2) { return -EINVAL; } 27 + static inline void reset_thread_features(void) {} 28 + static inline unsigned long shstk_alloc_thread_stack(struct task_struct *p, 29 + unsigned long clone_flags, 30 + unsigned long stack_size) { return 0; } 31 + static inline void shstk_free(struct task_struct *p) {} 32 + static inline int setup_signal_shadow_stack(struct ksignal *ksig) { return 0; } 33 + static inline int restore_signal_shadow_stack(void) { return 0; } 34 + #endif /* CONFIG_X86_USER_SHADOW_STACK */ 35 + 36 + #endif /* __ASSEMBLY__ */ 37 + 38 + #endif /* _ASM_X86_SHSTK_H */
+13
arch/x86/include/asm/special_insns.h
··· 202 202 : [pax] "a" (p)); 203 203 } 204 204 205 + #ifdef CONFIG_X86_USER_SHADOW_STACK 206 + static inline int write_user_shstk_64(u64 __user *addr, u64 val) 207 + { 208 + asm_volatile_goto("1: wrussq %[val], (%[addr])\n" 209 + _ASM_EXTABLE(1b, %l[fail]) 210 + :: [addr] "r" (addr), [val] "r" (val) 211 + :: fail); 212 + return 0; 213 + fail: 214 + return -EFAULT; 215 + } 216 + #endif /* CONFIG_X86_USER_SHADOW_STACK */ 217 + 205 218 #define nop() asm volatile ("nop") 206 219 207 220 static inline void serialize(void)
+2 -1
arch/x86/include/asm/tlbflush.h
··· 306 306 const pteval_t flush_on_clear = _PAGE_DIRTY | _PAGE_PRESENT | 307 307 _PAGE_ACCESSED; 308 308 const pteval_t software_flags = _PAGE_SOFTW1 | _PAGE_SOFTW2 | 309 - _PAGE_SOFTW3 | _PAGE_SOFTW4; 309 + _PAGE_SOFTW3 | _PAGE_SOFTW4 | 310 + _PAGE_SAVED_DIRTY; 310 311 const pteval_t flush_on_change = _PAGE_RW | _PAGE_USER | _PAGE_PWT | 311 312 _PAGE_PCD | _PAGE_PSE | _PAGE_GLOBAL | _PAGE_PAT | 312 313 _PAGE_PAT_LARGE | _PAGE_PKEY_BIT0 | _PAGE_PKEY_BIT1 |
+2
arch/x86/include/asm/trap_pf.h
··· 11 11 * bit 3 == 1: use of reserved bit detected 12 12 * bit 4 == 1: fault was an instruction fetch 13 13 * bit 5 == 1: protection keys block access 14 + * bit 6 == 1: shadow stack access fault 14 15 * bit 15 == 1: SGX MMU page-fault 15 16 */ 16 17 enum x86_pf_error_code { ··· 21 20 X86_PF_RSVD = 1 << 3, 22 21 X86_PF_INSTR = 1 << 4, 23 22 X86_PF_PK = 1 << 5, 23 + X86_PF_SHSTK = 1 << 6, 24 24 X86_PF_SGX = 1 << 15, 25 25 }; 26 26
+14 -1
arch/x86/include/asm/traps.h
··· 18 18 asmlinkage __visible noinstr struct pt_regs *vc_switch_off_ist(struct pt_regs *eregs); 19 19 #endif 20 20 21 - extern bool ibt_selftest(void); 21 + extern int ibt_selftest(void); 22 + extern int ibt_selftest_noendbr(void); 22 23 23 24 #ifdef CONFIG_X86_F00F_BUG 24 25 /* For handling the FOOF bug */ ··· 47 46 unsigned long fault_address, 48 47 struct stack_info *info); 49 48 #endif 49 + 50 + static inline void cond_local_irq_enable(struct pt_regs *regs) 51 + { 52 + if (regs->flags & X86_EFLAGS_IF) 53 + local_irq_enable(); 54 + } 55 + 56 + static inline void cond_local_irq_disable(struct pt_regs *regs) 57 + { 58 + if (regs->flags & X86_EFLAGS_IF) 59 + local_irq_disable(); 60 + } 50 61 51 62 #endif /* _ASM_X86_TRAPS_H */
+4
arch/x86/include/uapi/asm/mman.h
··· 3 3 #define _ASM_X86_MMAN_H 4 4 5 5 #define MAP_32BIT 0x40 /* only give out 32bit addresses */ 6 + #define MAP_ABOVE4G 0x80 /* only map above 4GB */ 6 7 7 8 #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS 8 9 #define arch_calc_vm_prot_bits(prot, key) ( \ ··· 12 11 ((key) & 0x4 ? VM_PKEY_BIT2 : 0) | \ 13 12 ((key) & 0x8 ? VM_PKEY_BIT3 : 0)) 14 13 #endif 14 + 15 + /* Flags for map_shadow_stack(2) */ 16 + #define SHADOW_STACK_SET_TOKEN (1ULL << 0) /* Set up a restore token in the shadow stack */ 15 17 16 18 #include <asm-generic/mman.h> 17 19
+12
arch/x86/include/uapi/asm/prctl.h
··· 23 23 #define ARCH_MAP_VDSO_32 0x2002 24 24 #define ARCH_MAP_VDSO_64 0x2003 25 25 26 + /* Don't use 0x3001-0x3004 because of old glibcs */ 27 + 26 28 #define ARCH_GET_UNTAG_MASK 0x4001 27 29 #define ARCH_ENABLE_TAGGED_ADDR 0x4002 28 30 #define ARCH_GET_MAX_TAG_BITS 0x4003 29 31 #define ARCH_FORCE_TAGGED_SVA 0x4004 32 + 33 + #define ARCH_SHSTK_ENABLE 0x5001 34 + #define ARCH_SHSTK_DISABLE 0x5002 35 + #define ARCH_SHSTK_LOCK 0x5003 36 + #define ARCH_SHSTK_UNLOCK 0x5004 37 + #define ARCH_SHSTK_STATUS 0x5005 38 + 39 + /* ARCH_SHSTK_ features bits */ 40 + #define ARCH_SHSTK_SHSTK (1ULL << 0) 41 + #define ARCH_SHSTK_WRSS (1ULL << 1) 30 42 31 43 #endif /* _ASM_X86_PRCTL_H */
+5
arch/x86/kernel/Makefile
··· 48 48 obj-y += traps.o idt.o irq.o irq_$(BITS).o dumpstack_$(BITS).o 49 49 obj-y += time.o ioport.o dumpstack.o nmi.o 50 50 obj-$(CONFIG_MODIFY_LDT_SYSCALL) += ldt.o 51 + obj-$(CONFIG_X86_KERNEL_IBT) += ibt_selftest.o 51 52 obj-y += setup.o x86_init.o i8259.o irqinit.o 52 53 obj-$(CONFIG_JUMP_LABEL) += jump_label.o 53 54 obj-$(CONFIG_IRQ_WORK) += irq_work.o ··· 144 143 obj-$(CONFIG_CFI_CLANG) += cfi.o 145 144 146 145 obj-$(CONFIG_CALL_THUNKS) += callthunks.o 146 + 147 + obj-$(CONFIG_X86_CET) += cet.o 148 + 149 + obj-$(CONFIG_X86_USER_SHADOW_STACK) += shstk.o 147 150 148 151 ### 149 152 # 64 bit specific files
+131
arch/x86/kernel/cet.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + 3 + #include <linux/ptrace.h> 4 + #include <asm/bugs.h> 5 + #include <asm/traps.h> 6 + 7 + enum cp_error_code { 8 + CP_EC = (1 << 15) - 1, 9 + 10 + CP_RET = 1, 11 + CP_IRET = 2, 12 + CP_ENDBR = 3, 13 + CP_RSTRORSSP = 4, 14 + CP_SETSSBSY = 5, 15 + 16 + CP_ENCL = 1 << 15, 17 + }; 18 + 19 + static const char cp_err[][10] = { 20 + [0] = "unknown", 21 + [1] = "near ret", 22 + [2] = "far/iret", 23 + [3] = "endbranch", 24 + [4] = "rstorssp", 25 + [5] = "setssbsy", 26 + }; 27 + 28 + static const char *cp_err_string(unsigned long error_code) 29 + { 30 + unsigned int cpec = error_code & CP_EC; 31 + 32 + if (cpec >= ARRAY_SIZE(cp_err)) 33 + cpec = 0; 34 + return cp_err[cpec]; 35 + } 36 + 37 + static void do_unexpected_cp(struct pt_regs *regs, unsigned long error_code) 38 + { 39 + WARN_ONCE(1, "Unexpected %s #CP, error_code: %s\n", 40 + user_mode(regs) ? "user mode" : "kernel mode", 41 + cp_err_string(error_code)); 42 + } 43 + 44 + static DEFINE_RATELIMIT_STATE(cpf_rate, DEFAULT_RATELIMIT_INTERVAL, 45 + DEFAULT_RATELIMIT_BURST); 46 + 47 + static void do_user_cp_fault(struct pt_regs *regs, unsigned long error_code) 48 + { 49 + struct task_struct *tsk; 50 + unsigned long ssp; 51 + 52 + /* 53 + * An exception was just taken from userspace. Since interrupts are disabled 54 + * here, no scheduling should have messed with the registers yet and they 55 + * will be whatever is live in userspace. So read the SSP before enabling 56 + * interrupts so locking the fpregs to do it later is not required. 57 + */ 58 + rdmsrl(MSR_IA32_PL3_SSP, ssp); 59 + 60 + cond_local_irq_enable(regs); 61 + 62 + tsk = current; 63 + tsk->thread.error_code = error_code; 64 + tsk->thread.trap_nr = X86_TRAP_CP; 65 + 66 + /* Ratelimit to prevent log spamming. */ 67 + if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) && 68 + __ratelimit(&cpf_rate)) { 69 + pr_emerg("%s[%d] control protection ip:%lx sp:%lx ssp:%lx error:%lx(%s)%s", 70 + tsk->comm, task_pid_nr(tsk), 71 + regs->ip, regs->sp, ssp, error_code, 72 + cp_err_string(error_code), 73 + error_code & CP_ENCL ? " in enclave" : ""); 74 + print_vma_addr(KERN_CONT " in ", regs->ip); 75 + pr_cont("\n"); 76 + } 77 + 78 + force_sig_fault(SIGSEGV, SEGV_CPERR, (void __user *)0); 79 + cond_local_irq_disable(regs); 80 + } 81 + 82 + static __ro_after_init bool ibt_fatal = true; 83 + 84 + static void do_kernel_cp_fault(struct pt_regs *regs, unsigned long error_code) 85 + { 86 + if ((error_code & CP_EC) != CP_ENDBR) { 87 + do_unexpected_cp(regs, error_code); 88 + return; 89 + } 90 + 91 + if (unlikely(regs->ip == (unsigned long)&ibt_selftest_noendbr)) { 92 + regs->ax = 0; 93 + return; 94 + } 95 + 96 + pr_err("Missing ENDBR: %pS\n", (void *)instruction_pointer(regs)); 97 + if (!ibt_fatal) { 98 + printk(KERN_DEFAULT CUT_HERE); 99 + __warn(__FILE__, __LINE__, (void *)regs->ip, TAINT_WARN, regs, NULL); 100 + return; 101 + } 102 + BUG(); 103 + } 104 + 105 + static int __init ibt_setup(char *str) 106 + { 107 + if (!strcmp(str, "off")) 108 + setup_clear_cpu_cap(X86_FEATURE_IBT); 109 + 110 + if (!strcmp(str, "warn")) 111 + ibt_fatal = false; 112 + 113 + return 1; 114 + } 115 + 116 + __setup("ibt=", ibt_setup); 117 + 118 + DEFINE_IDTENTRY_ERRORCODE(exc_control_protection) 119 + { 120 + if (user_mode(regs)) { 121 + if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK)) 122 + do_user_cp_fault(regs, error_code); 123 + else 124 + do_unexpected_cp(regs, error_code); 125 + } else { 126 + if (cpu_feature_enabled(X86_FEATURE_IBT)) 127 + do_kernel_cp_fault(regs, error_code); 128 + else 129 + do_unexpected_cp(regs, error_code); 130 + } 131 + }
+27 -8
arch/x86/kernel/cpu/common.c
··· 587 587 588 588 static __always_inline void setup_cet(struct cpuinfo_x86 *c) 589 589 { 590 - u64 msr = CET_ENDBR_EN; 590 + bool user_shstk, kernel_ibt; 591 591 592 - if (!HAS_KERNEL_IBT || 593 - !cpu_feature_enabled(X86_FEATURE_IBT)) 592 + if (!IS_ENABLED(CONFIG_X86_CET)) 594 593 return; 595 594 596 - wrmsrl(MSR_IA32_S_CET, msr); 595 + kernel_ibt = HAS_KERNEL_IBT && cpu_feature_enabled(X86_FEATURE_IBT); 596 + user_shstk = cpu_feature_enabled(X86_FEATURE_SHSTK) && 597 + IS_ENABLED(CONFIG_X86_USER_SHADOW_STACK); 598 + 599 + if (!kernel_ibt && !user_shstk) 600 + return; 601 + 602 + if (user_shstk) 603 + set_cpu_cap(c, X86_FEATURE_USER_SHSTK); 604 + 605 + if (kernel_ibt) 606 + wrmsrl(MSR_IA32_S_CET, CET_ENDBR_EN); 607 + else 608 + wrmsrl(MSR_IA32_S_CET, 0); 609 + 597 610 cr4_set_bits(X86_CR4_CET); 598 611 599 - if (!ibt_selftest()) { 612 + if (kernel_ibt && ibt_selftest()) { 600 613 pr_err("IBT selftest: Failed!\n"); 601 614 wrmsrl(MSR_IA32_S_CET, 0); 602 615 setup_clear_cpu_cap(X86_FEATURE_IBT); 603 - return; 604 616 } 605 617 } 606 618 607 619 __noendbr void cet_disable(void) 608 620 { 609 - if (cpu_feature_enabled(X86_FEATURE_IBT)) 610 - wrmsrl(MSR_IA32_S_CET, 0); 621 + if (!(cpu_feature_enabled(X86_FEATURE_IBT) || 622 + cpu_feature_enabled(X86_FEATURE_SHSTK))) 623 + return; 624 + 625 + wrmsrl(MSR_IA32_S_CET, 0); 626 + wrmsrl(MSR_IA32_U_CET, 0); 611 627 } 612 628 613 629 /* ··· 1506 1490 1507 1491 if (cmdline_find_option_bool(boot_command_line, "noxsaves")) 1508 1492 setup_clear_cpu_cap(X86_FEATURE_XSAVES); 1493 + 1494 + if (cmdline_find_option_bool(boot_command_line, "nousershstk")) 1495 + setup_clear_cpu_cap(X86_FEATURE_USER_SHSTK); 1509 1496 1510 1497 arglen = cmdline_find_option(boot_command_line, "clearcpuid", arg, sizeof(arg)); 1511 1498 if (arglen <= 0)
+1
arch/x86/kernel/cpu/cpuid-deps.c
··· 81 81 { X86_FEATURE_XFD, X86_FEATURE_XSAVES }, 82 82 { X86_FEATURE_XFD, X86_FEATURE_XGETBV1 }, 83 83 { X86_FEATURE_AMX_TILE, X86_FEATURE_XFD }, 84 + { X86_FEATURE_SHSTK, X86_FEATURE_XSAVES }, 84 85 {} 85 86 }; 86 87
+23
arch/x86/kernel/cpu/proc.c
··· 4 4 #include <linux/string.h> 5 5 #include <linux/seq_file.h> 6 6 #include <linux/cpufreq.h> 7 + #include <asm/prctl.h> 8 + #include <linux/proc_fs.h> 7 9 8 10 #include "cpu.h" 9 11 ··· 177 175 .stop = c_stop, 178 176 .show = show_cpuinfo, 179 177 }; 178 + 179 + #ifdef CONFIG_X86_USER_SHADOW_STACK 180 + static void dump_x86_features(struct seq_file *m, unsigned long features) 181 + { 182 + if (features & ARCH_SHSTK_SHSTK) 183 + seq_puts(m, "shstk "); 184 + if (features & ARCH_SHSTK_WRSS) 185 + seq_puts(m, "wrss "); 186 + } 187 + 188 + void arch_proc_pid_thread_features(struct seq_file *m, struct task_struct *task) 189 + { 190 + seq_puts(m, "x86_Thread_features:\t"); 191 + dump_x86_features(m, task->thread.features); 192 + seq_putc(m, '\n'); 193 + 194 + seq_puts(m, "x86_Thread_features_locked:\t"); 195 + dump_x86_features(m, task->thread.features_locked); 196 + seq_putc(m, '\n'); 197 + } 198 + #endif /* CONFIG_X86_USER_SHADOW_STACK */
+53 -1
arch/x86/kernel/fpu/core.c
··· 552 552 } 553 553 } 554 554 555 + /* A passed ssp of zero will not cause any update */ 556 + static int update_fpu_shstk(struct task_struct *dst, unsigned long ssp) 557 + { 558 + #ifdef CONFIG_X86_USER_SHADOW_STACK 559 + struct cet_user_state *xstate; 560 + 561 + /* If ssp update is not needed. */ 562 + if (!ssp) 563 + return 0; 564 + 565 + xstate = get_xsave_addr(&dst->thread.fpu.fpstate->regs.xsave, 566 + XFEATURE_CET_USER); 567 + 568 + /* 569 + * If there is a non-zero ssp, then 'dst' must be configured with a shadow 570 + * stack and the fpu state should be up to date since it was just copied 571 + * from the parent in fpu_clone(). So there must be a valid non-init CET 572 + * state location in the buffer. 573 + */ 574 + if (WARN_ON_ONCE(!xstate)) 575 + return 1; 576 + 577 + xstate->user_ssp = (u64)ssp; 578 + #endif 579 + return 0; 580 + } 581 + 555 582 /* Clone current's FPU state on fork */ 556 - int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal) 583 + int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal, 584 + unsigned long ssp) 557 585 { 558 586 struct fpu *src_fpu = &current->thread.fpu; 559 587 struct fpu *dst_fpu = &dst->thread.fpu; ··· 640 612 */ 641 613 if (use_xsave()) 642 614 dst_fpu->fpstate->regs.xsave.header.xfeatures &= ~XFEATURE_MASK_PASID; 615 + 616 + /* 617 + * Update shadow stack pointer, in case it changed during clone. 618 + */ 619 + if (update_fpu_shstk(dst, ssp)) 620 + return 1; 643 621 644 622 trace_x86_fpu_copy_src(src_fpu); 645 623 trace_x86_fpu_copy_dst(dst_fpu); ··· 786 752 fpregs_restore_userregs(); 787 753 } 788 754 EXPORT_SYMBOL_GPL(switch_fpu_return); 755 + 756 + void fpregs_lock_and_load(void) 757 + { 758 + /* 759 + * fpregs_lock() only disables preemption (mostly). So modifying state 760 + * in an interrupt could screw up some in progress fpregs operation. 761 + * Warn about it. 762 + */ 763 + WARN_ON_ONCE(!irq_fpu_usable()); 764 + WARN_ON_ONCE(current->flags & PF_KTHREAD); 765 + 766 + fpregs_lock(); 767 + 768 + fpregs_assert_state_consistent(); 769 + 770 + if (test_thread_flag(TIF_NEED_FPU_LOAD)) 771 + fpregs_restore_userregs(); 772 + } 789 773 790 774 #ifdef CONFIG_X86_DEBUG_FPU 791 775 /*
+81
arch/x86/kernel/fpu/regset.c
··· 8 8 #include <asm/fpu/api.h> 9 9 #include <asm/fpu/signal.h> 10 10 #include <asm/fpu/regset.h> 11 + #include <asm/prctl.h> 11 12 12 13 #include "context.h" 13 14 #include "internal.h" ··· 174 173 vfree(tmpbuf); 175 174 return ret; 176 175 } 176 + 177 + #ifdef CONFIG_X86_USER_SHADOW_STACK 178 + int ssp_active(struct task_struct *target, const struct user_regset *regset) 179 + { 180 + if (target->thread.features & ARCH_SHSTK_SHSTK) 181 + return regset->n; 182 + 183 + return 0; 184 + } 185 + 186 + int ssp_get(struct task_struct *target, const struct user_regset *regset, 187 + struct membuf to) 188 + { 189 + struct fpu *fpu = &target->thread.fpu; 190 + struct cet_user_state *cetregs; 191 + 192 + if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK)) 193 + return -ENODEV; 194 + 195 + sync_fpstate(fpu); 196 + cetregs = get_xsave_addr(&fpu->fpstate->regs.xsave, XFEATURE_CET_USER); 197 + if (WARN_ON(!cetregs)) { 198 + /* 199 + * This shouldn't ever be NULL because shadow stack was 200 + * verified to be enabled above. This means 201 + * MSR_IA32_U_CET.CET_SHSTK_EN should be 1 and so 202 + * XFEATURE_CET_USER should not be in the init state. 203 + */ 204 + return -ENODEV; 205 + } 206 + 207 + return membuf_write(&to, (unsigned long *)&cetregs->user_ssp, 208 + sizeof(cetregs->user_ssp)); 209 + } 210 + 211 + int ssp_set(struct task_struct *target, const struct user_regset *regset, 212 + unsigned int pos, unsigned int count, 213 + const void *kbuf, const void __user *ubuf) 214 + { 215 + struct fpu *fpu = &target->thread.fpu; 216 + struct xregs_state *xsave = &fpu->fpstate->regs.xsave; 217 + struct cet_user_state *cetregs; 218 + unsigned long user_ssp; 219 + int r; 220 + 221 + if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK) || 222 + !ssp_active(target, regset)) 223 + return -ENODEV; 224 + 225 + if (pos != 0 || count != sizeof(user_ssp)) 226 + return -EINVAL; 227 + 228 + r = user_regset_copyin(&pos, &count, &kbuf, &ubuf, &user_ssp, 0, -1); 229 + if (r) 230 + return r; 231 + 232 + /* 233 + * Some kernel instructions (IRET, etc) can cause exceptions in the case 234 + * of disallowed CET register values. Just prevent invalid values. 235 + */ 236 + if (user_ssp >= TASK_SIZE_MAX || !IS_ALIGNED(user_ssp, 8)) 237 + return -EINVAL; 238 + 239 + fpu_force_restore(fpu); 240 + 241 + cetregs = get_xsave_addr(xsave, XFEATURE_CET_USER); 242 + if (WARN_ON(!cetregs)) { 243 + /* 244 + * This shouldn't ever be NULL because shadow stack was 245 + * verified to be enabled above. This means 246 + * MSR_IA32_U_CET.CET_SHSTK_EN should be 1 and so 247 + * XFEATURE_CET_USER should not be in the init state. 248 + */ 249 + return -ENODEV; 250 + } 251 + 252 + cetregs->user_ssp = user_ssp; 253 + return 0; 254 + } 255 + #endif /* CONFIG_X86_USER_SHADOW_STACK */ 177 256 178 257 #if defined CONFIG_X86_32 || defined CONFIG_IA32_EMULATION 179 258
+43 -47
arch/x86/kernel/fpu/xstate.c
··· 39 39 */ 40 40 static const char *xfeature_names[] = 41 41 { 42 - "x87 floating point registers" , 43 - "SSE registers" , 44 - "AVX registers" , 45 - "MPX bounds registers" , 46 - "MPX CSR" , 47 - "AVX-512 opmask" , 48 - "AVX-512 Hi256" , 49 - "AVX-512 ZMM_Hi256" , 50 - "Processor Trace (unused)" , 42 + "x87 floating point registers", 43 + "SSE registers", 44 + "AVX registers", 45 + "MPX bounds registers", 46 + "MPX CSR", 47 + "AVX-512 opmask", 48 + "AVX-512 Hi256", 49 + "AVX-512 ZMM_Hi256", 50 + "Processor Trace (unused)", 51 51 "Protection Keys User registers", 52 52 "PASID state", 53 - "unknown xstate feature" , 54 - "unknown xstate feature" , 55 - "unknown xstate feature" , 56 - "unknown xstate feature" , 57 - "unknown xstate feature" , 58 - "unknown xstate feature" , 59 - "AMX Tile config" , 60 - "AMX Tile data" , 61 - "unknown xstate feature" , 53 + "Control-flow User registers", 54 + "Control-flow Kernel registers (unused)", 55 + "unknown xstate feature", 56 + "unknown xstate feature", 57 + "unknown xstate feature", 58 + "unknown xstate feature", 59 + "AMX Tile config", 60 + "AMX Tile data", 61 + "unknown xstate feature", 62 62 }; 63 63 64 64 static unsigned short xsave_cpuid_features[] __initdata = { ··· 73 73 [XFEATURE_PT_UNIMPLEMENTED_SO_FAR] = X86_FEATURE_INTEL_PT, 74 74 [XFEATURE_PKRU] = X86_FEATURE_PKU, 75 75 [XFEATURE_PASID] = X86_FEATURE_ENQCMD, 76 + [XFEATURE_CET_USER] = X86_FEATURE_SHSTK, 76 77 [XFEATURE_XTILE_CFG] = X86_FEATURE_AMX_TILE, 77 78 [XFEATURE_XTILE_DATA] = X86_FEATURE_AMX_TILE, 78 79 }; ··· 277 276 print_xstate_feature(XFEATURE_MASK_Hi16_ZMM); 278 277 print_xstate_feature(XFEATURE_MASK_PKRU); 279 278 print_xstate_feature(XFEATURE_MASK_PASID); 279 + print_xstate_feature(XFEATURE_MASK_CET_USER); 280 280 print_xstate_feature(XFEATURE_MASK_XTILE_CFG); 281 281 print_xstate_feature(XFEATURE_MASK_XTILE_DATA); 282 282 } ··· 346 344 XFEATURE_MASK_BNDREGS | \ 347 345 XFEATURE_MASK_BNDCSR | \ 348 346 XFEATURE_MASK_PASID | \ 347 + XFEATURE_MASK_CET_USER | \ 349 348 XFEATURE_MASK_XTILE) 350 349 351 350 /* ··· 449 446 } \ 450 447 } while (0) 451 448 452 - #define XCHECK_SZ(sz, nr, nr_macro, __struct) do { \ 453 - if ((nr == nr_macro) && \ 454 - WARN_ONCE(sz != sizeof(__struct), \ 455 - "%s: struct is %zu bytes, cpu state %d bytes\n", \ 456 - __stringify(nr_macro), sizeof(__struct), sz)) { \ 449 + #define XCHECK_SZ(sz, nr, __struct) ({ \ 450 + if (WARN_ONCE(sz != sizeof(__struct), \ 451 + "[%s]: struct is %zu bytes, cpu state %d bytes\n", \ 452 + xfeature_names[nr], sizeof(__struct), sz)) { \ 457 453 __xstate_dump_leaves(); \ 458 454 } \ 459 - } while (0) 455 + true; \ 456 + }) 457 + 460 458 461 459 /** 462 460 * check_xtile_data_against_struct - Check tile data state size. ··· 531 527 * Ask the CPU for the size of the state. 532 528 */ 533 529 int sz = xfeature_size(nr); 530 + 534 531 /* 535 532 * Match each CPU state with the corresponding software 536 533 * structure. 537 534 */ 538 - XCHECK_SZ(sz, nr, XFEATURE_YMM, struct ymmh_struct); 539 - XCHECK_SZ(sz, nr, XFEATURE_BNDREGS, struct mpx_bndreg_state); 540 - XCHECK_SZ(sz, nr, XFEATURE_BNDCSR, struct mpx_bndcsr_state); 541 - XCHECK_SZ(sz, nr, XFEATURE_OPMASK, struct avx_512_opmask_state); 542 - XCHECK_SZ(sz, nr, XFEATURE_ZMM_Hi256, struct avx_512_zmm_uppers_state); 543 - XCHECK_SZ(sz, nr, XFEATURE_Hi16_ZMM, struct avx_512_hi16_state); 544 - XCHECK_SZ(sz, nr, XFEATURE_PKRU, struct pkru_state); 545 - XCHECK_SZ(sz, nr, XFEATURE_PASID, struct ia32_pasid_state); 546 - XCHECK_SZ(sz, nr, XFEATURE_XTILE_CFG, struct xtile_cfg); 547 - 548 - /* The tile data size varies between implementations. */ 549 - if (nr == XFEATURE_XTILE_DATA) 550 - check_xtile_data_against_struct(sz); 551 - 552 - /* 553 - * Make *SURE* to add any feature numbers in below if 554 - * there are "holes" in the xsave state component 555 - * numbers. 556 - */ 557 - if ((nr < XFEATURE_YMM) || 558 - (nr >= XFEATURE_MAX) || 559 - (nr == XFEATURE_PT_UNIMPLEMENTED_SO_FAR) || 560 - ((nr >= XFEATURE_RSRVD_COMP_11) && (nr <= XFEATURE_RSRVD_COMP_16))) { 535 + switch (nr) { 536 + case XFEATURE_YMM: return XCHECK_SZ(sz, nr, struct ymmh_struct); 537 + case XFEATURE_BNDREGS: return XCHECK_SZ(sz, nr, struct mpx_bndreg_state); 538 + case XFEATURE_BNDCSR: return XCHECK_SZ(sz, nr, struct mpx_bndcsr_state); 539 + case XFEATURE_OPMASK: return XCHECK_SZ(sz, nr, struct avx_512_opmask_state); 540 + case XFEATURE_ZMM_Hi256: return XCHECK_SZ(sz, nr, struct avx_512_zmm_uppers_state); 541 + case XFEATURE_Hi16_ZMM: return XCHECK_SZ(sz, nr, struct avx_512_hi16_state); 542 + case XFEATURE_PKRU: return XCHECK_SZ(sz, nr, struct pkru_state); 543 + case XFEATURE_PASID: return XCHECK_SZ(sz, nr, struct ia32_pasid_state); 544 + case XFEATURE_XTILE_CFG: return XCHECK_SZ(sz, nr, struct xtile_cfg); 545 + case XFEATURE_CET_USER: return XCHECK_SZ(sz, nr, struct cet_user_state); 546 + case XFEATURE_XTILE_DATA: check_xtile_data_against_struct(sz); return true; 547 + default: 561 548 XSTATE_WARN_ON(1, "No structure for xstate: %d\n", nr); 562 549 return false; 563 550 } 551 + 564 552 return true; 565 553 } 566 554
+17
arch/x86/kernel/ibt_selftest.S
··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #include <linux/linkage.h> 3 + #include <linux/objtool.h> 4 + #include <asm/nospec-branch.h> 5 + 6 + SYM_CODE_START(ibt_selftest_noendbr) 7 + ANNOTATE_NOENDBR 8 + UNWIND_HINT_FUNC 9 + /* #CP handler sets %ax to 0 */ 10 + RET 11 + SYM_CODE_END(ibt_selftest_noendbr) 12 + 13 + SYM_FUNC_START(ibt_selftest) 14 + lea ibt_selftest_noendbr(%rip), %rax 15 + ANNOTATE_RETPOLINE_SAFE 16 + jmp *%rax 17 + SYM_FUNC_END(ibt_selftest)
+1 -1
arch/x86/kernel/idt.c
··· 107 107 ISTG(X86_TRAP_MC, asm_exc_machine_check, IST_INDEX_MCE), 108 108 #endif 109 109 110 - #ifdef CONFIG_X86_KERNEL_IBT 110 + #ifdef CONFIG_X86_CET 111 111 INTG(X86_TRAP_CP, asm_exc_control_protection), 112 112 #endif 113 113
+20 -1
arch/x86/kernel/process.c
··· 51 51 #include <asm/unwind.h> 52 52 #include <asm/tdx.h> 53 53 #include <asm/mmu_context.h> 54 + #include <asm/shstk.h> 54 55 55 56 #include "process.h" 56 57 ··· 123 122 124 123 free_vm86(t); 125 124 125 + shstk_free(tsk); 126 126 fpu__drop(fpu); 127 127 } 128 128 ··· 164 162 struct inactive_task_frame *frame; 165 163 struct fork_frame *fork_frame; 166 164 struct pt_regs *childregs; 165 + unsigned long new_ssp; 167 166 int ret = 0; 168 167 169 168 childregs = task_pt_regs(p); ··· 202 199 frame->flags = X86_EFLAGS_FIXED; 203 200 #endif 204 201 205 - fpu_clone(p, clone_flags, args->fn); 202 + /* 203 + * Allocate a new shadow stack for thread if needed. If shadow stack, 204 + * is disabled, new_ssp will remain 0, and fpu_clone() will know not to 205 + * update it. 206 + */ 207 + new_ssp = shstk_alloc_thread_stack(p, clone_flags, args->stack_size); 208 + if (IS_ERR_VALUE(new_ssp)) 209 + return PTR_ERR((void *)new_ssp); 210 + 211 + fpu_clone(p, clone_flags, args->fn, new_ssp); 206 212 207 213 /* Kernel thread ? */ 208 214 if (unlikely(p->flags & PF_KTHREAD)) { ··· 256 244 257 245 if (!ret && unlikely(test_tsk_thread_flag(current, TIF_IO_BITMAP))) 258 246 io_bitmap_share(p); 247 + 248 + /* 249 + * If copy_thread() if failing, don't leak the shadow stack possibly 250 + * allocated in shstk_alloc_thread_stack() above. 251 + */ 252 + if (ret) 253 + shstk_free(p); 259 254 260 255 return ret; 261 256 }
+8
arch/x86/kernel/process_64.c
··· 515 515 load_gs_index(__USER_DS); 516 516 } 517 517 518 + reset_thread_features(); 519 + 518 520 loadsegment(fs, 0); 519 521 loadsegment(es, _ds); 520 522 loadsegment(ds, _ds); ··· 896 894 else 897 895 return put_user(LAM_U57_BITS, (unsigned long __user *)arg2); 898 896 #endif 897 + case ARCH_SHSTK_ENABLE: 898 + case ARCH_SHSTK_DISABLE: 899 + case ARCH_SHSTK_LOCK: 900 + case ARCH_SHSTK_UNLOCK: 901 + case ARCH_SHSTK_STATUS: 902 + return shstk_prctl(task, option, arg2); 899 903 default: 900 904 ret = -EINVAL; 901 905 break;
+12
arch/x86/kernel/ptrace.c
··· 58 58 REGSET64_FP, 59 59 REGSET64_IOPERM, 60 60 REGSET64_XSTATE, 61 + REGSET64_SSP, 61 62 }; 62 63 63 64 #define REGSET_GENERAL \ ··· 1268 1267 .active = ioperm_active, 1269 1268 .regset_get = ioperm_get 1270 1269 }, 1270 + #ifdef CONFIG_X86_USER_SHADOW_STACK 1271 + [REGSET64_SSP] = { 1272 + .core_note_type = NT_X86_SHSTK, 1273 + .n = 1, 1274 + .size = sizeof(u64), 1275 + .align = sizeof(u64), 1276 + .active = ssp_active, 1277 + .regset_get = ssp_get, 1278 + .set = ssp_set 1279 + }, 1280 + #endif 1271 1281 }; 1272 1282 1273 1283 static const struct user_regset_view user_x86_64_view = {
+550
arch/x86/kernel/shstk.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * shstk.c - Intel shadow stack support 4 + * 5 + * Copyright (c) 2021, Intel Corporation. 6 + * Yu-cheng Yu <yu-cheng.yu@intel.com> 7 + */ 8 + 9 + #include <linux/sched.h> 10 + #include <linux/bitops.h> 11 + #include <linux/types.h> 12 + #include <linux/mm.h> 13 + #include <linux/mman.h> 14 + #include <linux/slab.h> 15 + #include <linux/uaccess.h> 16 + #include <linux/sched/signal.h> 17 + #include <linux/compat.h> 18 + #include <linux/sizes.h> 19 + #include <linux/user.h> 20 + #include <linux/syscalls.h> 21 + #include <asm/msr.h> 22 + #include <asm/fpu/xstate.h> 23 + #include <asm/fpu/types.h> 24 + #include <asm/shstk.h> 25 + #include <asm/special_insns.h> 26 + #include <asm/fpu/api.h> 27 + #include <asm/prctl.h> 28 + 29 + #define SS_FRAME_SIZE 8 30 + 31 + static bool features_enabled(unsigned long features) 32 + { 33 + return current->thread.features & features; 34 + } 35 + 36 + static void features_set(unsigned long features) 37 + { 38 + current->thread.features |= features; 39 + } 40 + 41 + static void features_clr(unsigned long features) 42 + { 43 + current->thread.features &= ~features; 44 + } 45 + 46 + /* 47 + * Create a restore token on the shadow stack. A token is always 8-byte 48 + * and aligned to 8. 49 + */ 50 + static int create_rstor_token(unsigned long ssp, unsigned long *token_addr) 51 + { 52 + unsigned long addr; 53 + 54 + /* Token must be aligned */ 55 + if (!IS_ALIGNED(ssp, 8)) 56 + return -EINVAL; 57 + 58 + addr = ssp - SS_FRAME_SIZE; 59 + 60 + /* 61 + * SSP is aligned, so reserved bits and mode bit are a zero, just mark 62 + * the token 64-bit. 63 + */ 64 + ssp |= BIT(0); 65 + 66 + if (write_user_shstk_64((u64 __user *)addr, (u64)ssp)) 67 + return -EFAULT; 68 + 69 + if (token_addr) 70 + *token_addr = addr; 71 + 72 + return 0; 73 + } 74 + 75 + /* 76 + * VM_SHADOW_STACK will have a guard page. This helps userspace protect 77 + * itself from attacks. The reasoning is as follows: 78 + * 79 + * The shadow stack pointer(SSP) is moved by CALL, RET, and INCSSPQ. The 80 + * INCSSP instruction can increment the shadow stack pointer. It is the 81 + * shadow stack analog of an instruction like: 82 + * 83 + * addq $0x80, %rsp 84 + * 85 + * However, there is one important difference between an ADD on %rsp 86 + * and INCSSP. In addition to modifying SSP, INCSSP also reads from the 87 + * memory of the first and last elements that were "popped". It can be 88 + * thought of as acting like this: 89 + * 90 + * READ_ONCE(ssp); // read+discard top element on stack 91 + * ssp += nr_to_pop * 8; // move the shadow stack 92 + * READ_ONCE(ssp-8); // read+discard last popped stack element 93 + * 94 + * The maximum distance INCSSP can move the SSP is 2040 bytes, before 95 + * it would read the memory. Therefore a single page gap will be enough 96 + * to prevent any operation from shifting the SSP to an adjacent stack, 97 + * since it would have to land in the gap at least once, causing a 98 + * fault. 99 + */ 100 + static unsigned long alloc_shstk(unsigned long addr, unsigned long size, 101 + unsigned long token_offset, bool set_res_tok) 102 + { 103 + int flags = MAP_ANONYMOUS | MAP_PRIVATE | MAP_ABOVE4G; 104 + struct mm_struct *mm = current->mm; 105 + unsigned long mapped_addr, unused; 106 + 107 + if (addr) 108 + flags |= MAP_FIXED_NOREPLACE; 109 + 110 + mmap_write_lock(mm); 111 + mapped_addr = do_mmap(NULL, addr, size, PROT_READ, flags, 112 + VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL); 113 + mmap_write_unlock(mm); 114 + 115 + if (!set_res_tok || IS_ERR_VALUE(mapped_addr)) 116 + goto out; 117 + 118 + if (create_rstor_token(mapped_addr + token_offset, NULL)) { 119 + vm_munmap(mapped_addr, size); 120 + return -EINVAL; 121 + } 122 + 123 + out: 124 + return mapped_addr; 125 + } 126 + 127 + static unsigned long adjust_shstk_size(unsigned long size) 128 + { 129 + if (size) 130 + return PAGE_ALIGN(size); 131 + 132 + return PAGE_ALIGN(min_t(unsigned long long, rlimit(RLIMIT_STACK), SZ_4G)); 133 + } 134 + 135 + static void unmap_shadow_stack(u64 base, u64 size) 136 + { 137 + int r; 138 + 139 + r = vm_munmap(base, size); 140 + 141 + /* 142 + * mmap_write_lock_killable() failed with -EINTR. This means 143 + * the process is about to die and have it's MM cleaned up. 144 + * This task shouldn't ever make it back to userspace. In this 145 + * case it is ok to leak a shadow stack, so just exit out. 146 + */ 147 + if (r == -EINTR) 148 + return; 149 + 150 + /* 151 + * For all other types of vm_munmap() failure, either the 152 + * system is out of memory or there is bug. 153 + */ 154 + WARN_ON_ONCE(r); 155 + } 156 + 157 + static int shstk_setup(void) 158 + { 159 + struct thread_shstk *shstk = &current->thread.shstk; 160 + unsigned long addr, size; 161 + 162 + /* Already enabled */ 163 + if (features_enabled(ARCH_SHSTK_SHSTK)) 164 + return 0; 165 + 166 + /* Also not supported for 32 bit and x32 */ 167 + if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK) || in_32bit_syscall()) 168 + return -EOPNOTSUPP; 169 + 170 + size = adjust_shstk_size(0); 171 + addr = alloc_shstk(0, size, 0, false); 172 + if (IS_ERR_VALUE(addr)) 173 + return PTR_ERR((void *)addr); 174 + 175 + fpregs_lock_and_load(); 176 + wrmsrl(MSR_IA32_PL3_SSP, addr + size); 177 + wrmsrl(MSR_IA32_U_CET, CET_SHSTK_EN); 178 + fpregs_unlock(); 179 + 180 + shstk->base = addr; 181 + shstk->size = size; 182 + features_set(ARCH_SHSTK_SHSTK); 183 + 184 + return 0; 185 + } 186 + 187 + void reset_thread_features(void) 188 + { 189 + memset(&current->thread.shstk, 0, sizeof(struct thread_shstk)); 190 + current->thread.features = 0; 191 + current->thread.features_locked = 0; 192 + } 193 + 194 + unsigned long shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags, 195 + unsigned long stack_size) 196 + { 197 + struct thread_shstk *shstk = &tsk->thread.shstk; 198 + unsigned long addr, size; 199 + 200 + /* 201 + * If shadow stack is not enabled on the new thread, skip any 202 + * switch to a new shadow stack. 203 + */ 204 + if (!features_enabled(ARCH_SHSTK_SHSTK)) 205 + return 0; 206 + 207 + /* 208 + * For CLONE_VM, except vfork, the child needs a separate shadow 209 + * stack. 210 + */ 211 + if ((clone_flags & (CLONE_VFORK | CLONE_VM)) != CLONE_VM) 212 + return 0; 213 + 214 + size = adjust_shstk_size(stack_size); 215 + addr = alloc_shstk(0, size, 0, false); 216 + if (IS_ERR_VALUE(addr)) 217 + return addr; 218 + 219 + shstk->base = addr; 220 + shstk->size = size; 221 + 222 + return addr + size; 223 + } 224 + 225 + static unsigned long get_user_shstk_addr(void) 226 + { 227 + unsigned long long ssp; 228 + 229 + fpregs_lock_and_load(); 230 + 231 + rdmsrl(MSR_IA32_PL3_SSP, ssp); 232 + 233 + fpregs_unlock(); 234 + 235 + return ssp; 236 + } 237 + 238 + #define SHSTK_DATA_BIT BIT(63) 239 + 240 + static int put_shstk_data(u64 __user *addr, u64 data) 241 + { 242 + if (WARN_ON_ONCE(data & SHSTK_DATA_BIT)) 243 + return -EINVAL; 244 + 245 + /* 246 + * Mark the high bit so that the sigframe can't be processed as a 247 + * return address. 248 + */ 249 + if (write_user_shstk_64(addr, data | SHSTK_DATA_BIT)) 250 + return -EFAULT; 251 + return 0; 252 + } 253 + 254 + static int get_shstk_data(unsigned long *data, unsigned long __user *addr) 255 + { 256 + unsigned long ldata; 257 + 258 + if (unlikely(get_user(ldata, addr))) 259 + return -EFAULT; 260 + 261 + if (!(ldata & SHSTK_DATA_BIT)) 262 + return -EINVAL; 263 + 264 + *data = ldata & ~SHSTK_DATA_BIT; 265 + 266 + return 0; 267 + } 268 + 269 + static int shstk_push_sigframe(unsigned long *ssp) 270 + { 271 + unsigned long target_ssp = *ssp; 272 + 273 + /* Token must be aligned */ 274 + if (!IS_ALIGNED(target_ssp, 8)) 275 + return -EINVAL; 276 + 277 + *ssp -= SS_FRAME_SIZE; 278 + if (put_shstk_data((void __user *)*ssp, target_ssp)) 279 + return -EFAULT; 280 + 281 + return 0; 282 + } 283 + 284 + static int shstk_pop_sigframe(unsigned long *ssp) 285 + { 286 + struct vm_area_struct *vma; 287 + unsigned long token_addr; 288 + bool need_to_check_vma; 289 + int err = 1; 290 + 291 + /* 292 + * It is possible for the SSP to be off the end of a shadow stack by 4 293 + * or 8 bytes. If the shadow stack is at the start of a page or 4 bytes 294 + * before it, it might be this case, so check that the address being 295 + * read is actually shadow stack. 296 + */ 297 + if (!IS_ALIGNED(*ssp, 8)) 298 + return -EINVAL; 299 + 300 + need_to_check_vma = PAGE_ALIGN(*ssp) == *ssp; 301 + 302 + if (need_to_check_vma) 303 + mmap_read_lock_killable(current->mm); 304 + 305 + err = get_shstk_data(&token_addr, (unsigned long __user *)*ssp); 306 + if (unlikely(err)) 307 + goto out_err; 308 + 309 + if (need_to_check_vma) { 310 + vma = find_vma(current->mm, *ssp); 311 + if (!vma || !(vma->vm_flags & VM_SHADOW_STACK)) { 312 + err = -EFAULT; 313 + goto out_err; 314 + } 315 + 316 + mmap_read_unlock(current->mm); 317 + } 318 + 319 + /* Restore SSP aligned? */ 320 + if (unlikely(!IS_ALIGNED(token_addr, 8))) 321 + return -EINVAL; 322 + 323 + /* SSP in userspace? */ 324 + if (unlikely(token_addr >= TASK_SIZE_MAX)) 325 + return -EINVAL; 326 + 327 + *ssp = token_addr; 328 + 329 + return 0; 330 + out_err: 331 + if (need_to_check_vma) 332 + mmap_read_unlock(current->mm); 333 + return err; 334 + } 335 + 336 + int setup_signal_shadow_stack(struct ksignal *ksig) 337 + { 338 + void __user *restorer = ksig->ka.sa.sa_restorer; 339 + unsigned long ssp; 340 + int err; 341 + 342 + if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK) || 343 + !features_enabled(ARCH_SHSTK_SHSTK)) 344 + return 0; 345 + 346 + if (!restorer) 347 + return -EINVAL; 348 + 349 + ssp = get_user_shstk_addr(); 350 + if (unlikely(!ssp)) 351 + return -EINVAL; 352 + 353 + err = shstk_push_sigframe(&ssp); 354 + if (unlikely(err)) 355 + return err; 356 + 357 + /* Push restorer address */ 358 + ssp -= SS_FRAME_SIZE; 359 + err = write_user_shstk_64((u64 __user *)ssp, (u64)restorer); 360 + if (unlikely(err)) 361 + return -EFAULT; 362 + 363 + fpregs_lock_and_load(); 364 + wrmsrl(MSR_IA32_PL3_SSP, ssp); 365 + fpregs_unlock(); 366 + 367 + return 0; 368 + } 369 + 370 + int restore_signal_shadow_stack(void) 371 + { 372 + unsigned long ssp; 373 + int err; 374 + 375 + if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK) || 376 + !features_enabled(ARCH_SHSTK_SHSTK)) 377 + return 0; 378 + 379 + ssp = get_user_shstk_addr(); 380 + if (unlikely(!ssp)) 381 + return -EINVAL; 382 + 383 + err = shstk_pop_sigframe(&ssp); 384 + if (unlikely(err)) 385 + return err; 386 + 387 + fpregs_lock_and_load(); 388 + wrmsrl(MSR_IA32_PL3_SSP, ssp); 389 + fpregs_unlock(); 390 + 391 + return 0; 392 + } 393 + 394 + void shstk_free(struct task_struct *tsk) 395 + { 396 + struct thread_shstk *shstk = &tsk->thread.shstk; 397 + 398 + if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK) || 399 + !features_enabled(ARCH_SHSTK_SHSTK)) 400 + return; 401 + 402 + /* 403 + * When fork() with CLONE_VM fails, the child (tsk) already has a 404 + * shadow stack allocated, and exit_thread() calls this function to 405 + * free it. In this case the parent (current) and the child share 406 + * the same mm struct. 407 + */ 408 + if (!tsk->mm || tsk->mm != current->mm) 409 + return; 410 + 411 + unmap_shadow_stack(shstk->base, shstk->size); 412 + } 413 + 414 + static int wrss_control(bool enable) 415 + { 416 + u64 msrval; 417 + 418 + if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK)) 419 + return -EOPNOTSUPP; 420 + 421 + /* 422 + * Only enable WRSS if shadow stack is enabled. If shadow stack is not 423 + * enabled, WRSS will already be disabled, so don't bother clearing it 424 + * when disabling. 425 + */ 426 + if (!features_enabled(ARCH_SHSTK_SHSTK)) 427 + return -EPERM; 428 + 429 + /* Already enabled/disabled? */ 430 + if (features_enabled(ARCH_SHSTK_WRSS) == enable) 431 + return 0; 432 + 433 + fpregs_lock_and_load(); 434 + rdmsrl(MSR_IA32_U_CET, msrval); 435 + 436 + if (enable) { 437 + features_set(ARCH_SHSTK_WRSS); 438 + msrval |= CET_WRSS_EN; 439 + } else { 440 + features_clr(ARCH_SHSTK_WRSS); 441 + if (!(msrval & CET_WRSS_EN)) 442 + goto unlock; 443 + 444 + msrval &= ~CET_WRSS_EN; 445 + } 446 + 447 + wrmsrl(MSR_IA32_U_CET, msrval); 448 + 449 + unlock: 450 + fpregs_unlock(); 451 + 452 + return 0; 453 + } 454 + 455 + static int shstk_disable(void) 456 + { 457 + if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK)) 458 + return -EOPNOTSUPP; 459 + 460 + /* Already disabled? */ 461 + if (!features_enabled(ARCH_SHSTK_SHSTK)) 462 + return 0; 463 + 464 + fpregs_lock_and_load(); 465 + /* Disable WRSS too when disabling shadow stack */ 466 + wrmsrl(MSR_IA32_U_CET, 0); 467 + wrmsrl(MSR_IA32_PL3_SSP, 0); 468 + fpregs_unlock(); 469 + 470 + shstk_free(current); 471 + features_clr(ARCH_SHSTK_SHSTK | ARCH_SHSTK_WRSS); 472 + 473 + return 0; 474 + } 475 + 476 + SYSCALL_DEFINE3(map_shadow_stack, unsigned long, addr, unsigned long, size, unsigned int, flags) 477 + { 478 + bool set_tok = flags & SHADOW_STACK_SET_TOKEN; 479 + unsigned long aligned_size; 480 + 481 + if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK)) 482 + return -EOPNOTSUPP; 483 + 484 + if (flags & ~SHADOW_STACK_SET_TOKEN) 485 + return -EINVAL; 486 + 487 + /* If there isn't space for a token */ 488 + if (set_tok && size < 8) 489 + return -ENOSPC; 490 + 491 + if (addr && addr < SZ_4G) 492 + return -ERANGE; 493 + 494 + /* 495 + * An overflow would result in attempting to write the restore token 496 + * to the wrong location. Not catastrophic, but just return the right 497 + * error code and block it. 498 + */ 499 + aligned_size = PAGE_ALIGN(size); 500 + if (aligned_size < size) 501 + return -EOVERFLOW; 502 + 503 + return alloc_shstk(addr, aligned_size, size, set_tok); 504 + } 505 + 506 + long shstk_prctl(struct task_struct *task, int option, unsigned long arg2) 507 + { 508 + unsigned long features = arg2; 509 + 510 + if (option == ARCH_SHSTK_STATUS) { 511 + return put_user(task->thread.features, (unsigned long __user *)arg2); 512 + } 513 + 514 + if (option == ARCH_SHSTK_LOCK) { 515 + task->thread.features_locked |= features; 516 + return 0; 517 + } 518 + 519 + /* Only allow via ptrace */ 520 + if (task != current) { 521 + if (option == ARCH_SHSTK_UNLOCK && IS_ENABLED(CONFIG_CHECKPOINT_RESTORE)) { 522 + task->thread.features_locked &= ~features; 523 + return 0; 524 + } 525 + return -EINVAL; 526 + } 527 + 528 + /* Do not allow to change locked features */ 529 + if (features & task->thread.features_locked) 530 + return -EPERM; 531 + 532 + /* Only support enabling/disabling one feature at a time. */ 533 + if (hweight_long(features) > 1) 534 + return -EINVAL; 535 + 536 + if (option == ARCH_SHSTK_DISABLE) { 537 + if (features & ARCH_SHSTK_WRSS) 538 + return wrss_control(false); 539 + if (features & ARCH_SHSTK_SHSTK) 540 + return shstk_disable(); 541 + return -EINVAL; 542 + } 543 + 544 + /* Handle ARCH_SHSTK_ENABLE */ 545 + if (features & ARCH_SHSTK_SHSTK) 546 + return shstk_setup(); 547 + if (features & ARCH_SHSTK_WRSS) 548 + return wrss_control(true); 549 + return -EINVAL; 550 + }
+1
arch/x86/kernel/signal.c
··· 40 40 #include <asm/syscall.h> 41 41 #include <asm/sigframe.h> 42 42 #include <asm/signal.h> 43 + #include <asm/shstk.h> 43 44 44 45 static inline int is_ia32_compat_frame(struct ksignal *ksig) 45 46 {
+1 -1
arch/x86/kernel/signal_32.c
··· 402 402 */ 403 403 static_assert(NSIGILL == 11); 404 404 static_assert(NSIGFPE == 15); 405 - static_assert(NSIGSEGV == 9); 405 + static_assert(NSIGSEGV == 10); 406 406 static_assert(NSIGBUS == 5); 407 407 static_assert(NSIGTRAP == 6); 408 408 static_assert(NSIGCHLD == 6);
+7 -1
arch/x86/kernel/signal_64.c
··· 175 175 frame = get_sigframe(ksig, regs, sizeof(struct rt_sigframe), &fp); 176 176 uc_flags = frame_uc_flags(regs); 177 177 178 + if (setup_signal_shadow_stack(ksig)) 179 + return -EFAULT; 180 + 178 181 if (!user_access_begin(frame, sizeof(*frame))) 179 182 return -EFAULT; 180 183 ··· 261 258 set_current_blocked(&set); 262 259 263 260 if (!restore_sigcontext(regs, &frame->uc.uc_mcontext, uc_flags)) 261 + goto badframe; 262 + 263 + if (restore_signal_shadow_stack()) 264 264 goto badframe; 265 265 266 266 if (restore_altstack(&frame->uc.uc_stack)) ··· 409 403 */ 410 404 static_assert(NSIGILL == 11); 411 405 static_assert(NSIGFPE == 15); 412 - static_assert(NSIGSEGV == 9); 406 + static_assert(NSIGSEGV == 10); 413 407 static_assert(NSIGBUS == 5); 414 408 static_assert(NSIGTRAP == 6); 415 409 static_assert(NSIGCHLD == 6);
+5 -1
arch/x86/kernel/sys_x86_64.c
··· 193 193 194 194 info.flags = VM_UNMAPPED_AREA_TOPDOWN; 195 195 info.length = len; 196 - info.low_limit = PAGE_SIZE; 196 + if (!in_32bit_syscall() && (flags & MAP_ABOVE4G)) 197 + info.low_limit = SZ_4G; 198 + else 199 + info.low_limit = PAGE_SIZE; 200 + 197 201 info.high_limit = get_mmap_base(0); 198 202 199 203 /*
-87
arch/x86/kernel/traps.c
··· 77 77 78 78 DECLARE_BITMAP(system_vectors, NR_VECTORS); 79 79 80 - static inline void cond_local_irq_enable(struct pt_regs *regs) 81 - { 82 - if (regs->flags & X86_EFLAGS_IF) 83 - local_irq_enable(); 84 - } 85 - 86 - static inline void cond_local_irq_disable(struct pt_regs *regs) 87 - { 88 - if (regs->flags & X86_EFLAGS_IF) 89 - local_irq_disable(); 90 - } 91 - 92 80 __always_inline int is_valid_bugaddr(unsigned long addr) 93 81 { 94 82 if (addr < TASK_SIZE_MAX) ··· 200 212 { 201 213 do_error_trap(regs, 0, "overflow", X86_TRAP_OF, SIGSEGV, 0, NULL); 202 214 } 203 - 204 - #ifdef CONFIG_X86_KERNEL_IBT 205 - 206 - static __ro_after_init bool ibt_fatal = true; 207 - 208 - extern void ibt_selftest_ip(void); /* code label defined in asm below */ 209 - 210 - enum cp_error_code { 211 - CP_EC = (1 << 15) - 1, 212 - 213 - CP_RET = 1, 214 - CP_IRET = 2, 215 - CP_ENDBR = 3, 216 - CP_RSTRORSSP = 4, 217 - CP_SETSSBSY = 5, 218 - 219 - CP_ENCL = 1 << 15, 220 - }; 221 - 222 - DEFINE_IDTENTRY_ERRORCODE(exc_control_protection) 223 - { 224 - if (!cpu_feature_enabled(X86_FEATURE_IBT)) { 225 - pr_err("Unexpected #CP\n"); 226 - BUG(); 227 - } 228 - 229 - if (WARN_ON_ONCE(user_mode(regs) || (error_code & CP_EC) != CP_ENDBR)) 230 - return; 231 - 232 - if (unlikely(regs->ip == (unsigned long)&ibt_selftest_ip)) { 233 - regs->ax = 0; 234 - return; 235 - } 236 - 237 - pr_err("Missing ENDBR: %pS\n", (void *)instruction_pointer(regs)); 238 - if (!ibt_fatal) { 239 - printk(KERN_DEFAULT CUT_HERE); 240 - __warn(__FILE__, __LINE__, (void *)regs->ip, TAINT_WARN, regs, NULL); 241 - return; 242 - } 243 - BUG(); 244 - } 245 - 246 - /* Must be noinline to ensure uniqueness of ibt_selftest_ip. */ 247 - noinline bool ibt_selftest(void) 248 - { 249 - unsigned long ret; 250 - 251 - asm (" lea ibt_selftest_ip(%%rip), %%rax\n\t" 252 - ANNOTATE_RETPOLINE_SAFE 253 - " jmp *%%rax\n\t" 254 - "ibt_selftest_ip:\n\t" 255 - UNWIND_HINT_FUNC 256 - ANNOTATE_NOENDBR 257 - " nop\n\t" 258 - 259 - : "=a" (ret) : : "memory"); 260 - 261 - return !ret; 262 - } 263 - 264 - static int __init ibt_setup(char *str) 265 - { 266 - if (!strcmp(str, "off")) 267 - setup_clear_cpu_cap(X86_FEATURE_IBT); 268 - 269 - if (!strcmp(str, "warn")) 270 - ibt_fatal = false; 271 - 272 - return 1; 273 - } 274 - 275 - __setup("ibt=", ibt_setup); 276 - 277 - #endif /* CONFIG_X86_KERNEL_IBT */ 278 215 279 216 #ifdef CONFIG_X86_F00F_BUG 280 217 void handle_invalid_op(struct pt_regs *regs)
+22
arch/x86/mm/fault.c
··· 1112 1112 (error_code & X86_PF_INSTR), foreign)) 1113 1113 return 1; 1114 1114 1115 + /* 1116 + * Shadow stack accesses (PF_SHSTK=1) are only permitted to 1117 + * shadow stack VMAs. All other accesses result in an error. 1118 + */ 1119 + if (error_code & X86_PF_SHSTK) { 1120 + if (unlikely(!(vma->vm_flags & VM_SHADOW_STACK))) 1121 + return 1; 1122 + if (unlikely(!(vma->vm_flags & VM_WRITE))) 1123 + return 1; 1124 + return 0; 1125 + } 1126 + 1115 1127 if (error_code & X86_PF_WRITE) { 1116 1128 /* write, present and write, not present: */ 1129 + if (unlikely(vma->vm_flags & VM_SHADOW_STACK)) 1130 + return 1; 1117 1131 if (unlikely(!(vma->vm_flags & VM_WRITE))) 1118 1132 return 1; 1119 1133 return 0; ··· 1319 1305 1320 1306 perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); 1321 1307 1308 + /* 1309 + * Read-only permissions can not be expressed in shadow stack PTEs. 1310 + * Treat all shadow stack accesses as WRITE faults. This ensures 1311 + * that the MM will prepare everything (e.g., break COW) such that 1312 + * maybe_mkwrite() can create a proper shadow stack PTE. 1313 + */ 1314 + if (error_code & X86_PF_SHSTK) 1315 + flags |= FAULT_FLAG_WRITE; 1322 1316 if (error_code & X86_PF_WRITE) 1323 1317 flags |= FAULT_FLAG_WRITE; 1324 1318 if (error_code & X86_PF_INSTR)
+2 -2
arch/x86/mm/pat/set_memory.c
··· 2074 2074 2075 2075 int set_memory_ro(unsigned long addr, int numpages) 2076 2076 { 2077 - return change_page_attr_clear(&addr, numpages, __pgprot(_PAGE_RW), 0); 2077 + return change_page_attr_clear(&addr, numpages, __pgprot(_PAGE_RW | _PAGE_DIRTY), 0); 2078 2078 } 2079 2079 2080 2080 int set_memory_rox(unsigned long addr, int numpages) 2081 2081 { 2082 - pgprot_t clr = __pgprot(_PAGE_RW); 2082 + pgprot_t clr = __pgprot(_PAGE_RW | _PAGE_DIRTY); 2083 2083 2084 2084 if (__supported_pte_mask & _PAGE_NX) 2085 2085 clr.pgprot |= _PAGE_NX;
+40
arch/x86/mm/pgtable.c
··· 881 881 882 882 #endif /* CONFIG_X86_64 */ 883 883 #endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */ 884 + 885 + pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma) 886 + { 887 + if (vma->vm_flags & VM_SHADOW_STACK) 888 + return pte_mkwrite_shstk(pte); 889 + 890 + pte = pte_mkwrite_novma(pte); 891 + 892 + return pte_clear_saveddirty(pte); 893 + } 894 + 895 + pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma) 896 + { 897 + if (vma->vm_flags & VM_SHADOW_STACK) 898 + return pmd_mkwrite_shstk(pmd); 899 + 900 + pmd = pmd_mkwrite_novma(pmd); 901 + 902 + return pmd_clear_saveddirty(pmd); 903 + } 904 + 905 + void arch_check_zapped_pte(struct vm_area_struct *vma, pte_t pte) 906 + { 907 + /* 908 + * Hardware before shadow stack can (rarely) set Dirty=1 909 + * on a Write=0 PTE. So the below condition 910 + * only indicates a software bug when shadow stack is 911 + * supported by the HW. This checking is covered in 912 + * pte_shstk(). 913 + */ 914 + VM_WARN_ON_ONCE(!(vma->vm_flags & VM_SHADOW_STACK) && 915 + pte_shstk(pte)); 916 + } 917 + 918 + void arch_check_zapped_pmd(struct vm_area_struct *vma, pmd_t pmd) 919 + { 920 + /* See note in arch_check_zapped_pte() */ 921 + VM_WARN_ON_ONCE(!(vma->vm_flags & VM_SHADOW_STACK) && 922 + pmd_shstk(pmd)); 923 + }
+1 -1
arch/x86/xen/enlighten_pv.c
··· 694 694 TRAP_ENTRY(exc_coprocessor_error, false ), 695 695 TRAP_ENTRY(exc_alignment_check, false ), 696 696 TRAP_ENTRY(exc_simd_coprocessor_error, false ), 697 - #ifdef CONFIG_X86_KERNEL_IBT 697 + #ifdef CONFIG_X86_CET 698 698 TRAP_ENTRY(exc_control_protection, false ), 699 699 #endif 700 700 };
+1 -1
arch/x86/xen/mmu_pv.c
··· 166 166 if (pte == NULL) 167 167 return; /* vaddr missing */ 168 168 169 - ptev = pte_mkwrite(*pte); 169 + ptev = pte_mkwrite_novma(*pte); 170 170 171 171 if (HYPERVISOR_update_va_mapping(address, ptev, 0)) 172 172 BUG();
+1 -1
arch/x86/xen/xen-asm.S
··· 148 148 xen_pv_trap asm_exc_spurious_interrupt_bug 149 149 xen_pv_trap asm_exc_coprocessor_error 150 150 xen_pv_trap asm_exc_alignment_check 151 - #ifdef CONFIG_X86_KERNEL_IBT 151 + #ifdef CONFIG_X86_CET 152 152 xen_pv_trap asm_exc_control_protection 153 153 #endif 154 154 #ifdef CONFIG_X86_MCE
+1 -1
arch/xtensa/include/asm/pgtable.h
··· 262 262 { pte_val(pte) |= _PAGE_DIRTY; return pte; } 263 263 static inline pte_t pte_mkyoung(pte_t pte) 264 264 { pte_val(pte) |= _PAGE_ACCESSED; return pte; } 265 - static inline pte_t pte_mkwrite(pte_t pte) 265 + static inline pte_t pte_mkwrite_novma(pte_t pte) 266 266 { pte_val(pte) |= _PAGE_WRITABLE; return pte; } 267 267 268 268 #define pgprot_noncached(prot) \
+1 -1
fs/aio.c
··· 558 558 559 559 ctx->mmap_base = do_mmap(ctx->aio_ring_file, 0, ctx->mmap_size, 560 560 PROT_READ | PROT_WRITE, 561 - MAP_SHARED, 0, &unused, NULL); 561 + MAP_SHARED, 0, 0, &unused, NULL); 562 562 mmap_write_unlock(mm); 563 563 if (IS_ERR((void *)ctx->mmap_base)) { 564 564 ctx->mmap_size = 0;
+6
fs/proc/array.c
··· 431 431 seq_printf(m, "untag_mask:\t%#lx\n", mm_untag_mask(mm)); 432 432 } 433 433 434 + __weak void arch_proc_pid_thread_features(struct seq_file *m, 435 + struct task_struct *task) 436 + { 437 + } 438 + 434 439 int proc_pid_status(struct seq_file *m, struct pid_namespace *ns, 435 440 struct pid *pid, struct task_struct *task) 436 441 { ··· 460 455 task_cpus_allowed(m, task); 461 456 cpuset_task_status_allowed(m, task); 462 457 task_context_switch_counts(m, task); 458 + arch_proc_pid_thread_features(m, task); 463 459 return 0; 464 460 } 465 461
+3
fs/proc/task_mmu.c
··· 692 692 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR 693 693 [ilog2(VM_UFFD_MINOR)] = "ui", 694 694 #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */ 695 + #ifdef CONFIG_X86_USER_SHADOW_STACK 696 + [ilog2(VM_SHADOW_STACK)] = "ss", 697 + #endif 695 698 }; 696 699 size_t i; 697 700
+1 -1
include/asm-generic/hugetlb.h
··· 22 22 23 23 static inline pte_t huge_pte_mkwrite(pte_t pte) 24 24 { 25 - return pte_mkwrite(pte); 25 + return pte_mkwrite_novma(pte); 26 26 } 27 27 28 28 #ifndef __HAVE_ARCH_HUGE_PTE_WRPROTECT
+39 -8
include/linux/mm.h
··· 319 319 #define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit architectures */ 320 320 #define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */ 321 321 #define VM_HIGH_ARCH_BIT_4 36 /* bit only usable on 64-bit architectures */ 322 + #define VM_HIGH_ARCH_BIT_5 37 /* bit only usable on 64-bit architectures */ 322 323 #define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0) 323 324 #define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1) 324 325 #define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2) 325 326 #define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3) 326 327 #define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4) 328 + #define VM_HIGH_ARCH_5 BIT(VM_HIGH_ARCH_BIT_5) 327 329 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */ 328 330 329 331 #ifdef CONFIG_ARCH_HAS_PKEYS ··· 340 338 # define VM_PKEY_BIT4 0 341 339 #endif 342 340 #endif /* CONFIG_ARCH_HAS_PKEYS */ 341 + 342 + #ifdef CONFIG_X86_USER_SHADOW_STACK 343 + /* 344 + * VM_SHADOW_STACK should not be set with VM_SHARED because of lack of 345 + * support core mm. 346 + * 347 + * These VMAs will get a single end guard page. This helps userspace protect 348 + * itself from attacks. A single page is enough for current shadow stack archs 349 + * (x86). See the comments near alloc_shstk() in arch/x86/kernel/shstk.c 350 + * for more details on the guard size. 351 + */ 352 + # define VM_SHADOW_STACK VM_HIGH_ARCH_5 353 + #else 354 + # define VM_SHADOW_STACK VM_NONE 355 + #endif 343 356 344 357 #if defined(CONFIG_X86) 345 358 # define VM_PAT VM_ARCH_1 /* PAT reserves whole VMA at once (x86) */ ··· 387 370 #endif 388 371 389 372 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR 390 - # define VM_UFFD_MINOR_BIT 37 373 + # define VM_UFFD_MINOR_BIT 38 391 374 # define VM_UFFD_MINOR BIT(VM_UFFD_MINOR_BIT) /* UFFD minor faults */ 392 375 #else /* !CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */ 393 376 # define VM_UFFD_MINOR VM_NONE ··· 413 396 #ifndef VM_STACK_DEFAULT_FLAGS /* arch can override this */ 414 397 #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS 415 398 #endif 399 + 400 + #define VM_STARTGAP_FLAGS (VM_GROWSDOWN | VM_SHADOW_STACK) 416 401 417 402 #ifdef CONFIG_STACK_GROWSUP 418 403 #define VM_STACK VM_GROWSUP ··· 1328 1309 static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma) 1329 1310 { 1330 1311 if (likely(vma->vm_flags & VM_WRITE)) 1331 - pte = pte_mkwrite(pte); 1312 + pte = pte_mkwrite(pte, vma); 1332 1313 return pte; 1333 1314 } 1334 1315 ··· 3284 3265 struct list_head *uf); 3285 3266 extern unsigned long do_mmap(struct file *file, unsigned long addr, 3286 3267 unsigned long len, unsigned long prot, unsigned long flags, 3287 - unsigned long pgoff, unsigned long *populate, struct list_head *uf); 3268 + vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate, 3269 + struct list_head *uf); 3288 3270 extern int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm, 3289 3271 unsigned long start, size_t len, struct list_head *uf, 3290 3272 bool unlock); ··· 3373 3353 return mtree_load(&mm->mm_mt, addr); 3374 3354 } 3375 3355 3356 + static inline unsigned long stack_guard_start_gap(struct vm_area_struct *vma) 3357 + { 3358 + if (vma->vm_flags & VM_GROWSDOWN) 3359 + return stack_guard_gap; 3360 + 3361 + /* See reasoning around the VM_SHADOW_STACK definition */ 3362 + if (vma->vm_flags & VM_SHADOW_STACK) 3363 + return PAGE_SIZE; 3364 + 3365 + return 0; 3366 + } 3367 + 3376 3368 static inline unsigned long vm_start_gap(struct vm_area_struct *vma) 3377 3369 { 3370 + unsigned long gap = stack_guard_start_gap(vma); 3378 3371 unsigned long vm_start = vma->vm_start; 3379 3372 3380 - if (vma->vm_flags & VM_GROWSDOWN) { 3381 - vm_start -= stack_guard_gap; 3382 - if (vm_start > vma->vm_start) 3383 - vm_start = 0; 3384 - } 3373 + vm_start -= gap; 3374 + if (vm_start > vma->vm_start) 3375 + vm_start = 0; 3385 3376 return vm_start; 3386 3377 } 3387 3378
+4
include/linux/mman.h
··· 15 15 #ifndef MAP_32BIT 16 16 #define MAP_32BIT 0 17 17 #endif 18 + #ifndef MAP_ABOVE4G 19 + #define MAP_ABOVE4G 0 20 + #endif 18 21 #ifndef MAP_HUGE_2MB 19 22 #define MAP_HUGE_2MB 0 20 23 #endif ··· 53 50 | MAP_STACK \ 54 51 | MAP_HUGETLB \ 55 52 | MAP_32BIT \ 53 + | MAP_ABOVE4G \ 56 54 | MAP_HUGE_2MB \ 57 55 | MAP_HUGE_1GB) 58 56
+28
include/linux/pgtable.h
··· 371 371 } 372 372 #endif 373 373 374 + #ifndef arch_check_zapped_pte 375 + static inline void arch_check_zapped_pte(struct vm_area_struct *vma, 376 + pte_t pte) 377 + { 378 + } 379 + #endif 380 + 381 + #ifndef arch_check_zapped_pmd 382 + static inline void arch_check_zapped_pmd(struct vm_area_struct *vma, 383 + pmd_t pmd) 384 + { 385 + } 386 + #endif 387 + 374 388 #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR 375 389 static inline pte_t ptep_get_and_clear(struct mm_struct *mm, 376 390 unsigned long address, ··· 589 575 extern pud_t pudp_huge_clear_flush(struct vm_area_struct *vma, 590 576 unsigned long address, 591 577 pud_t *pudp); 578 + #endif 579 + 580 + #ifndef pte_mkwrite 581 + static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma) 582 + { 583 + return pte_mkwrite_novma(pte); 584 + } 585 + #endif 586 + 587 + #if defined(CONFIG_ARCH_WANT_PMD_MKWRITE) && !defined(pmd_mkwrite) 588 + static inline pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma) 589 + { 590 + return pmd_mkwrite_novma(pmd); 591 + } 592 592 #endif 593 593 594 594 #ifndef __HAVE_ARCH_PTEP_SET_WRPROTECT
+1
include/linux/proc_fs.h
··· 159 159 #endif /* CONFIG_PROC_PID_ARCH_STATUS */ 160 160 161 161 void arch_report_meminfo(struct seq_file *m); 162 + void arch_proc_pid_thread_features(struct seq_file *m, struct task_struct *task); 162 163 163 164 #else /* CONFIG_PROC_FS */ 164 165
+1
include/linux/syscalls.h
··· 939 939 asmlinkage long sys_cachestat(unsigned int fd, 940 940 struct cachestat_range __user *cstat_range, 941 941 struct cachestat __user *cstat, unsigned int flags); 942 + asmlinkage long sys_map_shadow_stack(unsigned long addr, unsigned long size, unsigned int flags); 942 943 943 944 /* 944 945 * Architecture-specific system calls
+2 -1
include/uapi/asm-generic/siginfo.h
··· 242 242 #define SEGV_ADIPERR 7 /* Precise MCD exception */ 243 243 #define SEGV_MTEAERR 8 /* Asynchronous ARM MTE error */ 244 244 #define SEGV_MTESERR 9 /* Synchronous ARM MTE exception */ 245 - #define NSIGSEGV 9 245 + #define SEGV_CPERR 10 /* Control protection fault */ 246 + #define NSIGSEGV 10 246 247 247 248 /* 248 249 * SIGBUS si_codes
+2
include/uapi/linux/elf.h
··· 409 409 #define NT_386_TLS 0x200 /* i386 TLS slots (struct user_desc) */ 410 410 #define NT_386_IOPERM 0x201 /* x86 io permission bitmap (1=deny) */ 411 411 #define NT_X86_XSTATE 0x202 /* x86 extended state using xsave */ 412 + /* Old binutils treats 0x203 as a CET state */ 413 + #define NT_X86_SHSTK 0x204 /* x86 SHSTK state */ 412 414 #define NT_S390_HIGH_GPRS 0x300 /* s390 upper register halves */ 413 415 #define NT_S390_TIMER 0x301 /* s390 timer register */ 414 416 #define NT_S390_TODCMP 0x302 /* s390 TOD clock comparator register */
+1 -1
ipc/shm.c
··· 1662 1662 goto invalid; 1663 1663 } 1664 1664 1665 - addr = do_mmap(file, addr, size, prot, flags, 0, &populate, NULL); 1665 + addr = do_mmap(file, addr, size, prot, flags, 0, 0, &populate, NULL); 1666 1666 *raddr = addr; 1667 1667 err = 0; 1668 1668 if (IS_ERR_VALUE(addr))
+1
kernel/sys_ni.c
··· 274 274 COND_SYSCALL(modify_ldt); 275 275 COND_SYSCALL(vm86); 276 276 COND_SYSCALL(kexec_file_load); 277 + COND_SYSCALL(map_shadow_stack); 277 278 278 279 /* s390 */ 279 280 COND_SYSCALL(s390_pci_mmio_read);
+6 -6
mm/debug_vm_pgtable.c
··· 109 109 WARN_ON(!pte_same(pte, pte)); 110 110 WARN_ON(!pte_young(pte_mkyoung(pte_mkold(pte)))); 111 111 WARN_ON(!pte_dirty(pte_mkdirty(pte_mkclean(pte)))); 112 - WARN_ON(!pte_write(pte_mkwrite(pte_wrprotect(pte)))); 112 + WARN_ON(!pte_write(pte_mkwrite(pte_wrprotect(pte), args->vma))); 113 113 WARN_ON(pte_young(pte_mkold(pte_mkyoung(pte)))); 114 114 WARN_ON(pte_dirty(pte_mkclean(pte_mkdirty(pte)))); 115 - WARN_ON(pte_write(pte_wrprotect(pte_mkwrite(pte)))); 115 + WARN_ON(pte_write(pte_wrprotect(pte_mkwrite(pte, args->vma)))); 116 116 WARN_ON(pte_dirty(pte_wrprotect(pte_mkclean(pte)))); 117 117 WARN_ON(!pte_dirty(pte_wrprotect(pte_mkdirty(pte)))); 118 118 } ··· 156 156 pte = pte_mkclean(pte); 157 157 set_pte_at(args->mm, args->vaddr, args->ptep, pte); 158 158 flush_dcache_page(page); 159 - pte = pte_mkwrite(pte); 159 + pte = pte_mkwrite(pte, args->vma); 160 160 pte = pte_mkdirty(pte); 161 161 ptep_set_access_flags(args->vma, args->vaddr, args->ptep, pte, 1); 162 162 pte = ptep_get(args->ptep); ··· 202 202 WARN_ON(!pmd_same(pmd, pmd)); 203 203 WARN_ON(!pmd_young(pmd_mkyoung(pmd_mkold(pmd)))); 204 204 WARN_ON(!pmd_dirty(pmd_mkdirty(pmd_mkclean(pmd)))); 205 - WARN_ON(!pmd_write(pmd_mkwrite(pmd_wrprotect(pmd)))); 205 + WARN_ON(!pmd_write(pmd_mkwrite(pmd_wrprotect(pmd), args->vma))); 206 206 WARN_ON(pmd_young(pmd_mkold(pmd_mkyoung(pmd)))); 207 207 WARN_ON(pmd_dirty(pmd_mkclean(pmd_mkdirty(pmd)))); 208 - WARN_ON(pmd_write(pmd_wrprotect(pmd_mkwrite(pmd)))); 208 + WARN_ON(pmd_write(pmd_wrprotect(pmd_mkwrite(pmd, args->vma)))); 209 209 WARN_ON(pmd_dirty(pmd_wrprotect(pmd_mkclean(pmd)))); 210 210 WARN_ON(!pmd_dirty(pmd_wrprotect(pmd_mkdirty(pmd)))); 211 211 /* ··· 256 256 pmd = pmd_mkclean(pmd); 257 257 set_pmd_at(args->mm, vaddr, args->pmdp, pmd); 258 258 flush_dcache_page(page); 259 - pmd = pmd_mkwrite(pmd); 259 + pmd = pmd_mkwrite(pmd, args->vma); 260 260 pmd = pmd_mkdirty(pmd); 261 261 pmdp_set_access_flags(args->vma, vaddr, args->pmdp, pmd, 1); 262 262 pmd = READ_ONCE(*args->pmdp);
+1 -1
mm/gup.c
··· 1051 1051 !writable_file_mapping_allowed(vma, gup_flags)) 1052 1052 return -EFAULT; 1053 1053 1054 - if (!(vm_flags & VM_WRITE)) { 1054 + if (!(vm_flags & VM_WRITE) || (vm_flags & VM_SHADOW_STACK)) { 1055 1055 if (!(gup_flags & FOLL_FORCE)) 1056 1056 return -EFAULT; 1057 1057 /* hugetlb does not support FOLL_FORCE|FOLL_WRITE. */
+6 -5
mm/huge_memory.c
··· 551 551 pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma) 552 552 { 553 553 if (likely(vma->vm_flags & VM_WRITE)) 554 - pmd = pmd_mkwrite(pmd); 554 + pmd = pmd_mkwrite(pmd, vma); 555 555 return pmd; 556 556 } 557 557 ··· 1566 1566 pmd = pmd_modify(oldpmd, vma->vm_page_prot); 1567 1567 pmd = pmd_mkyoung(pmd); 1568 1568 if (writable) 1569 - pmd = pmd_mkwrite(pmd); 1569 + pmd = pmd_mkwrite(pmd, vma); 1570 1570 set_pmd_at(vma->vm_mm, haddr, vmf->pmd, pmd); 1571 1571 update_mmu_cache_pmd(vma, vmf->address, vmf->pmd); 1572 1572 spin_unlock(vmf->ptl); ··· 1675 1675 */ 1676 1676 orig_pmd = pmdp_huge_get_and_clear_full(vma, addr, pmd, 1677 1677 tlb->fullmm); 1678 + arch_check_zapped_pmd(vma, orig_pmd); 1678 1679 tlb_remove_pmd_tlb_entry(tlb, pmd, addr); 1679 1680 if (vma_is_special_huge(vma)) { 1680 1681 if (arch_needs_pgtable_deposit()) ··· 1920 1919 /* See change_pte_range(). */ 1921 1920 if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && !pmd_write(entry) && 1922 1921 can_change_pmd_writable(vma, addr, entry)) 1923 - entry = pmd_mkwrite(entry); 1922 + entry = pmd_mkwrite(entry, vma); 1924 1923 1925 1924 ret = HPAGE_PMD_NR; 1926 1925 set_pmd_at(mm, addr, pmd, entry); ··· 2234 2233 } else { 2235 2234 entry = mk_pte(page + i, READ_ONCE(vma->vm_page_prot)); 2236 2235 if (write) 2237 - entry = pte_mkwrite(entry); 2236 + entry = pte_mkwrite(entry, vma); 2238 2237 if (anon_exclusive) 2239 2238 SetPageAnonExclusive(page + i); 2240 2239 if (!young) ··· 3266 3265 if (pmd_swp_soft_dirty(*pvmw->pmd)) 3267 3266 pmde = pmd_mksoft_dirty(pmde); 3268 3267 if (is_writable_migration_entry(entry)) 3269 - pmde = pmd_mkwrite(pmde); 3268 + pmde = pmd_mkwrite(pmde, vma); 3270 3269 if (pmd_swp_uffd_wp(*pvmw->pmd)) 3271 3270 pmde = pmd_mkuffd_wp(pmde); 3272 3271 if (!is_migration_entry_young(entry))
+2 -2
mm/internal.h
··· 556 556 } 557 557 558 558 /* 559 - * Stack area - automatically grows in one direction 559 + * Stack area (including shadow stacks) 560 560 * 561 561 * VM_GROWSUP / VM_GROWSDOWN VMAs are always private anonymous: 562 562 * do_mmap() forbids all other combinations. 563 563 */ 564 564 static inline bool is_stack_mapping(vm_flags_t flags) 565 565 { 566 - return (flags & VM_STACK) == VM_STACK; 566 + return ((flags & VM_STACK) == VM_STACK) || (flags & VM_SHADOW_STACK); 567 567 } 568 568 569 569 /*
+3 -2
mm/memory.c
··· 1430 1430 continue; 1431 1431 ptent = ptep_get_and_clear_full(mm, addr, pte, 1432 1432 tlb->fullmm); 1433 + arch_check_zapped_pte(vma, ptent); 1433 1434 tlb_remove_tlb_entry(tlb, pte, addr); 1434 1435 zap_install_uffd_wp_if_needed(vma, addr, pte, details, 1435 1436 ptent); ··· 4125 4124 entry = mk_pte(&folio->page, vma->vm_page_prot); 4126 4125 entry = pte_sw_mkyoung(entry); 4127 4126 if (vma->vm_flags & VM_WRITE) 4128 - entry = pte_mkwrite(pte_mkdirty(entry)); 4127 + entry = pte_mkwrite(pte_mkdirty(entry), vma); 4129 4128 4130 4129 vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, 4131 4130 &vmf->ptl); ··· 4843 4842 pte = pte_modify(old_pte, vma->vm_page_prot); 4844 4843 pte = pte_mkyoung(pte); 4845 4844 if (writable) 4846 - pte = pte_mkwrite(pte); 4845 + pte = pte_mkwrite(pte, vma); 4847 4846 ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte); 4848 4847 update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1); 4849 4848 pte_unmap_unlock(vmf->pte, vmf->ptl);
+1 -1
mm/migrate.c
··· 220 220 if (folio_test_dirty(folio) && is_migration_entry_dirty(entry)) 221 221 pte = pte_mkdirty(pte); 222 222 if (is_writable_migration_entry(entry)) 223 - pte = pte_mkwrite(pte); 223 + pte = pte_mkwrite(pte, vma); 224 224 else if (pte_swp_uffd_wp(old_pte)) 225 225 pte = pte_mkuffd_wp(pte); 226 226
+1 -1
mm/migrate_device.c
··· 624 624 } 625 625 entry = mk_pte(page, vma->vm_page_prot); 626 626 if (vma->vm_flags & VM_WRITE) 627 - entry = pte_mkwrite(pte_mkdirty(entry)); 627 + entry = pte_mkwrite(pte_mkdirty(entry), vma); 628 628 } 629 629 630 630 ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+7 -7
mm/mmap.c
··· 1182 1182 */ 1183 1183 unsigned long do_mmap(struct file *file, unsigned long addr, 1184 1184 unsigned long len, unsigned long prot, 1185 - unsigned long flags, unsigned long pgoff, 1186 - unsigned long *populate, struct list_head *uf) 1185 + unsigned long flags, vm_flags_t vm_flags, 1186 + unsigned long pgoff, unsigned long *populate, 1187 + struct list_head *uf) 1187 1188 { 1188 1189 struct mm_struct *mm = current->mm; 1189 - vm_flags_t vm_flags; 1190 1190 int pkey = 0; 1191 1191 1192 1192 *populate = 0; ··· 1246 1246 * to. we assume access permissions have been handled by the open 1247 1247 * of the memory object, so we don't do any here. 1248 1248 */ 1249 - vm_flags = calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) | 1249 + vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) | 1250 1250 mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC; 1251 1251 1252 1252 if (flags & MAP_LOCKED) ··· 1564 1564 gap = mas.index; 1565 1565 gap += (info->align_offset - gap) & info->align_mask; 1566 1566 tmp = mas_next(&mas, ULONG_MAX); 1567 - if (tmp && (tmp->vm_flags & VM_GROWSDOWN)) { /* Avoid prev check if possible */ 1567 + if (tmp && (tmp->vm_flags & VM_STARTGAP_FLAGS)) { /* Avoid prev check if possible */ 1568 1568 if (vm_start_gap(tmp) < gap + length - 1) { 1569 1569 low_limit = tmp->vm_end; 1570 1570 mas_reset(&mas); ··· 1616 1616 gap -= (gap - info->align_offset) & info->align_mask; 1617 1617 gap_end = mas.last; 1618 1618 tmp = mas_next(&mas, ULONG_MAX); 1619 - if (tmp && (tmp->vm_flags & VM_GROWSDOWN)) { /* Avoid prev check if possible */ 1619 + if (tmp && (tmp->vm_flags & VM_STARTGAP_FLAGS)) { /* Avoid prev check if possible */ 1620 1620 if (vm_start_gap(tmp) <= gap_end) { 1621 1621 high_limit = vm_start_gap(tmp); 1622 1622 mas_reset(&mas); ··· 2998 2998 2999 2999 file = get_file(vma->vm_file); 3000 3000 ret = do_mmap(vma->vm_file, start, size, 3001 - prot, flags, pgoff, &populate, NULL); 3001 + prot, flags, 0, pgoff, &populate, NULL); 3002 3002 fput(file); 3003 3003 out: 3004 3004 mmap_write_unlock(mm);
+1 -1
mm/mprotect.c
··· 185 185 if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && 186 186 !pte_write(ptent) && 187 187 can_change_pte_writable(vma, addr, ptent)) 188 - ptent = pte_mkwrite(ptent); 188 + ptent = pte_mkwrite(ptent, vma); 189 189 190 190 ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent); 191 191 if (pte_needs_flush(oldpte, ptent))
+2 -2
mm/nommu.c
··· 1016 1016 unsigned long len, 1017 1017 unsigned long prot, 1018 1018 unsigned long flags, 1019 + vm_flags_t vm_flags, 1019 1020 unsigned long pgoff, 1020 1021 unsigned long *populate, 1021 1022 struct list_head *uf) ··· 1024 1023 struct vm_area_struct *vma; 1025 1024 struct vm_region *region; 1026 1025 struct rb_node *rb; 1027 - vm_flags_t vm_flags; 1028 1026 unsigned long capabilities, result; 1029 1027 int ret; 1030 1028 VMA_ITERATOR(vmi, current->mm, 0); ··· 1043 1043 1044 1044 /* we've determined that we can make the mapping, now translate what we 1045 1045 * now know into VMA flags */ 1046 - vm_flags = determine_vm_flags(file, prot, flags, capabilities); 1046 + vm_flags |= determine_vm_flags(file, prot, flags, capabilities); 1047 1047 1048 1048 1049 1049 /* we're going to need to record the mapping */
+1 -1
mm/userfaultfd.c
··· 86 86 if (page_in_cache && !vm_shared) 87 87 writable = false; 88 88 if (writable) 89 - _dst_pte = pte_mkwrite(_dst_pte); 89 + _dst_pte = pte_mkwrite(_dst_pte, dst_vma); 90 90 if (flags & MFILL_ATOMIC_WP) 91 91 _dst_pte = pte_mkuffd_wp(_dst_pte); 92 92
+1 -1
mm/util.c
··· 543 543 if (!ret) { 544 544 if (mmap_write_lock_killable(mm)) 545 545 return -EINTR; 546 - ret = do_mmap(file, addr, len, prot, flag, pgoff, &populate, 546 + ret = do_mmap(file, addr, len, prot, flag, 0, pgoff, &populate, 547 547 &uf); 548 548 mmap_write_unlock(mm); 549 549 userfaultfd_unmap_complete(mm, &uf);
+1 -1
tools/testing/selftests/x86/Makefile
··· 18 18 test_FCMOV test_FCOMI test_FISTTP \ 19 19 vdso_restorer 20 20 TARGETS_C_64BIT_ONLY := fsgsbase sysret_rip syscall_numbering \ 21 - corrupt_xstate_header amx lam 21 + corrupt_xstate_header amx lam test_shadow_stack 22 22 # Some selftests require 32bit support enabled also on 64bit systems 23 23 TARGETS_C_32BIT_NEEDED := ldt_gdt ptrace_syscall 24 24
+884
tools/testing/selftests/x86/test_shadow_stack.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * This program test's basic kernel shadow stack support. It enables shadow 4 + * stack manual via the arch_prctl(), instead of relying on glibc. It's 5 + * Makefile doesn't compile with shadow stack support, so it doesn't rely on 6 + * any particular glibc. As a result it can't do any operations that require 7 + * special glibc shadow stack support (longjmp(), swapcontext(), etc). Just 8 + * stick to the basics and hope the compiler doesn't do anything strange. 9 + */ 10 + 11 + #define _GNU_SOURCE 12 + 13 + #include <sys/syscall.h> 14 + #include <asm/mman.h> 15 + #include <sys/mman.h> 16 + #include <sys/stat.h> 17 + #include <sys/wait.h> 18 + #include <stdio.h> 19 + #include <stdlib.h> 20 + #include <fcntl.h> 21 + #include <unistd.h> 22 + #include <string.h> 23 + #include <errno.h> 24 + #include <stdbool.h> 25 + #include <x86intrin.h> 26 + #include <asm/prctl.h> 27 + #include <sys/prctl.h> 28 + #include <stdint.h> 29 + #include <signal.h> 30 + #include <pthread.h> 31 + #include <sys/ioctl.h> 32 + #include <linux/userfaultfd.h> 33 + #include <setjmp.h> 34 + #include <sys/ptrace.h> 35 + #include <sys/signal.h> 36 + #include <linux/elf.h> 37 + 38 + /* 39 + * Define the ABI defines if needed, so people can run the tests 40 + * without building the headers. 41 + */ 42 + #ifndef __NR_map_shadow_stack 43 + #define __NR_map_shadow_stack 452 44 + 45 + #define SHADOW_STACK_SET_TOKEN (1ULL << 0) 46 + 47 + #define ARCH_SHSTK_ENABLE 0x5001 48 + #define ARCH_SHSTK_DISABLE 0x5002 49 + #define ARCH_SHSTK_LOCK 0x5003 50 + #define ARCH_SHSTK_UNLOCK 0x5004 51 + #define ARCH_SHSTK_STATUS 0x5005 52 + 53 + #define ARCH_SHSTK_SHSTK (1ULL << 0) 54 + #define ARCH_SHSTK_WRSS (1ULL << 1) 55 + 56 + #define NT_X86_SHSTK 0x204 57 + #endif 58 + 59 + #define SS_SIZE 0x200000 60 + #define PAGE_SIZE 0x1000 61 + 62 + #if (__GNUC__ < 8) || (__GNUC__ == 8 && __GNUC_MINOR__ < 5) 63 + int main(int argc, char *argv[]) 64 + { 65 + printf("[SKIP]\tCompiler does not support CET.\n"); 66 + return 0; 67 + } 68 + #else 69 + void write_shstk(unsigned long *addr, unsigned long val) 70 + { 71 + asm volatile("wrssq %[val], (%[addr])\n" 72 + : "=m" (addr) 73 + : [addr] "r" (addr), [val] "r" (val)); 74 + } 75 + 76 + static inline unsigned long __attribute__((always_inline)) get_ssp(void) 77 + { 78 + unsigned long ret = 0; 79 + 80 + asm volatile("xor %0, %0; rdsspq %0" : "=r" (ret)); 81 + return ret; 82 + } 83 + 84 + /* 85 + * For use in inline enablement of shadow stack. 86 + * 87 + * The program can't return from the point where shadow stack gets enabled 88 + * because there will be no address on the shadow stack. So it can't use 89 + * syscall() for enablement, since it is a function. 90 + * 91 + * Based on code from nolibc.h. Keep a copy here because this can't pull in all 92 + * of nolibc.h. 93 + */ 94 + #define ARCH_PRCTL(arg1, arg2) \ 95 + ({ \ 96 + long _ret; \ 97 + register long _num asm("eax") = __NR_arch_prctl; \ 98 + register long _arg1 asm("rdi") = (long)(arg1); \ 99 + register long _arg2 asm("rsi") = (long)(arg2); \ 100 + \ 101 + asm volatile ( \ 102 + "syscall\n" \ 103 + : "=a"(_ret) \ 104 + : "r"(_arg1), "r"(_arg2), \ 105 + "0"(_num) \ 106 + : "rcx", "r11", "memory", "cc" \ 107 + ); \ 108 + _ret; \ 109 + }) 110 + 111 + void *create_shstk(void *addr) 112 + { 113 + return (void *)syscall(__NR_map_shadow_stack, addr, SS_SIZE, SHADOW_STACK_SET_TOKEN); 114 + } 115 + 116 + void *create_normal_mem(void *addr) 117 + { 118 + return mmap(addr, SS_SIZE, PROT_READ | PROT_WRITE, 119 + MAP_PRIVATE | MAP_ANONYMOUS, 0, 0); 120 + } 121 + 122 + void free_shstk(void *shstk) 123 + { 124 + munmap(shstk, SS_SIZE); 125 + } 126 + 127 + int reset_shstk(void *shstk) 128 + { 129 + return madvise(shstk, SS_SIZE, MADV_DONTNEED); 130 + } 131 + 132 + void try_shstk(unsigned long new_ssp) 133 + { 134 + unsigned long ssp; 135 + 136 + printf("[INFO]\tnew_ssp = %lx, *new_ssp = %lx\n", 137 + new_ssp, *((unsigned long *)new_ssp)); 138 + 139 + ssp = get_ssp(); 140 + printf("[INFO]\tchanging ssp from %lx to %lx\n", ssp, new_ssp); 141 + 142 + asm volatile("rstorssp (%0)\n":: "r" (new_ssp)); 143 + asm volatile("saveprevssp"); 144 + printf("[INFO]\tssp is now %lx\n", get_ssp()); 145 + 146 + /* Switch back to original shadow stack */ 147 + ssp -= 8; 148 + asm volatile("rstorssp (%0)\n":: "r" (ssp)); 149 + asm volatile("saveprevssp"); 150 + } 151 + 152 + int test_shstk_pivot(void) 153 + { 154 + void *shstk = create_shstk(0); 155 + 156 + if (shstk == MAP_FAILED) { 157 + printf("[FAIL]\tError creating shadow stack: %d\n", errno); 158 + return 1; 159 + } 160 + try_shstk((unsigned long)shstk + SS_SIZE - 8); 161 + free_shstk(shstk); 162 + 163 + printf("[OK]\tShadow stack pivot\n"); 164 + return 0; 165 + } 166 + 167 + int test_shstk_faults(void) 168 + { 169 + unsigned long *shstk = create_shstk(0); 170 + 171 + /* Read shadow stack, test if it's zero to not get read optimized out */ 172 + if (*shstk != 0) 173 + goto err; 174 + 175 + /* Wrss memory that was already read. */ 176 + write_shstk(shstk, 1); 177 + if (*shstk != 1) 178 + goto err; 179 + 180 + /* Page out memory, so we can wrss it again. */ 181 + if (reset_shstk((void *)shstk)) 182 + goto err; 183 + 184 + write_shstk(shstk, 1); 185 + if (*shstk != 1) 186 + goto err; 187 + 188 + printf("[OK]\tShadow stack faults\n"); 189 + return 0; 190 + 191 + err: 192 + return 1; 193 + } 194 + 195 + unsigned long saved_ssp; 196 + unsigned long saved_ssp_val; 197 + volatile bool segv_triggered; 198 + 199 + void __attribute__((noinline)) violate_ss(void) 200 + { 201 + saved_ssp = get_ssp(); 202 + saved_ssp_val = *(unsigned long *)saved_ssp; 203 + 204 + /* Corrupt shadow stack */ 205 + printf("[INFO]\tCorrupting shadow stack\n"); 206 + write_shstk((void *)saved_ssp, 0); 207 + } 208 + 209 + void segv_handler(int signum, siginfo_t *si, void *uc) 210 + { 211 + printf("[INFO]\tGenerated shadow stack violation successfully\n"); 212 + 213 + segv_triggered = true; 214 + 215 + /* Fix shadow stack */ 216 + write_shstk((void *)saved_ssp, saved_ssp_val); 217 + } 218 + 219 + int test_shstk_violation(void) 220 + { 221 + struct sigaction sa = {}; 222 + 223 + sa.sa_sigaction = segv_handler; 224 + sa.sa_flags = SA_SIGINFO; 225 + if (sigaction(SIGSEGV, &sa, NULL)) 226 + return 1; 227 + 228 + segv_triggered = false; 229 + 230 + /* Make sure segv_triggered is set before violate_ss() */ 231 + asm volatile("" : : : "memory"); 232 + 233 + violate_ss(); 234 + 235 + signal(SIGSEGV, SIG_DFL); 236 + 237 + printf("[OK]\tShadow stack violation test\n"); 238 + 239 + return !segv_triggered; 240 + } 241 + 242 + /* Gup test state */ 243 + #define MAGIC_VAL 0x12345678 244 + bool is_shstk_access; 245 + void *shstk_ptr; 246 + int fd; 247 + 248 + void reset_test_shstk(void *addr) 249 + { 250 + if (shstk_ptr) 251 + free_shstk(shstk_ptr); 252 + shstk_ptr = create_shstk(addr); 253 + } 254 + 255 + void test_access_fix_handler(int signum, siginfo_t *si, void *uc) 256 + { 257 + printf("[INFO]\tViolation from %s\n", is_shstk_access ? "shstk access" : "normal write"); 258 + 259 + segv_triggered = true; 260 + 261 + /* Fix shadow stack */ 262 + if (is_shstk_access) { 263 + reset_test_shstk(shstk_ptr); 264 + return; 265 + } 266 + 267 + free_shstk(shstk_ptr); 268 + create_normal_mem(shstk_ptr); 269 + } 270 + 271 + bool test_shstk_access(void *ptr) 272 + { 273 + is_shstk_access = true; 274 + segv_triggered = false; 275 + write_shstk(ptr, MAGIC_VAL); 276 + 277 + asm volatile("" : : : "memory"); 278 + 279 + return segv_triggered; 280 + } 281 + 282 + bool test_write_access(void *ptr) 283 + { 284 + is_shstk_access = false; 285 + segv_triggered = false; 286 + *(unsigned long *)ptr = MAGIC_VAL; 287 + 288 + asm volatile("" : : : "memory"); 289 + 290 + return segv_triggered; 291 + } 292 + 293 + bool gup_write(void *ptr) 294 + { 295 + unsigned long val; 296 + 297 + lseek(fd, (unsigned long)ptr, SEEK_SET); 298 + if (write(fd, &val, sizeof(val)) < 0) 299 + return 1; 300 + 301 + return 0; 302 + } 303 + 304 + bool gup_read(void *ptr) 305 + { 306 + unsigned long val; 307 + 308 + lseek(fd, (unsigned long)ptr, SEEK_SET); 309 + if (read(fd, &val, sizeof(val)) < 0) 310 + return 1; 311 + 312 + return 0; 313 + } 314 + 315 + int test_gup(void) 316 + { 317 + struct sigaction sa = {}; 318 + int status; 319 + pid_t pid; 320 + 321 + sa.sa_sigaction = test_access_fix_handler; 322 + sa.sa_flags = SA_SIGINFO; 323 + if (sigaction(SIGSEGV, &sa, NULL)) 324 + return 1; 325 + 326 + segv_triggered = false; 327 + 328 + fd = open("/proc/self/mem", O_RDWR); 329 + if (fd == -1) 330 + return 1; 331 + 332 + reset_test_shstk(0); 333 + if (gup_read(shstk_ptr)) 334 + return 1; 335 + if (test_shstk_access(shstk_ptr)) 336 + return 1; 337 + printf("[INFO]\tGup read -> shstk access success\n"); 338 + 339 + reset_test_shstk(0); 340 + if (gup_write(shstk_ptr)) 341 + return 1; 342 + if (test_shstk_access(shstk_ptr)) 343 + return 1; 344 + printf("[INFO]\tGup write -> shstk access success\n"); 345 + 346 + reset_test_shstk(0); 347 + if (gup_read(shstk_ptr)) 348 + return 1; 349 + if (!test_write_access(shstk_ptr)) 350 + return 1; 351 + printf("[INFO]\tGup read -> write access success\n"); 352 + 353 + reset_test_shstk(0); 354 + if (gup_write(shstk_ptr)) 355 + return 1; 356 + if (!test_write_access(shstk_ptr)) 357 + return 1; 358 + printf("[INFO]\tGup write -> write access success\n"); 359 + 360 + close(fd); 361 + 362 + /* COW/gup test */ 363 + reset_test_shstk(0); 364 + pid = fork(); 365 + if (!pid) { 366 + fd = open("/proc/self/mem", O_RDWR); 367 + if (fd == -1) 368 + exit(1); 369 + 370 + if (gup_write(shstk_ptr)) { 371 + close(fd); 372 + exit(1); 373 + } 374 + close(fd); 375 + exit(0); 376 + } 377 + waitpid(pid, &status, 0); 378 + if (WEXITSTATUS(status)) { 379 + printf("[FAIL]\tWrite in child failed\n"); 380 + return 1; 381 + } 382 + if (*(unsigned long *)shstk_ptr == MAGIC_VAL) { 383 + printf("[FAIL]\tWrite in child wrote through to shared memory\n"); 384 + return 1; 385 + } 386 + 387 + printf("[INFO]\tCow gup write -> write access success\n"); 388 + 389 + free_shstk(shstk_ptr); 390 + 391 + signal(SIGSEGV, SIG_DFL); 392 + 393 + printf("[OK]\tShadow gup test\n"); 394 + 395 + return 0; 396 + } 397 + 398 + int test_mprotect(void) 399 + { 400 + struct sigaction sa = {}; 401 + 402 + sa.sa_sigaction = test_access_fix_handler; 403 + sa.sa_flags = SA_SIGINFO; 404 + if (sigaction(SIGSEGV, &sa, NULL)) 405 + return 1; 406 + 407 + segv_triggered = false; 408 + 409 + /* mprotect a shadow stack as read only */ 410 + reset_test_shstk(0); 411 + if (mprotect(shstk_ptr, SS_SIZE, PROT_READ) < 0) { 412 + printf("[FAIL]\tmprotect(PROT_READ) failed\n"); 413 + return 1; 414 + } 415 + 416 + /* try to wrss it and fail */ 417 + if (!test_shstk_access(shstk_ptr)) { 418 + printf("[FAIL]\tShadow stack access to read-only memory succeeded\n"); 419 + return 1; 420 + } 421 + 422 + /* 423 + * The shadow stack was reset above to resolve the fault, make the new one 424 + * read-only. 425 + */ 426 + if (mprotect(shstk_ptr, SS_SIZE, PROT_READ) < 0) { 427 + printf("[FAIL]\tmprotect(PROT_READ) failed\n"); 428 + return 1; 429 + } 430 + 431 + /* then back to writable */ 432 + if (mprotect(shstk_ptr, SS_SIZE, PROT_WRITE | PROT_READ) < 0) { 433 + printf("[FAIL]\tmprotect(PROT_WRITE) failed\n"); 434 + return 1; 435 + } 436 + 437 + /* then wrss to it and succeed */ 438 + if (test_shstk_access(shstk_ptr)) { 439 + printf("[FAIL]\tShadow stack access to mprotect() writable memory failed\n"); 440 + return 1; 441 + } 442 + 443 + free_shstk(shstk_ptr); 444 + 445 + signal(SIGSEGV, SIG_DFL); 446 + 447 + printf("[OK]\tmprotect() test\n"); 448 + 449 + return 0; 450 + } 451 + 452 + char zero[4096]; 453 + 454 + static void *uffd_thread(void *arg) 455 + { 456 + struct uffdio_copy req; 457 + int uffd = *(int *)arg; 458 + struct uffd_msg msg; 459 + int ret; 460 + 461 + while (1) { 462 + ret = read(uffd, &msg, sizeof(msg)); 463 + if (ret > 0) 464 + break; 465 + else if (errno == EAGAIN) 466 + continue; 467 + return (void *)1; 468 + } 469 + 470 + req.dst = msg.arg.pagefault.address; 471 + req.src = (__u64)zero; 472 + req.len = 4096; 473 + req.mode = 0; 474 + 475 + if (ioctl(uffd, UFFDIO_COPY, &req)) 476 + return (void *)1; 477 + 478 + return (void *)0; 479 + } 480 + 481 + int test_userfaultfd(void) 482 + { 483 + struct uffdio_register uffdio_register; 484 + struct uffdio_api uffdio_api; 485 + struct sigaction sa = {}; 486 + pthread_t thread; 487 + void *res; 488 + int uffd; 489 + 490 + sa.sa_sigaction = test_access_fix_handler; 491 + sa.sa_flags = SA_SIGINFO; 492 + if (sigaction(SIGSEGV, &sa, NULL)) 493 + return 1; 494 + 495 + uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK); 496 + if (uffd < 0) { 497 + printf("[SKIP]\tUserfaultfd unavailable.\n"); 498 + return 0; 499 + } 500 + 501 + reset_test_shstk(0); 502 + 503 + uffdio_api.api = UFFD_API; 504 + uffdio_api.features = 0; 505 + if (ioctl(uffd, UFFDIO_API, &uffdio_api)) 506 + goto err; 507 + 508 + uffdio_register.range.start = (__u64)shstk_ptr; 509 + uffdio_register.range.len = 4096; 510 + uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING; 511 + if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) 512 + goto err; 513 + 514 + if (pthread_create(&thread, NULL, &uffd_thread, &uffd)) 515 + goto err; 516 + 517 + reset_shstk(shstk_ptr); 518 + test_shstk_access(shstk_ptr); 519 + 520 + if (pthread_join(thread, &res)) 521 + goto err; 522 + 523 + if (test_shstk_access(shstk_ptr)) 524 + goto err; 525 + 526 + free_shstk(shstk_ptr); 527 + 528 + signal(SIGSEGV, SIG_DFL); 529 + 530 + if (!res) 531 + printf("[OK]\tUserfaultfd test\n"); 532 + return !!res; 533 + err: 534 + free_shstk(shstk_ptr); 535 + close(uffd); 536 + signal(SIGSEGV, SIG_DFL); 537 + return 1; 538 + } 539 + 540 + /* Simple linked list for keeping track of mappings in test_guard_gap() */ 541 + struct node { 542 + struct node *next; 543 + void *mapping; 544 + }; 545 + 546 + /* 547 + * This tests whether mmap will place other mappings in a shadow stack's guard 548 + * gap. The steps are: 549 + * 1. Finds an empty place by mapping and unmapping something. 550 + * 2. Map a shadow stack in the middle of the known empty area. 551 + * 3. Map a bunch of PAGE_SIZE mappings. These will use the search down 552 + * direction, filling any gaps until it encounters the shadow stack's 553 + * guard gap. 554 + * 4. When a mapping lands below the shadow stack from step 2, then all 555 + * of the above gaps are filled. The search down algorithm will have 556 + * looked at the shadow stack gaps. 557 + * 5. See if it landed in the gap. 558 + */ 559 + int test_guard_gap(void) 560 + { 561 + void *free_area, *shstk, *test_map = (void *)0xFFFFFFFFFFFFFFFF; 562 + struct node *head = NULL, *cur; 563 + 564 + free_area = mmap(0, SS_SIZE * 3, PROT_READ | PROT_WRITE, 565 + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); 566 + munmap(free_area, SS_SIZE * 3); 567 + 568 + shstk = create_shstk(free_area + SS_SIZE); 569 + if (shstk == MAP_FAILED) 570 + return 1; 571 + 572 + while (test_map > shstk) { 573 + test_map = mmap(0, PAGE_SIZE, PROT_READ | PROT_WRITE, 574 + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); 575 + if (test_map == MAP_FAILED) 576 + return 1; 577 + cur = malloc(sizeof(*cur)); 578 + cur->mapping = test_map; 579 + 580 + cur->next = head; 581 + head = cur; 582 + } 583 + 584 + while (head) { 585 + cur = head; 586 + head = cur->next; 587 + munmap(cur->mapping, PAGE_SIZE); 588 + free(cur); 589 + } 590 + 591 + free_shstk(shstk); 592 + 593 + if (shstk - test_map - PAGE_SIZE != PAGE_SIZE) 594 + return 1; 595 + 596 + printf("[OK]\tGuard gap test\n"); 597 + 598 + return 0; 599 + } 600 + 601 + /* 602 + * Too complicated to pull it out of the 32 bit header, but also get the 603 + * 64 bit one needed above. Just define a copy here. 604 + */ 605 + #define __NR_compat_sigaction 67 606 + 607 + /* 608 + * Call 32 bit signal handler to get 32 bit signals ABI. Make sure 609 + * to push the registers that will get clobbered. 610 + */ 611 + int sigaction32(int signum, const struct sigaction *restrict act, 612 + struct sigaction *restrict oldact) 613 + { 614 + register long syscall_reg asm("eax") = __NR_compat_sigaction; 615 + register long signum_reg asm("ebx") = signum; 616 + register long act_reg asm("ecx") = (long)act; 617 + register long oldact_reg asm("edx") = (long)oldact; 618 + int ret = 0; 619 + 620 + asm volatile ("int $0x80;" 621 + : "=a"(ret), "=m"(oldact) 622 + : "r"(syscall_reg), "r"(signum_reg), "r"(act_reg), 623 + "r"(oldact_reg) 624 + : "r8", "r9", "r10", "r11" 625 + ); 626 + 627 + return ret; 628 + } 629 + 630 + sigjmp_buf jmp_buffer; 631 + 632 + void segv_gp_handler(int signum, siginfo_t *si, void *uc) 633 + { 634 + segv_triggered = true; 635 + 636 + /* 637 + * To work with old glibc, this can't rely on siglongjmp working with 638 + * shadow stack enabled, so disable shadow stack before siglongjmp(). 639 + */ 640 + ARCH_PRCTL(ARCH_SHSTK_DISABLE, ARCH_SHSTK_SHSTK); 641 + siglongjmp(jmp_buffer, -1); 642 + } 643 + 644 + /* 645 + * Transition to 32 bit mode and check that a #GP triggers a segfault. 646 + */ 647 + int test_32bit(void) 648 + { 649 + struct sigaction sa = {}; 650 + struct sigaction *sa32; 651 + 652 + /* Create sigaction in 32 bit address range */ 653 + sa32 = mmap(0, 4096, PROT_READ | PROT_WRITE, 654 + MAP_32BIT | MAP_PRIVATE | MAP_ANONYMOUS, 0, 0); 655 + sa32->sa_flags = SA_SIGINFO; 656 + 657 + sa.sa_sigaction = segv_gp_handler; 658 + sa.sa_flags = SA_SIGINFO; 659 + if (sigaction(SIGSEGV, &sa, NULL)) 660 + return 1; 661 + 662 + 663 + segv_triggered = false; 664 + 665 + /* Make sure segv_triggered is set before triggering the #GP */ 666 + asm volatile("" : : : "memory"); 667 + 668 + /* 669 + * Set handler to somewhere in 32 bit address space 670 + */ 671 + sa32->sa_handler = (void *)sa32; 672 + if (sigaction32(SIGUSR1, sa32, NULL)) 673 + return 1; 674 + 675 + if (!sigsetjmp(jmp_buffer, 1)) 676 + raise(SIGUSR1); 677 + 678 + if (segv_triggered) 679 + printf("[OK]\t32 bit test\n"); 680 + 681 + return !segv_triggered; 682 + } 683 + 684 + void segv_handler_ptrace(int signum, siginfo_t *si, void *uc) 685 + { 686 + /* The SSP adjustment caused a segfault. */ 687 + exit(0); 688 + } 689 + 690 + int test_ptrace(void) 691 + { 692 + unsigned long saved_ssp, ssp = 0; 693 + struct sigaction sa= {}; 694 + struct iovec iov; 695 + int status; 696 + int pid; 697 + 698 + iov.iov_base = &ssp; 699 + iov.iov_len = sizeof(ssp); 700 + 701 + pid = fork(); 702 + if (!pid) { 703 + ssp = get_ssp(); 704 + 705 + sa.sa_sigaction = segv_handler_ptrace; 706 + sa.sa_flags = SA_SIGINFO; 707 + if (sigaction(SIGSEGV, &sa, NULL)) 708 + return 1; 709 + 710 + ptrace(PTRACE_TRACEME, NULL, NULL, NULL); 711 + /* 712 + * The parent will tweak the SSP and return from this function 713 + * will #CP. 714 + */ 715 + raise(SIGTRAP); 716 + 717 + exit(1); 718 + } 719 + 720 + while (waitpid(pid, &status, 0) != -1 && WSTOPSIG(status) != SIGTRAP); 721 + 722 + if (ptrace(PTRACE_GETREGSET, pid, NT_X86_SHSTK, &iov)) { 723 + printf("[INFO]\tFailed to PTRACE_GETREGS\n"); 724 + goto out_kill; 725 + } 726 + 727 + if (!ssp) { 728 + printf("[INFO]\tPtrace child SSP was 0\n"); 729 + goto out_kill; 730 + } 731 + 732 + saved_ssp = ssp; 733 + 734 + iov.iov_len = 0; 735 + if (!ptrace(PTRACE_SETREGSET, pid, NT_X86_SHSTK, &iov)) { 736 + printf("[INFO]\tToo small size accepted via PTRACE_SETREGS\n"); 737 + goto out_kill; 738 + } 739 + 740 + iov.iov_len = sizeof(ssp) + 1; 741 + if (!ptrace(PTRACE_SETREGSET, pid, NT_X86_SHSTK, &iov)) { 742 + printf("[INFO]\tToo large size accepted via PTRACE_SETREGS\n"); 743 + goto out_kill; 744 + } 745 + 746 + ssp += 1; 747 + if (!ptrace(PTRACE_SETREGSET, pid, NT_X86_SHSTK, &iov)) { 748 + printf("[INFO]\tUnaligned SSP written via PTRACE_SETREGS\n"); 749 + goto out_kill; 750 + } 751 + 752 + ssp = 0xFFFFFFFFFFFF0000; 753 + if (!ptrace(PTRACE_SETREGSET, pid, NT_X86_SHSTK, &iov)) { 754 + printf("[INFO]\tKernel range SSP written via PTRACE_SETREGS\n"); 755 + goto out_kill; 756 + } 757 + 758 + /* 759 + * Tweak the SSP so the child with #CP when it resumes and returns 760 + * from raise() 761 + */ 762 + ssp = saved_ssp + 8; 763 + iov.iov_len = sizeof(ssp); 764 + if (ptrace(PTRACE_SETREGSET, pid, NT_X86_SHSTK, &iov)) { 765 + printf("[INFO]\tFailed to PTRACE_SETREGS\n"); 766 + goto out_kill; 767 + } 768 + 769 + if (ptrace(PTRACE_DETACH, pid, NULL, NULL)) { 770 + printf("[INFO]\tFailed to PTRACE_DETACH\n"); 771 + goto out_kill; 772 + } 773 + 774 + waitpid(pid, &status, 0); 775 + if (WEXITSTATUS(status)) 776 + return 1; 777 + 778 + printf("[OK]\tPtrace test\n"); 779 + return 0; 780 + 781 + out_kill: 782 + kill(pid, SIGKILL); 783 + return 1; 784 + } 785 + 786 + int main(int argc, char *argv[]) 787 + { 788 + int ret = 0; 789 + 790 + if (ARCH_PRCTL(ARCH_SHSTK_ENABLE, ARCH_SHSTK_SHSTK)) { 791 + printf("[SKIP]\tCould not enable Shadow stack\n"); 792 + return 1; 793 + } 794 + 795 + if (ARCH_PRCTL(ARCH_SHSTK_DISABLE, ARCH_SHSTK_SHSTK)) { 796 + ret = 1; 797 + printf("[FAIL]\tDisabling shadow stack failed\n"); 798 + } 799 + 800 + if (ARCH_PRCTL(ARCH_SHSTK_ENABLE, ARCH_SHSTK_SHSTK)) { 801 + printf("[SKIP]\tCould not re-enable Shadow stack\n"); 802 + return 1; 803 + } 804 + 805 + if (ARCH_PRCTL(ARCH_SHSTK_ENABLE, ARCH_SHSTK_WRSS)) { 806 + printf("[SKIP]\tCould not enable WRSS\n"); 807 + ret = 1; 808 + goto out; 809 + } 810 + 811 + /* Should have succeeded if here, but this is a test, so double check. */ 812 + if (!get_ssp()) { 813 + printf("[FAIL]\tShadow stack disabled\n"); 814 + return 1; 815 + } 816 + 817 + if (test_shstk_pivot()) { 818 + ret = 1; 819 + printf("[FAIL]\tShadow stack pivot\n"); 820 + goto out; 821 + } 822 + 823 + if (test_shstk_faults()) { 824 + ret = 1; 825 + printf("[FAIL]\tShadow stack fault test\n"); 826 + goto out; 827 + } 828 + 829 + if (test_shstk_violation()) { 830 + ret = 1; 831 + printf("[FAIL]\tShadow stack violation test\n"); 832 + goto out; 833 + } 834 + 835 + if (test_gup()) { 836 + ret = 1; 837 + printf("[FAIL]\tShadow shadow stack gup\n"); 838 + goto out; 839 + } 840 + 841 + if (test_mprotect()) { 842 + ret = 1; 843 + printf("[FAIL]\tShadow shadow mprotect test\n"); 844 + goto out; 845 + } 846 + 847 + if (test_userfaultfd()) { 848 + ret = 1; 849 + printf("[FAIL]\tUserfaultfd test\n"); 850 + goto out; 851 + } 852 + 853 + if (test_guard_gap()) { 854 + ret = 1; 855 + printf("[FAIL]\tGuard gap test\n"); 856 + goto out; 857 + } 858 + 859 + if (test_ptrace()) { 860 + ret = 1; 861 + printf("[FAIL]\tptrace test\n"); 862 + } 863 + 864 + if (test_32bit()) { 865 + ret = 1; 866 + printf("[FAIL]\t32 bit test\n"); 867 + goto out; 868 + } 869 + 870 + return ret; 871 + 872 + out: 873 + /* 874 + * Disable shadow stack before the function returns, or there will be a 875 + * shadow stack violation. 876 + */ 877 + if (ARCH_PRCTL(ARCH_SHSTK_DISABLE, ARCH_SHSTK_SHSTK)) { 878 + ret = 1; 879 + printf("[FAIL]\tDisabling shadow stack failed\n"); 880 + } 881 + 882 + return ret; 883 + } 884 + #endif