Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Merge tag 'x86-fsgsbase-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fsgsbase from Thomas Gleixner:
"Support for FSGSBASE. Almost 5 years after the first RFC to support
it, this has been brought into a shape which is maintainable and
actually works.

This final version was done by Sasha Levin who took it up after Intel
dropped the ball. Sasha discovered that the SGX (sic!) offerings out
there ship rogue kernel modules enabling FSGSBASE behind the kernels
back which opens an instantanious unpriviledged root hole.

The FSGSBASE instructions provide a considerable speedup of the
context switch path and enable user space to write GSBASE without
kernel interaction. This enablement requires careful handling of the
exception entries which go through the paranoid entry path as they
can no longer rely on the assumption that user GSBASE is positive (as
enforced via prctl() on non FSGSBASE enabled systemn).

All other entries (syscalls, interrupts and exceptions) can still just
utilize SWAPGS unconditionally when the entry comes from user space.
Converting these entries to use FSGSBASE has no benefit as SWAPGS is
only marginally slower than WRGSBASE and locating and retrieving the
kernel GSBASE value is not a free operation either. The real benefit
of RD/WRGSBASE is the avoidance of the MSR reads and writes.

The changes come with appropriate selftests and have held up in field
testing against the (sanitized) Graphene-SGX driver"

* tag 'x86-fsgsbase-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
x86/fsgsbase: Fix Xen PV support
x86/ptrace: Fix 32-bit PTRACE_SETREGS vs fsbase and gsbase
selftests/x86/fsgsbase: Add a missing memory constraint
selftests/x86/fsgsbase: Fix a comment in the ptrace_write_gsbase test
selftests/x86: Add a syscall_arg_fault_64 test for negative GSBASE
selftests/x86/fsgsbase: Test ptracer-induced GS base write with FSGSBASE
selftests/x86/fsgsbase: Test GS selector on ptracer-induced GS base write
Documentation/x86/64: Add documentation for GS/FS addressing mode
x86/elf: Enumerate kernel FSGSBASE capability in AT_HWCAP2
x86/cpu: Enable FSGSBASE on 64bit by default and add a chicken bit
x86/entry/64: Handle FSGSBASE enabled paranoid entry/exit
x86/entry/64: Introduce the FIND_PERCPU_BASE macro
x86/entry/64: Switch CR3 before SWAPGS in paranoid entry
x86/speculation/swapgs: Check FSGSBASE in enabling SWAPGS mitigation
x86/process/64: Use FSGSBASE instructions on thread copy and ptrace
x86/process/64: Use FSBSBASE in switch_to() if available
x86/process/64: Make save_fsgs_for_kvm() ready for FSGSBASE
x86/fsgsbase/64: Enable FSGSBASE instructions in helper functions
x86/fsgsbase/64: Add intrinsics for FSGSBASE instructions
x86/cpu: Add 'unsafe_fsgsbase' to enable CR4.FSGSBASE
...

+894 -110
+2
Documentation/admin-guide/kernel-parameters.txt
··· 3084 3084 no5lvl [X86-64] Disable 5-level paging mode. Forces 3085 3085 kernel to use 4-level paging instead. 3086 3086 3087 + nofsgsbase [X86] Disables FSGSBASE instructions. 3088 + 3087 3089 no_console_suspend 3088 3090 [HW] Never suspend the console 3089 3091 Disable suspending of consoles during suspend and
+199
Documentation/x86/x86_64/fsgs.rst
··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + Using FS and GS segments in user space applications 4 + =================================================== 5 + 6 + The x86 architecture supports segmentation. Instructions which access 7 + memory can use segment register based addressing mode. The following 8 + notation is used to address a byte within a segment: 9 + 10 + Segment-register:Byte-address 11 + 12 + The segment base address is added to the Byte-address to compute the 13 + resulting virtual address which is accessed. This allows to access multiple 14 + instances of data with the identical Byte-address, i.e. the same code. The 15 + selection of a particular instance is purely based on the base-address in 16 + the segment register. 17 + 18 + In 32-bit mode the CPU provides 6 segments, which also support segment 19 + limits. The limits can be used to enforce address space protections. 20 + 21 + In 64-bit mode the CS/SS/DS/ES segments are ignored and the base address is 22 + always 0 to provide a full 64bit address space. The FS and GS segments are 23 + still functional in 64-bit mode. 24 + 25 + Common FS and GS usage 26 + ------------------------------ 27 + 28 + The FS segment is commonly used to address Thread Local Storage (TLS). FS 29 + is usually managed by runtime code or a threading library. Variables 30 + declared with the '__thread' storage class specifier are instantiated per 31 + thread and the compiler emits the FS: address prefix for accesses to these 32 + variables. Each thread has its own FS base address so common code can be 33 + used without complex address offset calculations to access the per thread 34 + instances. Applications should not use FS for other purposes when they use 35 + runtimes or threading libraries which manage the per thread FS. 36 + 37 + The GS segment has no common use and can be used freely by 38 + applications. GCC and Clang support GS based addressing via address space 39 + identifiers. 40 + 41 + Reading and writing the FS/GS base address 42 + ------------------------------------------ 43 + 44 + There exist two mechanisms to read and write the FS/GS base address: 45 + 46 + - the arch_prctl() system call 47 + 48 + - the FSGSBASE instruction family 49 + 50 + Accessing FS/GS base with arch_prctl() 51 + -------------------------------------- 52 + 53 + The arch_prctl(2) based mechanism is available on all 64-bit CPUs and all 54 + kernel versions. 55 + 56 + Reading the base: 57 + 58 + arch_prctl(ARCH_GET_FS, &fsbase); 59 + arch_prctl(ARCH_GET_GS, &gsbase); 60 + 61 + Writing the base: 62 + 63 + arch_prctl(ARCH_SET_FS, fsbase); 64 + arch_prctl(ARCH_SET_GS, gsbase); 65 + 66 + The ARCH_SET_GS prctl may be disabled depending on kernel configuration 67 + and security settings. 68 + 69 + Accessing FS/GS base with the FSGSBASE instructions 70 + --------------------------------------------------- 71 + 72 + With the Ivy Bridge CPU generation Intel introduced a new set of 73 + instructions to access the FS and GS base registers directly from user 74 + space. These instructions are also supported on AMD Family 17H CPUs. The 75 + following instructions are available: 76 + 77 + =============== =========================== 78 + RDFSBASE %reg Read the FS base register 79 + RDGSBASE %reg Read the GS base register 80 + WRFSBASE %reg Write the FS base register 81 + WRGSBASE %reg Write the GS base register 82 + =============== =========================== 83 + 84 + The instructions avoid the overhead of the arch_prctl() syscall and allow 85 + more flexible usage of the FS/GS addressing modes in user space 86 + applications. This does not prevent conflicts between threading libraries 87 + and runtimes which utilize FS and applications which want to use it for 88 + their own purpose. 89 + 90 + FSGSBASE instructions enablement 91 + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 92 + The instructions are enumerated in CPUID leaf 7, bit 0 of EBX. If 93 + available /proc/cpuinfo shows 'fsgsbase' in the flag entry of the CPUs. 94 + 95 + The availability of the instructions does not enable them 96 + automatically. The kernel has to enable them explicitly in CR4. The 97 + reason for this is that older kernels make assumptions about the values in 98 + the GS register and enforce them when GS base is set via 99 + arch_prctl(). Allowing user space to write arbitrary values to GS base 100 + would violate these assumptions and cause malfunction. 101 + 102 + On kernels which do not enable FSGSBASE the execution of the FSGSBASE 103 + instructions will fault with a #UD exception. 104 + 105 + The kernel provides reliable information about the enabled state in the 106 + ELF AUX vector. If the HWCAP2_FSGSBASE bit is set in the AUX vector, the 107 + kernel has FSGSBASE instructions enabled and applications can use them. 108 + The following code example shows how this detection works:: 109 + 110 + #include <sys/auxv.h> 111 + #include <elf.h> 112 + 113 + /* Will be eventually in asm/hwcap.h */ 114 + #ifndef HWCAP2_FSGSBASE 115 + #define HWCAP2_FSGSBASE (1 << 1) 116 + #endif 117 + 118 + .... 119 + 120 + unsigned val = getauxval(AT_HWCAP2); 121 + 122 + if (val & HWCAP2_FSGSBASE) 123 + printf("FSGSBASE enabled\n"); 124 + 125 + FSGSBASE instructions compiler support 126 + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 127 + 128 + GCC version 4.6.4 and newer provide instrinsics for the FSGSBASE 129 + instructions. Clang 5 supports them as well. 130 + 131 + =================== =========================== 132 + _readfsbase_u64() Read the FS base register 133 + _readfsbase_u64() Read the GS base register 134 + _writefsbase_u64() Write the FS base register 135 + _writegsbase_u64() Write the GS base register 136 + =================== =========================== 137 + 138 + To utilize these instrinsics <immintrin.h> must be included in the source 139 + code and the compiler option -mfsgsbase has to be added. 140 + 141 + Compiler support for FS/GS based addressing 142 + ------------------------------------------- 143 + 144 + GCC version 6 and newer provide support for FS/GS based addressing via 145 + Named Address Spaces. GCC implements the following address space 146 + identifiers for x86: 147 + 148 + ========= ==================================== 149 + __seg_fs Variable is addressed relative to FS 150 + __seg_gs Variable is addressed relative to GS 151 + ========= ==================================== 152 + 153 + The preprocessor symbols __SEG_FS and __SEG_GS are defined when these 154 + address spaces are supported. Code which implements fallback modes should 155 + check whether these symbols are defined. Usage example:: 156 + 157 + #ifdef __SEG_GS 158 + 159 + long data0 = 0; 160 + long data1 = 1; 161 + 162 + long __seg_gs *ptr; 163 + 164 + /* Check whether FSGSBASE is enabled by the kernel (HWCAP2_FSGSBASE) */ 165 + .... 166 + 167 + /* Set GS base to point to data0 */ 168 + _writegsbase_u64(&data0); 169 + 170 + /* Access offset 0 of GS */ 171 + ptr = 0; 172 + printf("data0 = %ld\n", *ptr); 173 + 174 + /* Set GS base to point to data1 */ 175 + _writegsbase_u64(&data1); 176 + /* ptr still addresses offset 0! */ 177 + printf("data1 = %ld\n", *ptr); 178 + 179 + 180 + Clang does not provide the GCC address space identifiers, but it provides 181 + address spaces via an attribute based mechanism in Clang 2.6 and newer 182 + versions: 183 + 184 + ==================================== ===================================== 185 + __attribute__((address_space(256)) Variable is addressed relative to GS 186 + __attribute__((address_space(257)) Variable is addressed relative to FS 187 + ==================================== ===================================== 188 + 189 + FS/GS based addressing with inline assembly 190 + ------------------------------------------- 191 + 192 + In case the compiler does not support address spaces, inline assembly can 193 + be used for FS/GS based addressing mode:: 194 + 195 + mov %fs:offset, %reg 196 + mov %gs:offset, %reg 197 + 198 + mov %reg, %fs:offset 199 + mov %reg, %gs:offset
+1
Documentation/x86/x86_64/index.rst
··· 14 14 fake-numa-for-cpusets 15 15 cpu-hotplug-spec 16 16 machinecheck 17 + fsgs
+40
arch/x86/entry/calling.h
··· 6 6 #include <asm/percpu.h> 7 7 #include <asm/asm-offsets.h> 8 8 #include <asm/processor-flags.h> 9 + #include <asm/inst.h> 9 10 10 11 /* 11 12 ··· 342 341 #endif 343 342 .endm 344 343 344 + .macro SAVE_AND_SET_GSBASE scratch_reg:req save_reg:req 345 + rdgsbase \save_reg 346 + GET_PERCPU_BASE \scratch_reg 347 + wrgsbase \scratch_reg 348 + .endm 349 + 345 350 #else /* CONFIG_X86_64 */ 346 351 # undef UNWIND_HINT_IRET_REGS 347 352 # define UNWIND_HINT_IRET_REGS ··· 358 351 call stackleak_erase 359 352 #endif 360 353 .endm 354 + 355 + #ifdef CONFIG_SMP 356 + 357 + /* 358 + * CPU/node NR is loaded from the limit (size) field of a special segment 359 + * descriptor entry in GDT. 360 + */ 361 + .macro LOAD_CPU_AND_NODE_SEG_LIMIT reg:req 362 + movq $__CPUNODE_SEG, \reg 363 + lsl \reg, \reg 364 + .endm 365 + 366 + /* 367 + * Fetch the per-CPU GSBASE value for this processor and put it in @reg. 368 + * We normally use %gs for accessing per-CPU data, but we are setting up 369 + * %gs here and obviously can not use %gs itself to access per-CPU data. 370 + */ 371 + .macro GET_PERCPU_BASE reg:req 372 + ALTERNATIVE \ 373 + "LOAD_CPU_AND_NODE_SEG_LIMIT \reg", \ 374 + "RDPID \reg", \ 375 + X86_FEATURE_RDPID 376 + andq $VDSO_CPUNODE_MASK, \reg 377 + movq __per_cpu_offset(, \reg, 8), \reg 378 + .endm 379 + 380 + #else 381 + 382 + .macro GET_PERCPU_BASE reg:req 383 + movq pcpu_unit_offsets(%rip), \reg 384 + .endm 385 + 386 + #endif /* CONFIG_SMP */
+107 -32
arch/x86/entry/entry_64.S
··· 38 38 #include <asm/frame.h> 39 39 #include <asm/trapnr.h> 40 40 #include <asm/nospec-branch.h> 41 + #include <asm/fsgsbase.h> 41 42 #include <linux/err.h> 42 43 43 44 #include "calling.h" ··· 427 426 testb $3, CS-ORIG_RAX(%rsp) 428 427 jnz .Lfrom_usermode_switch_stack_\@ 429 428 430 - /* 431 - * paranoid_entry returns SWAPGS flag for paranoid_exit in EBX. 432 - * EBX == 0 -> SWAPGS, EBX == 1 -> no SWAPGS 433 - */ 429 + /* paranoid_entry returns GS information for paranoid_exit in EBX. */ 434 430 call paranoid_entry 435 431 436 432 UNWIND_HINT_REGS ··· 456 458 UNWIND_HINT_IRET_REGS offset=8 457 459 ASM_CLAC 458 460 459 - /* 460 - * paranoid_entry returns SWAPGS flag for paranoid_exit in EBX. 461 - * EBX == 0 -> SWAPGS, EBX == 1 -> no SWAPGS 462 - */ 461 + /* paranoid_entry returns GS information for paranoid_exit in EBX. */ 463 462 call paranoid_entry 464 463 UNWIND_HINT_REGS 465 464 ··· 793 798 #endif /* CONFIG_XEN_PV */ 794 799 795 800 /* 796 - * Save all registers in pt_regs, and switch gs if needed. 797 - * Use slow, but surefire "are we in kernel?" check. 798 - * Return: ebx=0: need swapgs on exit, ebx=1: otherwise 801 + * Save all registers in pt_regs. Return GSBASE related information 802 + * in EBX depending on the availability of the FSGSBASE instructions: 803 + * 804 + * FSGSBASE R/EBX 805 + * N 0 -> SWAPGS on exit 806 + * 1 -> no SWAPGS on exit 807 + * 808 + * Y GSBASE value at entry, must be restored in paranoid_exit 799 809 */ 800 810 SYM_CODE_START_LOCAL(paranoid_entry) 801 811 UNWIND_HINT_FUNC 802 812 cld 803 813 PUSH_AND_CLEAR_REGS save_ret=1 804 814 ENCODE_FRAME_POINTER 8 805 - movl $1, %ebx 806 - movl $MSR_GS_BASE, %ecx 807 - rdmsr 808 - testl %edx, %edx 809 - js 1f /* negative -> in kernel */ 810 - SWAPGS 811 - xorl %ebx, %ebx 812 815 813 - 1: 814 816 /* 815 817 * Always stash CR3 in %r14. This value will be restored, 816 818 * verbatim, at exit. Needed if paranoid_entry interrupted ··· 817 825 * This is also why CS (stashed in the "iret frame" by the 818 826 * hardware at entry) can not be used: this may be a return 819 827 * to kernel code, but with a user CR3 value. 828 + * 829 + * Switching CR3 does not depend on kernel GSBASE so it can 830 + * be done before switching to the kernel GSBASE. This is 831 + * required for FSGSBASE because the kernel GSBASE has to 832 + * be retrieved from a kernel internal table. 820 833 */ 821 834 SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg=%rax save_reg=%r14 835 + 836 + /* 837 + * Handling GSBASE depends on the availability of FSGSBASE. 838 + * 839 + * Without FSGSBASE the kernel enforces that negative GSBASE 840 + * values indicate kernel GSBASE. With FSGSBASE no assumptions 841 + * can be made about the GSBASE value when entering from user 842 + * space. 843 + */ 844 + ALTERNATIVE "jmp .Lparanoid_entry_checkgs", "", X86_FEATURE_FSGSBASE 845 + 846 + /* 847 + * Read the current GSBASE and store it in %rbx unconditionally, 848 + * retrieve and set the current CPUs kernel GSBASE. The stored value 849 + * has to be restored in paranoid_exit unconditionally. 850 + * 851 + * The MSR write ensures that no subsequent load is based on a 852 + * mispredicted GSBASE. No extra FENCE required. 853 + */ 854 + SAVE_AND_SET_GSBASE scratch_reg=%rax save_reg=%rbx 855 + ret 856 + 857 + .Lparanoid_entry_checkgs: 858 + /* EBX = 1 -> kernel GSBASE active, no restore required */ 859 + movl $1, %ebx 860 + /* 861 + * The kernel-enforced convention is a negative GSBASE indicates 862 + * a kernel value. No SWAPGS needed on entry and exit. 863 + */ 864 + movl $MSR_GS_BASE, %ecx 865 + rdmsr 866 + testl %edx, %edx 867 + jns .Lparanoid_entry_swapgs 868 + ret 869 + 870 + .Lparanoid_entry_swapgs: 871 + SWAPGS 822 872 823 873 /* 824 874 * The above SAVE_AND_SWITCH_TO_KERNEL_CR3 macro doesn't do an ··· 869 835 */ 870 836 FENCE_SWAPGS_KERNEL_ENTRY 871 837 838 + /* EBX = 0 -> SWAPGS required on exit */ 839 + xorl %ebx, %ebx 872 840 ret 873 841 SYM_CODE_END(paranoid_entry) 874 842 ··· 881 845 * 882 846 * We may be returning to very strange contexts (e.g. very early 883 847 * in syscall entry), so checking for preemption here would 884 - * be complicated. Fortunately, we there's no good reason 885 - * to try to handle preemption here. 848 + * be complicated. Fortunately, there's no good reason to try 849 + * to handle preemption here. 886 850 * 887 - * On entry, ebx is "no swapgs" flag (1: don't need swapgs, 0: need it) 851 + * R/EBX contains the GSBASE related information depending on the 852 + * availability of the FSGSBASE instructions: 853 + * 854 + * FSGSBASE R/EBX 855 + * N 0 -> SWAPGS on exit 856 + * 1 -> no SWAPGS on exit 857 + * 858 + * Y User space GSBASE, must be restored unconditionally 888 859 */ 889 860 SYM_CODE_START_LOCAL(paranoid_exit) 890 861 UNWIND_HINT_REGS 891 - testl %ebx, %ebx /* swapgs needed? */ 892 - jnz .Lparanoid_exit_no_swapgs 893 - /* Always restore stashed CR3 value (see paranoid_entry) */ 894 - RESTORE_CR3 scratch_reg=%rbx save_reg=%r14 862 + /* 863 + * The order of operations is important. RESTORE_CR3 requires 864 + * kernel GSBASE. 865 + * 866 + * NB to anyone to try to optimize this code: this code does 867 + * not execute at all for exceptions from user mode. Those 868 + * exceptions go through error_exit instead. 869 + */ 870 + RESTORE_CR3 scratch_reg=%rax save_reg=%r14 871 + 872 + /* Handle the three GSBASE cases */ 873 + ALTERNATIVE "jmp .Lparanoid_exit_checkgs", "", X86_FEATURE_FSGSBASE 874 + 875 + /* With FSGSBASE enabled, unconditionally restore GSBASE */ 876 + wrgsbase %rbx 877 + jmp restore_regs_and_return_to_kernel 878 + 879 + .Lparanoid_exit_checkgs: 880 + /* On non-FSGSBASE systems, conditionally do SWAPGS */ 881 + testl %ebx, %ebx 882 + jnz restore_regs_and_return_to_kernel 883 + 884 + /* We are returning to a context with user GSBASE */ 895 885 SWAPGS_UNSAFE_STACK 896 - jmp restore_regs_and_return_to_kernel 897 - .Lparanoid_exit_no_swapgs: 898 - /* Always restore stashed CR3 value (see paranoid_entry) */ 899 - RESTORE_CR3 scratch_reg=%rbx save_reg=%r14 900 - jmp restore_regs_and_return_to_kernel 886 + jmp restore_regs_and_return_to_kernel 901 887 SYM_CODE_END(paranoid_exit) 902 888 903 889 /* ··· 1324 1266 /* Always restore stashed CR3 value (see paranoid_entry) */ 1325 1267 RESTORE_CR3 scratch_reg=%r15 save_reg=%r14 1326 1268 1327 - testl %ebx, %ebx /* swapgs needed? */ 1269 + /* 1270 + * The above invocation of paranoid_entry stored the GSBASE 1271 + * related information in R/EBX depending on the availability 1272 + * of FSGSBASE. 1273 + * 1274 + * If FSGSBASE is enabled, restore the saved GSBASE value 1275 + * unconditionally, otherwise take the conditional SWAPGS path. 1276 + */ 1277 + ALTERNATIVE "jmp nmi_no_fsgsbase", "", X86_FEATURE_FSGSBASE 1278 + 1279 + wrgsbase %rbx 1280 + jmp nmi_restore 1281 + 1282 + nmi_no_fsgsbase: 1283 + /* EBX == 0 -> invoke SWAPGS */ 1284 + testl %ebx, %ebx 1328 1285 jnz nmi_restore 1286 + 1329 1287 nmi_swapgs: 1330 1288 SWAPGS_UNSAFE_STACK 1289 + 1331 1290 nmi_restore: 1332 1291 POP_REGS 1333 1292
+44 -15
arch/x86/include/asm/fsgsbase.h
··· 19 19 extern void x86_fsbase_write_task(struct task_struct *task, unsigned long fsbase); 20 20 extern void x86_gsbase_write_task(struct task_struct *task, unsigned long gsbase); 21 21 22 + /* Must be protected by X86_FEATURE_FSGSBASE check. */ 23 + 24 + static __always_inline unsigned long rdfsbase(void) 25 + { 26 + unsigned long fsbase; 27 + 28 + asm volatile("rdfsbase %0" : "=r" (fsbase) :: "memory"); 29 + 30 + return fsbase; 31 + } 32 + 33 + static __always_inline unsigned long rdgsbase(void) 34 + { 35 + unsigned long gsbase; 36 + 37 + asm volatile("rdgsbase %0" : "=r" (gsbase) :: "memory"); 38 + 39 + return gsbase; 40 + } 41 + 42 + static __always_inline void wrfsbase(unsigned long fsbase) 43 + { 44 + asm volatile("wrfsbase %0" :: "r" (fsbase) : "memory"); 45 + } 46 + 47 + static __always_inline void wrgsbase(unsigned long gsbase) 48 + { 49 + asm volatile("wrgsbase %0" :: "r" (gsbase) : "memory"); 50 + } 51 + 52 + #include <asm/cpufeature.h> 53 + 22 54 /* Helper functions for reading/writing FS/GS base */ 23 55 24 56 static inline unsigned long x86_fsbase_read_cpu(void) 25 57 { 26 58 unsigned long fsbase; 27 59 28 - rdmsrl(MSR_FS_BASE, fsbase); 60 + if (static_cpu_has(X86_FEATURE_FSGSBASE)) 61 + fsbase = rdfsbase(); 62 + else 63 + rdmsrl(MSR_FS_BASE, fsbase); 29 64 30 65 return fsbase; 31 66 } 32 67 33 - static inline unsigned long x86_gsbase_read_cpu_inactive(void) 34 - { 35 - unsigned long gsbase; 36 - 37 - rdmsrl(MSR_KERNEL_GS_BASE, gsbase); 38 - 39 - return gsbase; 40 - } 41 - 42 68 static inline void x86_fsbase_write_cpu(unsigned long fsbase) 43 69 { 44 - wrmsrl(MSR_FS_BASE, fsbase); 70 + if (static_cpu_has(X86_FEATURE_FSGSBASE)) 71 + wrfsbase(fsbase); 72 + else 73 + wrmsrl(MSR_FS_BASE, fsbase); 45 74 } 46 75 47 - static inline void x86_gsbase_write_cpu_inactive(unsigned long gsbase) 48 - { 49 - wrmsrl(MSR_KERNEL_GS_BASE, gsbase); 50 - } 76 + extern unsigned long x86_gsbase_read_cpu_inactive(void); 77 + extern void x86_gsbase_write_cpu_inactive(unsigned long gsbase); 78 + extern unsigned long x86_fsgsbase_read_task(struct task_struct *task, 79 + unsigned short selector); 51 80 52 81 #endif /* CONFIG_X86_64 */ 53 82
+15
arch/x86/include/asm/inst.h
··· 143 143 .macro MODRM mod opd1 opd2 144 144 .byte \mod | (\opd1 & 7) | ((\opd2 & 7) << 3) 145 145 .endm 146 + 147 + .macro RDPID opd 148 + REG_TYPE rdpid_opd_type \opd 149 + .if rdpid_opd_type == REG_TYPE_R64 150 + R64_NUM rdpid_opd \opd 151 + .else 152 + R32_NUM rdpid_opd \opd 153 + .endif 154 + .byte 0xf3 155 + .if rdpid_opd > 7 156 + PFX_REX rdpid_opd 0 157 + .endif 158 + .byte 0x0f, 0xc7 159 + MODRM 0xc0 rdpid_opd 0x7 160 + .endm 146 161 #endif 147 162 148 163 #endif
+2 -4
arch/x86/include/asm/processor.h
··· 457 457 DECLARE_PER_CPU(unsigned int, irq_count); 458 458 extern asmlinkage void ignore_sysret(void); 459 459 460 - #if IS_ENABLED(CONFIG_KVM) 461 460 /* Save actual FS/GS selectors and bases to current->thread */ 462 - void save_fsgs_for_kvm(void); 463 - #endif 461 + void current_save_fsgs(void); 464 462 #else /* X86_64 */ 465 463 #ifdef CONFIG_STACKPROTECTOR 466 464 /* ··· 573 575 this_cpu_write(cpu_tss_rw.x86_tss.sp0, sp0); 574 576 } 575 577 576 - static inline void native_swapgs(void) 578 + static __always_inline void native_swapgs(void) 577 579 { 578 580 #ifdef CONFIG_X86_64 579 581 asm volatile("swapgs" ::: "memory");
+3
arch/x86/include/uapi/asm/hwcap2.h
··· 5 5 /* MONITOR/MWAIT enabled in Ring 3 */ 6 6 #define HWCAP2_RING3MWAIT (1 << 0) 7 7 8 + /* Kernel allows FSGSBASE instructions available in Ring 3 */ 9 + #define HWCAP2_FSGSBASE BIT(1) 10 + 8 11 #endif
+2 -4
arch/x86/kernel/cpu/bugs.c
··· 543 543 * If FSGSBASE is enabled, the user can put a kernel address in 544 544 * GS, in which case SMAP provides no protection. 545 545 * 546 - * [ NOTE: Don't check for X86_FEATURE_FSGSBASE until the 547 - * FSGSBASE enablement patches have been merged. ] 548 - * 549 546 * If FSGSBASE is disabled, the user can only put a user space 550 547 * address in GS. That makes an attack harder, but still 551 548 * possible if there's no SMAP protection. 552 549 */ 553 - if (!smap_works_speculatively()) { 550 + if (boot_cpu_has(X86_FEATURE_FSGSBASE) || 551 + !smap_works_speculatively()) { 554 552 /* 555 553 * Mitigation can be provided from SWAPGS itself or 556 554 * PTI as the CR3 write in the Meltdown mitigation
+22
arch/x86/kernel/cpu/common.c
··· 441 441 static_key_enable(&cr_pinning.key); 442 442 } 443 443 444 + static __init int x86_nofsgsbase_setup(char *arg) 445 + { 446 + /* Require an exact match without trailing characters. */ 447 + if (strlen(arg)) 448 + return 0; 449 + 450 + /* Do not emit a message if the feature is not present. */ 451 + if (!boot_cpu_has(X86_FEATURE_FSGSBASE)) 452 + return 1; 453 + 454 + setup_clear_cpu_cap(X86_FEATURE_FSGSBASE); 455 + pr_info("FSGSBASE disabled via kernel command line\n"); 456 + return 1; 457 + } 458 + __setup("nofsgsbase", x86_nofsgsbase_setup); 459 + 444 460 /* 445 461 * Protection Keys are not available in 32-bit mode. 446 462 */ ··· 1510 1494 setup_smep(c); 1511 1495 setup_smap(c); 1512 1496 setup_umip(c); 1497 + 1498 + /* Enable FSGSBASE instructions if available. */ 1499 + if (cpu_has(c, X86_FEATURE_FSGSBASE)) { 1500 + cr4_set_bits(X86_CR4_FSGSBASE); 1501 + elf_hwcap2 |= HWCAP2_FSGSBASE; 1502 + } 1513 1503 1514 1504 /* 1515 1505 * The vendor-specific functions might have changed features.
+6 -4
arch/x86/kernel/process.c
··· 140 140 memset(p->thread.ptrace_bps, 0, sizeof(p->thread.ptrace_bps)); 141 141 142 142 #ifdef CONFIG_X86_64 143 - savesegment(gs, p->thread.gsindex); 144 - p->thread.gsbase = p->thread.gsindex ? 0 : current->thread.gsbase; 145 - savesegment(fs, p->thread.fsindex); 146 - p->thread.fsbase = p->thread.fsindex ? 0 : current->thread.fsbase; 143 + current_save_fsgs(); 144 + p->thread.fsindex = current->thread.fsindex; 145 + p->thread.fsbase = current->thread.fsbase; 146 + p->thread.gsindex = current->thread.gsindex; 147 + p->thread.gsbase = current->thread.gsbase; 148 + 147 149 savesegment(es, p->thread.es); 148 150 savesegment(ds, p->thread.ds); 149 151 #else
+123 -16
arch/x86/kernel/process_64.c
··· 151 151 }; 152 152 153 153 /* 154 + * Out of line to be protected from kprobes and tracing. If this would be 155 + * traced or probed than any access to a per CPU variable happens with 156 + * the wrong GS. 157 + * 158 + * It is not used on Xen paravirt. When paravirt support is needed, it 159 + * needs to be renamed with native_ prefix. 160 + */ 161 + static noinstr unsigned long __rdgsbase_inactive(void) 162 + { 163 + unsigned long gsbase; 164 + 165 + lockdep_assert_irqs_disabled(); 166 + 167 + if (!static_cpu_has(X86_FEATURE_XENPV)) { 168 + native_swapgs(); 169 + gsbase = rdgsbase(); 170 + native_swapgs(); 171 + } else { 172 + instrumentation_begin(); 173 + rdmsrl(MSR_KERNEL_GS_BASE, gsbase); 174 + instrumentation_end(); 175 + } 176 + 177 + return gsbase; 178 + } 179 + 180 + /* 181 + * Out of line to be protected from kprobes and tracing. If this would be 182 + * traced or probed than any access to a per CPU variable happens with 183 + * the wrong GS. 184 + * 185 + * It is not used on Xen paravirt. When paravirt support is needed, it 186 + * needs to be renamed with native_ prefix. 187 + */ 188 + static noinstr void __wrgsbase_inactive(unsigned long gsbase) 189 + { 190 + lockdep_assert_irqs_disabled(); 191 + 192 + if (!static_cpu_has(X86_FEATURE_XENPV)) { 193 + native_swapgs(); 194 + wrgsbase(gsbase); 195 + native_swapgs(); 196 + } else { 197 + instrumentation_begin(); 198 + wrmsrl(MSR_KERNEL_GS_BASE, gsbase); 199 + instrumentation_end(); 200 + } 201 + } 202 + 203 + /* 154 204 * Saves the FS or GS base for an outgoing thread if FSGSBASE extensions are 155 205 * not available. The goal is to be reasonably fast on non-FSGSBASE systems. 156 206 * It's forcibly inlined because it'll generate better code and this function ··· 249 199 { 250 200 savesegment(fs, task->thread.fsindex); 251 201 savesegment(gs, task->thread.gsindex); 252 - save_base_legacy(task, task->thread.fsindex, FS); 253 - save_base_legacy(task, task->thread.gsindex, GS); 202 + if (static_cpu_has(X86_FEATURE_FSGSBASE)) { 203 + /* 204 + * If FSGSBASE is enabled, we can't make any useful guesses 205 + * about the base, and user code expects us to save the current 206 + * value. Fortunately, reading the base directly is efficient. 207 + */ 208 + task->thread.fsbase = rdfsbase(); 209 + task->thread.gsbase = __rdgsbase_inactive(); 210 + } else { 211 + save_base_legacy(task, task->thread.fsindex, FS); 212 + save_base_legacy(task, task->thread.gsindex, GS); 213 + } 254 214 } 255 215 256 - #if IS_ENABLED(CONFIG_KVM) 257 216 /* 258 217 * While a process is running,current->thread.fsbase and current->thread.gsbase 259 - * may not match the corresponding CPU registers (see save_base_legacy()). KVM 260 - * wants an efficient way to save and restore FSBASE and GSBASE. 261 - * When FSGSBASE extensions are enabled, this will have to use RD{FS,GS}BASE. 218 + * may not match the corresponding CPU registers (see save_base_legacy()). 262 219 */ 263 - void save_fsgs_for_kvm(void) 220 + void current_save_fsgs(void) 264 221 { 222 + unsigned long flags; 223 + 224 + /* Interrupts need to be off for FSGSBASE */ 225 + local_irq_save(flags); 265 226 save_fsgs(current); 227 + local_irq_restore(flags); 266 228 } 267 - EXPORT_SYMBOL_GPL(save_fsgs_for_kvm); 229 + #if IS_ENABLED(CONFIG_KVM) 230 + EXPORT_SYMBOL_GPL(current_save_fsgs); 268 231 #endif 269 232 270 233 static __always_inline void loadseg(enum which_selector which, ··· 342 279 static __always_inline void x86_fsgsbase_load(struct thread_struct *prev, 343 280 struct thread_struct *next) 344 281 { 345 - load_seg_legacy(prev->fsindex, prev->fsbase, 346 - next->fsindex, next->fsbase, FS); 347 - load_seg_legacy(prev->gsindex, prev->gsbase, 348 - next->gsindex, next->gsbase, GS); 282 + if (static_cpu_has(X86_FEATURE_FSGSBASE)) { 283 + /* Update the FS and GS selectors if they could have changed. */ 284 + if (unlikely(prev->fsindex || next->fsindex)) 285 + loadseg(FS, next->fsindex); 286 + if (unlikely(prev->gsindex || next->gsindex)) 287 + loadseg(GS, next->gsindex); 288 + 289 + /* Update the bases. */ 290 + wrfsbase(next->fsbase); 291 + __wrgsbase_inactive(next->gsbase); 292 + } else { 293 + load_seg_legacy(prev->fsindex, prev->fsbase, 294 + next->fsindex, next->fsbase, FS); 295 + load_seg_legacy(prev->gsindex, prev->gsbase, 296 + next->gsindex, next->gsbase, GS); 297 + } 349 298 } 350 299 351 - static unsigned long x86_fsgsbase_read_task(struct task_struct *task, 352 - unsigned short selector) 300 + unsigned long x86_fsgsbase_read_task(struct task_struct *task, 301 + unsigned short selector) 353 302 { 354 303 unsigned short idx = selector >> 3; 355 304 unsigned long base; ··· 403 328 return base; 404 329 } 405 330 331 + unsigned long x86_gsbase_read_cpu_inactive(void) 332 + { 333 + unsigned long gsbase; 334 + 335 + if (static_cpu_has(X86_FEATURE_FSGSBASE)) { 336 + unsigned long flags; 337 + 338 + local_irq_save(flags); 339 + gsbase = __rdgsbase_inactive(); 340 + local_irq_restore(flags); 341 + } else { 342 + rdmsrl(MSR_KERNEL_GS_BASE, gsbase); 343 + } 344 + 345 + return gsbase; 346 + } 347 + 348 + void x86_gsbase_write_cpu_inactive(unsigned long gsbase) 349 + { 350 + if (static_cpu_has(X86_FEATURE_FSGSBASE)) { 351 + unsigned long flags; 352 + 353 + local_irq_save(flags); 354 + __wrgsbase_inactive(gsbase); 355 + local_irq_restore(flags); 356 + } else { 357 + wrmsrl(MSR_KERNEL_GS_BASE, gsbase); 358 + } 359 + } 360 + 406 361 unsigned long x86_fsbase_read_task(struct task_struct *task) 407 362 { 408 363 unsigned long fsbase; 409 364 410 365 if (task == current) 411 366 fsbase = x86_fsbase_read_cpu(); 412 - else if (task->thread.fsindex == 0) 367 + else if (static_cpu_has(X86_FEATURE_FSGSBASE) || 368 + (task->thread.fsindex == 0)) 413 369 fsbase = task->thread.fsbase; 414 370 else 415 371 fsbase = x86_fsgsbase_read_task(task, task->thread.fsindex); ··· 454 348 455 349 if (task == current) 456 350 gsbase = x86_gsbase_read_cpu_inactive(); 457 - else if (task->thread.gsindex == 0) 351 + else if (static_cpu_has(X86_FEATURE_FSGSBASE) || 352 + (task->thread.gsindex == 0)) 458 353 gsbase = task->thread.gsbase; 459 354 else 460 355 gsbase = x86_fsgsbase_read_task(task, task->thread.gsindex);
+32 -28
arch/x86/kernel/ptrace.c
··· 281 281 return -EIO; 282 282 283 283 /* 284 - * This function has some ABI oddities. 285 - * 286 - * A 32-bit ptracer probably expects that writing FS or GS will change 287 - * FSBASE or GSBASE respectively. In the absence of FSGSBASE support, 288 - * this code indeed has that effect. When FSGSBASE is added, this 289 - * will require a special case. 290 - * 291 - * For existing 64-bit ptracers, writing FS or GS *also* currently 292 - * changes the base if the selector is nonzero the next time the task 293 - * is run. This behavior may not be needed, and trying to preserve it 294 - * when FSGSBASE is added would be complicated at best. 284 + * Writes to FS and GS will change the stored selector. Whether 285 + * this changes the segment base as well depends on whether 286 + * FSGSBASE is enabled. 295 287 */ 296 288 297 289 switch (offset) { ··· 371 379 case offsetof(struct user_regs_struct,fs_base): 372 380 if (value >= TASK_SIZE_MAX) 373 381 return -EIO; 374 - /* 375 - * When changing the FS base, use do_arch_prctl_64() 376 - * to set the index to zero and to set the base 377 - * as requested. 378 - * 379 - * NB: This behavior is nonsensical and likely needs to 380 - * change when FSGSBASE support is added. 381 - */ 382 - if (child->thread.fsbase != value) 383 - return do_arch_prctl_64(child, ARCH_SET_FS, value); 382 + x86_fsbase_write_task(child, value); 384 383 return 0; 385 384 case offsetof(struct user_regs_struct,gs_base): 386 - /* 387 - * Exactly the same here as the %fs handling above. 388 - */ 389 385 if (value >= TASK_SIZE_MAX) 390 386 return -EIO; 391 - if (child->thread.gsbase != value) 392 - return do_arch_prctl_64(child, ARCH_SET_GS, value); 387 + x86_gsbase_write_task(child, value); 393 388 return 0; 394 389 #endif 395 390 } ··· 859 880 static int putreg32(struct task_struct *child, unsigned regno, u32 value) 860 881 { 861 882 struct pt_regs *regs = task_pt_regs(child); 883 + int ret; 862 884 863 885 switch (regno) { 864 886 865 887 SEG32(cs); 866 888 SEG32(ds); 867 889 SEG32(es); 868 - SEG32(fs); 869 - SEG32(gs); 890 + 891 + /* 892 + * A 32-bit ptracer on a 64-bit kernel expects that writing 893 + * FS or GS will also update the base. This is needed for 894 + * operations like PTRACE_SETREGS to fully restore a saved 895 + * CPU state. 896 + */ 897 + 898 + case offsetof(struct user32, regs.fs): 899 + ret = set_segment_reg(child, 900 + offsetof(struct user_regs_struct, fs), 901 + value); 902 + if (ret == 0) 903 + child->thread.fsbase = 904 + x86_fsgsbase_read_task(child, value); 905 + return ret; 906 + 907 + case offsetof(struct user32, regs.gs): 908 + ret = set_segment_reg(child, 909 + offsetof(struct user_regs_struct, gs), 910 + value); 911 + if (ret == 0) 912 + child->thread.gsbase = 913 + x86_fsgsbase_read_task(child, value); 914 + return ret; 915 + 870 916 SEG32(ss); 871 917 872 918 R32(ebx, bx);
+1 -1
arch/x86/kvm/vmx/vmx.c
··· 1170 1170 1171 1171 gs_base = cpu_kernelmode_gs_base(cpu); 1172 1172 if (likely(is_64bit_mm(current->mm))) { 1173 - save_fsgs_for_kvm(); 1173 + current_save_fsgs(); 1174 1174 fs_sel = current->thread.fsindex; 1175 1175 gs_sel = current->thread.gsindex; 1176 1176 fs_base = current->thread.fsbase;
+1 -1
tools/testing/selftests/x86/Makefile
··· 13 13 TARGETS_C_BOTHBITS := single_step_syscall sysret_ss_attrs syscall_nt test_mremap_vdso \ 14 14 check_initial_reg_state sigreturn iopl ioperm \ 15 15 test_vdso test_vsyscall mov_ss_trap \ 16 - syscall_arg_fault 16 + syscall_arg_fault fsgsbase_restore 17 17 TARGETS_C_32BIT_ONLY := entry_from_vm86 test_syscall_vdso unwind_vdso \ 18 18 test_FCMOV test_FCOMI test_FISTTP \ 19 19 vdso_restorer
+23 -5
tools/testing/selftests/x86/fsgsbase.c
··· 285 285 /* 32-bit set_thread_area */ 286 286 long ret; 287 287 asm volatile ("int $0x80" 288 - : "=a" (ret) : "a" (243), "b" (low_desc) 288 + : "=a" (ret), "+m" (*low_desc) 289 + : "a" (243), "b" (low_desc) 289 290 : "r8", "r9", "r10", "r11"); 290 291 memcpy(&desc, low_desc, sizeof(desc)); 291 292 munmap(low_desc, sizeof(desc)); ··· 490 489 * selector value is changed or not by the GSBASE write in 491 490 * a ptracer. 492 491 */ 493 - if (gs == 0 && base == 0xFF) { 494 - printf("[OK]\tGS was reset as expected\n"); 495 - } else { 492 + if (gs != *shared_scratch) { 496 493 nerrs++; 497 - printf("[FAIL]\tGS=0x%lx, GSBASE=0x%lx (should be 0, 0xFF)\n", gs, base); 494 + printf("[FAIL]\tGS changed to %lx\n", gs); 495 + 496 + /* 497 + * On older kernels, poking a nonzero value into the 498 + * base would zero the selector. On newer kernels, 499 + * this behavior has changed -- poking the base 500 + * changes only the base and, if FSGSBASE is not 501 + * available, this may have no effect once the tracee 502 + * is resumed. 503 + */ 504 + if (gs == 0) 505 + printf("\tNote: this is expected behavior on older kernels.\n"); 506 + } else if (have_fsgsbase && (base != 0xFF)) { 507 + nerrs++; 508 + printf("[FAIL]\tGSBASE changed to %lx\n", base); 509 + } else { 510 + printf("[OK]\tGS remained 0x%hx", *shared_scratch); 511 + if (have_fsgsbase) 512 + printf(" and GSBASE changed to 0xFF"); 513 + printf("\n"); 498 514 } 499 515 } 500 516
+245
tools/testing/selftests/x86/fsgsbase_restore.c
··· 1 + // SPDX-License-Identifier: GPL-2.0-only 2 + /* 3 + * fsgsbase_restore.c, test ptrace vs fsgsbase 4 + * Copyright (c) 2020 Andy Lutomirski 5 + * 6 + * This test case simulates a tracer redirecting tracee execution to 7 + * a function and then restoring tracee state using PTRACE_GETREGS and 8 + * PTRACE_SETREGS. This is similar to what gdb does when doing 9 + * 'p func()'. The catch is that this test has the called function 10 + * modify a segment register. This makes sure that ptrace correctly 11 + * restores segment state when using PTRACE_SETREGS. 12 + * 13 + * This is not part of fsgsbase.c, because that test is 64-bit only. 14 + */ 15 + 16 + #define _GNU_SOURCE 17 + #include <stdio.h> 18 + #include <stdlib.h> 19 + #include <stdbool.h> 20 + #include <string.h> 21 + #include <sys/syscall.h> 22 + #include <unistd.h> 23 + #include <err.h> 24 + #include <sys/user.h> 25 + #include <asm/prctl.h> 26 + #include <sys/prctl.h> 27 + #include <asm/ldt.h> 28 + #include <sys/mman.h> 29 + #include <stddef.h> 30 + #include <sys/ptrace.h> 31 + #include <sys/wait.h> 32 + #include <stdint.h> 33 + 34 + #define EXPECTED_VALUE 0x1337f00d 35 + 36 + #ifdef __x86_64__ 37 + # define SEG "%gs" 38 + #else 39 + # define SEG "%fs" 40 + #endif 41 + 42 + static unsigned int dereference_seg_base(void) 43 + { 44 + int ret; 45 + asm volatile ("mov %" SEG ":(0), %0" : "=rm" (ret)); 46 + return ret; 47 + } 48 + 49 + static void init_seg(void) 50 + { 51 + unsigned int *target = mmap( 52 + NULL, sizeof(unsigned int), 53 + PROT_READ | PROT_WRITE, 54 + MAP_PRIVATE | MAP_ANONYMOUS | MAP_32BIT, -1, 0); 55 + if (target == MAP_FAILED) 56 + err(1, "mmap"); 57 + 58 + *target = EXPECTED_VALUE; 59 + 60 + printf("\tsegment base address = 0x%lx\n", (unsigned long)target); 61 + 62 + struct user_desc desc = { 63 + .entry_number = 0, 64 + .base_addr = (unsigned int)(uintptr_t)target, 65 + .limit = sizeof(unsigned int) - 1, 66 + .seg_32bit = 1, 67 + .contents = 0, /* Data, grow-up */ 68 + .read_exec_only = 0, 69 + .limit_in_pages = 0, 70 + .seg_not_present = 0, 71 + .useable = 0 72 + }; 73 + if (syscall(SYS_modify_ldt, 1, &desc, sizeof(desc)) == 0) { 74 + printf("\tusing LDT slot 0\n"); 75 + asm volatile ("mov %0, %" SEG :: "rm" ((unsigned short)0x7)); 76 + } else { 77 + /* No modify_ldt for us (configured out, perhaps) */ 78 + 79 + struct user_desc *low_desc = mmap( 80 + NULL, sizeof(desc), 81 + PROT_READ | PROT_WRITE, 82 + MAP_PRIVATE | MAP_ANONYMOUS | MAP_32BIT, -1, 0); 83 + memcpy(low_desc, &desc, sizeof(desc)); 84 + 85 + low_desc->entry_number = -1; 86 + 87 + /* 32-bit set_thread_area */ 88 + long ret; 89 + asm volatile ("int $0x80" 90 + : "=a" (ret), "+m" (*low_desc) 91 + : "a" (243), "b" (low_desc) 92 + #ifdef __x86_64__ 93 + : "r8", "r9", "r10", "r11" 94 + #endif 95 + ); 96 + memcpy(&desc, low_desc, sizeof(desc)); 97 + munmap(low_desc, sizeof(desc)); 98 + 99 + if (ret != 0) { 100 + printf("[NOTE]\tcould not create a segment -- can't test anything\n"); 101 + exit(0); 102 + } 103 + printf("\tusing GDT slot %d\n", desc.entry_number); 104 + 105 + unsigned short sel = (unsigned short)((desc.entry_number << 3) | 0x3); 106 + asm volatile ("mov %0, %" SEG :: "rm" (sel)); 107 + } 108 + } 109 + 110 + static void tracee_zap_segment(void) 111 + { 112 + /* 113 + * The tracer will redirect execution here. This is meant to 114 + * work like gdb's 'p func()' feature. The tricky bit is that 115 + * we modify a segment register in order to make sure that ptrace 116 + * can correctly restore segment registers. 117 + */ 118 + printf("\tTracee: in tracee_zap_segment()\n"); 119 + 120 + /* 121 + * Write a nonzero selector with base zero to the segment register. 122 + * Using a null selector would defeat the test on AMD pre-Zen2 123 + * CPUs, as such CPUs don't clear the base when loading a null 124 + * selector. 125 + */ 126 + unsigned short sel; 127 + asm volatile ("mov %%ss, %0\n\t" 128 + "mov %0, %" SEG 129 + : "=rm" (sel)); 130 + 131 + pid_t pid = getpid(), tid = syscall(SYS_gettid); 132 + 133 + printf("\tTracee is going back to sleep\n"); 134 + syscall(SYS_tgkill, pid, tid, SIGSTOP); 135 + 136 + /* Should not get here. */ 137 + while (true) { 138 + printf("[FAIL]\tTracee hit unreachable code\n"); 139 + pause(); 140 + } 141 + } 142 + 143 + int main() 144 + { 145 + printf("\tSetting up a segment\n"); 146 + init_seg(); 147 + 148 + unsigned int val = dereference_seg_base(); 149 + if (val != EXPECTED_VALUE) { 150 + printf("[FAIL]\tseg[0] == %x; should be %x\n", val, EXPECTED_VALUE); 151 + return 1; 152 + } 153 + printf("[OK]\tThe segment points to the right place.\n"); 154 + 155 + pid_t chld = fork(); 156 + if (chld < 0) 157 + err(1, "fork"); 158 + 159 + if (chld == 0) { 160 + prctl(PR_SET_PDEATHSIG, SIGKILL, 0, 0, 0, 0); 161 + 162 + if (ptrace(PTRACE_TRACEME, 0, 0, 0) != 0) 163 + err(1, "PTRACE_TRACEME"); 164 + 165 + pid_t pid = getpid(), tid = syscall(SYS_gettid); 166 + 167 + printf("\tTracee will take a nap until signaled\n"); 168 + syscall(SYS_tgkill, pid, tid, SIGSTOP); 169 + 170 + printf("\tTracee was resumed. Will re-check segment.\n"); 171 + 172 + val = dereference_seg_base(); 173 + if (val != EXPECTED_VALUE) { 174 + printf("[FAIL]\tseg[0] == %x; should be %x\n", val, EXPECTED_VALUE); 175 + exit(1); 176 + } 177 + 178 + printf("[OK]\tThe segment points to the right place.\n"); 179 + exit(0); 180 + } 181 + 182 + int status; 183 + 184 + /* Wait for SIGSTOP. */ 185 + if (waitpid(chld, &status, 0) != chld || !WIFSTOPPED(status)) 186 + err(1, "waitpid"); 187 + 188 + struct user_regs_struct regs; 189 + 190 + if (ptrace(PTRACE_GETREGS, chld, NULL, &regs) != 0) 191 + err(1, "PTRACE_GETREGS"); 192 + 193 + #ifdef __x86_64__ 194 + printf("\tChild GS=0x%lx, GSBASE=0x%lx\n", (unsigned long)regs.gs, (unsigned long)regs.gs_base); 195 + #else 196 + printf("\tChild FS=0x%lx\n", (unsigned long)regs.xfs); 197 + #endif 198 + 199 + struct user_regs_struct regs2 = regs; 200 + #ifdef __x86_64__ 201 + regs2.rip = (unsigned long)tracee_zap_segment; 202 + regs2.rsp -= 128; /* Don't clobber the redzone. */ 203 + #else 204 + regs2.eip = (unsigned long)tracee_zap_segment; 205 + #endif 206 + 207 + printf("\tTracer: redirecting tracee to tracee_zap_segment()\n"); 208 + if (ptrace(PTRACE_SETREGS, chld, NULL, &regs2) != 0) 209 + err(1, "PTRACE_GETREGS"); 210 + if (ptrace(PTRACE_CONT, chld, NULL, NULL) != 0) 211 + err(1, "PTRACE_GETREGS"); 212 + 213 + /* Wait for SIGSTOP. */ 214 + if (waitpid(chld, &status, 0) != chld || !WIFSTOPPED(status)) 215 + err(1, "waitpid"); 216 + 217 + printf("\tTracer: restoring tracee state\n"); 218 + if (ptrace(PTRACE_SETREGS, chld, NULL, &regs) != 0) 219 + err(1, "PTRACE_GETREGS"); 220 + if (ptrace(PTRACE_DETACH, chld, NULL, NULL) != 0) 221 + err(1, "PTRACE_GETREGS"); 222 + 223 + /* Wait for SIGSTOP. */ 224 + if (waitpid(chld, &status, 0) != chld) 225 + err(1, "waitpid"); 226 + 227 + if (WIFSIGNALED(status)) { 228 + printf("[FAIL]\tTracee crashed\n"); 229 + return 1; 230 + } 231 + 232 + if (!WIFEXITED(status)) { 233 + printf("[FAIL]\tTracee stopped for an unexpected reason: %d\n", status); 234 + return 1; 235 + } 236 + 237 + int exitcode = WEXITSTATUS(status); 238 + if (exitcode != 0) { 239 + printf("[FAIL]\tTracee reported failure\n"); 240 + return 1; 241 + } 242 + 243 + printf("[OK]\tAll is well.\n"); 244 + return 0; 245 + }
+26
tools/testing/selftests/x86/syscall_arg_fault.c
··· 53 53 if (ax != -EFAULT && ax != -ENOSYS) { 54 54 printf("[FAIL]\tAX had the wrong value: 0x%lx\n", 55 55 (unsigned long)ax); 56 + printf("\tIP = 0x%lx\n", (unsigned long)ctx->uc_mcontext.gregs[REG_IP]); 56 57 n_errs++; 57 58 } else { 58 59 printf("[OK]\tSeems okay\n"); ··· 207 206 : : : "memory", "flags"); 208 207 } 209 208 set_eflags(get_eflags() & ~X86_EFLAGS_TF); 209 + 210 + #ifdef __x86_64__ 211 + printf("[RUN]\tSYSENTER with TF, invalid state, and GSBASE < 0\n"); 212 + 213 + if (sigsetjmp(jmpbuf, 1) == 0) { 214 + sigtrap_consecutive_syscalls = 0; 215 + 216 + asm volatile ("wrgsbase %%rax\n\t" 217 + :: "a" (0xffffffffffff0000UL)); 218 + 219 + set_eflags(get_eflags() | X86_EFLAGS_TF); 220 + asm volatile ( 221 + "movl $-1, %%eax\n\t" 222 + "movl $-1, %%ebx\n\t" 223 + "movl $-1, %%ecx\n\t" 224 + "movl $-1, %%edx\n\t" 225 + "movl $-1, %%esi\n\t" 226 + "movl $-1, %%edi\n\t" 227 + "movl $-1, %%ebp\n\t" 228 + "movl $-1, %%esp\n\t" 229 + "sysenter" 230 + : : : "memory", "flags"); 231 + } 232 + set_eflags(get_eflags() & ~X86_EFLAGS_TF); 233 + #endif 210 234 211 235 return 0; 212 236 }