Merge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

+16

Documentation/ABI/testing/sysfs-devices-system-cpu

··· 375 375 Description: information about CPUs heterogeneity. 376 376 377 377 cpu_capacity: capacity of cpu#. 378 + 379 + What: /sys/devices/system/cpu/vulnerabilities 380 + /sys/devices/system/cpu/vulnerabilities/meltdown 381 + /sys/devices/system/cpu/vulnerabilities/spectre_v1 382 + /sys/devices/system/cpu/vulnerabilities/spectre_v2 383 + Date: January 2018 384 + Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org> 385 + Description: Information about CPU vulnerabilities 386 + 387 + The files are named after the code names of CPU 388 + vulnerabilities. The output of those files reflects the 389 + state of the CPUs in the system. Possible output values: 390 + 391 + "Not affected" CPU is not affected by the vulnerability 392 + "Vulnerable" CPU is affected and no mitigation in effect 393 + "Mitigation: $M" CPU is affected and mitigation $M is in effect

+42 -7

Documentation/admin-guide/kernel-parameters.txt

··· 2623 2623 nosmt [KNL,S390] Disable symmetric multithreading (SMT). 2624 2624 Equivalent to smt=1. 2625 2625 2626 + nospectre_v2 [X86] Disable all mitigations for the Spectre variant 2 2627 + (indirect branch prediction) vulnerability. System may 2628 + allow data leaks with this option, which is equivalent 2629 + to spectre_v2=off. 2630 + 2626 2631 noxsave [BUGS=X86] Disables x86 extended register state save 2627 2632 and restore using xsave. The kernel will fallback to 2628 2633 enabling legacy floating-point and sse state. ··· 2713 2708 no-steal-acc [X86,KVM] Disable paravirtualized steal time accounting. 2714 2709 steal time is computed, but won't influence scheduler 2715 2710 behaviour 2716 - 2717 - nopti [X86-64] Disable kernel page table isolation 2718 2711 2719 2712 nolapic [X86-32,APIC] Do not enable or use the local APIC. 2720 2713 ··· 3294 3291 pt. [PARIDE] 3295 3292 See Documentation/blockdev/paride.txt. 3296 3293 3297 - pti= [X86_64] 3298 - Control user/kernel address space isolation: 3299 - on - enable 3300 - off - disable 3301 - auto - default setting 3294 + pti= [X86_64] Control Page Table Isolation of user and 3295 + kernel address spaces. Disabling this feature 3296 + removes hardening, but improves performance of 3297 + system calls and interrupts. 3298 + 3299 + on - unconditionally enable 3300 + off - unconditionally disable 3301 + auto - kernel detects whether your CPU model is 3302 + vulnerable to issues that PTI mitigates 3303 + 3304 + Not specifying this option is equivalent to pti=auto. 3305 + 3306 + nopti [X86_64] 3307 + Equivalent to pti=off 3302 3308 3303 3309 pty.legacy_count= 3304 3310 [KNL] Number of legacy pty's. Overwrites compiled-in ··· 3957 3945 3958 3946 sonypi.*= [HW] Sony Programmable I/O Control Device driver 3959 3947 See Documentation/laptops/sonypi.txt 3948 + 3949 + spectre_v2= [X86] Control mitigation of Spectre variant 2 3950 + (indirect branch speculation) vulnerability. 3951 + 3952 + on - unconditionally enable 3953 + off - unconditionally disable 3954 + auto - kernel detects whether your CPU model is 3955 + vulnerable 3956 + 3957 + Selecting 'on' will, and 'auto' may, choose a 3958 + mitigation method at run time according to the 3959 + CPU, the available microcode, the setting of the 3960 + CONFIG_RETPOLINE configuration option, and the 3961 + compiler with which the kernel was built. 3962 + 3963 + Specific mitigations can also be selected manually: 3964 + 3965 + retpoline - replace indirect branches 3966 + retpoline,generic - google's original retpoline 3967 + retpoline,amd - AMD-specific minimal thunk 3968 + 3969 + Not specifying this option is equivalent to 3970 + spectre_v2=auto. 3960 3971 3961 3972 spia_io_base= [HW,MTD] 3962 3973 spia_fio_base=

+186

Documentation/x86/pti.txt

··· 1 + Overview 2 + ======== 3 + 4 + Page Table Isolation (pti, previously known as KAISER[1]) is a 5 + countermeasure against attacks on the shared user/kernel address 6 + space such as the "Meltdown" approach[2]. 7 + 8 + To mitigate this class of attacks, we create an independent set of 9 + page tables for use only when running userspace applications. When 10 + the kernel is entered via syscalls, interrupts or exceptions, the 11 + page tables are switched to the full "kernel" copy. When the system 12 + switches back to user mode, the user copy is used again. 13 + 14 + The userspace page tables contain only a minimal amount of kernel 15 + data: only what is needed to enter/exit the kernel such as the 16 + entry/exit functions themselves and the interrupt descriptor table 17 + (IDT). There are a few strictly unnecessary things that get mapped 18 + such as the first C function when entering an interrupt (see 19 + comments in pti.c). 20 + 21 + This approach helps to ensure that side-channel attacks leveraging 22 + the paging structures do not function when PTI is enabled. It can be 23 + enabled by setting CONFIG_PAGE_TABLE_ISOLATION=y at compile time. 24 + Once enabled at compile-time, it can be disabled at boot with the 25 + 'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt). 26 + 27 + Page Table Management 28 + ===================== 29 + 30 + When PTI is enabled, the kernel manages two sets of page tables. 31 + The first set is very similar to the single set which is present in 32 + kernels without PTI. This includes a complete mapping of userspace 33 + that the kernel can use for things like copy_to_user(). 34 + 35 + Although _complete_, the user portion of the kernel page tables is 36 + crippled by setting the NX bit in the top level. This ensures 37 + that any missed kernel->user CR3 switch will immediately crash 38 + userspace upon executing its first instruction. 39 + 40 + The userspace page tables map only the kernel data needed to enter 41 + and exit the kernel. This data is entirely contained in the 'struct 42 + cpu_entry_area' structure which is placed in the fixmap which gives 43 + each CPU's copy of the area a compile-time-fixed virtual address. 44 + 45 + For new userspace mappings, the kernel makes the entries in its 46 + page tables like normal. The only difference is when the kernel 47 + makes entries in the top (PGD) level. In addition to setting the 48 + entry in the main kernel PGD, a copy of the entry is made in the 49 + userspace page tables' PGD. 50 + 51 + This sharing at the PGD level also inherently shares all the lower 52 + layers of the page tables. This leaves a single, shared set of 53 + userspace page tables to manage. One PTE to lock, one set of 54 + accessed bits, dirty bits, etc... 55 + 56 + Overhead 57 + ======== 58 + 59 + Protection against side-channel attacks is important. But, 60 + this protection comes at a cost: 61 + 62 + 1. Increased Memory Use 63 + a. Each process now needs an order-1 PGD instead of order-0. 64 + (Consumes an additional 4k per process). 65 + b. The 'cpu_entry_area' structure must be 2MB in size and 2MB 66 + aligned so that it can be mapped by setting a single PMD 67 + entry. This consumes nearly 2MB of RAM once the kernel 68 + is decompressed, but no space in the kernel image itself. 69 + 70 + 2. Runtime Cost 71 + a. CR3 manipulation to switch between the page table copies 72 + must be done at interrupt, syscall, and exception entry 73 + and exit (it can be skipped when the kernel is interrupted, 74 + though.) Moves to CR3 are on the order of a hundred 75 + cycles, and are required at every entry and exit. 76 + b. A "trampoline" must be used for SYSCALL entry. This 77 + trampoline depends on a smaller set of resources than the 78 + non-PTI SYSCALL entry code, so requires mapping fewer 79 + things into the userspace page tables. The downside is 80 + that stacks must be switched at entry time. 81 + d. Global pages are disabled for all kernel structures not 82 + mapped into both kernel and userspace page tables. This 83 + feature of the MMU allows different processes to share TLB 84 + entries mapping the kernel. Losing the feature means more 85 + TLB misses after a context switch. The actual loss of 86 + performance is very small, however, never exceeding 1%. 87 + d. Process Context IDentifiers (PCID) is a CPU feature that 88 + allows us to skip flushing the entire TLB when switching page 89 + tables by setting a special bit in CR3 when the page tables 90 + are changed. This makes switching the page tables (at context 91 + switch, or kernel entry/exit) cheaper. But, on systems with 92 + PCID support, the context switch code must flush both the user 93 + and kernel entries out of the TLB. The user PCID TLB flush is 94 + deferred until the exit to userspace, minimizing the cost. 95 + See intel.com/sdm for the gory PCID/INVPCID details. 96 + e. The userspace page tables must be populated for each new 97 + process. Even without PTI, the shared kernel mappings 98 + are created by copying top-level (PGD) entries into each 99 + new process. But, with PTI, there are now *two* kernel 100 + mappings: one in the kernel page tables that maps everything 101 + and one for the entry/exit structures. At fork(), we need to 102 + copy both. 103 + f. In addition to the fork()-time copying, there must also 104 + be an update to the userspace PGD any time a set_pgd() is done 105 + on a PGD used to map userspace. This ensures that the kernel 106 + and userspace copies always map the same userspace 107 + memory. 108 + g. On systems without PCID support, each CR3 write flushes 109 + the entire TLB. That means that each syscall, interrupt 110 + or exception flushes the TLB. 111 + h. INVPCID is a TLB-flushing instruction which allows flushing 112 + of TLB entries for non-current PCIDs. Some systems support 113 + PCIDs, but do not support INVPCID. On these systems, addresses 114 + can only be flushed from the TLB for the current PCID. When 115 + flushing a kernel address, we need to flush all PCIDs, so a 116 + single kernel address flush will require a TLB-flushing CR3 117 + write upon the next use of every PCID. 118 + 119 + Possible Future Work 120 + ==================== 121 + 1. We can be more careful about not actually writing to CR3 122 + unless its value is actually changed. 123 + 2. Allow PTI to be enabled/disabled at runtime in addition to the 124 + boot-time switching. 125 + 126 + Testing 127 + ======== 128 + 129 + To test stability of PTI, the following test procedure is recommended, 130 + ideally doing all of these in parallel: 131 + 132 + 1. Set CONFIG_DEBUG_ENTRY=y 133 + 2. Run several copies of all of the tools/testing/selftests/x86/ tests 134 + (excluding MPX and protection_keys) in a loop on multiple CPUs for 135 + several minutes. These tests frequently uncover corner cases in the 136 + kernel entry code. In general, old kernels might cause these tests 137 + themselves to crash, but they should never crash the kernel. 138 + 3. Run the 'perf' tool in a mode (top or record) that generates many 139 + frequent performance monitoring non-maskable interrupts (see "NMI" 140 + in /proc/interrupts). This exercises the NMI entry/exit code which 141 + is known to trigger bugs in code paths that did not expect to be 142 + interrupted, including nested NMIs. Using "-c" boosts the rate of 143 + NMIs, and using two -c with separate counters encourages nested NMIs 144 + and less deterministic behavior. 145 + 146 + while true; do perf record -c 10000 -e instructions,cycles -a sleep 10; done 147 + 148 + 4. Launch a KVM virtual machine. 149 + 5. Run 32-bit binaries on systems supporting the SYSCALL instruction. 150 + This has been a lightly-tested code path and needs extra scrutiny. 151 + 152 + Debugging 153 + ========= 154 + 155 + Bugs in PTI cause a few different signatures of crashes 156 + that are worth noting here. 157 + 158 + * Failures of the selftests/x86 code. Usually a bug in one of the 159 + more obscure corners of entry_64.S 160 + * Crashes in early boot, especially around CPU bringup. Bugs 161 + in the trampoline code or mappings cause these. 162 + * Crashes at the first interrupt. Caused by bugs in entry_64.S, 163 + like screwing up a page table switch. Also caused by 164 + incorrectly mapping the IRQ handler entry code. 165 + * Crashes at the first NMI. The NMI code is separate from main 166 + interrupt handlers and can have bugs that do not affect 167 + normal interrupts. Also caused by incorrectly mapping NMI 168 + code. NMIs that interrupt the entry code must be very 169 + careful and can be the cause of crashes that show up when 170 + running perf. 171 + * Kernel crashes at the first exit to userspace. entry_64.S 172 + bugs, or failing to map some of the exit code. 173 + * Crashes at first interrupt that interrupts userspace. The paths 174 + in entry_64.S that return to userspace are sometimes separate 175 + from the ones that return to the kernel. 176 + * Double faults: overflowing the kernel stack because of page 177 + faults upon page faults. Caused by touching non-pti-mapped 178 + data in the entry code, or forgetting to switch to kernel 179 + CR3 before calling into C functions which are not pti-mapped. 180 + * Userspace segfaults early in boot, sometimes manifesting 181 + as mount(8) failing to mount the rootfs. These have 182 + tended to be TLB invalidation issues. Usually invalidating 183 + the wrong PCID, or otherwise missing an invalidation. 184 + 185 + 1. https://gruss.cc/files/kaiser.pdf 186 + 2. https://meltdownattack.com/meltdown.pdf

+14

arch/x86/Kconfig

··· 88 88 select GENERIC_CLOCKEVENTS_MIN_ADJUST 89 89 select GENERIC_CMOS_UPDATE 90 90 select GENERIC_CPU_AUTOPROBE 91 + select GENERIC_CPU_VULNERABILITIES 91 92 select GENERIC_EARLY_IOREMAP 92 93 select GENERIC_FIND_FIRST_BIT 93 94 select GENERIC_IOMAP ··· 428 427 config GOLDFISH 429 428 def_bool y 430 429 depends on X86_GOLDFISH 430 + 431 + config RETPOLINE 432 + bool "Avoid speculative indirect branches in kernel" 433 + default y 434 + help 435 + Compile kernel with the retpoline compiler options to guard against 436 + kernel-to-user data leaks by avoiding speculative indirect 437 + branches. Requires a compiler with -mindirect-branch=thunk-extern 438 + support for full protection. The kernel may run slower. 439 + 440 + Without compiler support, at least indirect branches in assembler 441 + code are eliminated. Since this includes the syscall entry path, 442 + it is not entirely pointless. 431 443 432 444 config INTEL_RDT 433 445 bool "Intel Resource Director Technology support"

+10

arch/x86/Makefile

··· 230 230 # 231 231 KBUILD_CFLAGS += -fno-asynchronous-unwind-tables 232 232 233 + # Avoid indirect branches in kernel to deal with Spectre 234 + ifdef CONFIG_RETPOLINE 235 + RETPOLINE_CFLAGS += $(call cc-option,-mindirect-branch=thunk-extern -mindirect-branch-register) 236 + ifneq ($(RETPOLINE_CFLAGS),) 237 + KBUILD_CFLAGS += $(RETPOLINE_CFLAGS) -DRETPOLINE 238 + else 239 + $(warning CONFIG_RETPOLINE=y, but not supported by the compiler. Toolchain update recommended.) 240 + endif 241 + endif 242 + 233 243 archscripts: scripts_basic 234 244 $(Q)$(MAKE) $(build)=arch/x86/tools relocs 235 245

+3 -2

arch/x86/crypto/aesni-intel_asm.S

··· 32 32 #include <linux/linkage.h> 33 33 #include <asm/inst.h> 34 34 #include <asm/frame.h> 35 + #include <asm/nospec-branch.h> 35 36 36 37 /* 37 38 * The following macros are used to move an (un)aligned 16 byte value to/from ··· 2885 2884 pxor INC, STATE4 2886 2885 movdqu IV, 0x30(OUTP) 2887 2886 2888 - call *%r11 2887 + CALL_NOSPEC %r11 2889 2888 2890 2889 movdqu 0x00(OUTP), INC 2891 2890 pxor INC, STATE1 ··· 2930 2929 _aesni_gf128mul_x_ble() 2931 2930 movups IV, (IVP) 2932 2931 2933 - call *%r11 2932 + CALL_NOSPEC %r11 2934 2933 2935 2934 movdqu 0x40(OUTP), INC 2936 2935 pxor INC, STATE1

+2 -1

arch/x86/crypto/camellia-aesni-avx-asm_64.S

··· 17 17 18 18 #include <linux/linkage.h> 19 19 #include <asm/frame.h> 20 + #include <asm/nospec-branch.h> 20 21 21 22 #define CAMELLIA_TABLE_BYTE_LEN 272 22 23 ··· 1228 1227 vpxor 14 * 16(%rax), %xmm15, %xmm14; 1229 1228 vpxor 15 * 16(%rax), %xmm15, %xmm15; 1230 1229 1231 - call *%r9; 1230 + CALL_NOSPEC %r9; 1232 1231 1233 1232 addq $(16 * 16), %rsp; 1234 1233

+2 -1

arch/x86/crypto/camellia-aesni-avx2-asm_64.S

··· 12 12 13 13 #include <linux/linkage.h> 14 14 #include <asm/frame.h> 15 + #include <asm/nospec-branch.h> 15 16 16 17 #define CAMELLIA_TABLE_BYTE_LEN 272 17 18 ··· 1344 1343 vpxor 14 * 32(%rax), %ymm15, %ymm14; 1345 1344 vpxor 15 * 32(%rax), %ymm15, %ymm15; 1346 1345 1347 - call *%r9; 1346 + CALL_NOSPEC %r9; 1348 1347 1349 1348 addq $(16 * 32), %rsp; 1350 1349

+2 -1

arch/x86/crypto/crc32c-pcl-intel-asm_64.S

··· 45 45 46 46 #include <asm/inst.h> 47 47 #include <linux/linkage.h> 48 + #include <asm/nospec-branch.h> 48 49 49 50 ## ISCSI CRC 32 Implementation with crc32 and pclmulqdq Instruction 50 51 ··· 173 172 movzxw (bufp, %rax, 2), len 174 173 lea crc_array(%rip), bufp 175 174 lea (bufp, len, 1), bufp 176 - jmp *bufp 175 + JMP_NOSPEC bufp 177 176 178 177 ################################################################ 179 178 ## 2a) PROCESS FULL BLOCKS:

+19 -17

arch/x86/entry/calling.h

··· 198 198 * PAGE_TABLE_ISOLATION PGDs are 8k. Flip bit 12 to switch between the two 199 199 * halves: 200 200 */ 201 - #define PTI_SWITCH_PGTABLES_MASK (1<<PAGE_SHIFT) 202 - #define PTI_SWITCH_MASK (PTI_SWITCH_PGTABLES_MASK|(1<<X86_CR3_PTI_SWITCH_BIT)) 201 + #define PTI_USER_PGTABLE_BIT PAGE_SHIFT 202 + #define PTI_USER_PGTABLE_MASK (1 << PTI_USER_PGTABLE_BIT) 203 + #define PTI_USER_PCID_BIT X86_CR3_PTI_PCID_USER_BIT 204 + #define PTI_USER_PCID_MASK (1 << PTI_USER_PCID_BIT) 205 + #define PTI_USER_PGTABLE_AND_PCID_MASK (PTI_USER_PCID_MASK | PTI_USER_PGTABLE_MASK) 203 206 204 207 .macro SET_NOFLUSH_BIT reg:req 205 208 bts $X86_CR3_PCID_NOFLUSH_BIT, \reg ··· 211 208 .macro ADJUST_KERNEL_CR3 reg:req 212 209 ALTERNATIVE "", "SET_NOFLUSH_BIT \reg", X86_FEATURE_PCID 213 210 /* Clear PCID and "PAGE_TABLE_ISOLATION bit", point CR3 at kernel pagetables: */ 214 - andq $(~PTI_SWITCH_MASK), \reg 211 + andq $(~PTI_USER_PGTABLE_AND_PCID_MASK), \reg 215 212 .endm 216 213 217 214 .macro SWITCH_TO_KERNEL_CR3 scratch_reg:req ··· 242 239 /* Flush needed, clear the bit */ 243 240 btr \scratch_reg, THIS_CPU_user_pcid_flush_mask 244 241 movq \scratch_reg2, \scratch_reg 245 - jmp .Lwrcr3_\@ 242 + jmp .Lwrcr3_pcid_\@ 246 243 247 244 .Lnoflush_\@: 248 245 movq \scratch_reg2, \scratch_reg 249 246 SET_NOFLUSH_BIT \scratch_reg 250 247 248 + .Lwrcr3_pcid_\@: 249 + /* Flip the ASID to the user version */ 250 + orq $(PTI_USER_PCID_MASK), \scratch_reg 251 + 251 252 .Lwrcr3_\@: 252 - /* Flip the PGD and ASID to the user version */ 253 - orq $(PTI_SWITCH_MASK), \scratch_reg 253 + /* Flip the PGD to the user version */ 254 + orq $(PTI_USER_PGTABLE_MASK), \scratch_reg 254 255 mov \scratch_reg, %cr3 255 256 .Lend_\@: 256 257 .endm ··· 270 263 movq %cr3, \scratch_reg 271 264 movq \scratch_reg, \save_reg 272 265 /* 273 - * Is the "switch mask" all zero? That means that both of 274 - * these are zero: 275 - * 276 - * 1. The user/kernel PCID bit, and 277 - * 2. The user/kernel "bit" that points CR3 to the 278 - * bottom half of the 8k PGD 279 - * 280 - * That indicates a kernel CR3 value, not a user CR3. 266 + * Test the user pagetable bit. If set, then the user page tables 267 + * are active. If clear CR3 already has the kernel page table 268 + * active. 281 269 */ 282 - testq $(PTI_SWITCH_MASK), \scratch_reg 283 - jz .Ldone_\@ 270 + bt $PTI_USER_PGTABLE_BIT, \scratch_reg 271 + jnc .Ldone_\@ 284 272 285 273 ADJUST_KERNEL_CR3 \scratch_reg 286 274 movq \scratch_reg, %cr3 ··· 292 290 * KERNEL pages can always resume with NOFLUSH as we do 293 291 * explicit flushes. 294 292 */ 295 - bt $X86_CR3_PTI_SWITCH_BIT, \save_reg 293 + bt $PTI_USER_PGTABLE_BIT, \save_reg 296 294 jnc .Lnoflush_\@ 297 295 298 296 /*

+3 -2

arch/x86/entry/entry_32.S

··· 44 44 #include <asm/asm.h> 45 45 #include <asm/smap.h> 46 46 #include <asm/frame.h> 47 + #include <asm/nospec-branch.h> 47 48 48 49 .section .entry.text, "ax" 49 50 ··· 291 290 292 291 /* kernel thread */ 293 292 1: movl %edi, %eax 294 - call *%ebx 293 + CALL_NOSPEC %ebx 295 294 /* 296 295 * A kernel thread is allowed to return here after successfully 297 296 * calling do_execve(). Exit to userspace to complete the execve() ··· 920 919 movl %ecx, %es 921 920 TRACE_IRQS_OFF 922 921 movl %esp, %eax # pt_regs pointer 923 - call *%edi 922 + CALL_NOSPEC %edi 924 923 jmp ret_from_exception 925 924 END(common_exception) 926 925

+9 -3

arch/x86/entry/entry_64.S

··· 37 37 #include <asm/pgtable_types.h> 38 38 #include <asm/export.h> 39 39 #include <asm/frame.h> 40 + #include <asm/nospec-branch.h> 40 41 #include <linux/err.h> 41 42 42 43 #include "calling.h" ··· 192 191 */ 193 192 pushq %rdi 194 193 movq $entry_SYSCALL_64_stage2, %rdi 195 - jmp *%rdi 194 + JMP_NOSPEC %rdi 196 195 END(entry_SYSCALL_64_trampoline) 197 196 198 197 .popsection ··· 271 270 * It might end up jumping to the slow path. If it jumps, RAX 272 271 * and all argument registers are clobbered. 273 272 */ 273 + #ifdef CONFIG_RETPOLINE 274 + movq sys_call_table(, %rax, 8), %rax 275 + call __x86_indirect_thunk_rax 276 + #else 274 277 call *sys_call_table(, %rax, 8) 278 + #endif 275 279 .Lentry_SYSCALL_64_after_fastpath_call: 276 280 277 281 movq %rax, RAX(%rsp) ··· 448 442 jmp entry_SYSCALL64_slow_path 449 443 450 444 1: 451 - jmp *%rax /* Called from C */ 445 + JMP_NOSPEC %rax /* Called from C */ 452 446 END(stub_ptregs_64) 453 447 454 448 .macro ptregs_stub func ··· 527 521 1: 528 522 /* kernel thread */ 529 523 movq %r12, %rdi 530 - call *%rbx 524 + CALL_NOSPEC %rbx 531 525 /* 532 526 * A kernel thread is allowed to return here after successfully 533 527 * calling do_execve(). Exit to userspace to complete the execve()

+18

arch/x86/events/intel/bts.c

··· 582 582 if (!boot_cpu_has(X86_FEATURE_DTES64) || !x86_pmu.bts) 583 583 return -ENODEV; 584 584 585 + if (boot_cpu_has(X86_FEATURE_PTI)) { 586 + /* 587 + * BTS hardware writes through a virtual memory map we must 588 + * either use the kernel physical map, or the user mapping of 589 + * the AUX buffer. 590 + * 591 + * However, since this driver supports per-CPU and per-task inherit 592 + * we cannot use the user mapping since it will not be availble 593 + * if we're not running the owning process. 594 + * 595 + * With PTI we can't use the kernal map either, because its not 596 + * there when we run userspace. 597 + * 598 + * For now, disable this driver when using PTI. 599 + */ 600 + return -ENODEV; 601 + } 602 + 585 603 bts_pmu.capabilities = PERF_PMU_CAP_AUX_NO_SG | PERF_PMU_CAP_ITRACE | 586 604 PERF_PMU_CAP_EXCLUSIVE; 587 605 bts_pmu.task_ctx_nr = perf_sw_context;

+25

arch/x86/include/asm/asm-prototypes.h

··· 11 11 #include <asm/pgtable.h> 12 12 #include <asm/special_insns.h> 13 13 #include <asm/preempt.h> 14 + #include <asm/asm.h> 14 15 15 16 #ifndef CONFIG_X86_CMPXCHG64 16 17 extern void cmpxchg8b_emu(void); 17 18 #endif 19 + 20 + #ifdef CONFIG_RETPOLINE 21 + #ifdef CONFIG_X86_32 22 + #define INDIRECT_THUNK(reg) extern asmlinkage void __x86_indirect_thunk_e ## reg(void); 23 + #else 24 + #define INDIRECT_THUNK(reg) extern asmlinkage void __x86_indirect_thunk_r ## reg(void); 25 + INDIRECT_THUNK(8) 26 + INDIRECT_THUNK(9) 27 + INDIRECT_THUNK(10) 28 + INDIRECT_THUNK(11) 29 + INDIRECT_THUNK(12) 30 + INDIRECT_THUNK(13) 31 + INDIRECT_THUNK(14) 32 + INDIRECT_THUNK(15) 33 + #endif 34 + INDIRECT_THUNK(ax) 35 + INDIRECT_THUNK(bx) 36 + INDIRECT_THUNK(cx) 37 + INDIRECT_THUNK(dx) 38 + INDIRECT_THUNK(si) 39 + INDIRECT_THUNK(di) 40 + INDIRECT_THUNK(bp) 41 + INDIRECT_THUNK(sp) 42 + #endif /* CONFIG_RETPOLINE */

+4

arch/x86/include/asm/cpufeatures.h

··· 203 203 #define X86_FEATURE_PROC_FEEDBACK ( 7*32+ 9) /* AMD ProcFeedbackInterface */ 204 204 #define X86_FEATURE_SME ( 7*32+10) /* AMD Secure Memory Encryption */ 205 205 #define X86_FEATURE_PTI ( 7*32+11) /* Kernel Page Table Isolation enabled */ 206 + #define X86_FEATURE_RETPOLINE ( 7*32+12) /* Generic Retpoline mitigation for Spectre variant 2 */ 207 + #define X86_FEATURE_RETPOLINE_AMD ( 7*32+13) /* AMD Retpoline mitigation for Spectre variant 2 */ 206 208 #define X86_FEATURE_INTEL_PPIN ( 7*32+14) /* Intel Processor Inventory Number */ 207 209 #define X86_FEATURE_INTEL_PT ( 7*32+15) /* Intel Processor Trace */ 208 210 #define X86_FEATURE_AVX512_4VNNIW ( 7*32+16) /* AVX-512 Neural Network Instructions */ ··· 344 342 #define X86_BUG_MONITOR X86_BUG(12) /* IPI required to wake up remote CPU */ 345 343 #define X86_BUG_AMD_E400 X86_BUG(13) /* CPU is among the affected by Erratum 400 */ 346 344 #define X86_BUG_CPU_MELTDOWN X86_BUG(14) /* CPU is affected by meltdown attack and needs kernel page table isolation */ 345 + #define X86_BUG_SPECTRE_V1 X86_BUG(15) /* CPU is affected by Spectre variant 1 attack with conditional branches */ 346 + #define X86_BUG_SPECTRE_V2 X86_BUG(16) /* CPU is affected by Spectre variant 2 attack with indirect branches */ 347 347 348 348 #endif /* _ASM_X86_CPUFEATURES_H */

+10 -8

arch/x86/include/asm/mshyperv.h

··· 7 7 #include <linux/nmi.h> 8 8 #include <asm/io.h> 9 9 #include <asm/hyperv.h> 10 + #include <asm/nospec-branch.h> 10 11 11 12 /* 12 13 * The below CPUID leaves are present if VersionAndFeatures.HypervisorPresent ··· 187 186 return U64_MAX; 188 187 189 188 __asm__ __volatile__("mov %4, %%r8\n" 190 - "call *%5" 189 + CALL_NOSPEC 191 190 : "=a" (hv_status), ASM_CALL_CONSTRAINT, 192 191 "+c" (control), "+d" (input_address) 193 - : "r" (output_address), "m" (hv_hypercall_pg) 192 + : "r" (output_address), 193 + THUNK_TARGET(hv_hypercall_pg) 194 194 : "cc", "memory", "r8", "r9", "r10", "r11"); 195 195 #else 196 196 u32 input_address_hi = upper_32_bits(input_address); ··· 202 200 if (!hv_hypercall_pg) 203 201 return U64_MAX; 204 202 205 - __asm__ __volatile__("call *%7" 203 + __asm__ __volatile__(CALL_NOSPEC 206 204 : "=A" (hv_status), 207 205 "+c" (input_address_lo), ASM_CALL_CONSTRAINT 208 206 : "A" (control), 209 207 "b" (input_address_hi), 210 208 "D"(output_address_hi), "S"(output_address_lo), 211 - "m" (hv_hypercall_pg) 209 + THUNK_TARGET(hv_hypercall_pg) 212 210 : "cc", "memory"); 213 211 #endif /* !x86_64 */ 214 212 return hv_status; ··· 229 227 230 228 #ifdef CONFIG_X86_64 231 229 { 232 - __asm__ __volatile__("call *%4" 230 + __asm__ __volatile__(CALL_NOSPEC 233 231 : "=a" (hv_status), ASM_CALL_CONSTRAINT, 234 232 "+c" (control), "+d" (input1) 235 - : "m" (hv_hypercall_pg) 233 + : THUNK_TARGET(hv_hypercall_pg) 236 234 : "cc", "r8", "r9", "r10", "r11"); 237 235 } 238 236 #else ··· 240 238 u32 input1_hi = upper_32_bits(input1); 241 239 u32 input1_lo = lower_32_bits(input1); 242 240 243 - __asm__ __volatile__ ("call *%5" 241 + __asm__ __volatile__ (CALL_NOSPEC 244 242 : "=A"(hv_status), 245 243 "+c"(input1_lo), 246 244 ASM_CALL_CONSTRAINT 247 245 : "A" (control), 248 246 "b" (input1_hi), 249 - "m" (hv_hypercall_pg) 247 + THUNK_TARGET(hv_hypercall_pg) 250 248 : "cc", "edi", "esi"); 251 249 } 252 250 #endif

+3

arch/x86/include/asm/msr-index.h

··· 355 355 #define FAM10H_MMIO_CONF_BASE_MASK 0xfffffffULL 356 356 #define FAM10H_MMIO_CONF_BASE_SHIFT 20 357 357 #define MSR_FAM10H_NODE_ID 0xc001100c 358 + #define MSR_F10H_DECFG 0xc0011029 359 + #define MSR_F10H_DECFG_LFENCE_SERIALIZE_BIT 1 360 + #define MSR_F10H_DECFG_LFENCE_SERIALIZE BIT_ULL(MSR_F10H_DECFG_LFENCE_SERIALIZE_BIT) 358 361 359 362 /* K8 MSRs */ 360 363 #define MSR_K8_TOP_MEM1 0xc001001a

+214

arch/x86/include/asm/nospec-branch.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + 3 + #ifndef __NOSPEC_BRANCH_H__ 4 + #define __NOSPEC_BRANCH_H__ 5 + 6 + #include <asm/alternative.h> 7 + #include <asm/alternative-asm.h> 8 + #include <asm/cpufeatures.h> 9 + 10 + /* 11 + * Fill the CPU return stack buffer. 12 + * 13 + * Each entry in the RSB, if used for a speculative 'ret', contains an 14 + * infinite 'pause; jmp' loop to capture speculative execution. 15 + * 16 + * This is required in various cases for retpoline and IBRS-based 17 + * mitigations for the Spectre variant 2 vulnerability. Sometimes to 18 + * eliminate potentially bogus entries from the RSB, and sometimes 19 + * purely to ensure that it doesn't get empty, which on some CPUs would 20 + * allow predictions from other (unwanted!) sources to be used. 21 + * 22 + * We define a CPP macro such that it can be used from both .S files and 23 + * inline assembly. It's possible to do a .macro and then include that 24 + * from C via asm(".include <asm/nospec-branch.h>") but let's not go there. 25 + */ 26 + 27 + #define RSB_CLEAR_LOOPS 32 /* To forcibly overwrite all entries */ 28 + #define RSB_FILL_LOOPS 16 /* To avoid underflow */ 29 + 30 + /* 31 + * Google experimented with loop-unrolling and this turned out to be 32 + * the optimal version — two calls, each with their own speculation 33 + * trap should their return address end up getting used, in a loop. 34 + */ 35 + #define __FILL_RETURN_BUFFER(reg, nr, sp) \ 36 + mov $(nr/2), reg; \ 37 + 771: \ 38 + call 772f; \ 39 + 773: /* speculation trap */ \ 40 + pause; \ 41 + jmp 773b; \ 42 + 772: \ 43 + call 774f; \ 44 + 775: /* speculation trap */ \ 45 + pause; \ 46 + jmp 775b; \ 47 + 774: \ 48 + dec reg; \ 49 + jnz 771b; \ 50 + add $(BITS_PER_LONG/8) * nr, sp; 51 + 52 + #ifdef __ASSEMBLY__ 53 + 54 + /* 55 + * This should be used immediately before a retpoline alternative. It tells 56 + * objtool where the retpolines are so that it can make sense of the control 57 + * flow by just reading the original instruction(s) and ignoring the 58 + * alternatives. 59 + */ 60 + .macro ANNOTATE_NOSPEC_ALTERNATIVE 61 + .Lannotate_\@: 62 + .pushsection .discard.nospec 63 + .long .Lannotate_\@ - . 64 + .popsection 65 + .endm 66 + 67 + /* 68 + * These are the bare retpoline primitives for indirect jmp and call. 69 + * Do not use these directly; they only exist to make the ALTERNATIVE 70 + * invocation below less ugly. 71 + */ 72 + .macro RETPOLINE_JMP reg:req 73 + call .Ldo_rop_\@ 74 + .Lspec_trap_\@: 75 + pause 76 + jmp .Lspec_trap_\@ 77 + .Ldo_rop_\@: 78 + mov \reg, (%_ASM_SP) 79 + ret 80 + .endm 81 + 82 + /* 83 + * This is a wrapper around RETPOLINE_JMP so the called function in reg 84 + * returns to the instruction after the macro. 85 + */ 86 + .macro RETPOLINE_CALL reg:req 87 + jmp .Ldo_call_\@ 88 + .Ldo_retpoline_jmp_\@: 89 + RETPOLINE_JMP \reg 90 + .Ldo_call_\@: 91 + call .Ldo_retpoline_jmp_\@ 92 + .endm 93 + 94 + /* 95 + * JMP_NOSPEC and CALL_NOSPEC macros can be used instead of a simple 96 + * indirect jmp/call which may be susceptible to the Spectre variant 2 97 + * attack. 98 + */ 99 + .macro JMP_NOSPEC reg:req 100 + #ifdef CONFIG_RETPOLINE 101 + ANNOTATE_NOSPEC_ALTERNATIVE 102 + ALTERNATIVE_2 __stringify(jmp *\reg), \ 103 + __stringify(RETPOLINE_JMP \reg), X86_FEATURE_RETPOLINE, \ 104 + __stringify(lfence; jmp *\reg), X86_FEATURE_RETPOLINE_AMD 105 + #else 106 + jmp *\reg 107 + #endif 108 + .endm 109 + 110 + .macro CALL_NOSPEC reg:req 111 + #ifdef CONFIG_RETPOLINE 112 + ANNOTATE_NOSPEC_ALTERNATIVE 113 + ALTERNATIVE_2 __stringify(call *\reg), \ 114 + __stringify(RETPOLINE_CALL \reg), X86_FEATURE_RETPOLINE,\ 115 + __stringify(lfence; call *\reg), X86_FEATURE_RETPOLINE_AMD 116 + #else 117 + call *\reg 118 + #endif 119 + .endm 120 + 121 + /* 122 + * A simpler FILL_RETURN_BUFFER macro. Don't make people use the CPP 123 + * monstrosity above, manually. 124 + */ 125 + .macro FILL_RETURN_BUFFER reg:req nr:req ftr:req 126 + #ifdef CONFIG_RETPOLINE 127 + ANNOTATE_NOSPEC_ALTERNATIVE 128 + ALTERNATIVE "jmp .Lskip_rsb_\@", \ 129 + __stringify(__FILL_RETURN_BUFFER(\reg,\nr,%_ASM_SP)) \ 130 + \ftr 131 + .Lskip_rsb_\@: 132 + #endif 133 + .endm 134 + 135 + #else /* __ASSEMBLY__ */ 136 + 137 + #define ANNOTATE_NOSPEC_ALTERNATIVE \ 138 + "999:\n\t" \ 139 + ".pushsection .discard.nospec\n\t" \ 140 + ".long 999b - .\n\t" \ 141 + ".popsection\n\t" 142 + 143 + #if defined(CONFIG_X86_64) && defined(RETPOLINE) 144 + 145 + /* 146 + * Since the inline asm uses the %V modifier which is only in newer GCC, 147 + * the 64-bit one is dependent on RETPOLINE not CONFIG_RETPOLINE. 148 + */ 149 + # define CALL_NOSPEC \ 150 + ANNOTATE_NOSPEC_ALTERNATIVE \ 151 + ALTERNATIVE( \ 152 + "call *%[thunk_target]\n", \ 153 + "call __x86_indirect_thunk_%V[thunk_target]\n", \ 154 + X86_FEATURE_RETPOLINE) 155 + # define THUNK_TARGET(addr) [thunk_target] "r" (addr) 156 + 157 + #elif defined(CONFIG_X86_32) && defined(CONFIG_RETPOLINE) 158 + /* 159 + * For i386 we use the original ret-equivalent retpoline, because 160 + * otherwise we'll run out of registers. We don't care about CET 161 + * here, anyway. 162 + */ 163 + # define CALL_NOSPEC ALTERNATIVE("call *%[thunk_target]\n", \ 164 + " jmp 904f;\n" \ 165 + " .align 16\n" \ 166 + "901: call 903f;\n" \ 167 + "902: pause;\n" \ 168 + " jmp 902b;\n" \ 169 + " .align 16\n" \ 170 + "903: addl $4, %%esp;\n" \ 171 + " pushl %[thunk_target];\n" \ 172 + " ret;\n" \ 173 + " .align 16\n" \ 174 + "904: call 901b;\n", \ 175 + X86_FEATURE_RETPOLINE) 176 + 177 + # define THUNK_TARGET(addr) [thunk_target] "rm" (addr) 178 + #else /* No retpoline for C / inline asm */ 179 + # define CALL_NOSPEC "call *%[thunk_target]\n" 180 + # define THUNK_TARGET(addr) [thunk_target] "rm" (addr) 181 + #endif 182 + 183 + /* The Spectre V2 mitigation variants */ 184 + enum spectre_v2_mitigation { 185 + SPECTRE_V2_NONE, 186 + SPECTRE_V2_RETPOLINE_MINIMAL, 187 + SPECTRE_V2_RETPOLINE_MINIMAL_AMD, 188 + SPECTRE_V2_RETPOLINE_GENERIC, 189 + SPECTRE_V2_RETPOLINE_AMD, 190 + SPECTRE_V2_IBRS, 191 + }; 192 + 193 + /* 194 + * On VMEXIT we must ensure that no RSB predictions learned in the guest 195 + * can be followed in the host, by overwriting the RSB completely. Both 196 + * retpoline and IBRS mitigations for Spectre v2 need this; only on future 197 + * CPUs with IBRS_ATT *might* it be avoided. 198 + */ 199 + static inline void vmexit_fill_RSB(void) 200 + { 201 + #ifdef CONFIG_RETPOLINE 202 + unsigned long loops = RSB_CLEAR_LOOPS / 2; 203 + 204 + asm volatile (ANNOTATE_NOSPEC_ALTERNATIVE 205 + ALTERNATIVE("jmp 910f", 206 + __stringify(__FILL_RETURN_BUFFER(%0, RSB_CLEAR_LOOPS, %1)), 207 + X86_FEATURE_RETPOLINE) 208 + "910:" 209 + : "=&r" (loops), ASM_CALL_CONSTRAINT 210 + : "r" (loops) : "memory" ); 211 + #endif 212 + } 213 + #endif /* __ASSEMBLY__ */ 214 + #endif /* __NOSPEC_BRANCH_H__ */

+1 -1

arch/x86/include/asm/processor-flags.h

··· 40 40 #define CR3_NOFLUSH BIT_ULL(63) 41 41 42 42 #ifdef CONFIG_PAGE_TABLE_ISOLATION 43 - # define X86_CR3_PTI_SWITCH_BIT 11 43 + # define X86_CR3_PTI_PCID_USER_BIT 11 44 44 #endif 45 45 46 46 #else

+3 -3

arch/x86/include/asm/tlbflush.h

··· 81 81 * Make sure that the dynamic ASID space does not confict with the 82 82 * bit we are using to switch between user and kernel ASIDs. 83 83 */ 84 - BUILD_BUG_ON(TLB_NR_DYN_ASIDS >= (1 << X86_CR3_PTI_SWITCH_BIT)); 84 + BUILD_BUG_ON(TLB_NR_DYN_ASIDS >= (1 << X86_CR3_PTI_PCID_USER_BIT)); 85 85 86 86 /* 87 87 * The ASID being passed in here should have respected the 88 88 * MAX_ASID_AVAILABLE and thus never have the switch bit set. 89 89 */ 90 - VM_WARN_ON_ONCE(asid & (1 << X86_CR3_PTI_SWITCH_BIT)); 90 + VM_WARN_ON_ONCE(asid & (1 << X86_CR3_PTI_PCID_USER_BIT)); 91 91 #endif 92 92 /* 93 93 * The dynamically-assigned ASIDs that get passed in are small ··· 112 112 { 113 113 u16 ret = kern_pcid(asid); 114 114 #ifdef CONFIG_PAGE_TABLE_ISOLATION 115 - ret |= 1 << X86_CR3_PTI_SWITCH_BIT; 115 + ret |= 1 << X86_CR3_PTI_PCID_USER_BIT; 116 116 #endif 117 117 return ret; 118 118 }

+3 -2

arch/x86/include/asm/xen/hypercall.h

··· 44 44 #include <asm/page.h> 45 45 #include <asm/pgtable.h> 46 46 #include <asm/smap.h> 47 + #include <asm/nospec-branch.h> 47 48 48 49 #include <xen/interface/xen.h> 49 50 #include <xen/interface/sched.h> ··· 218 217 __HYPERCALL_5ARG(a1, a2, a3, a4, a5); 219 218 220 219 stac(); 221 - asm volatile("call *%[call]" 220 + asm volatile(CALL_NOSPEC 222 221 : __HYPERCALL_5PARAM 223 - : [call] "a" (&hypercall_page[call]) 222 + : [thunk_target] "a" (&hypercall_page[call]) 224 223 : __HYPERCALL_CLOBBER5); 225 224 clac(); 226 225

+5 -2

arch/x86/kernel/alternative.c

··· 344 344 static void __init_or_module noinline optimize_nops(struct alt_instr *a, u8 *instr) 345 345 { 346 346 unsigned long flags; 347 + int i; 347 348 348 - if (instr[0] != 0x90) 349 - return; 349 + for (i = 0; i < a->padlen; i++) { 350 + if (instr[i] != 0x90) 351 + return; 352 + } 350 353 351 354 local_irq_save(flags); 352 355 add_nops(instr + (a->instrlen - a->padlen), a->padlen);

+26 -2

arch/x86/kernel/cpu/amd.c

··· 829 829 set_cpu_cap(c, X86_FEATURE_K8); 830 830 831 831 if (cpu_has(c, X86_FEATURE_XMM2)) { 832 - /* MFENCE stops RDTSC speculation */ 833 - set_cpu_cap(c, X86_FEATURE_MFENCE_RDTSC); 832 + unsigned long long val; 833 + int ret; 834 + 835 + /* 836 + * A serializing LFENCE has less overhead than MFENCE, so 837 + * use it for execution serialization. On families which 838 + * don't have that MSR, LFENCE is already serializing. 839 + * msr_set_bit() uses the safe accessors, too, even if the MSR 840 + * is not present. 841 + */ 842 + msr_set_bit(MSR_F10H_DECFG, 843 + MSR_F10H_DECFG_LFENCE_SERIALIZE_BIT); 844 + 845 + /* 846 + * Verify that the MSR write was successful (could be running 847 + * under a hypervisor) and only then assume that LFENCE is 848 + * serializing. 849 + */ 850 + ret = rdmsrl_safe(MSR_F10H_DECFG, &val); 851 + if (!ret && (val & MSR_F10H_DECFG_LFENCE_SERIALIZE)) { 852 + /* A serializing LFENCE stops RDTSC speculation */ 853 + set_cpu_cap(c, X86_FEATURE_LFENCE_RDTSC); 854 + } else { 855 + /* MFENCE stops RDTSC speculation */ 856 + set_cpu_cap(c, X86_FEATURE_MFENCE_RDTSC); 857 + } 834 858 } 835 859 836 860 /*

+185

arch/x86/kernel/cpu/bugs.c

··· 10 10 */ 11 11 #include <linux/init.h> 12 12 #include <linux/utsname.h> 13 + #include <linux/cpu.h> 14 + 15 + #include <asm/nospec-branch.h> 16 + #include <asm/cmdline.h> 13 17 #include <asm/bugs.h> 14 18 #include <asm/processor.h> 15 19 #include <asm/processor-flags.h> ··· 24 20 #include <asm/pgtable.h> 25 21 #include <asm/set_memory.h> 26 22 23 + static void __init spectre_v2_select_mitigation(void); 24 + 27 25 void __init check_bugs(void) 28 26 { 29 27 identify_boot_cpu(); ··· 34 28 pr_info("CPU: "); 35 29 print_cpu_info(&boot_cpu_data); 36 30 } 31 + 32 + /* Select the proper spectre mitigation before patching alternatives */ 33 + spectre_v2_select_mitigation(); 37 34 38 35 #ifdef CONFIG_X86_32 39 36 /* ··· 69 60 set_memory_4k((unsigned long)__va(0), 1); 70 61 #endif 71 62 } 63 + 64 + /* The kernel command line selection */ 65 + enum spectre_v2_mitigation_cmd { 66 + SPECTRE_V2_CMD_NONE, 67 + SPECTRE_V2_CMD_AUTO, 68 + SPECTRE_V2_CMD_FORCE, 69 + SPECTRE_V2_CMD_RETPOLINE, 70 + SPECTRE_V2_CMD_RETPOLINE_GENERIC, 71 + SPECTRE_V2_CMD_RETPOLINE_AMD, 72 + }; 73 + 74 + static const char *spectre_v2_strings[] = { 75 + [SPECTRE_V2_NONE] = "Vulnerable", 76 + [SPECTRE_V2_RETPOLINE_MINIMAL] = "Vulnerable: Minimal generic ASM retpoline", 77 + [SPECTRE_V2_RETPOLINE_MINIMAL_AMD] = "Vulnerable: Minimal AMD ASM retpoline", 78 + [SPECTRE_V2_RETPOLINE_GENERIC] = "Mitigation: Full generic retpoline", 79 + [SPECTRE_V2_RETPOLINE_AMD] = "Mitigation: Full AMD retpoline", 80 + }; 81 + 82 + #undef pr_fmt 83 + #define pr_fmt(fmt) "Spectre V2 mitigation: " fmt 84 + 85 + static enum spectre_v2_mitigation spectre_v2_enabled = SPECTRE_V2_NONE; 86 + 87 + static void __init spec2_print_if_insecure(const char *reason) 88 + { 89 + if (boot_cpu_has_bug(X86_BUG_SPECTRE_V2)) 90 + pr_info("%s\n", reason); 91 + } 92 + 93 + static void __init spec2_print_if_secure(const char *reason) 94 + { 95 + if (!boot_cpu_has_bug(X86_BUG_SPECTRE_V2)) 96 + pr_info("%s\n", reason); 97 + } 98 + 99 + static inline bool retp_compiler(void) 100 + { 101 + return __is_defined(RETPOLINE); 102 + } 103 + 104 + static inline bool match_option(const char *arg, int arglen, const char *opt) 105 + { 106 + int len = strlen(opt); 107 + 108 + return len == arglen && !strncmp(arg, opt, len); 109 + } 110 + 111 + static enum spectre_v2_mitigation_cmd __init spectre_v2_parse_cmdline(void) 112 + { 113 + char arg[20]; 114 + int ret; 115 + 116 + ret = cmdline_find_option(boot_command_line, "spectre_v2", arg, 117 + sizeof(arg)); 118 + if (ret > 0) { 119 + if (match_option(arg, ret, "off")) { 120 + goto disable; 121 + } else if (match_option(arg, ret, "on")) { 122 + spec2_print_if_secure("force enabled on command line."); 123 + return SPECTRE_V2_CMD_FORCE; 124 + } else if (match_option(arg, ret, "retpoline")) { 125 + spec2_print_if_insecure("retpoline selected on command line."); 126 + return SPECTRE_V2_CMD_RETPOLINE; 127 + } else if (match_option(arg, ret, "retpoline,amd")) { 128 + if (boot_cpu_data.x86_vendor != X86_VENDOR_AMD) { 129 + pr_err("retpoline,amd selected but CPU is not AMD. Switching to AUTO select\n"); 130 + return SPECTRE_V2_CMD_AUTO; 131 + } 132 + spec2_print_if_insecure("AMD retpoline selected on command line."); 133 + return SPECTRE_V2_CMD_RETPOLINE_AMD; 134 + } else if (match_option(arg, ret, "retpoline,generic")) { 135 + spec2_print_if_insecure("generic retpoline selected on command line."); 136 + return SPECTRE_V2_CMD_RETPOLINE_GENERIC; 137 + } else if (match_option(arg, ret, "auto")) { 138 + return SPECTRE_V2_CMD_AUTO; 139 + } 140 + } 141 + 142 + if (!cmdline_find_option_bool(boot_command_line, "nospectre_v2")) 143 + return SPECTRE_V2_CMD_AUTO; 144 + disable: 145 + spec2_print_if_insecure("disabled on command line."); 146 + return SPECTRE_V2_CMD_NONE; 147 + } 148 + 149 + static void __init spectre_v2_select_mitigation(void) 150 + { 151 + enum spectre_v2_mitigation_cmd cmd = spectre_v2_parse_cmdline(); 152 + enum spectre_v2_mitigation mode = SPECTRE_V2_NONE; 153 + 154 + /* 155 + * If the CPU is not affected and the command line mode is NONE or AUTO 156 + * then nothing to do. 157 + */ 158 + if (!boot_cpu_has_bug(X86_BUG_SPECTRE_V2) && 159 + (cmd == SPECTRE_V2_CMD_NONE || cmd == SPECTRE_V2_CMD_AUTO)) 160 + return; 161 + 162 + switch (cmd) { 163 + case SPECTRE_V2_CMD_NONE: 164 + return; 165 + 166 + case SPECTRE_V2_CMD_FORCE: 167 + /* FALLTRHU */ 168 + case SPECTRE_V2_CMD_AUTO: 169 + goto retpoline_auto; 170 + 171 + case SPECTRE_V2_CMD_RETPOLINE_AMD: 172 + if (IS_ENABLED(CONFIG_RETPOLINE)) 173 + goto retpoline_amd; 174 + break; 175 + case SPECTRE_V2_CMD_RETPOLINE_GENERIC: 176 + if (IS_ENABLED(CONFIG_RETPOLINE)) 177 + goto retpoline_generic; 178 + break; 179 + case SPECTRE_V2_CMD_RETPOLINE: 180 + if (IS_ENABLED(CONFIG_RETPOLINE)) 181 + goto retpoline_auto; 182 + break; 183 + } 184 + pr_err("kernel not compiled with retpoline; no mitigation available!"); 185 + return; 186 + 187 + retpoline_auto: 188 + if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD) { 189 + retpoline_amd: 190 + if (!boot_cpu_has(X86_FEATURE_LFENCE_RDTSC)) { 191 + pr_err("LFENCE not serializing. Switching to generic retpoline\n"); 192 + goto retpoline_generic; 193 + } 194 + mode = retp_compiler() ? SPECTRE_V2_RETPOLINE_AMD : 195 + SPECTRE_V2_RETPOLINE_MINIMAL_AMD; 196 + setup_force_cpu_cap(X86_FEATURE_RETPOLINE_AMD); 197 + setup_force_cpu_cap(X86_FEATURE_RETPOLINE); 198 + } else { 199 + retpoline_generic: 200 + mode = retp_compiler() ? SPECTRE_V2_RETPOLINE_GENERIC : 201 + SPECTRE_V2_RETPOLINE_MINIMAL; 202 + setup_force_cpu_cap(X86_FEATURE_RETPOLINE); 203 + } 204 + 205 + spectre_v2_enabled = mode; 206 + pr_info("%s\n", spectre_v2_strings[mode]); 207 + } 208 + 209 + #undef pr_fmt 210 + 211 + #ifdef CONFIG_SYSFS 212 + ssize_t cpu_show_meltdown(struct device *dev, 213 + struct device_attribute *attr, char *buf) 214 + { 215 + if (!boot_cpu_has_bug(X86_BUG_CPU_MELTDOWN)) 216 + return sprintf(buf, "Not affected\n"); 217 + if (boot_cpu_has(X86_FEATURE_PTI)) 218 + return sprintf(buf, "Mitigation: PTI\n"); 219 + return sprintf(buf, "Vulnerable\n"); 220 + } 221 + 222 + ssize_t cpu_show_spectre_v1(struct device *dev, 223 + struct device_attribute *attr, char *buf) 224 + { 225 + if (!boot_cpu_has_bug(X86_BUG_SPECTRE_V1)) 226 + return sprintf(buf, "Not affected\n"); 227 + return sprintf(buf, "Vulnerable\n"); 228 + } 229 + 230 + ssize_t cpu_show_spectre_v2(struct device *dev, 231 + struct device_attribute *attr, char *buf) 232 + { 233 + if (!boot_cpu_has_bug(X86_BUG_SPECTRE_V2)) 234 + return sprintf(buf, "Not affected\n"); 235 + 236 + return sprintf(buf, "%s\n", spectre_v2_strings[spectre_v2_enabled]); 237 + } 238 + #endif

+3

arch/x86/kernel/cpu/common.c

··· 926 926 if (c->x86_vendor != X86_VENDOR_AMD) 927 927 setup_force_cpu_bug(X86_BUG_CPU_MELTDOWN); 928 928 929 + setup_force_cpu_bug(X86_BUG_SPECTRE_V1); 930 + setup_force_cpu_bug(X86_BUG_SPECTRE_V2); 931 + 929 932 fpu__init_system(c); 930 933 931 934 #ifdef CONFIG_X86_32

+4 -2

arch/x86/kernel/ftrace_32.S

··· 8 8 #include <asm/segment.h> 9 9 #include <asm/export.h> 10 10 #include <asm/ftrace.h> 11 + #include <asm/nospec-branch.h> 11 12 12 13 #ifdef CC_USING_FENTRY 13 14 # define function_hook __fentry__ ··· 198 197 movl 0x4(%ebp), %edx 199 198 subl $MCOUNT_INSN_SIZE, %eax 200 199 201 - call *ftrace_trace_function 200 + movl ftrace_trace_function, %ecx 201 + CALL_NOSPEC %ecx 202 202 203 203 popl %edx 204 204 popl %ecx ··· 243 241 movl %eax, %ecx 244 242 popl %edx 245 243 popl %eax 246 - jmp *%ecx 244 + JMP_NOSPEC %ecx 247 245 #endif

+4 -4

arch/x86/kernel/ftrace_64.S

··· 7 7 #include <asm/ptrace.h> 8 8 #include <asm/ftrace.h> 9 9 #include <asm/export.h> 10 - 10 + #include <asm/nospec-branch.h> 11 11 12 12 .code64 13 13 .section .entry.text, "ax" ··· 286 286 * ip and parent ip are used and the list function is called when 287 287 * function tracing is enabled. 288 288 */ 289 - call *ftrace_trace_function 290 - 289 + movq ftrace_trace_function, %r8 290 + CALL_NOSPEC %r8 291 291 restore_mcount_regs 292 292 293 293 jmp fgraph_trace ··· 329 329 movq 8(%rsp), %rdx 330 330 movq (%rsp), %rax 331 331 addq $24, %rsp 332 - jmp *%rdi 332 + JMP_NOSPEC %rdi 333 333 #endif

+5 -4

arch/x86/kernel/irq_32.c

··· 20 20 #include <linux/mm.h> 21 21 22 22 #include <asm/apic.h> 23 + #include <asm/nospec-branch.h> 23 24 24 25 #ifdef CONFIG_DEBUG_STACKOVERFLOW 25 26 ··· 56 55 static void call_on_stack(void *func, void *stack) 57 56 { 58 57 asm volatile("xchgl %%ebx,%%esp \n" 59 - "call *%%edi \n" 58 + CALL_NOSPEC 60 59 "movl %%ebx,%%esp \n" 61 60 : "=b" (stack) 62 61 : "0" (stack), 63 - "D"(func) 62 + [thunk_target] "D"(func) 64 63 : "memory", "cc", "edx", "ecx", "eax"); 65 64 } 66 65 ··· 96 95 call_on_stack(print_stack_overflow, isp); 97 96 98 97 asm volatile("xchgl %%ebx,%%esp \n" 99 - "call *%%edi \n" 98 + CALL_NOSPEC 100 99 "movl %%ebx,%%esp \n" 101 100 : "=a" (arg1), "=b" (isp) 102 101 : "0" (desc), "1" (isp), 103 - "D" (desc->handle_irq) 102 + [thunk_target] "D" (desc->handle_irq) 104 103 : "memory", "cc", "ecx"); 105 104 return 1; 106 105 }

+11

arch/x86/kernel/tboot.c

··· 138 138 return -1; 139 139 set_pte_at(&tboot_mm, vaddr, pte, pfn_pte(pfn, prot)); 140 140 pte_unmap(pte); 141 + 142 + /* 143 + * PTI poisons low addresses in the kernel page tables in the 144 + * name of making them unusable for userspace. To execute 145 + * code at such a low address, the poison must be cleared. 146 + * 147 + * Note: 'pgd' actually gets set in p4d_alloc() _or_ 148 + * pud_alloc() depending on 4/5-level paging. 149 + */ 150 + pgd->pgd &= ~_PAGE_NX; 151 + 141 152 return 0; 142 153 } 143 154

+4

arch/x86/kvm/svm.c

··· 45 45 #include <asm/debugreg.h> 46 46 #include <asm/kvm_para.h> 47 47 #include <asm/irq_remapping.h> 48 + #include <asm/nospec-branch.h> 48 49 49 50 #include <asm/virtext.h> 50 51 #include "trace.h" ··· 5027 5026 , "ebx", "ecx", "edx", "esi", "edi" 5028 5027 #endif 5029 5028 ); 5029 + 5030 + /* Eliminate branch target predictions from guest mode */ 5031 + vmexit_fill_RSB(); 5030 5032 5031 5033 #ifdef CONFIG_X86_64 5032 5034 wrmsrl(MSR_GS_BASE, svm->host.gs_base);

+4

arch/x86/kvm/vmx.c

··· 50 50 #include <asm/apic.h> 51 51 #include <asm/irq_remapping.h> 52 52 #include <asm/mmu_context.h> 53 + #include <asm/nospec-branch.h> 53 54 54 55 #include "trace.h" 55 56 #include "pmu.h" ··· 9490 9489 , "eax", "ebx", "edi", "esi" 9491 9490 #endif 9492 9491 ); 9492 + 9493 + /* Eliminate branch target predictions from guest mode */ 9494 + vmexit_fill_RSB(); 9493 9495 9494 9496 /* MSR_IA32_DEBUGCTLMSR is zeroed on vmexit. Restore it if needed */ 9495 9497 if (debugctlmsr)

+1

arch/x86/lib/Makefile

··· 26 26 lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o 27 27 lib-$(CONFIG_INSTRUCTION_DECODER) += insn.o inat.o insn-eval.o 28 28 lib-$(CONFIG_RANDOMIZE_BASE) += kaslr.o 29 + lib-$(CONFIG_RETPOLINE) += retpoline.o 29 30 30 31 obj-y += msr.o msr-reg.o msr-reg-export.o hweight.o 31 32

+4 -3

arch/x86/lib/checksum_32.S

··· 29 29 #include <asm/errno.h> 30 30 #include <asm/asm.h> 31 31 #include <asm/export.h> 32 - 32 + #include <asm/nospec-branch.h> 33 + 33 34 /* 34 35 * computes a partial checksum, e.g. for TCP/UDP fragments 35 36 */ ··· 157 156 negl %ebx 158 157 lea 45f(%ebx,%ebx,2), %ebx 159 158 testl %esi, %esi 160 - jmp *%ebx 159 + JMP_NOSPEC %ebx 161 160 162 161 # Handle 2-byte-aligned regions 163 162 20: addw (%esi), %ax ··· 440 439 andl $-32,%edx 441 440 lea 3f(%ebx,%ebx), %ebx 442 441 testl %esi, %esi 443 - jmp *%ebx 442 + JMP_NOSPEC %ebx 444 443 1: addl $64,%esi 445 444 addl $64,%edi 446 445 SRC(movb -32(%edx),%bl) ; SRC(movb (%edx),%bl)

+48

arch/x86/lib/retpoline.S

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + 3 + #include <linux/stringify.h> 4 + #include <linux/linkage.h> 5 + #include <asm/dwarf2.h> 6 + #include <asm/cpufeatures.h> 7 + #include <asm/alternative-asm.h> 8 + #include <asm/export.h> 9 + #include <asm/nospec-branch.h> 10 + 11 + .macro THUNK reg 12 + .section .text.__x86.indirect_thunk.\reg 13 + 14 + ENTRY(__x86_indirect_thunk_\reg) 15 + CFI_STARTPROC 16 + JMP_NOSPEC %\reg 17 + CFI_ENDPROC 18 + ENDPROC(__x86_indirect_thunk_\reg) 19 + .endm 20 + 21 + /* 22 + * Despite being an assembler file we can't just use .irp here 23 + * because __KSYM_DEPS__ only uses the C preprocessor and would 24 + * only see one instance of "__x86_indirect_thunk_\reg" rather 25 + * than one per register with the correct names. So we do it 26 + * the simple and nasty way... 27 + */ 28 + #define EXPORT_THUNK(reg) EXPORT_SYMBOL(__x86_indirect_thunk_ ## reg) 29 + #define GENERATE_THUNK(reg) THUNK reg ; EXPORT_THUNK(reg) 30 + 31 + GENERATE_THUNK(_ASM_AX) 32 + GENERATE_THUNK(_ASM_BX) 33 + GENERATE_THUNK(_ASM_CX) 34 + GENERATE_THUNK(_ASM_DX) 35 + GENERATE_THUNK(_ASM_SI) 36 + GENERATE_THUNK(_ASM_DI) 37 + GENERATE_THUNK(_ASM_BP) 38 + GENERATE_THUNK(_ASM_SP) 39 + #ifdef CONFIG_64BIT 40 + GENERATE_THUNK(r8) 41 + GENERATE_THUNK(r9) 42 + GENERATE_THUNK(r10) 43 + GENERATE_THUNK(r11) 44 + GENERATE_THUNK(r12) 45 + GENERATE_THUNK(r13) 46 + GENERATE_THUNK(r14) 47 + GENERATE_THUNK(r15) 48 + #endif

+6 -26

arch/x86/mm/pti.c

··· 149 149 * 150 150 * Returns a pointer to a P4D on success, or NULL on failure. 151 151 */ 152 - static p4d_t *pti_user_pagetable_walk_p4d(unsigned long address) 152 + static __init p4d_t *pti_user_pagetable_walk_p4d(unsigned long address) 153 153 { 154 154 pgd_t *pgd = kernel_to_user_pgdp(pgd_offset_k(address)); 155 155 gfp_t gfp = (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO); ··· 164 164 if (!new_p4d_page) 165 165 return NULL; 166 166 167 - if (pgd_none(*pgd)) { 168 - set_pgd(pgd, __pgd(_KERNPG_TABLE | __pa(new_p4d_page))); 169 - new_p4d_page = 0; 170 - } 171 - if (new_p4d_page) 172 - free_page(new_p4d_page); 167 + set_pgd(pgd, __pgd(_KERNPG_TABLE | __pa(new_p4d_page))); 173 168 } 174 169 BUILD_BUG_ON(pgd_large(*pgd) != 0); 175 170 ··· 177 182 * 178 183 * Returns a pointer to a PMD on success, or NULL on failure. 179 184 */ 180 - static pmd_t *pti_user_pagetable_walk_pmd(unsigned long address) 185 + static __init pmd_t *pti_user_pagetable_walk_pmd(unsigned long address) 181 186 { 182 187 gfp_t gfp = (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO); 183 188 p4d_t *p4d = pti_user_pagetable_walk_p4d(address); ··· 189 194 if (!new_pud_page) 190 195 return NULL; 191 196 192 - if (p4d_none(*p4d)) { 193 - set_p4d(p4d, __p4d(_KERNPG_TABLE | __pa(new_pud_page))); 194 - new_pud_page = 0; 195 - } 196 - if (new_pud_page) 197 - free_page(new_pud_page); 197 + set_p4d(p4d, __p4d(_KERNPG_TABLE | __pa(new_pud_page))); 198 198 } 199 199 200 200 pud = pud_offset(p4d, address); ··· 203 213 if (!new_pmd_page) 204 214 return NULL; 205 215 206 - if (pud_none(*pud)) { 207 - set_pud(pud, __pud(_KERNPG_TABLE | __pa(new_pmd_page))); 208 - new_pmd_page = 0; 209 - } 210 - if (new_pmd_page) 211 - free_page(new_pmd_page); 216 + set_pud(pud, __pud(_KERNPG_TABLE | __pa(new_pmd_page))); 212 217 } 213 218 214 219 return pmd_offset(pud, address); ··· 236 251 if (!new_pte_page) 237 252 return NULL; 238 253 239 - if (pmd_none(*pmd)) { 240 - set_pmd(pmd, __pmd(_KERNPG_TABLE | __pa(new_pte_page))); 241 - new_pte_page = 0; 242 - } 243 - if (new_pte_page) 244 - free_page(new_pte_page); 254 + set_pmd(pmd, __pmd(_KERNPG_TABLE | __pa(new_pte_page))); 245 255 } 246 256 247 257 pte = pte_offset_kernel(pmd, address);

+2

arch/x86/platform/efi/efi_64.c

··· 135 135 pud[j] = *pud_offset(p4d_k, vaddr); 136 136 } 137 137 } 138 + pgd_offset_k(pgd * PGDIR_SIZE)->pgd &= ~_PAGE_NX; 138 139 } 140 + 139 141 out: 140 142 __flush_tlb_all(); 141 143

+3

drivers/base/Kconfig

··· 236 236 config GENERIC_CPU_AUTOPROBE 237 237 bool 238 238 239 + config GENERIC_CPU_VULNERABILITIES 240 + bool 241 + 239 242 config SOC_BUS 240 243 bool 241 244 select GLOB

+48

drivers/base/cpu.c

··· 511 511 #endif 512 512 } 513 513 514 + #ifdef CONFIG_GENERIC_CPU_VULNERABILITIES 515 + 516 + ssize_t __weak cpu_show_meltdown(struct device *dev, 517 + struct device_attribute *attr, char *buf) 518 + { 519 + return sprintf(buf, "Not affected\n"); 520 + } 521 + 522 + ssize_t __weak cpu_show_spectre_v1(struct device *dev, 523 + struct device_attribute *attr, char *buf) 524 + { 525 + return sprintf(buf, "Not affected\n"); 526 + } 527 + 528 + ssize_t __weak cpu_show_spectre_v2(struct device *dev, 529 + struct device_attribute *attr, char *buf) 530 + { 531 + return sprintf(buf, "Not affected\n"); 532 + } 533 + 534 + static DEVICE_ATTR(meltdown, 0444, cpu_show_meltdown, NULL); 535 + static DEVICE_ATTR(spectre_v1, 0444, cpu_show_spectre_v1, NULL); 536 + static DEVICE_ATTR(spectre_v2, 0444, cpu_show_spectre_v2, NULL); 537 + 538 + static struct attribute *cpu_root_vulnerabilities_attrs[] = { 539 + &dev_attr_meltdown.attr, 540 + &dev_attr_spectre_v1.attr, 541 + &dev_attr_spectre_v2.attr, 542 + NULL 543 + }; 544 + 545 + static const struct attribute_group cpu_root_vulnerabilities_group = { 546 + .name = "vulnerabilities", 547 + .attrs = cpu_root_vulnerabilities_attrs, 548 + }; 549 + 550 + static void __init cpu_register_vulnerabilities(void) 551 + { 552 + if (sysfs_create_group(&cpu_subsys.dev_root->kobj, 553 + &cpu_root_vulnerabilities_group)) 554 + pr_err("Unable to register CPU vulnerabilities\n"); 555 + } 556 + 557 + #else 558 + static inline void cpu_register_vulnerabilities(void) { } 559 + #endif 560 + 514 561 void __init cpu_dev_init(void) 515 562 { 516 563 if (subsys_system_register(&cpu_subsys, cpu_root_attr_groups)) 517 564 panic("Failed to register CPU subsystem"); 518 565 519 566 cpu_dev_register_generic(); 567 + cpu_register_vulnerabilities(); 520 568 }

+7

include/linux/cpu.h

··· 47 47 extern int cpu_add_dev_attr_group(struct attribute_group *attrs); 48 48 extern void cpu_remove_dev_attr_group(struct attribute_group *attrs); 49 49 50 + extern ssize_t cpu_show_meltdown(struct device *dev, 51 + struct device_attribute *attr, char *buf); 52 + extern ssize_t cpu_show_spectre_v1(struct device *dev, 53 + struct device_attribute *attr, char *buf); 54 + extern ssize_t cpu_show_spectre_v2(struct device *dev, 55 + struct device_attribute *attr, char *buf); 56 + 50 57 extern __printf(4, 5) 51 58 struct device *cpu_device_create(struct device *parent, void *drvdata, 52 59 const struct attribute_group **groups,

+1 -1

security/Kconfig

··· 63 63 ensuring that the majority of kernel addresses are not mapped 64 64 into userspace. 65 65 66 - See Documentation/x86/pagetable-isolation.txt for more details. 66 + See Documentation/x86/pti.txt for more details. 67 67 68 68 config SECURITY_INFINIBAND 69 69 bool "Infiniband Security Hooks"

+63 -6

tools/objtool/check.c

··· 428 428 } 429 429 430 430 /* 431 + * FIXME: For now, just ignore any alternatives which add retpolines. This is 432 + * a temporary hack, as it doesn't allow ORC to unwind from inside a retpoline. 433 + * But it at least allows objtool to understand the control flow *around* the 434 + * retpoline. 435 + */ 436 + static int add_nospec_ignores(struct objtool_file *file) 437 + { 438 + struct section *sec; 439 + struct rela *rela; 440 + struct instruction *insn; 441 + 442 + sec = find_section_by_name(file->elf, ".rela.discard.nospec"); 443 + if (!sec) 444 + return 0; 445 + 446 + list_for_each_entry(rela, &sec->rela_list, list) { 447 + if (rela->sym->type != STT_SECTION) { 448 + WARN("unexpected relocation symbol type in %s", sec->name); 449 + return -1; 450 + } 451 + 452 + insn = find_insn(file, rela->sym->sec, rela->addend); 453 + if (!insn) { 454 + WARN("bad .discard.nospec entry"); 455 + return -1; 456 + } 457 + 458 + insn->ignore_alts = true; 459 + } 460 + 461 + return 0; 462 + } 463 + 464 + /* 431 465 * Find the destination instructions for all jumps. 432 466 */ 433 467 static int add_jump_destinations(struct objtool_file *file) ··· 490 456 } else if (rela->sym->sec->idx) { 491 457 dest_sec = rela->sym->sec; 492 458 dest_off = rela->sym->sym.st_value + rela->addend + 4; 459 + } else if (strstr(rela->sym->name, "_indirect_thunk_")) { 460 + /* 461 + * Retpoline jumps are really dynamic jumps in 462 + * disguise, so convert them accordingly. 463 + */ 464 + insn->type = INSN_JUMP_DYNAMIC; 465 + continue; 493 466 } else { 494 467 /* sibling call */ 495 468 insn->jump_dest = 0; ··· 543 502 dest_off = insn->offset + insn->len + insn->immediate; 544 503 insn->call_dest = find_symbol_by_offset(insn->sec, 545 504 dest_off); 505 + /* 506 + * FIXME: Thanks to retpolines, it's now considered 507 + * normal for a function to call within itself. So 508 + * disable this warning for now. 509 + */ 510 + #if 0 546 511 if (!insn->call_dest) { 547 512 WARN_FUNC("can't find call dest symbol at offset 0x%lx", 548 513 insn->sec, insn->offset, dest_off); 549 514 return -1; 550 515 } 516 + #endif 551 517 } else if (rela->sym->type == STT_SECTION) { 552 518 insn->call_dest = find_symbol_by_offset(rela->sym->sec, 553 519 rela->addend+4); ··· 719 671 return ret; 720 672 721 673 list_for_each_entry_safe(special_alt, tmp, &special_alts, list) { 722 - alt = malloc(sizeof(*alt)); 723 - if (!alt) { 724 - WARN("malloc failed"); 725 - ret = -1; 726 - goto out; 727 - } 728 674 729 675 orig_insn = find_insn(file, special_alt->orig_sec, 730 676 special_alt->orig_off); ··· 728 686 ret = -1; 729 687 goto out; 730 688 } 689 + 690 + /* Ignore retpoline alternatives. */ 691 + if (orig_insn->ignore_alts) 692 + continue; 731 693 732 694 new_insn = NULL; 733 695 if (!special_alt->group || special_alt->new_len) { ··· 756 710 &new_insn); 757 711 if (ret) 758 712 goto out; 713 + } 714 + 715 + alt = malloc(sizeof(*alt)); 716 + if (!alt) { 717 + WARN("malloc failed"); 718 + ret = -1; 719 + goto out; 759 720 } 760 721 761 722 alt->insn = new_insn; ··· 1080 1027 return ret; 1081 1028 1082 1029 add_ignores(file); 1030 + 1031 + ret = add_nospec_ignores(file); 1032 + if (ret) 1033 + return ret; 1083 1034 1084 1035 ret = add_jump_destinations(file); 1085 1036 if (ret)

+1 -1

tools/objtool/check.h

··· 44 44 unsigned int len; 45 45 unsigned char type; 46 46 unsigned long immediate; 47 - bool alt_group, visited, dead_end, ignore, hint, save, restore; 47 + bool alt_group, visited, dead_end, ignore, hint, save, restore, ignore_alts; 48 48 struct symbol *call_dest; 49 49 struct instruction *jump_dest; 50 50 struct list_head alts;

+1 -1

tools/testing/selftests/x86/Makefile

··· 7 7 8 8 TARGETS_C_BOTHBITS := single_step_syscall sysret_ss_attrs syscall_nt ptrace_syscall test_mremap_vdso \ 9 9 check_initial_reg_state sigreturn ldt_gdt iopl mpx-mini-test ioperm \ 10 - protection_keys test_vdso 10 + protection_keys test_vdso test_vsyscall 11 11 TARGETS_C_32BIT_ONLY := entry_from_vm86 syscall_arg_fault test_syscall_vdso unwind_vdso \ 12 12 test_FCMOV test_FCOMI test_FISTTP \ 13 13 vdso_restorer

+500

tools/testing/selftests/x86/test_vsyscall.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + 3 + #define _GNU_SOURCE 4 + 5 + #include <stdio.h> 6 + #include <sys/time.h> 7 + #include <time.h> 8 + #include <stdlib.h> 9 + #include <sys/syscall.h> 10 + #include <unistd.h> 11 + #include <dlfcn.h> 12 + #include <string.h> 13 + #include <inttypes.h> 14 + #include <signal.h> 15 + #include <sys/ucontext.h> 16 + #include <errno.h> 17 + #include <err.h> 18 + #include <sched.h> 19 + #include <stdbool.h> 20 + #include <setjmp.h> 21 + 22 + #ifdef __x86_64__ 23 + # define VSYS(x) (x) 24 + #else 25 + # define VSYS(x) 0 26 + #endif 27 + 28 + #ifndef SYS_getcpu 29 + # ifdef __x86_64__ 30 + # define SYS_getcpu 309 31 + # else 32 + # define SYS_getcpu 318 33 + # endif 34 + #endif 35 + 36 + static void sethandler(int sig, void (*handler)(int, siginfo_t *, void *), 37 + int flags) 38 + { 39 + struct sigaction sa; 40 + memset(&sa, 0, sizeof(sa)); 41 + sa.sa_sigaction = handler; 42 + sa.sa_flags = SA_SIGINFO | flags; 43 + sigemptyset(&sa.sa_mask); 44 + if (sigaction(sig, &sa, 0)) 45 + err(1, "sigaction"); 46 + } 47 + 48 + /* vsyscalls and vDSO */ 49 + bool should_read_vsyscall = false; 50 + 51 + typedef long (*gtod_t)(struct timeval *tv, struct timezone *tz); 52 + gtod_t vgtod = (gtod_t)VSYS(0xffffffffff600000); 53 + gtod_t vdso_gtod; 54 + 55 + typedef int (*vgettime_t)(clockid_t, struct timespec *); 56 + vgettime_t vdso_gettime; 57 + 58 + typedef long (*time_func_t)(time_t *t); 59 + time_func_t vtime = (time_func_t)VSYS(0xffffffffff600400); 60 + time_func_t vdso_time; 61 + 62 + typedef long (*getcpu_t)(unsigned *, unsigned *, void *); 63 + getcpu_t vgetcpu = (getcpu_t)VSYS(0xffffffffff600800); 64 + getcpu_t vdso_getcpu; 65 + 66 + static void init_vdso(void) 67 + { 68 + void *vdso = dlopen("linux-vdso.so.1", RTLD_LAZY | RTLD_LOCAL | RTLD_NOLOAD); 69 + if (!vdso) 70 + vdso = dlopen("linux-gate.so.1", RTLD_LAZY | RTLD_LOCAL | RTLD_NOLOAD); 71 + if (!vdso) { 72 + printf("[WARN]\tfailed to find vDSO\n"); 73 + return; 74 + } 75 + 76 + vdso_gtod = (gtod_t)dlsym(vdso, "__vdso_gettimeofday"); 77 + if (!vdso_gtod) 78 + printf("[WARN]\tfailed to find gettimeofday in vDSO\n"); 79 + 80 + vdso_gettime = (vgettime_t)dlsym(vdso, "__vdso_clock_gettime"); 81 + if (!vdso_gettime) 82 + printf("[WARN]\tfailed to find clock_gettime in vDSO\n"); 83 + 84 + vdso_time = (time_func_t)dlsym(vdso, "__vdso_time"); 85 + if (!vdso_time) 86 + printf("[WARN]\tfailed to find time in vDSO\n"); 87 + 88 + vdso_getcpu = (getcpu_t)dlsym(vdso, "__vdso_getcpu"); 89 + if (!vdso_getcpu) { 90 + /* getcpu() was never wired up in the 32-bit vDSO. */ 91 + printf("[%s]\tfailed to find getcpu in vDSO\n", 92 + sizeof(long) == 8 ? "WARN" : "NOTE"); 93 + } 94 + } 95 + 96 + static int init_vsys(void) 97 + { 98 + #ifdef __x86_64__ 99 + int nerrs = 0; 100 + FILE *maps; 101 + char line[128]; 102 + bool found = false; 103 + 104 + maps = fopen("/proc/self/maps", "r"); 105 + if (!maps) { 106 + printf("[WARN]\tCould not open /proc/self/maps -- assuming vsyscall is r-x\n"); 107 + should_read_vsyscall = true; 108 + return 0; 109 + } 110 + 111 + while (fgets(line, sizeof(line), maps)) { 112 + char r, x; 113 + void *start, *end; 114 + char name[128]; 115 + if (sscanf(line, "%p-%p %c-%cp %*x %*x:%*x %*u %s", 116 + &start, &end, &r, &x, name) != 5) 117 + continue; 118 + 119 + if (strcmp(name, "[vsyscall]")) 120 + continue; 121 + 122 + printf("\tvsyscall map: %s", line); 123 + 124 + if (start != (void *)0xffffffffff600000 || 125 + end != (void *)0xffffffffff601000) { 126 + printf("[FAIL]\taddress range is nonsense\n"); 127 + nerrs++; 128 + } 129 + 130 + printf("\tvsyscall permissions are %c-%c\n", r, x); 131 + should_read_vsyscall = (r == 'r'); 132 + if (x != 'x') { 133 + vgtod = NULL; 134 + vtime = NULL; 135 + vgetcpu = NULL; 136 + } 137 + 138 + found = true; 139 + break; 140 + } 141 + 142 + fclose(maps); 143 + 144 + if (!found) { 145 + printf("\tno vsyscall map in /proc/self/maps\n"); 146 + should_read_vsyscall = false; 147 + vgtod = NULL; 148 + vtime = NULL; 149 + vgetcpu = NULL; 150 + } 151 + 152 + return nerrs; 153 + #else 154 + return 0; 155 + #endif 156 + } 157 + 158 + /* syscalls */ 159 + static inline long sys_gtod(struct timeval *tv, struct timezone *tz) 160 + { 161 + return syscall(SYS_gettimeofday, tv, tz); 162 + } 163 + 164 + static inline int sys_clock_gettime(clockid_t id, struct timespec *ts) 165 + { 166 + return syscall(SYS_clock_gettime, id, ts); 167 + } 168 + 169 + static inline long sys_time(time_t *t) 170 + { 171 + return syscall(SYS_time, t); 172 + } 173 + 174 + static inline long sys_getcpu(unsigned * cpu, unsigned * node, 175 + void* cache) 176 + { 177 + return syscall(SYS_getcpu, cpu, node, cache); 178 + } 179 + 180 + static jmp_buf jmpbuf; 181 + 182 + static void sigsegv(int sig, siginfo_t *info, void *ctx_void) 183 + { 184 + siglongjmp(jmpbuf, 1); 185 + } 186 + 187 + static double tv_diff(const struct timeval *a, const struct timeval *b) 188 + { 189 + return (double)(a->tv_sec - b->tv_sec) + 190 + (double)((int)a->tv_usec - (int)b->tv_usec) * 1e-6; 191 + } 192 + 193 + static int check_gtod(const struct timeval *tv_sys1, 194 + const struct timeval *tv_sys2, 195 + const struct timezone *tz_sys, 196 + const char *which, 197 + const struct timeval *tv_other, 198 + const struct timezone *tz_other) 199 + { 200 + int nerrs = 0; 201 + double d1, d2; 202 + 203 + if (tz_other && (tz_sys->tz_minuteswest != tz_other->tz_minuteswest || tz_sys->tz_dsttime != tz_other->tz_dsttime)) { 204 + printf("[FAIL] %s tz mismatch\n", which); 205 + nerrs++; 206 + } 207 + 208 + d1 = tv_diff(tv_other, tv_sys1); 209 + d2 = tv_diff(tv_sys2, tv_other); 210 + printf("\t%s time offsets: %lf %lf\n", which, d1, d2); 211 + 212 + if (d1 < 0 || d2 < 0) { 213 + printf("[FAIL]\t%s time was inconsistent with the syscall\n", which); 214 + nerrs++; 215 + } else { 216 + printf("[OK]\t%s gettimeofday()'s timeval was okay\n", which); 217 + } 218 + 219 + return nerrs; 220 + } 221 + 222 + static int test_gtod(void) 223 + { 224 + struct timeval tv_sys1, tv_sys2, tv_vdso, tv_vsys; 225 + struct timezone tz_sys, tz_vdso, tz_vsys; 226 + long ret_vdso = -1; 227 + long ret_vsys = -1; 228 + int nerrs = 0; 229 + 230 + printf("[RUN]\ttest gettimeofday()\n"); 231 + 232 + if (sys_gtod(&tv_sys1, &tz_sys) != 0) 233 + err(1, "syscall gettimeofday"); 234 + if (vdso_gtod) 235 + ret_vdso = vdso_gtod(&tv_vdso, &tz_vdso); 236 + if (vgtod) 237 + ret_vsys = vgtod(&tv_vsys, &tz_vsys); 238 + if (sys_gtod(&tv_sys2, &tz_sys) != 0) 239 + err(1, "syscall gettimeofday"); 240 + 241 + if (vdso_gtod) { 242 + if (ret_vdso == 0) { 243 + nerrs += check_gtod(&tv_sys1, &tv_sys2, &tz_sys, "vDSO", &tv_vdso, &tz_vdso); 244 + } else { 245 + printf("[FAIL]\tvDSO gettimeofday() failed: %ld\n", ret_vdso); 246 + nerrs++; 247 + } 248 + } 249 + 250 + if (vgtod) { 251 + if (ret_vsys == 0) { 252 + nerrs += check_gtod(&tv_sys1, &tv_sys2, &tz_sys, "vsyscall", &tv_vsys, &tz_vsys); 253 + } else { 254 + printf("[FAIL]\tvsys gettimeofday() failed: %ld\n", ret_vsys); 255 + nerrs++; 256 + } 257 + } 258 + 259 + return nerrs; 260 + } 261 + 262 + static int test_time(void) { 263 + int nerrs = 0; 264 + 265 + printf("[RUN]\ttest time()\n"); 266 + long t_sys1, t_sys2, t_vdso = 0, t_vsys = 0; 267 + long t2_sys1 = -1, t2_sys2 = -1, t2_vdso = -1, t2_vsys = -1; 268 + t_sys1 = sys_time(&t2_sys1); 269 + if (vdso_time) 270 + t_vdso = vdso_time(&t2_vdso); 271 + if (vtime) 272 + t_vsys = vtime(&t2_vsys); 273 + t_sys2 = sys_time(&t2_sys2); 274 + if (t_sys1 < 0 || t_sys1 != t2_sys1 || t_sys2 < 0 || t_sys2 != t2_sys2) { 275 + printf("[FAIL]\tsyscall failed (ret1:%ld output1:%ld ret2:%ld output2:%ld)\n", t_sys1, t2_sys1, t_sys2, t2_sys2); 276 + nerrs++; 277 + return nerrs; 278 + } 279 + 280 + if (vdso_time) { 281 + if (t_vdso < 0 || t_vdso != t2_vdso) { 282 + printf("[FAIL]\tvDSO failed (ret:%ld output:%ld)\n", t_vdso, t2_vdso); 283 + nerrs++; 284 + } else if (t_vdso < t_sys1 || t_vdso > t_sys2) { 285 + printf("[FAIL]\tvDSO returned the wrong time (%ld %ld %ld)\n", t_sys1, t_vdso, t_sys2); 286 + nerrs++; 287 + } else { 288 + printf("[OK]\tvDSO time() is okay\n"); 289 + } 290 + } 291 + 292 + if (vtime) { 293 + if (t_vsys < 0 || t_vsys != t2_vsys) { 294 + printf("[FAIL]\tvsyscall failed (ret:%ld output:%ld)\n", t_vsys, t2_vsys); 295 + nerrs++; 296 + } else if (t_vsys < t_sys1 || t_vsys > t_sys2) { 297 + printf("[FAIL]\tvsyscall returned the wrong time (%ld %ld %ld)\n", t_sys1, t_vsys, t_sys2); 298 + nerrs++; 299 + } else { 300 + printf("[OK]\tvsyscall time() is okay\n"); 301 + } 302 + } 303 + 304 + return nerrs; 305 + } 306 + 307 + static int test_getcpu(int cpu) 308 + { 309 + int nerrs = 0; 310 + long ret_sys, ret_vdso = -1, ret_vsys = -1; 311 + 312 + printf("[RUN]\tgetcpu() on CPU %d\n", cpu); 313 + 314 + cpu_set_t cpuset; 315 + CPU_ZERO(&cpuset); 316 + CPU_SET(cpu, &cpuset); 317 + if (sched_setaffinity(0, sizeof(cpuset), &cpuset) != 0) { 318 + printf("[SKIP]\tfailed to force CPU %d\n", cpu); 319 + return nerrs; 320 + } 321 + 322 + unsigned cpu_sys, cpu_vdso, cpu_vsys, node_sys, node_vdso, node_vsys; 323 + unsigned node = 0; 324 + bool have_node = false; 325 + ret_sys = sys_getcpu(&cpu_sys, &node_sys, 0); 326 + if (vdso_getcpu) 327 + ret_vdso = vdso_getcpu(&cpu_vdso, &node_vdso, 0); 328 + if (vgetcpu) 329 + ret_vsys = vgetcpu(&cpu_vsys, &node_vsys, 0); 330 + 331 + if (ret_sys == 0) { 332 + if (cpu_sys != cpu) { 333 + printf("[FAIL]\tsyscall reported CPU %hu but should be %d\n", cpu_sys, cpu); 334 + nerrs++; 335 + } 336 + 337 + have_node = true; 338 + node = node_sys; 339 + } 340 + 341 + if (vdso_getcpu) { 342 + if (ret_vdso) { 343 + printf("[FAIL]\tvDSO getcpu() failed\n"); 344 + nerrs++; 345 + } else { 346 + if (!have_node) { 347 + have_node = true; 348 + node = node_vdso; 349 + } 350 + 351 + if (cpu_vdso != cpu) { 352 + printf("[FAIL]\tvDSO reported CPU %hu but should be %d\n", cpu_vdso, cpu); 353 + nerrs++; 354 + } else { 355 + printf("[OK]\tvDSO reported correct CPU\n"); 356 + } 357 + 358 + if (node_vdso != node) { 359 + printf("[FAIL]\tvDSO reported node %hu but should be %hu\n", node_vdso, node); 360 + nerrs++; 361 + } else { 362 + printf("[OK]\tvDSO reported correct node\n"); 363 + } 364 + } 365 + } 366 + 367 + if (vgetcpu) { 368 + if (ret_vsys) { 369 + printf("[FAIL]\tvsyscall getcpu() failed\n"); 370 + nerrs++; 371 + } else { 372 + if (!have_node) { 373 + have_node = true; 374 + node = node_vsys; 375 + } 376 + 377 + if (cpu_vsys != cpu) { 378 + printf("[FAIL]\tvsyscall reported CPU %hu but should be %d\n", cpu_vsys, cpu); 379 + nerrs++; 380 + } else { 381 + printf("[OK]\tvsyscall reported correct CPU\n"); 382 + } 383 + 384 + if (node_vsys != node) { 385 + printf("[FAIL]\tvsyscall reported node %hu but should be %hu\n", node_vsys, node); 386 + nerrs++; 387 + } else { 388 + printf("[OK]\tvsyscall reported correct node\n"); 389 + } 390 + } 391 + } 392 + 393 + return nerrs; 394 + } 395 + 396 + static int test_vsys_r(void) 397 + { 398 + #ifdef __x86_64__ 399 + printf("[RUN]\tChecking read access to the vsyscall page\n"); 400 + bool can_read; 401 + if (sigsetjmp(jmpbuf, 1) == 0) { 402 + *(volatile int *)0xffffffffff600000; 403 + can_read = true; 404 + } else { 405 + can_read = false; 406 + } 407 + 408 + if (can_read && !should_read_vsyscall) { 409 + printf("[FAIL]\tWe have read access, but we shouldn't\n"); 410 + return 1; 411 + } else if (!can_read && should_read_vsyscall) { 412 + printf("[FAIL]\tWe don't have read access, but we should\n"); 413 + return 1; 414 + } else { 415 + printf("[OK]\tgot expected result\n"); 416 + } 417 + #endif 418 + 419 + return 0; 420 + } 421 + 422 + 423 + #ifdef __x86_64__ 424 + #define X86_EFLAGS_TF (1UL << 8) 425 + static volatile sig_atomic_t num_vsyscall_traps; 426 + 427 + static unsigned long get_eflags(void) 428 + { 429 + unsigned long eflags; 430 + asm volatile ("pushfq\n\tpopq %0" : "=rm" (eflags)); 431 + return eflags; 432 + } 433 + 434 + static void set_eflags(unsigned long eflags) 435 + { 436 + asm volatile ("pushq %0\n\tpopfq" : : "rm" (eflags) : "flags"); 437 + } 438 + 439 + static void sigtrap(int sig, siginfo_t *info, void *ctx_void) 440 + { 441 + ucontext_t *ctx = (ucontext_t *)ctx_void; 442 + unsigned long ip = ctx->uc_mcontext.gregs[REG_RIP]; 443 + 444 + if (((ip ^ 0xffffffffff600000UL) & ~0xfffUL) == 0) 445 + num_vsyscall_traps++; 446 + } 447 + 448 + static int test_native_vsyscall(void) 449 + { 450 + time_t tmp; 451 + bool is_native; 452 + 453 + if (!vtime) 454 + return 0; 455 + 456 + printf("[RUN]\tchecking for native vsyscall\n"); 457 + sethandler(SIGTRAP, sigtrap, 0); 458 + set_eflags(get_eflags() | X86_EFLAGS_TF); 459 + vtime(&tmp); 460 + set_eflags(get_eflags() & ~X86_EFLAGS_TF); 461 + 462 + /* 463 + * If vsyscalls are emulated, we expect a single trap in the 464 + * vsyscall page -- the call instruction will trap with RIP 465 + * pointing to the entry point before emulation takes over. 466 + * In native mode, we expect two traps, since whatever code 467 + * the vsyscall page contains will be more than just a ret 468 + * instruction. 469 + */ 470 + is_native = (num_vsyscall_traps > 1); 471 + 472 + printf("\tvsyscalls are %s (%d instructions in vsyscall page)\n", 473 + (is_native ? "native" : "emulated"), 474 + (int)num_vsyscall_traps); 475 + 476 + return 0; 477 + } 478 + #endif 479 + 480 + int main(int argc, char **argv) 481 + { 482 + int nerrs = 0; 483 + 484 + init_vdso(); 485 + nerrs += init_vsys(); 486 + 487 + nerrs += test_gtod(); 488 + nerrs += test_time(); 489 + nerrs += test_getcpu(0); 490 + nerrs += test_getcpu(1); 491 + 492 + sethandler(SIGSEGV, sigsegv, 0); 493 + nerrs += test_vsys_r(); 494 + 495 + #ifdef __x86_64__ 496 + nerrs += test_native_vsyscall(); 497 + #endif 498 + 499 + return nerrs ? 1 : 0; 500 + }

Configure Feed

Configure Feed