Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

tjh.dev / kernel

fork

Configure Feed

Issues Pull Requests Commits Tags

Feed URL

Select the types of activity you want to include in your feed.

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

fork

Configure Feed

Issues Pull Requests Commits Tags

Feed URL

Select the types of activity you want to include in your feed.

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
"ARM:

- A couple of fixes when handling an exception while a SError has
been delivered

- Workaround for Cortex-A510's single-step erratum

RISC-V:

- Make CY, TM, and IR counters accessible in VU mode

- Fix SBI implementation version

x86:

- Report deprecation of x87 features in supported CPUID

- Preparation for fixing an interrupt delivery race on AMD hardware

- Sparse fix

All except POWER and s390:

- Rework guest entry code to correctly mark noinstr areas and fix
vtime' accounting (for x86, this was already mostly correct but not
entirely; for ARM, MIPS and RISC-V it wasn't)"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86: Use ERR_PTR_USR() to return -EFAULT as a __user pointer
KVM: x86: Report deprecated x87 features in supported CPUID
KVM: arm64: Workaround Cortex-A510's single-step and PAC trap errata
KVM: arm64: Stop handle_exit() from handling HVC twice when an SError occurs
KVM: arm64: Avoid consuming a stale esr value when SError occur
RISC-V: KVM: Fix SBI implementation version
RISC-V: KVM: make CY, TM, and IR counters accessible in VU mode
kvm/riscv: rework guest entry logic
kvm/arm64: rework guest entry logic
kvm/x86: rework guest entry logic
kvm/mips: rework guest entry logic
kvm: add guest_state_{enter,exit}_irqoff()
KVM: x86: Move delivery of non-APICv interrupt into vendor code
kvm: Move KVM_GET_XSAVE2 IOCTL definition at the end of kvm.h

Linus Torvalds 4 years ago 5fdb2621 fbc04bf0

+336 -121

20 changed files

expand all collapse all

Documentation

arm64

silicon-errata.rst

arch

arm64

Kconfig

kernel

cpu_errata.c

kvm

arm.c

handle_exit.c

hyp

include

hyp

switch.h

tools

cpucaps

mips

kvm

mips.c

riscv

kvm

vcpu.c

vcpu_sbi_base.c

x86

include

asm

kvm-x86-ops.h

kvm_host.h

kvm

cpuid.c

lapic.c

svm

svm.c

vmx

vmx.c

x86.c

x86.h

include

linux

kvm_host.h

uapi

linux

kvm.h

Documentation/arm64/silicon-errata.rst

reviewed

··· 100 100 +----------------+-----------------+-----------------+-----------------------------+ 101 101 | ARM | Cortex-A510 | #2051678 | ARM64_ERRATUM_2051678 | 102 102 +----------------+-----------------+-----------------+-----------------------------+ 103 103 + | ARM | Cortex-A510 | #2077057 | ARM64_ERRATUM_2077057 | 104 104 + +----------------+-----------------+-----------------+-----------------------------+ 103 105 | ARM | Cortex-A710 | #2119858 | ARM64_ERRATUM_2119858 | 104 106 +----------------+-----------------+-----------------+-----------------------------+ 105 107 | ARM | Cortex-A710 | #2054223 | ARM64_ERRATUM_2054223 |

+16

arch/arm64/Kconfig

reviewed

··· 680 680 681 681 If unsure, say Y. 682 682 683 683 + config ARM64_ERRATUM_2077057 684 684 + bool "Cortex-A510: 2077057: workaround software-step corrupting SPSR_EL2" 685 685 + help 686 686 + This option adds the workaround for ARM Cortex-A510 erratum 2077057. 687 687 + Affected Cortex-A510 may corrupt SPSR_EL2 when the a step exception is 688 688 + expected, but a Pointer Authentication trap is taken instead. The 689 689 + erratum causes SPSR_EL1 to be copied to SPSR_EL2, which could allow 690 690 + EL1 to cause a return to EL2 with a guest controlled ELR_EL2. 691 691 + 692 692 + This can only happen when EL2 is stepping EL1. 693 693 + 694 694 + When these conditions occur, the SPSR_EL2 value is unchanged from the 695 695 + previous guest entry, and can be restored from the in-memory copy. 696 696 + 697 697 + If unsure, say Y. 698 698 + 683 699 config ARM64_ERRATUM_2119858 684 700 bool "Cortex-A710/X2: 2119858: workaround TRBE overwriting trace data in FILL mode" 685 701 default y

arch/arm64/kernel/cpu_errata.c

reviewed

··· 600 600 CAP_MIDR_RANGE_LIST(trbe_write_out_of_range_cpus), 601 601 }, 602 602 #endif 603 603 + #ifdef CONFIG_ARM64_ERRATUM_2077057 604 604 + { 605 605 + .desc = "ARM erratum 2077057", 606 606 + .capability = ARM64_WORKAROUND_2077057, 607 607 + .type = ARM64_CPUCAP_LOCAL_CPU_ERRATUM, 608 608 + ERRATA_MIDR_REV_RANGE(MIDR_CORTEX_A510, 0, 0, 2), 609 609 + }, 610 610 + #endif 603 611 #ifdef CONFIG_ARM64_ERRATUM_2064142 604 612 { 605 613 .desc = "ARM erratum 2064142",

+33 -18

arch/arm64/kvm/arm.c

reviewed

··· 797 797 xfer_to_guest_mode_work_pending(); 798 798 } 799 799 800 800 + /* 801 801 + * Actually run the vCPU, entering an RCU extended quiescent state (EQS) while 802 802 + * the vCPU is running. 803 803 + * 804 804 + * This must be noinstr as instrumentation may make use of RCU, and this is not 805 805 + * safe during the EQS. 806 806 + */ 807 807 + static int noinstr kvm_arm_vcpu_enter_exit(struct kvm_vcpu *vcpu) 808 808 + { 809 809 + int ret; 810 810 + 811 811 + guest_state_enter_irqoff(); 812 812 + ret = kvm_call_hyp_ret(__kvm_vcpu_run, vcpu); 813 813 + guest_state_exit_irqoff(); 814 814 + 815 815 + return ret; 816 816 + } 817 817 + 800 818 /** 801 819 * kvm_arch_vcpu_ioctl_run - the main VCPU run function to execute guest code 802 820 * @vcpu: The VCPU pointer ··· 899 881 * Enter the guest 900 882 */ 901 883 trace_kvm_entry(*vcpu_pc(vcpu)); 902 902 - guest_enter_irqoff(); 884 884 + guest_timing_enter_irqoff(); 903 885 904 904 - ret = kvm_call_hyp_ret(__kvm_vcpu_run, vcpu); 886 886 + ret = kvm_arm_vcpu_enter_exit(vcpu); 905 887 906 888 vcpu->mode = OUTSIDE_GUEST_MODE; 907 889 vcpu->stat.exits++; ··· 936 918 kvm_arch_vcpu_ctxsync_fp(vcpu); 937 919 938 920 /* 939 939 - * We may have taken a host interrupt in HYP mode (ie 940 940 - * while executing the guest). This interrupt is still 941 941 - * pending, as we haven't serviced it yet! 921 921 + * We must ensure that any pending interrupts are taken before 922 922 + * we exit guest timing so that timer ticks are accounted as 923 923 + * guest time. Transiently unmask interrupts so that any 924 924 + * pending interrupts are taken. 942 925 * 943 943 - * We're now back in SVC mode, with interrupts 944 944 - * disabled. Enabling the interrupts now will have 945 945 - * the effect of taking the interrupt again, in SVC 946 946 - * mode this time. 926 926 + * Per ARM DDI 0487G.b section D1.13.4, an ISB (or other 927 927 + * context synchronization event) is necessary to ensure that 928 928 + * pending interrupts are taken. 947 929 */ 948 930 local_irq_enable(); 931 931 + isb(); 932 932 + local_irq_disable(); 949 933 950 950 - /* 951 951 - * We do local_irq_enable() before calling guest_exit() so 952 952 - * that if a timer interrupt hits while running the guest we 953 953 - * account that tick as being spent in the guest. We enable 954 954 - * preemption after calling guest_exit() so that if we get 955 955 - * preempted we make sure ticks after that is not counted as 956 956 - * guest time. 957 957 - */ 958 958 - guest_exit(); 934 934 + guest_timing_exit_irqoff(); 935 935 + 936 936 + local_irq_enable(); 937 937 + 959 938 trace_kvm_exit(ret, kvm_vcpu_trap_get_class(vcpu), *vcpu_pc(vcpu)); 960 939 961 940 /* Exit types that need handling before we can be preempted */

arch/arm64/kvm/handle_exit.c

reviewed

··· 228 228 { 229 229 struct kvm_run *run = vcpu->run; 230 230 231 231 + if (ARM_SERROR_PENDING(exception_index)) { 232 232 + /* 233 233 + * The SError is handled by handle_exit_early(). If the guest 234 234 + * survives it will re-execute the original instruction. 235 235 + */ 236 236 + return 1; 237 237 + } 238 238 + 231 239 exception_index = ARM_EXCEPTION_CODE(exception_index); 232 240 233 241 switch (exception_index) {

+21 -2

arch/arm64/kvm/hyp/include/hyp/switch.h

reviewed

··· 402 402 return false; 403 403 } 404 404 405 405 + static inline void synchronize_vcpu_pstate(struct kvm_vcpu *vcpu, u64 *exit_code) 406 406 + { 407 407 + /* 408 408 + * Check for the conditions of Cortex-A510's #2077057. When these occur 409 409 + * SPSR_EL2 can't be trusted, but isn't needed either as it is 410 410 + * unchanged from the value in vcpu_gp_regs(vcpu)->pstate. 411 411 + * Are we single-stepping the guest, and took a PAC exception from the 412 412 + * active-not-pending state? 413 413 + */ 414 414 + if (cpus_have_final_cap(ARM64_WORKAROUND_2077057) && 415 415 + vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP && 416 416 + *vcpu_cpsr(vcpu) & DBG_SPSR_SS && 417 417 + ESR_ELx_EC(read_sysreg_el2(SYS_ESR)) == ESR_ELx_EC_PAC) 418 418 + write_sysreg_el2(*vcpu_cpsr(vcpu), SYS_SPSR); 419 419 + 420 420 + vcpu->arch.ctxt.regs.pstate = read_sysreg_el2(SYS_SPSR); 421 421 + } 422 422 + 405 423 /* 406 424 * Return true when we were able to fixup the guest exit and should return to 407 425 * the guest, false when we should restore the host state and return to the ··· 431 413 * Save PSTATE early so that we can evaluate the vcpu mode 432 414 * early on. 433 415 */ 434 434 - vcpu->arch.ctxt.regs.pstate = read_sysreg_el2(SYS_SPSR); 416 416 + synchronize_vcpu_pstate(vcpu, exit_code); 435 417 436 418 /* 437 419 * Check whether we want to repaint the state one way or ··· 442 424 if (ARM_EXCEPTION_CODE(*exit_code) != ARM_EXCEPTION_IRQ) 443 425 vcpu->arch.fault.esr_el2 = read_sysreg_el2(SYS_ESR); 444 426 445 445 - if (ARM_SERROR_PENDING(*exit_code)) { 427 427 + if (ARM_SERROR_PENDING(*exit_code) && 428 428 + ARM_EXCEPTION_CODE(*exit_code) != ARM_EXCEPTION_IRQ) { 446 429 u8 esr_ec = kvm_vcpu_trap_get_class(vcpu); 447 430 448 431 /*

+3 -2

arch/arm64/tools/cpucaps

reviewed

··· 55 55 WORKAROUND_1463225 56 56 WORKAROUND_1508412 57 57 WORKAROUND_1542419 58 58 - WORKAROUND_2064142 59 59 - WORKAROUND_2038923 60 58 WORKAROUND_1902691 59 59 + WORKAROUND_2038923 60 60 + WORKAROUND_2064142 61 61 + WORKAROUND_2077057 61 62 WORKAROUND_TRBE_OVERWRITE_FILL_MODE 62 63 WORKAROUND_TSB_FLUSH_FAILURE 63 64 WORKAROUND_TRBE_WRITE_OUT_OF_RANGE

+46 -4

arch/mips/kvm/mips.c

reviewed

··· 414 414 return -ENOIOCTLCMD; 415 415 } 416 416 417 417 + /* 418 418 + * Actually run the vCPU, entering an RCU extended quiescent state (EQS) while 419 419 + * the vCPU is running. 420 420 + * 421 421 + * This must be noinstr as instrumentation may make use of RCU, and this is not 422 422 + * safe during the EQS. 423 423 + */ 424 424 + static int noinstr kvm_mips_vcpu_enter_exit(struct kvm_vcpu *vcpu) 425 425 + { 426 426 + int ret; 427 427 + 428 428 + guest_state_enter_irqoff(); 429 429 + ret = kvm_mips_callbacks->vcpu_run(vcpu); 430 430 + guest_state_exit_irqoff(); 431 431 + 432 432 + return ret; 433 433 + } 434 434 + 417 435 int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu) 418 436 { 419 437 int r = -EINTR; ··· 452 434 lose_fpu(1); 453 435 454 436 local_irq_disable(); 455 455 - guest_enter_irqoff(); 437 437 + guest_timing_enter_irqoff(); 456 438 trace_kvm_enter(vcpu); 457 439 458 440 /* ··· 463 445 */ 464 446 smp_store_mb(vcpu->mode, IN_GUEST_MODE); 465 447 466 466 - r = kvm_mips_callbacks->vcpu_run(vcpu); 448 448 + r = kvm_mips_vcpu_enter_exit(vcpu); 449 449 + 450 450 + /* 451 451 + * We must ensure that any pending interrupts are taken before 452 452 + * we exit guest timing so that timer ticks are accounted as 453 453 + * guest time. Transiently unmask interrupts so that any 454 454 + * pending interrupts are taken. 455 455 + * 456 456 + * TODO: is there a barrier which ensures that pending interrupts are 457 457 + * recognised? Currently this just hopes that the CPU takes any pending 458 458 + * interrupts between the enable and disable. 459 459 + */ 460 460 + local_irq_enable(); 461 461 + local_irq_disable(); 467 462 468 463 trace_kvm_out(vcpu); 469 469 - guest_exit_irqoff(); 464 464 + guest_timing_exit_irqoff(); 470 465 local_irq_enable(); 471 466 472 467 out: ··· 1199 1168 /* 1200 1169 * Return value is in the form (errcode<<2 | RESUME_FLAG_HOST | RESUME_FLAG_NV) 1201 1170 */ 1202 1202 - int kvm_mips_handle_exit(struct kvm_vcpu *vcpu) 1171 1171 + static int __kvm_mips_handle_exit(struct kvm_vcpu *vcpu) 1203 1172 { 1204 1173 struct kvm_run *run = vcpu->run; 1205 1174 u32 cause = vcpu->arch.host_cp0_cause; ··· 1385 1354 read_c0_config5() & MIPS_CONF5_MSAEN) 1386 1355 __kvm_restore_msacsr(&vcpu->arch); 1387 1356 } 1357 1357 + return ret; 1358 1358 + } 1359 1359 + 1360 1360 + int noinstr kvm_mips_handle_exit(struct kvm_vcpu *vcpu) 1361 1361 + { 1362 1362 + int ret; 1363 1363 + 1364 1364 + guest_state_exit_irqoff(); 1365 1365 + ret = __kvm_mips_handle_exit(vcpu); 1366 1366 + guest_state_enter_irqoff(); 1367 1367 + 1388 1368 return ret; 1389 1369 } 1390 1370

+31 -17

arch/riscv/kvm/vcpu.c

reviewed

··· 90 90 int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu) 91 91 { 92 92 struct kvm_cpu_context *cntx; 93 93 + struct kvm_vcpu_csr *reset_csr = &vcpu->arch.guest_reset_csr; 93 94 94 95 /* Mark this VCPU never ran */ 95 96 vcpu->arch.ran_atleast_once = false; ··· 106 105 cntx->hstatus |= HSTATUS_VTW; 107 106 cntx->hstatus |= HSTATUS_SPVP; 108 107 cntx->hstatus |= HSTATUS_SPV; 108 108 + 109 109 + /* By default, make CY, TM, and IR counters accessible in VU mode */ 110 110 + reset_csr->scounteren = 0x7; 109 111 110 112 /* Setup VCPU timer */ 111 113 kvm_riscv_vcpu_timer_init(vcpu); ··· 703 699 csr_write(CSR_HVIP, csr->hvip); 704 700 } 705 701 702 702 + /* 703 703 + * Actually run the vCPU, entering an RCU extended quiescent state (EQS) while 704 704 + * the vCPU is running. 705 705 + * 706 706 + * This must be noinstr as instrumentation may make use of RCU, and this is not 707 707 + * safe during the EQS. 708 708 + */ 709 709 + static void noinstr kvm_riscv_vcpu_enter_exit(struct kvm_vcpu *vcpu) 710 710 + { 711 711 + guest_state_enter_irqoff(); 712 712 + __kvm_riscv_switch_to(&vcpu->arch); 713 713 + guest_state_exit_irqoff(); 714 714 + } 715 715 + 706 716 int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu) 707 717 { 708 718 int ret; ··· 808 790 continue; 809 791 } 810 792 811 811 - guest_enter_irqoff(); 793 793 + guest_timing_enter_irqoff(); 812 794 813 813 - __kvm_riscv_switch_to(&vcpu->arch); 795 795 + kvm_riscv_vcpu_enter_exit(vcpu); 814 796 815 797 vcpu->mode = OUTSIDE_GUEST_MODE; 816 798 vcpu->stat.exits++; ··· 830 812 kvm_riscv_vcpu_sync_interrupts(vcpu); 831 813 832 814 /* 833 833 - * We may have taken a host interrupt in VS/VU-mode (i.e. 834 834 - * while executing the guest). This interrupt is still 835 835 - * pending, as we haven't serviced it yet! 815 815 + * We must ensure that any pending interrupts are taken before 816 816 + * we exit guest timing so that timer ticks are accounted as 817 817 + * guest time. Transiently unmask interrupts so that any 818 818 + * pending interrupts are taken. 836 819 * 837 837 - * We're now back in HS-mode with interrupts disabled 838 838 - * so enabling the interrupts now will have the effect 839 839 - * of taking the interrupt again, in HS-mode this time. 820 820 + * There's no barrier which ensures that pending interrupts are 821 821 + * recognised, so we just hope that the CPU takes any pending 822 822 + * interrupts between the enable and disable. 840 823 */ 841 824 local_irq_enable(); 825 825 + local_irq_disable(); 842 826 843 843 - /* 844 844 - * We do local_irq_enable() before calling guest_exit() so 845 845 - * that if a timer interrupt hits while running the guest 846 846 - * we account that tick as being spent in the guest. We 847 847 - * enable preemption after calling guest_exit() so that if 848 848 - * we get preempted we make sure ticks after that is not 849 849 - * counted as guest time. 850 850 - */ 851 851 - guest_exit(); 827 827 + guest_timing_exit_irqoff(); 828 828 + 829 829 + local_irq_enable(); 852 830 853 831 preempt_enable(); 854 832

+2 -1

arch/riscv/kvm/vcpu_sbi_base.c

reviewed

··· 9 9 #include <linux/errno.h> 10 10 #include <linux/err.h> 11 11 #include <linux/kvm_host.h> 12 12 + #include <linux/version.h> 12 13 #include <asm/csr.h> 13 14 #include <asm/sbi.h> 14 15 #include <asm/kvm_vcpu_timer.h> ··· 33 32 *out_val = KVM_SBI_IMPID; 34 33 break; 35 34 case SBI_EXT_BASE_GET_IMP_VERSION: 36 36 - *out_val = 0; 35 35 + *out_val = LINUX_VERSION_CODE; 37 36 break; 38 37 case SBI_EXT_BASE_PROBE_EXT: 39 38 if ((cp->a0 >= SBI_EXT_EXPERIMENTAL_START &&

+1 -1

arch/x86/include/asm/kvm-x86-ops.h

reviewed

··· 82 82 KVM_X86_OP(load_eoi_exitmap) 83 83 KVM_X86_OP(set_virtual_apic_mode) 84 84 KVM_X86_OP_NULL(set_apic_access_page_addr) 85 85 - KVM_X86_OP(deliver_posted_interrupt) 85 85 + KVM_X86_OP(deliver_interrupt) 86 86 KVM_X86_OP_NULL(sync_pir_to_irr) 87 87 KVM_X86_OP(set_tss_addr) 88 88 KVM_X86_OP(set_identity_map_addr)

+2 -1

arch/x86/include/asm/kvm_host.h

reviewed

··· 1410 1410 void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap); 1411 1411 void (*set_virtual_apic_mode)(struct kvm_vcpu *vcpu); 1412 1412 void (*set_apic_access_page_addr)(struct kvm_vcpu *vcpu); 1413 1413 - int (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector); 1413 1413 + void (*deliver_interrupt)(struct kvm_lapic *apic, int delivery_mode, 1414 1414 + int trig_mode, int vector); 1414 1415 int (*sync_pir_to_irr)(struct kvm_vcpu *vcpu); 1415 1416 int (*set_tss_addr)(struct kvm *kvm, unsigned int addr); 1416 1417 int (*set_identity_map_addr)(struct kvm *kvm, u64 ident_addr);

+7 -6

arch/x86/kvm/cpuid.c

reviewed

··· 554 554 ); 555 555 556 556 kvm_cpu_cap_mask(CPUID_7_0_EBX, 557 557 - F(FSGSBASE) | F(SGX) | F(BMI1) | F(HLE) | F(AVX2) | F(SMEP) | 558 558 - F(BMI2) | F(ERMS) | F(INVPCID) | F(RTM) | 0 /*MPX*/ | F(RDSEED) | 559 559 - F(ADX) | F(SMAP) | F(AVX512IFMA) | F(AVX512F) | F(AVX512PF) | 560 560 - F(AVX512ER) | F(AVX512CD) | F(CLFLUSHOPT) | F(CLWB) | F(AVX512DQ) | 561 561 - F(SHA_NI) | F(AVX512BW) | F(AVX512VL) | 0 /*INTEL_PT*/ 562 562 - ); 557 557 + F(FSGSBASE) | F(SGX) | F(BMI1) | F(HLE) | F(AVX2) | 558 558 + F(FDP_EXCPTN_ONLY) | F(SMEP) | F(BMI2) | F(ERMS) | F(INVPCID) | 559 559 + F(RTM) | F(ZERO_FCS_FDS) | 0 /*MPX*/ | F(AVX512F) | 560 560 + F(AVX512DQ) | F(RDSEED) | F(ADX) | F(SMAP) | F(AVX512IFMA) | 561 561 + F(CLFLUSHOPT) | F(CLWB) | 0 /*INTEL_PT*/ | F(AVX512PF) | 562 562 + F(AVX512ER) | F(AVX512CD) | F(SHA_NI) | F(AVX512BW) | 563 563 + F(AVX512VL)); 563 564 564 565 kvm_cpu_cap_mask(CPUID_7_ECX, 565 566 F(AVX512VBMI) | F(LA57) | F(PKU) | 0 /*OSPKE*/ | F(RDPID) |

+2 -8

arch/x86/kvm/lapic.c

reviewed

··· 1096 1096 apic->regs + APIC_TMR); 1097 1097 } 1098 1098 1099 1099 - if (static_call(kvm_x86_deliver_posted_interrupt)(vcpu, vector)) { 1100 1100 - kvm_lapic_set_irr(vector, apic); 1101 1101 - kvm_make_request(KVM_REQ_EVENT, vcpu); 1102 1102 - kvm_vcpu_kick(vcpu); 1103 1103 - } else { 1104 1104 - trace_kvm_apicv_accept_irq(vcpu->vcpu_id, delivery_mode, 1105 1105 - trig_mode, vector); 1106 1106 - } 1099 1099 + static_call(kvm_x86_deliver_interrupt)(apic, delivery_mode, 1100 1100 + trig_mode, vector); 1107 1101 break; 1108 1102 1109 1103 case APIC_DM_REMRD:

+18 -3

arch/x86/kvm/svm/svm.c

reviewed

··· 3291 3291 SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_INTR; 3292 3292 } 3293 3293 3294 3294 + static void svm_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode, 3295 3295 + int trig_mode, int vector) 3296 3296 + { 3297 3297 + struct kvm_vcpu *vcpu = apic->vcpu; 3298 3298 + 3299 3299 + if (svm_deliver_avic_intr(vcpu, vector)) { 3300 3300 + kvm_lapic_set_irr(vector, apic); 3301 3301 + kvm_make_request(KVM_REQ_EVENT, vcpu); 3302 3302 + kvm_vcpu_kick(vcpu); 3303 3303 + } else { 3304 3304 + trace_kvm_apicv_accept_irq(vcpu->vcpu_id, delivery_mode, 3305 3305 + trig_mode, vector); 3306 3306 + } 3307 3307 + } 3308 3308 + 3294 3309 static void svm_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr) 3295 3310 { 3296 3311 struct vcpu_svm *svm = to_svm(vcpu); ··· 3630 3615 struct vcpu_svm *svm = to_svm(vcpu); 3631 3616 unsigned long vmcb_pa = svm->current_vmcb->pa; 3632 3617 3633 3633 - kvm_guest_enter_irqoff(); 3618 3618 + guest_state_enter_irqoff(); 3634 3619 3635 3620 if (sev_es_guest(vcpu->kvm)) { 3636 3621 __svm_sev_es_vcpu_run(vmcb_pa); ··· 3650 3635 vmload(__sme_page_pa(sd->save_area)); 3651 3636 } 3652 3637 3653 3653 - kvm_guest_exit_irqoff(); 3638 3638 + guest_state_exit_irqoff(); 3654 3639 } 3655 3640 3656 3641 static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu) ··· 4560 4545 .pmu_ops = &amd_pmu_ops, 4561 4546 .nested_ops = &svm_nested_ops, 4562 4547 4563 4563 - .deliver_posted_interrupt = svm_deliver_avic_intr, 4548 4548 + .deliver_interrupt = svm_deliver_interrupt, 4564 4549 .dy_apicv_has_pending_interrupt = svm_dy_apicv_has_pending_interrupt, 4565 4550 .update_pi_irte = svm_update_pi_irte, 4566 4551 .setup_mce = svm_setup_mce,

+18 -3

arch/x86/kvm/vmx/vmx.c

reviewed

··· 4041 4041 return 0; 4042 4042 } 4043 4043 4044 4044 + static void vmx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode, 4045 4045 + int trig_mode, int vector) 4046 4046 + { 4047 4047 + struct kvm_vcpu *vcpu = apic->vcpu; 4048 4048 + 4049 4049 + if (vmx_deliver_posted_interrupt(vcpu, vector)) { 4050 4050 + kvm_lapic_set_irr(vector, apic); 4051 4051 + kvm_make_request(KVM_REQ_EVENT, vcpu); 4052 4052 + kvm_vcpu_kick(vcpu); 4053 4053 + } else { 4054 4054 + trace_kvm_apicv_accept_irq(vcpu->vcpu_id, delivery_mode, 4055 4055 + trig_mode, vector); 4056 4056 + } 4057 4057 + } 4058 4058 + 4044 4059 /* 4045 4060 * Set up the vmcs's constant host-state fields, i.e., host-state fields that 4046 4061 * will not change in the lifetime of the guest. ··· 6769 6754 static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu, 6770 6755 struct vcpu_vmx *vmx) 6771 6756 { 6772 6772 - kvm_guest_enter_irqoff(); 6757 6757 + guest_state_enter_irqoff(); 6773 6758 6774 6759 /* L1D Flush includes CPU buffer clear to mitigate MDS */ 6775 6760 if (static_branch_unlikely(&vmx_l1d_should_flush)) ··· 6785 6770 6786 6771 vcpu->arch.cr2 = native_read_cr2(); 6787 6772 6788 6788 - kvm_guest_exit_irqoff(); 6773 6773 + guest_state_exit_irqoff(); 6789 6774 } 6790 6775 6791 6776 static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu) ··· 7783 7768 .hwapic_isr_update = vmx_hwapic_isr_update, 7784 7769 .guest_apic_has_interrupt = vmx_guest_apic_has_interrupt, 7785 7770 .sync_pir_to_irr = vmx_sync_pir_to_irr, 7786 7786 - .deliver_posted_interrupt = vmx_deliver_posted_interrupt, 7771 7771 + .deliver_interrupt = vmx_deliver_interrupt, 7787 7772 .dy_apicv_has_pending_interrupt = pi_has_pending_interrupt, 7788 7773 7789 7774 .set_tss_addr = vmx_set_tss_addr,

+6 -4

arch/x86/kvm/x86.c

reviewed

··· 90 90 u64 __read_mostly kvm_mce_cap_supported = MCG_CTL_P | MCG_SER_P; 91 91 EXPORT_SYMBOL_GPL(kvm_mce_cap_supported); 92 92 93 93 + #define ERR_PTR_USR(e) ((void __user *)ERR_PTR(e)) 94 94 + 93 95 #define emul_to_vcpu(ctxt) \ 94 96 ((struct kvm_vcpu *)(ctxt)->vcpu) 95 97 ··· 4342 4340 void __user *uaddr = (void __user*)(unsigned long)attr->addr; 4343 4341 4344 4342 if ((u64)(unsigned long)uaddr != attr->addr) 4345 4345 - return ERR_PTR(-EFAULT); 4343 4343 + return ERR_PTR_USR(-EFAULT); 4346 4344 return uaddr; 4347 4345 } 4348 4346 ··· 10043 10041 set_debugreg(0, 7); 10044 10042 } 10045 10043 10044 10044 + guest_timing_enter_irqoff(); 10045 10045 + 10046 10046 for (;;) { 10047 10047 /* 10048 10048 * Assert that vCPU vs. VM APICv state is consistent. An APICv ··· 10129 10125 * of accounting via context tracking, but the loss of accuracy is 10130 10126 * acceptable for all known use cases. 10131 10127 */ 10132 10132 - vtime_account_guest_exit(); 10128 10128 + guest_timing_exit_irqoff(); 10133 10129 10134 10130 if (lapic_in_kernel(vcpu)) { 10135 10131 s64 delta = vcpu->arch.apic->lapic_timer.advance_expire_delta; ··· 11642 11638 cancel_delayed_work_sync(&kvm->arch.kvmclock_update_work); 11643 11639 kvm_free_pit(kvm); 11644 11640 } 11645 11645 - 11646 11646 - #define ERR_PTR_USR(e) ((void __user *)ERR_PTR(e)) 11647 11641 11648 11642 /** 11649 11643 * __x86_set_memory_region: Setup KVM internal memory slot

-45

arch/x86/kvm/x86.h

reviewed

··· 10 10 11 11 void kvm_spurious_fault(void); 12 12 13 13 - static __always_inline void kvm_guest_enter_irqoff(void) 14 14 - { 15 15 - /* 16 16 - * VMENTER enables interrupts (host state), but the kernel state is 17 17 - * interrupts disabled when this is invoked. Also tell RCU about 18 18 - * it. This is the same logic as for exit_to_user_mode(). 19 19 - * 20 20 - * This ensures that e.g. latency analysis on the host observes 21 21 - * guest mode as interrupt enabled. 22 22 - * 23 23 - * guest_enter_irqoff() informs context tracking about the 24 24 - * transition to guest mode and if enabled adjusts RCU state 25 25 - * accordingly. 26 26 - */ 27 27 - instrumentation_begin(); 28 28 - trace_hardirqs_on_prepare(); 29 29 - lockdep_hardirqs_on_prepare(CALLER_ADDR0); 30 30 - instrumentation_end(); 31 31 - 32 32 - guest_enter_irqoff(); 33 33 - lockdep_hardirqs_on(CALLER_ADDR0); 34 34 - } 35 35 - 36 36 - static __always_inline void kvm_guest_exit_irqoff(void) 37 37 - { 38 38 - /* 39 39 - * VMEXIT disables interrupts (host state), but tracing and lockdep 40 40 - * have them in state 'on' as recorded before entering guest mode. 41 41 - * Same as enter_from_user_mode(). 42 42 - * 43 43 - * context_tracking_guest_exit() restores host context and reinstates 44 44 - * RCU if enabled and required. 45 45 - * 46 46 - * This needs to be done immediately after VM-Exit, before any code 47 47 - * that might contain tracepoints or call out to the greater world, 48 48 - * e.g. before x86_spec_ctrl_restore_host(). 49 49 - */ 50 50 - lockdep_hardirqs_off(CALLER_ADDR0); 51 51 - context_tracking_guest_exit(); 52 52 - 53 53 - instrumentation_begin(); 54 54 - trace_hardirqs_off_finish(); 55 55 - instrumentation_end(); 56 56 - } 57 57 - 58 13 #define KVM_NESTED_VMENTER_CONSISTENCY_CHECK(consistency_check) \ 59 14 ({ \ 60 15 bool failed = (consistency_check); \

+109 -3

include/linux/kvm_host.h

reviewed

··· 29 29 #include <linux/refcount.h> 30 30 #include <linux/nospec.h> 31 31 #include <linux/notifier.h> 32 32 + #include <linux/ftrace.h> 32 33 #include <linux/hashtable.h> 34 34 + #include <linux/instrumentation.h> 33 35 #include <linux/interval_tree.h> 34 36 #include <linux/rbtree.h> 35 37 #include <linux/xarray.h> ··· 370 368 u64 last_used_slot_gen; 371 369 }; 372 370 373 373 - /* must be called with irqs disabled */ 374 374 - static __always_inline void guest_enter_irqoff(void) 371 371 + /* 372 372 + * Start accounting time towards a guest. 373 373 + * Must be called before entering guest context. 374 374 + */ 375 375 + static __always_inline void guest_timing_enter_irqoff(void) 375 376 { 376 377 /* 377 378 * This is running in ioctl context so its safe to assume that it's the ··· 383 378 instrumentation_begin(); 384 379 vtime_account_guest_enter(); 385 380 instrumentation_end(); 381 381 + } 386 382 383 383 + /* 384 384 + * Enter guest context and enter an RCU extended quiescent state. 385 385 + * 386 386 + * Between guest_context_enter_irqoff() and guest_context_exit_irqoff() it is 387 387 + * unsafe to use any code which may directly or indirectly use RCU, tracing 388 388 + * (including IRQ flag tracing), or lockdep. All code in this period must be 389 389 + * non-instrumentable. 390 390 + */ 391 391 + static __always_inline void guest_context_enter_irqoff(void) 392 392 + { 387 393 /* 388 394 * KVM does not hold any references to rcu protected data when it 389 395 * switches CPU into a guest mode. In fact switching to a guest mode ··· 410 394 } 411 395 } 412 396 413 413 - static __always_inline void guest_exit_irqoff(void) 397 397 + /* 398 398 + * Deprecated. Architectures should move to guest_timing_enter_irqoff() and 399 399 + * guest_state_enter_irqoff(). 400 400 + */ 401 401 + static __always_inline void guest_enter_irqoff(void) 402 402 + { 403 403 + guest_timing_enter_irqoff(); 404 404 + guest_context_enter_irqoff(); 405 405 + } 406 406 + 407 407 + /** 408 408 + * guest_state_enter_irqoff - Fixup state when entering a guest 409 409 + * 410 410 + * Entry to a guest will enable interrupts, but the kernel state is interrupts 411 411 + * disabled when this is invoked. Also tell RCU about it. 412 412 + * 413 413 + * 1) Trace interrupts on state 414 414 + * 2) Invoke context tracking if enabled to adjust RCU state 415 415 + * 3) Tell lockdep that interrupts are enabled 416 416 + * 417 417 + * Invoked from architecture specific code before entering a guest. 418 418 + * Must be called with interrupts disabled and the caller must be 419 419 + * non-instrumentable. 420 420 + * The caller has to invoke guest_timing_enter_irqoff() before this. 421 421 + * 422 422 + * Note: this is analogous to exit_to_user_mode(). 423 423 + */ 424 424 + static __always_inline void guest_state_enter_irqoff(void) 425 425 + { 426 426 + instrumentation_begin(); 427 427 + trace_hardirqs_on_prepare(); 428 428 + lockdep_hardirqs_on_prepare(CALLER_ADDR0); 429 429 + instrumentation_end(); 430 430 + 431 431 + guest_context_enter_irqoff(); 432 432 + lockdep_hardirqs_on(CALLER_ADDR0); 433 433 + } 434 434 + 435 435 + /* 436 436 + * Exit guest context and exit an RCU extended quiescent state. 437 437 + * 438 438 + * Between guest_context_enter_irqoff() and guest_context_exit_irqoff() it is 439 439 + * unsafe to use any code which may directly or indirectly use RCU, tracing 440 440 + * (including IRQ flag tracing), or lockdep. All code in this period must be 441 441 + * non-instrumentable. 442 442 + */ 443 443 + static __always_inline void guest_context_exit_irqoff(void) 414 444 { 415 445 context_tracking_guest_exit(); 446 446 + } 416 447 448 448 + /* 449 449 + * Stop accounting time towards a guest. 450 450 + * Must be called after exiting guest context. 451 451 + */ 452 452 + static __always_inline void guest_timing_exit_irqoff(void) 453 453 + { 417 454 instrumentation_begin(); 418 455 /* Flush the guest cputime we spent on the guest */ 419 456 vtime_account_guest_exit(); 420 457 instrumentation_end(); 458 458 + } 459 459 + 460 460 + /* 461 461 + * Deprecated. Architectures should move to guest_state_exit_irqoff() and 462 462 + * guest_timing_exit_irqoff(). 463 463 + */ 464 464 + static __always_inline void guest_exit_irqoff(void) 465 465 + { 466 466 + guest_context_exit_irqoff(); 467 467 + guest_timing_exit_irqoff(); 421 468 } 422 469 423 470 static inline void guest_exit(void) ··· 490 411 local_irq_save(flags); 491 412 guest_exit_irqoff(); 492 413 local_irq_restore(flags); 414 414 + } 415 415 + 416 416 + /** 417 417 + * guest_state_exit_irqoff - Establish state when returning from guest mode 418 418 + * 419 419 + * Entry from a guest disables interrupts, but guest mode is traced as 420 420 + * interrupts enabled. Also with NO_HZ_FULL RCU might be idle. 421 421 + * 422 422 + * 1) Tell lockdep that interrupts are disabled 423 423 + * 2) Invoke context tracking if enabled to reactivate RCU 424 424 + * 3) Trace interrupts off state 425 425 + * 426 426 + * Invoked from architecture specific code after exiting a guest. 427 427 + * Must be invoked with interrupts disabled and the caller must be 428 428 + * non-instrumentable. 429 429 + * The caller has to invoke guest_timing_exit_irqoff() after this. 430 430 + * 431 431 + * Note: this is analogous to enter_from_user_mode(). 432 432 + */ 433 433 + static __always_inline void guest_state_exit_irqoff(void) 434 434 + { 435 435 + lockdep_hardirqs_off(CALLER_ADDR0); 436 436 + guest_context_exit_irqoff(); 437 437 + 438 438 + instrumentation_begin(); 439 439 + trace_hardirqs_off_finish(); 440 440 + instrumentation_end(); 493 441 } 494 442 495 443 static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)

+3 -3

include/uapi/linux/kvm.h

reviewed

··· 1624 1624 #define KVM_S390_NORMAL_RESET _IO(KVMIO, 0xc3) 1625 1625 #define KVM_S390_CLEAR_RESET _IO(KVMIO, 0xc4) 1626 1626 1627 1627 - /* Available with KVM_CAP_XSAVE2 */ 1628 1628 - #define KVM_GET_XSAVE2 _IOR(KVMIO, 0xcf, struct kvm_xsave) 1629 1629 - 1630 1627 struct kvm_s390_pv_sec_parm { 1631 1628 __u64 origin; 1632 1629 __u64 length; ··· 2044 2047 }; 2045 2048 2046 2049 #define KVM_GET_STATS_FD _IO(KVMIO, 0xce) 2050 2050 + 2051 2051 + /* Available with KVM_CAP_XSAVE2 */ 2052 2052 + #define KVM_GET_XSAVE2 _IOR(KVMIO, 0xcf, struct kvm_xsave) 2047 2053 2048 2054 #endif /* __LINUX_KVM_H */