Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
"This is a bit on the large side, mostly due to two changes:

- Changes to disable some broken PMU virtualization (see below for
details under "x86 PMU")

- Clean up SVM's enter/exit assembly code so that it can be compiled
without OBJECT_FILES_NON_STANDARD. This fixes a warning "Unpatched
return thunk in use. This should not happen!" when running KVM
selftests.

Everything else is small bugfixes and selftest changes:

- Fix a mostly benign bug in the gfn_to_pfn_cache infrastructure
where KVM would allow userspace to refresh the cache with a bogus
GPA. The bug has existed for quite some time, but was exposed by a
new sanity check added in 6.9 (to ensure a cache is either
GPA-based or HVA-based).

- Drop an unused param from gfn_to_pfn_cache_invalidate_start() that
got left behind during a 6.9 cleanup.

- Fix a math goof in x86's hugepage logic for
KVM_SET_MEMORY_ATTRIBUTES that results in an array overflow
(detected by KASAN).

- Fix a bug where KVM incorrectly clears root_role.direct when
userspace sets guest CPUID.

- Fix a dirty logging bug in the where KVM fails to write-protect
SPTEs used by a nested guest, if KVM is using Page-Modification
Logging and the nested hypervisor is NOT using EPT.

x86 PMU:

- Drop support for virtualizing adaptive PEBS, as KVM's
implementation is architecturally broken without an obvious/easy
path forward, and because exposing adaptive PEBS can leak host LBRs
to the guest, i.e. can leak host kernel addresses to the guest.

- Set the enable bits for general purpose counters in
PERF_GLOBAL_CTRL at RESET time, as done by both Intel and AMD
processors.

- Disable LBR virtualization on CPUs that don't support LBR
callstacks, as KVM unconditionally uses
PERF_SAMPLE_BRANCH_CALL_STACK when creating the perf event, and
would fail on such CPUs.

Tests:

- Fix a flaw in the max_guest_memory selftest that results in it
exhausting the supply of ucall structures when run with more than
256 vCPUs.

- Mark KVM_MEM_READONLY as supported for RISC-V in
set_memory_region_test"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (30 commits)
KVM: Drop unused @may_block param from gfn_to_pfn_cache_invalidate_start()
KVM: selftests: Add coverage of EPT-disabled to vmx_dirty_log_test
KVM: x86/mmu: Fix and clarify comments about clearing D-bit vs. write-protecting
KVM: x86/mmu: Remove function comments above clear_dirty_{gfn_range,pt_masked}()
KVM: x86/mmu: Write-protect L2 SPTEs in TDP MMU when clearing dirty status
KVM: x86/mmu: Precisely invalidate MMU root_role during CPUID update
KVM: VMX: Disable LBR virtualization if the CPU doesn't support LBR callstacks
perf/x86/intel: Expose existence of callback support to KVM
KVM: VMX: Snapshot LBR capabilities during module initialization
KVM: x86/pmu: Do not mask LVTPC when handling a PMI on AMD platforms
KVM: x86: Snapshot if a vCPU's vendor model is AMD vs. Intel compatible
KVM: x86: Stop compiling vmenter.S with OBJECT_FILES_NON_STANDARD
KVM: SVM: Create a stack frame in __svm_sev_es_vcpu_run()
KVM: SVM: Save/restore args across SEV-ES VMRUN via host save area
KVM: SVM: Save/restore non-volatile GPRs in SEV-ES VMRUN via host save area
KVM: SVM: Clobber RAX instead of RBX when discarding spec_ctrl_intercepted
KVM: SVM: Drop 32-bit "support" from __svm_sev_es_vcpu_run()
KVM: SVM: Wrap __svm_sev_es_vcpu_run() with #ifdef CONFIG_KVM_AMD_SEV
KVM: SVM: Create a stack frame in __svm_vcpu_run() for unwinding
KVM: SVM: Remove a useless zeroing of allocated memory
...

+267 -159
+1
arch/x86/events/intel/lbr.c
··· 1693 1693 lbr->from = x86_pmu.lbr_from; 1694 1694 lbr->to = x86_pmu.lbr_to; 1695 1695 lbr->info = x86_pmu.lbr_info; 1696 + lbr->has_callstack = x86_pmu_has_lbr_callstack(); 1696 1697 } 1697 1698 EXPORT_SYMBOL_GPL(x86_perf_get_lbr); 1698 1699
+1
arch/x86/include/asm/kvm_host.h
··· 855 855 int cpuid_nent; 856 856 struct kvm_cpuid_entry2 *cpuid_entries; 857 857 struct kvm_hypervisor_cpuid kvm_cpuid; 858 + bool is_amd_compatible; 858 859 859 860 /* 860 861 * FIXME: Drop this macro and use KVM_NR_GOVERNED_FEATURES directly
+1
arch/x86/include/asm/perf_event.h
··· 555 555 unsigned int from; 556 556 unsigned int to; 557 557 unsigned int info; 558 + bool has_callstack; 558 559 }; 559 560 560 561 extern void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap);
-5
arch/x86/kvm/Makefile
··· 3 3 ccflags-y += -I $(srctree)/arch/x86/kvm 4 4 ccflags-$(CONFIG_KVM_WERROR) += -Werror 5 5 6 - ifeq ($(CONFIG_FRAME_POINTER),y) 7 - OBJECT_FILES_NON_STANDARD_vmx/vmenter.o := y 8 - OBJECT_FILES_NON_STANDARD_svm/vmenter.o := y 9 - endif 10 - 11 6 include $(srctree)/virt/kvm/Makefile.kvm 12 7 13 8 kvm-y += x86.o emulate.o i8259.o irq.o lapic.o \
+1
arch/x86/kvm/cpuid.c
··· 376 376 377 377 kvm_update_pv_runtime(vcpu); 378 378 379 + vcpu->arch.is_amd_compatible = guest_cpuid_is_amd_or_hygon(vcpu); 379 380 vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu); 380 381 vcpu->arch.reserved_gpa_bits = kvm_vcpu_reserved_gpa_bits_raw(vcpu); 381 382
+10
arch/x86/kvm/cpuid.h
··· 120 120 return best && is_guest_vendor_intel(best->ebx, best->ecx, best->edx); 121 121 } 122 122 123 + static inline bool guest_cpuid_is_amd_compatible(struct kvm_vcpu *vcpu) 124 + { 125 + return vcpu->arch.is_amd_compatible; 126 + } 127 + 128 + static inline bool guest_cpuid_is_intel_compatible(struct kvm_vcpu *vcpu) 129 + { 130 + return !guest_cpuid_is_amd_compatible(vcpu); 131 + } 132 + 123 133 static inline int guest_cpuid_family(struct kvm_vcpu *vcpu) 124 134 { 125 135 struct kvm_cpuid_entry2 *best;
+2 -1
arch/x86/kvm/lapic.c
··· 2776 2776 trig_mode = reg & APIC_LVT_LEVEL_TRIGGER; 2777 2777 2778 2778 r = __apic_accept_irq(apic, mode, vector, 1, trig_mode, NULL); 2779 - if (r && lvt_type == APIC_LVTPC) 2779 + if (r && lvt_type == APIC_LVTPC && 2780 + guest_cpuid_is_intel_compatible(apic->vcpu)) 2780 2781 kvm_lapic_set_reg(apic, APIC_LVTPC, reg | APIC_LVT_MASKED); 2781 2782 return r; 2782 2783 }
+6 -5
arch/x86/kvm/mmu/mmu.c
··· 4935 4935 context->cpu_role.base.level, is_efer_nx(context), 4936 4936 guest_can_use(vcpu, X86_FEATURE_GBPAGES), 4937 4937 is_cr4_pse(context), 4938 - guest_cpuid_is_amd_or_hygon(vcpu)); 4938 + guest_cpuid_is_amd_compatible(vcpu)); 4939 4939 } 4940 4940 4941 4941 static void __reset_rsvds_bits_mask_ept(struct rsvd_bits_validate *rsvd_check, ··· 5576 5576 * that problem is swept under the rug; KVM's CPUID API is horrific and 5577 5577 * it's all but impossible to solve it without introducing a new API. 5578 5578 */ 5579 - vcpu->arch.root_mmu.root_role.word = 0; 5580 - vcpu->arch.guest_mmu.root_role.word = 0; 5581 - vcpu->arch.nested_mmu.root_role.word = 0; 5579 + vcpu->arch.root_mmu.root_role.invalid = 1; 5580 + vcpu->arch.guest_mmu.root_role.invalid = 1; 5581 + vcpu->arch.nested_mmu.root_role.invalid = 1; 5582 5582 vcpu->arch.root_mmu.cpu_role.ext.valid = 0; 5583 5583 vcpu->arch.guest_mmu.cpu_role.ext.valid = 0; 5584 5584 vcpu->arch.nested_mmu.cpu_role.ext.valid = 0; ··· 7399 7399 * by the memslot, KVM can't use a hugepage due to the 7400 7400 * misaligned address regardless of memory attributes. 7401 7401 */ 7402 - if (gfn >= slot->base_gfn) { 7402 + if (gfn >= slot->base_gfn && 7403 + gfn + nr_pages <= slot->base_gfn + slot->npages) { 7403 7404 if (hugepage_has_attrs(kvm, slot, gfn, level, attrs)) 7404 7405 hugepage_clear_mixed(slot, gfn, level); 7405 7406 else
+22 -29
arch/x86/kvm/mmu/tdp_mmu.c
··· 1548 1548 } 1549 1549 } 1550 1550 1551 - /* 1552 - * Clear the dirty status of all the SPTEs mapping GFNs in the memslot. If 1553 - * AD bits are enabled, this will involve clearing the dirty bit on each SPTE. 1554 - * If AD bits are not enabled, this will require clearing the writable bit on 1555 - * each SPTE. Returns true if an SPTE has been changed and the TLBs need to 1556 - * be flushed. 1557 - */ 1551 + static bool tdp_mmu_need_write_protect(struct kvm_mmu_page *sp) 1552 + { 1553 + /* 1554 + * All TDP MMU shadow pages share the same role as their root, aside 1555 + * from level, so it is valid to key off any shadow page to determine if 1556 + * write protection is needed for an entire tree. 1557 + */ 1558 + return kvm_mmu_page_ad_need_write_protect(sp) || !kvm_ad_enabled(); 1559 + } 1560 + 1558 1561 static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root, 1559 1562 gfn_t start, gfn_t end) 1560 1563 { 1561 - u64 dbit = kvm_ad_enabled() ? shadow_dirty_mask : PT_WRITABLE_MASK; 1564 + const u64 dbit = tdp_mmu_need_write_protect(root) ? PT_WRITABLE_MASK : 1565 + shadow_dirty_mask; 1562 1566 struct tdp_iter iter; 1563 1567 bool spte_set = false; 1564 1568 ··· 1577 1573 if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true)) 1578 1574 continue; 1579 1575 1580 - KVM_MMU_WARN_ON(kvm_ad_enabled() && 1576 + KVM_MMU_WARN_ON(dbit == shadow_dirty_mask && 1581 1577 spte_ad_need_write_protect(iter.old_spte)); 1582 1578 1583 1579 if (!(iter.old_spte & dbit)) ··· 1594 1590 } 1595 1591 1596 1592 /* 1597 - * Clear the dirty status of all the SPTEs mapping GFNs in the memslot. If 1598 - * AD bits are enabled, this will involve clearing the dirty bit on each SPTE. 1599 - * If AD bits are not enabled, this will require clearing the writable bit on 1600 - * each SPTE. Returns true if an SPTE has been changed and the TLBs need to 1601 - * be flushed. 1593 + * Clear the dirty status (D-bit or W-bit) of all the SPTEs mapping GFNs in the 1594 + * memslot. Returns true if an SPTE has been changed and the TLBs need to be 1595 + * flushed. 1602 1596 */ 1603 1597 bool kvm_tdp_mmu_clear_dirty_slot(struct kvm *kvm, 1604 1598 const struct kvm_memory_slot *slot) ··· 1612 1610 return spte_set; 1613 1611 } 1614 1612 1615 - /* 1616 - * Clears the dirty status of all the 4k SPTEs mapping GFNs for which a bit is 1617 - * set in mask, starting at gfn. The given memslot is expected to contain all 1618 - * the GFNs represented by set bits in the mask. If AD bits are enabled, 1619 - * clearing the dirty status will involve clearing the dirty bit on each SPTE 1620 - * or, if AD bits are not enabled, clearing the writable bit on each SPTE. 1621 - */ 1622 1613 static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root, 1623 1614 gfn_t gfn, unsigned long mask, bool wrprot) 1624 1615 { 1625 - u64 dbit = (wrprot || !kvm_ad_enabled()) ? PT_WRITABLE_MASK : 1626 - shadow_dirty_mask; 1616 + const u64 dbit = (wrprot || tdp_mmu_need_write_protect(root)) ? PT_WRITABLE_MASK : 1617 + shadow_dirty_mask; 1627 1618 struct tdp_iter iter; 1628 1619 1629 1620 lockdep_assert_held_write(&kvm->mmu_lock); ··· 1628 1633 if (!mask) 1629 1634 break; 1630 1635 1631 - KVM_MMU_WARN_ON(kvm_ad_enabled() && 1636 + KVM_MMU_WARN_ON(dbit == shadow_dirty_mask && 1632 1637 spte_ad_need_write_protect(iter.old_spte)); 1633 1638 1634 1639 if (iter.level > PG_LEVEL_4K || ··· 1654 1659 } 1655 1660 1656 1661 /* 1657 - * Clears the dirty status of all the 4k SPTEs mapping GFNs for which a bit is 1658 - * set in mask, starting at gfn. The given memslot is expected to contain all 1659 - * the GFNs represented by set bits in the mask. If AD bits are enabled, 1660 - * clearing the dirty status will involve clearing the dirty bit on each SPTE 1661 - * or, if AD bits are not enabled, clearing the writable bit on each SPTE. 1662 + * Clear the dirty status (D-bit or W-bit) of all the 4k SPTEs mapping GFNs for 1663 + * which a bit is set in mask, starting at gfn. The given memslot is expected to 1664 + * contain all the GFNs represented by set bits in the mask. 1662 1665 */ 1663 1666 void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm, 1664 1667 struct kvm_memory_slot *slot,
+14 -2
arch/x86/kvm/pmu.c
··· 775 775 pmu->pebs_data_cfg_mask = ~0ull; 776 776 bitmap_zero(pmu->all_valid_pmc_idx, X86_PMC_IDX_MAX); 777 777 778 - if (vcpu->kvm->arch.enable_pmu) 779 - static_call(kvm_x86_pmu_refresh)(vcpu); 778 + if (!vcpu->kvm->arch.enable_pmu) 779 + return; 780 + 781 + static_call(kvm_x86_pmu_refresh)(vcpu); 782 + 783 + /* 784 + * At RESET, both Intel and AMD CPUs set all enable bits for general 785 + * purpose counters in IA32_PERF_GLOBAL_CTRL (so that software that 786 + * was written for v1 PMUs don't unknowingly leave GP counters disabled 787 + * in the global controls). Emulate that behavior when refreshing the 788 + * PMU so that userspace doesn't need to manually set PERF_GLOBAL_CTRL. 789 + */ 790 + if (kvm_pmu_has_perf_global_ctrl(pmu) && pmu->nr_arch_gp_counters) 791 + pmu->global_ctrl = GENMASK_ULL(pmu->nr_arch_gp_counters - 1, 0); 780 792 } 781 793 782 794 void kvm_pmu_init(struct kvm_vcpu *vcpu)
+1 -1
arch/x86/kvm/svm/sev.c
··· 434 434 /* Avoid using vmalloc for smaller buffers. */ 435 435 size = npages * sizeof(struct page *); 436 436 if (size > PAGE_SIZE) 437 - pages = __vmalloc(size, GFP_KERNEL_ACCOUNT | __GFP_ZERO); 437 + pages = __vmalloc(size, GFP_KERNEL_ACCOUNT); 438 438 else 439 439 pages = kmalloc(size, GFP_KERNEL_ACCOUNT); 440 440
+10 -7
arch/x86/kvm/svm/svm.c
··· 1503 1503 __free_pages(virt_to_page(svm->msrpm), get_order(MSRPM_SIZE)); 1504 1504 } 1505 1505 1506 + static struct sev_es_save_area *sev_es_host_save_area(struct svm_cpu_data *sd) 1507 + { 1508 + return page_address(sd->save_area) + 0x400; 1509 + } 1510 + 1506 1511 static void svm_prepare_switch_to_guest(struct kvm_vcpu *vcpu) 1507 1512 { 1508 1513 struct vcpu_svm *svm = to_svm(vcpu); ··· 1524 1519 * or subsequent vmload of host save area. 1525 1520 */ 1526 1521 vmsave(sd->save_area_pa); 1527 - if (sev_es_guest(vcpu->kvm)) { 1528 - struct sev_es_save_area *hostsa; 1529 - hostsa = (struct sev_es_save_area *)(page_address(sd->save_area) + 0x400); 1530 - 1531 - sev_es_prepare_switch_to_guest(hostsa); 1532 - } 1522 + if (sev_es_guest(vcpu->kvm)) 1523 + sev_es_prepare_switch_to_guest(sev_es_host_save_area(sd)); 1533 1524 1534 1525 if (tsc_scaling) 1535 1526 __svm_write_tsc_multiplier(vcpu->arch.tsc_scaling_ratio); ··· 4102 4101 4103 4102 static noinstr void svm_vcpu_enter_exit(struct kvm_vcpu *vcpu, bool spec_ctrl_intercepted) 4104 4103 { 4104 + struct svm_cpu_data *sd = per_cpu_ptr(&svm_data, vcpu->cpu); 4105 4105 struct vcpu_svm *svm = to_svm(vcpu); 4106 4106 4107 4107 guest_state_enter_irqoff(); ··· 4110 4108 amd_clear_divider(); 4111 4109 4112 4110 if (sev_es_guest(vcpu->kvm)) 4113 - __svm_sev_es_vcpu_run(svm, spec_ctrl_intercepted); 4111 + __svm_sev_es_vcpu_run(svm, spec_ctrl_intercepted, 4112 + sev_es_host_save_area(sd)); 4114 4113 else 4115 4114 __svm_vcpu_run(svm, spec_ctrl_intercepted); 4116 4115
+2 -1
arch/x86/kvm/svm/svm.h
··· 698 698 699 699 /* vmenter.S */ 700 700 701 - void __svm_sev_es_vcpu_run(struct vcpu_svm *svm, bool spec_ctrl_intercepted); 701 + void __svm_sev_es_vcpu_run(struct vcpu_svm *svm, bool spec_ctrl_intercepted, 702 + struct sev_es_save_area *hostsa); 702 703 void __svm_vcpu_run(struct vcpu_svm *svm, bool spec_ctrl_intercepted); 703 704 704 705 #define DEFINE_KVM_GHCB_ACCESSORS(field) \
+44 -53
arch/x86/kvm/svm/vmenter.S
··· 3 3 #include <asm/asm.h> 4 4 #include <asm/asm-offsets.h> 5 5 #include <asm/bitsperlong.h> 6 + #include <asm/frame.h> 6 7 #include <asm/kvm_vcpu_regs.h> 7 8 #include <asm/nospec-branch.h> 8 9 #include "kvm-asm-offsets.h" ··· 68 67 "", X86_FEATURE_V_SPEC_CTRL 69 68 901: 70 69 .endm 71 - .macro RESTORE_HOST_SPEC_CTRL_BODY 70 + .macro RESTORE_HOST_SPEC_CTRL_BODY spec_ctrl_intercepted:req 72 71 900: 73 72 /* Same for after vmexit. */ 74 73 mov $MSR_IA32_SPEC_CTRL, %ecx ··· 77 76 * Load the value that the guest had written into MSR_IA32_SPEC_CTRL, 78 77 * if it was not intercepted during guest execution. 79 78 */ 80 - cmpb $0, (%_ASM_SP) 79 + cmpb $0, \spec_ctrl_intercepted 81 80 jnz 998f 82 81 rdmsr 83 82 movl %eax, SVM_spec_ctrl(%_ASM_DI) ··· 100 99 */ 101 100 SYM_FUNC_START(__svm_vcpu_run) 102 101 push %_ASM_BP 102 + mov %_ASM_SP, %_ASM_BP 103 103 #ifdef CONFIG_X86_64 104 104 push %r15 105 105 push %r14 ··· 270 268 RET 271 269 272 270 RESTORE_GUEST_SPEC_CTRL_BODY 273 - RESTORE_HOST_SPEC_CTRL_BODY 271 + RESTORE_HOST_SPEC_CTRL_BODY (%_ASM_SP) 274 272 275 273 10: cmpb $0, _ASM_RIP(kvm_rebooting) 276 274 jne 2b ··· 292 290 293 291 SYM_FUNC_END(__svm_vcpu_run) 294 292 293 + #ifdef CONFIG_KVM_AMD_SEV 294 + 295 + 296 + #ifdef CONFIG_X86_64 297 + #define SEV_ES_GPRS_BASE 0x300 298 + #define SEV_ES_RBX (SEV_ES_GPRS_BASE + __VCPU_REGS_RBX * WORD_SIZE) 299 + #define SEV_ES_RBP (SEV_ES_GPRS_BASE + __VCPU_REGS_RBP * WORD_SIZE) 300 + #define SEV_ES_RSI (SEV_ES_GPRS_BASE + __VCPU_REGS_RSI * WORD_SIZE) 301 + #define SEV_ES_RDI (SEV_ES_GPRS_BASE + __VCPU_REGS_RDI * WORD_SIZE) 302 + #define SEV_ES_R12 (SEV_ES_GPRS_BASE + __VCPU_REGS_R12 * WORD_SIZE) 303 + #define SEV_ES_R13 (SEV_ES_GPRS_BASE + __VCPU_REGS_R13 * WORD_SIZE) 304 + #define SEV_ES_R14 (SEV_ES_GPRS_BASE + __VCPU_REGS_R14 * WORD_SIZE) 305 + #define SEV_ES_R15 (SEV_ES_GPRS_BASE + __VCPU_REGS_R15 * WORD_SIZE) 306 + #endif 307 + 295 308 /** 296 309 * __svm_sev_es_vcpu_run - Run a SEV-ES vCPU via a transition to SVM guest mode 297 310 * @svm: struct vcpu_svm * 298 311 * @spec_ctrl_intercepted: bool 299 312 */ 300 313 SYM_FUNC_START(__svm_sev_es_vcpu_run) 301 - push %_ASM_BP 302 - #ifdef CONFIG_X86_64 303 - push %r15 304 - push %r14 305 - push %r13 306 - push %r12 307 - #else 308 - push %edi 309 - push %esi 310 - #endif 311 - push %_ASM_BX 314 + FRAME_BEGIN 312 315 313 316 /* 314 - * Save variables needed after vmexit on the stack, in inverse 315 - * order compared to when they are needed. 317 + * Save non-volatile (callee-saved) registers to the host save area. 318 + * Except for RAX and RSP, all GPRs are restored on #VMEXIT, but not 319 + * saved on VMRUN. 316 320 */ 321 + mov %rbp, SEV_ES_RBP (%rdx) 322 + mov %r15, SEV_ES_R15 (%rdx) 323 + mov %r14, SEV_ES_R14 (%rdx) 324 + mov %r13, SEV_ES_R13 (%rdx) 325 + mov %r12, SEV_ES_R12 (%rdx) 326 + mov %rbx, SEV_ES_RBX (%rdx) 317 327 318 - /* Accessed directly from the stack in RESTORE_HOST_SPEC_CTRL. */ 319 - push %_ASM_ARG2 320 - 321 - /* Save @svm. */ 322 - push %_ASM_ARG1 323 - 324 - .ifnc _ASM_ARG1, _ASM_DI 325 328 /* 326 - * Stash @svm in RDI early. On 32-bit, arguments are in RAX, RCX 327 - * and RDX which are clobbered by RESTORE_GUEST_SPEC_CTRL. 329 + * Save volatile registers that hold arguments that are needed after 330 + * #VMEXIT (RDI=@svm and RSI=@spec_ctrl_intercepted). 328 331 */ 329 - mov %_ASM_ARG1, %_ASM_DI 330 - .endif 332 + mov %rdi, SEV_ES_RDI (%rdx) 333 + mov %rsi, SEV_ES_RSI (%rdx) 331 334 332 - /* Clobbers RAX, RCX, RDX. */ 335 + /* Clobbers RAX, RCX, RDX (@hostsa). */ 333 336 RESTORE_GUEST_SPEC_CTRL 334 337 335 338 /* Get svm->current_vmcb->pa into RAX. */ 336 - mov SVM_current_vmcb(%_ASM_DI), %_ASM_AX 337 - mov KVM_VMCB_pa(%_ASM_AX), %_ASM_AX 339 + mov SVM_current_vmcb(%rdi), %rax 340 + mov KVM_VMCB_pa(%rax), %rax 338 341 339 342 /* Enter guest mode */ 340 343 sti 341 344 342 - 1: vmrun %_ASM_AX 345 + 1: vmrun %rax 343 346 344 347 2: cli 345 348 346 - /* Pop @svm to RDI, guest registers have been saved already. */ 347 - pop %_ASM_DI 348 - 349 349 #ifdef CONFIG_MITIGATION_RETPOLINE 350 350 /* IMPORTANT: Stuff the RSB immediately after VM-Exit, before RET! */ 351 - FILL_RETURN_BUFFER %_ASM_AX, RSB_CLEAR_LOOPS, X86_FEATURE_RETPOLINE 351 + FILL_RETURN_BUFFER %rax, RSB_CLEAR_LOOPS, X86_FEATURE_RETPOLINE 352 352 #endif 353 353 354 - /* Clobbers RAX, RCX, RDX. */ 354 + /* Clobbers RAX, RCX, RDX, consumes RDI (@svm) and RSI (@spec_ctrl_intercepted). */ 355 355 RESTORE_HOST_SPEC_CTRL 356 356 357 357 /* ··· 365 361 */ 366 362 UNTRAIN_RET_VM 367 363 368 - /* "Pop" @spec_ctrl_intercepted. */ 369 - pop %_ASM_BX 370 - 371 - pop %_ASM_BX 372 - 373 - #ifdef CONFIG_X86_64 374 - pop %r12 375 - pop %r13 376 - pop %r14 377 - pop %r15 378 - #else 379 - pop %esi 380 - pop %edi 381 - #endif 382 - pop %_ASM_BP 364 + FRAME_END 383 365 RET 384 366 385 367 RESTORE_GUEST_SPEC_CTRL_BODY 386 - RESTORE_HOST_SPEC_CTRL_BODY 368 + RESTORE_HOST_SPEC_CTRL_BODY %sil 387 369 388 - 3: cmpb $0, _ASM_RIP(kvm_rebooting) 370 + 3: cmpb $0, kvm_rebooting(%rip) 389 371 jne 2b 390 372 ud2 391 373 392 374 _ASM_EXTABLE(1b, 3b) 393 375 394 376 SYM_FUNC_END(__svm_sev_es_vcpu_run) 377 + #endif /* CONFIG_KVM_AMD_SEV */
+1 -1
arch/x86/kvm/vmx/pmu_intel.c
··· 535 535 perf_capabilities = vcpu_get_perf_capabilities(vcpu); 536 536 if (cpuid_model_is_consistent(vcpu) && 537 537 (perf_capabilities & PMU_CAP_LBR_FMT)) 538 - x86_perf_get_lbr(&lbr_desc->records); 538 + memcpy(&lbr_desc->records, &vmx_lbr_caps, sizeof(vmx_lbr_caps)); 539 539 else 540 540 lbr_desc->records.nr = 0; 541 541
+35 -6
arch/x86/kvm/vmx/vmx.c
··· 218 218 int __read_mostly pt_mode = PT_MODE_SYSTEM; 219 219 module_param(pt_mode, int, S_IRUGO); 220 220 221 + struct x86_pmu_lbr __ro_after_init vmx_lbr_caps; 222 + 221 223 static DEFINE_STATIC_KEY_FALSE(vmx_l1d_should_flush); 222 224 static DEFINE_STATIC_KEY_FALSE(vmx_l1d_flush_cond); 223 225 static DEFINE_MUTEX(vmx_l1d_flush_mutex); ··· 7864 7862 vmx_update_exception_bitmap(vcpu); 7865 7863 } 7866 7864 7867 - static u64 vmx_get_perf_capabilities(void) 7865 + static __init u64 vmx_get_perf_capabilities(void) 7868 7866 { 7869 7867 u64 perf_cap = PMU_CAP_FW_WRITES; 7870 - struct x86_pmu_lbr lbr; 7871 7868 u64 host_perf_cap = 0; 7872 7869 7873 7870 if (!enable_pmu) ··· 7876 7875 rdmsrl(MSR_IA32_PERF_CAPABILITIES, host_perf_cap); 7877 7876 7878 7877 if (!cpu_feature_enabled(X86_FEATURE_ARCH_LBR)) { 7879 - x86_perf_get_lbr(&lbr); 7880 - if (lbr.nr) 7878 + x86_perf_get_lbr(&vmx_lbr_caps); 7879 + 7880 + /* 7881 + * KVM requires LBR callstack support, as the overhead due to 7882 + * context switching LBRs without said support is too high. 7883 + * See intel_pmu_create_guest_lbr_event() for more info. 7884 + */ 7885 + if (!vmx_lbr_caps.has_callstack) 7886 + memset(&vmx_lbr_caps, 0, sizeof(vmx_lbr_caps)); 7887 + else if (vmx_lbr_caps.nr) 7881 7888 perf_cap |= host_perf_cap & PMU_CAP_LBR_FMT; 7882 7889 } 7883 7890 7884 7891 if (vmx_pebs_supported()) { 7885 7892 perf_cap |= host_perf_cap & PERF_CAP_PEBS_MASK; 7886 - if ((perf_cap & PERF_CAP_PEBS_FORMAT) < 4) 7887 - perf_cap &= ~PERF_CAP_PEBS_BASELINE; 7893 + 7894 + /* 7895 + * Disallow adaptive PEBS as it is functionally broken, can be 7896 + * used by the guest to read *host* LBRs, and can be used to 7897 + * bypass userspace event filters. To correctly and safely 7898 + * support adaptive PEBS, KVM needs to: 7899 + * 7900 + * 1. Account for the ADAPTIVE flag when (re)programming fixed 7901 + * counters. 7902 + * 7903 + * 2. Gain support from perf (or take direct control of counter 7904 + * programming) to support events without adaptive PEBS 7905 + * enabled for the hardware counter. 7906 + * 7907 + * 3. Ensure LBR MSRs cannot hold host data on VM-Entry with 7908 + * adaptive PEBS enabled and MSR_PEBS_DATA_CFG.LBRS=1. 7909 + * 7910 + * 4. Document which PMU events are effectively exposed to the 7911 + * guest via adaptive PEBS, and make adaptive PEBS mutually 7912 + * exclusive with KVM_SET_PMU_EVENT_FILTER if necessary. 7913 + */ 7914 + perf_cap &= ~PERF_CAP_PEBS_BASELINE; 7888 7915 } 7889 7916 7890 7917 return perf_cap;
+5 -1
arch/x86/kvm/vmx/vmx.h
··· 15 15 #include "vmx_ops.h" 16 16 #include "../cpuid.h" 17 17 #include "run_flags.h" 18 + #include "../mmu.h" 18 19 19 20 #define MSR_TYPE_R 1 20 21 #define MSR_TYPE_W 2 ··· 109 108 /* True if LBRs are marked as not intercepted in the MSR bitmap */ 110 109 bool msr_passthrough; 111 110 }; 111 + 112 + extern struct x86_pmu_lbr vmx_lbr_caps; 112 113 113 114 /* 114 115 * The nested_vmx structure is part of vcpu_vmx, and holds information we need ··· 722 719 if (!enable_ept) 723 720 return true; 724 721 725 - return allow_smaller_maxphyaddr && cpuid_maxphyaddr(vcpu) < boot_cpu_data.x86_phys_bits; 722 + return allow_smaller_maxphyaddr && 723 + cpuid_maxphyaddr(vcpu) < kvm_get_shadow_phys_bits(); 726 724 } 727 725 728 726 static inline bool is_unrestricted_guest(struct kvm_vcpu *vcpu)
+1 -1
arch/x86/kvm/x86.c
··· 3470 3470 static bool can_set_mci_status(struct kvm_vcpu *vcpu) 3471 3471 { 3472 3472 /* McStatusWrEn enabled? */ 3473 - if (guest_cpuid_is_amd_or_hygon(vcpu)) 3473 + if (guest_cpuid_is_amd_compatible(vcpu)) 3474 3474 return !!(vcpu->arch.msr_hwcr & BIT_ULL(18)); 3475 3475 3476 3476 return false;
+6 -9
tools/testing/selftests/kvm/max_guest_memory_test.c
··· 22 22 { 23 23 uint64_t gpa; 24 24 25 - for (gpa = start_gpa; gpa < end_gpa; gpa += stride) 26 - *((volatile uint64_t *)gpa) = gpa; 27 - 28 - GUEST_DONE(); 25 + for (;;) { 26 + for (gpa = start_gpa; gpa < end_gpa; gpa += stride) 27 + *((volatile uint64_t *)gpa) = gpa; 28 + GUEST_SYNC(0); 29 + } 29 30 } 30 31 31 32 struct vcpu_info { ··· 56 55 static void run_vcpu(struct kvm_vcpu *vcpu) 57 56 { 58 57 vcpu_run(vcpu); 59 - TEST_ASSERT_EQ(get_ucall(vcpu, NULL), UCALL_DONE); 58 + TEST_ASSERT_EQ(get_ucall(vcpu, NULL), UCALL_SYNC); 60 59 } 61 60 62 61 static void *vcpu_worker(void *data) ··· 65 64 struct kvm_vcpu *vcpu = info->vcpu; 66 65 struct kvm_vm *vm = vcpu->vm; 67 66 struct kvm_sregs sregs; 68 - struct kvm_regs regs; 69 67 70 68 vcpu_args_set(vcpu, 3, info->start_gpa, info->end_gpa, vm->page_size); 71 69 72 - /* Snapshot regs before the first run. */ 73 - vcpu_regs_get(vcpu, &regs); 74 70 rendezvous_with_boss(); 75 71 76 72 run_vcpu(vcpu); 77 73 rendezvous_with_boss(); 78 - vcpu_regs_set(vcpu, &regs); 79 74 vcpu_sregs_get(vcpu, &sregs); 80 75 #ifdef __x86_64__ 81 76 /* Toggle CR0.WP to trigger a MMU context reset. */
+1 -1
tools/testing/selftests/kvm/set_memory_region_test.c
··· 333 333 struct kvm_vm *vm; 334 334 int r, i; 335 335 336 - #if defined __aarch64__ || defined __x86_64__ 336 + #if defined __aarch64__ || defined __riscv || defined __x86_64__ 337 337 supported_flags |= KVM_MEM_READONLY; 338 338 #endif 339 339
+19 -1
tools/testing/selftests/kvm/x86_64/pmu_counters_test.c
··· 416 416 417 417 static void guest_test_gp_counters(void) 418 418 { 419 + uint8_t pmu_version = guest_get_pmu_version(); 419 420 uint8_t nr_gp_counters = 0; 420 421 uint32_t base_msr; 421 422 422 - if (guest_get_pmu_version()) 423 + if (pmu_version) 423 424 nr_gp_counters = this_cpu_property(X86_PROPERTY_PMU_NR_GP_COUNTERS); 425 + 426 + /* 427 + * For v2+ PMUs, PERF_GLOBAL_CTRL's architectural post-RESET value is 428 + * "Sets bits n-1:0 and clears the upper bits", where 'n' is the number 429 + * of GP counters. If there are no GP counters, require KVM to leave 430 + * PERF_GLOBAL_CTRL '0'. This edge case isn't covered by the SDM, but 431 + * follow the spirit of the architecture and only globally enable GP 432 + * counters, of which there are none. 433 + */ 434 + if (pmu_version > 1) { 435 + uint64_t global_ctrl = rdmsr(MSR_CORE_PERF_GLOBAL_CTRL); 436 + 437 + if (nr_gp_counters) 438 + GUEST_ASSERT_EQ(global_ctrl, GENMASK_ULL(nr_gp_counters - 1, 0)); 439 + else 440 + GUEST_ASSERT_EQ(global_ctrl, 0); 441 + } 424 442 425 443 if (this_cpu_has(X86_FEATURE_PDCM) && 426 444 rdmsr(MSR_IA32_PERF_CAPABILITIES) & PMU_CAP_FW_WRITES)
+46 -14
tools/testing/selftests/kvm/x86_64/vmx_dirty_log_test.c
··· 28 28 #define NESTED_TEST_MEM1 0xc0001000 29 29 #define NESTED_TEST_MEM2 0xc0002000 30 30 31 - static void l2_guest_code(void) 31 + static void l2_guest_code(u64 *a, u64 *b) 32 32 { 33 - *(volatile uint64_t *)NESTED_TEST_MEM1; 34 - *(volatile uint64_t *)NESTED_TEST_MEM1 = 1; 33 + READ_ONCE(*a); 34 + WRITE_ONCE(*a, 1); 35 35 GUEST_SYNC(true); 36 36 GUEST_SYNC(false); 37 37 38 - *(volatile uint64_t *)NESTED_TEST_MEM2 = 1; 38 + WRITE_ONCE(*b, 1); 39 39 GUEST_SYNC(true); 40 - *(volatile uint64_t *)NESTED_TEST_MEM2 = 1; 40 + WRITE_ONCE(*b, 1); 41 41 GUEST_SYNC(true); 42 42 GUEST_SYNC(false); 43 43 ··· 45 45 vmcall(); 46 46 } 47 47 48 + static void l2_guest_code_ept_enabled(void) 49 + { 50 + l2_guest_code((u64 *)NESTED_TEST_MEM1, (u64 *)NESTED_TEST_MEM2); 51 + } 52 + 53 + static void l2_guest_code_ept_disabled(void) 54 + { 55 + /* Access the same L1 GPAs as l2_guest_code_ept_enabled() */ 56 + l2_guest_code((u64 *)GUEST_TEST_MEM, (u64 *)GUEST_TEST_MEM); 57 + } 58 + 48 59 void l1_guest_code(struct vmx_pages *vmx) 49 60 { 50 61 #define L2_GUEST_STACK_SIZE 64 51 62 unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE]; 63 + void *l2_rip; 52 64 53 65 GUEST_ASSERT(vmx->vmcs_gpa); 54 66 GUEST_ASSERT(prepare_for_vmx_operation(vmx)); 55 67 GUEST_ASSERT(load_vmcs(vmx)); 56 68 57 - prepare_vmcs(vmx, l2_guest_code, 58 - &l2_guest_stack[L2_GUEST_STACK_SIZE]); 69 + if (vmx->eptp_gpa) 70 + l2_rip = l2_guest_code_ept_enabled; 71 + else 72 + l2_rip = l2_guest_code_ept_disabled; 73 + 74 + prepare_vmcs(vmx, l2_rip, &l2_guest_stack[L2_GUEST_STACK_SIZE]); 59 75 60 76 GUEST_SYNC(false); 61 77 GUEST_ASSERT(!vmlaunch()); ··· 80 64 GUEST_DONE(); 81 65 } 82 66 83 - int main(int argc, char *argv[]) 67 + static void test_vmx_dirty_log(bool enable_ept) 84 68 { 85 69 vm_vaddr_t vmx_pages_gva = 0; 86 70 struct vmx_pages *vmx; ··· 92 76 struct ucall uc; 93 77 bool done = false; 94 78 95 - TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_VMX)); 96 - TEST_REQUIRE(kvm_cpu_has_ept()); 79 + pr_info("Nested EPT: %s\n", enable_ept ? "enabled" : "disabled"); 97 80 98 81 /* Create VM */ 99 82 vm = vm_create_with_one_vcpu(&vcpu, l1_guest_code); ··· 118 103 * 119 104 * Note that prepare_eptp should be called only L1's GPA map is done, 120 105 * meaning after the last call to virt_map. 106 + * 107 + * When EPT is disabled, the L2 guest code will still access the same L1 108 + * GPAs as the EPT enabled case. 121 109 */ 122 - prepare_eptp(vmx, vm, 0); 123 - nested_map_memslot(vmx, vm, 0); 124 - nested_map(vmx, vm, NESTED_TEST_MEM1, GUEST_TEST_MEM, 4096); 125 - nested_map(vmx, vm, NESTED_TEST_MEM2, GUEST_TEST_MEM, 4096); 110 + if (enable_ept) { 111 + prepare_eptp(vmx, vm, 0); 112 + nested_map_memslot(vmx, vm, 0); 113 + nested_map(vmx, vm, NESTED_TEST_MEM1, GUEST_TEST_MEM, 4096); 114 + nested_map(vmx, vm, NESTED_TEST_MEM2, GUEST_TEST_MEM, 4096); 115 + } 126 116 127 117 bmap = bitmap_zalloc(TEST_MEM_PAGES); 128 118 host_test_mem = addr_gpa2hva(vm, GUEST_TEST_MEM); ··· 167 147 TEST_FAIL("Unknown ucall %lu", uc.cmd); 168 148 } 169 149 } 150 + } 151 + 152 + int main(int argc, char *argv[]) 153 + { 154 + TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_VMX)); 155 + 156 + test_vmx_dirty_log(/*enable_ept=*/false); 157 + 158 + if (kvm_cpu_has_ept()) 159 + test_vmx_dirty_log(/*enable_ept=*/true); 160 + 161 + return 0; 170 162 }
+1 -2
virt/kvm/kvm_main.c
··· 832 832 * mn_active_invalidate_count (see above) instead of 833 833 * mmu_invalidate_in_progress. 834 834 */ 835 - gfn_to_pfn_cache_invalidate_start(kvm, range->start, range->end, 836 - hva_range.may_block); 835 + gfn_to_pfn_cache_invalidate_start(kvm, range->start, range->end); 837 836 838 837 /* 839 838 * If one or more memslots were found and thus zapped, notify arch code
+2 -4
virt/kvm/kvm_mm.h
··· 26 26 #ifdef CONFIG_HAVE_KVM_PFNCACHE 27 27 void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm, 28 28 unsigned long start, 29 - unsigned long end, 30 - bool may_block); 29 + unsigned long end); 31 30 #else 32 31 static inline void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm, 33 32 unsigned long start, 34 - unsigned long end, 35 - bool may_block) 33 + unsigned long end) 36 34 { 37 35 } 38 36 #endif /* HAVE_KVM_PFNCACHE */
+35 -15
virt/kvm/pfncache.c
··· 23 23 * MMU notifier 'invalidate_range_start' hook. 24 24 */ 25 25 void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm, unsigned long start, 26 - unsigned long end, bool may_block) 26 + unsigned long end) 27 27 { 28 28 struct gfn_to_pfn_cache *gpc; 29 29 ··· 57 57 spin_unlock(&kvm->gpc_lock); 58 58 } 59 59 60 + static bool kvm_gpc_is_valid_len(gpa_t gpa, unsigned long uhva, 61 + unsigned long len) 62 + { 63 + unsigned long offset = kvm_is_error_gpa(gpa) ? offset_in_page(uhva) : 64 + offset_in_page(gpa); 65 + 66 + /* 67 + * The cached access must fit within a single page. The 'len' argument 68 + * to activate() and refresh() exists only to enforce that. 69 + */ 70 + return offset + len <= PAGE_SIZE; 71 + } 72 + 60 73 bool kvm_gpc_check(struct gfn_to_pfn_cache *gpc, unsigned long len) 61 74 { 62 75 struct kvm_memslots *slots = kvm_memslots(gpc->kvm); ··· 87 74 if (kvm_is_error_hva(gpc->uhva)) 88 75 return false; 89 76 90 - if (offset_in_page(gpc->uhva) + len > PAGE_SIZE) 77 + if (!kvm_gpc_is_valid_len(gpc->gpa, gpc->uhva, len)) 91 78 return false; 92 79 93 80 if (!gpc->valid) ··· 245 232 return -EFAULT; 246 233 } 247 234 248 - static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned long uhva, 249 - unsigned long len) 235 + static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned long uhva) 250 236 { 251 237 unsigned long page_offset; 252 238 bool unmap_old = false; ··· 257 245 258 246 /* Either gpa or uhva must be valid, but not both */ 259 247 if (WARN_ON_ONCE(kvm_is_error_gpa(gpa) == kvm_is_error_hva(uhva))) 260 - return -EINVAL; 261 - 262 - /* 263 - * The cached acces must fit within a single page. The 'len' argument 264 - * exists only to enforce that. 265 - */ 266 - page_offset = kvm_is_error_gpa(gpa) ? offset_in_page(uhva) : 267 - offset_in_page(gpa); 268 - if (page_offset + len > PAGE_SIZE) 269 248 return -EINVAL; 270 249 271 250 lockdep_assert_held(&gpc->refresh_lock); ··· 273 270 old_uhva = PAGE_ALIGN_DOWN(gpc->uhva); 274 271 275 272 if (kvm_is_error_gpa(gpa)) { 273 + page_offset = offset_in_page(uhva); 274 + 276 275 gpc->gpa = INVALID_GPA; 277 276 gpc->memslot = NULL; 278 277 gpc->uhva = PAGE_ALIGN_DOWN(uhva); ··· 283 278 hva_change = true; 284 279 } else { 285 280 struct kvm_memslots *slots = kvm_memslots(gpc->kvm); 281 + 282 + page_offset = offset_in_page(gpa); 286 283 287 284 if (gpc->gpa != gpa || gpc->generation != slots->generation || 288 285 kvm_is_error_hva(gpc->uhva)) { ··· 361 354 362 355 guard(mutex)(&gpc->refresh_lock); 363 356 357 + if (!kvm_gpc_is_valid_len(gpc->gpa, gpc->uhva, len)) 358 + return -EINVAL; 359 + 364 360 /* 365 361 * If the GPA is valid then ignore the HVA, as a cache can be GPA-based 366 362 * or HVA-based, not both. For GPA-based caches, the HVA will be ··· 371 361 */ 372 362 uhva = kvm_is_error_gpa(gpc->gpa) ? gpc->uhva : KVM_HVA_ERR_BAD; 373 363 374 - return __kvm_gpc_refresh(gpc, gpc->gpa, uhva, len); 364 + return __kvm_gpc_refresh(gpc, gpc->gpa, uhva); 375 365 } 376 366 377 367 void kvm_gpc_init(struct gfn_to_pfn_cache *gpc, struct kvm *kvm) ··· 390 380 unsigned long len) 391 381 { 392 382 struct kvm *kvm = gpc->kvm; 383 + 384 + if (!kvm_gpc_is_valid_len(gpa, uhva, len)) 385 + return -EINVAL; 393 386 394 387 guard(mutex)(&gpc->refresh_lock); 395 388 ··· 413 400 gpc->active = true; 414 401 write_unlock_irq(&gpc->lock); 415 402 } 416 - return __kvm_gpc_refresh(gpc, gpa, uhva, len); 403 + return __kvm_gpc_refresh(gpc, gpa, uhva); 417 404 } 418 405 419 406 int kvm_gpc_activate(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned long len) 420 407 { 408 + /* 409 + * Explicitly disallow INVALID_GPA so that the magic value can be used 410 + * by KVM to differentiate between GPA-based and HVA-based caches. 411 + */ 412 + if (WARN_ON_ONCE(kvm_is_error_gpa(gpa))) 413 + return -EINVAL; 414 + 421 415 return __kvm_gpc_activate(gpc, gpa, KVM_HVA_ERR_BAD, len); 422 416 } 423 417