Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Documentation/virt/kvm/amd-memory-encryption.rst Documentation/virt/kvm/x86/amd-memory-encryption.rst

+55 -6

Documentation/virt/kvm/api.rst

··· 151 151 KVM_CAP_S390_UCONTROL and use the flag KVM_VM_S390_UCONTROL as 152 152 privileged user (CAP_SYS_ADMIN). 153 153 154 - To use hardware assisted virtualization on MIPS (VZ ASE) rather than 155 - the default trap & emulate implementation (which changes the virtual 156 - memory layout to fit in user mode), check KVM_CAP_MIPS_VZ and use the 157 - flag KVM_VM_MIPS_VZ. 158 - 159 - 160 154 On arm64, the physical address size for a VM (IPA Size limit) is limited 161 155 to 40bits by default. The limit can be configured if the host supports the 162 156 extension KVM_CAP_ARM_VM_IPA_SIZE. When supported, use ··· 4075 4081 and their behavior depends on the ``X2APIC_ENABLE`` bit of the APIC base 4076 4082 register. 4077 4083 4084 + .. warning:: 4085 + MSR accesses coming from nested vmentry/vmexit are not filtered. 4086 + This includes both writes to individual VMCS fields and reads/writes 4087 + through the MSR lists pointed to by the VMCS. 4088 + 4078 4089 If a bit is within one of the defined ranges, read and write accesses are 4079 4090 guarded by the bitmap's value for the MSR index if the kind of access 4080 4091 is included in the ``struct kvm_msr_filter_range`` flags. If no range ··· 5292 5293 5293 5294 KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO 5294 5295 Sets the guest physical address of the vcpu_info for a given vCPU. 5296 + As with the shared_info page for the VM, the corresponding page may be 5297 + dirtied at any time if event channel interrupt delivery is enabled, so 5298 + userspace should always assume that the page is dirty without relying 5299 + on dirty logging. 5295 5300 5296 5301 KVM_XEN_VCPU_ATTR_TYPE_VCPU_TIME_INFO 5297 5302 Sets the guest physical address of an additional pvclock structure ··· 7722 7719 At this time, KVM_PMU_CAP_DISABLE is the only capability. Setting 7723 7720 this capability will disable PMU virtualization for that VM. Usermode 7724 7721 should adjust CPUID leaf 0xA to reflect that the PMU is disabled. 7722 + 7723 + 9. Known KVM API problems 7724 + ========================= 7725 + 7726 + In some cases, KVM's API has some inconsistencies or common pitfalls 7727 + that userspace need to be aware of. This section details some of 7728 + these issues. 7729 + 7730 + Most of them are architecture specific, so the section is split by 7731 + architecture. 7732 + 7733 + 9.1. x86 7734 + -------- 7735 + 7736 + ``KVM_GET_SUPPORTED_CPUID`` issues 7737 + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 7738 + 7739 + In general, ``KVM_GET_SUPPORTED_CPUID`` is designed so that it is possible 7740 + to take its result and pass it directly to ``KVM_SET_CPUID2``. This section 7741 + documents some cases in which that requires some care. 7742 + 7743 + Local APIC features 7744 + ~~~~~~~~~~~~~~~~~~~ 7745 + 7746 + CPU[EAX=1]:ECX[21] (X2APIC) is reported by ``KVM_GET_SUPPORTED_CPUID``, 7747 + but it can only be enabled if ``KVM_CREATE_IRQCHIP`` or 7748 + ``KVM_ENABLE_CAP(KVM_CAP_IRQCHIP_SPLIT)`` are used to enable in-kernel emulation of 7749 + the local APIC. 7750 + 7751 + The same is true for the ``KVM_FEATURE_PV_UNHALT`` paravirtualized feature. 7752 + 7753 + CPU[EAX=1]:ECX[24] (TSC_DEADLINE) is not reported by ``KVM_GET_SUPPORTED_CPUID``. 7754 + It can be enabled if ``KVM_CAP_TSC_DEADLINE_TIMER`` is present and the kernel 7755 + has enabled in-kernel emulation of the local APIC. 7756 + 7757 + Obsolete ioctls and capabilities 7758 + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 7759 + 7760 + KVM_CAP_DISABLE_QUIRKS does not let userspace know which quirks are actually 7761 + available. Use ``KVM_CHECK_EXTENSION(KVM_CAP_DISABLE_QUIRKS2)`` instead if 7762 + available. 7763 + 7764 + Ordering of KVM_GET_*/KVM_SET_* ioctls 7765 + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 7766 + 7767 + TBD

Documentation/virt/kvm/cpuid.rst Documentation/virt/kvm/x86/cpuid.rst

Documentation/virt/kvm/halt-polling.rst Documentation/virt/kvm/x86/halt-polling.rst

Documentation/virt/kvm/hypercalls.rst Documentation/virt/kvm/x86/hypercalls.rst

+8 -20

Documentation/virt/kvm/index.rst

··· 8 8 :maxdepth: 2 9 9 10 10 api 11 - amd-memory-encryption 12 - cpuid 13 - halt-polling 14 - hypercalls 15 - locking 16 - mmu 17 - msr 18 - nested-vmx 19 - ppc-pv 20 - s390-diag 21 - s390-pv 22 - s390-pv-boot 23 - timekeeping 24 - vcpu-requests 25 - 26 - review-checklist 27 - 28 - arm/index 29 - 30 11 devices/index 31 12 32 - running-nested-guests 13 + arm/index 14 + s390/index 15 + ppc-pv 16 + x86/index 17 + 18 + locking 19 + vcpu-requests 20 + review-checklist

+34 -9

Documentation/virt/kvm/locking.rst

··· 210 210 3. Reference 211 211 ------------ 212 212 213 - :Name: kvm_lock 213 + ``kvm_lock`` 214 + ^^^^^^^^^^^^ 215 + 214 216 :Type: mutex 215 217 :Arch: any 216 218 :Protects: - vm_list 217 219 218 - :Name: kvm_count_lock 220 + ``kvm_count_lock`` 221 + ^^^^^^^^^^^^^^^^^^ 222 + 219 223 :Type: raw_spinlock_t 220 224 :Arch: any 221 225 :Protects: - hardware virtualization enable/disable 222 226 :Comment: 'raw' because hardware enabling/disabling must be atomic /wrt 223 227 migration. 224 228 225 - :Name: kvm_arch::tsc_write_lock 226 - :Type: raw_spinlock 229 + ``kvm->mn_invalidate_lock`` 230 + ^^^^^^^^^^^^^^^^^^^^^^^^^^^ 231 + 232 + :Type: spinlock_t 233 + :Arch: any 234 + :Protects: mn_active_invalidate_count, mn_memslots_update_rcuwait 235 + 236 + ``kvm_arch::tsc_write_lock`` 237 + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 238 + 239 + :Type: raw_spinlock_t 227 240 :Arch: x86 228 241 :Protects: - kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset} 229 242 - tsc offset in vmcb 230 243 :Comment: 'raw' because updating the tsc offsets must not be preempted. 231 244 232 - :Name: kvm->mmu_lock 233 - :Type: spinlock_t 245 + ``kvm->mmu_lock`` 246 + ^^^^^^^^^^^^^^^^^ 247 + :Type: spinlock_t or rwlock_t 234 248 :Arch: any 235 249 :Protects: -shadow page/shadow tlb entry 236 250 :Comment: it is a spinlock since it is used in mmu notifier. 237 251 238 - :Name: kvm->srcu 252 + ``kvm->srcu`` 253 + ^^^^^^^^^^^^^ 239 254 :Type: srcu lock 240 255 :Arch: any 241 256 :Protects: - kvm->memslots ··· 261 246 The srcu index can be stored in kvm_vcpu->srcu_idx per vcpu 262 247 if it is needed by multiple functions. 263 248 264 - :Name: blocked_vcpu_on_cpu_lock 249 + ``kvm->slots_arch_lock`` 250 + ^^^^^^^^^^^^^^^^^^^^^^^^ 251 + :Type: mutex 252 + :Arch: any (only needed on x86 though) 253 + :Protects: any arch-specific fields of memslots that have to be modified 254 + in a ``kvm->srcu`` read-side critical section. 255 + :Comment: must be held before reading the pointer to the current memslots, 256 + until after all changes to the memslots are complete 257 + 258 + ``wakeup_vcpus_on_cpu_lock`` 259 + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 265 260 :Type: spinlock_t 266 261 :Arch: x86 267 - :Protects: blocked_vcpu_on_cpu 262 + :Protects: wakeup_vcpus_on_cpu 268 263 :Comment: This is a per-CPU lock and it is used for VT-d posted-interrupts. 269 264 When VT-d posted-interrupts is supported and the VM has assigned 270 265 devices, we put the blocked vCPU on the list blocked_vcpu_on_cpu

Documentation/virt/kvm/mmu.rst Documentation/virt/kvm/x86/mmu.rst

Documentation/virt/kvm/msr.rst Documentation/virt/kvm/x86/msr.rst

Documentation/virt/kvm/nested-vmx.rst Documentation/virt/kvm/x86/nested-vmx.rst

Documentation/virt/kvm/running-nested-guests.rst Documentation/virt/kvm/x86/running-nested-guests.rst

Documentation/virt/kvm/s390-diag.rst Documentation/virt/kvm/s390/s390-diag.rst

Documentation/virt/kvm/s390-pv-boot.rst Documentation/virt/kvm/s390/s390-pv-boot.rst

Documentation/virt/kvm/s390-pv.rst Documentation/virt/kvm/s390/s390-pv.rst

+12

Documentation/virt/kvm/s390/index.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ==================== 4 + KVM for s390 systems 5 + ==================== 6 + 7 + .. toctree:: 8 + :maxdepth: 2 9 + 10 + s390-diag 11 + s390-pv 12 + s390-pv-boot

Documentation/virt/kvm/timekeeping.rst Documentation/virt/kvm/x86/timekeeping.rst

+10

Documentation/virt/kvm/vcpu-requests.rst

··· 135 135 such as a pending signal, which does not indicate the VCPU's halt 136 136 emulation should stop, and therefore does not make the request. 137 137 138 + KVM_REQ_OUTSIDE_GUEST_MODE 139 + 140 + This "request" ensures the target vCPU has exited guest mode prior to the 141 + sender of the request continuing on. No action needs be taken by the target, 142 + and so no request is actually logged for the target. This request is similar 143 + to a "kick", but unlike a kick it guarantees the vCPU has actually exited 144 + guest mode. A kick only guarantees the vCPU will exit at some point in the 145 + future, e.g. a previous kick may have started the process, but there's no 146 + guarantee the to-be-kicked vCPU has fully exited guest mode. 147 + 138 148 KVM_REQUEST_MASK 139 149 ---------------- 140 150

+39

Documentation/virt/kvm/x86/errata.rst

··· 1 + 2 + ======================================= 3 + Known limitations of CPU virtualization 4 + ======================================= 5 + 6 + Whenever perfect emulation of a CPU feature is impossible or too hard, KVM 7 + has to choose between not implementing the feature at all or introducing 8 + behavioral differences between virtual machines and bare metal systems. 9 + 10 + This file documents some of the known limitations that KVM has in 11 + virtualizing CPU features. 12 + 13 + x86 14 + === 15 + 16 + ``KVM_GET_SUPPORTED_CPUID`` issues 17 + ---------------------------------- 18 + 19 + x87 features 20 + ~~~~~~~~~~~~ 21 + 22 + Unlike most other CPUID feature bits, CPUID[EAX=7,ECX=0]:EBX[6] 23 + (FDP_EXCPTN_ONLY) and CPUID[EAX=7,ECX=0]:EBX]13] (ZERO_FCS_FDS) are 24 + clear if the features are present and set if the features are not present. 25 + 26 + Clearing these bits in CPUID has no effect on the operation of the guest; 27 + if these bits are set on hardware, the features will not be present on 28 + any virtual machine that runs on that hardware. 29 + 30 + **Workaround:** It is recommended to always set these bits in guest CPUID. 31 + Note however that any software (e.g ``WIN87EM.DLL``) expecting these features 32 + to be present likely predates these CPUID feature bits, and therefore 33 + doesn't know to check for them anyway. 34 + 35 + Nested virtualization features 36 + ------------------------------ 37 + 38 + TBD 39 +

+19

Documentation/virt/kvm/x86/index.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + =================== 4 + KVM for x86 systems 5 + =================== 6 + 7 + .. toctree:: 8 + :maxdepth: 2 9 + 10 + amd-memory-encryption 11 + cpuid 12 + errata 13 + halt-polling 14 + hypercalls 15 + mmu 16 + msr 17 + nested-vmx 18 + running-nested-guests 19 + timekeeping

+1 -1

arch/s390/kvm/kvm-s390.c

··· 3462 3462 /* Kick a guest cpu out of SIE to process a request synchronously */ 3463 3463 void kvm_s390_sync_request(int req, struct kvm_vcpu *vcpu) 3464 3464 { 3465 - kvm_make_request(req, vcpu); 3465 + __kvm_make_request(req, vcpu); 3466 3466 kvm_s390_vcpu_request(vcpu); 3467 3467 } 3468 3468

+31 -15

arch/x86/include/asm/kvm_host.h

··· 249 249 #define PFERR_SGX_BIT 15 250 250 #define PFERR_GUEST_FINAL_BIT 32 251 251 #define PFERR_GUEST_PAGE_BIT 33 252 + #define PFERR_IMPLICIT_ACCESS_BIT 48 252 253 253 254 #define PFERR_PRESENT_MASK (1U << PFERR_PRESENT_BIT) 254 255 #define PFERR_WRITE_MASK (1U << PFERR_WRITE_BIT) ··· 260 259 #define PFERR_SGX_MASK (1U << PFERR_SGX_BIT) 261 260 #define PFERR_GUEST_FINAL_MASK (1ULL << PFERR_GUEST_FINAL_BIT) 262 261 #define PFERR_GUEST_PAGE_MASK (1ULL << PFERR_GUEST_PAGE_BIT) 262 + #define PFERR_IMPLICIT_ACCESS (1ULL << PFERR_IMPLICIT_ACCESS_BIT) 263 263 264 264 #define PFERR_NESTED_GUEST_PAGE (PFERR_GUEST_PAGE_MASK | \ 265 265 PFERR_WRITE_MASK | \ ··· 432 430 void (*inject_page_fault)(struct kvm_vcpu *vcpu, 433 431 struct x86_exception *fault); 434 432 gpa_t (*gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, 435 - gpa_t gva_or_gpa, u32 access, 433 + gpa_t gva_or_gpa, u64 access, 436 434 struct x86_exception *exception); 437 435 int (*sync_page)(struct kvm_vcpu *vcpu, 438 436 struct kvm_mmu_page *sp); ··· 514 512 u64 global_ctrl_mask; 515 513 u64 global_ovf_ctrl_mask; 516 514 u64 reserved_bits; 515 + u64 raw_event_mask; 517 516 u8 version; 518 517 struct kvm_pmc gp_counters[INTEL_PMC_MAX_GENERIC]; 519 518 struct kvm_pmc fixed_counters[KVM_PMC_MAX_FIXED]; ··· 1043 1040 struct msr_bitmap_range ranges[16]; 1044 1041 }; 1045 1042 1046 - #define APICV_INHIBIT_REASON_DISABLE 0 1047 - #define APICV_INHIBIT_REASON_HYPERV 1 1048 - #define APICV_INHIBIT_REASON_NESTED 2 1049 - #define APICV_INHIBIT_REASON_IRQWIN 3 1050 - #define APICV_INHIBIT_REASON_PIT_REINJ 4 1051 - #define APICV_INHIBIT_REASON_X2APIC 5 1052 - #define APICV_INHIBIT_REASON_BLOCKIRQ 6 1053 - #define APICV_INHIBIT_REASON_ABSENT 7 1043 + enum kvm_apicv_inhibit { 1044 + APICV_INHIBIT_REASON_DISABLE, 1045 + APICV_INHIBIT_REASON_HYPERV, 1046 + APICV_INHIBIT_REASON_NESTED, 1047 + APICV_INHIBIT_REASON_IRQWIN, 1048 + APICV_INHIBIT_REASON_PIT_REINJ, 1049 + APICV_INHIBIT_REASON_X2APIC, 1050 + APICV_INHIBIT_REASON_BLOCKIRQ, 1051 + APICV_INHIBIT_REASON_ABSENT, 1052 + }; 1054 1053 1055 1054 struct kvm_arch { 1056 1055 unsigned long n_used_mmu_pages; ··· 1406 1401 void (*enable_nmi_window)(struct kvm_vcpu *vcpu); 1407 1402 void (*enable_irq_window)(struct kvm_vcpu *vcpu); 1408 1403 void (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, int irr); 1409 - bool (*check_apicv_inhibit_reasons)(ulong bit); 1404 + bool (*check_apicv_inhibit_reasons)(enum kvm_apicv_inhibit reason); 1410 1405 void (*refresh_apicv_exec_ctrl)(struct kvm_vcpu *vcpu); 1411 1406 void (*hwapic_irr_update)(struct kvm_vcpu *vcpu, int max_irr); 1412 1407 void (*hwapic_isr_update)(struct kvm_vcpu *vcpu, int isr); ··· 1590 1585 1591 1586 void kvm_mmu_destroy(struct kvm_vcpu *vcpu); 1592 1587 int kvm_mmu_create(struct kvm_vcpu *vcpu); 1593 - void kvm_mmu_init_vm(struct kvm *kvm); 1588 + int kvm_mmu_init_vm(struct kvm *kvm); 1594 1589 void kvm_mmu_uninit_vm(struct kvm *kvm); 1595 1590 1596 1591 void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu); ··· 1800 1795 1801 1796 bool kvm_apicv_activated(struct kvm *kvm); 1802 1797 void kvm_vcpu_update_apicv(struct kvm_vcpu *vcpu); 1803 - void kvm_request_apicv_update(struct kvm *kvm, bool activate, 1804 - unsigned long bit); 1798 + void __kvm_set_or_clear_apicv_inhibit(struct kvm *kvm, 1799 + enum kvm_apicv_inhibit reason, bool set); 1800 + void kvm_set_or_clear_apicv_inhibit(struct kvm *kvm, 1801 + enum kvm_apicv_inhibit reason, bool set); 1805 1802 1806 - void __kvm_request_apicv_update(struct kvm *kvm, bool activate, 1807 - unsigned long bit); 1803 + static inline void kvm_set_apicv_inhibit(struct kvm *kvm, 1804 + enum kvm_apicv_inhibit reason) 1805 + { 1806 + kvm_set_or_clear_apicv_inhibit(kvm, reason, true); 1807 + } 1808 + 1809 + static inline void kvm_clear_apicv_inhibit(struct kvm *kvm, 1810 + enum kvm_apicv_inhibit reason) 1811 + { 1812 + kvm_set_or_clear_apicv_inhibit(kvm, reason, false); 1813 + } 1808 1814 1809 1815 int kvm_emulate_hypercall(struct kvm_vcpu *vcpu); 1810 1816

+11 -3

arch/x86/include/asm/svm.h

··· 221 221 #define SVM_NESTED_CTL_SEV_ES_ENABLE BIT(2) 222 222 223 223 224 + #define SVM_TSC_RATIO_RSVD 0xffffff0000000000ULL 225 + #define SVM_TSC_RATIO_MIN 0x0000000000000001ULL 226 + #define SVM_TSC_RATIO_MAX 0x000000ffffffffffULL 227 + #define SVM_TSC_RATIO_DEFAULT 0x0100000000ULL 228 + 229 + 224 230 /* AVIC */ 225 - #define AVIC_LOGICAL_ID_ENTRY_GUEST_PHYSICAL_ID_MASK (0xFF) 231 + #define AVIC_LOGICAL_ID_ENTRY_GUEST_PHYSICAL_ID_MASK (0xFFULL) 226 232 #define AVIC_LOGICAL_ID_ENTRY_VALID_BIT 31 227 233 #define AVIC_LOGICAL_ID_ENTRY_VALID_MASK (1 << 31) 228 234 ··· 236 230 #define AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK (0xFFFFFFFFFFULL << 12) 237 231 #define AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK (1ULL << 62) 238 232 #define AVIC_PHYSICAL_ID_ENTRY_VALID_MASK (1ULL << 63) 239 - #define AVIC_PHYSICAL_ID_TABLE_SIZE_MASK (0xFF) 233 + #define AVIC_PHYSICAL_ID_TABLE_SIZE_MASK (0xFFULL) 240 234 241 - #define AVIC_DOORBELL_PHYSICAL_ID_MASK (0xFF) 235 + #define AVIC_DOORBELL_PHYSICAL_ID_MASK GENMASK_ULL(11, 0) 236 + 237 + #define VMCB_AVIC_APIC_BAR_MASK 0xFFFFFFFFFF000ULL 242 238 243 239 #define AVIC_UNACCEL_ACCESS_WRITE_MASK 1 244 240 #define AVIC_UNACCEL_ACCESS_OFFSET_MASK 0xFF0

+1 -1

arch/x86/kernel/kvm.c

··· 517 517 } else if (apic_id < min && max - apic_id < KVM_IPI_CLUSTER_SIZE) { 518 518 ipi_bitmap <<= min - apic_id; 519 519 min = apic_id; 520 - } else if (apic_id < min + KVM_IPI_CLUSTER_SIZE) { 520 + } else if (apic_id > min && apic_id < min + KVM_IPI_CLUSTER_SIZE) { 521 521 max = apic_id < max ? max : apic_id; 522 522 } else { 523 523 ret = kvm_hypercall4(KVM_HC_SEND_IPI, (unsigned long)ipi_bitmap,

+1

arch/x86/kvm/cpuid.c

··· 735 735 if (function > READ_ONCE(max_cpuid_80000000)) 736 736 return entry; 737 737 } 738 + break; 738 739 739 740 default: 740 741 break;

+5 -3

arch/x86/kvm/emulate.c

··· 3540 3540 { 3541 3541 u64 tsc_aux = 0; 3542 3542 3543 - if (ctxt->ops->get_msr(ctxt, MSR_TSC_AUX, &tsc_aux)) 3543 + if (!ctxt->ops->guest_has_rdpid(ctxt)) 3544 3544 return emulate_ud(ctxt); 3545 + 3546 + ctxt->ops->get_msr(ctxt, MSR_TSC_AUX, &tsc_aux); 3545 3547 ctxt->dst.val = tsc_aux; 3546 3548 return X86EMUL_CONTINUE; 3547 3549 } ··· 3644 3642 3645 3643 msr_data = (u32)reg_read(ctxt, VCPU_REGS_RAX) 3646 3644 | ((u64)reg_read(ctxt, VCPU_REGS_RDX) << 32); 3647 - r = ctxt->ops->set_msr(ctxt, msr_index, msr_data); 3645 + r = ctxt->ops->set_msr_with_filter(ctxt, msr_index, msr_data); 3648 3646 3649 3647 if (r == X86EMUL_IO_NEEDED) 3650 3648 return r; ··· 3661 3659 u64 msr_data; 3662 3660 int r; 3663 3661 3664 - r = ctxt->ops->get_msr(ctxt, msr_index, &msr_data); 3662 + r = ctxt->ops->get_msr_with_filter(ctxt, msr_index, &msr_data); 3665 3663 3666 3664 if (r == X86EMUL_IO_NEEDED) 3667 3665 return r;

+16 -6

arch/x86/kvm/hyperv.c

··· 122 122 else 123 123 hv->synic_auto_eoi_used--; 124 124 125 - __kvm_request_apicv_update(vcpu->kvm, 126 - !hv->synic_auto_eoi_used, 127 - APICV_INHIBIT_REASON_HYPERV); 125 + /* 126 + * Inhibit APICv if any vCPU is using SynIC's AutoEOI, which relies on 127 + * the hypervisor to manually inject IRQs. 128 + */ 129 + __kvm_set_or_clear_apicv_inhibit(vcpu->kvm, 130 + APICV_INHIBIT_REASON_HYPERV, 131 + !!hv->synic_auto_eoi_used); 128 132 129 133 up_write(&vcpu->kvm->arch.apicv_update_lock); 130 134 } ··· 243 239 struct kvm_vcpu *vcpu = hv_synic_to_vcpu(synic); 244 240 int ret; 245 241 246 - if (!synic->active && !host) 242 + if (!synic->active && (!host || data)) 247 243 return 1; 248 244 249 245 trace_kvm_hv_synic_set_msr(vcpu->vcpu_id, msr, data, host); ··· 288 284 break; 289 285 case HV_X64_MSR_EOM: { 290 286 int i; 287 + 288 + if (!synic->active) 289 + break; 291 290 292 291 for (i = 0; i < ARRAY_SIZE(synic->sint); i++) 293 292 kvm_hv_notify_acked_sint(vcpu, i); ··· 455 448 struct kvm_vcpu *vcpu = hv_synic_to_vcpu(synic); 456 449 struct kvm_lapic_irq irq; 457 450 int ret, vector; 451 + 452 + if (KVM_BUG_ON(!lapic_in_kernel(vcpu), vcpu->kvm)) 453 + return -EINVAL; 458 454 459 455 if (sint >= ARRAY_SIZE(synic->sint)) 460 456 return -EINVAL; ··· 671 661 struct kvm_vcpu_hv *hv_vcpu = to_hv_vcpu(vcpu); 672 662 struct kvm_vcpu_hv_synic *synic = to_hv_synic(vcpu); 673 663 674 - if (!synic->active && !host) 664 + if (!synic->active && (!host || config)) 675 665 return 1; 676 666 677 667 if (unlikely(!host && hv_vcpu->enforce_cpuid && new_config.direct_mode && ··· 700 690 struct kvm_vcpu *vcpu = hv_stimer_to_vcpu(stimer); 701 691 struct kvm_vcpu_hv_synic *synic = to_hv_synic(vcpu); 702 692 703 - if (!synic->active && !host) 693 + if (!synic->active && (!host || count)) 704 694 return 1; 705 695 706 696 trace_kvm_hv_stimer_set_count(hv_stimer_to_vcpu(stimer)->vcpu_id,

+2 -4

arch/x86/kvm/i8254.c

··· 305 305 * So, deactivate APICv when PIT is in reinject mode. 306 306 */ 307 307 if (reinject) { 308 - kvm_request_apicv_update(kvm, false, 309 - APICV_INHIBIT_REASON_PIT_REINJ); 308 + kvm_set_apicv_inhibit(kvm, APICV_INHIBIT_REASON_PIT_REINJ); 310 309 /* The initial state is preserved while ps->reinject == 0. */ 311 310 kvm_pit_reset_reinject(pit); 312 311 kvm_register_irq_ack_notifier(kvm, &ps->irq_ack_notifier); 313 312 kvm_register_irq_mask_notifier(kvm, 0, &pit->mask_notifier); 314 313 } else { 315 - kvm_request_apicv_update(kvm, true, 316 - APICV_INHIBIT_REASON_PIT_REINJ); 314 + kvm_clear_apicv_inhibit(kvm, APICV_INHIBIT_REASON_PIT_REINJ); 317 315 kvm_unregister_irq_ack_notifier(kvm, &ps->irq_ack_notifier); 318 316 kvm_unregister_irq_mask_notifier(kvm, 0, &pit->mask_notifier); 319 317 }

+3

arch/x86/kvm/kvm_emulate.h

··· 210 210 int (*set_dr)(struct x86_emulate_ctxt *ctxt, int dr, ulong value); 211 211 u64 (*get_smbase)(struct x86_emulate_ctxt *ctxt); 212 212 void (*set_smbase)(struct x86_emulate_ctxt *ctxt, u64 smbase); 213 + int (*set_msr_with_filter)(struct x86_emulate_ctxt *ctxt, u32 msr_index, u64 data); 214 + int (*get_msr_with_filter)(struct x86_emulate_ctxt *ctxt, u32 msr_index, u64 *pdata); 213 215 int (*set_msr)(struct x86_emulate_ctxt *ctxt, u32 msr_index, u64 data); 214 216 int (*get_msr)(struct x86_emulate_ctxt *ctxt, u32 msr_index, u64 *pdata); 215 217 int (*check_pmc)(struct x86_emulate_ctxt *ctxt, u32 pmc); ··· 228 226 bool (*guest_has_long_mode)(struct x86_emulate_ctxt *ctxt); 229 227 bool (*guest_has_movbe)(struct x86_emulate_ctxt *ctxt); 230 228 bool (*guest_has_fxsr)(struct x86_emulate_ctxt *ctxt); 229 + bool (*guest_has_rdpid)(struct x86_emulate_ctxt *ctxt); 231 230 232 231 void (*set_nmi_mask)(struct x86_emulate_ctxt *ctxt, bool masked); 233 232

+4

arch/x86/kvm/lapic.c

··· 1024 1024 *r = -1; 1025 1025 1026 1026 if (irq->shorthand == APIC_DEST_SELF) { 1027 + if (KVM_BUG_ON(!src, kvm)) { 1028 + *r = 0; 1029 + return true; 1030 + } 1027 1031 *r = kvm_apic_set_irq(src->vcpu, irq, dest_map); 1028 1032 return true; 1029 1033 }

+16 -16

arch/x86/kvm/mmu.h

··· 214 214 */ 215 215 static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, 216 216 unsigned pte_access, unsigned pte_pkey, 217 - unsigned pfec) 217 + u64 access) 218 218 { 219 - int cpl = static_call(kvm_x86_get_cpl)(vcpu); 219 + /* strip nested paging fault error codes */ 220 + unsigned int pfec = access; 220 221 unsigned long rflags = static_call(kvm_x86_get_rflags)(vcpu); 221 222 222 223 /* 223 - * If CPL < 3, SMAP prevention are disabled if EFLAGS.AC = 1. 224 + * For explicit supervisor accesses, SMAP is disabled if EFLAGS.AC = 1. 225 + * For implicit supervisor accesses, SMAP cannot be overridden. 224 226 * 225 - * If CPL = 3, SMAP applies to all supervisor-mode data accesses 226 - * (these are implicit supervisor accesses) regardless of the value 227 - * of EFLAGS.AC. 227 + * SMAP works on supervisor accesses only, and not_smap can 228 + * be set or not set when user access with neither has any bearing 229 + * on the result. 228 230 * 229 - * This computes (cpl < 3) && (rflags & X86_EFLAGS_AC), leaving 230 - * the result in X86_EFLAGS_AC. We then insert it in place of 231 - * the PFERR_RSVD_MASK bit; this bit will always be zero in pfec, 232 - * but it will be one in index if SMAP checks are being overridden. 233 - * It is important to keep this branchless. 231 + * We put the SMAP checking bit in place of the PFERR_RSVD_MASK bit; 232 + * this bit will always be zero in pfec, but it will be one in index 233 + * if SMAP checks are being disabled. 234 234 */ 235 - unsigned long smap = (cpl - 3) & (rflags & X86_EFLAGS_AC); 236 - int index = (pfec >> 1) + 237 - (smap >> (X86_EFLAGS_AC_BIT - PFERR_RSVD_BIT + 1)); 235 + u64 implicit_access = access & PFERR_IMPLICIT_ACCESS; 236 + bool not_smap = ((rflags & X86_EFLAGS_AC) | implicit_access) == X86_EFLAGS_AC; 237 + int index = (pfec + (not_smap << PFERR_RSVD_BIT)) >> 1; 238 238 bool fault = (mmu->permissions[index] >> pte_access) & 1; 239 239 u32 errcode = PFERR_PRESENT_MASK; 240 240 ··· 317 317 atomic64_add(count, &kvm->stat.pages[level - 1]); 318 318 } 319 319 320 - gpa_t translate_nested_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, u32 access, 320 + gpa_t translate_nested_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, u64 access, 321 321 struct x86_exception *exception); 322 322 323 323 static inline gpa_t kvm_translate_gpa(struct kvm_vcpu *vcpu, 324 324 struct kvm_mmu *mmu, 325 - gpa_t gpa, u32 access, 325 + gpa_t gpa, u64 access, 326 326 struct x86_exception *exception) 327 327 { 328 328 if (mmu != &vcpu->arch.nested_mmu)

+17 -10

arch/x86/kvm/mmu/mmu.c

··· 2696 2696 if (*sptep == spte) { 2697 2697 ret = RET_PF_SPURIOUS; 2698 2698 } else { 2699 - trace_kvm_mmu_set_spte(level, gfn, sptep); 2700 2699 flush |= mmu_spte_update(sptep, spte); 2700 + trace_kvm_mmu_set_spte(level, gfn, sptep); 2701 2701 } 2702 2702 2703 2703 if (wrprot) { ··· 3703 3703 } 3704 3704 3705 3705 static gpa_t nonpaging_gva_to_gpa(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, 3706 - gpa_t vaddr, u32 access, 3706 + gpa_t vaddr, u64 access, 3707 3707 struct x86_exception *exception) 3708 3708 { 3709 3709 if (exception) ··· 4591 4591 * - X86_CR4_SMAP is set in CR4 4592 4592 * - A user page is accessed 4593 4593 * - The access is not a fetch 4594 - * - Page fault in kernel mode 4595 - * - if CPL = 3 or X86_EFLAGS_AC is clear 4594 + * - The access is supervisor mode 4595 + * - If implicit supervisor access or X86_EFLAGS_AC is clear 4596 4596 * 4597 - * Here, we cover the first three conditions. 4598 - * The fourth is computed dynamically in permission_fault(); 4597 + * Here, we cover the first four conditions. 4598 + * The fifth is computed dynamically in permission_fault(); 4599 4599 * PFERR_RSVD_MASK bit will be set in PFEC if the access is 4600 4600 * *not* subject to SMAP restrictions. 4601 4601 */ ··· 5768 5768 kvm_mmu_zap_all_fast(kvm); 5769 5769 } 5770 5770 5771 - void kvm_mmu_init_vm(struct kvm *kvm) 5771 + int kvm_mmu_init_vm(struct kvm *kvm) 5772 5772 { 5773 5773 struct kvm_page_track_notifier_node *node = &kvm->arch.mmu_sp_tracker; 5774 + int r; 5774 5775 5776 + INIT_LIST_HEAD(&kvm->arch.active_mmu_pages); 5777 + INIT_LIST_HEAD(&kvm->arch.zapped_obsolete_pages); 5778 + INIT_LIST_HEAD(&kvm->arch.lpage_disallowed_mmu_pages); 5775 5779 spin_lock_init(&kvm->arch.mmu_unsync_pages_lock); 5776 5780 5777 - kvm_mmu_init_tdp_mmu(kvm); 5781 + r = kvm_mmu_init_tdp_mmu(kvm); 5782 + if (r < 0) 5783 + return r; 5778 5784 5779 5785 node->track_write = kvm_mmu_pte_write; 5780 5786 node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot; 5781 5787 kvm_page_track_register_notifier(kvm, node); 5788 + return 0; 5782 5789 } 5783 5790 5784 5791 void kvm_mmu_uninit_vm(struct kvm *kvm) ··· 5849 5842 5850 5843 if (is_tdp_mmu_enabled(kvm)) { 5851 5844 for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) 5852 - flush = kvm_tdp_mmu_zap_gfn_range(kvm, i, gfn_start, 5853 - gfn_end, flush); 5845 + flush = kvm_tdp_mmu_zap_leafs(kvm, i, gfn_start, 5846 + gfn_end, true, flush); 5854 5847 } 5855 5848 5856 5849 if (flush)

+36 -42

arch/x86/kvm/mmu/paging_tmpl.h

··· 34 34 #define PT_HAVE_ACCESSED_DIRTY(mmu) true 35 35 #ifdef CONFIG_X86_64 36 36 #define PT_MAX_FULL_LEVELS PT64_ROOT_MAX_LEVEL 37 - #define CMPXCHG cmpxchg 37 + #define CMPXCHG "cmpxchgq" 38 38 #else 39 - #define CMPXCHG cmpxchg64 40 39 #define PT_MAX_FULL_LEVELS 2 41 40 #endif 42 41 #elif PTTYPE == 32 ··· 51 52 #define PT_GUEST_DIRTY_SHIFT PT_DIRTY_SHIFT 52 53 #define PT_GUEST_ACCESSED_SHIFT PT_ACCESSED_SHIFT 53 54 #define PT_HAVE_ACCESSED_DIRTY(mmu) true 54 - #define CMPXCHG cmpxchg 55 + #define CMPXCHG "cmpxchgl" 55 56 #elif PTTYPE == PTTYPE_EPT 56 57 #define pt_element_t u64 57 58 #define guest_walker guest_walkerEPT ··· 64 65 #define PT_GUEST_DIRTY_SHIFT 9 65 66 #define PT_GUEST_ACCESSED_SHIFT 8 66 67 #define PT_HAVE_ACCESSED_DIRTY(mmu) ((mmu)->ept_ad) 67 - #define CMPXCHG cmpxchg64 68 + #ifdef CONFIG_X86_64 69 + #define CMPXCHG "cmpxchgq" 70 + #endif 68 71 #define PT_MAX_FULL_LEVELS PT64_ROOT_MAX_LEVEL 69 72 #else 70 73 #error Invalid PTTYPE value ··· 148 147 pt_element_t __user *ptep_user, unsigned index, 149 148 pt_element_t orig_pte, pt_element_t new_pte) 150 149 { 151 - int npages; 152 - pt_element_t ret; 153 - pt_element_t *table; 154 - struct page *page; 150 + signed char r; 155 151 156 - npages = get_user_pages_fast((unsigned long)ptep_user, 1, FOLL_WRITE, &page); 157 - if (likely(npages == 1)) { 158 - table = kmap_atomic(page); 159 - ret = CMPXCHG(&table[index], orig_pte, new_pte); 160 - kunmap_atomic(table); 152 + if (!user_access_begin(ptep_user, sizeof(pt_element_t))) 153 + return -EFAULT; 161 154 162 - kvm_release_page_dirty(page); 163 - } else { 164 - struct vm_area_struct *vma; 165 - unsigned long vaddr = (unsigned long)ptep_user & PAGE_MASK; 166 - unsigned long pfn; 167 - unsigned long paddr; 155 + #ifdef CMPXCHG 156 + asm volatile("1:" LOCK_PREFIX CMPXCHG " %[new], %[ptr]\n" 157 + "setnz %b[r]\n" 158 + "2:" 159 + _ASM_EXTABLE_TYPE_REG(1b, 2b, EX_TYPE_EFAULT_REG, %k[r]) 160 + : [ptr] "+m" (*ptep_user), 161 + [old] "+a" (orig_pte), 162 + [r] "=q" (r) 163 + : [new] "r" (new_pte) 164 + : "memory"); 165 + #else 166 + asm volatile("1:" LOCK_PREFIX "cmpxchg8b %[ptr]\n" 167 + "setnz %b[r]\n" 168 + "2:" 169 + _ASM_EXTABLE_TYPE_REG(1b, 2b, EX_TYPE_EFAULT_REG, %k[r]) 170 + : [ptr] "+m" (*ptep_user), 171 + [old] "+A" (orig_pte), 172 + [r] "=q" (r) 173 + : [new_lo] "b" ((u32)new_pte), 174 + [new_hi] "c" ((u32)(new_pte >> 32)) 175 + : "memory"); 176 + #endif 168 177 169 - mmap_read_lock(current->mm); 170 - vma = find_vma_intersection(current->mm, vaddr, vaddr + PAGE_SIZE); 171 - if (!vma || !(vma->vm_flags & VM_PFNMAP)) { 172 - mmap_read_unlock(current->mm); 173 - return -EFAULT; 174 - } 175 - pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff; 176 - paddr = pfn << PAGE_SHIFT; 177 - table = memremap(paddr, PAGE_SIZE, MEMREMAP_WB); 178 - if (!table) { 179 - mmap_read_unlock(current->mm); 180 - return -EFAULT; 181 - } 182 - ret = CMPXCHG(&table[index], orig_pte, new_pte); 183 - memunmap(table); 184 - mmap_read_unlock(current->mm); 185 - } 186 - 187 - return (ret != orig_pte); 178 + user_access_end(); 179 + return r; 188 180 } 189 181 190 182 static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu, ··· 333 339 */ 334 340 static int FNAME(walk_addr_generic)(struct guest_walker *walker, 335 341 struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, 336 - gpa_t addr, u32 access) 342 + gpa_t addr, u64 access) 337 343 { 338 344 int ret; 339 345 pt_element_t pte; ··· 341 347 gfn_t table_gfn; 342 348 u64 pt_access, pte_access; 343 349 unsigned index, accessed_dirty, pte_pkey; 344 - unsigned nested_access; 350 + u64 nested_access; 345 351 gpa_t pte_gpa; 346 352 bool have_ad; 347 353 int offset; ··· 534 540 } 535 541 536 542 static int FNAME(walk_addr)(struct guest_walker *walker, 537 - struct kvm_vcpu *vcpu, gpa_t addr, u32 access) 543 + struct kvm_vcpu *vcpu, gpa_t addr, u64 access) 538 544 { 539 545 return FNAME(walk_addr_generic)(walker, vcpu, vcpu->arch.mmu, addr, 540 546 access); ··· 982 988 983 989 /* Note, @addr is a GPA when gva_to_gpa() translates an L2 GPA to an L1 GPA. */ 984 990 static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, 985 - gpa_t addr, u32 access, 991 + gpa_t addr, u64 access, 986 992 struct x86_exception *exception) 987 993 { 988 994 struct guest_walker walker;

+26 -46

arch/x86/kvm/mmu/tdp_mmu.c

··· 14 14 module_param_named(tdp_mmu, tdp_mmu_enabled, bool, 0644); 15 15 16 16 /* Initializes the TDP MMU for the VM, if enabled. */ 17 - bool kvm_mmu_init_tdp_mmu(struct kvm *kvm) 17 + int kvm_mmu_init_tdp_mmu(struct kvm *kvm) 18 18 { 19 + struct workqueue_struct *wq; 20 + 19 21 if (!tdp_enabled || !READ_ONCE(tdp_mmu_enabled)) 20 - return false; 22 + return 0; 23 + 24 + wq = alloc_workqueue("kvm", WQ_UNBOUND|WQ_MEM_RECLAIM|WQ_CPU_INTENSIVE, 0); 25 + if (!wq) 26 + return -ENOMEM; 21 27 22 28 /* This should not be changed for the lifetime of the VM. */ 23 29 kvm->arch.tdp_mmu_enabled = true; 24 - 25 30 INIT_LIST_HEAD(&kvm->arch.tdp_mmu_roots); 26 31 spin_lock_init(&kvm->arch.tdp_mmu_pages_lock); 27 32 INIT_LIST_HEAD(&kvm->arch.tdp_mmu_pages); 28 - kvm->arch.tdp_mmu_zap_wq = 29 - alloc_workqueue("kvm", WQ_UNBOUND|WQ_MEM_RECLAIM|WQ_CPU_INTENSIVE, 0); 30 - 31 - return true; 33 + kvm->arch.tdp_mmu_zap_wq = wq; 34 + return 1; 32 35 } 33 36 34 37 /* Arbitrarily returns true so that this may be used in if statements. */ ··· 909 906 } 910 907 911 908 /* 912 - * Tears down the mappings for the range of gfns, [start, end), and frees the 913 - * non-root pages mapping GFNs strictly within that range. Returns true if 914 - * SPTEs have been cleared and a TLB flush is needed before releasing the 915 - * MMU lock. 909 + * Zap leafs SPTEs for the range of gfns, [start, end). Returns true if SPTEs 910 + * have been cleared and a TLB flush is needed before releasing the MMU lock. 916 911 * 917 912 * If can_yield is true, will release the MMU lock and reschedule if the 918 913 * scheduler needs the CPU or there is contention on the MMU lock. If this ··· 918 917 * the caller must ensure it does not supply too large a GFN range, or the 919 918 * operation can cause a soft lockup. 920 919 */ 921 - static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root, 922 - gfn_t start, gfn_t end, bool can_yield, bool flush) 920 + static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root, 921 + gfn_t start, gfn_t end, bool can_yield, bool flush) 923 922 { 924 - bool zap_all = (start == 0 && end >= tdp_mmu_max_gfn_host()); 925 923 struct tdp_iter iter; 926 - 927 - /* 928 - * No need to try to step down in the iterator when zapping all SPTEs, 929 - * zapping the top-level non-leaf SPTEs will recurse on their children. 930 - */ 931 - int min_level = zap_all ? root->role.level : PG_LEVEL_4K; 932 924 933 925 end = min(end, tdp_mmu_max_gfn_host()); 934 926 ··· 929 935 930 936 rcu_read_lock(); 931 937 932 - for_each_tdp_pte_min_level(iter, root, min_level, start, end) { 938 + for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) { 933 939 if (can_yield && 934 940 tdp_mmu_iter_cond_resched(kvm, &iter, flush, false)) { 935 941 flush = false; 936 942 continue; 937 943 } 938 944 939 - if (!is_shadow_present_pte(iter.old_spte)) 940 - continue; 941 - 942 - /* 943 - * If this is a non-last-level SPTE that covers a larger range 944 - * than should be zapped, continue, and zap the mappings at a 945 - * lower level, except when zapping all SPTEs. 946 - */ 947 - if (!zap_all && 948 - (iter.gfn < start || 949 - iter.gfn + KVM_PAGES_PER_HPAGE(iter.level) > end) && 945 + if (!is_shadow_present_pte(iter.old_spte) || 950 946 !is_last_spte(iter.old_spte, iter.level)) 951 947 continue; 952 948 ··· 944 960 flush = true; 945 961 } 946 962 947 - /* 948 - * Need to flush before releasing RCU. TODO: do it only if intermediate 949 - * page tables were zapped; there is no need to flush under RCU protection 950 - * if no 'struct kvm_mmu_page' is freed. 951 - */ 952 - if (flush) 953 - kvm_flush_remote_tlbs_with_address(kvm, start, end - start); 954 - 955 963 rcu_read_unlock(); 956 964 957 - return false; 965 + /* 966 + * Because this flow zaps _only_ leaf SPTEs, the caller doesn't need 967 + * to provide RCU protection as no 'struct kvm_mmu_page' will be freed. 968 + */ 969 + return flush; 958 970 } 959 971 960 972 /* ··· 959 979 * SPTEs have been cleared and a TLB flush is needed before releasing the 960 980 * MMU lock. 961 981 */ 962 - bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start, 963 - gfn_t end, bool can_yield, bool flush) 982 + bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end, 983 + bool can_yield, bool flush) 964 984 { 965 985 struct kvm_mmu_page *root; 966 986 967 987 for_each_tdp_mmu_root_yield_safe(kvm, root, as_id) 968 - flush = zap_gfn_range(kvm, root, start, end, can_yield, flush); 988 + flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, flush); 969 989 970 990 return flush; 971 991 } ··· 1213 1233 bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range, 1214 1234 bool flush) 1215 1235 { 1216 - return __kvm_tdp_mmu_zap_gfn_range(kvm, range->slot->as_id, range->start, 1217 - range->end, range->may_block, flush); 1236 + return kvm_tdp_mmu_zap_leafs(kvm, range->slot->as_id, range->start, 1237 + range->end, range->may_block, flush); 1218 1238 } 1219 1239 1220 1240 typedef bool (*tdp_handler_t)(struct kvm *kvm, struct tdp_iter *iter,

+3 -9

arch/x86/kvm/mmu/tdp_mmu.h

··· 15 15 void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root, 16 16 bool shared); 17 17 18 - bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start, 18 + bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, 19 19 gfn_t end, bool can_yield, bool flush); 20 - static inline bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, 21 - gfn_t start, gfn_t end, bool flush) 22 - { 23 - return __kvm_tdp_mmu_zap_gfn_range(kvm, as_id, start, end, true, flush); 24 - } 25 - 26 20 bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp); 27 21 void kvm_tdp_mmu_zap_all(struct kvm *kvm); 28 22 void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm); ··· 66 72 u64 *spte); 67 73 68 74 #ifdef CONFIG_X86_64 69 - bool kvm_mmu_init_tdp_mmu(struct kvm *kvm); 75 + int kvm_mmu_init_tdp_mmu(struct kvm *kvm); 70 76 void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm); 71 77 static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return sp->tdp_mmu_page; } 72 78 ··· 87 93 return sp && is_tdp_mmu_page(sp) && sp->root_count; 88 94 } 89 95 #else 90 - static inline bool kvm_mmu_init_tdp_mmu(struct kvm *kvm) { return false; } 96 + static inline int kvm_mmu_init_tdp_mmu(struct kvm *kvm) { return 0; } 91 97 static inline void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm) {} 92 98 static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return false; } 93 99 static inline bool is_tdp_mmu(struct kvm_mmu *mmu) { return false; }

+7 -11

arch/x86/kvm/pmu.c

··· 96 96 97 97 static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, 98 98 u64 config, bool exclude_user, 99 - bool exclude_kernel, bool intr, 100 - bool in_tx, bool in_tx_cp) 99 + bool exclude_kernel, bool intr) 101 100 { 102 101 struct perf_event *event; 103 102 struct perf_event_attr attr = { ··· 115 116 116 117 attr.sample_period = get_sample_period(pmc, pmc->counter); 117 118 118 - if (in_tx) 119 - attr.config |= HSW_IN_TX; 120 - if (in_tx_cp) { 119 + if ((attr.config & HSW_IN_TX_CHECKPOINTED) && 120 + guest_cpuid_is_intel(pmc->vcpu)) { 121 121 /* 122 122 * HSW_IN_TX_CHECKPOINTED is not supported with nonzero 123 123 * period. Just clear the sample period so at least 124 124 * allocating the counter doesn't fail. 125 125 */ 126 126 attr.sample_period = 0; 127 - attr.config |= HSW_IN_TX_CHECKPOINTED; 128 127 } 129 128 130 129 event = perf_event_create_kernel_counter(&attr, -1, current, ··· 182 185 u32 type = PERF_TYPE_RAW; 183 186 struct kvm *kvm = pmc->vcpu->kvm; 184 187 struct kvm_pmu_event_filter *filter; 188 + struct kvm_pmu *pmu = vcpu_to_pmu(pmc->vcpu); 185 189 bool allow_event = true; 186 190 187 191 if (eventsel & ARCH_PERFMON_EVENTSEL_PIN_CONTROL) ··· 219 221 } 220 222 221 223 if (type == PERF_TYPE_RAW) 222 - config = eventsel & AMD64_RAW_EVENT_MASK; 224 + config = eventsel & pmu->raw_event_mask; 223 225 224 226 if (pmc->current_config == eventsel && pmc_resume_counter(pmc)) 225 227 return; ··· 230 232 pmc_reprogram_counter(pmc, type, config, 231 233 !(eventsel & ARCH_PERFMON_EVENTSEL_USR), 232 234 !(eventsel & ARCH_PERFMON_EVENTSEL_OS), 233 - eventsel & ARCH_PERFMON_EVENTSEL_INT, 234 - (eventsel & HSW_IN_TX), 235 - (eventsel & HSW_IN_TX_CHECKPOINTED)); 235 + eventsel & ARCH_PERFMON_EVENTSEL_INT); 236 236 } 237 237 EXPORT_SYMBOL_GPL(reprogram_gp_counter); 238 238 ··· 266 270 kvm_x86_ops.pmu_ops->pmc_perf_hw_id(pmc), 267 271 !(en_field & 0x2), /* exclude user */ 268 272 !(en_field & 0x1), /* exclude kernel */ 269 - pmi, false, false); 273 + pmi); 270 274 } 271 275 EXPORT_SYMBOL_GPL(reprogram_fixed_counter); 272 276

+10 -4

arch/x86/kvm/svm/avic.c

··· 726 726 { 727 727 struct kvm_kernel_irq_routing_entry *e; 728 728 struct kvm_irq_routing_table *irq_rt; 729 - int idx, ret = -EINVAL; 729 + int idx, ret = 0; 730 730 731 731 if (!kvm_arch_has_assigned_device(kvm) || 732 732 !irq_remapping_cap(IRQ_POSTING_CAP)) ··· 737 737 738 738 idx = srcu_read_lock(&kvm->irq_srcu); 739 739 irq_rt = srcu_dereference(kvm->irq_routing, &kvm->irq_srcu); 740 - WARN_ON(guest_irq >= irq_rt->nr_rt_entries); 740 + 741 + if (guest_irq >= irq_rt->nr_rt_entries || 742 + hlist_empty(&irq_rt->map[guest_irq])) { 743 + pr_warn_once("no route for guest_irq %u/%u (broken user space?)\n", 744 + guest_irq, irq_rt->nr_rt_entries); 745 + goto out; 746 + } 741 747 742 748 hlist_for_each_entry(e, &irq_rt->map[guest_irq], link) { 743 749 struct vcpu_data vcpu_info; ··· 828 822 return ret; 829 823 } 830 824 831 - bool avic_check_apicv_inhibit_reasons(ulong bit) 825 + bool avic_check_apicv_inhibit_reasons(enum kvm_apicv_inhibit reason) 832 826 { 833 827 ulong supported = BIT(APICV_INHIBIT_REASON_DISABLE) | 834 828 BIT(APICV_INHIBIT_REASON_ABSENT) | ··· 839 833 BIT(APICV_INHIBIT_REASON_X2APIC) | 840 834 BIT(APICV_INHIBIT_REASON_BLOCKIRQ); 841 835 842 - return supported & BIT(bit); 836 + return supported & BIT(reason); 843 837 } 844 838 845 839

+4 -5

arch/x86/kvm/svm/pmu.c

··· 262 262 /* MSR_EVNTSELn */ 263 263 pmc = get_gp_pmc_amd(pmu, msr, PMU_TYPE_EVNTSEL); 264 264 if (pmc) { 265 - if (data == pmc->eventsel) 266 - return 0; 267 - if (!(data & pmu->reserved_bits)) { 265 + data &= ~pmu->reserved_bits; 266 + if (data != pmc->eventsel) 268 267 reprogram_gp_counter(pmc, data); 269 - return 0; 270 - } 268 + return 0; 271 269 } 272 270 273 271 return 1; ··· 282 284 283 285 pmu->counter_bitmask[KVM_PMC_GP] = ((u64)1 << 48) - 1; 284 286 pmu->reserved_bits = 0xfffffff000280000ull; 287 + pmu->raw_event_mask = AMD64_RAW_EVENT_MASK; 285 288 pmu->version = 1; 286 289 /* not applicable to AMD; but clean them to prevent any fall out */ 287 290 pmu->counter_bitmask[KVM_PMC_FIXED] = 0;

+11 -25

arch/x86/kvm/svm/svm.c

··· 62 62 #define SEG_TYPE_LDT 2 63 63 #define SEG_TYPE_BUSY_TSS16 3 64 64 65 - #define SVM_FEATURE_LBRV (1 << 1) 66 - #define SVM_FEATURE_SVML (1 << 2) 67 - #define SVM_FEATURE_TSC_RATE (1 << 4) 68 - #define SVM_FEATURE_VMCB_CLEAN (1 << 5) 69 - #define SVM_FEATURE_FLUSH_ASID (1 << 6) 70 - #define SVM_FEATURE_DECODE_ASSIST (1 << 7) 71 - #define SVM_FEATURE_PAUSE_FILTER (1 << 10) 72 - 73 65 #define DEBUGCTL_RESERVED_BITS (~(0x3fULL)) 74 - 75 - #define TSC_RATIO_RSVD 0xffffff0000000000ULL 76 - #define TSC_RATIO_MIN 0x0000000000000001ULL 77 - #define TSC_RATIO_MAX 0x000000ffffffffffULL 78 66 79 67 static bool erratum_383_found __read_mostly; 80 68 ··· 75 87 static uint64_t osvw_len = 4, osvw_status; 76 88 77 89 static DEFINE_PER_CPU(u64, current_tsc_ratio); 78 - #define TSC_RATIO_DEFAULT 0x0100000000ULL 79 90 80 91 static const struct svm_direct_access_msrs { 81 92 u32 index; /* Index of the MSR */ ··· 467 480 { 468 481 /* Make sure we clean up behind us */ 469 482 if (tsc_scaling) 470 - wrmsrl(MSR_AMD64_TSC_RATIO, TSC_RATIO_DEFAULT); 483 + wrmsrl(MSR_AMD64_TSC_RATIO, SVM_TSC_RATIO_DEFAULT); 471 484 472 485 cpu_svm_disable(); 473 486 ··· 513 526 * Set the default value, even if we don't use TSC scaling 514 527 * to avoid having stale value in the msr 515 528 */ 516 - wrmsrl(MSR_AMD64_TSC_RATIO, TSC_RATIO_DEFAULT); 517 - __this_cpu_write(current_tsc_ratio, TSC_RATIO_DEFAULT); 529 + wrmsrl(MSR_AMD64_TSC_RATIO, SVM_TSC_RATIO_DEFAULT); 530 + __this_cpu_write(current_tsc_ratio, SVM_TSC_RATIO_DEFAULT); 518 531 } 519 532 520 533 ··· 2710 2723 break; 2711 2724 } 2712 2725 2713 - if (data & TSC_RATIO_RSVD) 2726 + if (data & SVM_TSC_RATIO_RSVD) 2714 2727 return 1; 2715 2728 2716 2729 svm->tsc_ratio_msr = data; ··· 2905 2918 * In this case AVIC was temporarily disabled for 2906 2919 * requesting the IRQ window and we have to re-enable it. 2907 2920 */ 2908 - kvm_request_apicv_update(vcpu->kvm, true, APICV_INHIBIT_REASON_IRQWIN); 2921 + kvm_clear_apicv_inhibit(vcpu->kvm, APICV_INHIBIT_REASON_IRQWIN); 2909 2922 2910 2923 ++vcpu->stat.irq_window_exits; 2911 2924 return 1; ··· 3503 3516 * via AVIC. In such case, we need to temporarily disable AVIC, 3504 3517 * and fallback to injecting IRQ via V_IRQ. 3505 3518 */ 3506 - kvm_request_apicv_update(vcpu->kvm, false, APICV_INHIBIT_REASON_IRQWIN); 3519 + kvm_set_apicv_inhibit(vcpu->kvm, APICV_INHIBIT_REASON_IRQWIN); 3507 3520 svm_set_vintr(svm); 3508 3521 } 3509 3522 } ··· 3935 3948 { 3936 3949 struct vcpu_svm *svm = to_svm(vcpu); 3937 3950 struct kvm_cpuid_entry2 *best; 3951 + struct kvm *kvm = vcpu->kvm; 3938 3952 3939 3953 vcpu->arch.xsaves_enabled = guest_cpuid_has(vcpu, X86_FEATURE_XSAVE) && 3940 3954 boot_cpu_has(X86_FEATURE_XSAVE) && ··· 3962 3974 * is exposed to the guest, disable AVIC. 3963 3975 */ 3964 3976 if (guest_cpuid_has(vcpu, X86_FEATURE_X2APIC)) 3965 - kvm_request_apicv_update(vcpu->kvm, false, 3966 - APICV_INHIBIT_REASON_X2APIC); 3977 + kvm_set_apicv_inhibit(kvm, APICV_INHIBIT_REASON_X2APIC); 3967 3978 3968 3979 /* 3969 3980 * Currently, AVIC does not work with nested virtualization. 3970 3981 * So, we disable AVIC when cpuid for SVM is set in the L1 guest. 3971 3982 */ 3972 3983 if (nested && guest_cpuid_has(vcpu, X86_FEATURE_SVM)) 3973 - kvm_request_apicv_update(vcpu->kvm, false, 3974 - APICV_INHIBIT_REASON_NESTED); 3984 + kvm_set_apicv_inhibit(kvm, APICV_INHIBIT_REASON_NESTED); 3975 3985 } 3976 3986 init_vmcb_after_set_cpuid(vcpu); 3977 3987 } ··· 4752 4766 } else { 4753 4767 pr_info("TSC scaling supported\n"); 4754 4768 kvm_has_tsc_control = true; 4755 - kvm_max_tsc_scaling_ratio = TSC_RATIO_MAX; 4756 - kvm_tsc_scaling_ratio_frac_bits = 32; 4757 4769 } 4758 4770 } 4771 + kvm_max_tsc_scaling_ratio = SVM_TSC_RATIO_MAX; 4772 + kvm_tsc_scaling_ratio_frac_bits = 32; 4759 4773 4760 4774 tsc_aux_uret_slot = kvm_add_user_return_msr(MSR_TSC_AUX); 4761 4775

+3 -12

arch/x86/kvm/svm/svm.h

··· 22 22 #include <asm/svm.h> 23 23 #include <asm/sev-common.h> 24 24 25 + #include "kvm_cache_regs.h" 26 + 25 27 #define __sme_page_pa(x) __sme_set(page_to_pfn(x) << PAGE_SHIFT) 26 28 27 29 #define IOPM_SIZE PAGE_SIZE * 3 ··· 571 569 572 570 /* avic.c */ 573 571 574 - #define AVIC_LOGICAL_ID_ENTRY_GUEST_PHYSICAL_ID_MASK (0xFF) 575 - #define AVIC_LOGICAL_ID_ENTRY_VALID_BIT 31 576 - #define AVIC_LOGICAL_ID_ENTRY_VALID_MASK (1 << 31) 577 - 578 - #define AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK GENMASK_ULL(11, 0) 579 - #define AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK (0xFFFFFFFFFFULL << 12) 580 - #define AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK (1ULL << 62) 581 - #define AVIC_PHYSICAL_ID_ENTRY_VALID_MASK (1ULL << 63) 582 - 583 - #define VMCB_AVIC_APIC_BAR_MASK 0xFFFFFFFFFF000ULL 584 - 585 572 int avic_ga_log_notifier(u32 ga_tag); 586 573 void avic_vm_destroy(struct kvm *kvm); 587 574 int avic_vm_init(struct kvm *kvm); ··· 583 592 void avic_apicv_post_state_restore(struct kvm_vcpu *vcpu); 584 593 void avic_set_virtual_apic_mode(struct kvm_vcpu *vcpu); 585 594 void avic_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu); 586 - bool avic_check_apicv_inhibit_reasons(ulong bit); 595 + bool avic_check_apicv_inhibit_reasons(enum kvm_apicv_inhibit reason); 587 596 void avic_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr); 588 597 void avic_hwapic_isr_update(struct kvm_vcpu *vcpu, int max_isr); 589 598 bool avic_dy_apicv_has_pending_interrupt(struct kvm_vcpu *vcpu);

-1

arch/x86/kvm/svm/svm_onhyperv.c

··· 4 4 */ 5 5 6 6 #include <linux/kvm_host.h> 7 - #include "kvm_cache_regs.h" 8 7 9 8 #include <asm/mshyperv.h> 10 9

+12 -10

arch/x86/kvm/trace.h

··· 1339 1339 __entry->vcpu_id, __entry->timer_index) 1340 1340 ); 1341 1341 1342 - TRACE_EVENT(kvm_apicv_update_request, 1343 - TP_PROTO(bool activate, unsigned long bit), 1344 - TP_ARGS(activate, bit), 1342 + TRACE_EVENT(kvm_apicv_inhibit_changed, 1343 + TP_PROTO(int reason, bool set, unsigned long inhibits), 1344 + TP_ARGS(reason, set, inhibits), 1345 1345 1346 1346 TP_STRUCT__entry( 1347 - __field(bool, activate) 1348 - __field(unsigned long, bit) 1347 + __field(int, reason) 1348 + __field(bool, set) 1349 + __field(unsigned long, inhibits) 1349 1350 ), 1350 1351 1351 1352 TP_fast_assign( 1352 - __entry->activate = activate; 1353 - __entry->bit = bit; 1353 + __entry->reason = reason; 1354 + __entry->set = set; 1355 + __entry->inhibits = inhibits; 1354 1356 ), 1355 1357 1356 - TP_printk("%s bit=%lu", 1357 - __entry->activate ? "activate" : "deactivate", 1358 - __entry->bit) 1358 + TP_printk("%s reason=%u, inhibits=0x%lx", 1359 + __entry->set ? "set" : "cleared", 1360 + __entry->reason, __entry->inhibits) 1359 1361 ); 1360 1362 1361 1363 TRACE_EVENT(kvm_apicv_accept_irq,

+11 -3

arch/x86/kvm/vmx/pmu_intel.c

··· 389 389 struct kvm_pmc *pmc; 390 390 u32 msr = msr_info->index; 391 391 u64 data = msr_info->data; 392 + u64 reserved_bits; 392 393 393 394 switch (msr) { 394 395 case MSR_CORE_PERF_FIXED_CTR_CTRL: ··· 444 443 } else if ((pmc = get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0))) { 445 444 if (data == pmc->eventsel) 446 445 return 0; 447 - if (!(data & pmu->reserved_bits)) { 446 + reserved_bits = pmu->reserved_bits; 447 + if ((pmc->idx == 2) && 448 + (pmu->raw_event_mask & HSW_IN_TX_CHECKPOINTED)) 449 + reserved_bits ^= HSW_IN_TX_CHECKPOINTED; 450 + if (!(data & reserved_bits)) { 448 451 reprogram_gp_counter(pmc, data); 449 452 return 0; 450 453 } ··· 490 485 pmu->counter_bitmask[KVM_PMC_FIXED] = 0; 491 486 pmu->version = 0; 492 487 pmu->reserved_bits = 0xffffffff00200000ull; 488 + pmu->raw_event_mask = X86_RAW_EVENT_MASK; 493 489 494 490 entry = kvm_find_cpuid_entry(vcpu, 0xa, 0); 495 491 if (!entry || !vcpu->kvm->arch.enable_pmu) ··· 539 533 entry = kvm_find_cpuid_entry(vcpu, 7, 0); 540 534 if (entry && 541 535 (boot_cpu_has(X86_FEATURE_HLE) || boot_cpu_has(X86_FEATURE_RTM)) && 542 - (entry->ebx & (X86_FEATURE_HLE|X86_FEATURE_RTM))) 543 - pmu->reserved_bits ^= HSW_IN_TX|HSW_IN_TX_CHECKPOINTED; 536 + (entry->ebx & (X86_FEATURE_HLE|X86_FEATURE_RTM))) { 537 + pmu->reserved_bits ^= HSW_IN_TX; 538 + pmu->raw_event_mask |= (HSW_IN_TX|HSW_IN_TX_CHECKPOINTED); 539 + } 544 540 545 541 bitmap_set(pmu->all_valid_pmc_idx, 546 542 0, pmu->nr_arch_gp_counters);

+10 -16

arch/x86/kvm/vmx/vmx.c

··· 2866 2866 int vmx_set_efer(struct kvm_vcpu *vcpu, u64 efer) 2867 2867 { 2868 2868 struct vcpu_vmx *vmx = to_vmx(vcpu); 2869 - struct vmx_uret_msr *msr = vmx_find_uret_msr(vmx, MSR_EFER); 2870 2869 2871 2870 /* Nothing to do if hardware doesn't support EFER. */ 2872 - if (!msr) 2871 + if (!vmx_find_uret_msr(vmx, MSR_EFER)) 2873 2872 return 0; 2874 2873 2875 2874 vcpu->arch.efer = efer; 2876 - if (efer & EFER_LMA) { 2877 - vm_entry_controls_setbit(to_vmx(vcpu), VM_ENTRY_IA32E_MODE); 2878 - msr->data = efer; 2879 - } else { 2880 - vm_entry_controls_clearbit(to_vmx(vcpu), VM_ENTRY_IA32E_MODE); 2875 + if (efer & EFER_LMA) 2876 + vm_entry_controls_setbit(vmx, VM_ENTRY_IA32E_MODE); 2877 + else 2878 + vm_entry_controls_clearbit(vmx, VM_ENTRY_IA32E_MODE); 2881 2879 2882 - msr->data = efer & ~EFER_LME; 2883 - } 2884 2880 vmx_setup_uret_msrs(vmx); 2885 2881 return 0; 2886 2882 } ··· 2902 2906 2903 2907 static void exit_lmode(struct kvm_vcpu *vcpu) 2904 2908 { 2905 - vm_entry_controls_clearbit(to_vmx(vcpu), VM_ENTRY_IA32E_MODE); 2906 2909 vmx_set_efer(vcpu, vcpu->arch.efer & ~EFER_LMA); 2907 2910 } 2908 2911 ··· 7700 7705 free_kvm_area(); 7701 7706 } 7702 7707 7703 - static bool vmx_check_apicv_inhibit_reasons(ulong bit) 7708 + static bool vmx_check_apicv_inhibit_reasons(enum kvm_apicv_inhibit reason) 7704 7709 { 7705 7710 ulong supported = BIT(APICV_INHIBIT_REASON_DISABLE) | 7706 7711 BIT(APICV_INHIBIT_REASON_ABSENT) | 7707 7712 BIT(APICV_INHIBIT_REASON_HYPERV) | 7708 7713 BIT(APICV_INHIBIT_REASON_BLOCKIRQ); 7709 7714 7710 - return supported & BIT(bit); 7715 + return supported & BIT(reason); 7711 7716 } 7712 7717 7713 7718 static struct kvm_x86_ops vmx_x86_ops __initdata = { ··· 7975 7980 if (!enable_apicv) 7976 7981 vmx_x86_ops.sync_pir_to_irr = NULL; 7977 7982 7978 - if (cpu_has_vmx_tsc_scaling()) { 7983 + if (cpu_has_vmx_tsc_scaling()) 7979 7984 kvm_has_tsc_control = true; 7980 - kvm_max_tsc_scaling_ratio = KVM_VMX_TSC_MULTIPLIER_MAX; 7981 - kvm_tsc_scaling_ratio_frac_bits = 48; 7982 - } 7983 7985 7986 + kvm_max_tsc_scaling_ratio = KVM_VMX_TSC_MULTIPLIER_MAX; 7987 + kvm_tsc_scaling_ratio_frac_bits = 48; 7984 7988 kvm_has_bus_lock_exit = cpu_has_vmx_bus_lock_detection(); 7985 7989 7986 7990 set_bit(0, vmx_vpid_bitmap); /* 0 is reserved for host */

+104 -57

arch/x86/kvm/x86.c

··· 1748 1748 { 1749 1749 struct msr_data msr; 1750 1750 1751 - if (!host_initiated && !kvm_msr_allowed(vcpu, index, KVM_MSR_FILTER_WRITE)) 1752 - return KVM_MSR_RET_FILTERED; 1753 - 1754 1751 switch (index) { 1755 1752 case MSR_FS_BASE: 1756 1753 case MSR_GS_BASE: ··· 1829 1832 struct msr_data msr; 1830 1833 int ret; 1831 1834 1832 - if (!host_initiated && !kvm_msr_allowed(vcpu, index, KVM_MSR_FILTER_READ)) 1833 - return KVM_MSR_RET_FILTERED; 1834 - 1835 1835 switch (index) { 1836 1836 case MSR_TSC_AUX: 1837 1837 if (!kvm_is_supported_user_return_msr(MSR_TSC_AUX)) ··· 1863 1869 } 1864 1870 1865 1871 return ret; 1872 + } 1873 + 1874 + static int kvm_get_msr_with_filter(struct kvm_vcpu *vcpu, u32 index, u64 *data) 1875 + { 1876 + if (!kvm_msr_allowed(vcpu, index, KVM_MSR_FILTER_READ)) 1877 + return KVM_MSR_RET_FILTERED; 1878 + return kvm_get_msr_ignored_check(vcpu, index, data, false); 1879 + } 1880 + 1881 + static int kvm_set_msr_with_filter(struct kvm_vcpu *vcpu, u32 index, u64 data) 1882 + { 1883 + if (!kvm_msr_allowed(vcpu, index, KVM_MSR_FILTER_WRITE)) 1884 + return KVM_MSR_RET_FILTERED; 1885 + return kvm_set_msr_ignored_check(vcpu, index, data, false); 1866 1886 } 1867 1887 1868 1888 int kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data) ··· 1961 1953 u64 data; 1962 1954 int r; 1963 1955 1964 - r = kvm_get_msr(vcpu, ecx, &data); 1956 + r = kvm_get_msr_with_filter(vcpu, ecx, &data); 1965 1957 1966 1958 if (!r) { 1967 1959 trace_kvm_msr_read(ecx, data); ··· 1986 1978 u64 data = kvm_read_edx_eax(vcpu); 1987 1979 int r; 1988 1980 1989 - r = kvm_set_msr(vcpu, ecx, data); 1981 + r = kvm_set_msr_with_filter(vcpu, ecx, data); 1990 1982 1991 1983 if (!r) { 1992 1984 trace_kvm_msr_write(ecx, data); ··· 5946 5938 smp_wmb(); 5947 5939 kvm->arch.irqchip_mode = KVM_IRQCHIP_SPLIT; 5948 5940 kvm->arch.nr_reserved_ioapic_pins = cap->args[0]; 5949 - kvm_request_apicv_update(kvm, true, APICV_INHIBIT_REASON_ABSENT); 5941 + kvm_clear_apicv_inhibit(kvm, APICV_INHIBIT_REASON_ABSENT); 5950 5942 r = 0; 5951 5943 split_irqchip_unlock: 5952 5944 mutex_unlock(&kvm->lock); ··· 6343 6335 /* Write kvm->irq_routing before enabling irqchip_in_kernel. */ 6344 6336 smp_wmb(); 6345 6337 kvm->arch.irqchip_mode = KVM_IRQCHIP_KERNEL; 6346 - kvm_request_apicv_update(kvm, true, APICV_INHIBIT_REASON_ABSENT); 6338 + kvm_clear_apicv_inhibit(kvm, APICV_INHIBIT_REASON_ABSENT); 6347 6339 create_irqchip_unlock: 6348 6340 mutex_unlock(&kvm->lock); 6349 6341 break; ··· 6734 6726 static_call(kvm_x86_get_segment)(vcpu, var, seg); 6735 6727 } 6736 6728 6737 - gpa_t translate_nested_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, u32 access, 6729 + gpa_t translate_nested_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, u64 access, 6738 6730 struct x86_exception *exception) 6739 6731 { 6740 6732 struct kvm_mmu *mmu = vcpu->arch.mmu; ··· 6754 6746 { 6755 6747 struct kvm_mmu *mmu = vcpu->arch.walk_mmu; 6756 6748 6757 - u32 access = (static_call(kvm_x86_get_cpl)(vcpu) == 3) ? PFERR_USER_MASK : 0; 6749 + u64 access = (static_call(kvm_x86_get_cpl)(vcpu) == 3) ? PFERR_USER_MASK : 0; 6758 6750 return mmu->gva_to_gpa(vcpu, mmu, gva, access, exception); 6759 6751 } 6760 6752 EXPORT_SYMBOL_GPL(kvm_mmu_gva_to_gpa_read); ··· 6764 6756 { 6765 6757 struct kvm_mmu *mmu = vcpu->arch.walk_mmu; 6766 6758 6767 - u32 access = (static_call(kvm_x86_get_cpl)(vcpu) == 3) ? PFERR_USER_MASK : 0; 6759 + u64 access = (static_call(kvm_x86_get_cpl)(vcpu) == 3) ? PFERR_USER_MASK : 0; 6768 6760 access |= PFERR_FETCH_MASK; 6769 6761 return mmu->gva_to_gpa(vcpu, mmu, gva, access, exception); 6770 6762 } ··· 6774 6766 { 6775 6767 struct kvm_mmu *mmu = vcpu->arch.walk_mmu; 6776 6768 6777 - u32 access = (static_call(kvm_x86_get_cpl)(vcpu) == 3) ? PFERR_USER_MASK : 0; 6769 + u64 access = (static_call(kvm_x86_get_cpl)(vcpu) == 3) ? PFERR_USER_MASK : 0; 6778 6770 access |= PFERR_WRITE_MASK; 6779 6771 return mmu->gva_to_gpa(vcpu, mmu, gva, access, exception); 6780 6772 } ··· 6790 6782 } 6791 6783 6792 6784 static int kvm_read_guest_virt_helper(gva_t addr, void *val, unsigned int bytes, 6793 - struct kvm_vcpu *vcpu, u32 access, 6785 + struct kvm_vcpu *vcpu, u64 access, 6794 6786 struct x86_exception *exception) 6795 6787 { 6796 6788 struct kvm_mmu *mmu = vcpu->arch.walk_mmu; ··· 6827 6819 { 6828 6820 struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt); 6829 6821 struct kvm_mmu *mmu = vcpu->arch.walk_mmu; 6830 - u32 access = (static_call(kvm_x86_get_cpl)(vcpu) == 3) ? PFERR_USER_MASK : 0; 6822 + u64 access = (static_call(kvm_x86_get_cpl)(vcpu) == 3) ? PFERR_USER_MASK : 0; 6831 6823 unsigned offset; 6832 6824 int ret; 6833 6825 ··· 6852 6844 gva_t addr, void *val, unsigned int bytes, 6853 6845 struct x86_exception *exception) 6854 6846 { 6855 - u32 access = (static_call(kvm_x86_get_cpl)(vcpu) == 3) ? PFERR_USER_MASK : 0; 6847 + u64 access = (static_call(kvm_x86_get_cpl)(vcpu) == 3) ? PFERR_USER_MASK : 0; 6856 6848 6857 6849 /* 6858 6850 * FIXME: this should call handle_emulation_failure if X86EMUL_IO_NEEDED ··· 6871 6863 struct x86_exception *exception, bool system) 6872 6864 { 6873 6865 struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt); 6874 - u32 access = 0; 6866 + u64 access = 0; 6875 6867 6876 - if (!system && static_call(kvm_x86_get_cpl)(vcpu) == 3) 6868 + if (system) 6869 + access |= PFERR_IMPLICIT_ACCESS; 6870 + else if (static_call(kvm_x86_get_cpl)(vcpu) == 3) 6877 6871 access |= PFERR_USER_MASK; 6878 6872 6879 6873 return kvm_read_guest_virt_helper(addr, val, bytes, vcpu, access, exception); ··· 6891 6881 } 6892 6882 6893 6883 static int kvm_write_guest_virt_helper(gva_t addr, void *val, unsigned int bytes, 6894 - struct kvm_vcpu *vcpu, u32 access, 6884 + struct kvm_vcpu *vcpu, u64 access, 6895 6885 struct x86_exception *exception) 6896 6886 { 6897 6887 struct kvm_mmu *mmu = vcpu->arch.walk_mmu; ··· 6925 6915 bool system) 6926 6916 { 6927 6917 struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt); 6928 - u32 access = PFERR_WRITE_MASK; 6918 + u64 access = PFERR_WRITE_MASK; 6929 6919 6930 - if (!system && static_call(kvm_x86_get_cpl)(vcpu) == 3) 6920 + if (system) 6921 + access |= PFERR_IMPLICIT_ACCESS; 6922 + else if (static_call(kvm_x86_get_cpl)(vcpu) == 3) 6931 6923 access |= PFERR_USER_MASK; 6932 6924 6933 6925 return kvm_write_guest_virt_helper(addr, val, bytes, vcpu, ··· 6996 6984 bool write) 6997 6985 { 6998 6986 struct kvm_mmu *mmu = vcpu->arch.walk_mmu; 6999 - u32 access = ((static_call(kvm_x86_get_cpl)(vcpu) == 3) ? PFERR_USER_MASK : 0) 6987 + u64 access = ((static_call(kvm_x86_get_cpl)(vcpu) == 3) ? PFERR_USER_MASK : 0) 7000 6988 | (write ? PFERR_WRITE_MASK : 0); 7001 6989 7002 6990 /* ··· 7639 7627 return; 7640 7628 } 7641 7629 7642 - static int emulator_get_msr(struct x86_emulate_ctxt *ctxt, 7643 - u32 msr_index, u64 *pdata) 7630 + static int emulator_get_msr_with_filter(struct x86_emulate_ctxt *ctxt, 7631 + u32 msr_index, u64 *pdata) 7644 7632 { 7645 7633 struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt); 7646 7634 int r; 7647 7635 7648 - r = kvm_get_msr(vcpu, msr_index, pdata); 7636 + r = kvm_get_msr_with_filter(vcpu, msr_index, pdata); 7649 7637 7650 7638 if (r && kvm_msr_user_space(vcpu, msr_index, KVM_EXIT_X86_RDMSR, 0, 7651 7639 complete_emulated_rdmsr, r)) { ··· 7656 7644 return r; 7657 7645 } 7658 7646 7659 - static int emulator_set_msr(struct x86_emulate_ctxt *ctxt, 7660 - u32 msr_index, u64 data) 7647 + static int emulator_set_msr_with_filter(struct x86_emulate_ctxt *ctxt, 7648 + u32 msr_index, u64 data) 7661 7649 { 7662 7650 struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt); 7663 7651 int r; 7664 7652 7665 - r = kvm_set_msr(vcpu, msr_index, data); 7653 + r = kvm_set_msr_with_filter(vcpu, msr_index, data); 7666 7654 7667 7655 if (r && kvm_msr_user_space(vcpu, msr_index, KVM_EXIT_X86_WRMSR, data, 7668 7656 complete_emulated_msr_access, r)) { ··· 7671 7659 } 7672 7660 7673 7661 return r; 7662 + } 7663 + 7664 + static int emulator_get_msr(struct x86_emulate_ctxt *ctxt, 7665 + u32 msr_index, u64 *pdata) 7666 + { 7667 + return kvm_get_msr(emul_to_vcpu(ctxt), msr_index, pdata); 7668 + } 7669 + 7670 + static int emulator_set_msr(struct x86_emulate_ctxt *ctxt, 7671 + u32 msr_index, u64 data) 7672 + { 7673 + return kvm_set_msr(emul_to_vcpu(ctxt), msr_index, data); 7674 7674 } 7675 7675 7676 7676 static u64 emulator_get_smbase(struct x86_emulate_ctxt *ctxt) ··· 7746 7722 static bool emulator_guest_has_fxsr(struct x86_emulate_ctxt *ctxt) 7747 7723 { 7748 7724 return guest_cpuid_has(emul_to_vcpu(ctxt), X86_FEATURE_FXSR); 7725 + } 7726 + 7727 + static bool emulator_guest_has_rdpid(struct x86_emulate_ctxt *ctxt) 7728 + { 7729 + return guest_cpuid_has(emul_to_vcpu(ctxt), X86_FEATURE_RDPID); 7749 7730 } 7750 7731 7751 7732 static ulong emulator_read_gpr(struct x86_emulate_ctxt *ctxt, unsigned reg) ··· 7823 7794 .set_dr = emulator_set_dr, 7824 7795 .get_smbase = emulator_get_smbase, 7825 7796 .set_smbase = emulator_set_smbase, 7797 + .set_msr_with_filter = emulator_set_msr_with_filter, 7798 + .get_msr_with_filter = emulator_get_msr_with_filter, 7826 7799 .set_msr = emulator_set_msr, 7827 7800 .get_msr = emulator_get_msr, 7828 7801 .check_pmc = emulator_check_pmc, ··· 7837 7806 .guest_has_long_mode = emulator_guest_has_long_mode, 7838 7807 .guest_has_movbe = emulator_guest_has_movbe, 7839 7808 .guest_has_fxsr = emulator_guest_has_fxsr, 7809 + .guest_has_rdpid = emulator_guest_has_rdpid, 7840 7810 .set_nmi_mask = emulator_set_nmi_mask, 7841 7811 .get_hflags = emulator_get_hflags, 7842 7812 .exiting_smm = emulator_exiting_smm, ··· 9090 9058 } 9091 9059 EXPORT_SYMBOL_GPL(kvm_apicv_activated); 9092 9060 9061 + 9062 + static void set_or_clear_apicv_inhibit(unsigned long *inhibits, 9063 + enum kvm_apicv_inhibit reason, bool set) 9064 + { 9065 + if (set) 9066 + __set_bit(reason, inhibits); 9067 + else 9068 + __clear_bit(reason, inhibits); 9069 + 9070 + trace_kvm_apicv_inhibit_changed(reason, set, *inhibits); 9071 + } 9072 + 9093 9073 static void kvm_apicv_init(struct kvm *kvm) 9094 9074 { 9075 + unsigned long *inhibits = &kvm->arch.apicv_inhibit_reasons; 9076 + 9095 9077 init_rwsem(&kvm->arch.apicv_update_lock); 9096 9078 9097 - set_bit(APICV_INHIBIT_REASON_ABSENT, 9098 - &kvm->arch.apicv_inhibit_reasons); 9079 + set_or_clear_apicv_inhibit(inhibits, APICV_INHIBIT_REASON_ABSENT, true); 9080 + 9099 9081 if (!enable_apicv) 9100 - set_bit(APICV_INHIBIT_REASON_DISABLE, 9101 - &kvm->arch.apicv_inhibit_reasons); 9082 + set_or_clear_apicv_inhibit(inhibits, 9083 + APICV_INHIBIT_REASON_ABSENT, true); 9102 9084 } 9103 9085 9104 9086 static void kvm_sched_yield(struct kvm_vcpu *vcpu, unsigned long dest_id) ··· 9786 9740 } 9787 9741 EXPORT_SYMBOL_GPL(kvm_vcpu_update_apicv); 9788 9742 9789 - void __kvm_request_apicv_update(struct kvm *kvm, bool activate, ulong bit) 9743 + void __kvm_set_or_clear_apicv_inhibit(struct kvm *kvm, 9744 + enum kvm_apicv_inhibit reason, bool set) 9790 9745 { 9791 9746 unsigned long old, new; 9792 9747 9793 9748 lockdep_assert_held_write(&kvm->arch.apicv_update_lock); 9794 9749 9795 - if (!static_call(kvm_x86_check_apicv_inhibit_reasons)(bit)) 9750 + if (!static_call(kvm_x86_check_apicv_inhibit_reasons)(reason)) 9796 9751 return; 9797 9752 9798 9753 old = new = kvm->arch.apicv_inhibit_reasons; 9799 9754 9800 - if (activate) 9801 - __clear_bit(bit, &new); 9802 - else 9803 - __set_bit(bit, &new); 9755 + set_or_clear_apicv_inhibit(&new, reason, set); 9804 9756 9805 9757 if (!!old != !!new) { 9806 - trace_kvm_apicv_update_request(activate, bit); 9807 9758 /* 9808 9759 * Kick all vCPUs before setting apicv_inhibit_reasons to avoid 9809 9760 * false positives in the sanity check WARN in svm_vcpu_run(). ··· 9819 9776 unsigned long gfn = gpa_to_gfn(APIC_DEFAULT_PHYS_BASE); 9820 9777 kvm_zap_gfn_range(kvm, gfn, gfn+1); 9821 9778 } 9822 - } else 9779 + } else { 9823 9780 kvm->arch.apicv_inhibit_reasons = new; 9781 + } 9824 9782 } 9825 9783 9826 - void kvm_request_apicv_update(struct kvm *kvm, bool activate, ulong bit) 9784 + void kvm_set_or_clear_apicv_inhibit(struct kvm *kvm, 9785 + enum kvm_apicv_inhibit reason, bool set) 9827 9786 { 9828 9787 if (!enable_apicv) 9829 9788 return; 9830 9789 9831 9790 down_write(&kvm->arch.apicv_update_lock); 9832 - __kvm_request_apicv_update(kvm, activate, bit); 9791 + __kvm_set_or_clear_apicv_inhibit(kvm, reason, set); 9833 9792 up_write(&kvm->arch.apicv_update_lock); 9834 9793 } 9835 - EXPORT_SYMBOL_GPL(kvm_request_apicv_update); 9794 + EXPORT_SYMBOL_GPL(kvm_set_or_clear_apicv_inhibit); 9836 9795 9837 9796 static void vcpu_scan_ioapic(struct kvm_vcpu *vcpu) 9838 9797 { ··· 10982 10937 10983 10938 static void kvm_arch_vcpu_guestdbg_update_apicv_inhibit(struct kvm *kvm) 10984 10939 { 10985 - bool inhibit = false; 10940 + bool set = false; 10986 10941 struct kvm_vcpu *vcpu; 10987 10942 unsigned long i; 10988 10943 ··· 10990 10945 10991 10946 kvm_for_each_vcpu(i, vcpu, kvm) { 10992 10947 if (vcpu->guest_debug & KVM_GUESTDBG_BLOCKIRQ) { 10993 - inhibit = true; 10948 + set = true; 10994 10949 break; 10995 10950 } 10996 10951 } 10997 - __kvm_request_apicv_update(kvm, !inhibit, APICV_INHIBIT_REASON_BLOCKIRQ); 10952 + __kvm_set_or_clear_apicv_inhibit(kvm, APICV_INHIBIT_REASON_BLOCKIRQ, set); 10998 10953 up_write(&kvm->arch.apicv_update_lock); 10999 10954 } 11000 10955 ··· 11602 11557 u64 max = min(0x7fffffffULL, 11603 11558 __scale_tsc(kvm_max_tsc_scaling_ratio, tsc_khz)); 11604 11559 kvm_max_guest_tsc_khz = max; 11605 - 11606 - kvm_default_tsc_scaling_ratio = 1ULL << kvm_tsc_scaling_ratio_frac_bits; 11607 11560 } 11608 - 11561 + kvm_default_tsc_scaling_ratio = 1ULL << kvm_tsc_scaling_ratio_frac_bits; 11609 11562 kvm_init_msr_list(); 11610 11563 return 0; 11611 11564 } ··· 11672 11629 11673 11630 ret = kvm_page_track_init(kvm); 11674 11631 if (ret) 11675 - return ret; 11632 + goto out; 11633 + 11634 + ret = kvm_mmu_init_vm(kvm); 11635 + if (ret) 11636 + goto out_page_track; 11676 11637 11677 11638 INIT_HLIST_HEAD(&kvm->arch.mask_notifier_list); 11678 - INIT_LIST_HEAD(&kvm->arch.active_mmu_pages); 11679 - INIT_LIST_HEAD(&kvm->arch.zapped_obsolete_pages); 11680 - INIT_LIST_HEAD(&kvm->arch.lpage_disallowed_mmu_pages); 11681 11639 INIT_LIST_HEAD(&kvm->arch.assigned_dev_head); 11682 11640 atomic_set(&kvm->arch.noncoherent_dma_count, 0); 11683 11641 ··· 11710 11666 11711 11667 kvm_apicv_init(kvm); 11712 11668 kvm_hv_init_vm(kvm); 11713 - kvm_mmu_init_vm(kvm); 11714 11669 kvm_xen_init_vm(kvm); 11715 11670 11716 11671 return static_call(kvm_x86_vm_init)(kvm); 11672 + 11673 + out_page_track: 11674 + kvm_page_track_cleanup(kvm); 11675 + out: 11676 + return ret; 11717 11677 } 11718 11678 11719 11679 int kvm_arch_post_init_vm(struct kvm *kvm) ··· 12641 12593 { 12642 12594 struct kvm_mmu *mmu = vcpu->arch.walk_mmu; 12643 12595 struct x86_exception fault; 12644 - u32 access = error_code & 12596 + u64 access = error_code & 12645 12597 (PFERR_WRITE_MASK | PFERR_FETCH_MASK | PFERR_USER_MASK); 12646 12598 12647 12599 if (!(error_code & PFERR_PRESENT_MASK) || ··· 12981 12933 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_avic_unaccelerated_access); 12982 12934 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_avic_incomplete_ipi); 12983 12935 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_avic_ga_log); 12984 - EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_apicv_update_request); 12985 12936 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_apicv_accept_irq); 12986 12937 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_enter); 12987 12938 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_exit);

+3 -4

arch/x86/kvm/xen.c

··· 39 39 } 40 40 41 41 do { 42 - ret = kvm_gfn_to_pfn_cache_init(kvm, gpc, NULL, false, true, 43 - gpa, PAGE_SIZE, false); 42 + ret = kvm_gfn_to_pfn_cache_init(kvm, gpc, NULL, KVM_HOST_USES_PFN, 43 + gpa, PAGE_SIZE); 44 44 if (ret) 45 45 goto out; 46 46 ··· 1025 1025 break; 1026 1026 1027 1027 idx = srcu_read_lock(&kvm->srcu); 1028 - rc = kvm_gfn_to_pfn_cache_refresh(kvm, gpc, gpc->gpa, 1029 - PAGE_SIZE, false); 1028 + rc = kvm_gfn_to_pfn_cache_refresh(kvm, gpc, gpc->gpa, PAGE_SIZE); 1030 1029 srcu_read_unlock(&kvm->srcu, idx); 1031 1030 } while(!rc); 1032 1031

+40 -20

include/linux/kvm_host.h

··· 148 148 #define KVM_REQUEST_MASK GENMASK(7,0) 149 149 #define KVM_REQUEST_NO_WAKEUP BIT(8) 150 150 #define KVM_REQUEST_WAIT BIT(9) 151 + #define KVM_REQUEST_NO_ACTION BIT(10) 151 152 /* 152 153 * Architecture-independent vcpu->requests bit members 153 154 * Bits 4-7 are reserved for more arch-independent bits. ··· 157 156 #define KVM_REQ_VM_DEAD (1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP) 158 157 #define KVM_REQ_UNBLOCK 2 159 158 #define KVM_REQ_UNHALT 3 160 - #define KVM_REQ_GPC_INVALIDATE (5 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP) 161 159 #define KVM_REQUEST_ARCH_BASE 8 160 + 161 + /* 162 + * KVM_REQ_OUTSIDE_GUEST_MODE exists is purely as way to force the vCPU to 163 + * OUTSIDE_GUEST_MODE. KVM_REQ_OUTSIDE_GUEST_MODE differs from a vCPU "kick" 164 + * in that it ensures the vCPU has reached OUTSIDE_GUEST_MODE before continuing 165 + * on. A kick only guarantees that the vCPU is on its way out, e.g. a previous 166 + * kick may have set vcpu->mode to EXITING_GUEST_MODE, and so there's no 167 + * guarantee the vCPU received an IPI and has actually exited guest mode. 168 + */ 169 + #define KVM_REQ_OUTSIDE_GUEST_MODE (KVM_REQUEST_NO_ACTION | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP) 162 170 163 171 #define KVM_ARCH_REQ_FLAGS(nr, flags) ({ \ 164 172 BUILD_BUG_ON((unsigned)(nr) >= (sizeof_field(struct kvm_vcpu, requests) * 8) - KVM_REQUEST_ARCH_BASE); \ ··· 1231 1221 * @gpc: struct gfn_to_pfn_cache object. 1232 1222 * @vcpu: vCPU to be used for marking pages dirty and to be woken on 1233 1223 * invalidation. 1234 - * @guest_uses_pa: indicates that the resulting host physical PFN is used while 1235 - * @vcpu is IN_GUEST_MODE so invalidations should wake it. 1236 - * @kernel_map: requests a kernel virtual mapping (kmap / memremap). 1224 + * @usage: indicates if the resulting host physical PFN is used while 1225 + * the @vcpu is IN_GUEST_MODE (in which case invalidation of 1226 + * the cache from MMU notifiers---but not for KVM memslot 1227 + * changes!---will also force @vcpu to exit the guest and 1228 + * refresh the cache); and/or if the PFN used directly 1229 + * by KVM (and thus needs a kernel virtual mapping). 1237 1230 * @gpa: guest physical address to map. 1238 1231 * @len: sanity check; the range being access must fit a single page. 1239 - * @dirty: mark the cache dirty immediately. 1240 1232 * 1241 1233 * @return: 0 for success. 1242 1234 * -EINVAL for a mapping which would cross a page boundary. 1243 1235 * -EFAULT for an untranslatable guest physical address. 1244 1236 * 1245 1237 * This primes a gfn_to_pfn_cache and links it into the @kvm's list for 1246 - * invalidations to be processed. Invalidation callbacks to @vcpu using 1247 - * %KVM_REQ_GPC_INVALIDATE will occur only for MMU notifiers, not for KVM 1248 - * memslot changes. Callers are required to use kvm_gfn_to_pfn_cache_check() 1249 - * to ensure that the cache is valid before accessing the target page. 1238 + * invalidations to be processed. Callers are required to use 1239 + * kvm_gfn_to_pfn_cache_check() to ensure that the cache is valid before 1240 + * accessing the target page. 1250 1241 */ 1251 1242 int kvm_gfn_to_pfn_cache_init(struct kvm *kvm, struct gfn_to_pfn_cache *gpc, 1252 - struct kvm_vcpu *vcpu, bool guest_uses_pa, 1253 - bool kernel_map, gpa_t gpa, unsigned long len, 1254 - bool dirty); 1243 + struct kvm_vcpu *vcpu, enum pfn_cache_usage usage, 1244 + gpa_t gpa, unsigned long len); 1255 1245 1256 1246 /** 1257 1247 * kvm_gfn_to_pfn_cache_check - check validity of a gfn_to_pfn_cache. ··· 1260 1250 * @gpc: struct gfn_to_pfn_cache object. 1261 1251 * @gpa: current guest physical address to map. 1262 1252 * @len: sanity check; the range being access must fit a single page. 1263 - * @dirty: mark the cache dirty immediately. 1264 1253 * 1265 1254 * @return: %true if the cache is still valid and the address matches. 1266 1255 * %false if the cache is not valid. ··· 1281 1272 * @gpc: struct gfn_to_pfn_cache object. 1282 1273 * @gpa: updated guest physical address to map. 1283 1274 * @len: sanity check; the range being access must fit a single page. 1284 - * @dirty: mark the cache dirty immediately. 1285 1275 * 1286 1276 * @return: 0 for success. 1287 1277 * -EINVAL for a mapping which would cross a page boundary. ··· 1293 1285 * with the lock still held to permit access. 1294 1286 */ 1295 1287 int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc, 1296 - gpa_t gpa, unsigned long len, bool dirty); 1288 + gpa_t gpa, unsigned long len); 1297 1289 1298 1290 /** 1299 1291 * kvm_gfn_to_pfn_cache_unmap - temporarily unmap a gfn_to_pfn_cache. ··· 1301 1293 * @kvm: pointer to kvm instance. 1302 1294 * @gpc: struct gfn_to_pfn_cache object. 1303 1295 * 1304 - * This unmaps the referenced page and marks it dirty, if appropriate. The 1305 - * cache is left in the invalid state but at least the mapping from GPA to 1306 - * userspace HVA will remain cached and can be reused on a subsequent 1307 - * refresh. 1296 + * This unmaps the referenced page. The cache is left in the invalid state 1297 + * but at least the mapping from GPA to userspace HVA will remain cached 1298 + * and can be reused on a subsequent refresh. 1308 1299 */ 1309 1300 void kvm_gfn_to_pfn_cache_unmap(struct kvm *kvm, struct gfn_to_pfn_cache *gpc); 1310 1301 ··· 1991 1984 1992 1985 void kvm_arch_irq_routing_update(struct kvm *kvm); 1993 1986 1994 - static inline void kvm_make_request(int req, struct kvm_vcpu *vcpu) 1987 + static inline void __kvm_make_request(int req, struct kvm_vcpu *vcpu) 1995 1988 { 1996 1989 /* 1997 1990 * Ensure the rest of the request is published to kvm_check_request's ··· 1999 1992 */ 2000 1993 smp_wmb(); 2001 1994 set_bit(req & KVM_REQUEST_MASK, (void *)&vcpu->requests); 1995 + } 1996 + 1997 + static __always_inline void kvm_make_request(int req, struct kvm_vcpu *vcpu) 1998 + { 1999 + /* 2000 + * Request that don't require vCPU action should never be logged in 2001 + * vcpu->requests. The vCPU won't clear the request, so it will stay 2002 + * logged indefinitely and prevent the vCPU from entering the guest. 2003 + */ 2004 + BUILD_BUG_ON(!__builtin_constant_p(req) || 2005 + (req & KVM_REQUEST_NO_ACTION)); 2006 + 2007 + __kvm_make_request(req, vcpu); 2002 2008 } 2003 2009 2004 2010 static inline bool kvm_request_pending(struct kvm_vcpu *vcpu)

+8 -3

include/linux/kvm_types.h

··· 18 18 19 19 enum kvm_mr_change; 20 20 21 + #include <linux/bits.h> 21 22 #include <linux/types.h> 22 23 #include <linux/spinlock_types.h> 23 24 ··· 47 46 48 47 typedef hfn_t kvm_pfn_t; 49 48 49 + enum pfn_cache_usage { 50 + KVM_GUEST_USES_PFN = BIT(0), 51 + KVM_HOST_USES_PFN = BIT(1), 52 + KVM_GUEST_AND_HOST_USE_PFN = KVM_GUEST_USES_PFN | KVM_HOST_USES_PFN, 53 + }; 54 + 50 55 struct gfn_to_hva_cache { 51 56 u64 generation; 52 57 gpa_t gpa; ··· 71 64 rwlock_t lock; 72 65 void *khva; 73 66 kvm_pfn_t pfn; 67 + enum pfn_cache_usage usage; 74 68 bool active; 75 69 bool valid; 76 - bool dirty; 77 - bool kernel_map; 78 - bool guest_uses_pa; 79 70 }; 80 71 81 72 #ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE

+17 -5

virt/kvm/kvm_main.c

··· 117 117 118 118 static const struct file_operations stat_fops_per_vm; 119 119 120 + static struct file_operations kvm_chardev_ops; 121 + 120 122 static long kvm_vcpu_ioctl(struct file *file, unsigned int ioctl, 121 123 unsigned long arg); 122 124 #ifdef CONFIG_KVM_COMPAT ··· 253 251 { 254 252 int cpu; 255 253 256 - kvm_make_request(req, vcpu); 254 + if (likely(!(req & KVM_REQUEST_NO_ACTION))) 255 + __kvm_make_request(req, vcpu); 257 256 258 257 if (!(req & KVM_REQUEST_NO_WAKEUP) && kvm_vcpu_wake_up(vcpu)) 259 258 return; ··· 1134 1131 preempt_notifier_inc(); 1135 1132 kvm_init_pm_notifier(kvm); 1136 1133 1134 + /* 1135 + * When the fd passed to this ioctl() is opened it pins the module, 1136 + * but try_module_get() also prevents getting a reference if the module 1137 + * is in MODULE_STATE_GOING (e.g. if someone ran "rmmod --wait"). 1138 + */ 1139 + if (!try_module_get(kvm_chardev_ops.owner)) { 1140 + r = -ENODEV; 1141 + goto out_err; 1142 + } 1143 + 1137 1144 return kvm; 1138 1145 1139 1146 out_err: ··· 1233 1220 preempt_notifier_dec(); 1234 1221 hardware_disable_all(); 1235 1222 mmdrop(mm); 1223 + module_put(kvm_chardev_ops.owner); 1236 1224 } 1237 1225 1238 1226 void kvm_get_kvm(struct kvm *kvm) ··· 3677 3663 return 0; 3678 3664 } 3679 3665 3680 - static struct file_operations kvm_vcpu_fops = { 3666 + static const struct file_operations kvm_vcpu_fops = { 3681 3667 .release = kvm_vcpu_release, 3682 3668 .unlocked_ioctl = kvm_vcpu_ioctl, 3683 3669 .mmap = kvm_vcpu_mmap, ··· 4728 4714 } 4729 4715 #endif 4730 4716 4731 - static struct file_operations kvm_vm_fops = { 4717 + static const struct file_operations kvm_vm_fops = { 4732 4718 .release = kvm_vm_release, 4733 4719 .unlocked_ioctl = kvm_vm_ioctl, 4734 4720 .llseek = noop_llseek, ··· 5735 5721 goto out_free_5; 5736 5722 5737 5723 kvm_chardev_ops.owner = module; 5738 - kvm_vm_fops.owner = module; 5739 - kvm_vcpu_fops.owner = module; 5740 5724 5741 5725 r = misc_register(&kvm_dev); 5742 5726 if (r) {

+26 -46

virt/kvm/pfncache.c

··· 27 27 { 28 28 DECLARE_BITMAP(vcpu_bitmap, KVM_MAX_VCPUS); 29 29 struct gfn_to_pfn_cache *gpc; 30 - bool wake_vcpus = false; 30 + bool evict_vcpus = false; 31 31 32 32 spin_lock(&kvm->gpc_lock); 33 33 list_for_each_entry(gpc, &kvm->gpc_list, list) { ··· 40 40 41 41 /* 42 42 * If a guest vCPU could be using the physical address, 43 - * it needs to be woken. 43 + * it needs to be forced out of guest mode. 44 44 */ 45 - if (gpc->guest_uses_pa) { 46 - if (!wake_vcpus) { 47 - wake_vcpus = true; 45 + if (gpc->usage & KVM_GUEST_USES_PFN) { 46 + if (!evict_vcpus) { 47 + evict_vcpus = true; 48 48 bitmap_zero(vcpu_bitmap, KVM_MAX_VCPUS); 49 49 } 50 50 __set_bit(gpc->vcpu->vcpu_idx, vcpu_bitmap); 51 51 } 52 - 53 - /* 54 - * We cannot call mark_page_dirty() from here because 55 - * this physical CPU might not have an active vCPU 56 - * with which to do the KVM dirty tracking. 57 - * 58 - * Neither is there any point in telling the kernel MM 59 - * that the underlying page is dirty. A vCPU in guest 60 - * mode might still be writing to it up to the point 61 - * where we wake them a few lines further down anyway. 62 - * 63 - * So all the dirty marking happens on the unmap. 64 - */ 65 52 } 66 53 write_unlock_irq(&gpc->lock); 67 54 } 68 55 spin_unlock(&kvm->gpc_lock); 69 56 70 - if (wake_vcpus) { 71 - unsigned int req = KVM_REQ_GPC_INVALIDATE; 57 + if (evict_vcpus) { 58 + /* 59 + * KVM needs to ensure the vCPU is fully out of guest context 60 + * before allowing the invalidation to continue. 61 + */ 62 + unsigned int req = KVM_REQ_OUTSIDE_GUEST_MODE; 72 63 bool called; 73 64 74 65 /* 75 66 * If the OOM reaper is active, then all vCPUs should have 76 67 * been stopped already, so perform the request without 77 - * KVM_REQUEST_WAIT and be sad if any needed to be woken. 68 + * KVM_REQUEST_WAIT and be sad if any needed to be IPI'd. 78 69 */ 79 70 if (!may_block) 80 71 req &= ~KVM_REQUEST_WAIT; ··· 95 104 } 96 105 EXPORT_SYMBOL_GPL(kvm_gfn_to_pfn_cache_check); 97 106 98 - static void __release_gpc(struct kvm *kvm, kvm_pfn_t pfn, void *khva, 99 - gpa_t gpa, bool dirty) 107 + static void __release_gpc(struct kvm *kvm, kvm_pfn_t pfn, void *khva, gpa_t gpa) 100 108 { 101 109 /* Unmap the old page if it was mapped before, and release it */ 102 110 if (!is_error_noslot_pfn(pfn)) { ··· 108 118 #endif 109 119 } 110 120 111 - kvm_release_pfn(pfn, dirty); 112 - if (dirty) 113 - mark_page_dirty(kvm, gpa); 121 + kvm_release_pfn(pfn, false); 114 122 } 115 123 } 116 124 ··· 140 152 } 141 153 142 154 int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc, 143 - gpa_t gpa, unsigned long len, bool dirty) 155 + gpa_t gpa, unsigned long len) 144 156 { 145 157 struct kvm_memslots *slots = kvm_memslots(kvm); 146 158 unsigned long page_offset = gpa & ~PAGE_MASK; ··· 148 160 unsigned long old_uhva; 149 161 gpa_t old_gpa; 150 162 void *old_khva; 151 - bool old_valid, old_dirty; 163 + bool old_valid; 152 164 int ret = 0; 153 165 154 166 /* ··· 165 177 old_khva = gpc->khva - offset_in_page(gpc->khva); 166 178 old_uhva = gpc->uhva; 167 179 old_valid = gpc->valid; 168 - old_dirty = gpc->dirty; 169 180 170 181 /* If the userspace HVA is invalid, refresh that first */ 171 182 if (gpc->gpa != gpa || gpc->generation != slots->generation || 172 183 kvm_is_error_hva(gpc->uhva)) { 173 184 gfn_t gfn = gpa_to_gfn(gpa); 174 185 175 - gpc->dirty = false; 176 186 gpc->gpa = gpa; 177 187 gpc->generation = slots->generation; 178 188 gpc->memslot = __gfn_to_memslot(slots, gfn); 179 189 gpc->uhva = gfn_to_hva_memslot(gpc->memslot, gfn); 180 190 181 191 if (kvm_is_error_hva(gpc->uhva)) { 192 + gpc->pfn = KVM_PFN_ERR_FAULT; 182 193 ret = -EFAULT; 183 194 goto out; 184 195 } ··· 206 219 goto map_done; 207 220 } 208 221 209 - if (gpc->kernel_map) { 222 + if (gpc->usage & KVM_HOST_USES_PFN) { 210 223 if (new_pfn == old_pfn) { 211 224 new_khva = old_khva; 212 225 old_pfn = KVM_PFN_ERR_FAULT; ··· 242 255 } 243 256 244 257 out: 245 - if (ret) 246 - gpc->dirty = false; 247 - else 248 - gpc->dirty = dirty; 249 - 250 258 write_unlock_irq(&gpc->lock); 251 259 252 - __release_gpc(kvm, old_pfn, old_khva, old_gpa, old_dirty); 260 + __release_gpc(kvm, old_pfn, old_khva, old_gpa); 253 261 254 262 return ret; 255 263 } ··· 254 272 { 255 273 void *old_khva; 256 274 kvm_pfn_t old_pfn; 257 - bool old_dirty; 258 275 gpa_t old_gpa; 259 276 260 277 write_lock_irq(&gpc->lock); ··· 261 280 gpc->valid = false; 262 281 263 282 old_khva = gpc->khva - offset_in_page(gpc->khva); 264 - old_dirty = gpc->dirty; 265 283 old_gpa = gpc->gpa; 266 284 old_pfn = gpc->pfn; 267 285 ··· 273 293 274 294 write_unlock_irq(&gpc->lock); 275 295 276 - __release_gpc(kvm, old_pfn, old_khva, old_gpa, old_dirty); 296 + __release_gpc(kvm, old_pfn, old_khva, old_gpa); 277 297 } 278 298 EXPORT_SYMBOL_GPL(kvm_gfn_to_pfn_cache_unmap); 279 299 280 300 281 301 int kvm_gfn_to_pfn_cache_init(struct kvm *kvm, struct gfn_to_pfn_cache *gpc, 282 - struct kvm_vcpu *vcpu, bool guest_uses_pa, 283 - bool kernel_map, gpa_t gpa, unsigned long len, 284 - bool dirty) 302 + struct kvm_vcpu *vcpu, enum pfn_cache_usage usage, 303 + gpa_t gpa, unsigned long len) 285 304 { 305 + WARN_ON_ONCE(!usage || (usage & KVM_GUEST_AND_HOST_USE_PFN) != usage); 306 + 286 307 if (!gpc->active) { 287 308 rwlock_init(&gpc->lock); 288 309 ··· 291 310 gpc->pfn = KVM_PFN_ERR_FAULT; 292 311 gpc->uhva = KVM_HVA_ERR_BAD; 293 312 gpc->vcpu = vcpu; 294 - gpc->kernel_map = kernel_map; 295 - gpc->guest_uses_pa = guest_uses_pa; 313 + gpc->usage = usage; 296 314 gpc->valid = false; 297 315 gpc->active = true; 298 316 ··· 299 319 list_add(&gpc->list, &kvm->gpc_list); 300 320 spin_unlock(&kvm->gpc_lock); 301 321 } 302 - return kvm_gfn_to_pfn_cache_refresh(kvm, gpc, gpa, len, dirty); 322 + return kvm_gfn_to_pfn_cache_refresh(kvm, gpc, gpa, len); 303 323 } 304 324 EXPORT_SYMBOL_GPL(kvm_gfn_to_pfn_cache_init); 305 325

Configure Feed

Configure Feed