Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

+70 -5

Documentation/virt/kvm/api.rst

··· 5645 5645 }; 5646 5646 5647 5647 Copies Memory Tagging Extension (MTE) tags to/from guest tag memory. The 5648 - ``guest_ipa`` and ``length`` fields must be ``PAGE_SIZE`` aligned. The ``addr`` 5648 + ``guest_ipa`` and ``length`` fields must be ``PAGE_SIZE`` aligned. 5649 + ``length`` must not be bigger than 2^31 - PAGE_SIZE bytes. The ``addr`` 5649 5650 field must point to a buffer which the tags will be copied to or from. 5650 5651 5651 5652 ``flags`` specifies the direction of copy, either ``KVM_ARM_TAGS_TO_GUEST`` or ··· 6030 6029 The "pad" and "reserved" fields may be used for future extensions and should be 6031 6030 set to 0s by userspace. 6032 6031 6032 + 4.138 KVM_ARM_SET_COUNTER_OFFSET 6033 + -------------------------------- 6034 + 6035 + :Capability: KVM_CAP_COUNTER_OFFSET 6036 + :Architectures: arm64 6037 + :Type: vm ioctl 6038 + :Parameters: struct kvm_arm_counter_offset (in) 6039 + :Returns: 0 on success, < 0 on error 6040 + 6041 + This capability indicates that userspace is able to apply a single VM-wide 6042 + offset to both the virtual and physical counters as viewed by the guest 6043 + using the KVM_ARM_SET_CNT_OFFSET ioctl and the following data structure: 6044 + 6045 + :: 6046 + 6047 + struct kvm_arm_counter_offset { 6048 + __u64 counter_offset; 6049 + __u64 reserved; 6050 + }; 6051 + 6052 + The offset describes a number of counter cycles that are subtracted from 6053 + both virtual and physical counter views (similar to the effects of the 6054 + CNTVOFF_EL2 and CNTPOFF_EL2 system registers, but only global). The offset 6055 + always applies to all vcpus (already created or created after this ioctl) 6056 + for this VM. 6057 + 6058 + It is userspace's responsibility to compute the offset based, for example, 6059 + on previous values of the guest counters. 6060 + 6061 + Any value other than 0 for the "reserved" field may result in an error 6062 + (-EINVAL) being returned. This ioctl can also return -EBUSY if any vcpu 6063 + ioctl is issued concurrently. 6064 + 6065 + Note that using this ioctl results in KVM ignoring subsequent userspace 6066 + writes to the CNTVCT_EL0 and CNTPCT_EL0 registers using the SET_ONE_REG 6067 + interface. No error will be returned, but the resulting offset will not be 6068 + applied. 6069 + 6033 6070 5. The kvm_run structure 6034 6071 ======================== 6035 6072 ··· 6257 6218 __u64 nr; 6258 6219 __u64 args[6]; 6259 6220 __u64 ret; 6260 - __u32 longmode; 6261 - __u32 pad; 6221 + __u64 flags; 6262 6222 } hypercall; 6263 6223 6264 - Unused. This was once used for 'hypercall to userspace'. To implement 6265 - such functionality, use KVM_EXIT_IO (x86) or KVM_EXIT_MMIO (all except s390). 6224 + 6225 + It is strongly recommended that userspace use ``KVM_EXIT_IO`` (x86) or 6226 + ``KVM_EXIT_MMIO`` (all except s390) to implement functionality that 6227 + requires a guest to interact with host userpace. 6266 6228 6267 6229 .. note:: KVM_EXIT_IO is significantly faster than KVM_EXIT_MMIO. 6230 + 6231 + For arm64: 6232 + ---------- 6233 + 6234 + SMCCC exits can be enabled depending on the configuration of the SMCCC 6235 + filter. See the Documentation/virt/kvm/devices/vm.rst 6236 + ``KVM_ARM_SMCCC_FILTER`` for more details. 6237 + 6238 + ``nr`` contains the function ID of the guest's SMCCC call. Userspace is 6239 + expected to use the ``KVM_GET_ONE_REG`` ioctl to retrieve the call 6240 + parameters from the vCPU's GPRs. 6241 + 6242 + Definition of ``flags``: 6243 + - ``KVM_HYPERCALL_EXIT_SMC``: Indicates that the guest used the SMC 6244 + conduit to initiate the SMCCC call. If this bit is 0 then the guest 6245 + used the HVC conduit for the SMCCC call. 6246 + 6247 + - ``KVM_HYPERCALL_EXIT_16BIT``: Indicates that the guest used a 16bit 6248 + instruction to initiate the SMCCC call. If this bit is 0 then the 6249 + guest used a 32bit instruction. An AArch64 guest always has this 6250 + bit set to 0. 6251 + 6252 + At the point of exit, PC points to the instruction immediately following 6253 + the trapping instruction. 6268 6254 6269 6255 :: 6270 6256 ··· 7330 7266 will clear DR6.RTM. 7331 7267 7332 7268 7.18 KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 7269 + -------------------------------------- 7333 7270 7334 7271 :Architectures: x86, arm64, mips 7335 7272 :Parameters: args[0] whether feature should be enabled or not

+79

Documentation/virt/kvm/devices/vm.rst

··· 321 321 if it is enabled 322 322 :Returns: -EFAULT if the given address is not accessible from kernel space; 323 323 0 in case of success. 324 + 325 + 6. GROUP: KVM_ARM_VM_SMCCC_CTRL 326 + =============================== 327 + 328 + :Architectures: arm64 329 + 330 + 6.1. ATTRIBUTE: KVM_ARM_VM_SMCCC_FILTER (w/o) 331 + --------------------------------------------- 332 + 333 + :Parameters: Pointer to a ``struct kvm_smccc_filter`` 334 + 335 + :Returns: 336 + 337 + ====== =========================================== 338 + EEXIST Range intersects with a previously inserted 339 + or reserved range 340 + EBUSY A vCPU in the VM has already run 341 + EINVAL Invalid filter configuration 342 + ENOMEM Failed to allocate memory for the in-kernel 343 + representation of the SMCCC filter 344 + ====== =========================================== 345 + 346 + Requests the installation of an SMCCC call filter described as follows:: 347 + 348 + enum kvm_smccc_filter_action { 349 + KVM_SMCCC_FILTER_HANDLE = 0, 350 + KVM_SMCCC_FILTER_DENY, 351 + KVM_SMCCC_FILTER_FWD_TO_USER, 352 + }; 353 + 354 + struct kvm_smccc_filter { 355 + __u32 base; 356 + __u32 nr_functions; 357 + __u8 action; 358 + __u8 pad[15]; 359 + }; 360 + 361 + The filter is defined as a set of non-overlapping ranges. Each 362 + range defines an action to be applied to SMCCC calls within the range. 363 + Userspace can insert multiple ranges into the filter by using 364 + successive calls to this attribute. 365 + 366 + The default configuration of KVM is such that all implemented SMCCC 367 + calls are allowed. Thus, the SMCCC filter can be defined sparsely 368 + by userspace, only describing ranges that modify the default behavior. 369 + 370 + The range expressed by ``struct kvm_smccc_filter`` is 371 + [``base``, ``base + nr_functions``). The range is not allowed to wrap, 372 + i.e. userspace cannot rely on ``base + nr_functions`` overflowing. 373 + 374 + The SMCCC filter applies to both SMC and HVC calls initiated by the 375 + guest. The SMCCC filter gates the in-kernel emulation of SMCCC calls 376 + and as such takes effect before other interfaces that interact with 377 + SMCCC calls (e.g. hypercall bitmap registers). 378 + 379 + Actions: 380 + 381 + - ``KVM_SMCCC_FILTER_HANDLE``: Allows the guest SMCCC call to be 382 + handled in-kernel. It is strongly recommended that userspace *not* 383 + explicitly describe the allowed SMCCC call ranges. 384 + 385 + - ``KVM_SMCCC_FILTER_DENY``: Rejects the guest SMCCC call in-kernel 386 + and returns to the guest. 387 + 388 + - ``KVM_SMCCC_FILTER_FWD_TO_USER``: The guest SMCCC call is forwarded 389 + to userspace with an exit reason of ``KVM_EXIT_HYPERCALL``. 390 + 391 + The ``pad`` field is reserved for future use and must be zero. KVM may 392 + return ``-EINVAL`` if the field is nonzero. 393 + 394 + KVM reserves the 'Arm Architecture Calls' range of function IDs and 395 + will reject attempts to define a filter for any portion of these ranges: 396 + 397 + =========== =============== 398 + Start End (inclusive) 399 + =========== =============== 400 + 0x8000_0000 0x8000_FFFF 401 + 0xC000_0000 0xC000_FFFF 402 + =========== ===============

+1 -1

Documentation/virt/kvm/locking.rst

··· 21 21 - kvm->mn_active_invalidate_count ensures that pairs of 22 22 invalidate_range_start() and invalidate_range_end() callbacks 23 23 use the same memslots array. kvm->slots_lock and kvm->slots_arch_lock 24 - are taken on the waiting side in install_new_memslots, so MMU notifiers 24 + are taken on the waiting side when modifying memslots, so MMU notifiers 25 25 must not take either kvm->slots_lock or kvm->slots_arch_lock. 26 26 27 27 For SRCU:

+26 -3

arch/arm64/include/asm/kvm_host.h

··· 16 16 #include <linux/types.h> 17 17 #include <linux/jump_label.h> 18 18 #include <linux/kvm_types.h> 19 + #include <linux/maple_tree.h> 19 20 #include <linux/percpu.h> 20 21 #include <linux/psci.h> 21 22 #include <asm/arch_gicv3.h> ··· 200 199 /* Mandated version of PSCI */ 201 200 u32 psci_version; 202 201 202 + /* Protects VM-scoped configuration data */ 203 + struct mutex config_lock; 204 + 203 205 /* 204 206 * If we encounter a data abort without valid instruction syndrome 205 207 * information, report this to user space. User space can (and ··· 225 221 #define KVM_ARCH_FLAG_EL1_32BIT 4 226 222 /* PSCI SYSTEM_SUSPEND enabled for the guest */ 227 223 #define KVM_ARCH_FLAG_SYSTEM_SUSPEND_ENABLED 5 228 - 224 + /* VM counter offset */ 225 + #define KVM_ARCH_FLAG_VM_COUNTER_OFFSET 6 226 + /* Timer PPIs made immutable */ 227 + #define KVM_ARCH_FLAG_TIMER_PPIS_IMMUTABLE 7 228 + /* SMCCC filter initialized for the VM */ 229 + #define KVM_ARCH_FLAG_SMCCC_FILTER_CONFIGURED 8 229 230 unsigned long flags; 230 231 231 232 /* ··· 251 242 252 243 /* Hypercall features firmware registers' descriptor */ 253 244 struct kvm_smccc_features smccc_feat; 245 + struct maple_tree smccc_filter; 254 246 255 247 /* 256 248 * For an untrusted host VM, 'pkvm.handle' is used to lookup ··· 375 365 TPIDR_EL2, /* EL2 Software Thread ID Register */ 376 366 CNTHCTL_EL2, /* Counter-timer Hypervisor Control register */ 377 367 SP_EL2, /* EL2 Stack Pointer */ 368 + CNTHP_CTL_EL2, 369 + CNTHP_CVAL_EL2, 370 + CNTHV_CTL_EL2, 371 + CNTHV_CVAL_EL2, 378 372 379 373 NR_SYS_REGS /* Nothing after this line! */ 380 374 }; ··· 536 522 537 523 /* vcpu power state */ 538 524 struct kvm_mp_state mp_state; 525 + spinlock_t mp_state_lock; 539 526 540 527 /* Cache some mmu pages needed inside spinlock regions */ 541 528 struct kvm_mmu_memory_cache mmu_page_cache; ··· 954 939 955 940 int __init kvm_sys_reg_table_init(void); 956 941 942 + bool lock_all_vcpus(struct kvm *kvm); 943 + void unlock_all_vcpus(struct kvm *kvm); 944 + 957 945 /* MMIO helpers */ 958 946 void kvm_mmio_write_buf(void *buf, unsigned int len, unsigned long data); 959 947 unsigned long kvm_mmio_read_buf(const void *buf, unsigned int len); ··· 1040 1022 int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu, 1041 1023 struct kvm_device_attr *attr); 1042 1024 1043 - long kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm, 1044 - struct kvm_arm_copy_mte_tags *copy_tags); 1025 + int kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm, 1026 + struct kvm_arm_copy_mte_tags *copy_tags); 1027 + int kvm_vm_ioctl_set_counter_offset(struct kvm *kvm, 1028 + struct kvm_arm_counter_offset *offset); 1045 1029 1046 1030 /* Guest/host FPSIMD coordination helpers */ 1047 1031 int kvm_arch_vcpu_run_map_fp(struct kvm_vcpu *vcpu); ··· 1097 1077 #define kvm_supports_32bit_el0() \ 1098 1078 (system_supports_32bit_el0() && \ 1099 1079 !static_branch_unlikely(&arm64_mismatched_32bit_el0)) 1080 + 1081 + #define kvm_vm_has_ran_once(kvm) \ 1082 + (test_bit(KVM_ARCH_FLAG_HAS_RAN_ONCE, &(kvm)->arch.flags)) 1100 1083 1101 1084 int kvm_trng_call(struct kvm_vcpu *vcpu); 1102 1085 #ifdef CONFIG_KVM

+4

arch/arm64/include/asm/kvm_mmu.h

··· 63 63 * specific registers encoded in the instructions). 64 64 */ 65 65 .macro kern_hyp_va reg 66 + #ifndef __KVM_VHE_HYPERVISOR__ 66 67 alternative_cb ARM64_ALWAYS_SYSTEM, kvm_update_va_mask 67 68 and \reg, \reg, #1 /* mask with va_mask */ 68 69 ror \reg, \reg, #1 /* rotate to the first tag bit */ ··· 71 70 add \reg, \reg, #0, lsl 12 /* insert the top 12 bits of the tag */ 72 71 ror \reg, \reg, #63 /* rotate back */ 73 72 alternative_cb_end 73 + #endif 74 74 .endm 75 75 76 76 /* ··· 129 127 130 128 static __always_inline unsigned long __kern_hyp_va(unsigned long v) 131 129 { 130 + #ifndef __KVM_VHE_HYPERVISOR__ 132 131 asm volatile(ALTERNATIVE_CB("and %0, %0, #1\n" 133 132 "ror %0, %0, #1\n" 134 133 "add %0, %0, #0\n" ··· 138 135 ARM64_ALWAYS_SYSTEM, 139 136 kvm_update_va_mask) 140 137 : "+r" (v)); 138 + #endif 141 139 return v; 142 140 } 143 141

+3

arch/arm64/include/asm/sysreg.h

··· 388 388 389 389 #define SYS_CNTFRQ_EL0 sys_reg(3, 3, 14, 0, 0) 390 390 391 + #define SYS_CNTPCT_EL0 sys_reg(3, 3, 14, 0, 1) 391 392 #define SYS_CNTPCTSS_EL0 sys_reg(3, 3, 14, 0, 5) 392 393 #define SYS_CNTVCTSS_EL0 sys_reg(3, 3, 14, 0, 6) 393 394 ··· 401 400 402 401 #define SYS_AARCH32_CNTP_TVAL sys_reg(0, 0, 14, 2, 0) 403 402 #define SYS_AARCH32_CNTP_CTL sys_reg(0, 0, 14, 2, 1) 403 + #define SYS_AARCH32_CNTPCT sys_reg(0, 0, 0, 14, 0) 404 404 #define SYS_AARCH32_CNTP_CVAL sys_reg(0, 2, 0, 14, 0) 405 + #define SYS_AARCH32_CNTPCTSS sys_reg(0, 8, 0, 14, 0) 405 406 406 407 #define __PMEV_op2(n) ((n) & 0x7) 407 408 #define __CNTR_CRm(n) (0x8 | (((n) >> 3) & 0x3))

+36

arch/arm64/include/uapi/asm/kvm.h

··· 198 198 __u64 reserved[2]; 199 199 }; 200 200 201 + /* 202 + * Counter/Timer offset structure. Describe the virtual/physical offset. 203 + * To be used with KVM_ARM_SET_COUNTER_OFFSET. 204 + */ 205 + struct kvm_arm_counter_offset { 206 + __u64 counter_offset; 207 + __u64 reserved; 208 + }; 209 + 201 210 #define KVM_ARM_TAGS_TO_GUEST 0 202 211 #define KVM_ARM_TAGS_FROM_GUEST 1 203 212 ··· 381 372 #endif 382 373 }; 383 374 375 + /* Device Control API on vm fd */ 376 + #define KVM_ARM_VM_SMCCC_CTRL 0 377 + #define KVM_ARM_VM_SMCCC_FILTER 0 378 + 384 379 /* Device Control API: ARM VGIC */ 385 380 #define KVM_DEV_ARM_VGIC_GRP_ADDR 0 386 381 #define KVM_DEV_ARM_VGIC_GRP_DIST_REGS 1 ··· 424 411 #define KVM_ARM_VCPU_TIMER_CTRL 1 425 412 #define KVM_ARM_VCPU_TIMER_IRQ_VTIMER 0 426 413 #define KVM_ARM_VCPU_TIMER_IRQ_PTIMER 1 414 + #define KVM_ARM_VCPU_TIMER_IRQ_HVTIMER 2 415 + #define KVM_ARM_VCPU_TIMER_IRQ_HPTIMER 3 427 416 #define KVM_ARM_VCPU_PVTIME_CTRL 2 428 417 #define KVM_ARM_VCPU_PVTIME_IPA 0 429 418 ··· 483 468 484 469 /* run->fail_entry.hardware_entry_failure_reason codes. */ 485 470 #define KVM_EXIT_FAIL_ENTRY_CPU_UNSUPPORTED (1ULL << 0) 471 + 472 + enum kvm_smccc_filter_action { 473 + KVM_SMCCC_FILTER_HANDLE = 0, 474 + KVM_SMCCC_FILTER_DENY, 475 + KVM_SMCCC_FILTER_FWD_TO_USER, 476 + 477 + #ifdef __KERNEL__ 478 + NR_SMCCC_FILTER_ACTIONS 479 + #endif 480 + }; 481 + 482 + struct kvm_smccc_filter { 483 + __u32 base; 484 + __u32 nr_functions; 485 + __u8 action; 486 + __u8 pad[15]; 487 + }; 488 + 489 + /* arm64-specific KVM_EXIT_HYPERCALL flags */ 490 + #define KVM_HYPERCALL_EXIT_SMC (1U << 0) 491 + #define KVM_HYPERCALL_EXIT_16BIT (1U << 1) 486 492 487 493 #endif 488 494

+11

arch/arm64/kernel/cpufeature.c

··· 2230 2230 .matches = has_cpuid_feature, 2231 2231 ARM64_CPUID_FIELDS(ID_AA64MMFR0_EL1, ECV, IMP) 2232 2232 }, 2233 + { 2234 + .desc = "Enhanced Counter Virtualization (CNTPOFF)", 2235 + .capability = ARM64_HAS_ECV_CNTPOFF, 2236 + .type = ARM64_CPUCAP_SYSTEM_FEATURE, 2237 + .matches = has_cpuid_feature, 2238 + .sys_reg = SYS_ID_AA64MMFR0_EL1, 2239 + .field_pos = ID_AA64MMFR0_EL1_ECV_SHIFT, 2240 + .field_width = 4, 2241 + .sign = FTR_UNSIGNED, 2242 + .min_field_value = ID_AA64MMFR0_EL1_ECV_CNTPOFF, 2243 + }, 2233 2244 #ifdef CONFIG_ARM64_PAN 2234 2245 { 2235 2246 .desc = "Privileged Access Never",

+431 -121

arch/arm64/kvm/arch_timer.c

··· 16 16 #include <asm/arch_timer.h> 17 17 #include <asm/kvm_emulate.h> 18 18 #include <asm/kvm_hyp.h> 19 + #include <asm/kvm_nested.h> 19 20 20 21 #include <kvm/arm_vgic.h> 21 22 #include <kvm/arm_arch_timer.h> ··· 31 30 32 31 static DEFINE_STATIC_KEY_FALSE(has_gic_active_state); 33 32 34 - static const struct kvm_irq_level default_ptimer_irq = { 35 - .irq = 30, 36 - .level = 1, 37 - }; 38 - 39 - static const struct kvm_irq_level default_vtimer_irq = { 40 - .irq = 27, 41 - .level = 1, 33 + static const u8 default_ppi[] = { 34 + [TIMER_PTIMER] = 30, 35 + [TIMER_VTIMER] = 27, 36 + [TIMER_HPTIMER] = 26, 37 + [TIMER_HVTIMER] = 28, 42 38 }; 43 39 44 40 static bool kvm_timer_irq_can_fire(struct arch_timer_context *timer_ctx); ··· 49 51 static u64 kvm_arm_timer_read(struct kvm_vcpu *vcpu, 50 52 struct arch_timer_context *timer, 51 53 enum kvm_arch_timer_regs treg); 54 + static bool kvm_arch_timer_get_input_level(int vintid); 55 + 56 + static struct irq_ops arch_timer_irq_ops = { 57 + .get_input_level = kvm_arch_timer_get_input_level, 58 + }; 59 + 60 + static bool has_cntpoff(void) 61 + { 62 + return (has_vhe() && cpus_have_final_cap(ARM64_HAS_ECV_CNTPOFF)); 63 + } 64 + 65 + static int nr_timers(struct kvm_vcpu *vcpu) 66 + { 67 + if (!vcpu_has_nv(vcpu)) 68 + return NR_KVM_EL0_TIMERS; 69 + 70 + return NR_KVM_TIMERS; 71 + } 52 72 53 73 u32 timer_get_ctl(struct arch_timer_context *ctxt) 54 74 { ··· 77 61 return __vcpu_sys_reg(vcpu, CNTV_CTL_EL0); 78 62 case TIMER_PTIMER: 79 63 return __vcpu_sys_reg(vcpu, CNTP_CTL_EL0); 64 + case TIMER_HVTIMER: 65 + return __vcpu_sys_reg(vcpu, CNTHV_CTL_EL2); 66 + case TIMER_HPTIMER: 67 + return __vcpu_sys_reg(vcpu, CNTHP_CTL_EL2); 80 68 default: 81 69 WARN_ON(1); 82 70 return 0; ··· 96 76 return __vcpu_sys_reg(vcpu, CNTV_CVAL_EL0); 97 77 case TIMER_PTIMER: 98 78 return __vcpu_sys_reg(vcpu, CNTP_CVAL_EL0); 79 + case TIMER_HVTIMER: 80 + return __vcpu_sys_reg(vcpu, CNTHV_CVAL_EL2); 81 + case TIMER_HPTIMER: 82 + return __vcpu_sys_reg(vcpu, CNTHP_CVAL_EL2); 99 83 default: 100 84 WARN_ON(1); 101 85 return 0; ··· 108 84 109 85 static u64 timer_get_offset(struct arch_timer_context *ctxt) 110 86 { 111 - if (ctxt->offset.vm_offset) 112 - return *ctxt->offset.vm_offset; 87 + u64 offset = 0; 113 88 114 - return 0; 89 + if (!ctxt) 90 + return 0; 91 + 92 + if (ctxt->offset.vm_offset) 93 + offset += *ctxt->offset.vm_offset; 94 + if (ctxt->offset.vcpu_offset) 95 + offset += *ctxt->offset.vcpu_offset; 96 + 97 + return offset; 115 98 } 116 99 117 100 static void timer_set_ctl(struct arch_timer_context *ctxt, u32 ctl) ··· 131 100 break; 132 101 case TIMER_PTIMER: 133 102 __vcpu_sys_reg(vcpu, CNTP_CTL_EL0) = ctl; 103 + break; 104 + case TIMER_HVTIMER: 105 + __vcpu_sys_reg(vcpu, CNTHV_CTL_EL2) = ctl; 106 + break; 107 + case TIMER_HPTIMER: 108 + __vcpu_sys_reg(vcpu, CNTHP_CTL_EL2) = ctl; 134 109 break; 135 110 default: 136 111 WARN_ON(1); ··· 153 116 break; 154 117 case TIMER_PTIMER: 155 118 __vcpu_sys_reg(vcpu, CNTP_CVAL_EL0) = cval; 119 + break; 120 + case TIMER_HVTIMER: 121 + __vcpu_sys_reg(vcpu, CNTHV_CVAL_EL2) = cval; 122 + break; 123 + case TIMER_HPTIMER: 124 + __vcpu_sys_reg(vcpu, CNTHP_CVAL_EL2) = cval; 156 125 break; 157 126 default: 158 127 WARN_ON(1); ··· 182 139 183 140 static void get_timer_map(struct kvm_vcpu *vcpu, struct timer_map *map) 184 141 { 185 - if (has_vhe()) { 142 + if (vcpu_has_nv(vcpu)) { 143 + if (is_hyp_ctxt(vcpu)) { 144 + map->direct_vtimer = vcpu_hvtimer(vcpu); 145 + map->direct_ptimer = vcpu_hptimer(vcpu); 146 + map->emul_vtimer = vcpu_vtimer(vcpu); 147 + map->emul_ptimer = vcpu_ptimer(vcpu); 148 + } else { 149 + map->direct_vtimer = vcpu_vtimer(vcpu); 150 + map->direct_ptimer = vcpu_ptimer(vcpu); 151 + map->emul_vtimer = vcpu_hvtimer(vcpu); 152 + map->emul_ptimer = vcpu_hptimer(vcpu); 153 + } 154 + } else if (has_vhe()) { 186 155 map->direct_vtimer = vcpu_vtimer(vcpu); 187 156 map->direct_ptimer = vcpu_ptimer(vcpu); 157 + map->emul_vtimer = NULL; 188 158 map->emul_ptimer = NULL; 189 159 } else { 190 160 map->direct_vtimer = vcpu_vtimer(vcpu); 191 161 map->direct_ptimer = NULL; 162 + map->emul_vtimer = NULL; 192 163 map->emul_ptimer = vcpu_ptimer(vcpu); 193 164 } 194 165 ··· 269 212 ns = cyclecounter_cyc2ns(timecounter->cc, 270 213 val - now, 271 214 timecounter->mask, 272 - &timecounter->frac); 215 + &timer_ctx->ns_frac); 273 216 return ns; 274 217 } 275 218 ··· 297 240 298 241 static u64 wfit_delay_ns(struct kvm_vcpu *vcpu) 299 242 { 300 - struct arch_timer_context *ctx = vcpu_vtimer(vcpu); 301 243 u64 val = vcpu_get_reg(vcpu, kvm_vcpu_sys_get_rt(vcpu)); 244 + struct arch_timer_context *ctx; 245 + 246 + ctx = (vcpu_has_nv(vcpu) && is_hyp_ctxt(vcpu)) ? vcpu_hvtimer(vcpu) 247 + : vcpu_vtimer(vcpu); 302 248 303 249 return kvm_counter_compute_delta(ctx, val); 304 250 } ··· 315 255 u64 min_delta = ULLONG_MAX; 316 256 int i; 317 257 318 - for (i = 0; i < NR_KVM_TIMERS; i++) { 258 + for (i = 0; i < nr_timers(vcpu); i++) { 319 259 struct arch_timer_context *ctx = &vcpu->arch.timer_cpu.timers[i]; 320 260 321 261 WARN(ctx->loaded, "timer %d loaded\n", i); ··· 398 338 399 339 switch (index) { 400 340 case TIMER_VTIMER: 341 + case TIMER_HVTIMER: 401 342 cnt_ctl = read_sysreg_el0(SYS_CNTV_CTL); 402 343 break; 403 344 case TIMER_PTIMER: 345 + case TIMER_HPTIMER: 404 346 cnt_ctl = read_sysreg_el0(SYS_CNTP_CTL); 405 347 break; 406 348 case NR_KVM_TIMERS: ··· 454 392 int ret; 455 393 456 394 timer_ctx->irq.level = new_level; 457 - trace_kvm_timer_update_irq(vcpu->vcpu_id, timer_ctx->irq.irq, 395 + trace_kvm_timer_update_irq(vcpu->vcpu_id, timer_irq(timer_ctx), 458 396 timer_ctx->irq.level); 459 397 460 398 if (!userspace_irqchip(vcpu->kvm)) { 461 399 ret = kvm_vgic_inject_irq(vcpu->kvm, vcpu->vcpu_id, 462 - timer_ctx->irq.irq, 400 + timer_irq(timer_ctx), 463 401 timer_ctx->irq.level, 464 402 timer_ctx); 465 403 WARN_ON(ret); ··· 494 432 kvm_call_hyp(__kvm_timer_set_cntvoff, cntvoff); 495 433 } 496 434 435 + static void set_cntpoff(u64 cntpoff) 436 + { 437 + if (has_cntpoff()) 438 + write_sysreg_s(cntpoff, SYS_CNTPOFF_EL2); 439 + } 440 + 497 441 static void timer_save_state(struct arch_timer_context *ctx) 498 442 { 499 443 struct arch_timer_cpu *timer = vcpu_timer(ctx->vcpu); ··· 515 447 goto out; 516 448 517 449 switch (index) { 450 + u64 cval; 451 + 518 452 case TIMER_VTIMER: 453 + case TIMER_HVTIMER: 519 454 timer_set_ctl(ctx, read_sysreg_el0(SYS_CNTV_CTL)); 520 455 timer_set_cval(ctx, read_sysreg_el0(SYS_CNTV_CVAL)); 521 456 ··· 544 473 set_cntvoff(0); 545 474 break; 546 475 case TIMER_PTIMER: 476 + case TIMER_HPTIMER: 547 477 timer_set_ctl(ctx, read_sysreg_el0(SYS_CNTP_CTL)); 548 - timer_set_cval(ctx, read_sysreg_el0(SYS_CNTP_CVAL)); 478 + cval = read_sysreg_el0(SYS_CNTP_CVAL); 479 + 480 + if (!has_cntpoff()) 481 + cval -= timer_get_offset(ctx); 482 + 483 + timer_set_cval(ctx, cval); 549 484 550 485 /* Disable the timer */ 551 486 write_sysreg_el0(0, SYS_CNTP_CTL); 552 487 isb(); 553 488 489 + set_cntpoff(0); 554 490 break; 555 491 case NR_KVM_TIMERS: 556 492 BUG(); ··· 588 510 */ 589 511 if (!kvm_timer_irq_can_fire(map.direct_vtimer) && 590 512 !kvm_timer_irq_can_fire(map.direct_ptimer) && 513 + !kvm_timer_irq_can_fire(map.emul_vtimer) && 591 514 !kvm_timer_irq_can_fire(map.emul_ptimer) && 592 515 !vcpu_has_wfit_active(vcpu)) 593 516 return; ··· 622 543 goto out; 623 544 624 545 switch (index) { 546 + u64 cval, offset; 547 + 625 548 case TIMER_VTIMER: 549 + case TIMER_HVTIMER: 626 550 set_cntvoff(timer_get_offset(ctx)); 627 551 write_sysreg_el0(timer_get_cval(ctx), SYS_CNTV_CVAL); 628 552 isb(); 629 553 write_sysreg_el0(timer_get_ctl(ctx), SYS_CNTV_CTL); 630 554 break; 631 555 case TIMER_PTIMER: 632 - write_sysreg_el0(timer_get_cval(ctx), SYS_CNTP_CVAL); 556 + case TIMER_HPTIMER: 557 + cval = timer_get_cval(ctx); 558 + offset = timer_get_offset(ctx); 559 + set_cntpoff(offset); 560 + if (!has_cntpoff()) 561 + cval += offset; 562 + write_sysreg_el0(cval, SYS_CNTP_CVAL); 633 563 isb(); 634 564 write_sysreg_el0(timer_get_ctl(ctx), SYS_CNTP_CTL); 635 565 break; ··· 674 586 kvm_timer_update_irq(ctx->vcpu, kvm_timer_should_fire(ctx), ctx); 675 587 676 588 if (irqchip_in_kernel(vcpu->kvm)) 677 - phys_active = kvm_vgic_map_is_active(vcpu, ctx->irq.irq); 589 + phys_active = kvm_vgic_map_is_active(vcpu, timer_irq(ctx)); 678 590 679 591 phys_active |= ctx->irq.level; 680 592 ··· 709 621 enable_percpu_irq(host_vtimer_irq, host_vtimer_irq_flags); 710 622 } 711 623 624 + /* If _pred is true, set bit in _set, otherwise set it in _clr */ 625 + #define assign_clear_set_bit(_pred, _bit, _clr, _set) \ 626 + do { \ 627 + if (_pred) \ 628 + (_set) |= (_bit); \ 629 + else \ 630 + (_clr) |= (_bit); \ 631 + } while (0) 632 + 633 + static void kvm_timer_vcpu_load_nested_switch(struct kvm_vcpu *vcpu, 634 + struct timer_map *map) 635 + { 636 + int hw, ret; 637 + 638 + if (!irqchip_in_kernel(vcpu->kvm)) 639 + return; 640 + 641 + /* 642 + * We only ever unmap the vtimer irq on a VHE system that runs nested 643 + * virtualization, in which case we have both a valid emul_vtimer, 644 + * emul_ptimer, direct_vtimer, and direct_ptimer. 645 + * 646 + * Since this is called from kvm_timer_vcpu_load(), a change between 647 + * vEL2 and vEL1/0 will have just happened, and the timer_map will 648 + * represent this, and therefore we switch the emul/direct mappings 649 + * below. 650 + */ 651 + hw = kvm_vgic_get_map(vcpu, timer_irq(map->direct_vtimer)); 652 + if (hw < 0) { 653 + kvm_vgic_unmap_phys_irq(vcpu, timer_irq(map->emul_vtimer)); 654 + kvm_vgic_unmap_phys_irq(vcpu, timer_irq(map->emul_ptimer)); 655 + 656 + ret = kvm_vgic_map_phys_irq(vcpu, 657 + map->direct_vtimer->host_timer_irq, 658 + timer_irq(map->direct_vtimer), 659 + &arch_timer_irq_ops); 660 + WARN_ON_ONCE(ret); 661 + ret = kvm_vgic_map_phys_irq(vcpu, 662 + map->direct_ptimer->host_timer_irq, 663 + timer_irq(map->direct_ptimer), 664 + &arch_timer_irq_ops); 665 + WARN_ON_ONCE(ret); 666 + 667 + /* 668 + * The virtual offset behaviour is "interresting", as it 669 + * always applies when HCR_EL2.E2H==0, but only when 670 + * accessed from EL1 when HCR_EL2.E2H==1. So make sure we 671 + * track E2H when putting the HV timer in "direct" mode. 672 + */ 673 + if (map->direct_vtimer == vcpu_hvtimer(vcpu)) { 674 + struct arch_timer_offset *offs = &map->direct_vtimer->offset; 675 + 676 + if (vcpu_el2_e2h_is_set(vcpu)) 677 + offs->vcpu_offset = NULL; 678 + else 679 + offs->vcpu_offset = &__vcpu_sys_reg(vcpu, CNTVOFF_EL2); 680 + } 681 + } 682 + } 683 + 684 + static void timer_set_traps(struct kvm_vcpu *vcpu, struct timer_map *map) 685 + { 686 + bool tpt, tpc; 687 + u64 clr, set; 688 + 689 + /* 690 + * No trapping gets configured here with nVHE. See 691 + * __timer_enable_traps(), which is where the stuff happens. 692 + */ 693 + if (!has_vhe()) 694 + return; 695 + 696 + /* 697 + * Our default policy is not to trap anything. As we progress 698 + * within this function, reality kicks in and we start adding 699 + * traps based on emulation requirements. 700 + */ 701 + tpt = tpc = false; 702 + 703 + /* 704 + * We have two possibility to deal with a physical offset: 705 + * 706 + * - Either we have CNTPOFF (yay!) or the offset is 0: 707 + * we let the guest freely access the HW 708 + * 709 + * - or neither of these condition apply: 710 + * we trap accesses to the HW, but still use it 711 + * after correcting the physical offset 712 + */ 713 + if (!has_cntpoff() && timer_get_offset(map->direct_ptimer)) 714 + tpt = tpc = true; 715 + 716 + /* 717 + * Apply the enable bits that the guest hypervisor has requested for 718 + * its own guest. We can only add traps that wouldn't have been set 719 + * above. 720 + */ 721 + if (vcpu_has_nv(vcpu) && !is_hyp_ctxt(vcpu)) { 722 + u64 val = __vcpu_sys_reg(vcpu, CNTHCTL_EL2); 723 + 724 + /* Use the VHE format for mental sanity */ 725 + if (!vcpu_el2_e2h_is_set(vcpu)) 726 + val = (val & (CNTHCTL_EL1PCEN | CNTHCTL_EL1PCTEN)) << 10; 727 + 728 + tpt |= !(val & (CNTHCTL_EL1PCEN << 10)); 729 + tpc |= !(val & (CNTHCTL_EL1PCTEN << 10)); 730 + } 731 + 732 + /* 733 + * Now that we have collected our requirements, compute the 734 + * trap and enable bits. 735 + */ 736 + set = 0; 737 + clr = 0; 738 + 739 + assign_clear_set_bit(tpt, CNTHCTL_EL1PCEN << 10, set, clr); 740 + assign_clear_set_bit(tpc, CNTHCTL_EL1PCTEN << 10, set, clr); 741 + 742 + /* This only happens on VHE, so use the CNTKCTL_EL1 accessor */ 743 + sysreg_clear_set(cntkctl_el1, clr, set); 744 + } 745 + 712 746 void kvm_timer_vcpu_load(struct kvm_vcpu *vcpu) 713 747 { 714 748 struct arch_timer_cpu *timer = vcpu_timer(vcpu); ··· 842 632 get_timer_map(vcpu, &map); 843 633 844 634 if (static_branch_likely(&has_gic_active_state)) { 635 + if (vcpu_has_nv(vcpu)) 636 + kvm_timer_vcpu_load_nested_switch(vcpu, &map); 637 + 845 638 kvm_timer_vcpu_load_gic(map.direct_vtimer); 846 639 if (map.direct_ptimer) 847 640 kvm_timer_vcpu_load_gic(map.direct_ptimer); ··· 857 644 timer_restore_state(map.direct_vtimer); 858 645 if (map.direct_ptimer) 859 646 timer_restore_state(map.direct_ptimer); 860 - 647 + if (map.emul_vtimer) 648 + timer_emulate(map.emul_vtimer); 861 649 if (map.emul_ptimer) 862 650 timer_emulate(map.emul_ptimer); 651 + 652 + timer_set_traps(vcpu, &map); 863 653 } 864 654 865 655 bool kvm_timer_should_notify_user(struct kvm_vcpu *vcpu) ··· 905 689 * In any case, we re-schedule the hrtimer for the physical timer when 906 690 * coming back to the VCPU thread in kvm_timer_vcpu_load(). 907 691 */ 692 + if (map.emul_vtimer) 693 + soft_timer_cancel(&map.emul_vtimer->hrtimer); 908 694 if (map.emul_ptimer) 909 695 soft_timer_cancel(&map.emul_ptimer->hrtimer); 910 696 ··· 956 738 * resets the timer to be disabled and unmasked and is compliant with 957 739 * the ARMv7 architecture. 958 740 */ 959 - timer_set_ctl(vcpu_vtimer(vcpu), 0); 960 - timer_set_ctl(vcpu_ptimer(vcpu), 0); 741 + for (int i = 0; i < nr_timers(vcpu); i++) 742 + timer_set_ctl(vcpu_get_timer(vcpu, i), 0); 743 + 744 + /* 745 + * A vcpu running at EL2 is in charge of the offset applied to 746 + * the virtual timer, so use the physical VM offset, and point 747 + * the vcpu offset to CNTVOFF_EL2. 748 + */ 749 + if (vcpu_has_nv(vcpu)) { 750 + struct arch_timer_offset *offs = &vcpu_vtimer(vcpu)->offset; 751 + 752 + offs->vcpu_offset = &__vcpu_sys_reg(vcpu, CNTVOFF_EL2); 753 + offs->vm_offset = &vcpu->kvm->arch.timer_data.poffset; 754 + } 961 755 962 756 if (timer->enabled) { 963 - kvm_timer_update_irq(vcpu, false, vcpu_vtimer(vcpu)); 964 - kvm_timer_update_irq(vcpu, false, vcpu_ptimer(vcpu)); 757 + for (int i = 0; i < nr_timers(vcpu); i++) 758 + kvm_timer_update_irq(vcpu, false, 759 + vcpu_get_timer(vcpu, i)); 965 760 966 761 if (irqchip_in_kernel(vcpu->kvm)) { 967 - kvm_vgic_reset_mapped_irq(vcpu, map.direct_vtimer->irq.irq); 762 + kvm_vgic_reset_mapped_irq(vcpu, timer_irq(map.direct_vtimer)); 968 763 if (map.direct_ptimer) 969 - kvm_vgic_reset_mapped_irq(vcpu, map.direct_ptimer->irq.irq); 764 + kvm_vgic_reset_mapped_irq(vcpu, timer_irq(map.direct_ptimer)); 970 765 } 971 766 } 972 767 768 + if (map.emul_vtimer) 769 + soft_timer_cancel(&map.emul_vtimer->hrtimer); 973 770 if (map.emul_ptimer) 974 771 soft_timer_cancel(&map.emul_ptimer->hrtimer); 975 772 976 773 return 0; 977 774 } 978 775 776 + static void timer_context_init(struct kvm_vcpu *vcpu, int timerid) 777 + { 778 + struct arch_timer_context *ctxt = vcpu_get_timer(vcpu, timerid); 779 + struct kvm *kvm = vcpu->kvm; 780 + 781 + ctxt->vcpu = vcpu; 782 + 783 + if (timerid == TIMER_VTIMER) 784 + ctxt->offset.vm_offset = &kvm->arch.timer_data.voffset; 785 + else 786 + ctxt->offset.vm_offset = &kvm->arch.timer_data.poffset; 787 + 788 + hrtimer_init(&ctxt->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_HARD); 789 + ctxt->hrtimer.function = kvm_hrtimer_expire; 790 + 791 + switch (timerid) { 792 + case TIMER_PTIMER: 793 + case TIMER_HPTIMER: 794 + ctxt->host_timer_irq = host_ptimer_irq; 795 + break; 796 + case TIMER_VTIMER: 797 + case TIMER_HVTIMER: 798 + ctxt->host_timer_irq = host_vtimer_irq; 799 + break; 800 + } 801 + } 802 + 979 803 void kvm_timer_vcpu_init(struct kvm_vcpu *vcpu) 980 804 { 981 805 struct arch_timer_cpu *timer = vcpu_timer(vcpu); 982 - struct arch_timer_context *vtimer = vcpu_vtimer(vcpu); 983 - struct arch_timer_context *ptimer = vcpu_ptimer(vcpu); 984 806 985 - vtimer->vcpu = vcpu; 986 - vtimer->offset.vm_offset = &vcpu->kvm->arch.timer_data.voffset; 987 - ptimer->vcpu = vcpu; 807 + for (int i = 0; i < NR_KVM_TIMERS; i++) 808 + timer_context_init(vcpu, i); 988 809 989 - /* Synchronize cntvoff across all vtimers of a VM. */ 990 - timer_set_offset(vtimer, kvm_phys_timer_read()); 991 - timer_set_offset(ptimer, 0); 810 + /* Synchronize offsets across timers of a VM if not already provided */ 811 + if (!test_bit(KVM_ARCH_FLAG_VM_COUNTER_OFFSET, &vcpu->kvm->arch.flags)) { 812 + timer_set_offset(vcpu_vtimer(vcpu), kvm_phys_timer_read()); 813 + timer_set_offset(vcpu_ptimer(vcpu), 0); 814 + } 992 815 993 816 hrtimer_init(&timer->bg_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_HARD); 994 817 timer->bg_timer.function = kvm_bg_timer_expire; 818 + } 995 819 996 - hrtimer_init(&vtimer->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_HARD); 997 - hrtimer_init(&ptimer->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_HARD); 998 - vtimer->hrtimer.function = kvm_hrtimer_expire; 999 - ptimer->hrtimer.function = kvm_hrtimer_expire; 1000 - 1001 - vtimer->irq.irq = default_vtimer_irq.irq; 1002 - ptimer->irq.irq = default_ptimer_irq.irq; 1003 - 1004 - vtimer->host_timer_irq = host_vtimer_irq; 1005 - ptimer->host_timer_irq = host_ptimer_irq; 1006 - 1007 - vtimer->host_timer_irq_flags = host_vtimer_irq_flags; 1008 - ptimer->host_timer_irq_flags = host_ptimer_irq_flags; 820 + void kvm_timer_init_vm(struct kvm *kvm) 821 + { 822 + for (int i = 0; i < NR_KVM_TIMERS; i++) 823 + kvm->arch.timer_data.ppi[i] = default_ppi[i]; 1009 824 } 1010 825 1011 826 void kvm_timer_cpu_up(void) ··· 1065 814 kvm_arm_timer_write(vcpu, timer, TIMER_REG_CTL, value); 1066 815 break; 1067 816 case KVM_REG_ARM_TIMER_CNT: 1068 - timer = vcpu_vtimer(vcpu); 1069 - timer_set_offset(timer, kvm_phys_timer_read() - value); 817 + if (!test_bit(KVM_ARCH_FLAG_VM_COUNTER_OFFSET, 818 + &vcpu->kvm->arch.flags)) { 819 + timer = vcpu_vtimer(vcpu); 820 + timer_set_offset(timer, kvm_phys_timer_read() - value); 821 + } 1070 822 break; 1071 823 case KVM_REG_ARM_TIMER_CVAL: 1072 824 timer = vcpu_vtimer(vcpu); ··· 1078 824 case KVM_REG_ARM_PTIMER_CTL: 1079 825 timer = vcpu_ptimer(vcpu); 1080 826 kvm_arm_timer_write(vcpu, timer, TIMER_REG_CTL, value); 827 + break; 828 + case KVM_REG_ARM_PTIMER_CNT: 829 + if (!test_bit(KVM_ARCH_FLAG_VM_COUNTER_OFFSET, 830 + &vcpu->kvm->arch.flags)) { 831 + timer = vcpu_ptimer(vcpu); 832 + timer_set_offset(timer, kvm_phys_timer_read() - value); 833 + } 1081 834 break; 1082 835 case KVM_REG_ARM_PTIMER_CVAL: 1083 836 timer = vcpu_ptimer(vcpu); ··· 1163 902 val = kvm_phys_timer_read() - timer_get_offset(timer); 1164 903 break; 1165 904 905 + case TIMER_REG_VOFF: 906 + val = *timer->offset.vcpu_offset; 907 + break; 908 + 1166 909 default: 1167 910 BUG(); 1168 911 } ··· 1185 920 get_timer_map(vcpu, &map); 1186 921 timer = vcpu_get_timer(vcpu, tmr); 1187 922 1188 - if (timer == map.emul_ptimer) 923 + if (timer == map.emul_vtimer || timer == map.emul_ptimer) 1189 924 return kvm_arm_timer_read(vcpu, timer, treg); 1190 925 1191 926 preempt_disable(); ··· 1217 952 timer_set_cval(timer, val); 1218 953 break; 1219 954 955 + case TIMER_REG_VOFF: 956 + *timer->offset.vcpu_offset = val; 957 + break; 958 + 1220 959 default: 1221 960 BUG(); 1222 961 } ··· 1236 967 1237 968 get_timer_map(vcpu, &map); 1238 969 timer = vcpu_get_timer(vcpu, tmr); 1239 - if (timer == map.emul_ptimer) { 970 + if (timer == map.emul_vtimer || timer == map.emul_ptimer) { 1240 971 soft_timer_cancel(&timer->hrtimer); 1241 972 kvm_arm_timer_write(vcpu, timer, treg, val); 1242 973 timer_emulate(timer); ··· 1314 1045 static const struct irq_domain_ops timer_domain_ops = { 1315 1046 .alloc = timer_irq_domain_alloc, 1316 1047 .free = timer_irq_domain_free, 1317 - }; 1318 - 1319 - static struct irq_ops arch_timer_irq_ops = { 1320 - .get_input_level = kvm_arch_timer_get_input_level, 1321 1048 }; 1322 1049 1323 1050 static void kvm_irq_fixup_flags(unsigned int virq, u32 *flags) ··· 1457 1192 1458 1193 static bool timer_irqs_are_valid(struct kvm_vcpu *vcpu) 1459 1194 { 1460 - int vtimer_irq, ptimer_irq, ret; 1461 - unsigned long i; 1195 + u32 ppis = 0; 1196 + bool valid; 1462 1197 1463 - vtimer_irq = vcpu_vtimer(vcpu)->irq.irq; 1464 - ret = kvm_vgic_set_owner(vcpu, vtimer_irq, vcpu_vtimer(vcpu)); 1465 - if (ret) 1466 - return false; 1198 + mutex_lock(&vcpu->kvm->arch.config_lock); 1467 1199 1468 - ptimer_irq = vcpu_ptimer(vcpu)->irq.irq; 1469 - ret = kvm_vgic_set_owner(vcpu, ptimer_irq, vcpu_ptimer(vcpu)); 1470 - if (ret) 1471 - return false; 1200 + for (int i = 0; i < nr_timers(vcpu); i++) { 1201 + struct arch_timer_context *ctx; 1202 + int irq; 1472 1203 1473 - kvm_for_each_vcpu(i, vcpu, vcpu->kvm) { 1474 - if (vcpu_vtimer(vcpu)->irq.irq != vtimer_irq || 1475 - vcpu_ptimer(vcpu)->irq.irq != ptimer_irq) 1476 - return false; 1204 + ctx = vcpu_get_timer(vcpu, i); 1205 + irq = timer_irq(ctx); 1206 + if (kvm_vgic_set_owner(vcpu, irq, ctx)) 1207 + break; 1208 + 1209 + /* 1210 + * We know by construction that we only have PPIs, so 1211 + * all values are less than 32. 1212 + */ 1213 + ppis |= BIT(irq); 1477 1214 } 1478 1215 1479 - return true; 1216 + valid = hweight32(ppis) == nr_timers(vcpu); 1217 + 1218 + if (valid) 1219 + set_bit(KVM_ARCH_FLAG_TIMER_PPIS_IMMUTABLE, &vcpu->kvm->arch.flags); 1220 + 1221 + mutex_unlock(&vcpu->kvm->arch.config_lock); 1222 + 1223 + return valid; 1480 1224 } 1481 1225 1482 - bool kvm_arch_timer_get_input_level(int vintid) 1226 + static bool kvm_arch_timer_get_input_level(int vintid) 1483 1227 { 1484 1228 struct kvm_vcpu *vcpu = kvm_get_running_vcpu(); 1485 - struct arch_timer_context *timer; 1486 1229 1487 1230 if (WARN(!vcpu, "No vcpu context!\n")) 1488 1231 return false; 1489 1232 1490 - if (vintid == vcpu_vtimer(vcpu)->irq.irq) 1491 - timer = vcpu_vtimer(vcpu); 1492 - else if (vintid == vcpu_ptimer(vcpu)->irq.irq) 1493 - timer = vcpu_ptimer(vcpu); 1494 - else 1495 - BUG(); 1233 + for (int i = 0; i < nr_timers(vcpu); i++) { 1234 + struct arch_timer_context *ctx; 1496 1235 1497 - return kvm_timer_should_fire(timer); 1236 + ctx = vcpu_get_timer(vcpu, i); 1237 + if (timer_irq(ctx) == vintid) 1238 + return kvm_timer_should_fire(ctx); 1239 + } 1240 + 1241 + /* A timer IRQ has fired, but no matching timer was found? */ 1242 + WARN_RATELIMIT(1, "timer INTID%d unknown\n", vintid); 1243 + 1244 + return false; 1498 1245 } 1499 1246 1500 1247 int kvm_timer_enable(struct kvm_vcpu *vcpu) ··· 1535 1258 1536 1259 ret = kvm_vgic_map_phys_irq(vcpu, 1537 1260 map.direct_vtimer->host_timer_irq, 1538 - map.direct_vtimer->irq.irq, 1261 + timer_irq(map.direct_vtimer), 1539 1262 &arch_timer_irq_ops); 1540 1263 if (ret) 1541 1264 return ret; ··· 1543 1266 if (map.direct_ptimer) { 1544 1267 ret = kvm_vgic_map_phys_irq(vcpu, 1545 1268 map.direct_ptimer->host_timer_irq, 1546 - map.direct_ptimer->irq.irq, 1269 + timer_irq(map.direct_ptimer), 1547 1270 &arch_timer_irq_ops); 1548 1271 } 1549 1272 ··· 1555 1278 return 0; 1556 1279 } 1557 1280 1558 - /* 1559 - * On VHE system, we only need to configure the EL2 timer trap register once, 1560 - * not for every world switch. 1561 - * The host kernel runs at EL2 with HCR_EL2.TGE == 1, 1562 - * and this makes those bits have no effect for the host kernel execution. 1563 - */ 1281 + /* If we have CNTPOFF, permanently set ECV to enable it */ 1564 1282 void kvm_timer_init_vhe(void) 1565 1283 { 1566 - /* When HCR_EL2.E2H ==1, EL1PCEN and EL1PCTEN are shifted by 10 */ 1567 - u32 cnthctl_shift = 10; 1568 - u64 val; 1569 - 1570 - /* 1571 - * VHE systems allow the guest direct access to the EL1 physical 1572 - * timer/counter. 1573 - */ 1574 - val = read_sysreg(cnthctl_el2); 1575 - val |= (CNTHCTL_EL1PCEN << cnthctl_shift); 1576 - val |= (CNTHCTL_EL1PCTEN << cnthctl_shift); 1577 - write_sysreg(val, cnthctl_el2); 1578 - } 1579 - 1580 - static void set_timer_irqs(struct kvm *kvm, int vtimer_irq, int ptimer_irq) 1581 - { 1582 - struct kvm_vcpu *vcpu; 1583 - unsigned long i; 1584 - 1585 - kvm_for_each_vcpu(i, vcpu, kvm) { 1586 - vcpu_vtimer(vcpu)->irq.irq = vtimer_irq; 1587 - vcpu_ptimer(vcpu)->irq.irq = ptimer_irq; 1588 - } 1284 + if (cpus_have_final_cap(ARM64_HAS_ECV_CNTPOFF)) 1285 + sysreg_clear_set(cntkctl_el1, 0, CNTHCTL_ECV); 1589 1286 } 1590 1287 1591 1288 int kvm_arm_timer_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr) 1592 1289 { 1593 1290 int __user *uaddr = (int __user *)(long)attr->addr; 1594 - struct arch_timer_context *vtimer = vcpu_vtimer(vcpu); 1595 - struct arch_timer_context *ptimer = vcpu_ptimer(vcpu); 1596 - int irq; 1291 + int irq, idx, ret = 0; 1597 1292 1598 1293 if (!irqchip_in_kernel(vcpu->kvm)) 1599 1294 return -EINVAL; ··· 1576 1327 if (!(irq_is_ppi(irq))) 1577 1328 return -EINVAL; 1578 1329 1579 - if (vcpu->arch.timer_cpu.enabled) 1580 - return -EBUSY; 1330 + mutex_lock(&vcpu->kvm->arch.config_lock); 1331 + 1332 + if (test_bit(KVM_ARCH_FLAG_TIMER_PPIS_IMMUTABLE, 1333 + &vcpu->kvm->arch.flags)) { 1334 + ret = -EBUSY; 1335 + goto out; 1336 + } 1581 1337 1582 1338 switch (attr->attr) { 1583 1339 case KVM_ARM_VCPU_TIMER_IRQ_VTIMER: 1584 - set_timer_irqs(vcpu->kvm, irq, ptimer->irq.irq); 1340 + idx = TIMER_VTIMER; 1585 1341 break; 1586 1342 case KVM_ARM_VCPU_TIMER_IRQ_PTIMER: 1587 - set_timer_irqs(vcpu->kvm, vtimer->irq.irq, irq); 1343 + idx = TIMER_PTIMER; 1344 + break; 1345 + case KVM_ARM_VCPU_TIMER_IRQ_HVTIMER: 1346 + idx = TIMER_HVTIMER; 1347 + break; 1348 + case KVM_ARM_VCPU_TIMER_IRQ_HPTIMER: 1349 + idx = TIMER_HPTIMER; 1588 1350 break; 1589 1351 default: 1590 - return -ENXIO; 1352 + ret = -ENXIO; 1353 + goto out; 1591 1354 } 1592 1355 1593 - return 0; 1356 + /* 1357 + * We cannot validate the IRQ unicity before we run, so take it at 1358 + * face value. The verdict will be given on first vcpu run, for each 1359 + * vcpu. Yes this is late. Blame it on the stupid API. 1360 + */ 1361 + vcpu->kvm->arch.timer_data.ppi[idx] = irq; 1362 + 1363 + out: 1364 + mutex_unlock(&vcpu->kvm->arch.config_lock); 1365 + return ret; 1594 1366 } 1595 1367 1596 1368 int kvm_arm_timer_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr) ··· 1627 1357 case KVM_ARM_VCPU_TIMER_IRQ_PTIMER: 1628 1358 timer = vcpu_ptimer(vcpu); 1629 1359 break; 1360 + case KVM_ARM_VCPU_TIMER_IRQ_HVTIMER: 1361 + timer = vcpu_hvtimer(vcpu); 1362 + break; 1363 + case KVM_ARM_VCPU_TIMER_IRQ_HPTIMER: 1364 + timer = vcpu_hptimer(vcpu); 1365 + break; 1630 1366 default: 1631 1367 return -ENXIO; 1632 1368 } 1633 1369 1634 - irq = timer->irq.irq; 1370 + irq = timer_irq(timer); 1635 1371 return put_user(irq, uaddr); 1636 1372 } 1637 1373 ··· 1646 1370 switch (attr->attr) { 1647 1371 case KVM_ARM_VCPU_TIMER_IRQ_VTIMER: 1648 1372 case KVM_ARM_VCPU_TIMER_IRQ_PTIMER: 1373 + case KVM_ARM_VCPU_TIMER_IRQ_HVTIMER: 1374 + case KVM_ARM_VCPU_TIMER_IRQ_HPTIMER: 1649 1375 return 0; 1650 1376 } 1651 1377 1652 1378 return -ENXIO; 1379 + } 1380 + 1381 + int kvm_vm_ioctl_set_counter_offset(struct kvm *kvm, 1382 + struct kvm_arm_counter_offset *offset) 1383 + { 1384 + int ret = 0; 1385 + 1386 + if (offset->reserved) 1387 + return -EINVAL; 1388 + 1389 + mutex_lock(&kvm->lock); 1390 + 1391 + if (lock_all_vcpus(kvm)) { 1392 + set_bit(KVM_ARCH_FLAG_VM_COUNTER_OFFSET, &kvm->arch.flags); 1393 + 1394 + /* 1395 + * If userspace decides to set the offset using this 1396 + * API rather than merely restoring the counter 1397 + * values, the offset applies to both the virtual and 1398 + * physical views. 1399 + */ 1400 + kvm->arch.timer_data.voffset = offset->counter_offset; 1401 + kvm->arch.timer_data.poffset = offset->counter_offset; 1402 + 1403 + unlock_all_vcpus(kvm); 1404 + } else { 1405 + ret = -EBUSY; 1406 + } 1407 + 1408 + mutex_unlock(&kvm->lock); 1409 + 1410 + return ret; 1653 1411 }

+136 -14

arch/arm64/kvm/arm.c

··· 126 126 { 127 127 int ret; 128 128 129 + mutex_init(&kvm->arch.config_lock); 130 + 131 + #ifdef CONFIG_LOCKDEP 132 + /* Clue in lockdep that the config_lock must be taken inside kvm->lock */ 133 + mutex_lock(&kvm->lock); 134 + mutex_lock(&kvm->arch.config_lock); 135 + mutex_unlock(&kvm->arch.config_lock); 136 + mutex_unlock(&kvm->lock); 137 + #endif 138 + 129 139 ret = kvm_share_hyp(kvm, kvm + 1); 130 140 if (ret) 131 141 return ret; ··· 155 145 goto err_free_cpumask; 156 146 157 147 kvm_vgic_early_init(kvm); 148 + 149 + kvm_timer_init_vm(kvm); 158 150 159 151 /* The maximum number of VCPUs is limited by the host's GIC model */ 160 152 kvm->max_vcpus = kvm_arm_default_max_vcpus(); ··· 202 190 kvm_destroy_vcpus(kvm); 203 191 204 192 kvm_unshare_hyp(kvm, kvm + 1); 193 + 194 + kvm_arm_teardown_hypercalls(kvm); 205 195 } 206 196 207 197 int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) ··· 233 219 case KVM_CAP_PTP_KVM: 234 220 case KVM_CAP_ARM_SYSTEM_SUSPEND: 235 221 case KVM_CAP_IRQFD_RESAMPLE: 222 + case KVM_CAP_COUNTER_OFFSET: 236 223 r = 1; 237 224 break; 238 225 case KVM_CAP_SET_GUEST_DEBUG2: ··· 339 324 int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu) 340 325 { 341 326 int err; 327 + 328 + spin_lock_init(&vcpu->arch.mp_state_lock); 329 + 330 + #ifdef CONFIG_LOCKDEP 331 + /* Inform lockdep that the config_lock is acquired after vcpu->mutex */ 332 + mutex_lock(&vcpu->mutex); 333 + mutex_lock(&vcpu->kvm->arch.config_lock); 334 + mutex_unlock(&vcpu->kvm->arch.config_lock); 335 + mutex_unlock(&vcpu->mutex); 336 + #endif 342 337 343 338 /* Force users to call KVM_ARM_VCPU_INIT */ 344 339 vcpu->arch.target = -1; ··· 467 442 vcpu->cpu = -1; 468 443 } 469 444 470 - void kvm_arm_vcpu_power_off(struct kvm_vcpu *vcpu) 445 + static void __kvm_arm_vcpu_power_off(struct kvm_vcpu *vcpu) 471 446 { 472 - vcpu->arch.mp_state.mp_state = KVM_MP_STATE_STOPPED; 447 + WRITE_ONCE(vcpu->arch.mp_state.mp_state, KVM_MP_STATE_STOPPED); 473 448 kvm_make_request(KVM_REQ_SLEEP, vcpu); 474 449 kvm_vcpu_kick(vcpu); 475 450 } 476 451 452 + void kvm_arm_vcpu_power_off(struct kvm_vcpu *vcpu) 453 + { 454 + spin_lock(&vcpu->arch.mp_state_lock); 455 + __kvm_arm_vcpu_power_off(vcpu); 456 + spin_unlock(&vcpu->arch.mp_state_lock); 457 + } 458 + 477 459 bool kvm_arm_vcpu_stopped(struct kvm_vcpu *vcpu) 478 460 { 479 - return vcpu->arch.mp_state.mp_state == KVM_MP_STATE_STOPPED; 461 + return READ_ONCE(vcpu->arch.mp_state.mp_state) == KVM_MP_STATE_STOPPED; 480 462 } 481 463 482 464 static void kvm_arm_vcpu_suspend(struct kvm_vcpu *vcpu) 483 465 { 484 - vcpu->arch.mp_state.mp_state = KVM_MP_STATE_SUSPENDED; 466 + WRITE_ONCE(vcpu->arch.mp_state.mp_state, KVM_MP_STATE_SUSPENDED); 485 467 kvm_make_request(KVM_REQ_SUSPEND, vcpu); 486 468 kvm_vcpu_kick(vcpu); 487 469 } 488 470 489 471 static bool kvm_arm_vcpu_suspended(struct kvm_vcpu *vcpu) 490 472 { 491 - return vcpu->arch.mp_state.mp_state == KVM_MP_STATE_SUSPENDED; 473 + return READ_ONCE(vcpu->arch.mp_state.mp_state) == KVM_MP_STATE_SUSPENDED; 492 474 } 493 475 494 476 int kvm_arch_vcpu_ioctl_get_mpstate(struct kvm_vcpu *vcpu, 495 477 struct kvm_mp_state *mp_state) 496 478 { 497 - *mp_state = vcpu->arch.mp_state; 479 + *mp_state = READ_ONCE(vcpu->arch.mp_state); 498 480 499 481 return 0; 500 482 } ··· 511 479 { 512 480 int ret = 0; 513 481 482 + spin_lock(&vcpu->arch.mp_state_lock); 483 + 514 484 switch (mp_state->mp_state) { 515 485 case KVM_MP_STATE_RUNNABLE: 516 - vcpu->arch.mp_state = *mp_state; 486 + WRITE_ONCE(vcpu->arch.mp_state, *mp_state); 517 487 break; 518 488 case KVM_MP_STATE_STOPPED: 519 - kvm_arm_vcpu_power_off(vcpu); 489 + __kvm_arm_vcpu_power_off(vcpu); 520 490 break; 521 491 case KVM_MP_STATE_SUSPENDED: 522 492 kvm_arm_vcpu_suspend(vcpu); ··· 526 492 default: 527 493 ret = -EINVAL; 528 494 } 495 + 496 + spin_unlock(&vcpu->arch.mp_state_lock); 529 497 530 498 return ret; 531 499 } ··· 628 592 if (kvm_vm_is_protected(kvm)) 629 593 kvm_call_hyp_nvhe(__pkvm_vcpu_init_traps, vcpu); 630 594 631 - mutex_lock(&kvm->lock); 595 + mutex_lock(&kvm->arch.config_lock); 632 596 set_bit(KVM_ARCH_FLAG_HAS_RAN_ONCE, &kvm->arch.flags); 633 - mutex_unlock(&kvm->lock); 597 + mutex_unlock(&kvm->arch.config_lock); 634 598 635 599 return ret; 636 600 } ··· 1245 1209 /* 1246 1210 * Handle the "start in power-off" case. 1247 1211 */ 1212 + spin_lock(&vcpu->arch.mp_state_lock); 1213 + 1248 1214 if (test_bit(KVM_ARM_VCPU_POWER_OFF, vcpu->arch.features)) 1249 - kvm_arm_vcpu_power_off(vcpu); 1215 + __kvm_arm_vcpu_power_off(vcpu); 1250 1216 else 1251 - vcpu->arch.mp_state.mp_state = KVM_MP_STATE_RUNNABLE; 1217 + WRITE_ONCE(vcpu->arch.mp_state.mp_state, KVM_MP_STATE_RUNNABLE); 1218 + 1219 + spin_unlock(&vcpu->arch.mp_state_lock); 1252 1220 1253 1221 return 0; 1254 1222 } ··· 1478 1438 } 1479 1439 } 1480 1440 1481 - long kvm_arch_vm_ioctl(struct file *filp, 1482 - unsigned int ioctl, unsigned long arg) 1441 + static int kvm_vm_has_attr(struct kvm *kvm, struct kvm_device_attr *attr) 1442 + { 1443 + switch (attr->group) { 1444 + case KVM_ARM_VM_SMCCC_CTRL: 1445 + return kvm_vm_smccc_has_attr(kvm, attr); 1446 + default: 1447 + return -ENXIO; 1448 + } 1449 + } 1450 + 1451 + static int kvm_vm_set_attr(struct kvm *kvm, struct kvm_device_attr *attr) 1452 + { 1453 + switch (attr->group) { 1454 + case KVM_ARM_VM_SMCCC_CTRL: 1455 + return kvm_vm_smccc_set_attr(kvm, attr); 1456 + default: 1457 + return -ENXIO; 1458 + } 1459 + } 1460 + 1461 + int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) 1483 1462 { 1484 1463 struct kvm *kvm = filp->private_data; 1485 1464 void __user *argp = (void __user *)arg; 1465 + struct kvm_device_attr attr; 1486 1466 1487 1467 switch (ioctl) { 1488 1468 case KVM_CREATE_IRQCHIP: { ··· 1538 1478 return -EFAULT; 1539 1479 return kvm_vm_ioctl_mte_copy_tags(kvm, &copy_tags); 1540 1480 } 1481 + case KVM_ARM_SET_COUNTER_OFFSET: { 1482 + struct kvm_arm_counter_offset offset; 1483 + 1484 + if (copy_from_user(&offset, argp, sizeof(offset))) 1485 + return -EFAULT; 1486 + return kvm_vm_ioctl_set_counter_offset(kvm, &offset); 1487 + } 1488 + case KVM_HAS_DEVICE_ATTR: { 1489 + if (copy_from_user(&attr, argp, sizeof(attr))) 1490 + return -EFAULT; 1491 + 1492 + return kvm_vm_has_attr(kvm, &attr); 1493 + } 1494 + case KVM_SET_DEVICE_ATTR: { 1495 + if (copy_from_user(&attr, argp, sizeof(attr))) 1496 + return -EFAULT; 1497 + 1498 + return kvm_vm_set_attr(kvm, &attr); 1499 + } 1541 1500 default: 1542 1501 return -EINVAL; 1543 1502 } 1503 + } 1504 + 1505 + /* unlocks vcpus from @vcpu_lock_idx and smaller */ 1506 + static void unlock_vcpus(struct kvm *kvm, int vcpu_lock_idx) 1507 + { 1508 + struct kvm_vcpu *tmp_vcpu; 1509 + 1510 + for (; vcpu_lock_idx >= 0; vcpu_lock_idx--) { 1511 + tmp_vcpu = kvm_get_vcpu(kvm, vcpu_lock_idx); 1512 + mutex_unlock(&tmp_vcpu->mutex); 1513 + } 1514 + } 1515 + 1516 + void unlock_all_vcpus(struct kvm *kvm) 1517 + { 1518 + lockdep_assert_held(&kvm->lock); 1519 + 1520 + unlock_vcpus(kvm, atomic_read(&kvm->online_vcpus) - 1); 1521 + } 1522 + 1523 + /* Returns true if all vcpus were locked, false otherwise */ 1524 + bool lock_all_vcpus(struct kvm *kvm) 1525 + { 1526 + struct kvm_vcpu *tmp_vcpu; 1527 + unsigned long c; 1528 + 1529 + lockdep_assert_held(&kvm->lock); 1530 + 1531 + /* 1532 + * Any time a vcpu is in an ioctl (including running), the 1533 + * core KVM code tries to grab the vcpu->mutex. 1534 + * 1535 + * By grabbing the vcpu->mutex of all VCPUs we ensure that no 1536 + * other VCPUs can fiddle with the state while we access it. 1537 + */ 1538 + kvm_for_each_vcpu(c, tmp_vcpu, kvm) { 1539 + if (!mutex_trylock(&tmp_vcpu->mutex)) { 1540 + unlock_vcpus(kvm, c - 1); 1541 + return false; 1542 + } 1543 + } 1544 + 1545 + return true; 1544 1546 } 1545 1547 1546 1548 static unsigned long nvhe_percpu_size(void)

+25 -14

arch/arm64/kvm/guest.c

··· 590 590 return copy_core_reg_indices(vcpu, NULL); 591 591 } 592 592 593 - /** 594 - * ARM64 versions of the TIMER registers, always available on arm64 595 - */ 593 + static const u64 timer_reg_list[] = { 594 + KVM_REG_ARM_TIMER_CTL, 595 + KVM_REG_ARM_TIMER_CNT, 596 + KVM_REG_ARM_TIMER_CVAL, 597 + KVM_REG_ARM_PTIMER_CTL, 598 + KVM_REG_ARM_PTIMER_CNT, 599 + KVM_REG_ARM_PTIMER_CVAL, 600 + }; 596 601 597 - #define NUM_TIMER_REGS 3 602 + #define NUM_TIMER_REGS ARRAY_SIZE(timer_reg_list) 598 603 599 604 static bool is_timer_reg(u64 index) 600 605 { ··· 607 602 case KVM_REG_ARM_TIMER_CTL: 608 603 case KVM_REG_ARM_TIMER_CNT: 609 604 case KVM_REG_ARM_TIMER_CVAL: 605 + case KVM_REG_ARM_PTIMER_CTL: 606 + case KVM_REG_ARM_PTIMER_CNT: 607 + case KVM_REG_ARM_PTIMER_CVAL: 610 608 return true; 611 609 } 612 610 return false; ··· 617 609 618 610 static int copy_timer_indices(struct kvm_vcpu *vcpu, u64 __user *uindices) 619 611 { 620 - if (put_user(KVM_REG_ARM_TIMER_CTL, uindices)) 621 - return -EFAULT; 622 - uindices++; 623 - if (put_user(KVM_REG_ARM_TIMER_CNT, uindices)) 624 - return -EFAULT; 625 - uindices++; 626 - if (put_user(KVM_REG_ARM_TIMER_CVAL, uindices)) 627 - return -EFAULT; 612 + for (int i = 0; i < NUM_TIMER_REGS; i++) { 613 + if (put_user(timer_reg_list[i], uindices)) 614 + return -EFAULT; 615 + uindices++; 616 + } 628 617 629 618 return 0; 630 619 } ··· 962 957 963 958 switch (attr->group) { 964 959 case KVM_ARM_VCPU_PMU_V3_CTRL: 960 + mutex_lock(&vcpu->kvm->arch.config_lock); 965 961 ret = kvm_arm_pmu_v3_set_attr(vcpu, attr); 962 + mutex_unlock(&vcpu->kvm->arch.config_lock); 966 963 break; 967 964 case KVM_ARM_VCPU_TIMER_CTRL: 968 965 ret = kvm_arm_timer_set_attr(vcpu, attr); ··· 1026 1019 return ret; 1027 1020 } 1028 1021 1029 - long kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm, 1030 - struct kvm_arm_copy_mte_tags *copy_tags) 1022 + int kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm, 1023 + struct kvm_arm_copy_mte_tags *copy_tags) 1031 1024 { 1032 1025 gpa_t guest_ipa = copy_tags->guest_ipa; 1033 1026 size_t length = copy_tags->length; ··· 1046 1039 return -EINVAL; 1047 1040 1048 1041 if (length & ~PAGE_MASK || guest_ipa & ~PAGE_MASK) 1042 + return -EINVAL; 1043 + 1044 + /* Lengths above INT_MAX cannot be represented in the return value */ 1045 + if (length > INT_MAX) 1049 1046 return -EINVAL; 1050 1047 1051 1048 gfn = gpa_to_gfn(guest_ipa);

+12 -24

arch/arm64/kvm/handle_exit.c

··· 36 36 37 37 static int handle_hvc(struct kvm_vcpu *vcpu) 38 38 { 39 - int ret; 40 - 41 39 trace_kvm_hvc_arm64(*vcpu_pc(vcpu), vcpu_get_reg(vcpu, 0), 42 40 kvm_vcpu_hvc_get_imm(vcpu)); 43 41 vcpu->stat.hvc_exit_stat++; ··· 50 52 return 1; 51 53 } 52 54 53 - ret = kvm_hvc_call_handler(vcpu); 54 - if (ret < 0) { 55 - vcpu_set_reg(vcpu, 0, ~0UL); 56 - return 1; 57 - } 58 - 59 - return ret; 55 + return kvm_smccc_call_handler(vcpu); 60 56 } 61 57 62 58 static int handle_smc(struct kvm_vcpu *vcpu) 63 59 { 64 - int ret; 65 - 66 60 /* 67 61 * "If an SMC instruction executed at Non-secure EL1 is 68 62 * trapped to EL2 because HCR_EL2.TSC is 1, the exception is a 69 63 * Trap exception, not a Secure Monitor Call exception [...]" 70 64 * 71 65 * We need to advance the PC after the trap, as it would 72 - * otherwise return to the same address... 73 - * 74 - * Only handle SMCs from the virtual EL2 with an immediate of zero and 75 - * skip it otherwise. 66 + * otherwise return to the same address. Furthermore, pre-incrementing 67 + * the PC before potentially exiting to userspace maintains the same 68 + * abstraction for both SMCs and HVCs. 76 69 */ 77 - if (!vcpu_is_el2(vcpu) || kvm_vcpu_hvc_get_imm(vcpu)) { 70 + kvm_incr_pc(vcpu); 71 + 72 + /* 73 + * SMCs with a nonzero immediate are reserved according to DEN0028E 2.9 74 + * "SMC and HVC immediate value". 75 + */ 76 + if (kvm_vcpu_hvc_get_imm(vcpu)) { 78 77 vcpu_set_reg(vcpu, 0, ~0UL); 79 - kvm_incr_pc(vcpu); 80 78 return 1; 81 79 } 82 80 ··· 83 89 * at Non-secure EL1 is trapped to EL2 if HCR_EL2.TSC==1, rather than 84 90 * being treated as UNDEFINED. 85 91 */ 86 - ret = kvm_hvc_call_handler(vcpu); 87 - if (ret < 0) 88 - vcpu_set_reg(vcpu, 0, ~0UL); 89 - 90 - kvm_incr_pc(vcpu); 91 - 92 - return ret; 92 + return kvm_smccc_call_handler(vcpu); 93 93 } 94 94 95 95 /*

+53

arch/arm64/kvm/hyp/include/hyp/switch.h

··· 26 26 #include <asm/kvm_emulate.h> 27 27 #include <asm/kvm_hyp.h> 28 28 #include <asm/kvm_mmu.h> 29 + #include <asm/kvm_nested.h> 29 30 #include <asm/fpsimd.h> 30 31 #include <asm/debug-monitors.h> 31 32 #include <asm/processor.h> ··· 327 326 return true; 328 327 } 329 328 329 + static bool kvm_hyp_handle_cntpct(struct kvm_vcpu *vcpu) 330 + { 331 + struct arch_timer_context *ctxt; 332 + u32 sysreg; 333 + u64 val; 334 + 335 + /* 336 + * We only get here for 64bit guests, 32bit guests will hit 337 + * the long and winding road all the way to the standard 338 + * handling. Yes, it sucks to be irrelevant. 339 + */ 340 + sysreg = esr_sys64_to_sysreg(kvm_vcpu_get_esr(vcpu)); 341 + 342 + switch (sysreg) { 343 + case SYS_CNTPCT_EL0: 344 + case SYS_CNTPCTSS_EL0: 345 + if (vcpu_has_nv(vcpu)) { 346 + if (is_hyp_ctxt(vcpu)) { 347 + ctxt = vcpu_hptimer(vcpu); 348 + break; 349 + } 350 + 351 + /* Check for guest hypervisor trapping */ 352 + val = __vcpu_sys_reg(vcpu, CNTHCTL_EL2); 353 + if (!vcpu_el2_e2h_is_set(vcpu)) 354 + val = (val & CNTHCTL_EL1PCTEN) << 10; 355 + 356 + if (!(val & (CNTHCTL_EL1PCTEN << 10))) 357 + return false; 358 + } 359 + 360 + ctxt = vcpu_ptimer(vcpu); 361 + break; 362 + default: 363 + return false; 364 + } 365 + 366 + val = arch_timer_read_cntpct_el0(); 367 + 368 + if (ctxt->offset.vm_offset) 369 + val -= *kern_hyp_va(ctxt->offset.vm_offset); 370 + if (ctxt->offset.vcpu_offset) 371 + val -= *kern_hyp_va(ctxt->offset.vcpu_offset); 372 + 373 + vcpu_set_reg(vcpu, kvm_vcpu_sys_get_rt(vcpu), val); 374 + __kvm_skip_instr(vcpu); 375 + return true; 376 + } 377 + 330 378 static bool kvm_hyp_handle_sysreg(struct kvm_vcpu *vcpu, u64 *exit_code) 331 379 { 332 380 if (cpus_have_final_cap(ARM64_WORKAROUND_CAVIUM_TX2_219_TVM) && ··· 388 338 389 339 if (esr_is_ptrauth_trap(kvm_vcpu_get_esr(vcpu))) 390 340 return kvm_hyp_handle_ptrauth(vcpu, exit_code); 341 + 342 + if (kvm_hyp_handle_cntpct(vcpu)) 343 + return true; 391 344 392 345 return false; 393 346 }

-2

arch/arm64/kvm/hyp/nvhe/debug-sr.c

··· 37 37 38 38 /* Now drain all buffered data to memory */ 39 39 psb_csync(); 40 - dsb(nsh); 41 40 } 42 41 43 42 static void __debug_restore_spe(u64 pmscr_el1) ··· 68 69 isb(); 69 70 /* Drain the trace buffer to memory */ 70 71 tsb_csync(); 71 - dsb(nsh); 72 72 } 73 73 74 74 static void __debug_restore_trace(u64 trfcr_el1)

+7

arch/arm64/kvm/hyp/nvhe/mem_protect.c

··· 297 297 params->vttbr = kvm_get_vttbr(mmu); 298 298 params->vtcr = host_mmu.arch.vtcr; 299 299 params->hcr_el2 |= HCR_VM; 300 + 301 + /* 302 + * The CMO below not only cleans the updated params to the 303 + * PoC, but also provides the DSB that ensures ongoing 304 + * page-table walks that have started before we trapped to EL2 305 + * have completed. 306 + */ 300 307 kvm_flush_dcache_to_poc(params, sizeof(*params)); 301 308 302 309 write_sysreg(params->hcr_el2, hcr_el2);

+18

arch/arm64/kvm/hyp/nvhe/switch.c

··· 272 272 */ 273 273 __debug_save_host_buffers_nvhe(vcpu); 274 274 275 + /* 276 + * We're about to restore some new MMU state. Make sure 277 + * ongoing page-table walks that have started before we 278 + * trapped to EL2 have completed. This also synchronises the 279 + * above disabling of SPE and TRBE. 280 + * 281 + * See DDI0487I.a D8.1.5 "Out-of-context translation regimes", 282 + * rule R_LFHQG and subsequent information statements. 283 + */ 284 + dsb(nsh); 285 + 275 286 __kvm_adjust_pc(vcpu); 276 287 277 288 /* ··· 316 305 __sysreg32_save_state(vcpu); 317 306 __timer_disable_traps(vcpu); 318 307 __hyp_vgic_save_state(vcpu); 308 + 309 + /* 310 + * Same thing as before the guest run: we're about to switch 311 + * the MMU context, so let's make sure we don't have any 312 + * ongoing EL1&0 translations. 313 + */ 314 + dsb(nsh); 319 315 320 316 __deactivate_traps(vcpu); 321 317 __load_host_stage2();

+12 -6

arch/arm64/kvm/hyp/nvhe/timer-sr.c

··· 9 9 #include <linux/kvm_host.h> 10 10 11 11 #include <asm/kvm_hyp.h> 12 + #include <asm/kvm_mmu.h> 12 13 13 14 void __kvm_timer_set_cntvoff(u64 cntvoff) 14 15 { ··· 36 35 */ 37 36 void __timer_enable_traps(struct kvm_vcpu *vcpu) 38 37 { 39 - u64 val; 38 + u64 clr = 0, set = 0; 40 39 41 40 /* 42 41 * Disallow physical timer access for the guest 43 - * Physical counter access is allowed 42 + * Physical counter access is allowed if no offset is enforced 43 + * or running protected (we don't offset anything in this case). 44 44 */ 45 - val = read_sysreg(cnthctl_el2); 46 - val &= ~CNTHCTL_EL1PCEN; 47 - val |= CNTHCTL_EL1PCTEN; 48 - write_sysreg(val, cnthctl_el2); 45 + clr = CNTHCTL_EL1PCEN; 46 + if (is_protected_kvm_enabled() || 47 + !kern_hyp_va(vcpu->kvm)->arch.timer_data.poffset) 48 + set |= CNTHCTL_EL1PCTEN; 49 + else 50 + clr |= CNTHCTL_EL1PCTEN; 51 + 52 + sysreg_clear_set(cnthctl_el2, clr, set); 49 53 }

+29 -9

arch/arm64/kvm/hyp/nvhe/tlb.c

··· 15 15 }; 16 16 17 17 static void __tlb_switch_to_guest(struct kvm_s2_mmu *mmu, 18 - struct tlb_inv_context *cxt) 18 + struct tlb_inv_context *cxt, 19 + bool nsh) 19 20 { 21 + /* 22 + * We have two requirements: 23 + * 24 + * - ensure that the page table updates are visible to all 25 + * CPUs, for which a dsb(DOMAIN-st) is what we need, DOMAIN 26 + * being either ish or nsh, depending on the invalidation 27 + * type. 28 + * 29 + * - complete any speculative page table walk started before 30 + * we trapped to EL2 so that we can mess with the MM 31 + * registers out of context, for which dsb(nsh) is enough 32 + * 33 + * The composition of these two barriers is a dsb(DOMAIN), and 34 + * the 'nsh' parameter tracks the distinction between 35 + * Inner-Shareable and Non-Shareable, as specified by the 36 + * callers. 37 + */ 38 + if (nsh) 39 + dsb(nsh); 40 + else 41 + dsb(ish); 42 + 20 43 if (cpus_have_final_cap(ARM64_WORKAROUND_SPECULATIVE_AT)) { 21 44 u64 val; 22 45 ··· 83 60 { 84 61 struct tlb_inv_context cxt; 85 62 86 - dsb(ishst); 87 - 88 63 /* Switch to requested VMID */ 89 - __tlb_switch_to_guest(mmu, &cxt); 64 + __tlb_switch_to_guest(mmu, &cxt, false); 90 65 91 66 /* 92 67 * We could do so much better if we had the VA as well. ··· 134 113 { 135 114 struct tlb_inv_context cxt; 136 115 137 - dsb(ishst); 138 - 139 116 /* Switch to requested VMID */ 140 - __tlb_switch_to_guest(mmu, &cxt); 117 + __tlb_switch_to_guest(mmu, &cxt, false); 141 118 142 119 __tlbi(vmalls12e1is); 143 120 dsb(ish); ··· 149 130 struct tlb_inv_context cxt; 150 131 151 132 /* Switch to requested VMID */ 152 - __tlb_switch_to_guest(mmu, &cxt); 133 + __tlb_switch_to_guest(mmu, &cxt, false); 153 134 154 135 __tlbi(vmalle1); 155 136 asm volatile("ic iallu"); ··· 161 142 162 143 void __kvm_flush_vm_context(void) 163 144 { 164 - dsb(ishst); 145 + /* Same remark as in __tlb_switch_to_guest() */ 146 + dsb(ish); 165 147 __tlbi(alle1is); 166 148 167 149 /*

+3 -4

arch/arm64/kvm/hyp/vhe/switch.c

··· 227 227 228 228 /* 229 229 * When we exit from the guest we change a number of CPU configuration 230 - * parameters, such as traps. Make sure these changes take effect 231 - * before running the host or additional guests. 230 + * parameters, such as traps. We rely on the isb() in kvm_call_hyp*() 231 + * to make sure these changes take effect before running the host or 232 + * additional guests. 232 233 */ 233 - isb(); 234 - 235 234 return ret; 236 235 } 237 236

+12

arch/arm64/kvm/hyp/vhe/sysreg-sr.c

··· 13 13 #include <asm/kvm_asm.h> 14 14 #include <asm/kvm_emulate.h> 15 15 #include <asm/kvm_hyp.h> 16 + #include <asm/kvm_nested.h> 16 17 17 18 /* 18 19 * VHE: Host and guest must save mdscr_el1 and sp_el0 (and the PC and ··· 69 68 70 69 host_ctxt = &this_cpu_ptr(&kvm_host_data)->host_ctxt; 71 70 __sysreg_save_user_state(host_ctxt); 71 + 72 + /* 73 + * When running a normal EL1 guest, we only load a new vcpu 74 + * after a context switch, which imvolves a DSB, so all 75 + * speculative EL1&0 walks will have already completed. 76 + * If running NV, the vcpu may transition between vEL1 and 77 + * vEL2 without a context switch, so make sure we complete 78 + * those walks before loading a new context. 79 + */ 80 + if (vcpu_has_nv(vcpu)) 81 + dsb(nsh); 72 82 73 83 /* 74 84 * Load guest EL1 and user state

+179 -10

arch/arm64/kvm/hypercalls.c

··· 47 47 cycles = systime_snapshot.cycles - vcpu->kvm->arch.timer_data.voffset; 48 48 break; 49 49 case KVM_PTP_PHYS_COUNTER: 50 - cycles = systime_snapshot.cycles; 50 + cycles = systime_snapshot.cycles - vcpu->kvm->arch.timer_data.poffset; 51 51 break; 52 52 default: 53 53 return; ··· 65 65 val[3] = lower_32_bits(cycles); 66 66 } 67 67 68 - static bool kvm_hvc_call_default_allowed(u32 func_id) 68 + static bool kvm_smccc_default_allowed(u32 func_id) 69 69 { 70 70 switch (func_id) { 71 71 /* ··· 93 93 } 94 94 } 95 95 96 - static bool kvm_hvc_call_allowed(struct kvm_vcpu *vcpu, u32 func_id) 96 + static bool kvm_smccc_test_fw_bmap(struct kvm_vcpu *vcpu, u32 func_id) 97 97 { 98 98 struct kvm_smccc_features *smccc_feat = &vcpu->kvm->arch.smccc_feat; 99 99 ··· 117 117 return test_bit(KVM_REG_ARM_VENDOR_HYP_BIT_PTP, 118 118 &smccc_feat->vendor_hyp_bmap); 119 119 default: 120 - return kvm_hvc_call_default_allowed(func_id); 120 + return false; 121 121 } 122 122 } 123 123 124 - int kvm_hvc_call_handler(struct kvm_vcpu *vcpu) 124 + #define SMC32_ARCH_RANGE_BEGIN ARM_SMCCC_VERSION_FUNC_ID 125 + #define SMC32_ARCH_RANGE_END ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \ 126 + ARM_SMCCC_SMC_32, \ 127 + 0, ARM_SMCCC_FUNC_MASK) 128 + 129 + #define SMC64_ARCH_RANGE_BEGIN ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \ 130 + ARM_SMCCC_SMC_64, \ 131 + 0, 0) 132 + #define SMC64_ARCH_RANGE_END ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \ 133 + ARM_SMCCC_SMC_64, \ 134 + 0, ARM_SMCCC_FUNC_MASK) 135 + 136 + static void init_smccc_filter(struct kvm *kvm) 137 + { 138 + int r; 139 + 140 + mt_init(&kvm->arch.smccc_filter); 141 + 142 + /* 143 + * Prevent userspace from handling any SMCCC calls in the architecture 144 + * range, avoiding the risk of misrepresenting Spectre mitigation status 145 + * to the guest. 146 + */ 147 + r = mtree_insert_range(&kvm->arch.smccc_filter, 148 + SMC32_ARCH_RANGE_BEGIN, SMC32_ARCH_RANGE_END, 149 + xa_mk_value(KVM_SMCCC_FILTER_HANDLE), 150 + GFP_KERNEL_ACCOUNT); 151 + WARN_ON_ONCE(r); 152 + 153 + r = mtree_insert_range(&kvm->arch.smccc_filter, 154 + SMC64_ARCH_RANGE_BEGIN, SMC64_ARCH_RANGE_END, 155 + xa_mk_value(KVM_SMCCC_FILTER_HANDLE), 156 + GFP_KERNEL_ACCOUNT); 157 + WARN_ON_ONCE(r); 158 + 159 + } 160 + 161 + static int kvm_smccc_set_filter(struct kvm *kvm, struct kvm_smccc_filter __user *uaddr) 162 + { 163 + const void *zero_page = page_to_virt(ZERO_PAGE(0)); 164 + struct kvm_smccc_filter filter; 165 + u32 start, end; 166 + int r; 167 + 168 + if (copy_from_user(&filter, uaddr, sizeof(filter))) 169 + return -EFAULT; 170 + 171 + if (memcmp(filter.pad, zero_page, sizeof(filter.pad))) 172 + return -EINVAL; 173 + 174 + start = filter.base; 175 + end = start + filter.nr_functions - 1; 176 + 177 + if (end < start || filter.action >= NR_SMCCC_FILTER_ACTIONS) 178 + return -EINVAL; 179 + 180 + mutex_lock(&kvm->arch.config_lock); 181 + 182 + if (kvm_vm_has_ran_once(kvm)) { 183 + r = -EBUSY; 184 + goto out_unlock; 185 + } 186 + 187 + r = mtree_insert_range(&kvm->arch.smccc_filter, start, end, 188 + xa_mk_value(filter.action), GFP_KERNEL_ACCOUNT); 189 + if (r) 190 + goto out_unlock; 191 + 192 + set_bit(KVM_ARCH_FLAG_SMCCC_FILTER_CONFIGURED, &kvm->arch.flags); 193 + 194 + out_unlock: 195 + mutex_unlock(&kvm->arch.config_lock); 196 + return r; 197 + } 198 + 199 + static u8 kvm_smccc_filter_get_action(struct kvm *kvm, u32 func_id) 200 + { 201 + unsigned long idx = func_id; 202 + void *val; 203 + 204 + if (!test_bit(KVM_ARCH_FLAG_SMCCC_FILTER_CONFIGURED, &kvm->arch.flags)) 205 + return KVM_SMCCC_FILTER_HANDLE; 206 + 207 + /* 208 + * But where's the error handling, you say? 209 + * 210 + * mt_find() returns NULL if no entry was found, which just so happens 211 + * to match KVM_SMCCC_FILTER_HANDLE. 212 + */ 213 + val = mt_find(&kvm->arch.smccc_filter, &idx, idx); 214 + return xa_to_value(val); 215 + } 216 + 217 + static u8 kvm_smccc_get_action(struct kvm_vcpu *vcpu, u32 func_id) 218 + { 219 + /* 220 + * Intervening actions in the SMCCC filter take precedence over the 221 + * pseudo-firmware register bitmaps. 222 + */ 223 + u8 action = kvm_smccc_filter_get_action(vcpu->kvm, func_id); 224 + if (action != KVM_SMCCC_FILTER_HANDLE) 225 + return action; 226 + 227 + if (kvm_smccc_test_fw_bmap(vcpu, func_id) || 228 + kvm_smccc_default_allowed(func_id)) 229 + return KVM_SMCCC_FILTER_HANDLE; 230 + 231 + return KVM_SMCCC_FILTER_DENY; 232 + } 233 + 234 + static void kvm_prepare_hypercall_exit(struct kvm_vcpu *vcpu, u32 func_id) 235 + { 236 + u8 ec = ESR_ELx_EC(kvm_vcpu_get_esr(vcpu)); 237 + struct kvm_run *run = vcpu->run; 238 + u64 flags = 0; 239 + 240 + if (ec == ESR_ELx_EC_SMC32 || ec == ESR_ELx_EC_SMC64) 241 + flags |= KVM_HYPERCALL_EXIT_SMC; 242 + 243 + if (!kvm_vcpu_trap_il_is32bit(vcpu)) 244 + flags |= KVM_HYPERCALL_EXIT_16BIT; 245 + 246 + run->exit_reason = KVM_EXIT_HYPERCALL; 247 + run->hypercall = (typeof(run->hypercall)) { 248 + .nr = func_id, 249 + .flags = flags, 250 + }; 251 + } 252 + 253 + int kvm_smccc_call_handler(struct kvm_vcpu *vcpu) 125 254 { 126 255 struct kvm_smccc_features *smccc_feat = &vcpu->kvm->arch.smccc_feat; 127 256 u32 func_id = smccc_get_function(vcpu); 128 257 u64 val[4] = {SMCCC_RET_NOT_SUPPORTED}; 129 258 u32 feature; 259 + u8 action; 130 260 gpa_t gpa; 131 261 132 - if (!kvm_hvc_call_allowed(vcpu, func_id)) 262 + action = kvm_smccc_get_action(vcpu, func_id); 263 + switch (action) { 264 + case KVM_SMCCC_FILTER_HANDLE: 265 + break; 266 + case KVM_SMCCC_FILTER_DENY: 133 267 goto out; 268 + case KVM_SMCCC_FILTER_FWD_TO_USER: 269 + kvm_prepare_hypercall_exit(vcpu, func_id); 270 + return 0; 271 + default: 272 + WARN_RATELIMIT(1, "Unhandled SMCCC filter action: %d\n", action); 273 + goto out; 274 + } 134 275 135 276 switch (func_id) { 136 277 case ARM_SMCCC_VERSION_FUNC_ID: ··· 386 245 smccc_feat->std_bmap = KVM_ARM_SMCCC_STD_FEATURES; 387 246 smccc_feat->std_hyp_bmap = KVM_ARM_SMCCC_STD_HYP_FEATURES; 388 247 smccc_feat->vendor_hyp_bmap = KVM_ARM_SMCCC_VENDOR_HYP_FEATURES; 248 + 249 + init_smccc_filter(kvm); 250 + } 251 + 252 + void kvm_arm_teardown_hypercalls(struct kvm *kvm) 253 + { 254 + mtree_destroy(&kvm->arch.smccc_filter); 389 255 } 390 256 391 257 int kvm_arm_get_fw_num_regs(struct kvm_vcpu *vcpu) ··· 525 377 if (val & ~fw_reg_features) 526 378 return -EINVAL; 527 379 528 - mutex_lock(&kvm->lock); 380 + mutex_lock(&kvm->arch.config_lock); 529 381 530 - if (test_bit(KVM_ARCH_FLAG_HAS_RAN_ONCE, &kvm->arch.flags) && 531 - val != *fw_reg_bmap) { 382 + if (kvm_vm_has_ran_once(kvm) && val != *fw_reg_bmap) { 532 383 ret = -EBUSY; 533 384 goto out; 534 385 } 535 386 536 387 WRITE_ONCE(*fw_reg_bmap, val); 537 388 out: 538 - mutex_unlock(&kvm->lock); 389 + mutex_unlock(&kvm->arch.config_lock); 539 390 return ret; 540 391 } 541 392 ··· 627 480 } 628 481 629 482 return -EINVAL; 483 + } 484 + 485 + int kvm_vm_smccc_has_attr(struct kvm *kvm, struct kvm_device_attr *attr) 486 + { 487 + switch (attr->attr) { 488 + case KVM_ARM_VM_SMCCC_FILTER: 489 + return 0; 490 + default: 491 + return -ENXIO; 492 + } 493 + } 494 + 495 + int kvm_vm_smccc_set_attr(struct kvm *kvm, struct kvm_device_attr *attr) 496 + { 497 + void __user *uaddr = (void __user *)attr->addr; 498 + 499 + switch (attr->attr) { 500 + case KVM_ARM_VM_SMCCC_FILTER: 501 + return kvm_smccc_set_filter(kvm, uaddr); 502 + default: 503 + return -ENXIO; 504 + } 630 505 }

+7 -18

arch/arm64/kvm/pmu-emul.c

··· 876 876 struct arm_pmu *arm_pmu; 877 877 int ret = -ENXIO; 878 878 879 - mutex_lock(&kvm->lock); 879 + lockdep_assert_held(&kvm->arch.config_lock); 880 880 mutex_lock(&arm_pmus_lock); 881 881 882 882 list_for_each_entry(entry, &arm_pmus, entry) { 883 883 arm_pmu = entry->arm_pmu; 884 884 if (arm_pmu->pmu.type == pmu_id) { 885 - if (test_bit(KVM_ARCH_FLAG_HAS_RAN_ONCE, &kvm->arch.flags) || 885 + if (kvm_vm_has_ran_once(kvm) || 886 886 (kvm->arch.pmu_filter && kvm->arch.arm_pmu != arm_pmu)) { 887 887 ret = -EBUSY; 888 888 break; ··· 896 896 } 897 897 898 898 mutex_unlock(&arm_pmus_lock); 899 - mutex_unlock(&kvm->lock); 900 899 return ret; 901 900 } 902 901 ··· 903 904 { 904 905 struct kvm *kvm = vcpu->kvm; 905 906 907 + lockdep_assert_held(&kvm->arch.config_lock); 908 + 906 909 if (!kvm_vcpu_has_pmu(vcpu)) 907 910 return -ENODEV; 908 911 909 912 if (vcpu->arch.pmu.created) 910 913 return -EBUSY; 911 914 912 - mutex_lock(&kvm->lock); 913 915 if (!kvm->arch.arm_pmu) { 914 916 /* No PMU set, get the default one */ 915 917 kvm->arch.arm_pmu = kvm_pmu_probe_armpmu(); 916 - if (!kvm->arch.arm_pmu) { 917 - mutex_unlock(&kvm->lock); 918 + if (!kvm->arch.arm_pmu) 918 919 return -ENODEV; 919 - } 920 920 } 921 - mutex_unlock(&kvm->lock); 922 921 923 922 switch (attr->attr) { 924 923 case KVM_ARM_VCPU_PMU_V3_IRQ: { ··· 960 963 filter.action != KVM_PMU_EVENT_DENY)) 961 964 return -EINVAL; 962 965 963 - mutex_lock(&kvm->lock); 964 - 965 - if (test_bit(KVM_ARCH_FLAG_HAS_RAN_ONCE, &kvm->arch.flags)) { 966 - mutex_unlock(&kvm->lock); 966 + if (kvm_vm_has_ran_once(kvm)) 967 967 return -EBUSY; 968 - } 969 968 970 969 if (!kvm->arch.pmu_filter) { 971 970 kvm->arch.pmu_filter = bitmap_alloc(nr_events, GFP_KERNEL_ACCOUNT); 972 - if (!kvm->arch.pmu_filter) { 973 - mutex_unlock(&kvm->lock); 971 + if (!kvm->arch.pmu_filter) 974 972 return -ENOMEM; 975 - } 976 973 977 974 /* 978 975 * The default depends on the first applied filter. ··· 984 993 bitmap_set(kvm->arch.pmu_filter, filter.base_event, filter.nevents); 985 994 else 986 995 bitmap_clear(kvm->arch.pmu_filter, filter.base_event, filter.nevents); 987 - 988 - mutex_unlock(&kvm->lock); 989 996 990 997 return 0; 991 998 }

+22 -15

arch/arm64/kvm/psci.c

··· 62 62 struct vcpu_reset_state *reset_state; 63 63 struct kvm *kvm = source_vcpu->kvm; 64 64 struct kvm_vcpu *vcpu = NULL; 65 + int ret = PSCI_RET_SUCCESS; 65 66 unsigned long cpu_id; 66 67 67 68 cpu_id = smccc_get_arg1(source_vcpu); ··· 77 76 */ 78 77 if (!vcpu) 79 78 return PSCI_RET_INVALID_PARAMS; 79 + 80 + spin_lock(&vcpu->arch.mp_state_lock); 80 81 if (!kvm_arm_vcpu_stopped(vcpu)) { 81 82 if (kvm_psci_version(source_vcpu) != KVM_ARM_PSCI_0_1) 82 - return PSCI_RET_ALREADY_ON; 83 + ret = PSCI_RET_ALREADY_ON; 83 84 else 84 - return PSCI_RET_INVALID_PARAMS; 85 + ret = PSCI_RET_INVALID_PARAMS; 86 + 87 + goto out_unlock; 85 88 } 86 89 87 90 reset_state = &vcpu->arch.reset_state; ··· 101 96 */ 102 97 reset_state->r0 = smccc_get_arg3(source_vcpu); 103 98 104 - WRITE_ONCE(reset_state->reset, true); 99 + reset_state->reset = true; 105 100 kvm_make_request(KVM_REQ_VCPU_RESET, vcpu); 106 101 107 102 /* ··· 110 105 */ 111 106 smp_wmb(); 112 107 113 - vcpu->arch.mp_state.mp_state = KVM_MP_STATE_RUNNABLE; 108 + WRITE_ONCE(vcpu->arch.mp_state.mp_state, KVM_MP_STATE_RUNNABLE); 114 109 kvm_vcpu_wake_up(vcpu); 115 110 116 - return PSCI_RET_SUCCESS; 111 + out_unlock: 112 + spin_unlock(&vcpu->arch.mp_state_lock); 113 + return ret; 117 114 } 118 115 119 116 static unsigned long kvm_psci_vcpu_affinity_info(struct kvm_vcpu *vcpu) ··· 175 168 * after this call is handled and before the VCPUs have been 176 169 * re-initialized. 177 170 */ 178 - kvm_for_each_vcpu(i, tmp, vcpu->kvm) 179 - tmp->arch.mp_state.mp_state = KVM_MP_STATE_STOPPED; 171 + kvm_for_each_vcpu(i, tmp, vcpu->kvm) { 172 + spin_lock(&tmp->arch.mp_state_lock); 173 + WRITE_ONCE(tmp->arch.mp_state.mp_state, KVM_MP_STATE_STOPPED); 174 + spin_unlock(&tmp->arch.mp_state_lock); 175 + } 180 176 kvm_make_all_cpus_request(vcpu->kvm, KVM_REQ_SLEEP); 181 177 182 178 memset(&vcpu->run->system_event, 0, sizeof(vcpu->run->system_event)); ··· 239 229 240 230 static int kvm_psci_0_2_call(struct kvm_vcpu *vcpu) 241 231 { 242 - struct kvm *kvm = vcpu->kvm; 243 232 u32 psci_fn = smccc_get_function(vcpu); 244 233 unsigned long val; 245 234 int ret = 1; ··· 263 254 kvm_psci_narrow_to_32bit(vcpu); 264 255 fallthrough; 265 256 case PSCI_0_2_FN64_CPU_ON: 266 - mutex_lock(&kvm->lock); 267 257 val = kvm_psci_vcpu_on(vcpu); 268 - mutex_unlock(&kvm->lock); 269 258 break; 270 259 case PSCI_0_2_FN_AFFINITY_INFO: 271 260 kvm_psci_narrow_to_32bit(vcpu); ··· 402 395 403 396 static int kvm_psci_0_1_call(struct kvm_vcpu *vcpu) 404 397 { 405 - struct kvm *kvm = vcpu->kvm; 406 398 u32 psci_fn = smccc_get_function(vcpu); 407 399 unsigned long val; 408 400 ··· 411 405 val = PSCI_RET_SUCCESS; 412 406 break; 413 407 case KVM_PSCI_FN_CPU_ON: 414 - mutex_lock(&kvm->lock); 415 408 val = kvm_psci_vcpu_on(vcpu); 416 - mutex_unlock(&kvm->lock); 417 409 break; 418 410 default: 419 411 val = PSCI_RET_NOT_SUPPORTED; ··· 439 435 int kvm_psci_call(struct kvm_vcpu *vcpu) 440 436 { 441 437 u32 psci_fn = smccc_get_function(vcpu); 438 + int version = kvm_psci_version(vcpu); 442 439 unsigned long val; 443 440 444 441 val = kvm_psci_check_allowed_function(vcpu, psci_fn); ··· 448 443 return 1; 449 444 } 450 445 451 - switch (kvm_psci_version(vcpu)) { 446 + switch (version) { 452 447 case KVM_ARM_PSCI_1_1: 453 448 return kvm_psci_1_x_call(vcpu, 1); 454 449 case KVM_ARM_PSCI_1_0: ··· 458 453 case KVM_ARM_PSCI_0_1: 459 454 return kvm_psci_0_1_call(vcpu); 460 455 default: 461 - return -EINVAL; 456 + WARN_ONCE(1, "Unknown PSCI version %d", version); 457 + smccc_set_retval(vcpu, SMCCC_RET_NOT_SUPPORTED, 0, 0, 0); 458 + return 1; 462 459 } 463 460 }

+8 -7

arch/arm64/kvm/reset.c

··· 205 205 206 206 is32bit = vcpu_has_feature(vcpu, KVM_ARM_VCPU_EL1_32BIT); 207 207 208 - lockdep_assert_held(&kvm->lock); 208 + lockdep_assert_held(&kvm->arch.config_lock); 209 209 210 210 if (test_bit(KVM_ARCH_FLAG_REG_WIDTH_CONFIGURED, &kvm->arch.flags)) { 211 211 /* ··· 262 262 bool loaded; 263 263 u32 pstate; 264 264 265 - mutex_lock(&vcpu->kvm->lock); 265 + mutex_lock(&vcpu->kvm->arch.config_lock); 266 266 ret = kvm_set_vm_width(vcpu); 267 - if (!ret) { 268 - reset_state = vcpu->arch.reset_state; 269 - WRITE_ONCE(vcpu->arch.reset_state.reset, false); 270 - } 271 - mutex_unlock(&vcpu->kvm->lock); 267 + mutex_unlock(&vcpu->kvm->arch.config_lock); 272 268 273 269 if (ret) 274 270 return ret; 271 + 272 + spin_lock(&vcpu->arch.mp_state_lock); 273 + reset_state = vcpu->arch.reset_state; 274 + vcpu->arch.reset_state.reset = false; 275 + spin_unlock(&vcpu->arch.mp_state_lock); 275 276 276 277 /* Reset PMU outside of the non-preemptible section */ 277 278 kvm_pmu_vcpu_reset(vcpu);

+10

arch/arm64/kvm/sys_regs.c

··· 1154 1154 tmr = TIMER_PTIMER; 1155 1155 treg = TIMER_REG_CVAL; 1156 1156 break; 1157 + case SYS_CNTPCT_EL0: 1158 + case SYS_CNTPCTSS_EL0: 1159 + case SYS_AARCH32_CNTPCT: 1160 + tmr = TIMER_PTIMER; 1161 + treg = TIMER_REG_CNT; 1162 + break; 1157 1163 default: 1158 1164 print_sys_reg_msg(p, "%s", "Unhandled trapped timer register"); 1159 1165 kvm_inject_undefined(vcpu); ··· 2097 2091 AMU_AMEVTYPER1_EL0(14), 2098 2092 AMU_AMEVTYPER1_EL0(15), 2099 2093 2094 + { SYS_DESC(SYS_CNTPCT_EL0), access_arch_timer }, 2095 + { SYS_DESC(SYS_CNTPCTSS_EL0), access_arch_timer }, 2100 2096 { SYS_DESC(SYS_CNTP_TVAL_EL0), access_arch_timer }, 2101 2097 { SYS_DESC(SYS_CNTP_CTL_EL0), access_arch_timer }, 2102 2098 { SYS_DESC(SYS_CNTP_CVAL_EL0), access_arch_timer }, ··· 2549 2541 { Op1( 0), CRn( 0), CRm( 2), Op2( 0), access_vm_reg, NULL, TTBR0_EL1 }, 2550 2542 { CP15_PMU_SYS_REG(DIRECT, 0, 0, 9, 0), .access = access_pmu_evcntr }, 2551 2543 { Op1( 0), CRn( 0), CRm(12), Op2( 0), access_gic_sgi }, /* ICC_SGI1R */ 2544 + { SYS_DESC(SYS_AARCH32_CNTPCT), access_arch_timer }, 2552 2545 { Op1( 1), CRn( 0), CRm( 2), Op2( 0), access_vm_reg, NULL, TTBR1_EL1 }, 2553 2546 { Op1( 1), CRn( 0), CRm(12), Op2( 0), access_gic_sgi }, /* ICC_ASGI1R */ 2554 2547 { Op1( 2), CRn( 0), CRm(12), Op2( 0), access_gic_sgi }, /* ICC_SGI0R */ 2555 2548 { SYS_DESC(SYS_AARCH32_CNTP_CVAL), access_arch_timer }, 2549 + { SYS_DESC(SYS_AARCH32_CNTPCTSS), access_arch_timer }, 2556 2550 }; 2557 2551 2558 2552 static bool check_sysreg_table(const struct sys_reg_desc *table, unsigned int n,

+5 -1

arch/arm64/kvm/trace_arm.h

··· 206 206 __field( unsigned long, vcpu_id ) 207 207 __field( int, direct_vtimer ) 208 208 __field( int, direct_ptimer ) 209 + __field( int, emul_vtimer ) 209 210 __field( int, emul_ptimer ) 210 211 ), 211 212 ··· 215 214 __entry->direct_vtimer = arch_timer_ctx_index(map->direct_vtimer); 216 215 __entry->direct_ptimer = 217 216 (map->direct_ptimer) ? arch_timer_ctx_index(map->direct_ptimer) : -1; 217 + __entry->emul_vtimer = 218 + (map->emul_vtimer) ? arch_timer_ctx_index(map->emul_vtimer) : -1; 218 219 __entry->emul_ptimer = 219 220 (map->emul_ptimer) ? arch_timer_ctx_index(map->emul_ptimer) : -1; 220 221 ), 221 222 222 - TP_printk("VCPU: %ld, dv: %d, dp: %d, ep: %d", 223 + TP_printk("VCPU: %ld, dv: %d, dp: %d, ev: %d, ep: %d", 223 224 __entry->vcpu_id, 224 225 __entry->direct_vtimer, 225 226 __entry->direct_ptimer, 227 + __entry->emul_vtimer, 226 228 __entry->emul_ptimer) 227 229 ); 228 230

+4 -4

arch/arm64/kvm/vgic/vgic-debug.c

··· 85 85 struct kvm *kvm = s->private; 86 86 struct vgic_state_iter *iter; 87 87 88 - mutex_lock(&kvm->lock); 88 + mutex_lock(&kvm->arch.config_lock); 89 89 iter = kvm->arch.vgic.iter; 90 90 if (iter) { 91 91 iter = ERR_PTR(-EBUSY); ··· 104 104 if (end_of_vgic(iter)) 105 105 iter = NULL; 106 106 out: 107 - mutex_unlock(&kvm->lock); 107 + mutex_unlock(&kvm->arch.config_lock); 108 108 return iter; 109 109 } 110 110 ··· 132 132 if (IS_ERR(v)) 133 133 return; 134 134 135 - mutex_lock(&kvm->lock); 135 + mutex_lock(&kvm->arch.config_lock); 136 136 iter = kvm->arch.vgic.iter; 137 137 kfree(iter->lpi_array); 138 138 kfree(iter); 139 139 kvm->arch.vgic.iter = NULL; 140 - mutex_unlock(&kvm->lock); 140 + mutex_unlock(&kvm->arch.config_lock); 141 141 } 142 142 143 143 static void print_dist_state(struct seq_file *s, struct vgic_dist *dist)

+23 -13

arch/arm64/kvm/vgic/vgic-init.c

··· 74 74 unsigned long i; 75 75 int ret; 76 76 77 - if (irqchip_in_kernel(kvm)) 78 - return -EEXIST; 79 - 80 77 /* 81 78 * This function is also called by the KVM_CREATE_IRQCHIP handler, 82 79 * which had no chance yet to check the availability of the GICv2 ··· 84 87 !kvm_vgic_global_state.can_emulate_gicv2) 85 88 return -ENODEV; 86 89 90 + /* Must be held to avoid race with vCPU creation */ 91 + lockdep_assert_held(&kvm->lock); 92 + 87 93 ret = -EBUSY; 88 94 if (!lock_all_vcpus(kvm)) 89 95 return ret; 96 + 97 + mutex_lock(&kvm->arch.config_lock); 98 + 99 + if (irqchip_in_kernel(kvm)) { 100 + ret = -EEXIST; 101 + goto out_unlock; 102 + } 90 103 91 104 kvm_for_each_vcpu(i, vcpu, kvm) { 92 105 if (vcpu_has_run_once(vcpu)) ··· 125 118 INIT_LIST_HEAD(&kvm->arch.vgic.rd_regions); 126 119 127 120 out_unlock: 121 + mutex_unlock(&kvm->arch.config_lock); 128 122 unlock_all_vcpus(kvm); 129 123 return ret; 130 124 } ··· 235 227 * KVM io device for the redistributor that belongs to this VCPU. 236 228 */ 237 229 if (dist->vgic_model == KVM_DEV_TYPE_ARM_VGIC_V3) { 238 - mutex_lock(&vcpu->kvm->lock); 230 + mutex_lock(&vcpu->kvm->arch.config_lock); 239 231 ret = vgic_register_redist_iodev(vcpu); 240 - mutex_unlock(&vcpu->kvm->lock); 232 + mutex_unlock(&vcpu->kvm->arch.config_lock); 241 233 } 242 234 return ret; 243 235 } ··· 258 250 * The function is generally called when nr_spis has been explicitly set 259 251 * by the guest through the KVM DEVICE API. If not nr_spis is set to 256. 260 252 * vgic_initialized() returns true when this function has succeeded. 261 - * Must be called with kvm->lock held! 262 253 */ 263 254 int vgic_init(struct kvm *kvm) 264 255 { ··· 265 258 struct kvm_vcpu *vcpu; 266 259 int ret = 0, i; 267 260 unsigned long idx; 261 + 262 + lockdep_assert_held(&kvm->arch.config_lock); 268 263 269 264 if (vgic_initialized(kvm)) 270 265 return 0; ··· 382 373 vgic_cpu->rd_iodev.base_addr = VGIC_ADDR_UNDEF; 383 374 } 384 375 385 - /* To be called with kvm->lock held */ 386 376 static void __kvm_vgic_destroy(struct kvm *kvm) 387 377 { 388 378 struct kvm_vcpu *vcpu; 389 379 unsigned long i; 380 + 381 + lockdep_assert_held(&kvm->arch.config_lock); 390 382 391 383 vgic_debug_destroy(kvm); 392 384 ··· 399 389 400 390 void kvm_vgic_destroy(struct kvm *kvm) 401 391 { 402 - mutex_lock(&kvm->lock); 392 + mutex_lock(&kvm->arch.config_lock); 403 393 __kvm_vgic_destroy(kvm); 404 - mutex_unlock(&kvm->lock); 394 + mutex_unlock(&kvm->arch.config_lock); 405 395 } 406 396 407 397 /** ··· 424 414 if (kvm->arch.vgic.vgic_model != KVM_DEV_TYPE_ARM_VGIC_V2) 425 415 return -EBUSY; 426 416 427 - mutex_lock(&kvm->lock); 417 + mutex_lock(&kvm->arch.config_lock); 428 418 ret = vgic_init(kvm); 429 - mutex_unlock(&kvm->lock); 419 + mutex_unlock(&kvm->arch.config_lock); 430 420 } 431 421 432 422 return ret; ··· 451 441 if (likely(vgic_ready(kvm))) 452 442 return 0; 453 443 454 - mutex_lock(&kvm->lock); 444 + mutex_lock(&kvm->arch.config_lock); 455 445 if (vgic_ready(kvm)) 456 446 goto out; 457 447 ··· 469 459 dist->ready = true; 470 460 471 461 out: 472 - mutex_unlock(&kvm->lock); 462 + mutex_unlock(&kvm->arch.config_lock); 473 463 return ret; 474 464 } 475 465

+24 -9

arch/arm64/kvm/vgic/vgic-its.c

··· 1958 1958 mutex_init(&its->its_lock); 1959 1959 mutex_init(&its->cmd_lock); 1960 1960 1961 + /* Yep, even more trickery for lock ordering... */ 1962 + #ifdef CONFIG_LOCKDEP 1963 + mutex_lock(&dev->kvm->arch.config_lock); 1964 + mutex_lock(&its->cmd_lock); 1965 + mutex_lock(&its->its_lock); 1966 + mutex_unlock(&its->its_lock); 1967 + mutex_unlock(&its->cmd_lock); 1968 + mutex_unlock(&dev->kvm->arch.config_lock); 1969 + #endif 1970 + 1961 1971 its->vgic_its_base = VGIC_ADDR_UNDEF; 1962 1972 1963 1973 INIT_LIST_HEAD(&its->device_list); ··· 2055 2045 2056 2046 mutex_lock(&dev->kvm->lock); 2057 2047 2048 + if (!lock_all_vcpus(dev->kvm)) { 2049 + mutex_unlock(&dev->kvm->lock); 2050 + return -EBUSY; 2051 + } 2052 + 2053 + mutex_lock(&dev->kvm->arch.config_lock); 2054 + 2058 2055 if (IS_VGIC_ADDR_UNDEF(its->vgic_its_base)) { 2059 2056 ret = -ENXIO; 2060 2057 goto out; ··· 2072 2055 offset); 2073 2056 if (!region) { 2074 2057 ret = -ENXIO; 2075 - goto out; 2076 - } 2077 - 2078 - if (!lock_all_vcpus(dev->kvm)) { 2079 - ret = -EBUSY; 2080 2058 goto out; 2081 2059 } 2082 2060 ··· 2088 2076 } else { 2089 2077 *reg = region->its_read(dev->kvm, its, addr, len); 2090 2078 } 2091 - unlock_all_vcpus(dev->kvm); 2092 2079 out: 2080 + mutex_unlock(&dev->kvm->arch.config_lock); 2081 + unlock_all_vcpus(dev->kvm); 2093 2082 mutex_unlock(&dev->kvm->lock); 2094 2083 return ret; 2095 2084 } ··· 2762 2749 return 0; 2763 2750 2764 2751 mutex_lock(&kvm->lock); 2765 - mutex_lock(&its->its_lock); 2766 2752 2767 2753 if (!lock_all_vcpus(kvm)) { 2768 - mutex_unlock(&its->its_lock); 2769 2754 mutex_unlock(&kvm->lock); 2770 2755 return -EBUSY; 2771 2756 } 2757 + 2758 + mutex_lock(&kvm->arch.config_lock); 2759 + mutex_lock(&its->its_lock); 2772 2760 2773 2761 switch (attr) { 2774 2762 case KVM_DEV_ARM_ITS_CTRL_RESET: ··· 2783 2769 break; 2784 2770 } 2785 2771 2786 - unlock_all_vcpus(kvm); 2787 2772 mutex_unlock(&its->its_lock); 2773 + mutex_unlock(&kvm->arch.config_lock); 2774 + unlock_all_vcpus(kvm); 2788 2775 mutex_unlock(&kvm->lock); 2789 2776 return ret; 2790 2777 }

+28 -57

arch/arm64/kvm/vgic/vgic-kvm-device.c

··· 46 46 struct vgic_dist *vgic = &kvm->arch.vgic; 47 47 int r; 48 48 49 - mutex_lock(&kvm->lock); 49 + mutex_lock(&kvm->arch.config_lock); 50 50 switch (FIELD_GET(KVM_ARM_DEVICE_TYPE_MASK, dev_addr->id)) { 51 51 case KVM_VGIC_V2_ADDR_TYPE_DIST: 52 52 r = vgic_check_type(kvm, KVM_DEV_TYPE_ARM_VGIC_V2); ··· 68 68 r = -ENODEV; 69 69 } 70 70 71 - mutex_unlock(&kvm->lock); 71 + mutex_unlock(&kvm->arch.config_lock); 72 72 73 73 return r; 74 74 } ··· 102 102 if (get_user(addr, uaddr)) 103 103 return -EFAULT; 104 104 105 - mutex_lock(&kvm->lock); 105 + mutex_lock(&kvm->arch.config_lock); 106 106 switch (attr->attr) { 107 107 case KVM_VGIC_V2_ADDR_TYPE_DIST: 108 108 r = vgic_check_type(kvm, KVM_DEV_TYPE_ARM_VGIC_V2); ··· 191 191 } 192 192 193 193 out: 194 - mutex_unlock(&kvm->lock); 194 + mutex_unlock(&kvm->arch.config_lock); 195 195 196 196 if (!r && !write) 197 197 r = put_user(addr, uaddr); ··· 227 227 (val & 31)) 228 228 return -EINVAL; 229 229 230 - mutex_lock(&dev->kvm->lock); 230 + mutex_lock(&dev->kvm->arch.config_lock); 231 231 232 232 if (vgic_ready(dev->kvm) || dev->kvm->arch.vgic.nr_spis) 233 233 ret = -EBUSY; ··· 235 235 dev->kvm->arch.vgic.nr_spis = 236 236 val - VGIC_NR_PRIVATE_IRQS; 237 237 238 - mutex_unlock(&dev->kvm->lock); 238 + mutex_unlock(&dev->kvm->arch.config_lock); 239 239 240 240 return ret; 241 241 } 242 242 case KVM_DEV_ARM_VGIC_GRP_CTRL: { 243 243 switch (attr->attr) { 244 244 case KVM_DEV_ARM_VGIC_CTRL_INIT: 245 - mutex_lock(&dev->kvm->lock); 245 + mutex_lock(&dev->kvm->arch.config_lock); 246 246 r = vgic_init(dev->kvm); 247 - mutex_unlock(&dev->kvm->lock); 247 + mutex_unlock(&dev->kvm->arch.config_lock); 248 248 return r; 249 249 case KVM_DEV_ARM_VGIC_SAVE_PENDING_TABLES: 250 250 /* ··· 260 260 mutex_unlock(&dev->kvm->lock); 261 261 return -EBUSY; 262 262 } 263 + 264 + mutex_lock(&dev->kvm->arch.config_lock); 263 265 r = vgic_v3_save_pending_tables(dev->kvm); 266 + mutex_unlock(&dev->kvm->arch.config_lock); 264 267 unlock_all_vcpus(dev->kvm); 265 268 mutex_unlock(&dev->kvm->lock); 266 269 return r; ··· 345 342 return 0; 346 343 } 347 344 348 - /* unlocks vcpus from @vcpu_lock_idx and smaller */ 349 - static void unlock_vcpus(struct kvm *kvm, int vcpu_lock_idx) 350 - { 351 - struct kvm_vcpu *tmp_vcpu; 352 - 353 - for (; vcpu_lock_idx >= 0; vcpu_lock_idx--) { 354 - tmp_vcpu = kvm_get_vcpu(kvm, vcpu_lock_idx); 355 - mutex_unlock(&tmp_vcpu->mutex); 356 - } 357 - } 358 - 359 - void unlock_all_vcpus(struct kvm *kvm) 360 - { 361 - unlock_vcpus(kvm, atomic_read(&kvm->online_vcpus) - 1); 362 - } 363 - 364 - /* Returns true if all vcpus were locked, false otherwise */ 365 - bool lock_all_vcpus(struct kvm *kvm) 366 - { 367 - struct kvm_vcpu *tmp_vcpu; 368 - unsigned long c; 369 - 370 - /* 371 - * Any time a vcpu is run, vcpu_load is called which tries to grab the 372 - * vcpu->mutex. By grabbing the vcpu->mutex of all VCPUs we ensure 373 - * that no other VCPUs are run and fiddle with the vgic state while we 374 - * access it. 375 - */ 376 - kvm_for_each_vcpu(c, tmp_vcpu, kvm) { 377 - if (!mutex_trylock(&tmp_vcpu->mutex)) { 378 - unlock_vcpus(kvm, c - 1); 379 - return false; 380 - } 381 - } 382 - 383 - return true; 384 - } 385 - 386 345 /** 387 346 * vgic_v2_attr_regs_access - allows user space to access VGIC v2 state 388 347 * ··· 376 411 377 412 mutex_lock(&dev->kvm->lock); 378 413 414 + if (!lock_all_vcpus(dev->kvm)) { 415 + mutex_unlock(&dev->kvm->lock); 416 + return -EBUSY; 417 + } 418 + 419 + mutex_lock(&dev->kvm->arch.config_lock); 420 + 379 421 ret = vgic_init(dev->kvm); 380 422 if (ret) 381 423 goto out; 382 - 383 - if (!lock_all_vcpus(dev->kvm)) { 384 - ret = -EBUSY; 385 - goto out; 386 - } 387 424 388 425 switch (attr->group) { 389 426 case KVM_DEV_ARM_VGIC_GRP_CPU_REGS: ··· 399 432 break; 400 433 } 401 434 402 - unlock_all_vcpus(dev->kvm); 403 435 out: 436 + mutex_unlock(&dev->kvm->arch.config_lock); 437 + unlock_all_vcpus(dev->kvm); 404 438 mutex_unlock(&dev->kvm->lock); 405 439 406 440 if (!ret && !is_write) ··· 537 569 538 570 mutex_lock(&dev->kvm->lock); 539 571 540 - if (unlikely(!vgic_initialized(dev->kvm))) { 541 - ret = -EBUSY; 542 - goto out; 572 + if (!lock_all_vcpus(dev->kvm)) { 573 + mutex_unlock(&dev->kvm->lock); 574 + return -EBUSY; 543 575 } 544 576 545 - if (!lock_all_vcpus(dev->kvm)) { 577 + mutex_lock(&dev->kvm->arch.config_lock); 578 + 579 + if (unlikely(!vgic_initialized(dev->kvm))) { 546 580 ret = -EBUSY; 547 581 goto out; 548 582 } ··· 579 609 break; 580 610 } 581 611 582 - unlock_all_vcpus(dev->kvm); 583 612 out: 613 + mutex_unlock(&dev->kvm->arch.config_lock); 614 + unlock_all_vcpus(dev->kvm); 584 615 mutex_unlock(&dev->kvm->lock); 585 616 586 617 if (!ret && uaccess && !is_write) {

+2 -2

arch/arm64/kvm/vgic/vgic-mmio-v3.c

··· 111 111 case GICD_CTLR: { 112 112 bool was_enabled, is_hwsgi; 113 113 114 - mutex_lock(&vcpu->kvm->lock); 114 + mutex_lock(&vcpu->kvm->arch.config_lock); 115 115 116 116 was_enabled = dist->enabled; 117 117 is_hwsgi = dist->nassgireq; ··· 139 139 else if (!was_enabled && dist->enabled) 140 140 vgic_kick_vcpus(vcpu->kvm); 141 141 142 - mutex_unlock(&vcpu->kvm->lock); 142 + mutex_unlock(&vcpu->kvm->arch.config_lock); 143 143 break; 144 144 } 145 145 case GICD_TYPER:

+6 -6

arch/arm64/kvm/vgic/vgic-mmio.c

··· 530 530 u32 intid = VGIC_ADDR_TO_INTID(addr, 1); 531 531 u32 val; 532 532 533 - mutex_lock(&vcpu->kvm->lock); 533 + mutex_lock(&vcpu->kvm->arch.config_lock); 534 534 vgic_access_active_prepare(vcpu, intid); 535 535 536 536 val = __vgic_mmio_read_active(vcpu, addr, len); 537 537 538 538 vgic_access_active_finish(vcpu, intid); 539 - mutex_unlock(&vcpu->kvm->lock); 539 + mutex_unlock(&vcpu->kvm->arch.config_lock); 540 540 541 541 return val; 542 542 } ··· 625 625 { 626 626 u32 intid = VGIC_ADDR_TO_INTID(addr, 1); 627 627 628 - mutex_lock(&vcpu->kvm->lock); 628 + mutex_lock(&vcpu->kvm->arch.config_lock); 629 629 vgic_access_active_prepare(vcpu, intid); 630 630 631 631 __vgic_mmio_write_cactive(vcpu, addr, len, val); 632 632 633 633 vgic_access_active_finish(vcpu, intid); 634 - mutex_unlock(&vcpu->kvm->lock); 634 + mutex_unlock(&vcpu->kvm->arch.config_lock); 635 635 } 636 636 637 637 int vgic_mmio_uaccess_write_cactive(struct kvm_vcpu *vcpu, ··· 662 662 { 663 663 u32 intid = VGIC_ADDR_TO_INTID(addr, 1); 664 664 665 - mutex_lock(&vcpu->kvm->lock); 665 + mutex_lock(&vcpu->kvm->arch.config_lock); 666 666 vgic_access_active_prepare(vcpu, intid); 667 667 668 668 __vgic_mmio_write_sactive(vcpu, addr, len, val); 669 669 670 670 vgic_access_active_finish(vcpu, intid); 671 - mutex_unlock(&vcpu->kvm->lock); 671 + mutex_unlock(&vcpu->kvm->arch.config_lock); 672 672 } 673 673 674 674 int vgic_mmio_uaccess_write_sactive(struct kvm_vcpu *vcpu,

+6 -5

arch/arm64/kvm/vgic/vgic-v4.c

··· 232 232 * @kvm: Pointer to the VM being initialized 233 233 * 234 234 * We may be called each time a vITS is created, or when the 235 - * vgic is initialized. This relies on kvm->lock to be 236 - * held. In both cases, the number of vcpus should now be 237 - * fixed. 235 + * vgic is initialized. In both cases, the number of vcpus 236 + * should now be fixed. 238 237 */ 239 238 int vgic_v4_init(struct kvm *kvm) 240 239 { ··· 241 242 struct kvm_vcpu *vcpu; 242 243 int nr_vcpus, ret; 243 244 unsigned long i; 245 + 246 + lockdep_assert_held(&kvm->arch.config_lock); 244 247 245 248 if (!kvm_vgic_global_state.has_gicv4) 246 249 return 0; /* Nothing to see here... move along. */ ··· 310 309 /** 311 310 * vgic_v4_teardown - Free the GICv4 data structures 312 311 * @kvm: Pointer to the VM being destroyed 313 - * 314 - * Relies on kvm->lock to be held. 315 312 */ 316 313 void vgic_v4_teardown(struct kvm *kvm) 317 314 { 318 315 struct its_vm *its_vm = &kvm->arch.vgic.its_vm; 319 316 int i; 317 + 318 + lockdep_assert_held(&kvm->arch.config_lock); 320 319 321 320 if (!its_vm->vpes) 322 321 return;

+22 -5

arch/arm64/kvm/vgic/vgic.c

··· 24 24 /* 25 25 * Locking order is always: 26 26 * kvm->lock (mutex) 27 - * its->cmd_lock (mutex) 28 - * its->its_lock (mutex) 29 - * vgic_cpu->ap_list_lock must be taken with IRQs disabled 30 - * kvm->lpi_list_lock must be taken with IRQs disabled 31 - * vgic_irq->irq_lock must be taken with IRQs disabled 27 + * vcpu->mutex (mutex) 28 + * kvm->arch.config_lock (mutex) 29 + * its->cmd_lock (mutex) 30 + * its->its_lock (mutex) 31 + * vgic_cpu->ap_list_lock must be taken with IRQs disabled 32 + * kvm->lpi_list_lock must be taken with IRQs disabled 33 + * vgic_irq->irq_lock must be taken with IRQs disabled 32 34 * 33 35 * As the ap_list_lock might be taken from the timer interrupt handler, 34 36 * we have to disable IRQs before taking this lock and everything lower ··· 573 571 vgic_put_irq(vcpu->kvm, irq); 574 572 575 573 return 0; 574 + } 575 + 576 + int kvm_vgic_get_map(struct kvm_vcpu *vcpu, unsigned int vintid) 577 + { 578 + struct vgic_irq *irq = vgic_get_irq(vcpu->kvm, vcpu, vintid); 579 + unsigned long flags; 580 + int ret = -1; 581 + 582 + raw_spin_lock_irqsave(&irq->irq_lock, flags); 583 + if (irq->hw) 584 + ret = irq->hwintid; 585 + raw_spin_unlock_irqrestore(&irq->irq_lock, flags); 586 + 587 + vgic_put_irq(vcpu->kvm, irq); 588 + return ret; 576 589 } 577 590 578 591 /**

-3

arch/arm64/kvm/vgic/vgic.h

··· 273 273 void vgic_debug_init(struct kvm *kvm); 274 274 void vgic_debug_destroy(struct kvm *kvm); 275 275 276 - bool lock_all_vcpus(struct kvm *kvm); 277 - void unlock_all_vcpus(struct kvm *kvm); 278 - 279 276 static inline int vgic_v3_max_apr_idx(struct kvm_vcpu *vcpu) 280 277 { 281 278 struct vgic_cpu *cpu_if = &vcpu->arch.vgic_cpu;

+1

arch/arm64/tools/cpucaps

··· 23 23 HAS_DIT 24 24 HAS_E0PD 25 25 HAS_ECV 26 + HAS_ECV_CNTPOFF 26 27 HAS_EPAN 27 28 HAS_GENERIC_AUTH 28 29 HAS_GENERIC_AUTH_ARCH_QARMA3

+4

arch/arm64/tools/sysreg

··· 2115 2115 Fields CONTEXTIDR_ELx 2116 2116 EndSysreg 2117 2117 2118 + Sysreg CNTPOFF_EL2 3 4 14 0 6 2119 + Field 63:0 PhysicalOffset 2120 + EndSysreg 2121 + 2118 2122 Sysreg CPACR_EL12 3 5 1 0 2 2119 2123 Fields CPACR_ELx 2120 2124 EndSysreg

+1 -1

arch/mips/include/asm/kvm_host.h

··· 757 757 int (*vcpu_run)(struct kvm_vcpu *vcpu); 758 758 void (*vcpu_reenter)(struct kvm_vcpu *vcpu); 759 759 }; 760 - extern struct kvm_mips_callbacks *kvm_mips_callbacks; 760 + extern const struct kvm_mips_callbacks * const kvm_mips_callbacks; 761 761 int kvm_mips_emulation_init(void); 762 762 763 763 /* Debug: dump vcpu state */

+2 -2

arch/mips/kvm/mips.c

··· 993 993 kvm_flush_remote_tlbs(kvm); 994 994 } 995 995 996 - long kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) 996 + int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) 997 997 { 998 - long r; 998 + int r; 999 999 1000 1000 switch (ioctl) { 1001 1001 default:

+1 -1

arch/mips/kvm/vz.c

··· 3305 3305 }; 3306 3306 3307 3307 /* FIXME: Get rid of the callbacks now that trap-and-emulate is gone. */ 3308 - struct kvm_mips_callbacks *kvm_mips_callbacks = &kvm_vz_callbacks; 3308 + const struct kvm_mips_callbacks * const kvm_mips_callbacks = &kvm_vz_callbacks; 3309 3309 3310 3310 int kvm_mips_emulation_init(void) 3311 3311 {

+7 -7

arch/powerpc/include/asm/kvm_ppc.h

··· 167 167 168 168 extern int kvmppc_allocate_hpt(struct kvm_hpt_info *info, u32 order); 169 169 extern void kvmppc_set_hpt(struct kvm *kvm, struct kvm_hpt_info *info); 170 - extern long kvmppc_alloc_reset_hpt(struct kvm *kvm, int order); 170 + extern int kvmppc_alloc_reset_hpt(struct kvm *kvm, int order); 171 171 extern void kvmppc_free_hpt(struct kvm_hpt_info *info); 172 172 extern void kvmppc_rmap_reset(struct kvm *kvm); 173 173 extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu, ··· 181 181 extern int kvmppc_switch_mmu_to_radix(struct kvm *kvm); 182 182 extern void kvmppc_setup_partition_table(struct kvm *kvm); 183 183 184 - extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm, 184 + extern int kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm, 185 185 struct kvm_create_spapr_tce_64 *args); 186 186 #define kvmppc_ioba_validate(stt, ioba, npages) \ 187 187 (iommu_tce_check_ioba((stt)->page_shift, (stt)->offset, \ ··· 222 222 extern int kvmppc_prepare_to_enter(struct kvm_vcpu *vcpu); 223 223 224 224 extern int kvm_vm_ioctl_get_htab_fd(struct kvm *kvm, struct kvm_get_htab_fd *); 225 - extern long kvm_vm_ioctl_resize_hpt_prepare(struct kvm *kvm, 226 - struct kvm_ppc_resize_hpt *rhpt); 227 - extern long kvm_vm_ioctl_resize_hpt_commit(struct kvm *kvm, 225 + extern int kvm_vm_ioctl_resize_hpt_prepare(struct kvm *kvm, 228 226 struct kvm_ppc_resize_hpt *rhpt); 227 + extern int kvm_vm_ioctl_resize_hpt_commit(struct kvm *kvm, 228 + struct kvm_ppc_resize_hpt *rhpt); 229 229 230 230 int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu *vcpu, struct kvm_interrupt *irq); 231 231 ··· 297 297 int (*emulate_mtspr)(struct kvm_vcpu *vcpu, int sprn, ulong spr_val); 298 298 int (*emulate_mfspr)(struct kvm_vcpu *vcpu, int sprn, ulong *spr_val); 299 299 void (*fast_vcpu_kick)(struct kvm_vcpu *vcpu); 300 - long (*arch_vm_ioctl)(struct file *filp, unsigned int ioctl, 301 - unsigned long arg); 300 + int (*arch_vm_ioctl)(struct file *filp, unsigned int ioctl, 301 + unsigned long arg); 302 302 int (*hcall_implemented)(unsigned long hcall); 303 303 int (*irq_bypass_add_producer)(struct irq_bypass_consumer *, 304 304 struct irq_bypass_producer *);

+7 -7

arch/powerpc/kvm/book3s_64_mmu_hv.c

··· 124 124 info->virt, (long)info->order, kvm->arch.lpid); 125 125 } 126 126 127 - long kvmppc_alloc_reset_hpt(struct kvm *kvm, int order) 127 + int kvmppc_alloc_reset_hpt(struct kvm *kvm, int order) 128 128 { 129 - long err = -EBUSY; 129 + int err = -EBUSY; 130 130 struct kvm_hpt_info info; 131 131 132 132 mutex_lock(&kvm->arch.mmu_setup_lock); ··· 1482 1482 mutex_unlock(&kvm->arch.mmu_setup_lock); 1483 1483 } 1484 1484 1485 - long kvm_vm_ioctl_resize_hpt_prepare(struct kvm *kvm, 1486 - struct kvm_ppc_resize_hpt *rhpt) 1485 + int kvm_vm_ioctl_resize_hpt_prepare(struct kvm *kvm, 1486 + struct kvm_ppc_resize_hpt *rhpt) 1487 1487 { 1488 1488 unsigned long flags = rhpt->flags; 1489 1489 unsigned long shift = rhpt->shift; ··· 1548 1548 /* Nothing to do, just force a KVM exit */ 1549 1549 } 1550 1550 1551 - long kvm_vm_ioctl_resize_hpt_commit(struct kvm *kvm, 1552 - struct kvm_ppc_resize_hpt *rhpt) 1551 + int kvm_vm_ioctl_resize_hpt_commit(struct kvm *kvm, 1552 + struct kvm_ppc_resize_hpt *rhpt) 1553 1553 { 1554 1554 unsigned long flags = rhpt->flags; 1555 1555 unsigned long shift = rhpt->shift; 1556 1556 struct kvm_resize_hpt *resize; 1557 - long ret; 1557 + int ret; 1558 1558 1559 1559 if (flags != 0 || kvm_is_radix(kvm)) 1560 1560 return -EINVAL;

+2 -2

arch/powerpc/kvm/book3s_64_vio.c

··· 288 288 .release = kvm_spapr_tce_release, 289 289 }; 290 290 291 - long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm, 292 - struct kvm_create_spapr_tce_64 *args) 291 + int kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm, 292 + struct kvm_create_spapr_tce_64 *args) 293 293 { 294 294 struct kvmppc_spapr_tce_table *stt = NULL; 295 295 struct kvmppc_spapr_tce_table *siter;

+3 -3

arch/powerpc/kvm/book3s_hv.c

··· 5799 5799 } 5800 5800 #endif 5801 5801 5802 - static long kvm_arch_vm_ioctl_hv(struct file *filp, 5803 - unsigned int ioctl, unsigned long arg) 5802 + static int kvm_arch_vm_ioctl_hv(struct file *filp, 5803 + unsigned int ioctl, unsigned long arg) 5804 5804 { 5805 5805 struct kvm *kvm __maybe_unused = filp->private_data; 5806 5806 void __user *argp = (void __user *)arg; 5807 - long r; 5807 + int r; 5808 5808 5809 5809 switch (ioctl) { 5810 5810

+2 -2

arch/powerpc/kvm/book3s_pr.c

··· 2044 2044 return 0; 2045 2045 } 2046 2046 2047 - static long kvm_arch_vm_ioctl_pr(struct file *filp, 2048 - unsigned int ioctl, unsigned long arg) 2047 + static int kvm_arch_vm_ioctl_pr(struct file *filp, 2048 + unsigned int ioctl, unsigned long arg) 2049 2049 { 2050 2050 return -ENOTTY; 2051 2051 }

+2 -3

arch/powerpc/kvm/powerpc.c

··· 2379 2379 } 2380 2380 #endif 2381 2381 2382 - long kvm_arch_vm_ioctl(struct file *filp, 2383 - unsigned int ioctl, unsigned long arg) 2382 + int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) 2384 2383 { 2385 2384 struct kvm *kvm __maybe_unused = filp->private_data; 2386 2385 void __user *argp = (void __user *)arg; 2387 - long r; 2386 + int r; 2388 2387 2389 2388 switch (ioctl) { 2390 2389 case KVM_PPC_GET_PVINFO: {

+1 -2

arch/riscv/kvm/vm.c

··· 87 87 return r; 88 88 } 89 89 90 - long kvm_arch_vm_ioctl(struct file *filp, 91 - unsigned int ioctl, unsigned long arg) 90 + int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) 92 91 { 93 92 return -EINVAL; 94 93 }

+2 -2

arch/s390/kvm/interrupt.c

··· 305 305 306 306 static inline int gisa_in_alert_list(struct kvm_s390_gisa *gisa) 307 307 { 308 - return READ_ONCE(gisa->next_alert) != (u32)(u64)gisa; 308 + return READ_ONCE(gisa->next_alert) != (u32)virt_to_phys(gisa); 309 309 } 310 310 311 311 static inline void gisa_set_ipm_gisc(struct kvm_s390_gisa *gisa, u32 gisc) ··· 3168 3168 hrtimer_init(&gi->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); 3169 3169 gi->timer.function = gisa_vcpu_kicker; 3170 3170 memset(gi->origin, 0, sizeof(struct kvm_s390_gisa)); 3171 - gi->origin->next_alert = (u32)(u64)gi->origin; 3171 + gi->origin->next_alert = (u32)virt_to_phys(gi->origin); 3172 3172 VM_EVENT(kvm, 3, "gisa 0x%pK initialized", gi->origin); 3173 3173 } 3174 3174

+3 -4

arch/s390/kvm/kvm-s390.c

··· 1990 1990 return ret; 1991 1991 } 1992 1992 1993 - static long kvm_s390_get_skeys(struct kvm *kvm, struct kvm_s390_skeys *args) 1993 + static int kvm_s390_get_skeys(struct kvm *kvm, struct kvm_s390_skeys *args) 1994 1994 { 1995 1995 uint8_t *keys; 1996 1996 uint64_t hva; ··· 2038 2038 return r; 2039 2039 } 2040 2040 2041 - static long kvm_s390_set_skeys(struct kvm *kvm, struct kvm_s390_skeys *args) 2041 + static int kvm_s390_set_skeys(struct kvm *kvm, struct kvm_s390_skeys *args) 2042 2042 { 2043 2043 uint8_t *keys; 2044 2044 uint64_t hva; ··· 2899 2899 } 2900 2900 } 2901 2901 2902 - long kvm_arch_vm_ioctl(struct file *filp, 2903 - unsigned int ioctl, unsigned long arg) 2902 + int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) 2904 2903 { 2905 2904 struct kvm *kvm = filp->private_data; 2906 2905 void __user *argp = (void __user *)arg;

+1 -1

arch/s390/kvm/pci.c

··· 112 112 return -EINVAL; 113 113 114 114 aift->sbv = zpci_aif_sbv; 115 - aift->gait = (struct zpci_gaite *)zpci_aipb->aipb.gait; 115 + aift->gait = phys_to_virt(zpci_aipb->aipb.gait); 116 116 117 117 return 0; 118 118 }

+29 -21

arch/s390/kvm/vsie.c

··· 138 138 } 139 139 /* Copy to APCB FORMAT1 from APCB FORMAT0 */ 140 140 static int setup_apcb10(struct kvm_vcpu *vcpu, struct kvm_s390_apcb1 *apcb_s, 141 - unsigned long apcb_o, struct kvm_s390_apcb1 *apcb_h) 141 + unsigned long crycb_gpa, struct kvm_s390_apcb1 *apcb_h) 142 142 { 143 143 struct kvm_s390_apcb0 tmp; 144 + unsigned long apcb_gpa; 144 145 145 - if (read_guest_real(vcpu, apcb_o, &tmp, sizeof(struct kvm_s390_apcb0))) 146 + apcb_gpa = crycb_gpa + offsetof(struct kvm_s390_crypto_cb, apcb0); 147 + 148 + if (read_guest_real(vcpu, apcb_gpa, &tmp, 149 + sizeof(struct kvm_s390_apcb0))) 146 150 return -EFAULT; 147 151 148 152 apcb_s->apm[0] = apcb_h->apm[0] & tmp.apm[0]; ··· 161 157 * setup_apcb00 - Copy to APCB FORMAT0 from APCB FORMAT0 162 158 * @vcpu: pointer to the virtual CPU 163 159 * @apcb_s: pointer to start of apcb in the shadow crycb 164 - * @apcb_o: pointer to start of original apcb in the guest2 160 + * @crycb_gpa: guest physical address to start of original guest crycb 165 161 * @apcb_h: pointer to start of apcb in the guest1 166 162 * 167 163 * Returns 0 and -EFAULT on error reading guest apcb 168 164 */ 169 165 static int setup_apcb00(struct kvm_vcpu *vcpu, unsigned long *apcb_s, 170 - unsigned long apcb_o, unsigned long *apcb_h) 166 + unsigned long crycb_gpa, unsigned long *apcb_h) 171 167 { 172 - if (read_guest_real(vcpu, apcb_o, apcb_s, 168 + unsigned long apcb_gpa; 169 + 170 + apcb_gpa = crycb_gpa + offsetof(struct kvm_s390_crypto_cb, apcb0); 171 + 172 + if (read_guest_real(vcpu, apcb_gpa, apcb_s, 173 173 sizeof(struct kvm_s390_apcb0))) 174 174 return -EFAULT; 175 175 ··· 186 178 * setup_apcb11 - Copy the FORMAT1 APCB from the guest to the shadow CRYCB 187 179 * @vcpu: pointer to the virtual CPU 188 180 * @apcb_s: pointer to start of apcb in the shadow crycb 189 - * @apcb_o: pointer to start of original guest apcb 181 + * @crycb_gpa: guest physical address to start of original guest crycb 190 182 * @apcb_h: pointer to start of apcb in the host 191 183 * 192 184 * Returns 0 and -EFAULT on error reading guest apcb 193 185 */ 194 186 static int setup_apcb11(struct kvm_vcpu *vcpu, unsigned long *apcb_s, 195 - unsigned long apcb_o, 187 + unsigned long crycb_gpa, 196 188 unsigned long *apcb_h) 197 189 { 198 - if (read_guest_real(vcpu, apcb_o, apcb_s, 190 + unsigned long apcb_gpa; 191 + 192 + apcb_gpa = crycb_gpa + offsetof(struct kvm_s390_crypto_cb, apcb1); 193 + 194 + if (read_guest_real(vcpu, apcb_gpa, apcb_s, 199 195 sizeof(struct kvm_s390_apcb1))) 200 196 return -EFAULT; 201 197 ··· 212 200 * setup_apcb - Create a shadow copy of the apcb. 213 201 * @vcpu: pointer to the virtual CPU 214 202 * @crycb_s: pointer to shadow crycb 215 - * @crycb_o: pointer to original guest crycb 203 + * @crycb_gpa: guest physical address of original guest crycb 216 204 * @crycb_h: pointer to the host crycb 217 205 * @fmt_o: format of the original guest crycb. 218 206 * @fmt_h: format of the host crycb. ··· 223 211 * Return 0 or an error number if the guest and host crycb are incompatible. 224 212 */ 225 213 static int setup_apcb(struct kvm_vcpu *vcpu, struct kvm_s390_crypto_cb *crycb_s, 226 - const u32 crycb_o, 214 + const u32 crycb_gpa, 227 215 struct kvm_s390_crypto_cb *crycb_h, 228 216 int fmt_o, int fmt_h) 229 217 { 230 - struct kvm_s390_crypto_cb *crycb; 231 - 232 - crycb = (struct kvm_s390_crypto_cb *) (unsigned long)crycb_o; 233 - 234 218 switch (fmt_o) { 235 219 case CRYCB_FORMAT2: 236 - if ((crycb_o & PAGE_MASK) != ((crycb_o + 256) & PAGE_MASK)) 220 + if ((crycb_gpa & PAGE_MASK) != ((crycb_gpa + 256) & PAGE_MASK)) 237 221 return -EACCES; 238 222 if (fmt_h != CRYCB_FORMAT2) 239 223 return -EINVAL; 240 224 return setup_apcb11(vcpu, (unsigned long *)&crycb_s->apcb1, 241 - (unsigned long) &crycb->apcb1, 225 + crycb_gpa, 242 226 (unsigned long *)&crycb_h->apcb1); 243 227 case CRYCB_FORMAT1: 244 228 switch (fmt_h) { 245 229 case CRYCB_FORMAT2: 246 230 return setup_apcb10(vcpu, &crycb_s->apcb1, 247 - (unsigned long) &crycb->apcb0, 231 + crycb_gpa, 248 232 &crycb_h->apcb1); 249 233 case CRYCB_FORMAT1: 250 234 return setup_apcb00(vcpu, 251 235 (unsigned long *) &crycb_s->apcb0, 252 - (unsigned long) &crycb->apcb0, 236 + crycb_gpa, 253 237 (unsigned long *) &crycb_h->apcb0); 254 238 } 255 239 break; 256 240 case CRYCB_FORMAT0: 257 - if ((crycb_o & PAGE_MASK) != ((crycb_o + 32) & PAGE_MASK)) 241 + if ((crycb_gpa & PAGE_MASK) != ((crycb_gpa + 32) & PAGE_MASK)) 258 242 return -EACCES; 259 243 260 244 switch (fmt_h) { 261 245 case CRYCB_FORMAT2: 262 246 return setup_apcb10(vcpu, &crycb_s->apcb1, 263 - (unsigned long) &crycb->apcb0, 247 + crycb_gpa, 264 248 &crycb_h->apcb1); 265 249 case CRYCB_FORMAT1: 266 250 case CRYCB_FORMAT0: 267 251 return setup_apcb00(vcpu, 268 252 (unsigned long *) &crycb_s->apcb0, 269 - (unsigned long) &crycb->apcb0, 253 + crycb_gpa, 270 254 (unsigned long *) &crycb_h->apcb0); 271 255 } 272 256 }

+5 -4

arch/x86/include/asm/cpufeatures.h

··· 226 226 227 227 /* Virtualization flags: Linux defined, word 8 */ 228 228 #define X86_FEATURE_TPR_SHADOW ( 8*32+ 0) /* Intel TPR Shadow */ 229 - #define X86_FEATURE_VNMI ( 8*32+ 1) /* Intel Virtual NMI */ 230 - #define X86_FEATURE_FLEXPRIORITY ( 8*32+ 2) /* Intel FlexPriority */ 231 - #define X86_FEATURE_EPT ( 8*32+ 3) /* Intel Extended Page Table */ 232 - #define X86_FEATURE_VPID ( 8*32+ 4) /* Intel Virtual Processor ID */ 229 + #define X86_FEATURE_FLEXPRIORITY ( 8*32+ 1) /* Intel FlexPriority */ 230 + #define X86_FEATURE_EPT ( 8*32+ 2) /* Intel Extended Page Table */ 231 + #define X86_FEATURE_VPID ( 8*32+ 3) /* Intel Virtual Processor ID */ 233 232 234 233 #define X86_FEATURE_VMMCALL ( 8*32+15) /* Prefer VMMCALL to VMCALL */ 235 234 #define X86_FEATURE_XENPV ( 8*32+16) /* "" Xen paravirtual guest */ ··· 337 338 #define X86_FEATURE_VIRT_SSBD (13*32+25) /* Virtualized Speculative Store Bypass Disable */ 338 339 #define X86_FEATURE_AMD_SSB_NO (13*32+26) /* "" Speculative Store Bypass is fixed in hardware. */ 339 340 #define X86_FEATURE_CPPC (13*32+27) /* Collaborative Processor Performance Control */ 341 + #define X86_FEATURE_AMD_PSFD (13*32+28) /* "" Predictive Store Forwarding Disable */ 340 342 #define X86_FEATURE_BTC_NO (13*32+29) /* "" Not vulnerable to Branch Type Confusion */ 341 343 #define X86_FEATURE_BRS (13*32+31) /* Branch Sampling available */ 342 344 ··· 370 370 #define X86_FEATURE_VGIF (15*32+16) /* Virtual GIF */ 371 371 #define X86_FEATURE_X2AVIC (15*32+18) /* Virtual x2apic */ 372 372 #define X86_FEATURE_V_SPEC_CTRL (15*32+20) /* Virtual SPEC_CTRL */ 373 + #define X86_FEATURE_VNMI (15*32+25) /* Virtual NMI */ 373 374 #define X86_FEATURE_SVME_ADDR_CHK (15*32+28) /* "" SVME addr check */ 374 375 375 376 /* Intel-defined CPU features, CPUID level 0x00000007:0 (ECX), word 16 */

+4 -2

arch/x86/include/asm/kvm-x86-ops.h

··· 54 54 KVM_X86_OP(get_if_flag) 55 55 KVM_X86_OP(flush_tlb_all) 56 56 KVM_X86_OP(flush_tlb_current) 57 - KVM_X86_OP_OPTIONAL(tlb_remote_flush) 58 - KVM_X86_OP_OPTIONAL(tlb_remote_flush_with_range) 57 + KVM_X86_OP_OPTIONAL(flush_remote_tlbs) 58 + KVM_X86_OP_OPTIONAL(flush_remote_tlbs_range) 59 59 KVM_X86_OP(flush_tlb_gva) 60 60 KVM_X86_OP(flush_tlb_guest) 61 61 KVM_X86_OP(vcpu_pre_run) ··· 68 68 KVM_X86_OP(patch_hypercall) 69 69 KVM_X86_OP(inject_irq) 70 70 KVM_X86_OP(inject_nmi) 71 + KVM_X86_OP_OPTIONAL_RET0(is_vnmi_pending) 72 + KVM_X86_OP_OPTIONAL_RET0(set_vnmi_pending) 71 73 KVM_X86_OP(inject_exception) 72 74 KVM_X86_OP(cancel_injection) 73 75 KVM_X86_OP(interrupt_allowed)

+51 -38

arch/x86/include/asm/kvm_host.h

··· 420 420 421 421 #define KVM_MMU_NUM_PREV_ROOTS 3 422 422 423 + #define KVM_MMU_ROOT_CURRENT BIT(0) 424 + #define KVM_MMU_ROOT_PREVIOUS(i) BIT(1+i) 425 + #define KVM_MMU_ROOTS_ALL (BIT(1 + KVM_MMU_NUM_PREV_ROOTS) - 1) 426 + 423 427 #define KVM_HAVE_MMU_RWLOCK 424 428 425 429 struct kvm_mmu_page; ··· 443 439 gpa_t (*gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, 444 440 gpa_t gva_or_gpa, u64 access, 445 441 struct x86_exception *exception); 446 - int (*sync_page)(struct kvm_vcpu *vcpu, 447 - struct kvm_mmu_page *sp); 448 - void (*invlpg)(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root_hpa); 442 + int (*sync_spte)(struct kvm_vcpu *vcpu, 443 + struct kvm_mmu_page *sp, int i); 449 444 struct kvm_mmu_root_info root; 450 445 union kvm_cpu_role cpu_role; 451 446 union kvm_mmu_page_role root_role; ··· 482 479 u64 pdptrs[4]; /* pae */ 483 480 }; 484 481 485 - struct kvm_tlb_range { 486 - u64 start_gfn; 487 - u64 pages; 488 - }; 489 - 490 482 enum pmc_type { 491 483 KVM_PMC_GP = 0, 492 484 KVM_PMC_FIXED, ··· 513 515 #define MSR_ARCH_PERFMON_FIXED_CTR_MAX (MSR_ARCH_PERFMON_FIXED_CTR0 + KVM_PMC_MAX_FIXED - 1) 514 516 #define KVM_AMD_PMC_MAX_GENERIC 6 515 517 struct kvm_pmu { 518 + u8 version; 516 519 unsigned nr_arch_gp_counters; 517 520 unsigned nr_arch_fixed_counters; 518 521 unsigned available_event_types; ··· 526 527 u64 global_ovf_ctrl_mask; 527 528 u64 reserved_bits; 528 529 u64 raw_event_mask; 529 - u8 version; 530 530 struct kvm_pmc gp_counters[KVM_INTEL_PMC_MAX_GENERIC]; 531 531 struct kvm_pmc fixed_counters[KVM_PMC_MAX_FIXED]; 532 532 struct irq_work irq_work; ··· 874 876 u64 tsc_scaling_ratio; /* current scaling ratio */ 875 877 876 878 atomic_t nmi_queued; /* unprocessed asynchronous NMIs */ 877 - unsigned nmi_pending; /* NMI queued after currently running handler */ 879 + /* Number of NMIs pending injection, not including hardware vNMIs. */ 880 + unsigned int nmi_pending; 878 881 bool nmi_injected; /* Trying to inject an NMI this entry */ 879 882 bool smi_pending; /* SMI queued after currently running handler */ 880 883 u8 handling_intr_from_guest; ··· 945 946 } pv_eoi; 946 947 947 948 u64 msr_kvm_poll_control; 948 - 949 - /* 950 - * Indicates the guest is trying to write a gfn that contains one or 951 - * more of the PTEs used to translate the write itself, i.e. the access 952 - * is changing its own translation in the guest page tables. KVM exits 953 - * to userspace if emulation of the faulting instruction fails and this 954 - * flag is set, as KVM cannot make forward progress. 955 - * 956 - * If emulation fails for a write to guest page tables, KVM unprotects 957 - * (zaps) the shadow page for the target gfn and resumes the guest to 958 - * retry the non-emulatable instruction (on hardware). Unprotecting the 959 - * gfn doesn't allow forward progress for a self-changing access because 960 - * doing so also zaps the translation for the gfn, i.e. retrying the 961 - * instruction will hit a !PRESENT fault, which results in a new shadow 962 - * page and sends KVM back to square one. 963 - */ 964 - bool write_fault_to_shadow_pgtable; 965 949 966 950 /* set at EPT violation at this point */ 967 951 unsigned long exit_qualification; ··· 1584 1602 1585 1603 void (*flush_tlb_all)(struct kvm_vcpu *vcpu); 1586 1604 void (*flush_tlb_current)(struct kvm_vcpu *vcpu); 1587 - int (*tlb_remote_flush)(struct kvm *kvm); 1588 - int (*tlb_remote_flush_with_range)(struct kvm *kvm, 1589 - struct kvm_tlb_range *range); 1605 + int (*flush_remote_tlbs)(struct kvm *kvm); 1606 + int (*flush_remote_tlbs_range)(struct kvm *kvm, gfn_t gfn, 1607 + gfn_t nr_pages); 1590 1608 1591 1609 /* 1592 1610 * Flush any TLB entries associated with the given GVA. ··· 1620 1638 int (*nmi_allowed)(struct kvm_vcpu *vcpu, bool for_injection); 1621 1639 bool (*get_nmi_mask)(struct kvm_vcpu *vcpu); 1622 1640 void (*set_nmi_mask)(struct kvm_vcpu *vcpu, bool masked); 1641 + /* Whether or not a virtual NMI is pending in hardware. */ 1642 + bool (*is_vnmi_pending)(struct kvm_vcpu *vcpu); 1643 + /* 1644 + * Attempt to pend a virtual NMI in harware. Returns %true on success 1645 + * to allow using static_call_ret0 as the fallback. 1646 + */ 1647 + bool (*set_vnmi_pending)(struct kvm_vcpu *vcpu); 1623 1648 void (*enable_nmi_window)(struct kvm_vcpu *vcpu); 1624 1649 void (*enable_irq_window)(struct kvm_vcpu *vcpu); 1625 1650 void (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, int irr); ··· 1797 1808 #define __KVM_HAVE_ARCH_FLUSH_REMOTE_TLB 1798 1809 static inline int kvm_arch_flush_remote_tlb(struct kvm *kvm) 1799 1810 { 1800 - if (kvm_x86_ops.tlb_remote_flush && 1801 - !static_call(kvm_x86_tlb_remote_flush)(kvm)) 1811 + if (kvm_x86_ops.flush_remote_tlbs && 1812 + !static_call(kvm_x86_flush_remote_tlbs)(kvm)) 1802 1813 return 0; 1803 1814 else 1804 1815 return -ENOTSUPP; ··· 1896 1907 * EMULTYPE_COMPLETE_USER_EXIT - Set when the emulator should update interruptibility 1897 1908 * state and inject single-step #DBs after skipping 1898 1909 * an instruction (after completing userspace I/O). 1910 + * 1911 + * EMULTYPE_WRITE_PF_TO_SP - Set when emulating an intercepted page fault that 1912 + * is attempting to write a gfn that contains one or 1913 + * more of the PTEs used to translate the write itself, 1914 + * and the owning page table is being shadowed by KVM. 1915 + * If emulation of the faulting instruction fails and 1916 + * this flag is set, KVM will exit to userspace instead 1917 + * of retrying emulation as KVM cannot make forward 1918 + * progress. 1919 + * 1920 + * If emulation fails for a write to guest page tables, 1921 + * KVM unprotects (zaps) the shadow page for the target 1922 + * gfn and resumes the guest to retry the non-emulatable 1923 + * instruction (on hardware). Unprotecting the gfn 1924 + * doesn't allow forward progress for a self-changing 1925 + * access because doing so also zaps the translation for 1926 + * the gfn, i.e. retrying the instruction will hit a 1927 + * !PRESENT fault, which results in a new shadow page 1928 + * and sends KVM back to square one. 1899 1929 */ 1900 1930 #define EMULTYPE_NO_DECODE (1 << 0) 1901 1931 #define EMULTYPE_TRAP_UD (1 << 1) ··· 1924 1916 #define EMULTYPE_VMWARE_GP (1 << 5) 1925 1917 #define EMULTYPE_PF (1 << 6) 1926 1918 #define EMULTYPE_COMPLETE_USER_EXIT (1 << 7) 1919 + #define EMULTYPE_WRITE_PF_TO_SP (1 << 8) 1927 1920 1928 1921 int kvm_emulate_instruction(struct kvm_vcpu *vcpu, int emulation_type); 1929 1922 int kvm_emulate_instruction_from_buffer(struct kvm_vcpu *vcpu, ··· 2003 1994 return !!(*irq_state); 2004 1995 } 2005 1996 2006 - #define KVM_MMU_ROOT_CURRENT BIT(0) 2007 - #define KVM_MMU_ROOT_PREVIOUS(i) BIT(1+i) 2008 - #define KVM_MMU_ROOTS_ALL (~0UL) 2009 - 2010 1997 int kvm_pic_set_irq(struct kvm_pic *pic, int irq, int irq_source_id, int level); 2011 1998 void kvm_pic_clear_all(struct kvm_pic *pic, int irq_source_id); 2012 1999 2013 2000 void kvm_inject_nmi(struct kvm_vcpu *vcpu); 2001 + int kvm_get_nr_pending_nmis(struct kvm_vcpu *vcpu); 2014 2002 2015 2003 void kvm_update_dr7(struct kvm_vcpu *vcpu); 2016 2004 ··· 2047 2041 int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code, 2048 2042 void *insn, int insn_len); 2049 2043 void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva); 2050 - void kvm_mmu_invalidate_gva(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, 2051 - gva_t gva, hpa_t root_hpa); 2044 + void kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, 2045 + u64 addr, unsigned long roots); 2052 2046 void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid); 2053 2047 void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd); 2054 2048 ··· 2209 2203 KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT | \ 2210 2204 KVM_X86_QUIRK_FIX_HYPERCALL_INSN | \ 2211 2205 KVM_X86_QUIRK_MWAIT_NEVER_UD_FAULTS) 2206 + 2207 + /* 2208 + * KVM previously used a u32 field in kvm_run to indicate the hypercall was 2209 + * initiated from long mode. KVM now sets bit 0 to indicate long mode, but the 2210 + * remaining 31 lower bits must be 0 to preserve ABI. 2211 + */ 2212 + #define KVM_EXIT_HYPERCALL_MBZ GENMASK_ULL(31, 1) 2212 2213 2213 2214 #endif /* _ASM_X86_KVM_HOST_H */

+9 -1

arch/x86/include/asm/svm.h

··· 183 183 #define V_GIF_SHIFT 9 184 184 #define V_GIF_MASK (1 << V_GIF_SHIFT) 185 185 186 + #define V_NMI_PENDING_SHIFT 11 187 + #define V_NMI_PENDING_MASK (1 << V_NMI_PENDING_SHIFT) 188 + 189 + #define V_NMI_BLOCKING_SHIFT 12 190 + #define V_NMI_BLOCKING_MASK (1 << V_NMI_BLOCKING_SHIFT) 191 + 186 192 #define V_INTR_PRIO_SHIFT 16 187 193 #define V_INTR_PRIO_MASK (0x0f << V_INTR_PRIO_SHIFT) 188 194 ··· 202 196 203 197 #define V_GIF_ENABLE_SHIFT 25 204 198 #define V_GIF_ENABLE_MASK (1 << V_GIF_ENABLE_SHIFT) 199 + 200 + #define V_NMI_ENABLE_SHIFT 26 201 + #define V_NMI_ENABLE_MASK (1 << V_NMI_ENABLE_SHIFT) 205 202 206 203 #define AVIC_ENABLE_SHIFT 31 207 204 #define AVIC_ENABLE_MASK (1 << AVIC_ENABLE_SHIFT) ··· 287 278 static_assert((X2AVIC_MAX_PHYSICAL_ID & AVIC_PHYSICAL_MAX_INDEX_MASK) == X2AVIC_MAX_PHYSICAL_ID); 288 279 289 280 #define AVIC_HPA_MASK ~((0xFFFULL << 52) | 0xFFF) 290 - #define VMCB_AVIC_APIC_BAR_MASK 0xFFFFFFFFFF000ULL 291 281 292 282 293 283 struct vmcb_seg {

+3

arch/x86/include/uapi/asm/kvm.h

··· 559 559 #define KVM_VCPU_TSC_CTRL 0 /* control group for the timestamp counter (TSC) */ 560 560 #define KVM_VCPU_TSC_OFFSET 0 /* attribute for the TSC offset */ 561 561 562 + /* x86-specific KVM_EXIT_HYPERCALL flags. */ 563 + #define KVM_EXIT_HYPERCALL_LONG_MODE BIT(0) 564 + 562 565 #endif /* _ASM_X86_KVM_H */

+6 -12

arch/x86/kvm/cpuid.c

··· 60 60 return ret; 61 61 } 62 62 63 - /* 64 - * This one is tied to SSB in the user API, and not 65 - * visible in /proc/cpuinfo. 66 - */ 67 - #define KVM_X86_FEATURE_AMD_PSFD (13*32+28) /* Predictive Store Forwarding Disable */ 68 - 69 63 #define F feature_bit 70 64 71 65 /* Scattered Flag - For features that are scattered by cpufeatures.h. */ ··· 260 266 /* Update OSXSAVE bit */ 261 267 if (boot_cpu_has(X86_FEATURE_XSAVE)) 262 268 cpuid_entry_change(best, X86_FEATURE_OSXSAVE, 263 - kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE)); 269 + kvm_is_cr4_bit_set(vcpu, X86_CR4_OSXSAVE)); 264 270 265 271 cpuid_entry_change(best, X86_FEATURE_APIC, 266 272 vcpu->arch.apic_base & MSR_IA32_APICBASE_ENABLE); ··· 269 275 best = cpuid_entry2_find(entries, nent, 7, 0); 270 276 if (best && boot_cpu_has(X86_FEATURE_PKU) && best->function == 0x7) 271 277 cpuid_entry_change(best, X86_FEATURE_OSPKE, 272 - kvm_read_cr4_bits(vcpu, X86_CR4_PKE)); 278 + kvm_is_cr4_bit_set(vcpu, X86_CR4_PKE)); 273 279 274 280 best = cpuid_entry2_find(entries, nent, 0xD, 0); 275 281 if (best) ··· 414 420 * KVM_SET_CPUID{,2} again. To support this legacy behavior, check 415 421 * whether the supplied CPUID data is equal to what's already set. 416 422 */ 417 - if (vcpu->arch.last_vmentry_cpu != -1) { 423 + if (kvm_vcpu_has_run(vcpu)) { 418 424 r = kvm_cpuid_check_equal(vcpu, e2, nent); 419 425 if (r) 420 426 return r; ··· 647 653 F(SPEC_CTRL_SSBD) | F(ARCH_CAPABILITIES) | F(INTEL_STIBP) | 648 654 F(MD_CLEAR) | F(AVX512_VP2INTERSECT) | F(FSRM) | 649 655 F(SERIALIZE) | F(TSXLDTRK) | F(AVX512_FP16) | 650 - F(AMX_TILE) | F(AMX_INT8) | F(AMX_BF16) 656 + F(AMX_TILE) | F(AMX_INT8) | F(AMX_BF16) | F(FLUSH_L1D) 651 657 ); 652 658 653 659 /* TSC_ADJUST and ARCH_CAPABILITIES are emulated in software. */ ··· 709 715 F(CLZERO) | F(XSAVEERPTR) | 710 716 F(WBNOINVD) | F(AMD_IBPB) | F(AMD_IBRS) | F(AMD_SSBD) | F(VIRT_SSBD) | 711 717 F(AMD_SSB_NO) | F(AMD_STIBP) | F(AMD_STIBP_ALWAYS_ON) | 712 - __feature_bit(KVM_X86_FEATURE_AMD_PSFD) 718 + F(AMD_PSFD) 713 719 ); 714 720 715 721 /* ··· 996 1002 entry->eax = entry->ebx = entry->ecx = 0; 997 1003 break; 998 1004 case 0xd: { 999 - u64 permitted_xcr0 = kvm_caps.supported_xcr0 & xstate_get_guest_group_perm(); 1005 + u64 permitted_xcr0 = kvm_get_filtered_xcr0(); 1000 1006 u64 permitted_xss = kvm_caps.supported_xss; 1001 1007 1002 1008 entry->eax &= permitted_xcr0;

+8

arch/x86/kvm/emulate.c

··· 1640 1640 goto exception; 1641 1641 break; 1642 1642 case VCPU_SREG_CS: 1643 + /* 1644 + * KVM uses "none" when loading CS as part of emulating Real 1645 + * Mode exceptions and IRET (handled above). In all other 1646 + * cases, loading CS without a control transfer is a KVM bug. 1647 + */ 1648 + if (WARN_ON_ONCE(transfer == X86_TRANSFER_NONE)) 1649 + goto exception; 1650 + 1643 1651 if (!(seg_desc.type & 8)) 1644 1652 goto exception; 1645 1653

+17 -1

arch/x86/kvm/kvm_cache_regs.h

··· 4 4 5 5 #include <linux/kvm_host.h> 6 6 7 - #define KVM_POSSIBLE_CR0_GUEST_BITS X86_CR0_TS 7 + #define KVM_POSSIBLE_CR0_GUEST_BITS (X86_CR0_TS | X86_CR0_WP) 8 8 #define KVM_POSSIBLE_CR4_GUEST_BITS \ 9 9 (X86_CR4_PVI | X86_CR4_DE | X86_CR4_PCE | X86_CR4_OSFXSR \ 10 10 | X86_CR4_OSXMMEXCPT | X86_CR4_PGE | X86_CR4_TSD | X86_CR4_FSGSBASE) ··· 157 157 return vcpu->arch.cr0 & mask; 158 158 } 159 159 160 + static __always_inline bool kvm_is_cr0_bit_set(struct kvm_vcpu *vcpu, 161 + unsigned long cr0_bit) 162 + { 163 + BUILD_BUG_ON(!is_power_of_2(cr0_bit)); 164 + 165 + return !!kvm_read_cr0_bits(vcpu, cr0_bit); 166 + } 167 + 160 168 static inline ulong kvm_read_cr0(struct kvm_vcpu *vcpu) 161 169 { 162 170 return kvm_read_cr0_bits(vcpu, ~0UL); ··· 177 169 !kvm_register_is_available(vcpu, VCPU_EXREG_CR4)) 178 170 static_call(kvm_x86_cache_reg)(vcpu, VCPU_EXREG_CR4); 179 171 return vcpu->arch.cr4 & mask; 172 + } 173 + 174 + static __always_inline bool kvm_is_cr4_bit_set(struct kvm_vcpu *vcpu, 175 + unsigned long cr4_bit) 176 + { 177 + BUILD_BUG_ON(!is_power_of_2(cr4_bit)); 178 + 179 + return !!kvm_read_cr4_bits(vcpu, cr4_bit); 180 180 } 181 181 182 182 static inline ulong kvm_read_cr3(struct kvm_vcpu *vcpu)

+24 -9

arch/x86/kvm/kvm_onhyperv.c

··· 10 10 #include "hyperv.h" 11 11 #include "kvm_onhyperv.h" 12 12 13 + struct kvm_hv_tlb_range { 14 + u64 start_gfn; 15 + u64 pages; 16 + }; 17 + 13 18 static int kvm_fill_hv_flush_list_func(struct hv_guest_mapping_flush_list *flush, 14 19 void *data) 15 20 { 16 - struct kvm_tlb_range *range = data; 21 + struct kvm_hv_tlb_range *range = data; 17 22 18 23 return hyperv_fill_flush_guest_mapping_list(flush, range->start_gfn, 19 24 range->pages); 20 25 } 21 26 22 27 static inline int hv_remote_flush_root_tdp(hpa_t root_tdp, 23 - struct kvm_tlb_range *range) 28 + struct kvm_hv_tlb_range *range) 24 29 { 25 30 if (range) 26 31 return hyperv_flush_guest_mapping_range(root_tdp, ··· 34 29 return hyperv_flush_guest_mapping(root_tdp); 35 30 } 36 31 37 - int hv_remote_flush_tlb_with_range(struct kvm *kvm, 38 - struct kvm_tlb_range *range) 32 + static int __hv_flush_remote_tlbs_range(struct kvm *kvm, 33 + struct kvm_hv_tlb_range *range) 39 34 { 40 35 struct kvm_arch *kvm_arch = &kvm->arch; 41 36 struct kvm_vcpu *vcpu; ··· 91 86 spin_unlock(&kvm_arch->hv_root_tdp_lock); 92 87 return ret; 93 88 } 94 - EXPORT_SYMBOL_GPL(hv_remote_flush_tlb_with_range); 95 89 96 - int hv_remote_flush_tlb(struct kvm *kvm) 90 + int hv_flush_remote_tlbs_range(struct kvm *kvm, gfn_t start_gfn, gfn_t nr_pages) 97 91 { 98 - return hv_remote_flush_tlb_with_range(kvm, NULL); 92 + struct kvm_hv_tlb_range range = { 93 + .start_gfn = start_gfn, 94 + .pages = nr_pages, 95 + }; 96 + 97 + return __hv_flush_remote_tlbs_range(kvm, &range); 99 98 } 100 - EXPORT_SYMBOL_GPL(hv_remote_flush_tlb); 99 + EXPORT_SYMBOL_GPL(hv_flush_remote_tlbs_range); 100 + 101 + int hv_flush_remote_tlbs(struct kvm *kvm) 102 + { 103 + return __hv_flush_remote_tlbs_range(kvm, NULL); 104 + } 105 + EXPORT_SYMBOL_GPL(hv_flush_remote_tlbs); 101 106 102 107 void hv_track_root_tdp(struct kvm_vcpu *vcpu, hpa_t root_tdp) 103 108 { 104 109 struct kvm_arch *kvm_arch = &vcpu->kvm->arch; 105 110 106 - if (kvm_x86_ops.tlb_remote_flush == hv_remote_flush_tlb) { 111 + if (kvm_x86_ops.flush_remote_tlbs == hv_flush_remote_tlbs) { 107 112 spin_lock(&kvm_arch->hv_root_tdp_lock); 108 113 vcpu->arch.hv_root_tdp = root_tdp; 109 114 if (root_tdp != kvm_arch->hv_root_tdp)

+3 -4

arch/x86/kvm/kvm_onhyperv.h

··· 7 7 #define __ARCH_X86_KVM_KVM_ONHYPERV_H__ 8 8 9 9 #if IS_ENABLED(CONFIG_HYPERV) 10 - int hv_remote_flush_tlb_with_range(struct kvm *kvm, 11 - struct kvm_tlb_range *range); 12 - int hv_remote_flush_tlb(struct kvm *kvm); 10 + int hv_flush_remote_tlbs_range(struct kvm *kvm, gfn_t gfn, gfn_t nr_pages); 11 + int hv_flush_remote_tlbs(struct kvm *kvm); 13 12 void hv_track_root_tdp(struct kvm_vcpu *vcpu, hpa_t root_tdp); 14 13 #else /* !CONFIG_HYPERV */ 15 - static inline int hv_remote_flush_tlb(struct kvm *kvm) 14 + static inline int hv_flush_remote_tlbs(struct kvm *kvm) 16 15 { 17 16 return -EOPNOTSUPP; 18 17 }

+26 -2

arch/x86/kvm/mmu.h

··· 113 113 bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu); 114 114 int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code, 115 115 u64 fault_address, char *insn, int insn_len); 116 + void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu, 117 + struct kvm_mmu *mmu); 116 118 117 119 int kvm_mmu_load(struct kvm_vcpu *vcpu); 118 120 void kvm_mmu_unload(struct kvm_vcpu *vcpu); ··· 134 132 { 135 133 BUILD_BUG_ON((X86_CR3_PCID_MASK & PAGE_MASK) != 0); 136 134 137 - return kvm_read_cr4_bits(vcpu, X86_CR4_PCIDE) 135 + return kvm_is_cr4_bit_set(vcpu, X86_CR4_PCIDE) 138 136 ? cr3 & X86_CR3_PCID_MASK 139 137 : 0; 140 138 } ··· 153 151 154 152 static_call(kvm_x86_load_mmu_pgd)(vcpu, root_hpa, 155 153 vcpu->arch.mmu->root_role.level); 154 + } 155 + 156 + static inline void kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu, 157 + struct kvm_mmu *mmu) 158 + { 159 + /* 160 + * When EPT is enabled, KVM may passthrough CR0.WP to the guest, i.e. 161 + * @mmu's snapshot of CR0.WP and thus all related paging metadata may 162 + * be stale. Refresh CR0.WP and the metadata on-demand when checking 163 + * for permission faults. Exempt nested MMUs, i.e. MMUs for shadowing 164 + * nEPT and nNPT, as CR0.WP is ignored in both cases. Note, KVM does 165 + * need to refresh nested_mmu, a.k.a. the walker used to translate L2 166 + * GVAs to GPAs, as that "MMU" needs to honor L2's CR0.WP. 167 + */ 168 + if (!tdp_enabled || mmu == &vcpu->arch.guest_mmu) 169 + return; 170 + 171 + __kvm_mmu_refresh_passthrough_bits(vcpu, mmu); 156 172 } 157 173 158 174 /* ··· 204 184 u64 implicit_access = access & PFERR_IMPLICIT_ACCESS; 205 185 bool not_smap = ((rflags & X86_EFLAGS_AC) | implicit_access) == X86_EFLAGS_AC; 206 186 int index = (pfec + (not_smap << PFERR_RSVD_BIT)) >> 1; 207 - bool fault = (mmu->permissions[index] >> pte_access) & 1; 208 187 u32 errcode = PFERR_PRESENT_MASK; 188 + bool fault; 189 + 190 + kvm_mmu_refresh_passthrough_bits(vcpu, mmu); 191 + 192 + fault = (mmu->permissions[index] >> pte_access) & 1; 209 193 210 194 WARN_ON(pfec & (PFERR_PK_MASK | PFERR_RSVD_MASK)); 211 195 if (unlikely(mmu->pkru_mask)) {

+316 -212

arch/x86/kvm/mmu/mmu.c

··· 125 125 #define PTE_LIST_EXT 14 126 126 127 127 /* 128 - * Slight optimization of cacheline layout, by putting `more' and `spte_count' 129 - * at the start; then accessing it will only use one single cacheline for 130 - * either full (entries==PTE_LIST_EXT) case or entries<=6. 128 + * struct pte_list_desc is the core data structure used to implement a custom 129 + * list for tracking a set of related SPTEs, e.g. all the SPTEs that map a 130 + * given GFN when used in the context of rmaps. Using a custom list allows KVM 131 + * to optimize for the common case where many GFNs will have at most a handful 132 + * of SPTEs pointing at them, i.e. allows packing multiple SPTEs into a small 133 + * memory footprint, which in turn improves runtime performance by exploiting 134 + * cache locality. 135 + * 136 + * A list is comprised of one or more pte_list_desc objects (descriptors). 137 + * Each individual descriptor stores up to PTE_LIST_EXT SPTEs. If a descriptor 138 + * is full and a new SPTEs needs to be added, a new descriptor is allocated and 139 + * becomes the head of the list. This means that by definitions, all tail 140 + * descriptors are full. 141 + * 142 + * Note, the meta data fields are deliberately placed at the start of the 143 + * structure to optimize the cacheline layout; accessing the descriptor will 144 + * touch only a single cacheline so long as @spte_count<=6 (or if only the 145 + * descriptors metadata is accessed). 131 146 */ 132 147 struct pte_list_desc { 133 148 struct pte_list_desc *more; 134 - /* 135 - * Stores number of entries stored in the pte_list_desc. No need to be 136 - * u64 but just for easier alignment. When PTE_LIST_EXT, means full. 137 - */ 138 - u64 spte_count; 149 + /* The number of PTEs stored in _this_ descriptor. */ 150 + u32 spte_count; 151 + /* The number of PTEs stored in all tails of this descriptor. */ 152 + u32 tail_count; 139 153 u64 *sptes[PTE_LIST_EXT]; 140 154 }; 141 155 ··· 256 242 return regs; 257 243 } 258 244 259 - static inline bool kvm_available_flush_tlb_with_range(void) 245 + static unsigned long get_guest_cr3(struct kvm_vcpu *vcpu) 260 246 { 261 - return kvm_x86_ops.tlb_remote_flush_with_range; 247 + return kvm_read_cr3(vcpu); 262 248 } 263 249 264 - static void kvm_flush_remote_tlbs_with_range(struct kvm *kvm, 265 - struct kvm_tlb_range *range) 250 + static inline unsigned long kvm_mmu_get_guest_pgd(struct kvm_vcpu *vcpu, 251 + struct kvm_mmu *mmu) 266 252 { 267 - int ret = -ENOTSUPP; 253 + if (IS_ENABLED(CONFIG_RETPOLINE) && mmu->get_guest_pgd == get_guest_cr3) 254 + return kvm_read_cr3(vcpu); 268 255 269 - if (range && kvm_x86_ops.tlb_remote_flush_with_range) 270 - ret = static_call(kvm_x86_tlb_remote_flush_with_range)(kvm, range); 256 + return mmu->get_guest_pgd(vcpu); 257 + } 271 258 259 + static inline bool kvm_available_flush_remote_tlbs_range(void) 260 + { 261 + return kvm_x86_ops.flush_remote_tlbs_range; 262 + } 263 + 264 + void kvm_flush_remote_tlbs_range(struct kvm *kvm, gfn_t start_gfn, 265 + gfn_t nr_pages) 266 + { 267 + int ret = -EOPNOTSUPP; 268 + 269 + if (kvm_x86_ops.flush_remote_tlbs_range) 270 + ret = static_call(kvm_x86_flush_remote_tlbs_range)(kvm, start_gfn, 271 + nr_pages); 272 272 if (ret) 273 273 kvm_flush_remote_tlbs(kvm); 274 - } 275 - 276 - void kvm_flush_remote_tlbs_with_address(struct kvm *kvm, 277 - u64 start_gfn, u64 pages) 278 - { 279 - struct kvm_tlb_range range; 280 - 281 - range.start_gfn = start_gfn; 282 - range.pages = pages; 283 - 284 - kvm_flush_remote_tlbs_with_range(kvm, &range); 285 274 } 286 275 287 276 static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index); ··· 905 888 untrack_possible_nx_huge_page(kvm, sp); 906 889 } 907 890 908 - static struct kvm_memory_slot * 909 - gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu, gfn_t gfn, 910 - bool no_dirty_log) 891 + static struct kvm_memory_slot *gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu, 892 + gfn_t gfn, 893 + bool no_dirty_log) 911 894 { 912 895 struct kvm_memory_slot *slot; 913 896 ··· 946 929 desc->sptes[0] = (u64 *)rmap_head->val; 947 930 desc->sptes[1] = spte; 948 931 desc->spte_count = 2; 932 + desc->tail_count = 0; 949 933 rmap_head->val = (unsigned long)desc | 1; 950 934 ++count; 951 935 } else { 952 936 rmap_printk("%p %llx many->many\n", spte, *spte); 953 937 desc = (struct pte_list_desc *)(rmap_head->val & ~1ul); 954 - while (desc->spte_count == PTE_LIST_EXT) { 955 - count += PTE_LIST_EXT; 956 - if (!desc->more) { 957 - desc->more = kvm_mmu_memory_cache_alloc(cache); 958 - desc = desc->more; 959 - desc->spte_count = 0; 960 - break; 961 - } 962 - desc = desc->more; 938 + count = desc->tail_count + desc->spte_count; 939 + 940 + /* 941 + * If the previous head is full, allocate a new head descriptor 942 + * as tail descriptors are always kept full. 943 + */ 944 + if (desc->spte_count == PTE_LIST_EXT) { 945 + desc = kvm_mmu_memory_cache_alloc(cache); 946 + desc->more = (struct pte_list_desc *)(rmap_head->val & ~1ul); 947 + desc->spte_count = 0; 948 + desc->tail_count = count; 949 + rmap_head->val = (unsigned long)desc | 1; 963 950 } 964 - count += desc->spte_count; 965 951 desc->sptes[desc->spte_count++] = spte; 966 952 } 967 953 return count; 968 954 } 969 955 970 - static void 971 - pte_list_desc_remove_entry(struct kvm_rmap_head *rmap_head, 972 - struct pte_list_desc *desc, int i, 973 - struct pte_list_desc *prev_desc) 956 + static void pte_list_desc_remove_entry(struct kvm_rmap_head *rmap_head, 957 + struct pte_list_desc *desc, int i) 974 958 { 975 - int j = desc->spte_count - 1; 959 + struct pte_list_desc *head_desc = (struct pte_list_desc *)(rmap_head->val & ~1ul); 960 + int j = head_desc->spte_count - 1; 976 961 977 - desc->sptes[i] = desc->sptes[j]; 978 - desc->sptes[j] = NULL; 979 - desc->spte_count--; 980 - if (desc->spte_count) 962 + /* 963 + * The head descriptor should never be empty. A new head is added only 964 + * when adding an entry and the previous head is full, and heads are 965 + * removed (this flow) when they become empty. 966 + */ 967 + BUG_ON(j < 0); 968 + 969 + /* 970 + * Replace the to-be-freed SPTE with the last valid entry from the head 971 + * descriptor to ensure that tail descriptors are full at all times. 972 + * Note, this also means that tail_count is stable for each descriptor. 973 + */ 974 + desc->sptes[i] = head_desc->sptes[j]; 975 + head_desc->sptes[j] = NULL; 976 + head_desc->spte_count--; 977 + if (head_desc->spte_count) 981 978 return; 982 - if (!prev_desc && !desc->more) 979 + 980 + /* 981 + * The head descriptor is empty. If there are no tail descriptors, 982 + * nullify the rmap head to mark the list as emtpy, else point the rmap 983 + * head at the next descriptor, i.e. the new head. 984 + */ 985 + if (!head_desc->more) 983 986 rmap_head->val = 0; 984 987 else 985 - if (prev_desc) 986 - prev_desc->more = desc->more; 987 - else 988 - rmap_head->val = (unsigned long)desc->more | 1; 989 - mmu_free_pte_list_desc(desc); 988 + rmap_head->val = (unsigned long)head_desc->more | 1; 989 + mmu_free_pte_list_desc(head_desc); 990 990 } 991 991 992 992 static void pte_list_remove(u64 *spte, struct kvm_rmap_head *rmap_head) 993 993 { 994 994 struct pte_list_desc *desc; 995 - struct pte_list_desc *prev_desc; 996 995 int i; 997 996 998 997 if (!rmap_head->val) { ··· 1024 991 } else { 1025 992 rmap_printk("%p many->many\n", spte); 1026 993 desc = (struct pte_list_desc *)(rmap_head->val & ~1ul); 1027 - prev_desc = NULL; 1028 994 while (desc) { 1029 995 for (i = 0; i < desc->spte_count; ++i) { 1030 996 if (desc->sptes[i] == spte) { 1031 - pte_list_desc_remove_entry(rmap_head, 1032 - desc, i, prev_desc); 997 + pte_list_desc_remove_entry(rmap_head, desc, i); 1033 998 return; 1034 999 } 1035 1000 } 1036 - prev_desc = desc; 1037 1001 desc = desc->more; 1038 1002 } 1039 1003 pr_err("%s: %p many->many\n", __func__, spte); ··· 1077 1047 unsigned int pte_list_count(struct kvm_rmap_head *rmap_head) 1078 1048 { 1079 1049 struct pte_list_desc *desc; 1080 - unsigned int count = 0; 1081 1050 1082 1051 if (!rmap_head->val) 1083 1052 return 0; ··· 1084 1055 return 1; 1085 1056 1086 1057 desc = (struct pte_list_desc *)(rmap_head->val & ~1ul); 1087 - 1088 - while (desc) { 1089 - count += desc->spte_count; 1090 - desc = desc->more; 1091 - } 1092 - 1093 - return count; 1058 + return desc->tail_count + desc->spte_count; 1094 1059 } 1095 1060 1096 1061 static struct kvm_rmap_head *gfn_to_rmap(gfn_t gfn, int level, ··· 1094 1071 1095 1072 idx = gfn_to_index(gfn, slot->base_gfn, level); 1096 1073 return &slot->arch.rmap[level - PG_LEVEL_4K][idx]; 1097 - } 1098 - 1099 - static bool rmap_can_add(struct kvm_vcpu *vcpu) 1100 - { 1101 - struct kvm_mmu_memory_cache *mc; 1102 - 1103 - mc = &vcpu->arch.mmu_pte_list_desc_cache; 1104 - return kvm_mmu_memory_cache_nr_free_objects(mc); 1105 1074 } 1106 1075 1107 1076 static void rmap_remove(struct kvm *kvm, u64 *spte) ··· 1494 1479 } 1495 1480 } 1496 1481 1497 - if (need_flush && kvm_available_flush_tlb_with_range()) { 1482 + if (need_flush && kvm_available_flush_remote_tlbs_range()) { 1498 1483 kvm_flush_remote_tlbs_gfn(kvm, gfn, level); 1499 1484 return false; 1500 1485 } ··· 1519 1504 struct kvm_rmap_head *end_rmap; 1520 1505 }; 1521 1506 1522 - static void 1523 - rmap_walk_init_level(struct slot_rmap_walk_iterator *iterator, int level) 1507 + static void rmap_walk_init_level(struct slot_rmap_walk_iterator *iterator, 1508 + int level) 1524 1509 { 1525 1510 iterator->level = level; 1526 1511 iterator->gfn = iterator->start_gfn; ··· 1528 1513 iterator->end_rmap = gfn_to_rmap(iterator->end_gfn, level, iterator->slot); 1529 1514 } 1530 1515 1531 - static void 1532 - slot_rmap_walk_init(struct slot_rmap_walk_iterator *iterator, 1533 - const struct kvm_memory_slot *slot, int start_level, 1534 - int end_level, gfn_t start_gfn, gfn_t end_gfn) 1516 + static void slot_rmap_walk_init(struct slot_rmap_walk_iterator *iterator, 1517 + const struct kvm_memory_slot *slot, 1518 + int start_level, int end_level, 1519 + gfn_t start_gfn, gfn_t end_gfn) 1535 1520 { 1536 1521 iterator->slot = slot; 1537 1522 iterator->start_level = start_level; ··· 1804 1789 kvm_mmu_mark_parents_unsync(sp); 1805 1790 } 1806 1791 1807 - static int nonpaging_sync_page(struct kvm_vcpu *vcpu, 1808 - struct kvm_mmu_page *sp) 1809 - { 1810 - return -1; 1811 - } 1812 - 1813 1792 #define KVM_PAGE_ARRAY_NR 16 1814 1793 1815 1794 struct kvm_mmu_pages { ··· 1923 1914 &(_kvm)->arch.mmu_page_hash[kvm_page_table_hashfn(_gfn)]) \ 1924 1915 if ((_sp)->gfn != (_gfn) || !sp_has_gptes(_sp)) {} else 1925 1916 1917 + static bool kvm_sync_page_check(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) 1918 + { 1919 + union kvm_mmu_page_role root_role = vcpu->arch.mmu->root_role; 1920 + 1921 + /* 1922 + * Ignore various flags when verifying that it's safe to sync a shadow 1923 + * page using the current MMU context. 1924 + * 1925 + * - level: not part of the overall MMU role and will never match as the MMU's 1926 + * level tracks the root level 1927 + * - access: updated based on the new guest PTE 1928 + * - quadrant: not part of the overall MMU role (similar to level) 1929 + */ 1930 + const union kvm_mmu_page_role sync_role_ign = { 1931 + .level = 0xf, 1932 + .access = 0x7, 1933 + .quadrant = 0x3, 1934 + .passthrough = 0x1, 1935 + }; 1936 + 1937 + /* 1938 + * Direct pages can never be unsync, and KVM should never attempt to 1939 + * sync a shadow page for a different MMU context, e.g. if the role 1940 + * differs then the memslot lookup (SMM vs. non-SMM) will be bogus, the 1941 + * reserved bits checks will be wrong, etc... 1942 + */ 1943 + if (WARN_ON_ONCE(sp->role.direct || !vcpu->arch.mmu->sync_spte || 1944 + (sp->role.word ^ root_role.word) & ~sync_role_ign.word)) 1945 + return false; 1946 + 1947 + return true; 1948 + } 1949 + 1950 + static int kvm_sync_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int i) 1951 + { 1952 + if (!sp->spt[i]) 1953 + return 0; 1954 + 1955 + return vcpu->arch.mmu->sync_spte(vcpu, sp, i); 1956 + } 1957 + 1958 + static int __kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) 1959 + { 1960 + int flush = 0; 1961 + int i; 1962 + 1963 + if (!kvm_sync_page_check(vcpu, sp)) 1964 + return -1; 1965 + 1966 + for (i = 0; i < SPTE_ENT_PER_PAGE; i++) { 1967 + int ret = kvm_sync_spte(vcpu, sp, i); 1968 + 1969 + if (ret < -1) 1970 + return -1; 1971 + flush |= ret; 1972 + } 1973 + 1974 + /* 1975 + * Note, any flush is purely for KVM's correctness, e.g. when dropping 1976 + * an existing SPTE or clearing W/A/D bits to ensure an mmu_notifier 1977 + * unmap or dirty logging event doesn't fail to flush. The guest is 1978 + * responsible for flushing the TLB to ensure any changes in protection 1979 + * bits are recognized, i.e. until the guest flushes or page faults on 1980 + * a relevant address, KVM is architecturally allowed to let vCPUs use 1981 + * cached translations with the old protection bits. 1982 + */ 1983 + return flush; 1984 + } 1985 + 1926 1986 static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, 1927 1987 struct list_head *invalid_list) 1928 1988 { 1929 - int ret = vcpu->arch.mmu->sync_page(vcpu, sp); 1989 + int ret = __kvm_sync_page(vcpu, sp); 1930 1990 1931 1991 if (ret < 0) 1932 1992 kvm_mmu_prepare_zap_page(vcpu->kvm, sp, invalid_list); ··· 3382 3304 * Returns true if the SPTE was fixed successfully. Otherwise, 3383 3305 * someone else modified the SPTE from its original value. 3384 3306 */ 3385 - static bool 3386 - fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, 3387 - u64 *sptep, u64 old_spte, u64 new_spte) 3307 + static bool fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, 3308 + struct kvm_page_fault *fault, 3309 + u64 *sptep, u64 old_spte, u64 new_spte) 3388 3310 { 3389 3311 /* 3390 3312 * Theoretically we could also set dirty bit (and flush TLB) here in ··· 3590 3512 int i; 3591 3513 LIST_HEAD(invalid_list); 3592 3514 bool free_active_root; 3515 + 3516 + WARN_ON_ONCE(roots_to_free & ~KVM_MMU_ROOTS_ALL); 3593 3517 3594 3518 BUILD_BUG_ON(KVM_MMU_NUM_PREV_ROOTS >= BITS_PER_LONG); 3595 3519 ··· 3811 3731 int quadrant, i, r; 3812 3732 hpa_t root; 3813 3733 3814 - root_pgd = mmu->get_guest_pgd(vcpu); 3734 + root_pgd = kvm_mmu_get_guest_pgd(vcpu, mmu); 3815 3735 root_gfn = root_pgd >> PAGE_SHIFT; 3816 3736 3817 3737 if (mmu_check_root(vcpu, root_gfn)) ··· 4261 4181 arch.token = alloc_apf_token(vcpu); 4262 4182 arch.gfn = gfn; 4263 4183 arch.direct_map = vcpu->arch.mmu->root_role.direct; 4264 - arch.cr3 = vcpu->arch.mmu->get_guest_pgd(vcpu); 4184 + arch.cr3 = kvm_mmu_get_guest_pgd(vcpu, vcpu->arch.mmu); 4265 4185 4266 4186 return kvm_setup_async_pf(vcpu, cr2_or_gpa, 4267 4187 kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch); ··· 4280 4200 return; 4281 4201 4282 4202 if (!vcpu->arch.mmu->root_role.direct && 4283 - work->arch.cr3 != vcpu->arch.mmu->get_guest_pgd(vcpu)) 4203 + work->arch.cr3 != kvm_mmu_get_guest_pgd(vcpu, vcpu->arch.mmu)) 4284 4204 return; 4285 4205 4286 - kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true); 4206 + kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true, NULL); 4287 4207 } 4288 4208 4289 4209 static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) ··· 4549 4469 { 4550 4470 context->page_fault = nonpaging_page_fault; 4551 4471 context->gva_to_gpa = nonpaging_gva_to_gpa; 4552 - context->sync_page = nonpaging_sync_page; 4553 - context->invlpg = NULL; 4472 + context->sync_spte = NULL; 4554 4473 } 4555 4474 4556 4475 static inline bool is_root_usable(struct kvm_mmu_root_info *root, gpa_t pgd, ··· 4683 4604 } 4684 4605 EXPORT_SYMBOL_GPL(kvm_mmu_new_pgd); 4685 4606 4686 - static unsigned long get_cr3(struct kvm_vcpu *vcpu) 4687 - { 4688 - return kvm_read_cr3(vcpu); 4689 - } 4690 - 4691 4607 static bool sync_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, gfn_t gfn, 4692 4608 unsigned int access) 4693 4609 { ··· 4712 4638 #include "paging_tmpl.h" 4713 4639 #undef PTTYPE 4714 4640 4715 - static void 4716 - __reset_rsvds_bits_mask(struct rsvd_bits_validate *rsvd_check, 4717 - u64 pa_bits_rsvd, int level, bool nx, bool gbpages, 4718 - bool pse, bool amd) 4641 + static void __reset_rsvds_bits_mask(struct rsvd_bits_validate *rsvd_check, 4642 + u64 pa_bits_rsvd, int level, bool nx, 4643 + bool gbpages, bool pse, bool amd) 4719 4644 { 4720 4645 u64 gbpages_bit_rsvd = 0; 4721 4646 u64 nonleaf_bit8_rsvd = 0; ··· 4827 4754 guest_cpuid_is_amd_or_hygon(vcpu)); 4828 4755 } 4829 4756 4830 - static void 4831 - __reset_rsvds_bits_mask_ept(struct rsvd_bits_validate *rsvd_check, 4832 - u64 pa_bits_rsvd, bool execonly, int huge_page_level) 4757 + static void __reset_rsvds_bits_mask_ept(struct rsvd_bits_validate *rsvd_check, 4758 + u64 pa_bits_rsvd, bool execonly, 4759 + int huge_page_level) 4833 4760 { 4834 4761 u64 high_bits_rsvd = pa_bits_rsvd & rsvd_bits(0, 51); 4835 4762 u64 large_1g_rsvd = 0, large_2m_rsvd = 0; ··· 4929 4856 * the direct page table on host, use as much mmu features as 4930 4857 * possible, however, kvm currently does not do execution-protection. 4931 4858 */ 4932 - static void 4933 - reset_tdp_shadow_zero_bits_mask(struct kvm_mmu *context) 4859 + static void reset_tdp_shadow_zero_bits_mask(struct kvm_mmu *context) 4934 4860 { 4935 4861 struct rsvd_bits_validate *shadow_zero_check; 4936 4862 int i; ··· 5132 5060 { 5133 5061 context->page_fault = paging64_page_fault; 5134 5062 context->gva_to_gpa = paging64_gva_to_gpa; 5135 - context->sync_page = paging64_sync_page; 5136 - context->invlpg = paging64_invlpg; 5063 + context->sync_spte = paging64_sync_spte; 5137 5064 } 5138 5065 5139 5066 static void paging32_init_context(struct kvm_mmu *context) 5140 5067 { 5141 5068 context->page_fault = paging32_page_fault; 5142 5069 context->gva_to_gpa = paging32_gva_to_gpa; 5143 - context->sync_page = paging32_sync_page; 5144 - context->invlpg = paging32_invlpg; 5070 + context->sync_spte = paging32_sync_spte; 5145 5071 } 5146 5072 5147 - static union kvm_cpu_role 5148 - kvm_calc_cpu_role(struct kvm_vcpu *vcpu, const struct kvm_mmu_role_regs *regs) 5073 + static union kvm_cpu_role kvm_calc_cpu_role(struct kvm_vcpu *vcpu, 5074 + const struct kvm_mmu_role_regs *regs) 5149 5075 { 5150 5076 union kvm_cpu_role role = {0}; 5151 5077 ··· 5180 5110 role.ext.cr4_la57 = ____is_efer_lma(regs) && ____is_cr4_la57(regs); 5181 5111 role.ext.efer_lma = ____is_efer_lma(regs); 5182 5112 return role; 5113 + } 5114 + 5115 + void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu, 5116 + struct kvm_mmu *mmu) 5117 + { 5118 + const bool cr0_wp = kvm_is_cr0_bit_set(vcpu, X86_CR0_WP); 5119 + 5120 + BUILD_BUG_ON((KVM_MMU_CR0_ROLE_BITS & KVM_POSSIBLE_CR0_GUEST_BITS) != X86_CR0_WP); 5121 + BUILD_BUG_ON((KVM_MMU_CR4_ROLE_BITS & KVM_POSSIBLE_CR4_GUEST_BITS)); 5122 + 5123 + if (is_cr0_wp(mmu) == cr0_wp) 5124 + return; 5125 + 5126 + mmu->cpu_role.base.cr0_wp = cr0_wp; 5127 + reset_guest_paging_metadata(vcpu, mmu); 5183 5128 } 5184 5129 5185 5130 static inline int kvm_mmu_get_tdp_level(struct kvm_vcpu *vcpu) ··· 5242 5157 context->cpu_role.as_u64 = cpu_role.as_u64; 5243 5158 context->root_role.word = root_role.word; 5244 5159 context->page_fault = kvm_tdp_page_fault; 5245 - context->sync_page = nonpaging_sync_page; 5246 - context->invlpg = NULL; 5247 - context->get_guest_pgd = get_cr3; 5160 + context->sync_spte = NULL; 5161 + context->get_guest_pgd = get_guest_cr3; 5248 5162 context->get_pdptr = kvm_pdptr_read; 5249 5163 context->inject_page_fault = kvm_inject_page_fault; 5250 5164 ··· 5373 5289 5374 5290 context->page_fault = ept_page_fault; 5375 5291 context->gva_to_gpa = ept_gva_to_gpa; 5376 - context->sync_page = ept_sync_page; 5377 - context->invlpg = ept_invlpg; 5292 + context->sync_spte = ept_sync_spte; 5378 5293 5379 5294 update_permission_bitmask(context, true); 5380 5295 context->pkru_mask = 0; ··· 5392 5309 5393 5310 kvm_init_shadow_mmu(vcpu, cpu_role); 5394 5311 5395 - context->get_guest_pgd = get_cr3; 5312 + context->get_guest_pgd = get_guest_cr3; 5396 5313 context->get_pdptr = kvm_pdptr_read; 5397 5314 context->inject_page_fault = kvm_inject_page_fault; 5398 5315 } ··· 5406 5323 return; 5407 5324 5408 5325 g_context->cpu_role.as_u64 = new_mode.as_u64; 5409 - g_context->get_guest_pgd = get_cr3; 5326 + g_context->get_guest_pgd = get_guest_cr3; 5410 5327 g_context->get_pdptr = kvm_pdptr_read; 5411 5328 g_context->inject_page_fault = kvm_inject_page_fault; 5412 5329 ··· 5414 5331 * L2 page tables are never shadowed, so there is no need to sync 5415 5332 * SPTEs. 5416 5333 */ 5417 - g_context->invlpg = NULL; 5334 + g_context->sync_spte = NULL; 5418 5335 5419 5336 /* 5420 5337 * Note that arch.mmu->gva_to_gpa translates l2_gpa to l1_gpa using ··· 5476 5393 * Changing guest CPUID after KVM_RUN is forbidden, see the comment in 5477 5394 * kvm_arch_vcpu_ioctl(). 5478 5395 */ 5479 - KVM_BUG_ON(vcpu->arch.last_vmentry_cpu != -1, vcpu->kvm); 5396 + KVM_BUG_ON(kvm_vcpu_has_run(vcpu), vcpu->kvm); 5480 5397 } 5481 5398 5482 5399 void kvm_mmu_reset_context(struct kvm_vcpu *vcpu) ··· 5747 5664 5748 5665 if (r == RET_PF_INVALID) { 5749 5666 r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa, 5750 - lower_32_bits(error_code), false); 5667 + lower_32_bits(error_code), false, 5668 + &emulation_type); 5751 5669 if (KVM_BUG_ON(r == RET_PF_INVALID, vcpu->kvm)) 5752 5670 return -EIO; 5753 5671 } ··· 5790 5706 } 5791 5707 EXPORT_SYMBOL_GPL(kvm_mmu_page_fault); 5792 5708 5793 - void kvm_mmu_invalidate_gva(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, 5794 - gva_t gva, hpa_t root_hpa) 5709 + static void __kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, 5710 + u64 addr, hpa_t root_hpa) 5711 + { 5712 + struct kvm_shadow_walk_iterator iterator; 5713 + 5714 + vcpu_clear_mmio_info(vcpu, addr); 5715 + 5716 + if (!VALID_PAGE(root_hpa)) 5717 + return; 5718 + 5719 + write_lock(&vcpu->kvm->mmu_lock); 5720 + for_each_shadow_entry_using_root(vcpu, root_hpa, addr, iterator) { 5721 + struct kvm_mmu_page *sp = sptep_to_sp(iterator.sptep); 5722 + 5723 + if (sp->unsync) { 5724 + int ret = kvm_sync_spte(vcpu, sp, iterator.index); 5725 + 5726 + if (ret < 0) 5727 + mmu_page_zap_pte(vcpu->kvm, sp, iterator.sptep, NULL); 5728 + if (ret) 5729 + kvm_flush_remote_tlbs_sptep(vcpu->kvm, iterator.sptep); 5730 + } 5731 + 5732 + if (!sp->unsync_children) 5733 + break; 5734 + } 5735 + write_unlock(&vcpu->kvm->mmu_lock); 5736 + } 5737 + 5738 + void kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, 5739 + u64 addr, unsigned long roots) 5795 5740 { 5796 5741 int i; 5742 + 5743 + WARN_ON_ONCE(roots & ~KVM_MMU_ROOTS_ALL); 5797 5744 5798 5745 /* It's actually a GPA for vcpu->arch.guest_mmu. */ 5799 5746 if (mmu != &vcpu->arch.guest_mmu) { 5800 5747 /* INVLPG on a non-canonical address is a NOP according to the SDM. */ 5801 - if (is_noncanonical_address(gva, vcpu)) 5748 + if (is_noncanonical_address(addr, vcpu)) 5802 5749 return; 5803 5750 5804 - static_call(kvm_x86_flush_tlb_gva)(vcpu, gva); 5751 + static_call(kvm_x86_flush_tlb_gva)(vcpu, addr); 5805 5752 } 5806 5753 5807 - if (!mmu->invlpg) 5754 + if (!mmu->sync_spte) 5808 5755 return; 5809 5756 5810 - if (root_hpa == INVALID_PAGE) { 5811 - mmu->invlpg(vcpu, gva, mmu->root.hpa); 5757 + if (roots & KVM_MMU_ROOT_CURRENT) 5758 + __kvm_mmu_invalidate_addr(vcpu, mmu, addr, mmu->root.hpa); 5812 5759 5813 - /* 5814 - * INVLPG is required to invalidate any global mappings for the VA, 5815 - * irrespective of PCID. Since it would take us roughly similar amount 5816 - * of work to determine whether any of the prev_root mappings of the VA 5817 - * is marked global, or to just sync it blindly, so we might as well 5818 - * just always sync it. 5819 - * 5820 - * Mappings not reachable via the current cr3 or the prev_roots will be 5821 - * synced when switching to that cr3, so nothing needs to be done here 5822 - * for them. 5823 - */ 5824 - for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) 5825 - if (VALID_PAGE(mmu->prev_roots[i].hpa)) 5826 - mmu->invlpg(vcpu, gva, mmu->prev_roots[i].hpa); 5827 - } else { 5828 - mmu->invlpg(vcpu, gva, root_hpa); 5760 + for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) { 5761 + if (roots & KVM_MMU_ROOT_PREVIOUS(i)) 5762 + __kvm_mmu_invalidate_addr(vcpu, mmu, addr, mmu->prev_roots[i].hpa); 5829 5763 } 5830 5764 } 5765 + EXPORT_SYMBOL_GPL(kvm_mmu_invalidate_addr); 5831 5766 5832 5767 void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva) 5833 5768 { 5834 - kvm_mmu_invalidate_gva(vcpu, vcpu->arch.walk_mmu, gva, INVALID_PAGE); 5769 + /* 5770 + * INVLPG is required to invalidate any global mappings for the VA, 5771 + * irrespective of PCID. Blindly sync all roots as it would take 5772 + * roughly the same amount of work/time to determine whether any of the 5773 + * previous roots have a global mapping. 5774 + * 5775 + * Mappings not reachable via the current or previous cached roots will 5776 + * be synced when switching to that new cr3, so nothing needs to be 5777 + * done here for them. 5778 + */ 5779 + kvm_mmu_invalidate_addr(vcpu, vcpu->arch.walk_mmu, gva, KVM_MMU_ROOTS_ALL); 5835 5780 ++vcpu->stat.invlpg; 5836 5781 } 5837 5782 EXPORT_SYMBOL_GPL(kvm_mmu_invlpg); ··· 5869 5756 void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid) 5870 5757 { 5871 5758 struct kvm_mmu *mmu = vcpu->arch.mmu; 5872 - bool tlb_flush = false; 5759 + unsigned long roots = 0; 5873 5760 uint i; 5874 5761 5875 - if (pcid == kvm_get_active_pcid(vcpu)) { 5876 - if (mmu->invlpg) 5877 - mmu->invlpg(vcpu, gva, mmu->root.hpa); 5878 - tlb_flush = true; 5879 - } 5762 + if (pcid == kvm_get_active_pcid(vcpu)) 5763 + roots |= KVM_MMU_ROOT_CURRENT; 5880 5764 5881 5765 for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) { 5882 5766 if (VALID_PAGE(mmu->prev_roots[i].hpa) && 5883 - pcid == kvm_get_pcid(vcpu, mmu->prev_roots[i].pgd)) { 5884 - if (mmu->invlpg) 5885 - mmu->invlpg(vcpu, gva, mmu->prev_roots[i].hpa); 5886 - tlb_flush = true; 5887 - } 5767 + pcid == kvm_get_pcid(vcpu, mmu->prev_roots[i].pgd)) 5768 + roots |= KVM_MMU_ROOT_PREVIOUS(i); 5888 5769 } 5889 5770 5890 - if (tlb_flush) 5891 - static_call(kvm_x86_flush_tlb_gva)(vcpu, gva); 5892 - 5771 + if (roots) 5772 + kvm_mmu_invalidate_addr(vcpu, mmu, gva, roots); 5893 5773 ++vcpu->stat.invlpg; 5894 5774 5895 5775 /* ··· 5919 5813 EXPORT_SYMBOL_GPL(kvm_configure_mmu); 5920 5814 5921 5815 /* The return value indicates if tlb flush on all vcpus is needed. */ 5922 - typedef bool (*slot_level_handler) (struct kvm *kvm, 5816 + typedef bool (*slot_rmaps_handler) (struct kvm *kvm, 5923 5817 struct kvm_rmap_head *rmap_head, 5924 5818 const struct kvm_memory_slot *slot); 5925 5819 5926 - /* The caller should hold mmu-lock before calling this function. */ 5927 - static __always_inline bool 5928 - slot_handle_level_range(struct kvm *kvm, const struct kvm_memory_slot *memslot, 5929 - slot_level_handler fn, int start_level, int end_level, 5930 - gfn_t start_gfn, gfn_t end_gfn, bool flush_on_yield, 5931 - bool flush) 5820 + static __always_inline bool __walk_slot_rmaps(struct kvm *kvm, 5821 + const struct kvm_memory_slot *slot, 5822 + slot_rmaps_handler fn, 5823 + int start_level, int end_level, 5824 + gfn_t start_gfn, gfn_t end_gfn, 5825 + bool flush_on_yield, bool flush) 5932 5826 { 5933 5827 struct slot_rmap_walk_iterator iterator; 5934 5828 5935 - for_each_slot_rmap_range(memslot, start_level, end_level, start_gfn, 5829 + lockdep_assert_held_write(&kvm->mmu_lock); 5830 + 5831 + for_each_slot_rmap_range(slot, start_level, end_level, start_gfn, 5936 5832 end_gfn, &iterator) { 5937 5833 if (iterator.rmap) 5938 - flush |= fn(kvm, iterator.rmap, memslot); 5834 + flush |= fn(kvm, iterator.rmap, slot); 5939 5835 5940 5836 if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) { 5941 5837 if (flush && flush_on_yield) { 5942 - kvm_flush_remote_tlbs_with_address(kvm, 5943 - start_gfn, 5944 - iterator.gfn - start_gfn + 1); 5838 + kvm_flush_remote_tlbs_range(kvm, start_gfn, 5839 + iterator.gfn - start_gfn + 1); 5945 5840 flush = false; 5946 5841 } 5947 5842 cond_resched_rwlock_write(&kvm->mmu_lock); ··· 5952 5845 return flush; 5953 5846 } 5954 5847 5955 - static __always_inline bool 5956 - slot_handle_level(struct kvm *kvm, const struct kvm_memory_slot *memslot, 5957 - slot_level_handler fn, int start_level, int end_level, 5958 - bool flush_on_yield) 5848 + static __always_inline bool walk_slot_rmaps(struct kvm *kvm, 5849 + const struct kvm_memory_slot *slot, 5850 + slot_rmaps_handler fn, 5851 + int start_level, int end_level, 5852 + bool flush_on_yield) 5959 5853 { 5960 - return slot_handle_level_range(kvm, memslot, fn, start_level, 5961 - end_level, memslot->base_gfn, 5962 - memslot->base_gfn + memslot->npages - 1, 5963 - flush_on_yield, false); 5854 + return __walk_slot_rmaps(kvm, slot, fn, start_level, end_level, 5855 + slot->base_gfn, slot->base_gfn + slot->npages - 1, 5856 + flush_on_yield, false); 5964 5857 } 5965 5858 5966 - static __always_inline bool 5967 - slot_handle_level_4k(struct kvm *kvm, const struct kvm_memory_slot *memslot, 5968 - slot_level_handler fn, bool flush_on_yield) 5859 + static __always_inline bool walk_slot_rmaps_4k(struct kvm *kvm, 5860 + const struct kvm_memory_slot *slot, 5861 + slot_rmaps_handler fn, 5862 + bool flush_on_yield) 5969 5863 { 5970 - return slot_handle_level(kvm, memslot, fn, PG_LEVEL_4K, 5971 - PG_LEVEL_4K, flush_on_yield); 5864 + return walk_slot_rmaps(kvm, slot, fn, PG_LEVEL_4K, PG_LEVEL_4K, flush_on_yield); 5972 5865 } 5973 5866 5974 5867 static void free_mmu_pages(struct kvm_mmu *mmu) ··· 6263 6156 if (WARN_ON_ONCE(start >= end)) 6264 6157 continue; 6265 6158 6266 - flush = slot_handle_level_range(kvm, memslot, __kvm_zap_rmap, 6267 - PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL, 6268 - start, end - 1, true, flush); 6159 + flush = __walk_slot_rmaps(kvm, memslot, __kvm_zap_rmap, 6160 + PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL, 6161 + start, end - 1, true, flush); 6269 6162 } 6270 6163 } 6271 6164 ··· 6297 6190 } 6298 6191 6299 6192 if (flush) 6300 - kvm_flush_remote_tlbs_with_address(kvm, gfn_start, 6301 - gfn_end - gfn_start); 6193 + kvm_flush_remote_tlbs_range(kvm, gfn_start, gfn_end - gfn_start); 6302 6194 6303 6195 kvm_mmu_invalidate_end(kvm, 0, -1ul); 6304 6196 ··· 6317 6211 { 6318 6212 if (kvm_memslots_have_rmaps(kvm)) { 6319 6213 write_lock(&kvm->mmu_lock); 6320 - slot_handle_level(kvm, memslot, slot_rmap_write_protect, 6321 - start_level, KVM_MAX_HUGEPAGE_LEVEL, false); 6214 + walk_slot_rmaps(kvm, memslot, slot_rmap_write_protect, 6215 + start_level, KVM_MAX_HUGEPAGE_LEVEL, false); 6322 6216 write_unlock(&kvm->mmu_lock); 6323 6217 } 6324 6218 ··· 6553 6447 * all the way to the target level. There's no need to split pages 6554 6448 * already at the target level. 6555 6449 */ 6556 - for (level = KVM_MAX_HUGEPAGE_LEVEL; level > target_level; level--) { 6557 - slot_handle_level_range(kvm, slot, shadow_mmu_try_split_huge_pages, 6558 - level, level, start, end - 1, true, false); 6559 - } 6450 + for (level = KVM_MAX_HUGEPAGE_LEVEL; level > target_level; level--) 6451 + __walk_slot_rmaps(kvm, slot, shadow_mmu_try_split_huge_pages, 6452 + level, level, start, end - 1, true, false); 6560 6453 } 6561 6454 6562 6455 /* Must be called with the mmu_lock held in write-mode. */ ··· 6634 6529 PG_LEVEL_NUM)) { 6635 6530 kvm_zap_one_rmap_spte(kvm, rmap_head, sptep); 6636 6531 6637 - if (kvm_available_flush_tlb_with_range()) 6532 + if (kvm_available_flush_remote_tlbs_range()) 6638 6533 kvm_flush_remote_tlbs_sptep(kvm, sptep); 6639 6534 else 6640 6535 need_tlb_flush = 1; ··· 6653 6548 * Note, use KVM_MAX_HUGEPAGE_LEVEL - 1 since there's no need to zap 6654 6549 * pages that are already mapped at the maximum hugepage level. 6655 6550 */ 6656 - if (slot_handle_level(kvm, slot, kvm_mmu_zap_collapsible_spte, 6657 - PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL - 1, true)) 6551 + if (walk_slot_rmaps(kvm, slot, kvm_mmu_zap_collapsible_spte, 6552 + PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL - 1, true)) 6658 6553 kvm_arch_flush_remote_tlbs_memslot(kvm, slot); 6659 6554 } 6660 6555 ··· 6685 6580 * is observed by any other operation on the same memslot. 6686 6581 */ 6687 6582 lockdep_assert_held(&kvm->slots_lock); 6688 - kvm_flush_remote_tlbs_with_address(kvm, memslot->base_gfn, 6689 - memslot->npages); 6583 + kvm_flush_remote_tlbs_range(kvm, memslot->base_gfn, memslot->npages); 6690 6584 } 6691 6585 6692 6586 void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm, ··· 6697 6593 * Clear dirty bits only on 4k SPTEs since the legacy MMU only 6698 6594 * support dirty logging at a 4k granularity. 6699 6595 */ 6700 - slot_handle_level_4k(kvm, memslot, __rmap_clear_dirty, false); 6596 + walk_slot_rmaps_4k(kvm, memslot, __rmap_clear_dirty, false); 6701 6597 write_unlock(&kvm->mmu_lock); 6702 6598 } 6703 6599 ··· 6767 6663 } 6768 6664 } 6769 6665 6770 - static unsigned long 6771 - mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc) 6666 + static unsigned long mmu_shrink_scan(struct shrinker *shrink, 6667 + struct shrink_control *sc) 6772 6668 { 6773 6669 struct kvm *kvm; 6774 6670 int nr_to_scan = sc->nr_to_scan; ··· 6826 6722 return freed; 6827 6723 } 6828 6724 6829 - static unsigned long 6830 - mmu_shrink_count(struct shrinker *shrink, struct shrink_control *sc) 6725 + static unsigned long mmu_shrink_count(struct shrinker *shrink, 6726 + struct shrink_control *sc) 6831 6727 { 6832 6728 return percpu_counter_read_positive(&kvm_total_used_mmu_pages); 6833 6729 }

+15 -5

arch/x86/kvm/mmu/mmu_internal.h

··· 170 170 struct kvm_memory_slot *slot, u64 gfn, 171 171 int min_level); 172 172 173 - void kvm_flush_remote_tlbs_with_address(struct kvm *kvm, 174 - u64 start_gfn, u64 pages); 173 + void kvm_flush_remote_tlbs_range(struct kvm *kvm, gfn_t start_gfn, 174 + gfn_t nr_pages); 175 175 176 176 /* Flush the given page (huge or not) of guest memory. */ 177 177 static inline void kvm_flush_remote_tlbs_gfn(struct kvm *kvm, gfn_t gfn, int level) 178 178 { 179 - kvm_flush_remote_tlbs_with_address(kvm, gfn_round_for_level(gfn, level), 180 - KVM_PAGES_PER_HPAGE(level)); 179 + kvm_flush_remote_tlbs_range(kvm, gfn_round_for_level(gfn, level), 180 + KVM_PAGES_PER_HPAGE(level)); 181 181 } 182 182 183 183 unsigned int pte_list_count(struct kvm_rmap_head *rmap_head); ··· 240 240 kvm_pfn_t pfn; 241 241 hva_t hva; 242 242 bool map_writable; 243 + 244 + /* 245 + * Indicates the guest is trying to write a gfn that contains one or 246 + * more of the PTEs used to translate the write itself, i.e. the access 247 + * is changing its own translation in the guest page tables. 248 + */ 249 + bool write_fault_to_shadow_pgtable; 243 250 }; 244 251 245 252 int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault); ··· 280 273 }; 281 274 282 275 static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, 283 - u32 err, bool prefetch) 276 + u32 err, bool prefetch, int *emulation_type) 284 277 { 285 278 struct kvm_page_fault fault = { 286 279 .addr = cr2_or_gpa, ··· 318 311 r = kvm_tdp_page_fault(vcpu, &fault); 319 312 else 320 313 r = vcpu->arch.mmu->page_fault(vcpu, &fault); 314 + 315 + if (fault.write_fault_to_shadow_pgtable && emulation_type) 316 + *emulation_type |= EMULTYPE_WRITE_PF_TO_SP; 321 317 322 318 /* 323 319 * Similar to above, prefetch faults aren't truly spurious, and the

+69 -208

arch/x86/kvm/mmu/paging_tmpl.h

··· 324 324 trace_kvm_mmu_pagetable_walk(addr, access); 325 325 retry_walk: 326 326 walker->level = mmu->cpu_role.base.level; 327 - pte = mmu->get_guest_pgd(vcpu); 327 + pte = kvm_mmu_get_guest_pgd(vcpu, mmu); 328 328 have_ad = PT_HAVE_ACCESSED_DIRTY(mmu); 329 329 330 330 #if PTTYPE == 64 ··· 519 519 520 520 static bool 521 521 FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, 522 - u64 *spte, pt_element_t gpte, bool no_dirty_log) 522 + u64 *spte, pt_element_t gpte) 523 523 { 524 524 struct kvm_memory_slot *slot; 525 525 unsigned pte_access; ··· 535 535 pte_access = sp->role.access & FNAME(gpte_access)(gpte); 536 536 FNAME(protect_clean_gpte)(vcpu->arch.mmu, &pte_access, gpte); 537 537 538 - slot = gfn_to_memslot_dirty_bitmap(vcpu, gfn, 539 - no_dirty_log && (pte_access & ACC_WRITE_MASK)); 538 + slot = gfn_to_memslot_dirty_bitmap(vcpu, gfn, pte_access & ACC_WRITE_MASK); 540 539 if (!slot) 541 540 return false; 542 541 ··· 604 605 if (is_shadow_present_pte(*spte)) 605 606 continue; 606 607 607 - if (!FNAME(prefetch_gpte)(vcpu, sp, spte, gptep[i], true)) 608 + if (!FNAME(prefetch_gpte)(vcpu, sp, spte, gptep[i])) 608 609 break; 609 610 } 610 611 } ··· 684 685 685 686 if (sp != ERR_PTR(-EEXIST)) 686 687 link_shadow_page(vcpu, it.sptep, sp); 688 + 689 + if (fault->write && table_gfn == fault->gfn) 690 + fault->write_fault_to_shadow_pgtable = true; 687 691 } 688 692 693 + /* 694 + * Adjust the hugepage size _after_ resolving indirect shadow pages. 695 + * KVM doesn't support mapping hugepages into the guest for gfns that 696 + * are being shadowed by KVM, i.e. allocating a new shadow page may 697 + * affect the allowed hugepage size. 698 + */ 689 699 kvm_mmu_hugepage_adjust(vcpu, fault); 690 700 691 701 trace_kvm_mmu_spte_requested(fault); ··· 739 731 return RET_PF_RETRY; 740 732 } 741 733 742 - /* 743 - * To see whether the mapped gfn can write its page table in the current 744 - * mapping. 745 - * 746 - * It is the helper function of FNAME(page_fault). When guest uses large page 747 - * size to map the writable gfn which is used as current page table, we should 748 - * force kvm to use small page size to map it because new shadow page will be 749 - * created when kvm establishes shadow page table that stop kvm using large 750 - * page size. Do it early can avoid unnecessary #PF and emulation. 751 - * 752 - * @write_fault_to_shadow_pgtable will return true if the fault gfn is 753 - * currently used as its page table. 754 - * 755 - * Note: the PDPT page table is not checked for PAE-32 bit guest. It is ok 756 - * since the PDPT is always shadowed, that means, we can not use large page 757 - * size to map the gfn which is used as PDPT. 758 - */ 759 - static bool 760 - FNAME(is_self_change_mapping)(struct kvm_vcpu *vcpu, 761 - struct guest_walker *walker, bool user_fault, 762 - bool *write_fault_to_shadow_pgtable) 763 - { 764 - int level; 765 - gfn_t mask = ~(KVM_PAGES_PER_HPAGE(walker->level) - 1); 766 - bool self_changed = false; 767 - 768 - if (!(walker->pte_access & ACC_WRITE_MASK || 769 - (!is_cr0_wp(vcpu->arch.mmu) && !user_fault))) 770 - return false; 771 - 772 - for (level = walker->level; level <= walker->max_level; level++) { 773 - gfn_t gfn = walker->gfn ^ walker->table_gfn[level - 1]; 774 - 775 - self_changed |= !(gfn & mask); 776 - *write_fault_to_shadow_pgtable |= !gfn; 777 - } 778 - 779 - return self_changed; 780 - } 781 - 782 734 /* 783 735 * Page fault handler. There are several causes for a page fault: 784 736 * - there is no shadow pte for the guest pte ··· 757 789 { 758 790 struct guest_walker walker; 759 791 int r; 760 - bool is_self_change_mapping; 761 792 762 793 pgprintk("%s: addr %lx err %x\n", __func__, fault->addr, fault->error_code); 763 794 WARN_ON_ONCE(fault->is_tdp); ··· 781 814 } 782 815 783 816 fault->gfn = walker.gfn; 817 + fault->max_level = walker.level; 784 818 fault->slot = kvm_vcpu_gfn_to_memslot(vcpu, fault->gfn); 785 819 786 820 if (page_fault_handle_page_track(vcpu, fault)) { ··· 792 824 r = mmu_topup_memory_caches(vcpu, true); 793 825 if (r) 794 826 return r; 795 - 796 - vcpu->arch.write_fault_to_shadow_pgtable = false; 797 - 798 - is_self_change_mapping = FNAME(is_self_change_mapping)(vcpu, 799 - &walker, fault->user, &vcpu->arch.write_fault_to_shadow_pgtable); 800 - 801 - if (is_self_change_mapping) 802 - fault->max_level = PG_LEVEL_4K; 803 - else 804 - fault->max_level = walker.level; 805 827 806 828 r = kvm_faultin_pfn(vcpu, fault, walker.pte_access); 807 829 if (r != RET_PF_CONTINUE) ··· 845 887 return gfn_to_gpa(sp->gfn) + offset * sizeof(pt_element_t); 846 888 } 847 889 848 - static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root_hpa) 849 - { 850 - struct kvm_shadow_walk_iterator iterator; 851 - struct kvm_mmu_page *sp; 852 - u64 old_spte; 853 - int level; 854 - u64 *sptep; 855 - 856 - vcpu_clear_mmio_info(vcpu, gva); 857 - 858 - /* 859 - * No need to check return value here, rmap_can_add() can 860 - * help us to skip pte prefetch later. 861 - */ 862 - mmu_topup_memory_caches(vcpu, true); 863 - 864 - if (!VALID_PAGE(root_hpa)) { 865 - WARN_ON(1); 866 - return; 867 - } 868 - 869 - write_lock(&vcpu->kvm->mmu_lock); 870 - for_each_shadow_entry_using_root(vcpu, root_hpa, gva, iterator) { 871 - level = iterator.level; 872 - sptep = iterator.sptep; 873 - 874 - sp = sptep_to_sp(sptep); 875 - old_spte = *sptep; 876 - if (is_last_spte(old_spte, level)) { 877 - pt_element_t gpte; 878 - gpa_t pte_gpa; 879 - 880 - if (!sp->unsync) 881 - break; 882 - 883 - pte_gpa = FNAME(get_level1_sp_gpa)(sp); 884 - pte_gpa += spte_index(sptep) * sizeof(pt_element_t); 885 - 886 - mmu_page_zap_pte(vcpu->kvm, sp, sptep, NULL); 887 - if (is_shadow_present_pte(old_spte)) 888 - kvm_flush_remote_tlbs_sptep(vcpu->kvm, sptep); 889 - 890 - if (!rmap_can_add(vcpu)) 891 - break; 892 - 893 - if (kvm_vcpu_read_guest_atomic(vcpu, pte_gpa, &gpte, 894 - sizeof(pt_element_t))) 895 - break; 896 - 897 - FNAME(prefetch_gpte)(vcpu, sp, sptep, gpte, false); 898 - } 899 - 900 - if (!sp->unsync_children) 901 - break; 902 - } 903 - write_unlock(&vcpu->kvm->mmu_lock); 904 - } 905 - 906 890 /* Note, @addr is a GPA when gva_to_gpa() translates an L2 GPA to an L1 GPA. */ 907 891 static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, 908 892 gpa_t addr, u64 access, ··· 877 977 * can't change unless all sptes pointing to it are nuked first. 878 978 * 879 979 * Returns 880 - * < 0: the sp should be zapped 881 - * 0: the sp is synced and no tlb flushing is required 882 - * > 0: the sp is synced and tlb flushing is required 980 + * < 0: failed to sync spte 981 + * 0: the spte is synced and no tlb flushing is required 982 + * > 0: the spte is synced and tlb flushing is required 883 983 */ 884 - static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) 984 + static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int i) 885 985 { 886 - union kvm_mmu_page_role root_role = vcpu->arch.mmu->root_role; 887 - int i; 888 986 bool host_writable; 889 987 gpa_t first_pte_gpa; 890 - bool flush = false; 988 + u64 *sptep, spte; 989 + struct kvm_memory_slot *slot; 990 + unsigned pte_access; 991 + pt_element_t gpte; 992 + gpa_t pte_gpa; 993 + gfn_t gfn; 891 994 892 - /* 893 - * Ignore various flags when verifying that it's safe to sync a shadow 894 - * page using the current MMU context. 895 - * 896 - * - level: not part of the overall MMU role and will never match as the MMU's 897 - * level tracks the root level 898 - * - access: updated based on the new guest PTE 899 - * - quadrant: not part of the overall MMU role (similar to level) 900 - */ 901 - const union kvm_mmu_page_role sync_role_ign = { 902 - .level = 0xf, 903 - .access = 0x7, 904 - .quadrant = 0x3, 905 - .passthrough = 0x1, 906 - }; 907 - 908 - /* 909 - * Direct pages can never be unsync, and KVM should never attempt to 910 - * sync a shadow page for a different MMU context, e.g. if the role 911 - * differs then the memslot lookup (SMM vs. non-SMM) will be bogus, the 912 - * reserved bits checks will be wrong, etc... 913 - */ 914 - if (WARN_ON_ONCE(sp->role.direct || 915 - (sp->role.word ^ root_role.word) & ~sync_role_ign.word)) 916 - return -1; 995 + if (WARN_ON_ONCE(!sp->spt[i])) 996 + return 0; 917 997 918 998 first_pte_gpa = FNAME(get_level1_sp_gpa)(sp); 999 + pte_gpa = first_pte_gpa + i * sizeof(pt_element_t); 919 1000 920 - for (i = 0; i < SPTE_ENT_PER_PAGE; i++) { 921 - u64 *sptep, spte; 922 - struct kvm_memory_slot *slot; 923 - unsigned pte_access; 924 - pt_element_t gpte; 925 - gpa_t pte_gpa; 926 - gfn_t gfn; 1001 + if (kvm_vcpu_read_guest_atomic(vcpu, pte_gpa, &gpte, 1002 + sizeof(pt_element_t))) 1003 + return -1; 927 1004 928 - if (!sp->spt[i]) 929 - continue; 1005 + if (FNAME(prefetch_invalid_gpte)(vcpu, sp, &sp->spt[i], gpte)) 1006 + return 1; 930 1007 931 - pte_gpa = first_pte_gpa + i * sizeof(pt_element_t); 1008 + gfn = gpte_to_gfn(gpte); 1009 + pte_access = sp->role.access; 1010 + pte_access &= FNAME(gpte_access)(gpte); 1011 + FNAME(protect_clean_gpte)(vcpu->arch.mmu, &pte_access, gpte); 932 1012 933 - if (kvm_vcpu_read_guest_atomic(vcpu, pte_gpa, &gpte, 934 - sizeof(pt_element_t))) 935 - return -1; 936 - 937 - if (FNAME(prefetch_invalid_gpte)(vcpu, sp, &sp->spt[i], gpte)) { 938 - flush = true; 939 - continue; 940 - } 941 - 942 - gfn = gpte_to_gfn(gpte); 943 - pte_access = sp->role.access; 944 - pte_access &= FNAME(gpte_access)(gpte); 945 - FNAME(protect_clean_gpte)(vcpu->arch.mmu, &pte_access, gpte); 946 - 947 - if (sync_mmio_spte(vcpu, &sp->spt[i], gfn, pte_access)) 948 - continue; 949 - 950 - /* 951 - * Drop the SPTE if the new protections would result in a RWX=0 952 - * SPTE or if the gfn is changing. The RWX=0 case only affects 953 - * EPT with execute-only support, i.e. EPT without an effective 954 - * "present" bit, as all other paging modes will create a 955 - * read-only SPTE if pte_access is zero. 956 - */ 957 - if ((!pte_access && !shadow_present_mask) || 958 - gfn != kvm_mmu_page_get_gfn(sp, i)) { 959 - drop_spte(vcpu->kvm, &sp->spt[i]); 960 - flush = true; 961 - continue; 962 - } 963 - 964 - /* Update the shadowed access bits in case they changed. */ 965 - kvm_mmu_page_set_access(sp, i, pte_access); 966 - 967 - sptep = &sp->spt[i]; 968 - spte = *sptep; 969 - host_writable = spte & shadow_host_writable_mask; 970 - slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn); 971 - make_spte(vcpu, sp, slot, pte_access, gfn, 972 - spte_to_pfn(spte), spte, true, false, 973 - host_writable, &spte); 974 - 975 - flush |= mmu_spte_update(sptep, spte); 976 - } 1013 + if (sync_mmio_spte(vcpu, &sp->spt[i], gfn, pte_access)) 1014 + return 0; 977 1015 978 1016 /* 979 - * Note, any flush is purely for KVM's correctness, e.g. when dropping 980 - * an existing SPTE or clearing W/A/D bits to ensure an mmu_notifier 981 - * unmap or dirty logging event doesn't fail to flush. The guest is 982 - * responsible for flushing the TLB to ensure any changes in protection 983 - * bits are recognized, i.e. until the guest flushes or page faults on 984 - * a relevant address, KVM is architecturally allowed to let vCPUs use 985 - * cached translations with the old protection bits. 1017 + * Drop the SPTE if the new protections would result in a RWX=0 1018 + * SPTE or if the gfn is changing. The RWX=0 case only affects 1019 + * EPT with execute-only support, i.e. EPT without an effective 1020 + * "present" bit, as all other paging modes will create a 1021 + * read-only SPTE if pte_access is zero. 986 1022 */ 987 - return flush; 1023 + if ((!pte_access && !shadow_present_mask) || 1024 + gfn != kvm_mmu_page_get_gfn(sp, i)) { 1025 + drop_spte(vcpu->kvm, &sp->spt[i]); 1026 + return 1; 1027 + } 1028 + /* 1029 + * Do nothing if the permissions are unchanged. The existing SPTE is 1030 + * still, and prefetch_invalid_gpte() has verified that the A/D bits 1031 + * are set in the "new" gPTE, i.e. there is no danger of missing an A/D 1032 + * update due to A/D bits being set in the SPTE but not the gPTE. 1033 + */ 1034 + if (kvm_mmu_page_get_access(sp, i) == pte_access) 1035 + return 0; 1036 + 1037 + /* Update the shadowed access bits in case they changed. */ 1038 + kvm_mmu_page_set_access(sp, i, pte_access); 1039 + 1040 + sptep = &sp->spt[i]; 1041 + spte = *sptep; 1042 + host_writable = spte & shadow_host_writable_mask; 1043 + slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn); 1044 + make_spte(vcpu, sp, slot, pte_access, gfn, 1045 + spte_to_pfn(spte), spte, true, false, 1046 + host_writable, &spte); 1047 + 1048 + return mmu_spte_update(sptep, spte); 988 1049 } 989 1050 990 1051 #undef pt_element_t

+1 -1

arch/x86/kvm/mmu/spte.c

··· 164 164 /* 165 165 * For simplicity, enforce the NX huge page mitigation even if not 166 166 * strictly necessary. KVM could ignore the mitigation if paging is 167 - * disabled in the guest, as the guest doesn't have an page tables to 167 + * disabled in the guest, as the guest doesn't have any page tables to 168 168 * abuse. But to safely ignore the mitigation, KVM would have to 169 169 * ensure a new MMU is loaded (or all shadow pages zapped) when CR0.PG 170 170 * is toggled on, and that's a net negative for performance when TDP is

+34 -14

arch/x86/kvm/mmu/tdp_iter.h

··· 29 29 WRITE_ONCE(*rcu_dereference(sptep), new_spte); 30 30 } 31 31 32 + /* 33 + * SPTEs must be modified atomically if they are shadow-present, leaf 34 + * SPTEs, and have volatile bits, i.e. has bits that can be set outside 35 + * of mmu_lock. The Writable bit can be set by KVM's fast page fault 36 + * handler, and Accessed and Dirty bits can be set by the CPU. 37 + * 38 + * Note, non-leaf SPTEs do have Accessed bits and those bits are 39 + * technically volatile, but KVM doesn't consume the Accessed bit of 40 + * non-leaf SPTEs, i.e. KVM doesn't care if it clobbers the bit. This 41 + * logic needs to be reassessed if KVM were to use non-leaf Accessed 42 + * bits, e.g. to skip stepping down into child SPTEs when aging SPTEs. 43 + */ 44 + static inline bool kvm_tdp_mmu_spte_need_atomic_write(u64 old_spte, int level) 45 + { 46 + return is_shadow_present_pte(old_spte) && 47 + is_last_spte(old_spte, level) && 48 + spte_has_volatile_bits(old_spte); 49 + } 50 + 32 51 static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 old_spte, 33 52 u64 new_spte, int level) 34 53 { 35 - /* 36 - * Atomically write the SPTE if it is a shadow-present, leaf SPTE with 37 - * volatile bits, i.e. has bits that can be set outside of mmu_lock. 38 - * The Writable bit can be set by KVM's fast page fault handler, and 39 - * Accessed and Dirty bits can be set by the CPU. 40 - * 41 - * Note, non-leaf SPTEs do have Accessed bits and those bits are 42 - * technically volatile, but KVM doesn't consume the Accessed bit of 43 - * non-leaf SPTEs, i.e. KVM doesn't care if it clobbers the bit. This 44 - * logic needs to be reassessed if KVM were to use non-leaf Accessed 45 - * bits, e.g. to skip stepping down into child SPTEs when aging SPTEs. 46 - */ 47 - if (is_shadow_present_pte(old_spte) && is_last_spte(old_spte, level) && 48 - spte_has_volatile_bits(old_spte)) 54 + if (kvm_tdp_mmu_spte_need_atomic_write(old_spte, level)) 49 55 return kvm_tdp_mmu_write_spte_atomic(sptep, new_spte); 50 56 51 57 __kvm_tdp_mmu_write_spte(sptep, new_spte); 58 + return old_spte; 59 + } 60 + 61 + static inline u64 tdp_mmu_clear_spte_bits(tdp_ptep_t sptep, u64 old_spte, 62 + u64 mask, int level) 63 + { 64 + atomic64_t *sptep_atomic; 65 + 66 + if (kvm_tdp_mmu_spte_need_atomic_write(old_spte, level)) { 67 + sptep_atomic = (atomic64_t *)rcu_dereference(sptep); 68 + return (u64)atomic64_fetch_and(~mask, sptep_atomic); 69 + } 70 + 71 + __kvm_tdp_mmu_write_spte(sptep, old_spte & ~mask); 52 72 return old_spte; 53 73 } 54 74

+72 -143

arch/x86/kvm/mmu/tdp_mmu.c

··· 334 334 u64 old_spte, u64 new_spte, int level, 335 335 bool shared); 336 336 337 - static void handle_changed_spte_acc_track(u64 old_spte, u64 new_spte, int level) 338 - { 339 - if (!is_shadow_present_pte(old_spte) || !is_last_spte(old_spte, level)) 340 - return; 341 - 342 - if (is_accessed_spte(old_spte) && 343 - (!is_shadow_present_pte(new_spte) || !is_accessed_spte(new_spte) || 344 - spte_to_pfn(old_spte) != spte_to_pfn(new_spte))) 345 - kvm_set_pfn_accessed(spte_to_pfn(old_spte)); 346 - } 347 - 348 - static void handle_changed_spte_dirty_log(struct kvm *kvm, int as_id, gfn_t gfn, 349 - u64 old_spte, u64 new_spte, int level) 350 - { 351 - bool pfn_changed; 352 - struct kvm_memory_slot *slot; 353 - 354 - if (level > PG_LEVEL_4K) 355 - return; 356 - 357 - pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte); 358 - 359 - if ((!is_writable_pte(old_spte) || pfn_changed) && 360 - is_writable_pte(new_spte)) { 361 - slot = __gfn_to_memslot(__kvm_memslots(kvm, as_id), gfn); 362 - mark_page_dirty_in_slot(kvm, slot, gfn); 363 - } 364 - } 365 - 366 337 static void tdp_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp) 367 338 { 368 339 kvm_account_pgtable_pages((void *)sp->spt, +1); ··· 476 505 } 477 506 478 507 /** 479 - * __handle_changed_spte - handle bookkeeping associated with an SPTE change 508 + * handle_changed_spte - handle bookkeeping associated with an SPTE change 480 509 * @kvm: kvm instance 481 510 * @as_id: the address space of the paging structure the SPTE was a part of 482 511 * @gfn: the base GFN that was mapped by the SPTE ··· 487 516 * the MMU lock and the operation must synchronize with other 488 517 * threads that might be modifying SPTEs. 489 518 * 490 - * Handle bookkeeping that might result from the modification of a SPTE. 491 - * This function must be called for all TDP SPTE modifications. 519 + * Handle bookkeeping that might result from the modification of a SPTE. Note, 520 + * dirty logging updates are handled in common code, not here (see make_spte() 521 + * and fast_pf_fix_direct_spte()). 492 522 */ 493 - static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn, 494 - u64 old_spte, u64 new_spte, int level, 495 - bool shared) 523 + static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn, 524 + u64 old_spte, u64 new_spte, int level, 525 + bool shared) 496 526 { 497 527 bool was_present = is_shadow_present_pte(old_spte); 498 528 bool is_present = is_shadow_present_pte(new_spte); ··· 577 605 if (was_present && !was_leaf && 578 606 (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed))) 579 607 handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared); 580 - } 581 608 582 - static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn, 583 - u64 old_spte, u64 new_spte, int level, 584 - bool shared) 585 - { 586 - __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, 587 - shared); 588 - handle_changed_spte_acc_track(old_spte, new_spte, level); 589 - handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte, 590 - new_spte, level); 609 + if (was_leaf && is_accessed_spte(old_spte) && 610 + (!is_present || !is_accessed_spte(new_spte) || pfn_changed)) 611 + kvm_set_pfn_accessed(spte_to_pfn(old_spte)); 591 612 } 592 613 593 614 /* ··· 623 658 if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte)) 624 659 return -EBUSY; 625 660 626 - __handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte, 627 - new_spte, iter->level, true); 628 - handle_changed_spte_acc_track(iter->old_spte, new_spte, iter->level); 661 + handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte, 662 + new_spte, iter->level, true); 629 663 630 664 return 0; 631 665 } ··· 660 696 661 697 662 698 /* 663 - * __tdp_mmu_set_spte - Set a TDP MMU SPTE and handle the associated bookkeeping 699 + * tdp_mmu_set_spte - Set a TDP MMU SPTE and handle the associated bookkeeping 664 700 * @kvm: KVM instance 665 701 * @as_id: Address space ID, i.e. regular vs. SMM 666 702 * @sptep: Pointer to the SPTE ··· 668 704 * @new_spte: The new value that will be set for the SPTE 669 705 * @gfn: The base GFN that was (or will be) mapped by the SPTE 670 706 * @level: The level _containing_ the SPTE (its parent PT's level) 671 - * @record_acc_track: Notify the MM subsystem of changes to the accessed state 672 - * of the page. Should be set unless handling an MMU 673 - * notifier for access tracking. Leaving record_acc_track 674 - * unset in that case prevents page accesses from being 675 - * double counted. 676 - * @record_dirty_log: Record the page as dirty in the dirty bitmap if 677 - * appropriate for the change being made. Should be set 678 - * unless performing certain dirty logging operations. 679 - * Leaving record_dirty_log unset in that case prevents page 680 - * writes from being double counted. 681 707 * 682 708 * Returns the old SPTE value, which _may_ be different than @old_spte if the 683 709 * SPTE had voldatile bits. 684 710 */ 685 - static u64 __tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep, 686 - u64 old_spte, u64 new_spte, gfn_t gfn, int level, 687 - bool record_acc_track, bool record_dirty_log) 711 + static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep, 712 + u64 old_spte, u64 new_spte, gfn_t gfn, int level) 688 713 { 689 714 lockdep_assert_held_write(&kvm->mmu_lock); 690 715 ··· 688 735 689 736 old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte, level); 690 737 691 - __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false); 692 - 693 - if (record_acc_track) 694 - handle_changed_spte_acc_track(old_spte, new_spte, level); 695 - if (record_dirty_log) 696 - handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte, 697 - new_spte, level); 738 + handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false); 698 739 return old_spte; 699 740 } 700 741 701 - static inline void _tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter, 702 - u64 new_spte, bool record_acc_track, 703 - bool record_dirty_log) 742 + static inline void tdp_mmu_iter_set_spte(struct kvm *kvm, struct tdp_iter *iter, 743 + u64 new_spte) 704 744 { 705 745 WARN_ON_ONCE(iter->yielded); 706 - 707 - iter->old_spte = __tdp_mmu_set_spte(kvm, iter->as_id, iter->sptep, 708 - iter->old_spte, new_spte, 709 - iter->gfn, iter->level, 710 - record_acc_track, record_dirty_log); 711 - } 712 - 713 - static inline void tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter, 714 - u64 new_spte) 715 - { 716 - _tdp_mmu_set_spte(kvm, iter, new_spte, true, true); 717 - } 718 - 719 - static inline void tdp_mmu_set_spte_no_acc_track(struct kvm *kvm, 720 - struct tdp_iter *iter, 721 - u64 new_spte) 722 - { 723 - _tdp_mmu_set_spte(kvm, iter, new_spte, false, true); 724 - } 725 - 726 - static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm, 727 - struct tdp_iter *iter, 728 - u64 new_spte) 729 - { 730 - _tdp_mmu_set_spte(kvm, iter, new_spte, true, false); 746 + iter->old_spte = tdp_mmu_set_spte(kvm, iter->as_id, iter->sptep, 747 + iter->old_spte, new_spte, 748 + iter->gfn, iter->level); 731 749 } 732 750 733 751 #define tdp_root_for_each_pte(_iter, _root, _start, _end) \ ··· 790 866 continue; 791 867 792 868 if (!shared) 793 - tdp_mmu_set_spte(kvm, &iter, 0); 869 + tdp_mmu_iter_set_spte(kvm, &iter, 0); 794 870 else if (tdp_mmu_set_spte_atomic(kvm, &iter, 0)) 795 871 goto retry; 796 872 } ··· 847 923 if (WARN_ON_ONCE(!is_shadow_present_pte(old_spte))) 848 924 return false; 849 925 850 - __tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte, 0, 851 - sp->gfn, sp->role.level + 1, true, true); 926 + tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte, 0, 927 + sp->gfn, sp->role.level + 1); 852 928 853 929 return true; 854 930 } ··· 882 958 !is_last_spte(iter.old_spte, iter.level)) 883 959 continue; 884 960 885 - tdp_mmu_set_spte(kvm, &iter, 0); 961 + tdp_mmu_iter_set_spte(kvm, &iter, 0); 886 962 flush = true; 887 963 } 888 964 ··· 1052 1128 if (ret) 1053 1129 return ret; 1054 1130 } else { 1055 - tdp_mmu_set_spte(kvm, iter, spte); 1131 + tdp_mmu_iter_set_spte(kvm, iter, spte); 1056 1132 } 1057 1133 1058 1134 tdp_account_mmu_page(kvm, sp); ··· 1186 1262 /* 1187 1263 * Mark the SPTEs range of GFNs [start, end) unaccessed and return non-zero 1188 1264 * if any of the GFNs in the range have been accessed. 1265 + * 1266 + * No need to mark the corresponding PFN as accessed as this call is coming 1267 + * from the clear_young() or clear_flush_young() notifier, which uses the 1268 + * return value to determine if the page has been accessed. 1189 1269 */ 1190 1270 static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter, 1191 1271 struct kvm_gfn_range *range) 1192 1272 { 1193 - u64 new_spte = 0; 1273 + u64 new_spte; 1194 1274 1195 1275 /* If we have a non-accessed entry we don't need to change the pte. */ 1196 1276 if (!is_accessed_spte(iter->old_spte)) 1197 1277 return false; 1198 1278 1199 - new_spte = iter->old_spte; 1200 - 1201 - if (spte_ad_enabled(new_spte)) { 1202 - new_spte &= ~shadow_accessed_mask; 1279 + if (spte_ad_enabled(iter->old_spte)) { 1280 + iter->old_spte = tdp_mmu_clear_spte_bits(iter->sptep, 1281 + iter->old_spte, 1282 + shadow_accessed_mask, 1283 + iter->level); 1284 + new_spte = iter->old_spte & ~shadow_accessed_mask; 1203 1285 } else { 1204 1286 /* 1205 1287 * Capture the dirty status of the page, so that it doesn't get 1206 1288 * lost when the SPTE is marked for access tracking. 1207 1289 */ 1208 - if (is_writable_pte(new_spte)) 1209 - kvm_set_pfn_dirty(spte_to_pfn(new_spte)); 1290 + if (is_writable_pte(iter->old_spte)) 1291 + kvm_set_pfn_dirty(spte_to_pfn(iter->old_spte)); 1210 1292 1211 - new_spte = mark_spte_for_access_track(new_spte); 1293 + new_spte = mark_spte_for_access_track(iter->old_spte); 1294 + iter->old_spte = kvm_tdp_mmu_write_spte(iter->sptep, 1295 + iter->old_spte, new_spte, 1296 + iter->level); 1212 1297 } 1213 1298 1214 - tdp_mmu_set_spte_no_acc_track(kvm, iter, new_spte); 1215 - 1299 + trace_kvm_tdp_mmu_spte_changed(iter->as_id, iter->gfn, iter->level, 1300 + iter->old_spte, new_spte); 1216 1301 return true; 1217 1302 } 1218 1303 ··· 1257 1324 * Note, when changing a read-only SPTE, it's not strictly necessary to 1258 1325 * zero the SPTE before setting the new PFN, but doing so preserves the 1259 1326 * invariant that the PFN of a present * leaf SPTE can never change. 1260 - * See __handle_changed_spte(). 1327 + * See handle_changed_spte(). 1261 1328 */ 1262 - tdp_mmu_set_spte(kvm, iter, 0); 1329 + tdp_mmu_iter_set_spte(kvm, iter, 0); 1263 1330 1264 1331 if (!pte_write(range->pte)) { 1265 1332 new_spte = kvm_mmu_changed_pte_notifier_make_spte(iter->old_spte, 1266 1333 pte_pfn(range->pte)); 1267 1334 1268 - tdp_mmu_set_spte(kvm, iter, new_spte); 1335 + tdp_mmu_iter_set_spte(kvm, iter, new_spte); 1269 1336 } 1270 1337 1271 1338 return true; ··· 1282 1349 /* 1283 1350 * No need to handle the remote TLB flush under RCU protection, the 1284 1351 * target SPTE _must_ be a leaf SPTE, i.e. cannot result in freeing a 1285 - * shadow page. See the WARN on pfn_changed in __handle_changed_spte(). 1352 + * shadow page. See the WARN on pfn_changed in handle_changed_spte(). 1286 1353 */ 1287 1354 return kvm_tdp_mmu_handle_gfn(kvm, range, set_spte_gfn); 1288 1355 } ··· 1540 1607 static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root, 1541 1608 gfn_t start, gfn_t end) 1542 1609 { 1610 + u64 dbit = kvm_ad_enabled() ? shadow_dirty_mask : PT_WRITABLE_MASK; 1543 1611 struct tdp_iter iter; 1544 - u64 new_spte; 1545 1612 bool spte_set = false; 1546 1613 1547 1614 rcu_read_lock(); ··· 1554 1621 if (!is_shadow_present_pte(iter.old_spte)) 1555 1622 continue; 1556 1623 1557 - if (spte_ad_need_write_protect(iter.old_spte)) { 1558 - if (is_writable_pte(iter.old_spte)) 1559 - new_spte = iter.old_spte & ~PT_WRITABLE_MASK; 1560 - else 1561 - continue; 1562 - } else { 1563 - if (iter.old_spte & shadow_dirty_mask) 1564 - new_spte = iter.old_spte & ~shadow_dirty_mask; 1565 - else 1566 - continue; 1567 - } 1624 + MMU_WARN_ON(kvm_ad_enabled() && 1625 + spte_ad_need_write_protect(iter.old_spte)); 1568 1626 1569 - if (tdp_mmu_set_spte_atomic(kvm, &iter, new_spte)) 1627 + if (!(iter.old_spte & dbit)) 1628 + continue; 1629 + 1630 + if (tdp_mmu_set_spte_atomic(kvm, &iter, iter.old_spte & ~dbit)) 1570 1631 goto retry; 1571 1632 1572 1633 spte_set = true; ··· 1602 1675 static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root, 1603 1676 gfn_t gfn, unsigned long mask, bool wrprot) 1604 1677 { 1678 + u64 dbit = (wrprot || !kvm_ad_enabled()) ? PT_WRITABLE_MASK : 1679 + shadow_dirty_mask; 1605 1680 struct tdp_iter iter; 1606 - u64 new_spte; 1607 1681 1608 1682 rcu_read_lock(); 1609 1683 ··· 1613 1685 if (!mask) 1614 1686 break; 1615 1687 1688 + MMU_WARN_ON(kvm_ad_enabled() && 1689 + spte_ad_need_write_protect(iter.old_spte)); 1690 + 1616 1691 if (iter.level > PG_LEVEL_4K || 1617 1692 !(mask & (1UL << (iter.gfn - gfn)))) 1618 1693 continue; 1619 1694 1620 1695 mask &= ~(1UL << (iter.gfn - gfn)); 1621 1696 1622 - if (wrprot || spte_ad_need_write_protect(iter.old_spte)) { 1623 - if (is_writable_pte(iter.old_spte)) 1624 - new_spte = iter.old_spte & ~PT_WRITABLE_MASK; 1625 - else 1626 - continue; 1627 - } else { 1628 - if (iter.old_spte & shadow_dirty_mask) 1629 - new_spte = iter.old_spte & ~shadow_dirty_mask; 1630 - else 1631 - continue; 1632 - } 1697 + if (!(iter.old_spte & dbit)) 1698 + continue; 1633 1699 1634 - tdp_mmu_set_spte_no_dirty_log(kvm, &iter, new_spte); 1700 + iter.old_spte = tdp_mmu_clear_spte_bits(iter.sptep, 1701 + iter.old_spte, dbit, 1702 + iter.level); 1703 + 1704 + trace_kvm_tdp_mmu_spte_changed(iter.as_id, iter.gfn, iter.level, 1705 + iter.old_spte, 1706 + iter.old_spte & ~dbit); 1707 + kvm_set_pfn_dirty(spte_to_pfn(iter.old_spte)); 1635 1708 } 1636 1709 1637 1710 rcu_read_unlock(); ··· 1750 1821 if (new_spte == iter.old_spte) 1751 1822 break; 1752 1823 1753 - tdp_mmu_set_spte(kvm, &iter, new_spte); 1824 + tdp_mmu_iter_set_spte(kvm, &iter, new_spte); 1754 1825 spte_set = true; 1755 1826 } 1756 1827

+16 -9

arch/x86/kvm/pmu.c

··· 93 93 #undef __KVM_X86_PMU_OP 94 94 } 95 95 96 - static inline bool pmc_is_enabled(struct kvm_pmc *pmc) 96 + static inline bool pmc_is_globally_enabled(struct kvm_pmc *pmc) 97 97 { 98 98 return static_call(kvm_x86_pmu_pmc_is_enabled)(pmc); 99 99 } ··· 400 400 return is_fixed_event_allowed(filter, pmc->idx); 401 401 } 402 402 403 + static bool pmc_event_is_allowed(struct kvm_pmc *pmc) 404 + { 405 + return pmc_is_globally_enabled(pmc) && pmc_speculative_in_use(pmc) && 406 + check_pmu_event_filter(pmc); 407 + } 408 + 403 409 static void reprogram_counter(struct kvm_pmc *pmc) 404 410 { 405 411 struct kvm_pmu *pmu = pmc_to_pmu(pmc); ··· 415 409 416 410 pmc_pause_counter(pmc); 417 411 418 - if (!pmc_speculative_in_use(pmc) || !pmc_is_enabled(pmc)) 419 - goto reprogram_complete; 420 - 421 - if (!check_pmu_event_filter(pmc)) 412 + if (!pmc_event_is_allowed(pmc)) 422 413 goto reprogram_complete; 423 414 424 415 if (pmc->counter < pmc->prev_counter) ··· 543 540 if (!pmc) 544 541 return 1; 545 542 546 - if (!(kvm_read_cr4(vcpu) & X86_CR4_PCE) && 543 + if (!kvm_is_cr4_bit_set(vcpu, X86_CR4_PCE) && 547 544 (static_call(kvm_x86_get_cpl)(vcpu) != 0) && 548 - (kvm_read_cr0(vcpu) & X86_CR0_PE)) 545 + kvm_is_cr0_bit_set(vcpu, X86_CR0_PE)) 549 546 return 1; 550 547 551 548 *data = pmc_read_counter(pmc) & mask; ··· 592 589 */ 593 590 void kvm_pmu_refresh(struct kvm_vcpu *vcpu) 594 591 { 592 + if (KVM_BUG_ON(kvm_vcpu_has_run(vcpu), vcpu->kvm)) 593 + return; 594 + 595 + bitmap_zero(vcpu_to_pmu(vcpu)->all_valid_pmc_idx, X86_PMC_IDX_MAX); 595 596 static_call(kvm_x86_pmu_refresh)(vcpu); 596 597 } 597 598 ··· 653 646 { 654 647 pmc->prev_counter = pmc->counter; 655 648 pmc->counter = (pmc->counter + 1) & pmc_bitmask(pmc); 656 - kvm_pmu_request_counter_reprogam(pmc); 649 + kvm_pmu_request_counter_reprogram(pmc); 657 650 } 658 651 659 652 static inline bool eventsel_match_perf_hw_id(struct kvm_pmc *pmc, ··· 691 684 for_each_set_bit(i, pmu->all_valid_pmc_idx, X86_PMC_IDX_MAX) { 692 685 pmc = static_call(kvm_x86_pmu_pmc_idx_to_pmc)(pmu, i); 693 686 694 - if (!pmc || !pmc_is_enabled(pmc) || !pmc_speculative_in_use(pmc)) 687 + if (!pmc || !pmc_event_is_allowed(pmc)) 695 688 continue; 696 689 697 690 /* Ignore checks for edge detect, pin control, invert and CMASK bits */

+1 -1

arch/x86/kvm/pmu.h

··· 195 195 KVM_PMC_MAX_FIXED); 196 196 } 197 197 198 - static inline void kvm_pmu_request_counter_reprogam(struct kvm_pmc *pmc) 198 + static inline void kvm_pmu_request_counter_reprogram(struct kvm_pmc *pmc) 199 199 { 200 200 set_bit(pmc->idx, pmc_to_pmu(pmc)->reprogram_pmi); 201 201 kvm_make_request(KVM_REQ_PMU, pmc->vcpu);

+75 -16

arch/x86/kvm/svm/nested.c

··· 139 139 140 140 if (g->int_ctl & V_INTR_MASKING_MASK) { 141 141 /* 142 - * Once running L2 with HF_VINTR_MASK, EFLAGS.IF and CR8 143 - * does not affect any interrupt we may want to inject; 144 - * therefore, writes to CR8 are irrelevant to L0, as are 145 - * interrupt window vmexits. 142 + * If L2 is active and V_INTR_MASKING is enabled in vmcb12, 143 + * disable intercept of CR8 writes as L2's CR8 does not affect 144 + * any interrupt KVM may want to inject. 145 + * 146 + * Similarly, disable intercept of virtual interrupts (used to 147 + * detect interrupt windows) if the saved RFLAGS.IF is '0', as 148 + * the effective RFLAGS.IF for L1 interrupts will never be set 149 + * while L2 is running (L2's RFLAGS.IF doesn't affect L1 IRQs). 146 150 */ 147 151 vmcb_clr_intercept(c, INTERCEPT_CR8_WRITE); 148 - vmcb_clr_intercept(c, INTERCEPT_VINTR); 152 + if (!(svm->vmcb01.ptr->save.rflags & X86_EFLAGS_IF)) 153 + vmcb_clr_intercept(c, INTERCEPT_VINTR); 149 154 } 150 155 151 156 /* ··· 280 275 281 276 if (CC(!nested_svm_check_tlb_ctl(vcpu, control->tlb_ctl))) 282 277 return false; 278 + 279 + if (CC((control->int_ctl & V_NMI_ENABLE_MASK) && 280 + !vmcb12_is_intercept(control, INTERCEPT_NMI))) { 281 + return false; 282 + } 283 283 284 284 return true; 285 285 } ··· 426 416 427 417 /* Only a few fields of int_ctl are written by the processor. */ 428 418 mask = V_IRQ_MASK | V_TPR_MASK; 429 - if (!(svm->nested.ctl.int_ctl & V_INTR_MASKING_MASK) && 430 - svm_is_intercept(svm, INTERCEPT_VINTR)) { 431 - /* 432 - * In order to request an interrupt window, L0 is usurping 433 - * svm->vmcb->control.int_ctl and possibly setting V_IRQ 434 - * even if it was clear in L1's VMCB. Restoring it would be 435 - * wrong. However, in this case V_IRQ will remain true until 436 - * interrupt_window_interception calls svm_clear_vintr and 437 - * restores int_ctl. We can just leave it aside. 438 - */ 419 + /* 420 + * Don't sync vmcb02 V_IRQ back to vmcb12 if KVM (L0) is intercepting 421 + * virtual interrupts in order to request an interrupt window, as KVM 422 + * has usurped vmcb02's int_ctl. If an interrupt window opens before 423 + * the next VM-Exit, svm_clear_vintr() will restore vmcb12's int_ctl. 424 + * If no window opens, V_IRQ will be correctly preserved in vmcb12's 425 + * int_ctl (because it was never recognized while L2 was running). 426 + */ 427 + if (svm_is_intercept(svm, INTERCEPT_VINTR) && 428 + !test_bit(INTERCEPT_VINTR, (unsigned long *)svm->nested.ctl.intercepts)) 439 429 mask &= ~V_IRQ_MASK; 440 - } 441 430 442 431 if (nested_vgif_enabled(svm)) 443 432 mask |= V_GIF_MASK; 433 + 434 + if (nested_vnmi_enabled(svm)) 435 + mask |= V_NMI_BLOCKING_MASK | V_NMI_PENDING_MASK; 444 436 445 437 svm->nested.ctl.int_ctl &= ~mask; 446 438 svm->nested.ctl.int_ctl |= svm->vmcb->control.int_ctl & mask; ··· 662 650 int_ctl_vmcb12_bits |= (V_GIF_MASK | V_GIF_ENABLE_MASK); 663 651 else 664 652 int_ctl_vmcb01_bits |= (V_GIF_MASK | V_GIF_ENABLE_MASK); 653 + 654 + if (vnmi) { 655 + if (vmcb01->control.int_ctl & V_NMI_PENDING_MASK) { 656 + svm->vcpu.arch.nmi_pending++; 657 + kvm_make_request(KVM_REQ_EVENT, &svm->vcpu); 658 + } 659 + if (nested_vnmi_enabled(svm)) 660 + int_ctl_vmcb12_bits |= (V_NMI_PENDING_MASK | 661 + V_NMI_ENABLE_MASK | 662 + V_NMI_BLOCKING_MASK); 663 + } 665 664 666 665 /* Copied from vmcb01. msrpm_base can be overwritten later. */ 667 666 vmcb02->control.nested_ctl = vmcb01->control.nested_ctl; ··· 1044 1021 1045 1022 svm_switch_vmcb(svm, &svm->vmcb01); 1046 1023 1024 + /* 1025 + * Rules for synchronizing int_ctl bits from vmcb02 to vmcb01: 1026 + * 1027 + * V_IRQ, V_IRQ_VECTOR, V_INTR_PRIO_MASK, V_IGN_TPR: If L1 doesn't 1028 + * intercept interrupts, then KVM will use vmcb02's V_IRQ (and related 1029 + * flags) to detect interrupt windows for L1 IRQs (even if L1 uses 1030 + * virtual interrupt masking). Raise KVM_REQ_EVENT to ensure that 1031 + * KVM re-requests an interrupt window if necessary, which implicitly 1032 + * copies this bits from vmcb02 to vmcb01. 1033 + * 1034 + * V_TPR: If L1 doesn't use virtual interrupt masking, then L1's vTPR 1035 + * is stored in vmcb02, but its value doesn't need to be copied from/to 1036 + * vmcb01 because it is copied from/to the virtual APIC's TPR register 1037 + * on each VM entry/exit. 1038 + * 1039 + * V_GIF: If nested vGIF is not used, KVM uses vmcb02's V_GIF for L1's 1040 + * V_GIF. However, GIF is architecturally clear on each VM exit, thus 1041 + * there is no need to copy V_GIF from vmcb02 to vmcb01. 1042 + */ 1043 + if (!nested_exit_on_intr(svm)) 1044 + kvm_make_request(KVM_REQ_EVENT, &svm->vcpu); 1045 + 1047 1046 if (unlikely(svm->lbrv_enabled && (svm->nested.ctl.virt_ext & LBR_CTL_ENABLE_MASK))) { 1048 1047 svm_copy_lbrs(vmcb12, vmcb02); 1049 1048 svm_update_lbrv(vcpu); 1050 1049 } else if (unlikely(vmcb01->control.virt_ext & LBR_CTL_ENABLE_MASK)) { 1051 1050 svm_copy_lbrs(vmcb01, vmcb02); 1052 1051 svm_update_lbrv(vcpu); 1052 + } 1053 + 1054 + if (vnmi) { 1055 + if (vmcb02->control.int_ctl & V_NMI_BLOCKING_MASK) 1056 + vmcb01->control.int_ctl |= V_NMI_BLOCKING_MASK; 1057 + else 1058 + vmcb01->control.int_ctl &= ~V_NMI_BLOCKING_MASK; 1059 + 1060 + if (vcpu->arch.nmi_pending) { 1061 + vcpu->arch.nmi_pending--; 1062 + vmcb01->control.int_ctl |= V_NMI_PENDING_MASK; 1063 + } else { 1064 + vmcb01->control.int_ctl &= ~V_NMI_PENDING_MASK; 1065 + } 1053 1066 } 1054 1067 1055 1068 /*

+1 -1

arch/x86/kvm/svm/pmu.c

··· 161 161 data &= ~pmu->reserved_bits; 162 162 if (data != pmc->eventsel) { 163 163 pmc->eventsel = data; 164 - kvm_pmu_request_counter_reprogam(pmc); 164 + kvm_pmu_request_counter_reprogram(pmc); 165 165 } 166 166 return 0; 167 167 }

+145 -56

arch/x86/kvm/svm/svm.c

··· 99 99 #endif 100 100 { .index = MSR_IA32_SPEC_CTRL, .always = false }, 101 101 { .index = MSR_IA32_PRED_CMD, .always = false }, 102 + { .index = MSR_IA32_FLUSH_CMD, .always = false }, 102 103 { .index = MSR_IA32_LASTBRANCHFROMIP, .always = false }, 103 104 { .index = MSR_IA32_LASTBRANCHTOIP, .always = false }, 104 105 { .index = MSR_IA32_LASTINTFROMIP, .always = false }, ··· 235 234 bool intercept_smi = true; 236 235 module_param(intercept_smi, bool, 0444); 237 236 237 + bool vnmi = true; 238 + module_param(vnmi, bool, 0444); 238 239 239 240 static bool svm_gp_erratum_intercept = true; 240 241 ··· 1318 1315 if (kvm_vcpu_apicv_active(vcpu)) 1319 1316 avic_init_vmcb(svm, vmcb); 1320 1317 1318 + if (vnmi) 1319 + svm->vmcb->control.int_ctl |= V_NMI_ENABLE_MASK; 1320 + 1321 1321 if (vgif) { 1322 1322 svm_clr_intercept(svm, INTERCEPT_STGI); 1323 1323 svm_clr_intercept(svm, INTERCEPT_CLGI); ··· 1592 1586 WARN_ON(kvm_vcpu_apicv_activated(&svm->vcpu)); 1593 1587 1594 1588 svm_set_intercept(svm, INTERCEPT_VINTR); 1589 + 1590 + /* 1591 + * Recalculating intercepts may have cleared the VINTR intercept. If 1592 + * V_INTR_MASKING is enabled in vmcb12, then the effective RFLAGS.IF 1593 + * for L1 physical interrupts is L1's RFLAGS.IF at the time of VMRUN. 1594 + * Requesting an interrupt window if save.RFLAGS.IF=0 is pointless as 1595 + * interrupts will never be unblocked while L2 is running. 1596 + */ 1597 + if (!svm_is_intercept(svm, INTERCEPT_VINTR)) 1598 + return; 1595 1599 1596 1600 /* 1597 1601 * This is just a dummy VINTR to actually cause a vmexit to happen. ··· 2500 2484 has_error_code, error_code); 2501 2485 } 2502 2486 2487 + static void svm_clr_iret_intercept(struct vcpu_svm *svm) 2488 + { 2489 + if (!sev_es_guest(svm->vcpu.kvm)) 2490 + svm_clr_intercept(svm, INTERCEPT_IRET); 2491 + } 2492 + 2493 + static void svm_set_iret_intercept(struct vcpu_svm *svm) 2494 + { 2495 + if (!sev_es_guest(svm->vcpu.kvm)) 2496 + svm_set_intercept(svm, INTERCEPT_IRET); 2497 + } 2498 + 2503 2499 static int iret_interception(struct kvm_vcpu *vcpu) 2504 2500 { 2505 2501 struct vcpu_svm *svm = to_svm(vcpu); 2506 2502 2507 2503 ++vcpu->stat.nmi_window_exits; 2508 2504 svm->awaiting_iret_completion = true; 2509 - if (!sev_es_guest(vcpu->kvm)) { 2510 - svm_clr_intercept(svm, INTERCEPT_IRET); 2505 + 2506 + svm_clr_iret_intercept(svm); 2507 + if (!sev_es_guest(vcpu->kvm)) 2511 2508 svm->nmi_iret_rip = kvm_rip_read(vcpu); 2512 - } 2509 + 2513 2510 kvm_make_request(KVM_REQ_EVENT, vcpu); 2514 2511 return 1; 2515 2512 } ··· 2905 2876 static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) 2906 2877 { 2907 2878 struct vcpu_svm *svm = to_svm(vcpu); 2908 - int r; 2879 + int ret = 0; 2909 2880 2910 2881 u32 ecx = msr->index; 2911 2882 u64 data = msr->data; ··· 2975 2946 */ 2976 2947 set_msr_interception(vcpu, svm->msrpm, MSR_IA32_SPEC_CTRL, 1, 1); 2977 2948 break; 2978 - case MSR_IA32_PRED_CMD: 2979 - if (!msr->host_initiated && 2980 - !guest_has_pred_cmd_msr(vcpu)) 2981 - return 1; 2982 - 2983 - if (data & ~PRED_CMD_IBPB) 2984 - return 1; 2985 - if (!boot_cpu_has(X86_FEATURE_IBPB)) 2986 - return 1; 2987 - if (!data) 2988 - break; 2989 - 2990 - wrmsrl(MSR_IA32_PRED_CMD, PRED_CMD_IBPB); 2991 - set_msr_interception(vcpu, svm->msrpm, MSR_IA32_PRED_CMD, 0, 1); 2992 - break; 2993 2949 case MSR_AMD64_VIRT_SPEC_CTRL: 2994 2950 if (!msr->host_initiated && 2995 2951 !guest_cpuid_has(vcpu, X86_FEATURE_VIRT_SSBD)) ··· 3027 3013 * guest via direct_access_msrs, and switch it via user return. 3028 3014 */ 3029 3015 preempt_disable(); 3030 - r = kvm_set_user_return_msr(tsc_aux_uret_slot, data, -1ull); 3016 + ret = kvm_set_user_return_msr(tsc_aux_uret_slot, data, -1ull); 3031 3017 preempt_enable(); 3032 - if (r) 3033 - return 1; 3018 + if (ret) 3019 + break; 3034 3020 3035 3021 svm->tsc_aux = data; 3036 3022 break; ··· 3088 3074 default: 3089 3075 return kvm_set_msr_common(vcpu, msr); 3090 3076 } 3091 - return 0; 3077 + return ret; 3092 3078 } 3093 3079 3094 3080 static int msr_interception(struct kvm_vcpu *vcpu) ··· 3499 3485 return; 3500 3486 3501 3487 svm->nmi_masked = true; 3502 - if (!sev_es_guest(vcpu->kvm)) 3503 - svm_set_intercept(svm, INTERCEPT_IRET); 3488 + svm_set_iret_intercept(svm); 3504 3489 ++vcpu->stat.nmi_injections; 3490 + } 3491 + 3492 + static bool svm_is_vnmi_pending(struct kvm_vcpu *vcpu) 3493 + { 3494 + struct vcpu_svm *svm = to_svm(vcpu); 3495 + 3496 + if (!is_vnmi_enabled(svm)) 3497 + return false; 3498 + 3499 + return !!(svm->vmcb->control.int_ctl & V_NMI_BLOCKING_MASK); 3500 + } 3501 + 3502 + static bool svm_set_vnmi_pending(struct kvm_vcpu *vcpu) 3503 + { 3504 + struct vcpu_svm *svm = to_svm(vcpu); 3505 + 3506 + if (!is_vnmi_enabled(svm)) 3507 + return false; 3508 + 3509 + if (svm->vmcb->control.int_ctl & V_NMI_PENDING_MASK) 3510 + return false; 3511 + 3512 + svm->vmcb->control.int_ctl |= V_NMI_PENDING_MASK; 3513 + vmcb_mark_dirty(svm->vmcb, VMCB_INTR); 3514 + 3515 + /* 3516 + * Because the pending NMI is serviced by hardware, KVM can't know when 3517 + * the NMI is "injected", but for all intents and purposes, passing the 3518 + * NMI off to hardware counts as injection. 3519 + */ 3520 + ++vcpu->stat.nmi_injections; 3521 + 3522 + return true; 3505 3523 } 3506 3524 3507 3525 static void svm_inject_irq(struct kvm_vcpu *vcpu, bool reinjected) ··· 3631 3585 svm_set_intercept(svm, INTERCEPT_CR8_WRITE); 3632 3586 } 3633 3587 3588 + static bool svm_get_nmi_mask(struct kvm_vcpu *vcpu) 3589 + { 3590 + struct vcpu_svm *svm = to_svm(vcpu); 3591 + 3592 + if (is_vnmi_enabled(svm)) 3593 + return svm->vmcb->control.int_ctl & V_NMI_BLOCKING_MASK; 3594 + else 3595 + return svm->nmi_masked; 3596 + } 3597 + 3598 + static void svm_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked) 3599 + { 3600 + struct vcpu_svm *svm = to_svm(vcpu); 3601 + 3602 + if (is_vnmi_enabled(svm)) { 3603 + if (masked) 3604 + svm->vmcb->control.int_ctl |= V_NMI_BLOCKING_MASK; 3605 + else 3606 + svm->vmcb->control.int_ctl &= ~V_NMI_BLOCKING_MASK; 3607 + 3608 + } else { 3609 + svm->nmi_masked = masked; 3610 + if (masked) 3611 + svm_set_iret_intercept(svm); 3612 + else 3613 + svm_clr_iret_intercept(svm); 3614 + } 3615 + } 3616 + 3634 3617 bool svm_nmi_blocked(struct kvm_vcpu *vcpu) 3635 3618 { 3636 3619 struct vcpu_svm *svm = to_svm(vcpu); ··· 3671 3596 if (is_guest_mode(vcpu) && nested_exit_on_nmi(svm)) 3672 3597 return false; 3673 3598 3674 - return (vmcb->control.int_state & SVM_INTERRUPT_SHADOW_MASK) || 3675 - svm->nmi_masked; 3599 + if (svm_get_nmi_mask(vcpu)) 3600 + return true; 3601 + 3602 + return vmcb->control.int_state & SVM_INTERRUPT_SHADOW_MASK; 3676 3603 } 3677 3604 3678 3605 static int svm_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection) ··· 3690 3613 if (for_injection && is_guest_mode(vcpu) && nested_exit_on_nmi(svm)) 3691 3614 return -EBUSY; 3692 3615 return 1; 3693 - } 3694 - 3695 - static bool svm_get_nmi_mask(struct kvm_vcpu *vcpu) 3696 - { 3697 - return to_svm(vcpu)->nmi_masked; 3698 - } 3699 - 3700 - static void svm_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked) 3701 - { 3702 - struct vcpu_svm *svm = to_svm(vcpu); 3703 - 3704 - if (masked) { 3705 - svm->nmi_masked = true; 3706 - if (!sev_es_guest(vcpu->kvm)) 3707 - svm_set_intercept(svm, INTERCEPT_IRET); 3708 - } else { 3709 - svm->nmi_masked = false; 3710 - if (!sev_es_guest(vcpu->kvm)) 3711 - svm_clr_intercept(svm, INTERCEPT_IRET); 3712 - } 3713 3616 } 3714 3617 3715 3618 bool svm_interrupt_blocked(struct kvm_vcpu *vcpu) ··· 3772 3715 { 3773 3716 struct vcpu_svm *svm = to_svm(vcpu); 3774 3717 3775 - if (svm->nmi_masked && !svm->awaiting_iret_completion) 3718 + /* 3719 + * KVM should never request an NMI window when vNMI is enabled, as KVM 3720 + * allows at most one to-be-injected NMI and one pending NMI, i.e. if 3721 + * two NMIs arrive simultaneously, KVM will inject one and set 3722 + * V_NMI_PENDING for the other. WARN, but continue with the standard 3723 + * single-step approach to try and salvage the pending NMI. 3724 + */ 3725 + WARN_ON_ONCE(is_vnmi_enabled(svm)); 3726 + 3727 + if (svm_get_nmi_mask(vcpu) && !svm->awaiting_iret_completion) 3776 3728 return; /* IRET will cause a vm exit */ 3777 3729 3778 3730 if (!gif_set(svm)) { ··· 3843 3777 { 3844 3778 /* 3845 3779 * When running on Hyper-V with EnlightenedNptTlb enabled, remote TLB 3846 - * flushes should be routed to hv_remote_flush_tlb() without requesting 3780 + * flushes should be routed to hv_flush_remote_tlbs() without requesting 3847 3781 * a "regular" remote flush. Reaching this point means either there's 3848 - * a KVM bug or a prior hv_remote_flush_tlb() call failed, both of 3782 + * a KVM bug or a prior hv_flush_remote_tlbs() call failed, both of 3849 3783 * which might be fatal to the guest. Yell, but try to recover. 3850 3784 */ 3851 3785 if (WARN_ON_ONCE(svm_hv_is_enlightened_tlb_enabled(vcpu))) 3852 - hv_remote_flush_tlb(vcpu->kvm); 3786 + hv_flush_remote_tlbs(vcpu->kvm); 3853 3787 3854 3788 svm_flush_tlb_asid(vcpu); 3855 3789 } ··· 4208 4142 { 4209 4143 switch (index) { 4210 4144 case MSR_IA32_MCG_EXT_CTL: 4211 - case MSR_IA32_VMX_BASIC ... MSR_IA32_VMX_VMFUNC: 4145 + case KVM_FIRST_EMULATED_VMX_MSR ... KVM_LAST_EMULATED_VMX_MSR: 4212 4146 return false; 4213 4147 case MSR_IA32_SMBASE: 4214 4148 if (!IS_ENABLED(CONFIG_KVM_SMM)) ··· 4250 4184 4251 4185 svm->vgif_enabled = vgif && guest_cpuid_has(vcpu, X86_FEATURE_VGIF); 4252 4186 4187 + svm->vnmi_enabled = vnmi && guest_cpuid_has(vcpu, X86_FEATURE_VNMI); 4188 + 4253 4189 svm_recalc_instruction_intercepts(vcpu, svm); 4190 + 4191 + if (boot_cpu_has(X86_FEATURE_IBPB)) 4192 + set_msr_interception(vcpu, svm->msrpm, MSR_IA32_PRED_CMD, 0, 4193 + !!guest_has_pred_cmd_msr(vcpu)); 4194 + 4195 + if (boot_cpu_has(X86_FEATURE_FLUSH_L1D)) 4196 + set_msr_interception(vcpu, svm->msrpm, MSR_IA32_FLUSH_CMD, 0, 4197 + !!guest_cpuid_has(vcpu, X86_FEATURE_FLUSH_L1D)); 4254 4198 4255 4199 /* For sev guests, the memory encryption bit is not reserved in CR3. */ 4256 4200 if (sev_guest(vcpu->kvm)) { ··· 4639 4563 void *insn, int insn_len) 4640 4564 { 4641 4565 bool smep, smap, is_user; 4642 - unsigned long cr4; 4643 4566 u64 error_code; 4644 4567 4645 4568 /* Emulation is always possible when KVM has access to all guest state. */ ··· 4730 4655 if (error_code & (PFERR_GUEST_PAGE_MASK | PFERR_FETCH_MASK)) 4731 4656 goto resume_guest; 4732 4657 4733 - cr4 = kvm_read_cr4(vcpu); 4734 - smep = cr4 & X86_CR4_SMEP; 4735 - smap = cr4 & X86_CR4_SMAP; 4658 + smep = kvm_is_cr4_bit_set(vcpu, X86_CR4_SMEP); 4659 + smap = kvm_is_cr4_bit_set(vcpu, X86_CR4_SMAP); 4736 4660 is_user = svm_get_cpl(vcpu) == 3; 4737 4661 if (smap && (!smep || is_user)) { 4738 4662 pr_err_ratelimited("SEV Guest triggered AMD Erratum 1096\n"); ··· 4869 4795 .patch_hypercall = svm_patch_hypercall, 4870 4796 .inject_irq = svm_inject_irq, 4871 4797 .inject_nmi = svm_inject_nmi, 4798 + .is_vnmi_pending = svm_is_vnmi_pending, 4799 + .set_vnmi_pending = svm_set_vnmi_pending, 4872 4800 .inject_exception = svm_inject_exception, 4873 4801 .cancel_injection = svm_cancel_injection, 4874 4802 .interrupt_allowed = svm_interrupt_allowed, ··· 5012 4936 5013 4937 if (vgif) 5014 4938 kvm_cpu_cap_set(X86_FEATURE_VGIF); 4939 + 4940 + if (vnmi) 4941 + kvm_cpu_cap_set(X86_FEATURE_VNMI); 5015 4942 5016 4943 /* Nested VM can receive #VMEXIT instead of triggering #GP */ 5017 4944 kvm_cpu_cap_set(X86_FEATURE_SVME_ADDR_CHK); ··· 5166 5087 else 5167 5088 pr_info("Virtual GIF supported\n"); 5168 5089 } 5090 + 5091 + vnmi = vgif && vnmi && boot_cpu_has(X86_FEATURE_VNMI); 5092 + if (vnmi) 5093 + pr_info("Virtual NMI enabled\n"); 5094 + 5095 + if (!vnmi) { 5096 + svm_x86_ops.is_vnmi_pending = NULL; 5097 + svm_x86_ops.set_vnmi_pending = NULL; 5098 + } 5099 + 5169 5100 5170 5101 if (lbrv) { 5171 5102 if (!boot_cpu_has(X86_FEATURE_LBRV))

+29

arch/x86/kvm/svm/svm.h

··· 36 36 extern int vgif; 37 37 extern bool intercept_smi; 38 38 extern bool x2avic_enabled; 39 + extern bool vnmi; 39 40 40 41 /* 41 42 * Clean bits in VMCB. ··· 266 265 bool pause_filter_enabled : 1; 267 266 bool pause_threshold_enabled : 1; 268 267 bool vgif_enabled : 1; 268 + bool vnmi_enabled : 1; 269 269 270 270 u32 ldr_reg; 271 271 u32 dfr_reg; ··· 541 539 return svm->nested.ctl.nested_ctl & SVM_NESTED_CTL_NP_ENABLE; 542 540 } 543 541 542 + static inline bool nested_vnmi_enabled(struct vcpu_svm *svm) 543 + { 544 + return svm->vnmi_enabled && 545 + (svm->nested.ctl.int_ctl & V_NMI_ENABLE_MASK); 546 + } 547 + 544 548 static inline bool is_x2apic_msrpm_offset(u32 offset) 545 549 { 546 550 /* 4 msrs per u8, and 4 u8 in u32 */ ··· 554 546 555 547 return (msr >= APIC_BASE_MSR) && 556 548 (msr < (APIC_BASE_MSR + 0x100)); 549 + } 550 + 551 + static inline struct vmcb *get_vnmi_vmcb_l1(struct vcpu_svm *svm) 552 + { 553 + if (!vnmi) 554 + return NULL; 555 + 556 + if (is_guest_mode(&svm->vcpu)) 557 + return NULL; 558 + else 559 + return svm->vmcb01.ptr; 560 + } 561 + 562 + static inline bool is_vnmi_enabled(struct vcpu_svm *svm) 563 + { 564 + struct vmcb *vmcb = get_vnmi_vmcb_l1(svm); 565 + 566 + if (vmcb) 567 + return !!(vmcb->control.int_ctl & V_NMI_ENABLE_MASK); 568 + else 569 + return false; 557 570 } 558 571 559 572 /* svm.c */

+2 -3

arch/x86/kvm/svm/svm_onhyperv.h

··· 45 45 if (npt_enabled && 46 46 ms_hyperv.nested_features & HV_X64_NESTED_ENLIGHTENED_TLB) { 47 47 pr_info(KBUILD_MODNAME ": Hyper-V enlightened NPT TLB flush enabled\n"); 48 - svm_x86_ops.tlb_remote_flush = hv_remote_flush_tlb; 49 - svm_x86_ops.tlb_remote_flush_with_range = 50 - hv_remote_flush_tlb_with_range; 48 + svm_x86_ops.flush_remote_tlbs = hv_flush_remote_tlbs; 49 + svm_x86_ops.flush_remote_tlbs_range = hv_flush_remote_tlbs_range; 51 50 } 52 51 53 52 if (ms_hyperv.nested_features & HV_X64_NESTED_DIRECT_FLUSH) {

+106 -1

arch/x86/kvm/vmx/hyperv.c

··· 13 13 14 14 #define CC KVM_NESTED_VMENTER_CONSISTENCY_CHECK 15 15 16 - DEFINE_STATIC_KEY_FALSE(enable_evmcs); 16 + /* 17 + * Enlightened VMCSv1 doesn't support these: 18 + * 19 + * POSTED_INTR_NV = 0x00000002, 20 + * GUEST_INTR_STATUS = 0x00000810, 21 + * APIC_ACCESS_ADDR = 0x00002014, 22 + * POSTED_INTR_DESC_ADDR = 0x00002016, 23 + * EOI_EXIT_BITMAP0 = 0x0000201c, 24 + * EOI_EXIT_BITMAP1 = 0x0000201e, 25 + * EOI_EXIT_BITMAP2 = 0x00002020, 26 + * EOI_EXIT_BITMAP3 = 0x00002022, 27 + * GUEST_PML_INDEX = 0x00000812, 28 + * PML_ADDRESS = 0x0000200e, 29 + * VM_FUNCTION_CONTROL = 0x00002018, 30 + * EPTP_LIST_ADDRESS = 0x00002024, 31 + * VMREAD_BITMAP = 0x00002026, 32 + * VMWRITE_BITMAP = 0x00002028, 33 + * 34 + * TSC_MULTIPLIER = 0x00002032, 35 + * PLE_GAP = 0x00004020, 36 + * PLE_WINDOW = 0x00004022, 37 + * VMX_PREEMPTION_TIMER_VALUE = 0x0000482E, 38 + * 39 + * Currently unsupported in KVM: 40 + * GUEST_IA32_RTIT_CTL = 0x00002814, 41 + */ 42 + #define EVMCS1_SUPPORTED_PINCTRL \ 43 + (PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR | \ 44 + PIN_BASED_EXT_INTR_MASK | \ 45 + PIN_BASED_NMI_EXITING | \ 46 + PIN_BASED_VIRTUAL_NMIS) 47 + 48 + #define EVMCS1_SUPPORTED_EXEC_CTRL \ 49 + (CPU_BASED_ALWAYSON_WITHOUT_TRUE_MSR | \ 50 + CPU_BASED_HLT_EXITING | \ 51 + CPU_BASED_CR3_LOAD_EXITING | \ 52 + CPU_BASED_CR3_STORE_EXITING | \ 53 + CPU_BASED_UNCOND_IO_EXITING | \ 54 + CPU_BASED_MOV_DR_EXITING | \ 55 + CPU_BASED_USE_TSC_OFFSETTING | \ 56 + CPU_BASED_MWAIT_EXITING | \ 57 + CPU_BASED_MONITOR_EXITING | \ 58 + CPU_BASED_INVLPG_EXITING | \ 59 + CPU_BASED_RDPMC_EXITING | \ 60 + CPU_BASED_INTR_WINDOW_EXITING | \ 61 + CPU_BASED_CR8_LOAD_EXITING | \ 62 + CPU_BASED_CR8_STORE_EXITING | \ 63 + CPU_BASED_RDTSC_EXITING | \ 64 + CPU_BASED_TPR_SHADOW | \ 65 + CPU_BASED_USE_IO_BITMAPS | \ 66 + CPU_BASED_MONITOR_TRAP_FLAG | \ 67 + CPU_BASED_USE_MSR_BITMAPS | \ 68 + CPU_BASED_NMI_WINDOW_EXITING | \ 69 + CPU_BASED_PAUSE_EXITING | \ 70 + CPU_BASED_ACTIVATE_SECONDARY_CONTROLS) 71 + 72 + #define EVMCS1_SUPPORTED_2NDEXEC \ 73 + (SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE | \ 74 + SECONDARY_EXEC_WBINVD_EXITING | \ 75 + SECONDARY_EXEC_ENABLE_VPID | \ 76 + SECONDARY_EXEC_ENABLE_EPT | \ 77 + SECONDARY_EXEC_UNRESTRICTED_GUEST | \ 78 + SECONDARY_EXEC_DESC | \ 79 + SECONDARY_EXEC_ENABLE_RDTSCP | \ 80 + SECONDARY_EXEC_ENABLE_INVPCID | \ 81 + SECONDARY_EXEC_XSAVES | \ 82 + SECONDARY_EXEC_RDSEED_EXITING | \ 83 + SECONDARY_EXEC_RDRAND_EXITING | \ 84 + SECONDARY_EXEC_TSC_SCALING | \ 85 + SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE | \ 86 + SECONDARY_EXEC_PT_USE_GPA | \ 87 + SECONDARY_EXEC_PT_CONCEAL_VMX | \ 88 + SECONDARY_EXEC_BUS_LOCK_DETECTION | \ 89 + SECONDARY_EXEC_NOTIFY_VM_EXITING | \ 90 + SECONDARY_EXEC_ENCLS_EXITING) 91 + 92 + #define EVMCS1_SUPPORTED_3RDEXEC (0ULL) 93 + 94 + #define EVMCS1_SUPPORTED_VMEXIT_CTRL \ 95 + (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR | \ 96 + VM_EXIT_SAVE_DEBUG_CONTROLS | \ 97 + VM_EXIT_ACK_INTR_ON_EXIT | \ 98 + VM_EXIT_HOST_ADDR_SPACE_SIZE | \ 99 + VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL | \ 100 + VM_EXIT_SAVE_IA32_PAT | \ 101 + VM_EXIT_LOAD_IA32_PAT | \ 102 + VM_EXIT_SAVE_IA32_EFER | \ 103 + VM_EXIT_LOAD_IA32_EFER | \ 104 + VM_EXIT_CLEAR_BNDCFGS | \ 105 + VM_EXIT_PT_CONCEAL_PIP | \ 106 + VM_EXIT_CLEAR_IA32_RTIT_CTL) 107 + 108 + #define EVMCS1_SUPPORTED_VMENTRY_CTRL \ 109 + (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | \ 110 + VM_ENTRY_LOAD_DEBUG_CONTROLS | \ 111 + VM_ENTRY_IA32E_MODE | \ 112 + VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL | \ 113 + VM_ENTRY_LOAD_IA32_PAT | \ 114 + VM_ENTRY_LOAD_IA32_EFER | \ 115 + VM_ENTRY_LOAD_BNDCFGS | \ 116 + VM_ENTRY_PT_CONCEAL_PIP | \ 117 + VM_ENTRY_LOAD_IA32_RTIT_CTL) 118 + 119 + #define EVMCS1_SUPPORTED_VMFUNC (0) 17 120 18 121 #define EVMCS1_OFFSET(x) offsetof(struct hv_enlightened_vmcs, x) 19 122 #define EVMCS1_FIELD(number, name, clean_field)[ROL16(number, 6)] = \ ··· 609 506 } 610 507 611 508 #if IS_ENABLED(CONFIG_HYPERV) 509 + DEFINE_STATIC_KEY_FALSE(__kvm_is_using_evmcs); 510 + 612 511 /* 613 512 * KVM on Hyper-V always uses the latest known eVMCSv1 revision, the assumption 614 513 * is: in case a feature has corresponding fields in eVMCS described and it was

+8 -107

arch/x86/kvm/vmx/hyperv.h

··· 16 16 17 17 struct vmcs_config; 18 18 19 - DECLARE_STATIC_KEY_FALSE(enable_evmcs); 20 - 21 19 #define current_evmcs ((struct hv_enlightened_vmcs *)this_cpu_read(current_vmcs)) 22 20 23 21 #define KVM_EVMCS_VERSION 1 24 - 25 - /* 26 - * Enlightened VMCSv1 doesn't support these: 27 - * 28 - * POSTED_INTR_NV = 0x00000002, 29 - * GUEST_INTR_STATUS = 0x00000810, 30 - * APIC_ACCESS_ADDR = 0x00002014, 31 - * POSTED_INTR_DESC_ADDR = 0x00002016, 32 - * EOI_EXIT_BITMAP0 = 0x0000201c, 33 - * EOI_EXIT_BITMAP1 = 0x0000201e, 34 - * EOI_EXIT_BITMAP2 = 0x00002020, 35 - * EOI_EXIT_BITMAP3 = 0x00002022, 36 - * GUEST_PML_INDEX = 0x00000812, 37 - * PML_ADDRESS = 0x0000200e, 38 - * VM_FUNCTION_CONTROL = 0x00002018, 39 - * EPTP_LIST_ADDRESS = 0x00002024, 40 - * VMREAD_BITMAP = 0x00002026, 41 - * VMWRITE_BITMAP = 0x00002028, 42 - * 43 - * TSC_MULTIPLIER = 0x00002032, 44 - * PLE_GAP = 0x00004020, 45 - * PLE_WINDOW = 0x00004022, 46 - * VMX_PREEMPTION_TIMER_VALUE = 0x0000482E, 47 - * 48 - * Currently unsupported in KVM: 49 - * GUEST_IA32_RTIT_CTL = 0x00002814, 50 - */ 51 - #define EVMCS1_SUPPORTED_PINCTRL \ 52 - (PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR | \ 53 - PIN_BASED_EXT_INTR_MASK | \ 54 - PIN_BASED_NMI_EXITING | \ 55 - PIN_BASED_VIRTUAL_NMIS) 56 - 57 - #define EVMCS1_SUPPORTED_EXEC_CTRL \ 58 - (CPU_BASED_ALWAYSON_WITHOUT_TRUE_MSR | \ 59 - CPU_BASED_HLT_EXITING | \ 60 - CPU_BASED_CR3_LOAD_EXITING | \ 61 - CPU_BASED_CR3_STORE_EXITING | \ 62 - CPU_BASED_UNCOND_IO_EXITING | \ 63 - CPU_BASED_MOV_DR_EXITING | \ 64 - CPU_BASED_USE_TSC_OFFSETTING | \ 65 - CPU_BASED_MWAIT_EXITING | \ 66 - CPU_BASED_MONITOR_EXITING | \ 67 - CPU_BASED_INVLPG_EXITING | \ 68 - CPU_BASED_RDPMC_EXITING | \ 69 - CPU_BASED_INTR_WINDOW_EXITING | \ 70 - CPU_BASED_CR8_LOAD_EXITING | \ 71 - CPU_BASED_CR8_STORE_EXITING | \ 72 - CPU_BASED_RDTSC_EXITING | \ 73 - CPU_BASED_TPR_SHADOW | \ 74 - CPU_BASED_USE_IO_BITMAPS | \ 75 - CPU_BASED_MONITOR_TRAP_FLAG | \ 76 - CPU_BASED_USE_MSR_BITMAPS | \ 77 - CPU_BASED_NMI_WINDOW_EXITING | \ 78 - CPU_BASED_PAUSE_EXITING | \ 79 - CPU_BASED_ACTIVATE_SECONDARY_CONTROLS) 80 - 81 - #define EVMCS1_SUPPORTED_2NDEXEC \ 82 - (SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE | \ 83 - SECONDARY_EXEC_WBINVD_EXITING | \ 84 - SECONDARY_EXEC_ENABLE_VPID | \ 85 - SECONDARY_EXEC_ENABLE_EPT | \ 86 - SECONDARY_EXEC_UNRESTRICTED_GUEST | \ 87 - SECONDARY_EXEC_DESC | \ 88 - SECONDARY_EXEC_ENABLE_RDTSCP | \ 89 - SECONDARY_EXEC_ENABLE_INVPCID | \ 90 - SECONDARY_EXEC_XSAVES | \ 91 - SECONDARY_EXEC_RDSEED_EXITING | \ 92 - SECONDARY_EXEC_RDRAND_EXITING | \ 93 - SECONDARY_EXEC_TSC_SCALING | \ 94 - SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE | \ 95 - SECONDARY_EXEC_PT_USE_GPA | \ 96 - SECONDARY_EXEC_PT_CONCEAL_VMX | \ 97 - SECONDARY_EXEC_BUS_LOCK_DETECTION | \ 98 - SECONDARY_EXEC_NOTIFY_VM_EXITING | \ 99 - SECONDARY_EXEC_ENCLS_EXITING) 100 - 101 - #define EVMCS1_SUPPORTED_3RDEXEC (0ULL) 102 - 103 - #define EVMCS1_SUPPORTED_VMEXIT_CTRL \ 104 - (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR | \ 105 - VM_EXIT_SAVE_DEBUG_CONTROLS | \ 106 - VM_EXIT_ACK_INTR_ON_EXIT | \ 107 - VM_EXIT_HOST_ADDR_SPACE_SIZE | \ 108 - VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL | \ 109 - VM_EXIT_SAVE_IA32_PAT | \ 110 - VM_EXIT_LOAD_IA32_PAT | \ 111 - VM_EXIT_SAVE_IA32_EFER | \ 112 - VM_EXIT_LOAD_IA32_EFER | \ 113 - VM_EXIT_CLEAR_BNDCFGS | \ 114 - VM_EXIT_PT_CONCEAL_PIP | \ 115 - VM_EXIT_CLEAR_IA32_RTIT_CTL) 116 - 117 - #define EVMCS1_SUPPORTED_VMENTRY_CTRL \ 118 - (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | \ 119 - VM_ENTRY_LOAD_DEBUG_CONTROLS | \ 120 - VM_ENTRY_IA32E_MODE | \ 121 - VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL | \ 122 - VM_ENTRY_LOAD_IA32_PAT | \ 123 - VM_ENTRY_LOAD_IA32_EFER | \ 124 - VM_ENTRY_LOAD_BNDCFGS | \ 125 - VM_ENTRY_PT_CONCEAL_PIP | \ 126 - VM_ENTRY_LOAD_IA32_RTIT_CTL) 127 - 128 - #define EVMCS1_SUPPORTED_VMFUNC (0) 129 22 130 23 struct evmcs_field { 131 24 u16 offset; ··· 66 173 } 67 174 68 175 #if IS_ENABLED(CONFIG_HYPERV) 176 + 177 + DECLARE_STATIC_KEY_FALSE(__kvm_is_using_evmcs); 178 + 179 + static __always_inline bool kvm_is_using_evmcs(void) 180 + { 181 + return static_branch_unlikely(&__kvm_is_using_evmcs); 182 + } 69 183 70 184 static __always_inline int get_evmcs_offset(unsigned long field, 71 185 u16 *clean_field) ··· 163 263 164 264 void evmcs_sanitize_exec_ctrls(struct vmcs_config *vmcs_conf); 165 265 #else /* !IS_ENABLED(CONFIG_HYPERV) */ 266 + static __always_inline bool kvm_is_using_evmcs(void) { return false; } 166 267 static __always_inline void evmcs_write64(unsigned long field, u64 value) {} 167 268 static __always_inline void evmcs_write32(unsigned long field, u32 value) {} 168 269 static __always_inline void evmcs_write16(unsigned long field, u16 value) {}

+84 -42

arch/x86/kvm/vmx/nested.c

··· 358 358 static void nested_ept_invalidate_addr(struct kvm_vcpu *vcpu, gpa_t eptp, 359 359 gpa_t addr) 360 360 { 361 + unsigned long roots = 0; 361 362 uint i; 362 363 struct kvm_mmu_root_info *cached_root; 363 364 ··· 369 368 370 369 if (nested_ept_root_matches(cached_root->hpa, cached_root->pgd, 371 370 eptp)) 372 - vcpu->arch.mmu->invlpg(vcpu, addr, cached_root->hpa); 371 + roots |= KVM_MMU_ROOT_PREVIOUS(i); 373 372 } 373 + if (roots) 374 + kvm_mmu_invalidate_addr(vcpu, vcpu->arch.mmu, addr, roots); 374 375 } 375 376 376 377 static void nested_ept_inject_page_fault(struct kvm_vcpu *vcpu, ··· 656 653 657 654 nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0, 658 655 MSR_IA32_PRED_CMD, MSR_TYPE_W); 656 + 657 + nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0, 658 + MSR_IA32_FLUSH_CMD, MSR_TYPE_W); 659 659 660 660 kvm_vcpu_unmap(vcpu, &vmx->nested.msr_bitmap_map, false); 661 661 ··· 4489 4483 * CR0_GUEST_HOST_MASK is already set in the original vmcs01 4490 4484 * (KVM doesn't change it); 4491 4485 */ 4492 - vcpu->arch.cr0_guest_owned_bits = KVM_POSSIBLE_CR0_GUEST_BITS; 4486 + vcpu->arch.cr0_guest_owned_bits = vmx_l1_guest_owned_cr0_bits(); 4493 4487 vmx_set_cr0(vcpu, vmcs12->host_cr0); 4494 4488 4495 4489 /* Same as above - no reason to call set_cr4_guest_host_mask(). */ ··· 4640 4634 */ 4641 4635 vmx_set_efer(vcpu, nested_vmx_get_vmcs01_guest_efer(vmx)); 4642 4636 4643 - vcpu->arch.cr0_guest_owned_bits = KVM_POSSIBLE_CR0_GUEST_BITS; 4637 + vcpu->arch.cr0_guest_owned_bits = vmx_l1_guest_owned_cr0_bits(); 4644 4638 vmx_set_cr0(vcpu, vmcs_readl(CR0_READ_SHADOW)); 4645 4639 4646 4640 vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK); ··· 5162 5156 * does force CR0.PE=1, but only to also force VM86 in order to emulate 5163 5157 * Real Mode, and so there's no need to check CR0.PE manually. 5164 5158 */ 5165 - if (!kvm_read_cr4_bits(vcpu, X86_CR4_VMXE)) { 5159 + if (!kvm_is_cr4_bit_set(vcpu, X86_CR4_VMXE)) { 5166 5160 kvm_queue_exception(vcpu, UD_VECTOR); 5167 5161 return 1; 5168 5162 } ··· 6761 6755 return (u64)max_idx << VMCS_FIELD_INDEX_SHIFT; 6762 6756 } 6763 6757 6764 - /* 6765 - * nested_vmx_setup_ctls_msrs() sets up variables containing the values to be 6766 - * returned for the various VMX controls MSRs when nested VMX is enabled. 6767 - * The same values should also be used to verify that vmcs12 control fields are 6768 - * valid during nested entry from L1 to L2. 6769 - * Each of these control msrs has a low and high 32-bit half: A low bit is on 6770 - * if the corresponding bit in the (32-bit) control field *must* be on, and a 6771 - * bit in the high half is on if the corresponding bit in the control field 6772 - * may be on. See also vmx_control_verify(). 6773 - */ 6774 - void nested_vmx_setup_ctls_msrs(struct vmcs_config *vmcs_conf, u32 ept_caps) 6758 + static void nested_vmx_setup_pinbased_ctls(struct vmcs_config *vmcs_conf, 6759 + struct nested_vmx_msrs *msrs) 6775 6760 { 6776 - struct nested_vmx_msrs *msrs = &vmcs_conf->nested; 6777 - 6778 - /* 6779 - * Note that as a general rule, the high half of the MSRs (bits in 6780 - * the control fields which may be 1) should be initialized by the 6781 - * intersection of the underlying hardware's MSR (i.e., features which 6782 - * can be supported) and the list of features we want to expose - 6783 - * because they are known to be properly supported in our code. 6784 - * Also, usually, the low half of the MSRs (bits which must be 1) can 6785 - * be set to 0, meaning that L1 may turn off any of these bits. The 6786 - * reason is that if one of these bits is necessary, it will appear 6787 - * in vmcs01 and prepare_vmcs02, when it bitwise-or's the control 6788 - * fields of vmcs01 and vmcs02, will turn these bits off - and 6789 - * nested_vmx_l1_wants_exit() will not pass related exits to L1. 6790 - * These rules have exceptions below. 6791 - */ 6792 - 6793 - /* pin-based controls */ 6794 6761 msrs->pinbased_ctls_low = 6795 6762 PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR; 6796 6763 ··· 6776 6797 msrs->pinbased_ctls_high |= 6777 6798 PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR | 6778 6799 PIN_BASED_VMX_PREEMPTION_TIMER; 6800 + } 6779 6801 6780 - /* exit controls */ 6802 + static void nested_vmx_setup_exit_ctls(struct vmcs_config *vmcs_conf, 6803 + struct nested_vmx_msrs *msrs) 6804 + { 6781 6805 msrs->exit_ctls_low = 6782 6806 VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR; 6783 6807 ··· 6799 6817 6800 6818 /* We support free control of debug control saving. */ 6801 6819 msrs->exit_ctls_low &= ~VM_EXIT_SAVE_DEBUG_CONTROLS; 6820 + } 6802 6821 6803 - /* entry controls */ 6822 + static void nested_vmx_setup_entry_ctls(struct vmcs_config *vmcs_conf, 6823 + struct nested_vmx_msrs *msrs) 6824 + { 6804 6825 msrs->entry_ctls_low = 6805 6826 VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR; 6806 6827 ··· 6819 6834 6820 6835 /* We support free control of debug control loading. */ 6821 6836 msrs->entry_ctls_low &= ~VM_ENTRY_LOAD_DEBUG_CONTROLS; 6837 + } 6822 6838 6823 - /* cpu-based controls */ 6839 + static void nested_vmx_setup_cpubased_ctls(struct vmcs_config *vmcs_conf, 6840 + struct nested_vmx_msrs *msrs) 6841 + { 6824 6842 msrs->procbased_ctls_low = 6825 6843 CPU_BASED_ALWAYSON_WITHOUT_TRUE_MSR; 6826 6844 ··· 6855 6867 /* We support free control of CR3 access interception. */ 6856 6868 msrs->procbased_ctls_low &= 6857 6869 ~(CPU_BASED_CR3_LOAD_EXITING | CPU_BASED_CR3_STORE_EXITING); 6870 + } 6858 6871 6859 - /* 6860 - * secondary cpu-based controls. Do not include those that 6861 - * depend on CPUID bits, they are added later by 6862 - * vmx_vcpu_after_set_cpuid. 6863 - */ 6872 + static void nested_vmx_setup_secondary_ctls(u32 ept_caps, 6873 + struct vmcs_config *vmcs_conf, 6874 + struct nested_vmx_msrs *msrs) 6875 + { 6864 6876 msrs->secondary_ctls_low = 0; 6865 6877 6866 6878 msrs->secondary_ctls_high = vmcs_conf->cpu_based_2nd_exec_ctrl; ··· 6938 6950 6939 6951 if (enable_sgx) 6940 6952 msrs->secondary_ctls_high |= SECONDARY_EXEC_ENCLS_EXITING; 6953 + } 6941 6954 6942 - /* miscellaneous data */ 6955 + static void nested_vmx_setup_misc_data(struct vmcs_config *vmcs_conf, 6956 + struct nested_vmx_msrs *msrs) 6957 + { 6943 6958 msrs->misc_low = (u32)vmcs_conf->misc & VMX_MISC_SAVE_EFER_LMA; 6944 6959 msrs->misc_low |= 6945 6960 MSR_IA32_VMX_MISC_VMWRITE_SHADOW_RO_FIELDS | ··· 6950 6959 VMX_MISC_ACTIVITY_HLT | 6951 6960 VMX_MISC_ACTIVITY_WAIT_SIPI; 6952 6961 msrs->misc_high = 0; 6962 + } 6953 6963 6964 + static void nested_vmx_setup_basic(struct nested_vmx_msrs *msrs) 6965 + { 6954 6966 /* 6955 6967 * This MSR reports some information about VMX support. We 6956 6968 * should return information about the VMX we emulate for the ··· 6968 6974 6969 6975 if (cpu_has_vmx_basic_inout()) 6970 6976 msrs->basic |= VMX_BASIC_INOUT; 6977 + } 6971 6978 6979 + static void nested_vmx_setup_cr_fixed(struct nested_vmx_msrs *msrs) 6980 + { 6972 6981 /* 6973 6982 * These MSRs specify bits which the guest must keep fixed on 6974 6983 * while L1 is in VMXON mode (in L1's root mode, or running an L2). ··· 6988 6991 6989 6992 if (vmx_umip_emulated()) 6990 6993 msrs->cr4_fixed1 |= X86_CR4_UMIP; 6994 + } 6995 + 6996 + /* 6997 + * nested_vmx_setup_ctls_msrs() sets up variables containing the values to be 6998 + * returned for the various VMX controls MSRs when nested VMX is enabled. 6999 + * The same values should also be used to verify that vmcs12 control fields are 7000 + * valid during nested entry from L1 to L2. 7001 + * Each of these control msrs has a low and high 32-bit half: A low bit is on 7002 + * if the corresponding bit in the (32-bit) control field *must* be on, and a 7003 + * bit in the high half is on if the corresponding bit in the control field 7004 + * may be on. See also vmx_control_verify(). 7005 + */ 7006 + void nested_vmx_setup_ctls_msrs(struct vmcs_config *vmcs_conf, u32 ept_caps) 7007 + { 7008 + struct nested_vmx_msrs *msrs = &vmcs_conf->nested; 7009 + 7010 + /* 7011 + * Note that as a general rule, the high half of the MSRs (bits in 7012 + * the control fields which may be 1) should be initialized by the 7013 + * intersection of the underlying hardware's MSR (i.e., features which 7014 + * can be supported) and the list of features we want to expose - 7015 + * because they are known to be properly supported in our code. 7016 + * Also, usually, the low half of the MSRs (bits which must be 1) can 7017 + * be set to 0, meaning that L1 may turn off any of these bits. The 7018 + * reason is that if one of these bits is necessary, it will appear 7019 + * in vmcs01 and prepare_vmcs02, when it bitwise-or's the control 7020 + * fields of vmcs01 and vmcs02, will turn these bits off - and 7021 + * nested_vmx_l1_wants_exit() will not pass related exits to L1. 7022 + * These rules have exceptions below. 7023 + */ 7024 + nested_vmx_setup_pinbased_ctls(vmcs_conf, msrs); 7025 + 7026 + nested_vmx_setup_exit_ctls(vmcs_conf, msrs); 7027 + 7028 + nested_vmx_setup_entry_ctls(vmcs_conf, msrs); 7029 + 7030 + nested_vmx_setup_cpubased_ctls(vmcs_conf, msrs); 7031 + 7032 + nested_vmx_setup_secondary_ctls(ept_caps, vmcs_conf, msrs); 7033 + 7034 + nested_vmx_setup_misc_data(vmcs_conf, msrs); 7035 + 7036 + nested_vmx_setup_basic(msrs); 7037 + 7038 + nested_vmx_setup_cr_fixed(msrs); 6991 7039 6992 7040 msrs->vmcs_enum = nested_vmx_calc_vmcs_enum_msr(); 6993 7041 }

+75 -60

arch/x86/kvm/vmx/pmu_intel.c

··· 57 57 pmc = get_fixed_pmc(pmu, MSR_CORE_PERF_FIXED_CTR0 + i); 58 58 59 59 __set_bit(INTEL_PMC_IDX_FIXED + i, pmu->pmc_in_use); 60 - kvm_pmu_request_counter_reprogam(pmc); 60 + kvm_pmu_request_counter_reprogram(pmc); 61 61 } 62 62 } 63 63 ··· 76 76 static void reprogram_counters(struct kvm_pmu *pmu, u64 diff) 77 77 { 78 78 int bit; 79 - struct kvm_pmc *pmc; 80 79 81 - for_each_set_bit(bit, (unsigned long *)&diff, X86_PMC_IDX_MAX) { 82 - pmc = intel_pmc_idx_to_pmc(pmu, bit); 83 - if (pmc) 84 - kvm_pmu_request_counter_reprogam(pmc); 85 - } 80 + if (!diff) 81 + return; 82 + 83 + for_each_set_bit(bit, (unsigned long *)&diff, X86_PMC_IDX_MAX) 84 + set_bit(bit, pmu->reprogram_pmi); 85 + kvm_make_request(KVM_REQ_PMU, pmu_to_vcpu(pmu)); 86 86 } 87 87 88 88 static bool intel_hw_event_available(struct kvm_pmc *pmc) ··· 351 351 switch (msr) { 352 352 case MSR_CORE_PERF_FIXED_CTR_CTRL: 353 353 msr_info->data = pmu->fixed_ctr_ctrl; 354 - return 0; 354 + break; 355 355 case MSR_CORE_PERF_GLOBAL_STATUS: 356 356 msr_info->data = pmu->global_status; 357 - return 0; 357 + break; 358 358 case MSR_CORE_PERF_GLOBAL_CTRL: 359 359 msr_info->data = pmu->global_ctrl; 360 - return 0; 360 + break; 361 361 case MSR_CORE_PERF_GLOBAL_OVF_CTRL: 362 362 msr_info->data = 0; 363 - return 0; 363 + break; 364 364 case MSR_IA32_PEBS_ENABLE: 365 365 msr_info->data = pmu->pebs_enable; 366 - return 0; 366 + break; 367 367 case MSR_IA32_DS_AREA: 368 368 msr_info->data = pmu->ds_area; 369 - return 0; 369 + break; 370 370 case MSR_PEBS_DATA_CFG: 371 371 msr_info->data = pmu->pebs_data_cfg; 372 - return 0; 372 + break; 373 373 default: 374 374 if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) || 375 375 (pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) { 376 376 u64 val = pmc_read_counter(pmc); 377 377 msr_info->data = 378 378 val & pmu->counter_bitmask[KVM_PMC_GP]; 379 - return 0; 379 + break; 380 380 } else if ((pmc = get_fixed_pmc(pmu, msr))) { 381 381 u64 val = pmc_read_counter(pmc); 382 382 msr_info->data = 383 383 val & pmu->counter_bitmask[KVM_PMC_FIXED]; 384 - return 0; 384 + break; 385 385 } else if ((pmc = get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0))) { 386 386 msr_info->data = pmc->eventsel; 387 - return 0; 388 - } else if (intel_pmu_handle_lbr_msrs_access(vcpu, msr_info, true)) 389 - return 0; 387 + break; 388 + } else if (intel_pmu_handle_lbr_msrs_access(vcpu, msr_info, true)) { 389 + break; 390 + } 391 + return 1; 390 392 } 391 393 392 - return 1; 394 + return 0; 393 395 } 394 396 395 397 static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) ··· 404 402 405 403 switch (msr) { 406 404 case MSR_CORE_PERF_FIXED_CTR_CTRL: 407 - if (pmu->fixed_ctr_ctrl == data) 408 - return 0; 409 - if (!(data & pmu->fixed_ctr_ctrl_mask)) { 405 + if (data & pmu->fixed_ctr_ctrl_mask) 406 + return 1; 407 + 408 + if (pmu->fixed_ctr_ctrl != data) 410 409 reprogram_fixed_counters(pmu, data); 411 - return 0; 412 - } 413 410 break; 414 411 case MSR_CORE_PERF_GLOBAL_STATUS: 415 - if (msr_info->host_initiated) { 416 - pmu->global_status = data; 417 - return 0; 418 - } 419 - break; /* RO MSR */ 412 + if (!msr_info->host_initiated) 413 + return 1; /* RO MSR */ 414 + 415 + pmu->global_status = data; 416 + break; 420 417 case MSR_CORE_PERF_GLOBAL_CTRL: 421 - if (pmu->global_ctrl == data) 422 - return 0; 423 - if (kvm_valid_perf_global_ctrl(pmu, data)) { 418 + if (!kvm_valid_perf_global_ctrl(pmu, data)) 419 + return 1; 420 + 421 + if (pmu->global_ctrl != data) { 424 422 diff = pmu->global_ctrl ^ data; 425 423 pmu->global_ctrl = data; 426 424 reprogram_counters(pmu, diff); 427 - return 0; 428 425 } 429 426 break; 430 427 case MSR_CORE_PERF_GLOBAL_OVF_CTRL: 431 - if (!(data & pmu->global_ovf_ctrl_mask)) { 432 - if (!msr_info->host_initiated) 433 - pmu->global_status &= ~data; 434 - return 0; 435 - } 428 + if (data & pmu->global_ovf_ctrl_mask) 429 + return 1; 430 + 431 + if (!msr_info->host_initiated) 432 + pmu->global_status &= ~data; 436 433 break; 437 434 case MSR_IA32_PEBS_ENABLE: 438 - if (pmu->pebs_enable == data) 439 - return 0; 440 - if (!(data & pmu->pebs_enable_mask)) { 435 + if (data & pmu->pebs_enable_mask) 436 + return 1; 437 + 438 + if (pmu->pebs_enable != data) { 441 439 diff = pmu->pebs_enable ^ data; 442 440 pmu->pebs_enable = data; 443 441 reprogram_counters(pmu, diff); 444 - return 0; 445 442 } 446 443 break; 447 444 case MSR_IA32_DS_AREA: ··· 448 447 return 1; 449 448 if (is_noncanonical_address(data, vcpu)) 450 449 return 1; 450 + 451 451 pmu->ds_area = data; 452 - return 0; 452 + break; 453 453 case MSR_PEBS_DATA_CFG: 454 - if (pmu->pebs_data_cfg == data) 455 - return 0; 456 - if (!(data & pmu->pebs_data_cfg_mask)) { 457 - pmu->pebs_data_cfg = data; 458 - return 0; 459 - } 454 + if (data & pmu->pebs_data_cfg_mask) 455 + return 1; 456 + 457 + pmu->pebs_data_cfg = data; 460 458 break; 461 459 default: 462 460 if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) || ··· 463 463 if ((msr & MSR_PMC_FULL_WIDTH_BIT) && 464 464 (data & ~pmu->counter_bitmask[KVM_PMC_GP])) 465 465 return 1; 466 + 466 467 if (!msr_info->host_initiated && 467 468 !(msr & MSR_PMC_FULL_WIDTH_BIT)) 468 469 data = (s64)(s32)data; 469 470 pmc->counter += data - pmc_read_counter(pmc); 470 471 pmc_update_sample_period(pmc); 471 - return 0; 472 + break; 472 473 } else if ((pmc = get_fixed_pmc(pmu, msr))) { 473 474 pmc->counter += data - pmc_read_counter(pmc); 474 475 pmc_update_sample_period(pmc); 475 - return 0; 476 + break; 476 477 } else if ((pmc = get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0))) { 477 - if (data == pmc->eventsel) 478 - return 0; 479 478 reserved_bits = pmu->reserved_bits; 480 479 if ((pmc->idx == 2) && 481 480 (pmu->raw_event_mask & HSW_IN_TX_CHECKPOINTED)) 482 481 reserved_bits ^= HSW_IN_TX_CHECKPOINTED; 483 - if (!(data & reserved_bits)) { 482 + if (data & reserved_bits) 483 + return 1; 484 + 485 + if (data != pmc->eventsel) { 484 486 pmc->eventsel = data; 485 - kvm_pmu_request_counter_reprogam(pmc); 486 - return 0; 487 + kvm_pmu_request_counter_reprogram(pmc); 487 488 } 488 - } else if (intel_pmu_handle_lbr_msrs_access(vcpu, msr_info, false)) 489 - return 0; 489 + break; 490 + } else if (intel_pmu_handle_lbr_msrs_access(vcpu, msr_info, false)) { 491 + break; 492 + } 493 + /* Not a known PMU MSR. */ 494 + return 1; 490 495 } 491 496 492 - return 1; 497 + return 0; 493 498 } 494 499 495 500 static void setup_fixed_pmc_eventsel(struct kvm_pmu *pmu) ··· 535 530 pmu->fixed_ctr_ctrl_mask = ~0ull; 536 531 pmu->pebs_enable_mask = ~0ull; 537 532 pmu->pebs_data_cfg_mask = ~0ull; 533 + 534 + memset(&lbr_desc->records, 0, sizeof(lbr_desc->records)); 535 + 536 + /* 537 + * Setting passthrough of LBR MSRs is done only in the VM-Entry loop, 538 + * and PMU refresh is disallowed after the vCPU has run, i.e. this code 539 + * should never be reached while KVM is passing through MSRs. 540 + */ 541 + if (KVM_BUG_ON(lbr_desc->msr_passthrough, vcpu->kvm)) 542 + return; 538 543 539 544 entry = kvm_find_cpuid_entry(vcpu, 0xa); 540 545 if (!entry || !vcpu->kvm->arch.enable_pmu)

+2 -2

arch/x86/kvm/vmx/sgx.c

··· 29 29 30 30 /* Skip vmcs.GUEST_DS retrieval for 64-bit mode to avoid VMREADs. */ 31 31 *gva = offset; 32 - if (!is_long_mode(vcpu)) { 32 + if (!is_64_bit_mode(vcpu)) { 33 33 vmx_get_segment(vcpu, &s, VCPU_SREG_DS); 34 34 *gva += s.base; 35 35 } 36 36 37 37 if (!IS_ALIGNED(*gva, alignment)) { 38 38 fault = true; 39 - } else if (likely(is_long_mode(vcpu))) { 39 + } else if (likely(is_64_bit_mode(vcpu))) { 40 40 fault = is_noncanonical_address(*gva, vcpu); 41 41 } else { 42 42 *gva &= 0xffffffff;

+46 -50

arch/x86/kvm/vmx/vmx.c

··· 164 164 static u32 vmx_possible_passthrough_msrs[MAX_POSSIBLE_PASSTHROUGH_MSRS] = { 165 165 MSR_IA32_SPEC_CTRL, 166 166 MSR_IA32_PRED_CMD, 167 + MSR_IA32_FLUSH_CMD, 167 168 MSR_IA32_TSC, 168 169 #ifdef CONFIG_X86_64 169 170 MSR_FS_BASE, ··· 580 579 581 580 if (enlightened_vmcs) { 582 581 pr_info("Using Hyper-V Enlightened VMCS\n"); 583 - static_branch_enable(&enable_evmcs); 582 + static_branch_enable(&__kvm_is_using_evmcs); 584 583 } 585 584 586 585 if (ms_hyperv.nested_features & HV_X64_NESTED_DIRECT_FLUSH) ··· 596 595 { 597 596 struct hv_vp_assist_page *vp_ap; 598 597 599 - if (!static_branch_unlikely(&enable_evmcs)) 598 + if (!kvm_is_using_evmcs()) 600 599 return; 601 600 602 601 /* ··· 1946 1945 static int vmx_get_msr_feature(struct kvm_msr_entry *msr) 1947 1946 { 1948 1947 switch (msr->index) { 1949 - case MSR_IA32_VMX_BASIC ... MSR_IA32_VMX_VMFUNC: 1948 + case KVM_FIRST_EMULATED_VMX_MSR ... KVM_LAST_EMULATED_VMX_MSR: 1950 1949 if (!nested) 1951 1950 return 1; 1952 1951 return vmx_get_vmx_msr(&vmcs_config.nested, msr->index, &msr->data); ··· 2031 2030 msr_info->data = to_vmx(vcpu)->msr_ia32_sgxlepubkeyhash 2032 2031 [msr_info->index - MSR_IA32_SGXLEPUBKEYHASH0]; 2033 2032 break; 2034 - case MSR_IA32_VMX_BASIC ... MSR_IA32_VMX_VMFUNC: 2033 + case KVM_FIRST_EMULATED_VMX_MSR ... KVM_LAST_EMULATED_VMX_MSR: 2035 2034 if (!nested_vmx_allowed(vcpu)) 2036 2035 return 1; 2037 2036 if (vmx_get_vmx_msr(&vmx->nested.msrs, msr_info->index, ··· 2286 2285 if (data & ~(TSX_CTRL_RTM_DISABLE | TSX_CTRL_CPUID_CLEAR)) 2287 2286 return 1; 2288 2287 goto find_uret_msr; 2289 - case MSR_IA32_PRED_CMD: 2290 - if (!msr_info->host_initiated && 2291 - !guest_has_pred_cmd_msr(vcpu)) 2292 - return 1; 2293 - 2294 - if (data & ~PRED_CMD_IBPB) 2295 - return 1; 2296 - if (!boot_cpu_has(X86_FEATURE_IBPB)) 2297 - return 1; 2298 - if (!data) 2299 - break; 2300 - 2301 - wrmsrl(MSR_IA32_PRED_CMD, PRED_CMD_IBPB); 2302 - 2303 - /* 2304 - * For non-nested: 2305 - * When it's written (to non-zero) for the first time, pass 2306 - * it through. 2307 - * 2308 - * For nested: 2309 - * The handling of the MSR bitmap for L2 guests is done in 2310 - * nested_vmx_prepare_msr_bitmap. We should not touch the 2311 - * vmcs02.msr_bitmap here since it gets completely overwritten 2312 - * in the merging. 2313 - */ 2314 - vmx_disable_intercept_for_msr(vcpu, MSR_IA32_PRED_CMD, MSR_TYPE_W); 2315 - break; 2316 2288 case MSR_IA32_CR_PAT: 2317 2289 if (!kvm_pat_valid(data)) 2318 2290 return 1; ··· 2340 2366 vmx->msr_ia32_sgxlepubkeyhash 2341 2367 [msr_index - MSR_IA32_SGXLEPUBKEYHASH0] = data; 2342 2368 break; 2343 - case MSR_IA32_VMX_BASIC ... MSR_IA32_VMX_VMFUNC: 2369 + case KVM_FIRST_EMULATED_VMX_MSR ... KVM_LAST_EMULATED_VMX_MSR: 2344 2370 if (!msr_info->host_initiated) 2345 2371 return 1; /* they are read-only */ 2346 2372 if (!nested_vmx_allowed(vcpu)) ··· 2790 2816 * This can happen if we hot-added a CPU but failed to allocate 2791 2817 * VP assist page for it. 2792 2818 */ 2793 - if (static_branch_unlikely(&enable_evmcs) && 2794 - !hv_get_vp_assist_page(cpu)) 2819 + if (kvm_is_using_evmcs() && !hv_get_vp_assist_page(cpu)) 2795 2820 return -EFAULT; 2796 2821 2797 2822 intel_pt_handle_vmx(1); ··· 2842 2869 memset(vmcs, 0, vmcs_config.size); 2843 2870 2844 2871 /* KVM supports Enlightened VMCS v1 only */ 2845 - if (static_branch_unlikely(&enable_evmcs)) 2872 + if (kvm_is_using_evmcs()) 2846 2873 vmcs->hdr.revision_id = KVM_EVMCS_VERSION; 2847 2874 else 2848 2875 vmcs->hdr.revision_id = vmcs_config.revision_id; ··· 2937 2964 * still be marked with revision_id reported by 2938 2965 * physical CPU. 2939 2966 */ 2940 - if (static_branch_unlikely(&enable_evmcs)) 2967 + if (kvm_is_using_evmcs()) 2941 2968 vmcs->hdr.revision_id = vmcs_config.revision_id; 2942 2969 2943 2970 per_cpu(vmxarea, cpu) = vmcs; ··· 3904 3931 * 'Enlightened MSR Bitmap' feature L0 needs to know that MSR 3905 3932 * bitmap has changed. 3906 3933 */ 3907 - if (IS_ENABLED(CONFIG_HYPERV) && static_branch_unlikely(&enable_evmcs)) { 3934 + if (kvm_is_using_evmcs()) { 3908 3935 struct hv_enlightened_vmcs *evmcs = (void *)vmx->vmcs01.vmcs; 3909 3936 3910 3937 if (evmcs->hv_enlightenments_control.msr_bitmap) ··· 4746 4773 /* 22.2.1, 20.8.1 */ 4747 4774 vm_entry_controls_set(vmx, vmx_vmentry_ctrl()); 4748 4775 4749 - vmx->vcpu.arch.cr0_guest_owned_bits = KVM_POSSIBLE_CR0_GUEST_BITS; 4776 + vmx->vcpu.arch.cr0_guest_owned_bits = vmx_l1_guest_owned_cr0_bits(); 4750 4777 vmcs_writel(CR0_GUEST_HOST_MASK, ~vmx->vcpu.arch.cr0_guest_owned_bits); 4751 4778 4752 4779 set_cr4_guest_host_mask(vmx); ··· 5136 5163 if (!boot_cpu_has(X86_FEATURE_SPLIT_LOCK_DETECT)) 5137 5164 return true; 5138 5165 5139 - return vmx_get_cpl(vcpu) == 3 && kvm_read_cr0_bits(vcpu, X86_CR0_AM) && 5166 + return vmx_get_cpl(vcpu) == 3 && kvm_is_cr0_bit_set(vcpu, X86_CR0_AM) && 5140 5167 (kvm_get_rflags(vcpu) & X86_EFLAGS_AC); 5141 5168 } 5142 5169 ··· 5473 5500 break; 5474 5501 case 3: /* lmsw */ 5475 5502 val = (exit_qualification >> LMSW_SOURCE_DATA_SHIFT) & 0x0f; 5476 - trace_kvm_cr_write(0, (kvm_read_cr0(vcpu) & ~0xful) | val); 5503 + trace_kvm_cr_write(0, (kvm_read_cr0_bits(vcpu, ~0xful) | val)); 5477 5504 kvm_lmsw(vcpu, val); 5478 5505 5479 5506 return kvm_skip_emulated_instruction(vcpu); ··· 6930 6957 * real mode. 6931 6958 */ 6932 6959 return enable_unrestricted_guest || emulate_invalid_guest_state; 6933 - case MSR_IA32_VMX_BASIC ... MSR_IA32_VMX_VMFUNC: 6960 + case KVM_FIRST_EMULATED_VMX_MSR ... KVM_LAST_EMULATED_VMX_MSR: 6934 6961 return nested; 6935 6962 case MSR_AMD64_VIRT_SPEC_CTRL: 6936 6963 case MSR_AMD64_TSC_RATIO: ··· 7283 7310 vmx_vcpu_enter_exit(vcpu, __vmx_vcpu_run_flags(vmx)); 7284 7311 7285 7312 /* All fields are clean at this point */ 7286 - if (static_branch_unlikely(&enable_evmcs)) { 7313 + if (kvm_is_using_evmcs()) { 7287 7314 current_evmcs->hv_clean_fields |= 7288 7315 HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL; 7289 7316 ··· 7413 7440 * feature only for vmcs01, KVM currently isn't equipped to realize any 7414 7441 * performance benefits from enabling it for vmcs02. 7415 7442 */ 7416 - if (IS_ENABLED(CONFIG_HYPERV) && static_branch_unlikely(&enable_evmcs) && 7443 + if (kvm_is_using_evmcs() && 7417 7444 (ms_hyperv.nested_features & HV_X64_NESTED_MSR_BITMAP)) { 7418 7445 struct hv_enlightened_vmcs *evmcs = (void *)vmx->vmcs01.vmcs; 7419 7446 ··· 7531 7558 if (!kvm_arch_has_noncoherent_dma(vcpu->kvm)) 7532 7559 return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; 7533 7560 7534 - if (kvm_read_cr0(vcpu) & X86_CR0_CD) { 7561 + if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) { 7535 7562 if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED)) 7536 7563 cache = MTRR_TYPE_WRBACK; 7537 7564 else ··· 7717 7744 vmx_set_intercept_for_msr(vcpu, MSR_IA32_XFD_ERR, MSR_TYPE_R, 7718 7745 !guest_cpuid_has(vcpu, X86_FEATURE_XFD)); 7719 7746 7747 + if (boot_cpu_has(X86_FEATURE_IBPB)) 7748 + vmx_set_intercept_for_msr(vcpu, MSR_IA32_PRED_CMD, MSR_TYPE_W, 7749 + !guest_has_pred_cmd_msr(vcpu)); 7750 + 7751 + if (boot_cpu_has(X86_FEATURE_FLUSH_L1D)) 7752 + vmx_set_intercept_for_msr(vcpu, MSR_IA32_FLUSH_CMD, MSR_TYPE_W, 7753 + !guest_cpuid_has(vcpu, X86_FEATURE_FLUSH_L1D)); 7720 7754 7721 7755 set_cr4_guest_host_mask(vmx); 7722 7756 ··· 7756 7776 if (boot_cpu_has(X86_FEATURE_PDCM)) 7757 7777 rdmsrl(MSR_IA32_PERF_CAPABILITIES, host_perf_cap); 7758 7778 7759 - x86_perf_get_lbr(&lbr); 7760 - if (lbr.nr) 7761 - perf_cap |= host_perf_cap & PMU_CAP_LBR_FMT; 7779 + if (!cpu_feature_enabled(X86_FEATURE_ARCH_LBR)) { 7780 + x86_perf_get_lbr(&lbr); 7781 + if (lbr.nr) 7782 + perf_cap |= host_perf_cap & PMU_CAP_LBR_FMT; 7783 + } 7762 7784 7763 7785 if (vmx_pebs_supported()) { 7764 7786 perf_cap |= host_perf_cap & PERF_CAP_PEBS_MASK; ··· 7898 7916 return X86EMUL_CONTINUE; 7899 7917 7900 7918 /* FIXME: produce nested vmexit and return X86EMUL_INTERCEPTED. */ 7919 + break; 7920 + 7921 + case x86_intercept_pause: 7922 + /* 7923 + * PAUSE is a single-byte NOP with a REPE prefix, i.e. collides 7924 + * with vanilla NOPs in the emulator. Apply the interception 7925 + * check only to actual PAUSE instructions. Don't check 7926 + * PAUSE-loop-exiting, software can't expect a given PAUSE to 7927 + * exit, i.e. KVM is within its rights to allow L2 to execute 7928 + * the PAUSE. 7929 + */ 7930 + if ((info->rep_prefix != REPE_PREFIX) || 7931 + !nested_cpu_has2(vmcs12, CPU_BASED_PAUSE_EXITING)) 7932 + return X86EMUL_CONTINUE; 7933 + 7901 7934 break; 7902 7935 7903 7936 /* TODO: check more intercepts... */ ··· 8412 8415 #if IS_ENABLED(CONFIG_HYPERV) 8413 8416 if (ms_hyperv.nested_features & HV_X64_NESTED_GUEST_MAPPING_FLUSH 8414 8417 && enable_ept) { 8415 - vmx_x86_ops.tlb_remote_flush = hv_remote_flush_tlb; 8416 - vmx_x86_ops.tlb_remote_flush_with_range = 8417 - hv_remote_flush_tlb_with_range; 8418 + vmx_x86_ops.flush_remote_tlbs = hv_flush_remote_tlbs; 8419 + vmx_x86_ops.flush_remote_tlbs_range = hv_flush_remote_tlbs_range; 8418 8420 } 8419 8421 #endif 8420 8422

+19 -1

arch/x86/kvm/vmx/vmx.h

··· 369 369 struct lbr_desc lbr_desc; 370 370 371 371 /* Save desired MSR intercept (read: pass-through) state */ 372 - #define MAX_POSSIBLE_PASSTHROUGH_MSRS 15 372 + #define MAX_POSSIBLE_PASSTHROUGH_MSRS 16 373 373 struct { 374 374 DECLARE_BITMAP(read, MAX_POSSIBLE_PASSTHROUGH_MSRS); 375 375 DECLARE_BITMAP(write, MAX_POSSIBLE_PASSTHROUGH_MSRS); ··· 639 639 (1 << VCPU_EXREG_CR4) | \ 640 640 (1 << VCPU_EXREG_EXIT_INFO_1) | \ 641 641 (1 << VCPU_EXREG_EXIT_INFO_2)) 642 + 643 + static inline unsigned long vmx_l1_guest_owned_cr0_bits(void) 644 + { 645 + unsigned long bits = KVM_POSSIBLE_CR0_GUEST_BITS; 646 + 647 + /* 648 + * CR0.WP needs to be intercepted when KVM is shadowing legacy paging 649 + * in order to construct shadow PTEs with the correct protections. 650 + * Note! CR0.WP technically can be passed through to the guest if 651 + * paging is disabled, but checking CR0.PG would generate a cyclical 652 + * dependency of sorts due to forcing the caller to ensure CR0 holds 653 + * the correct value prior to determining which CR0 bits can be owned 654 + * by L1. Keep it simple and limit the optimization to EPT. 655 + */ 656 + if (!enable_ept) 657 + bits &= ~X86_CR0_WP; 658 + return bits; 659 + } 642 660 643 661 static __always_inline struct kvm_vmx *to_kvm_vmx(struct kvm *kvm) 644 662 {

+11 -11

arch/x86/kvm/vmx/vmx_ops.h

··· 147 147 static __always_inline u16 vmcs_read16(unsigned long field) 148 148 { 149 149 vmcs_check16(field); 150 - if (static_branch_unlikely(&enable_evmcs)) 150 + if (kvm_is_using_evmcs()) 151 151 return evmcs_read16(field); 152 152 return __vmcs_readl(field); 153 153 } ··· 155 155 static __always_inline u32 vmcs_read32(unsigned long field) 156 156 { 157 157 vmcs_check32(field); 158 - if (static_branch_unlikely(&enable_evmcs)) 158 + if (kvm_is_using_evmcs()) 159 159 return evmcs_read32(field); 160 160 return __vmcs_readl(field); 161 161 } ··· 163 163 static __always_inline u64 vmcs_read64(unsigned long field) 164 164 { 165 165 vmcs_check64(field); 166 - if (static_branch_unlikely(&enable_evmcs)) 166 + if (kvm_is_using_evmcs()) 167 167 return evmcs_read64(field); 168 168 #ifdef CONFIG_X86_64 169 169 return __vmcs_readl(field); ··· 175 175 static __always_inline unsigned long vmcs_readl(unsigned long field) 176 176 { 177 177 vmcs_checkl(field); 178 - if (static_branch_unlikely(&enable_evmcs)) 178 + if (kvm_is_using_evmcs()) 179 179 return evmcs_read64(field); 180 180 return __vmcs_readl(field); 181 181 } ··· 222 222 static __always_inline void vmcs_write16(unsigned long field, u16 value) 223 223 { 224 224 vmcs_check16(field); 225 - if (static_branch_unlikely(&enable_evmcs)) 225 + if (kvm_is_using_evmcs()) 226 226 return evmcs_write16(field, value); 227 227 228 228 __vmcs_writel(field, value); ··· 231 231 static __always_inline void vmcs_write32(unsigned long field, u32 value) 232 232 { 233 233 vmcs_check32(field); 234 - if (static_branch_unlikely(&enable_evmcs)) 234 + if (kvm_is_using_evmcs()) 235 235 return evmcs_write32(field, value); 236 236 237 237 __vmcs_writel(field, value); ··· 240 240 static __always_inline void vmcs_write64(unsigned long field, u64 value) 241 241 { 242 242 vmcs_check64(field); 243 - if (static_branch_unlikely(&enable_evmcs)) 243 + if (kvm_is_using_evmcs()) 244 244 return evmcs_write64(field, value); 245 245 246 246 __vmcs_writel(field, value); ··· 252 252 static __always_inline void vmcs_writel(unsigned long field, unsigned long value) 253 253 { 254 254 vmcs_checkl(field); 255 - if (static_branch_unlikely(&enable_evmcs)) 255 + if (kvm_is_using_evmcs()) 256 256 return evmcs_write64(field, value); 257 257 258 258 __vmcs_writel(field, value); ··· 262 262 { 263 263 BUILD_BUG_ON_MSG(__builtin_constant_p(field) && ((field) & 0x6000) == 0x2000, 264 264 "vmcs_clear_bits does not support 64-bit fields"); 265 - if (static_branch_unlikely(&enable_evmcs)) 265 + if (kvm_is_using_evmcs()) 266 266 return evmcs_write32(field, evmcs_read32(field) & ~mask); 267 267 268 268 __vmcs_writel(field, __vmcs_readl(field) & ~mask); ··· 272 272 { 273 273 BUILD_BUG_ON_MSG(__builtin_constant_p(field) && ((field) & 0x6000) == 0x2000, 274 274 "vmcs_set_bits does not support 64-bit fields"); 275 - if (static_branch_unlikely(&enable_evmcs)) 275 + if (kvm_is_using_evmcs()) 276 276 return evmcs_write32(field, evmcs_read32(field) | mask); 277 277 278 278 __vmcs_writel(field, __vmcs_readl(field) | mask); ··· 289 289 { 290 290 u64 phys_addr = __pa(vmcs); 291 291 292 - if (static_branch_unlikely(&enable_evmcs)) 292 + if (kvm_is_using_evmcs()) 293 293 return evmcs_load(phys_addr); 294 294 295 295 vmx_asm1(vmptrld, "m"(phys_addr), vmcs, phys_addr);

+169 -87

arch/x86/kvm/x86.c

··· 196 196 module_param(eager_page_split, bool, 0644); 197 197 198 198 /* Enable/disable SMT_RSB bug mitigation */ 199 - bool __read_mostly mitigate_smt_rsb; 199 + static bool __read_mostly mitigate_smt_rsb; 200 200 module_param(mitigate_smt_rsb, bool, 0444); 201 201 202 202 /* ··· 804 804 */ 805 805 if ((fault->error_code & PFERR_PRESENT_MASK) && 806 806 !(fault->error_code & PFERR_RSVD_MASK)) 807 - kvm_mmu_invalidate_gva(vcpu, fault_mmu, fault->address, 808 - fault_mmu->root.hpa); 807 + kvm_mmu_invalidate_addr(vcpu, fault_mmu, fault->address, 808 + KVM_MMU_ROOT_CURRENT); 809 809 810 810 fault_mmu->inject_page_fault(vcpu, fault); 811 811 } ··· 843 843 844 844 bool kvm_require_dr(struct kvm_vcpu *vcpu, int dr) 845 845 { 846 - if ((dr != 4 && dr != 5) || !kvm_read_cr4_bits(vcpu, X86_CR4_DE)) 846 + if ((dr != 4 && dr != 5) || !kvm_is_cr4_bit_set(vcpu, X86_CR4_DE)) 847 847 return true; 848 848 849 849 kvm_queue_exception(vcpu, UD_VECTOR); ··· 908 908 909 909 void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned long cr0) 910 910 { 911 + /* 912 + * CR0.WP is incorporated into the MMU role, but only for non-nested, 913 + * indirect shadow MMUs. If paging is disabled, no updates are needed 914 + * as there are no permission bits to emulate. If TDP is enabled, the 915 + * MMU's metadata needs to be updated, e.g. so that emulating guest 916 + * translations does the right thing, but there's no need to unload the 917 + * root as CR0.WP doesn't affect SPTEs. 918 + */ 919 + if ((cr0 ^ old_cr0) == X86_CR0_WP) { 920 + if (!(cr0 & X86_CR0_PG)) 921 + return; 922 + 923 + if (tdp_enabled) { 924 + kvm_init_mmu(vcpu); 925 + return; 926 + } 927 + } 928 + 911 929 if ((cr0 ^ old_cr0) & X86_CR0_PG) { 912 930 kvm_clear_async_pf_completion_queue(vcpu); 913 931 kvm_async_pf_hash_reset(vcpu); ··· 985 967 return 1; 986 968 987 969 if (!(cr0 & X86_CR0_PG) && 988 - (is_64_bit_mode(vcpu) || kvm_read_cr4_bits(vcpu, X86_CR4_PCIDE))) 970 + (is_64_bit_mode(vcpu) || kvm_is_cr4_bit_set(vcpu, X86_CR4_PCIDE))) 989 971 return 1; 990 972 991 973 static_call(kvm_x86_set_cr0)(vcpu, cr0); ··· 1007 989 if (vcpu->arch.guest_state_protected) 1008 990 return; 1009 991 1010 - if (kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE)) { 992 + if (kvm_is_cr4_bit_set(vcpu, X86_CR4_OSXSAVE)) { 1011 993 1012 994 if (vcpu->arch.xcr0 != host_xcr0) 1013 995 xsetbv(XCR_XFEATURE_ENABLED_MASK, vcpu->arch.xcr0); ··· 1021 1003 if (static_cpu_has(X86_FEATURE_PKU) && 1022 1004 vcpu->arch.pkru != vcpu->arch.host_pkru && 1023 1005 ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) || 1024 - kvm_read_cr4_bits(vcpu, X86_CR4_PKE))) 1006 + kvm_is_cr4_bit_set(vcpu, X86_CR4_PKE))) 1025 1007 write_pkru(vcpu->arch.pkru); 1026 1008 #endif /* CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS */ 1027 1009 } ··· 1035 1017 #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS 1036 1018 if (static_cpu_has(X86_FEATURE_PKU) && 1037 1019 ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) || 1038 - kvm_read_cr4_bits(vcpu, X86_CR4_PKE))) { 1020 + kvm_is_cr4_bit_set(vcpu, X86_CR4_PKE))) { 1039 1021 vcpu->arch.pkru = rdpkru(); 1040 1022 if (vcpu->arch.pkru != vcpu->arch.host_pkru) 1041 1023 write_pkru(vcpu->arch.host_pkru); 1042 1024 } 1043 1025 #endif /* CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS */ 1044 1026 1045 - if (kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE)) { 1027 + if (kvm_is_cr4_bit_set(vcpu, X86_CR4_OSXSAVE)) { 1046 1028 1047 1029 if (vcpu->arch.xcr0 != host_xcr0) 1048 1030 xsetbv(XCR_XFEATURE_ENABLED_MASK, host_xcr0); ··· 1198 1180 return 1; 1199 1181 1200 1182 if ((cr4 & X86_CR4_PCIDE) && !(old_cr4 & X86_CR4_PCIDE)) { 1201 - if (!guest_cpuid_has(vcpu, X86_FEATURE_PCID)) 1202 - return 1; 1203 - 1204 1183 /* PCID can not be enabled when cr3[11:0]!=000H or EFER.LMA=0 */ 1205 1184 if ((kvm_read_cr3(vcpu) & X86_CR3_PCID_MASK) || !is_long_mode(vcpu)) 1206 1185 return 1; ··· 1244 1229 * PCIDs for them are also 0, because MOV to CR3 always flushes the TLB 1245 1230 * with PCIDE=0. 1246 1231 */ 1247 - if (!kvm_read_cr4_bits(vcpu, X86_CR4_PCIDE)) 1232 + if (!kvm_is_cr4_bit_set(vcpu, X86_CR4_PCIDE)) 1248 1233 return; 1249 1234 1250 1235 for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) ··· 1259 1244 bool skip_tlb_flush = false; 1260 1245 unsigned long pcid = 0; 1261 1246 #ifdef CONFIG_X86_64 1262 - bool pcid_enabled = kvm_read_cr4_bits(vcpu, X86_CR4_PCIDE); 1263 - 1264 - if (pcid_enabled) { 1247 + if (kvm_is_cr4_bit_set(vcpu, X86_CR4_PCIDE)) { 1265 1248 skip_tlb_flush = cr3 & X86_CR3_PCID_NOFLUSH; 1266 1249 cr3 &= ~X86_CR3_PCID_NOFLUSH; 1267 1250 pcid = cr3 & X86_CR3_PCID_MASK; ··· 1558 1545 static unsigned num_emulated_msrs; 1559 1546 1560 1547 /* 1561 - * List of msr numbers which are used to expose MSR-based features that 1562 - * can be used by a hypervisor to validate requested CPU features. 1548 + * List of MSRs that control the existence of MSR-based features, i.e. MSRs 1549 + * that are effectively CPUID leafs. VMX MSRs are also included in the set of 1550 + * feature MSRs, but are handled separately to allow expedited lookups. 1563 1551 */ 1564 - static const u32 msr_based_features_all[] = { 1565 - MSR_IA32_VMX_BASIC, 1566 - MSR_IA32_VMX_TRUE_PINBASED_CTLS, 1567 - MSR_IA32_VMX_PINBASED_CTLS, 1568 - MSR_IA32_VMX_TRUE_PROCBASED_CTLS, 1569 - MSR_IA32_VMX_PROCBASED_CTLS, 1570 - MSR_IA32_VMX_TRUE_EXIT_CTLS, 1571 - MSR_IA32_VMX_EXIT_CTLS, 1572 - MSR_IA32_VMX_TRUE_ENTRY_CTLS, 1573 - MSR_IA32_VMX_ENTRY_CTLS, 1574 - MSR_IA32_VMX_MISC, 1575 - MSR_IA32_VMX_CR0_FIXED0, 1576 - MSR_IA32_VMX_CR0_FIXED1, 1577 - MSR_IA32_VMX_CR4_FIXED0, 1578 - MSR_IA32_VMX_CR4_FIXED1, 1579 - MSR_IA32_VMX_VMCS_ENUM, 1580 - MSR_IA32_VMX_PROCBASED_CTLS2, 1581 - MSR_IA32_VMX_EPT_VPID_CAP, 1582 - MSR_IA32_VMX_VMFUNC, 1583 - 1552 + static const u32 msr_based_features_all_except_vmx[] = { 1584 1553 MSR_AMD64_DE_CFG, 1585 1554 MSR_IA32_UCODE_REV, 1586 1555 MSR_IA32_ARCH_CAPABILITIES, 1587 1556 MSR_IA32_PERF_CAPABILITIES, 1588 1557 }; 1589 1558 1590 - static u32 msr_based_features[ARRAY_SIZE(msr_based_features_all)]; 1559 + static u32 msr_based_features[ARRAY_SIZE(msr_based_features_all_except_vmx) + 1560 + (KVM_LAST_EMULATED_VMX_MSR - KVM_FIRST_EMULATED_VMX_MSR + 1)]; 1591 1561 static unsigned int num_msr_based_features; 1562 + 1563 + /* 1564 + * All feature MSRs except uCode revID, which tracks the currently loaded uCode 1565 + * patch, are immutable once the vCPU model is defined. 1566 + */ 1567 + static bool kvm_is_immutable_feature_msr(u32 msr) 1568 + { 1569 + int i; 1570 + 1571 + if (msr >= KVM_FIRST_EMULATED_VMX_MSR && msr <= KVM_LAST_EMULATED_VMX_MSR) 1572 + return true; 1573 + 1574 + for (i = 0; i < ARRAY_SIZE(msr_based_features_all_except_vmx); i++) { 1575 + if (msr == msr_based_features_all_except_vmx[i]) 1576 + return msr != MSR_IA32_UCODE_REV; 1577 + } 1578 + 1579 + return false; 1580 + } 1592 1581 1593 1582 /* 1594 1583 * Some IA32_ARCH_CAPABILITIES bits have dependencies on MSRs that KVM ··· 2209 2194 2210 2195 static int do_set_msr(struct kvm_vcpu *vcpu, unsigned index, u64 *data) 2211 2196 { 2197 + u64 val; 2198 + 2199 + /* 2200 + * Disallow writes to immutable feature MSRs after KVM_RUN. KVM does 2201 + * not support modifying the guest vCPU model on the fly, e.g. changing 2202 + * the nVMX capabilities while L2 is running is nonsensical. Ignore 2203 + * writes of the same value, e.g. to allow userspace to blindly stuff 2204 + * all MSRs when emulating RESET. 2205 + */ 2206 + if (kvm_vcpu_has_run(vcpu) && kvm_is_immutable_feature_msr(index)) { 2207 + if (do_get_msr(vcpu, index, &val) || *data != val) 2208 + return -EINVAL; 2209 + 2210 + return 0; 2211 + } 2212 + 2212 2213 return kvm_set_msr_ignored_check(vcpu, index, *data, true); 2213 2214 } 2214 2215 ··· 3647 3616 if (data & ~kvm_caps.supported_perf_cap) 3648 3617 return 1; 3649 3618 3619 + /* 3620 + * Note, this is not just a performance optimization! KVM 3621 + * disallows changing feature MSRs after the vCPU has run; PMU 3622 + * refresh will bug the VM if called after the vCPU has run. 3623 + */ 3624 + if (vcpu->arch.perf_capabilities == data) 3625 + break; 3626 + 3650 3627 vcpu->arch.perf_capabilities = data; 3651 3628 kvm_pmu_refresh(vcpu); 3652 - return 0; 3629 + break; 3630 + case MSR_IA32_PRED_CMD: 3631 + if (!msr_info->host_initiated && !guest_has_pred_cmd_msr(vcpu)) 3632 + return 1; 3633 + 3634 + if (!boot_cpu_has(X86_FEATURE_IBPB) || (data & ~PRED_CMD_IBPB)) 3635 + return 1; 3636 + if (!data) 3637 + break; 3638 + 3639 + wrmsrl(MSR_IA32_PRED_CMD, PRED_CMD_IBPB); 3640 + break; 3641 + case MSR_IA32_FLUSH_CMD: 3642 + if (!msr_info->host_initiated && 3643 + !guest_cpuid_has(vcpu, X86_FEATURE_FLUSH_L1D)) 3644 + return 1; 3645 + 3646 + if (!boot_cpu_has(X86_FEATURE_FLUSH_L1D) || (data & ~L1D_FLUSH)) 3647 + return 1; 3648 + if (!data) 3649 + break; 3650 + 3651 + wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH); 3652 + break; 3653 3653 case MSR_EFER: 3654 3654 return set_efer(vcpu, msr_info); 3655 3655 case MSR_K7_HWCR: ··· 4596 4534 r = 0; 4597 4535 break; 4598 4536 case KVM_CAP_XSAVE2: { 4599 - u64 guest_perm = xstate_get_guest_group_perm(); 4600 - 4601 - r = xstate_required_size(kvm_caps.supported_xcr0 & guest_perm, false); 4537 + r = xstate_required_size(kvm_get_filtered_xcr0(), false); 4602 4538 if (r < sizeof(struct kvm_xsave)) 4603 4539 r = sizeof(struct kvm_xsave); 4604 4540 break; ··· 5096 5036 return 0; 5097 5037 if (mce->status & MCI_STATUS_UC) { 5098 5038 if ((vcpu->arch.mcg_status & MCG_STATUS_MCIP) || 5099 - !kvm_read_cr4_bits(vcpu, X86_CR4_MCE)) { 5039 + !kvm_is_cr4_bit_set(vcpu, X86_CR4_MCE)) { 5100 5040 kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu); 5101 5041 return 0; 5102 5042 } ··· 5188 5128 events->interrupt.shadow = static_call(kvm_x86_get_interrupt_shadow)(vcpu); 5189 5129 5190 5130 events->nmi.injected = vcpu->arch.nmi_injected; 5191 - events->nmi.pending = vcpu->arch.nmi_pending != 0; 5131 + events->nmi.pending = kvm_get_nr_pending_nmis(vcpu); 5192 5132 events->nmi.masked = static_call(kvm_x86_get_nmi_mask)(vcpu); 5193 5133 5194 5134 /* events->sipi_vector is never valid when reporting to user space */ ··· 5275 5215 events->interrupt.shadow); 5276 5216 5277 5217 vcpu->arch.nmi_injected = events->nmi.injected; 5278 - if (events->flags & KVM_VCPUEVENT_VALID_NMI_PENDING) 5279 - vcpu->arch.nmi_pending = events->nmi.pending; 5218 + if (events->flags & KVM_VCPUEVENT_VALID_NMI_PENDING) { 5219 + vcpu->arch.nmi_pending = 0; 5220 + atomic_set(&vcpu->arch.nmi_queued, events->nmi.pending); 5221 + kvm_make_request(KVM_REQ_NMI, vcpu); 5222 + } 5280 5223 static_call(kvm_x86_set_nmi_mask)(vcpu, events->nmi.masked); 5281 5224 5282 5225 if (events->flags & KVM_VCPUEVENT_VALID_SIPI_VECTOR && ··· 6087 6024 return 0; 6088 6025 } 6089 6026 6090 - static unsigned long kvm_vm_ioctl_get_nr_mmu_pages(struct kvm *kvm) 6091 - { 6092 - return kvm->arch.n_max_mmu_pages; 6093 - } 6094 - 6095 6027 static int kvm_vm_ioctl_get_irqchip(struct kvm *kvm, struct kvm_irqchip *chip) 6096 6028 { 6097 6029 struct kvm_pic *pic = kvm->arch.vpic; ··· 6733 6675 return 0; 6734 6676 } 6735 6677 6736 - long kvm_arch_vm_ioctl(struct file *filp, 6737 - unsigned int ioctl, unsigned long arg) 6678 + int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) 6738 6679 { 6739 6680 struct kvm *kvm = filp->private_data; 6740 6681 void __user *argp = (void __user *)arg; ··· 6770 6713 } 6771 6714 case KVM_SET_NR_MMU_PAGES: 6772 6715 r = kvm_vm_ioctl_set_nr_mmu_pages(kvm, arg); 6773 - break; 6774 - case KVM_GET_NR_MMU_PAGES: 6775 - r = kvm_vm_ioctl_get_nr_mmu_pages(kvm); 6776 6716 break; 6777 6717 case KVM_CREATE_IRQCHIP: { 6778 6718 mutex_lock(&kvm->lock); ··· 7075 7021 return r; 7076 7022 } 7077 7023 7024 + static void kvm_probe_feature_msr(u32 msr_index) 7025 + { 7026 + struct kvm_msr_entry msr = { 7027 + .index = msr_index, 7028 + }; 7029 + 7030 + if (kvm_get_msr_feature(&msr)) 7031 + return; 7032 + 7033 + msr_based_features[num_msr_based_features++] = msr_index; 7034 + } 7035 + 7078 7036 static void kvm_probe_msr_to_save(u32 msr_index) 7079 7037 { 7080 7038 u32 dummy[2]; ··· 7162 7096 msrs_to_save[num_msrs_to_save++] = msr_index; 7163 7097 } 7164 7098 7165 - static void kvm_init_msr_list(void) 7099 + static void kvm_init_msr_lists(void) 7166 7100 { 7167 7101 unsigned i; 7168 7102 ··· 7188 7122 emulated_msrs[num_emulated_msrs++] = emulated_msrs_all[i]; 7189 7123 } 7190 7124 7191 - for (i = 0; i < ARRAY_SIZE(msr_based_features_all); i++) { 7192 - struct kvm_msr_entry msr; 7125 + for (i = KVM_FIRST_EMULATED_VMX_MSR; i <= KVM_LAST_EMULATED_VMX_MSR; i++) 7126 + kvm_probe_feature_msr(i); 7193 7127 7194 - msr.index = msr_based_features_all[i]; 7195 - if (kvm_get_msr_feature(&msr)) 7196 - continue; 7197 - 7198 - msr_based_features[num_msr_based_features++] = msr_based_features_all[i]; 7199 - } 7128 + for (i = 0; i < ARRAY_SIZE(msr_based_features_all_except_vmx); i++) 7129 + kvm_probe_feature_msr(msr_based_features_all_except_vmx[i]); 7200 7130 } 7201 7131 7202 7132 static int vcpu_mmio_write(struct kvm_vcpu *vcpu, gpa_t addr, int len, ··· 8528 8466 } 8529 8467 8530 8468 static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, 8531 - bool write_fault_to_shadow_pgtable, 8532 8469 int emulation_type) 8533 8470 { 8534 8471 gpa_t gpa = cr2_or_gpa; ··· 8598 8537 * be fixed by unprotecting shadow page and it should 8599 8538 * be reported to userspace. 8600 8539 */ 8601 - return !write_fault_to_shadow_pgtable; 8540 + return !(emulation_type & EMULTYPE_WRITE_PF_TO_SP); 8602 8541 } 8603 8542 8604 8543 static bool retry_instruction(struct x86_emulate_ctxt *ctxt, ··· 8846 8785 int r; 8847 8786 struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt; 8848 8787 bool writeback = true; 8849 - bool write_fault_to_spt; 8850 8788 8851 8789 if (unlikely(!kvm_can_emulate_insn(vcpu, emulation_type, insn, insn_len))) 8852 8790 return 1; 8853 8791 8854 8792 vcpu->arch.l1tf_flush_l1d = true; 8855 - 8856 - /* 8857 - * Clear write_fault_to_shadow_pgtable here to ensure it is 8858 - * never reused. 8859 - */ 8860 - write_fault_to_spt = vcpu->arch.write_fault_to_shadow_pgtable; 8861 - vcpu->arch.write_fault_to_shadow_pgtable = false; 8862 8793 8863 8794 if (!(emulation_type & EMULTYPE_NO_DECODE)) { 8864 8795 kvm_clear_exception_queue(vcpu); ··· 8872 8819 return 1; 8873 8820 } 8874 8821 if (reexecute_instruction(vcpu, cr2_or_gpa, 8875 - write_fault_to_spt, 8876 8822 emulation_type)) 8877 8823 return 1; 8878 8824 ··· 8950 8898 return 1; 8951 8899 8952 8900 if (r == EMULATION_FAILED) { 8953 - if (reexecute_instruction(vcpu, cr2_or_gpa, write_fault_to_spt, 8954 - emulation_type)) 8901 + if (reexecute_instruction(vcpu, cr2_or_gpa, emulation_type)) 8955 8902 return 1; 8956 8903 8957 8904 return handle_emulation_failure(vcpu, emulation_type); ··· 9528 9477 kvm_caps.max_guest_tsc_khz = max; 9529 9478 } 9530 9479 kvm_caps.default_tsc_scaling_ratio = 1ULL << kvm_caps.tsc_scaling_ratio_frac_bits; 9531 - kvm_init_msr_list(); 9480 + kvm_init_msr_lists(); 9532 9481 return 0; 9533 9482 9534 9483 out_unwind_ops: ··· 9859 9808 vcpu->run->hypercall.args[0] = gpa; 9860 9809 vcpu->run->hypercall.args[1] = npages; 9861 9810 vcpu->run->hypercall.args[2] = attrs; 9862 - vcpu->run->hypercall.longmode = op_64_bit; 9811 + vcpu->run->hypercall.flags = 0; 9812 + if (op_64_bit) 9813 + vcpu->run->hypercall.flags |= KVM_EXIT_HYPERCALL_LONG_MODE; 9814 + 9815 + WARN_ON_ONCE(vcpu->run->hypercall.flags & KVM_EXIT_HYPERCALL_MBZ); 9863 9816 vcpu->arch.complete_userspace_io = complete_hypercall_exit; 9864 9817 return 0; 9865 9818 } ··· 10225 10170 10226 10171 static void process_nmi(struct kvm_vcpu *vcpu) 10227 10172 { 10228 - unsigned limit = 2; 10173 + unsigned int limit; 10229 10174 10230 10175 /* 10231 - * x86 is limited to one NMI running, and one NMI pending after it. 10232 - * If an NMI is already in progress, limit further NMIs to just one. 10233 - * Otherwise, allow two (and we'll inject the first one immediately). 10176 + * x86 is limited to one NMI pending, but because KVM can't react to 10177 + * incoming NMIs as quickly as bare metal, e.g. if the vCPU is 10178 + * scheduled out, KVM needs to play nice with two queued NMIs showing 10179 + * up at the same time. To handle this scenario, allow two NMIs to be 10180 + * (temporarily) pending so long as NMIs are not blocked and KVM is not 10181 + * waiting for a previous NMI injection to complete (which effectively 10182 + * blocks NMIs). KVM will immediately inject one of the two NMIs, and 10183 + * will request an NMI window to handle the second NMI. 10234 10184 */ 10235 10185 if (static_call(kvm_x86_get_nmi_mask)(vcpu) || vcpu->arch.nmi_injected) 10236 10186 limit = 1; 10187 + else 10188 + limit = 2; 10189 + 10190 + /* 10191 + * Adjust the limit to account for pending virtual NMIs, which aren't 10192 + * tracked in vcpu->arch.nmi_pending. 10193 + */ 10194 + if (static_call(kvm_x86_is_vnmi_pending)(vcpu)) 10195 + limit--; 10237 10196 10238 10197 vcpu->arch.nmi_pending += atomic_xchg(&vcpu->arch.nmi_queued, 0); 10239 10198 vcpu->arch.nmi_pending = min(vcpu->arch.nmi_pending, limit); 10240 - kvm_make_request(KVM_REQ_EVENT, vcpu); 10199 + 10200 + if (vcpu->arch.nmi_pending && 10201 + (static_call(kvm_x86_set_vnmi_pending)(vcpu))) 10202 + vcpu->arch.nmi_pending--; 10203 + 10204 + if (vcpu->arch.nmi_pending) 10205 + kvm_make_request(KVM_REQ_EVENT, vcpu); 10206 + } 10207 + 10208 + /* Return total number of NMIs pending injection to the VM */ 10209 + int kvm_get_nr_pending_nmis(struct kvm_vcpu *vcpu) 10210 + { 10211 + return vcpu->arch.nmi_pending + 10212 + static_call(kvm_x86_is_vnmi_pending)(vcpu); 10241 10213 } 10242 10214 10243 10215 void kvm_make_scan_ioapic_request_mask(struct kvm *kvm, ··· 13350 13268 return 1; 13351 13269 } 13352 13270 13353 - pcid_enabled = kvm_read_cr4_bits(vcpu, X86_CR4_PCIDE); 13271 + pcid_enabled = kvm_is_cr4_bit_set(vcpu, X86_CR4_PCIDE); 13354 13272 13355 13273 switch (type) { 13356 13274 case INVPCID_TYPE_INDIV_ADDR:

+53 -11

arch/x86/kvm/x86.h

··· 3 3 #define ARCH_X86_KVM_X86_H 4 4 5 5 #include <linux/kvm_host.h> 6 + #include <asm/fpu/xstate.h> 6 7 #include <asm/mce.h> 7 8 #include <asm/pvclock.h> 8 9 #include "kvm_cache_regs.h" ··· 40 39 trace_kvm_nested_vmenter_failed(#consistency_check, 0); \ 41 40 failed; \ 42 41 }) 42 + 43 + /* 44 + * The first...last VMX feature MSRs that are emulated by KVM. This may or may 45 + * not cover all known VMX MSRs, as KVM doesn't emulate an MSR until there's an 46 + * associated feature that KVM supports for nested virtualization. 47 + */ 48 + #define KVM_FIRST_EMULATED_VMX_MSR MSR_IA32_VMX_BASIC 49 + #define KVM_LAST_EMULATED_VMX_MSR MSR_IA32_VMX_VMFUNC 43 50 44 51 #define KVM_DEFAULT_PLE_GAP 128 45 52 #define KVM_VMX_DEFAULT_PLE_WINDOW 4096 ··· 92 83 void kvm_service_local_tlb_flush_requests(struct kvm_vcpu *vcpu); 93 84 int kvm_check_nested_events(struct kvm_vcpu *vcpu); 94 85 86 + static inline bool kvm_vcpu_has_run(struct kvm_vcpu *vcpu) 87 + { 88 + return vcpu->arch.last_vmentry_cpu != -1; 89 + } 90 + 95 91 static inline bool kvm_is_exception_pending(struct kvm_vcpu *vcpu) 96 92 { 97 93 return vcpu->arch.exception.pending || ··· 137 123 138 124 static inline bool is_protmode(struct kvm_vcpu *vcpu) 139 125 { 140 - return kvm_read_cr0_bits(vcpu, X86_CR0_PE); 126 + return kvm_is_cr0_bit_set(vcpu, X86_CR0_PE); 141 127 } 142 128 143 - static inline int is_long_mode(struct kvm_vcpu *vcpu) 129 + static inline bool is_long_mode(struct kvm_vcpu *vcpu) 144 130 { 145 131 #ifdef CONFIG_X86_64 146 - return vcpu->arch.efer & EFER_LMA; 132 + return !!(vcpu->arch.efer & EFER_LMA); 147 133 #else 148 - return 0; 134 + return false; 149 135 #endif 150 136 } 151 137 ··· 185 171 return vcpu->arch.walk_mmu == &vcpu->arch.nested_mmu; 186 172 } 187 173 188 - static inline int is_pae(struct kvm_vcpu *vcpu) 174 + static inline bool is_pae(struct kvm_vcpu *vcpu) 189 175 { 190 - return kvm_read_cr4_bits(vcpu, X86_CR4_PAE); 176 + return kvm_is_cr4_bit_set(vcpu, X86_CR4_PAE); 191 177 } 192 178 193 - static inline int is_pse(struct kvm_vcpu *vcpu) 179 + static inline bool is_pse(struct kvm_vcpu *vcpu) 194 180 { 195 - return kvm_read_cr4_bits(vcpu, X86_CR4_PSE); 181 + return kvm_is_cr4_bit_set(vcpu, X86_CR4_PSE); 196 182 } 197 183 198 - static inline int is_paging(struct kvm_vcpu *vcpu) 184 + static inline bool is_paging(struct kvm_vcpu *vcpu) 199 185 { 200 - return likely(kvm_read_cr0_bits(vcpu, X86_CR0_PG)); 186 + return likely(kvm_is_cr0_bit_set(vcpu, X86_CR0_PG)); 201 187 } 202 188 203 189 static inline bool is_pae_paging(struct kvm_vcpu *vcpu) ··· 207 193 208 194 static inline u8 vcpu_virt_addr_bits(struct kvm_vcpu *vcpu) 209 195 { 210 - return kvm_read_cr4_bits(vcpu, X86_CR4_LA57) ? 57 : 48; 196 + return kvm_is_cr4_bit_set(vcpu, X86_CR4_LA57) ? 57 : 48; 211 197 } 212 198 213 199 static inline bool is_noncanonical_address(u64 la, struct kvm_vcpu *vcpu) ··· 328 314 extern struct kvm_caps kvm_caps; 329 315 330 316 extern bool enable_pmu; 317 + 318 + /* 319 + * Get a filtered version of KVM's supported XCR0 that strips out dynamic 320 + * features for which the current process doesn't (yet) have permission to use. 321 + * This is intended to be used only when enumerating support to userspace, 322 + * e.g. in KVM_GET_SUPPORTED_CPUID and KVM_CAP_XSAVE2, it does NOT need to be 323 + * used to check/restrict guest behavior as KVM rejects KVM_SET_CPUID{2} if 324 + * userspace attempts to enable unpermitted features. 325 + */ 326 + static inline u64 kvm_get_filtered_xcr0(void) 327 + { 328 + u64 permitted_xcr0 = kvm_caps.supported_xcr0; 329 + 330 + BUILD_BUG_ON(XFEATURE_MASK_USER_DYNAMIC != XFEATURE_MASK_XTILE_DATA); 331 + 332 + if (permitted_xcr0 & XFEATURE_MASK_USER_DYNAMIC) { 333 + permitted_xcr0 &= xstate_get_guest_group_perm(); 334 + 335 + /* 336 + * Treat XTILE_CFG as unsupported if the current process isn't 337 + * allowed to use XTILE_DATA, as attempting to set XTILE_CFG in 338 + * XCR0 without setting XTILE_DATA is architecturally illegal. 339 + */ 340 + if (!(permitted_xcr0 & XFEATURE_MASK_XTILE_DATA)) 341 + permitted_xcr0 &= ~XFEATURE_MASK_XTILE_CFG; 342 + } 343 + return permitted_xcr0; 344 + } 331 345 332 346 static inline bool kvm_mpx_supported(void) 333 347 {

+1

include/clocksource/arm_arch_timer.h

··· 21 21 #define CNTHCTL_EVNTEN (1 << 2) 22 22 #define CNTHCTL_EVNTDIR (1 << 3) 23 23 #define CNTHCTL_EVNTI (0xF << 4) 24 + #define CNTHCTL_ECV (1 << 12) 24 25 25 26 enum arch_timer_reg { 26 27 ARCH_TIMER_REG_CTRL,

+28 -6

include/kvm/arm_arch_timer.h

··· 13 13 enum kvm_arch_timers { 14 14 TIMER_PTIMER, 15 15 TIMER_VTIMER, 16 + NR_KVM_EL0_TIMERS, 17 + TIMER_HVTIMER = NR_KVM_EL0_TIMERS, 18 + TIMER_HPTIMER, 16 19 NR_KVM_TIMERS 17 20 }; 18 21 ··· 24 21 TIMER_REG_CVAL, 25 22 TIMER_REG_TVAL, 26 23 TIMER_REG_CTL, 24 + TIMER_REG_VOFF, 27 25 }; 28 26 29 27 struct arch_timer_offset { ··· 33 29 * structure. If NULL, assume a zero offset. 34 30 */ 35 31 u64 *vm_offset; 32 + /* 33 + * If set, pointer to one of the offsets in the vcpu's sysreg 34 + * array. If NULL, assume a zero offset. 35 + */ 36 + u64 *vcpu_offset; 36 37 }; 37 38 38 39 struct arch_timer_vm_data { 39 40 /* Offset applied to the virtual timer/counter */ 40 41 u64 voffset; 42 + /* Offset applied to the physical timer/counter */ 43 + u64 poffset; 44 + 45 + /* The PPI for each timer, global to the VM */ 46 + u8 ppi[NR_KVM_TIMERS]; 41 47 }; 42 48 43 49 struct arch_timer_context { 44 50 struct kvm_vcpu *vcpu; 45 51 46 - /* Timer IRQ */ 47 - struct kvm_irq_level irq; 48 - 49 52 /* Emulated Timer (may be unused) */ 50 53 struct hrtimer hrtimer; 54 + u64 ns_frac; 51 55 52 56 /* Offset for this counter/timer */ 53 57 struct arch_timer_offset offset; ··· 66 54 */ 67 55 bool loaded; 68 56 57 + /* Output level of the timer IRQ */ 58 + struct { 59 + bool level; 60 + } irq; 61 + 69 62 /* Duplicated state from arch_timer.c for convenience */ 70 63 u32 host_timer_irq; 71 - u32 host_timer_irq_flags; 72 64 }; 73 65 74 66 struct timer_map { 75 67 struct arch_timer_context *direct_vtimer; 76 68 struct arch_timer_context *direct_ptimer; 69 + struct arch_timer_context *emul_vtimer; 77 70 struct arch_timer_context *emul_ptimer; 78 71 }; 79 72 ··· 101 84 void kvm_timer_update_run(struct kvm_vcpu *vcpu); 102 85 void kvm_timer_vcpu_terminate(struct kvm_vcpu *vcpu); 103 86 87 + void kvm_timer_init_vm(struct kvm *kvm); 88 + 104 89 u64 kvm_arm_timer_get_reg(struct kvm_vcpu *, u64 regid); 105 90 int kvm_arm_timer_set_reg(struct kvm_vcpu *, u64 regid, u64 value); 106 91 ··· 117 98 118 99 void kvm_timer_init_vhe(void); 119 100 120 - bool kvm_arch_timer_get_input_level(int vintid); 121 - 122 101 #define vcpu_timer(v) (&(v)->arch.timer_cpu) 123 102 #define vcpu_get_timer(v,t) (&vcpu_timer(v)->timers[(t)]) 124 103 #define vcpu_vtimer(v) (&(v)->arch.timer_cpu.timers[TIMER_VTIMER]) 125 104 #define vcpu_ptimer(v) (&(v)->arch.timer_cpu.timers[TIMER_PTIMER]) 105 + #define vcpu_hvtimer(v) (&(v)->arch.timer_cpu.timers[TIMER_HVTIMER]) 106 + #define vcpu_hptimer(v) (&(v)->arch.timer_cpu.timers[TIMER_HPTIMER]) 126 107 127 108 #define arch_timer_ctx_index(ctx) ((ctx) - vcpu_timer((ctx)->vcpu)->timers) 109 + 110 + #define timer_vm_data(ctx) (&(ctx)->vcpu->kvm->arch.timer_data) 111 + #define timer_irq(ctx) (timer_vm_data(ctx)->ppi[arch_timer_ctx_index(ctx)]) 128 112 129 113 u64 kvm_arm_timer_read_sysreg(struct kvm_vcpu *vcpu, 130 114 enum kvm_arch_timers tmr,

+5 -1

include/kvm/arm_hypercalls.h

··· 6 6 7 7 #include <asm/kvm_emulate.h> 8 8 9 - int kvm_hvc_call_handler(struct kvm_vcpu *vcpu); 9 + int kvm_smccc_call_handler(struct kvm_vcpu *vcpu); 10 10 11 11 static inline u32 smccc_get_function(struct kvm_vcpu *vcpu) 12 12 { ··· 43 43 struct kvm_one_reg; 44 44 45 45 void kvm_arm_init_hypercalls(struct kvm *kvm); 46 + void kvm_arm_teardown_hypercalls(struct kvm *kvm); 46 47 int kvm_arm_get_fw_num_regs(struct kvm_vcpu *vcpu); 47 48 int kvm_arm_copy_fw_reg_indices(struct kvm_vcpu *vcpu, u64 __user *uindices); 48 49 int kvm_arm_get_fw_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg); 49 50 int kvm_arm_set_fw_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg); 51 + 52 + int kvm_vm_smccc_has_attr(struct kvm *kvm, struct kvm_device_attr *attr); 53 + int kvm_vm_smccc_set_attr(struct kvm *kvm, struct kvm_device_attr *attr); 50 54 51 55 #endif

+1

include/kvm/arm_vgic.h

··· 380 380 int kvm_vgic_map_phys_irq(struct kvm_vcpu *vcpu, unsigned int host_irq, 381 381 u32 vintid, struct irq_ops *ops); 382 382 int kvm_vgic_unmap_phys_irq(struct kvm_vcpu *vcpu, unsigned int vintid); 383 + int kvm_vgic_get_map(struct kvm_vcpu *vcpu, unsigned int vintid); 383 384 bool kvm_vgic_map_is_active(struct kvm_vcpu *vcpu, unsigned int vintid); 384 385 385 386 int kvm_vgic_vcpu_pending_irq(struct kvm_vcpu *vcpu);

+3 -4

include/linux/kvm_host.h

··· 58 58 59 59 /* 60 60 * Bit 63 of the memslot generation number is an "update in-progress flag", 61 - * e.g. is temporarily set for the duration of install_new_memslots(). 61 + * e.g. is temporarily set for the duration of kvm_swap_active_memslots(). 62 62 * This flag effectively creates a unique generation number that is used to 63 63 * mark cached memslot data, e.g. MMIO accesses, as potentially being stale, 64 64 * i.e. may (or may not) have come from the previous memslots generation. ··· 713 713 * use by the VM. To be used under the slots_lock (above) or in a 714 714 * kvm->srcu critical section where acquiring the slots_lock would 715 715 * lead to deadlock with the synchronize_srcu in 716 - * install_new_memslots. 716 + * kvm_swap_active_memslots(). 717 717 */ 718 718 struct mutex slots_arch_lock; 719 719 struct mm_struct *mm; /* userspace tied to this vm */ ··· 1398 1398 bool line_status); 1399 1399 int kvm_vm_ioctl_enable_cap(struct kvm *kvm, 1400 1400 struct kvm_enable_cap *cap); 1401 - long kvm_arch_vm_ioctl(struct file *filp, 1402 - unsigned int ioctl, unsigned long arg); 1401 + int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg); 1403 1402 long kvm_arch_vm_compat_ioctl(struct file *filp, unsigned int ioctl, 1404 1403 unsigned long arg); 1405 1404

+1 -1

include/linux/kvm_types.h

··· 91 91 * is topped up (__kvm_mmu_topup_memory_cache()). 92 92 */ 93 93 struct kvm_mmu_memory_cache { 94 - int nobjs; 95 94 gfp_t gfp_zero; 96 95 gfp_t gfp_custom; 97 96 struct kmem_cache *kmem_cache; 98 97 int capacity; 98 + int nobjs; 99 99 void **objects; 100 100 }; 101 101 #endif

+11 -3

include/uapi/linux/kvm.h

··· 341 341 __u64 nr; 342 342 __u64 args[6]; 343 343 __u64 ret; 344 - __u32 longmode; 345 - __u32 pad; 344 + 345 + union { 346 + #ifndef __KERNEL__ 347 + __u32 longmode; 348 + #endif 349 + __u64 flags; 350 + }; 346 351 } hypercall; 347 352 /* KVM_EXIT_TPR_ACCESS */ 348 353 struct { ··· 1189 1184 #define KVM_CAP_S390_PROTECTED_ASYNC_DISABLE 224 1190 1185 #define KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 225 1191 1186 #define KVM_CAP_PMU_EVENT_MASKED_EVENTS 226 1187 + #define KVM_CAP_COUNTER_OFFSET 227 1192 1188 1193 1189 #ifdef KVM_CAP_IRQ_ROUTING 1194 1190 ··· 1457 1451 #define KVM_CREATE_VCPU _IO(KVMIO, 0x41) 1458 1452 #define KVM_GET_DIRTY_LOG _IOW(KVMIO, 0x42, struct kvm_dirty_log) 1459 1453 #define KVM_SET_NR_MMU_PAGES _IO(KVMIO, 0x44) 1460 - #define KVM_GET_NR_MMU_PAGES _IO(KVMIO, 0x45) 1454 + #define KVM_GET_NR_MMU_PAGES _IO(KVMIO, 0x45) /* deprecated */ 1461 1455 #define KVM_SET_USER_MEMORY_REGION _IOW(KVMIO, 0x46, \ 1462 1456 struct kvm_userspace_memory_region) 1463 1457 #define KVM_SET_TSS_ADDR _IO(KVMIO, 0x47) ··· 1549 1543 #define KVM_SET_PMU_EVENT_FILTER _IOW(KVMIO, 0xb2, struct kvm_pmu_event_filter) 1550 1544 #define KVM_PPC_SVM_OFF _IO(KVMIO, 0xb3) 1551 1545 #define KVM_ARM_MTE_COPY_TAGS _IOR(KVMIO, 0xb4, struct kvm_arm_copy_mte_tags) 1546 + /* Available with KVM_CAP_COUNTER_OFFSET */ 1547 + #define KVM_ARM_SET_COUNTER_OFFSET _IOW(KVMIO, 0xb5, struct kvm_arm_counter_offset) 1552 1548 1553 1549 /* ioctl for vm fd */ 1554 1550 #define KVM_CREATE_DEVICE _IOWR(KVMIO, 0xe0, struct kvm_create_device)

+1 -1

tools/include/uapi/linux/kvm.h

··· 1451 1451 #define KVM_CREATE_VCPU _IO(KVMIO, 0x41) 1452 1452 #define KVM_GET_DIRTY_LOG _IOW(KVMIO, 0x42, struct kvm_dirty_log) 1453 1453 #define KVM_SET_NR_MMU_PAGES _IO(KVMIO, 0x44) 1454 - #define KVM_GET_NR_MMU_PAGES _IO(KVMIO, 0x45) 1454 + #define KVM_GET_NR_MMU_PAGES _IO(KVMIO, 0x45) /* deprecated */ 1455 1455 #define KVM_SET_USER_MEMORY_REGION _IOW(KVMIO, 0x46, \ 1456 1456 struct kvm_userspace_memory_region) 1457 1457 #define KVM_SET_TSS_ADDR _IO(KVMIO, 0x47)

+2

tools/testing/selftests/kvm/Makefile

··· 105 105 TEST_GEN_PROGS_x86_64 += x86_64/vmx_nested_tsc_scaling_test 106 106 TEST_GEN_PROGS_x86_64 += x86_64/xapic_ipi_test 107 107 TEST_GEN_PROGS_x86_64 += x86_64/xapic_state_test 108 + TEST_GEN_PROGS_x86_64 += x86_64/xcr0_cpuid_test 108 109 TEST_GEN_PROGS_x86_64 += x86_64/xss_msr_test 109 110 TEST_GEN_PROGS_x86_64 += x86_64/debug_regs 110 111 TEST_GEN_PROGS_x86_64 += x86_64/tsc_msrs_test ··· 142 141 TEST_GEN_PROGS_aarch64 += aarch64/hypercalls 143 142 TEST_GEN_PROGS_aarch64 += aarch64/page_fault_test 144 143 TEST_GEN_PROGS_aarch64 += aarch64/psci_test 144 + TEST_GEN_PROGS_aarch64 += aarch64/smccc_filter 145 145 TEST_GEN_PROGS_aarch64 += aarch64/vcpu_width_config 146 146 TEST_GEN_PROGS_aarch64 += aarch64/vgic_init 147 147 TEST_GEN_PROGS_aarch64 += aarch64/vgic_irq

+40 -16

tools/testing/selftests/kvm/aarch64/arch_timer.c

··· 47 47 int nr_iter; 48 48 int timer_period_ms; 49 49 int migration_freq_ms; 50 + struct kvm_arm_counter_offset offset; 50 51 }; 51 52 52 53 static struct test_args test_args = { ··· 55 54 .nr_iter = NR_TEST_ITERS_DEF, 56 55 .timer_period_ms = TIMER_TEST_PERIOD_MS_DEF, 57 56 .migration_freq_ms = TIMER_TEST_MIGRATION_FREQ_MS, 57 + .offset = { .reserved = 1 }, 58 58 }; 59 59 60 60 #define msecs_to_usecs(msec) ((msec) * 1000LL) ··· 123 121 uint64_t xcnt = 0, xcnt_diff_us, cval = 0; 124 122 unsigned long xctl = 0; 125 123 unsigned int timer_irq = 0; 124 + unsigned int accessor; 126 125 127 - if (stage == GUEST_STAGE_VTIMER_CVAL || 128 - stage == GUEST_STAGE_VTIMER_TVAL) { 129 - xctl = timer_get_ctl(VIRTUAL); 130 - timer_set_ctl(VIRTUAL, CTL_IMASK); 131 - xcnt = timer_get_cntct(VIRTUAL); 132 - cval = timer_get_cval(VIRTUAL); 126 + if (intid == IAR_SPURIOUS) 127 + return; 128 + 129 + switch (stage) { 130 + case GUEST_STAGE_VTIMER_CVAL: 131 + case GUEST_STAGE_VTIMER_TVAL: 132 + accessor = VIRTUAL; 133 133 timer_irq = vtimer_irq; 134 - } else if (stage == GUEST_STAGE_PTIMER_CVAL || 135 - stage == GUEST_STAGE_PTIMER_TVAL) { 136 - xctl = timer_get_ctl(PHYSICAL); 137 - timer_set_ctl(PHYSICAL, CTL_IMASK); 138 - xcnt = timer_get_cntct(PHYSICAL); 139 - cval = timer_get_cval(PHYSICAL); 134 + break; 135 + case GUEST_STAGE_PTIMER_CVAL: 136 + case GUEST_STAGE_PTIMER_TVAL: 137 + accessor = PHYSICAL; 140 138 timer_irq = ptimer_irq; 141 - } else { 139 + break; 140 + default: 142 141 GUEST_ASSERT(0); 142 + return; 143 143 } 144 + 145 + xctl = timer_get_ctl(accessor); 146 + if ((xctl & CTL_IMASK) || !(xctl & CTL_ENABLE)) 147 + return; 148 + 149 + timer_set_ctl(accessor, CTL_IMASK); 150 + xcnt = timer_get_cntct(accessor); 151 + cval = timer_get_cval(accessor); 144 152 145 153 xcnt_diff_us = cycles_to_usec(xcnt - shared_data->xcnt); 146 154 ··· 160 148 /* Basic 'timer condition met' check */ 161 149 GUEST_ASSERT_3(xcnt >= cval, xcnt, cval, xcnt_diff_us); 162 150 GUEST_ASSERT_1(xctl & CTL_ISTATUS, xctl); 151 + 152 + WRITE_ONCE(shared_data->nr_iter, shared_data->nr_iter + 1); 163 153 } 164 154 165 155 static void guest_irq_handler(struct ex_regs *regs) ··· 171 157 struct test_vcpu_shared_data *shared_data = &vcpu_shared_data[cpu]; 172 158 173 159 guest_validate_irq(intid, shared_data); 174 - 175 - WRITE_ONCE(shared_data->nr_iter, shared_data->nr_iter + 1); 176 160 177 161 gic_set_eoi(intid); 178 162 } ··· 384 372 vm_init_descriptor_tables(vm); 385 373 vm_install_exception_handler(vm, VECTOR_IRQ_CURRENT, guest_irq_handler); 386 374 375 + if (!test_args.offset.reserved) { 376 + if (kvm_has_cap(KVM_CAP_COUNTER_OFFSET)) 377 + vm_ioctl(vm, KVM_ARM_SET_COUNTER_OFFSET, &test_args.offset); 378 + else 379 + TEST_FAIL("no support for global offset\n"); 380 + } 381 + 387 382 for (i = 0; i < nr_vcpus; i++) 388 383 vcpu_init_descriptor_tables(vcpus[i]); 389 384 ··· 422 403 TIMER_TEST_PERIOD_MS_DEF); 423 404 pr_info("\t-m: Frequency (in ms) of vCPUs to migrate to different pCPU. 0 to turn off (default: %u)\n", 424 405 TIMER_TEST_MIGRATION_FREQ_MS); 406 + pr_info("\t-o: Counter offset (in counter cycles, default: 0)\n"); 425 407 pr_info("\t-h: print this help screen\n"); 426 408 } 427 409 ··· 430 410 { 431 411 int opt; 432 412 433 - while ((opt = getopt(argc, argv, "hn:i:p:m:")) != -1) { 413 + while ((opt = getopt(argc, argv, "hn:i:p:m:o:")) != -1) { 434 414 switch (opt) { 435 415 case 'n': 436 416 test_args.nr_vcpus = atoi_positive("Number of vCPUs", optarg); ··· 448 428 break; 449 429 case 'm': 450 430 test_args.migration_freq_ms = atoi_non_negative("Frequency", optarg); 431 + break; 432 + case 'o': 433 + test_args.offset.counter_offset = strtol(optarg, NULL, 0); 434 + test_args.offset.reserved = 0; 451 435 break; 452 436 case 'h': 453 437 default:

+9 -6

tools/testing/selftests/kvm/aarch64/get-reg-list.c

··· 651 651 * The current blessed list was primed with the output of kernel version 652 652 * v4.15 with --core-reg-fixup and then later updated with new registers. 653 653 * 654 - * The blessed list is up to date with kernel version v5.13-rc3 654 + * The blessed list is up to date with kernel version v6.4 (or so we hope) 655 655 */ 656 656 static __u64 base_regs[] = { 657 657 KVM_REG_ARM64 | KVM_REG_SIZE_U64 | KVM_REG_ARM_CORE | KVM_REG_ARM_CORE_REG(regs.regs[0]), ··· 807 807 ARM64_SYS_REG(3, 0, 0, 3, 7), 808 808 ARM64_SYS_REG(3, 0, 0, 4, 0), /* ID_AA64PFR0_EL1 */ 809 809 ARM64_SYS_REG(3, 0, 0, 4, 1), /* ID_AA64PFR1_EL1 */ 810 - ARM64_SYS_REG(3, 0, 0, 4, 2), 810 + ARM64_SYS_REG(3, 0, 0, 4, 2), /* ID_AA64PFR2_EL1 */ 811 811 ARM64_SYS_REG(3, 0, 0, 4, 3), 812 812 ARM64_SYS_REG(3, 0, 0, 4, 4), /* ID_AA64ZFR0_EL1 */ 813 - ARM64_SYS_REG(3, 0, 0, 4, 5), 813 + ARM64_SYS_REG(3, 0, 0, 4, 5), /* ID_AA64SMFR0_EL1 */ 814 814 ARM64_SYS_REG(3, 0, 0, 4, 6), 815 815 ARM64_SYS_REG(3, 0, 0, 4, 7), 816 816 ARM64_SYS_REG(3, 0, 0, 5, 0), /* ID_AA64DFR0_EL1 */ ··· 823 823 ARM64_SYS_REG(3, 0, 0, 5, 7), 824 824 ARM64_SYS_REG(3, 0, 0, 6, 0), /* ID_AA64ISAR0_EL1 */ 825 825 ARM64_SYS_REG(3, 0, 0, 6, 1), /* ID_AA64ISAR1_EL1 */ 826 - ARM64_SYS_REG(3, 0, 0, 6, 2), 826 + ARM64_SYS_REG(3, 0, 0, 6, 2), /* ID_AA64ISAR2_EL1 */ 827 827 ARM64_SYS_REG(3, 0, 0, 6, 3), 828 828 ARM64_SYS_REG(3, 0, 0, 6, 4), 829 829 ARM64_SYS_REG(3, 0, 0, 6, 5), ··· 832 832 ARM64_SYS_REG(3, 0, 0, 7, 0), /* ID_AA64MMFR0_EL1 */ 833 833 ARM64_SYS_REG(3, 0, 0, 7, 1), /* ID_AA64MMFR1_EL1 */ 834 834 ARM64_SYS_REG(3, 0, 0, 7, 2), /* ID_AA64MMFR2_EL1 */ 835 - ARM64_SYS_REG(3, 0, 0, 7, 3), 836 - ARM64_SYS_REG(3, 0, 0, 7, 4), 835 + ARM64_SYS_REG(3, 0, 0, 7, 3), /* ID_AA64MMFR3_EL1 */ 836 + ARM64_SYS_REG(3, 0, 0, 7, 4), /* ID_AA64MMFR4_EL1 */ 837 837 ARM64_SYS_REG(3, 0, 0, 7, 5), 838 838 ARM64_SYS_REG(3, 0, 0, 7, 6), 839 839 ARM64_SYS_REG(3, 0, 0, 7, 7), ··· 858 858 ARM64_SYS_REG(3, 2, 0, 0, 0), /* CSSELR_EL1 */ 859 859 ARM64_SYS_REG(3, 3, 13, 0, 2), /* TPIDR_EL0 */ 860 860 ARM64_SYS_REG(3, 3, 13, 0, 3), /* TPIDRRO_EL0 */ 861 + ARM64_SYS_REG(3, 3, 14, 0, 1), /* CNTPCT_EL0 */ 862 + ARM64_SYS_REG(3, 3, 14, 2, 1), /* CNTP_CTL_EL0 */ 863 + ARM64_SYS_REG(3, 3, 14, 2, 2), /* CNTP_CVAL_EL0 */ 861 864 ARM64_SYS_REG(3, 4, 3, 0, 0), /* DACR32_EL2 */ 862 865 ARM64_SYS_REG(3, 4, 5, 0, 1), /* IFSR32_EL2 */ 863 866 ARM64_SYS_REG(3, 4, 5, 3, 0), /* FPEXC32_EL2 */

+268

tools/testing/selftests/kvm/aarch64/smccc_filter.c

··· 1 + // SPDX-License-Identifier: GPL-2.0-only 2 + /* 3 + * smccc_filter - Tests for the SMCCC filter UAPI. 4 + * 5 + * Copyright (c) 2023 Google LLC 6 + * 7 + * This test includes: 8 + * - Tests that the UAPI constraints are upheld by KVM. For example, userspace 9 + * is prevented from filtering the architecture range of SMCCC calls. 10 + * - Test that the filter actions (DENIED, FWD_TO_USER) work as intended. 11 + */ 12 + 13 + #include <linux/arm-smccc.h> 14 + #include <linux/psci.h> 15 + #include <stdint.h> 16 + 17 + #include "processor.h" 18 + #include "test_util.h" 19 + 20 + enum smccc_conduit { 21 + HVC_INSN, 22 + SMC_INSN, 23 + }; 24 + 25 + #define for_each_conduit(conduit) \ 26 + for (conduit = HVC_INSN; conduit <= SMC_INSN; conduit++) 27 + 28 + static void guest_main(uint32_t func_id, enum smccc_conduit conduit) 29 + { 30 + struct arm_smccc_res res; 31 + 32 + if (conduit == SMC_INSN) 33 + smccc_smc(func_id, 0, 0, 0, 0, 0, 0, 0, &res); 34 + else 35 + smccc_hvc(func_id, 0, 0, 0, 0, 0, 0, 0, &res); 36 + 37 + GUEST_SYNC(res.a0); 38 + } 39 + 40 + static int __set_smccc_filter(struct kvm_vm *vm, uint32_t start, uint32_t nr_functions, 41 + enum kvm_smccc_filter_action action) 42 + { 43 + struct kvm_smccc_filter filter = { 44 + .base = start, 45 + .nr_functions = nr_functions, 46 + .action = action, 47 + }; 48 + 49 + return __kvm_device_attr_set(vm->fd, KVM_ARM_VM_SMCCC_CTRL, 50 + KVM_ARM_VM_SMCCC_FILTER, &filter); 51 + } 52 + 53 + static void set_smccc_filter(struct kvm_vm *vm, uint32_t start, uint32_t nr_functions, 54 + enum kvm_smccc_filter_action action) 55 + { 56 + int ret = __set_smccc_filter(vm, start, nr_functions, action); 57 + 58 + TEST_ASSERT(!ret, "failed to configure SMCCC filter: %d", ret); 59 + } 60 + 61 + static struct kvm_vm *setup_vm(struct kvm_vcpu **vcpu) 62 + { 63 + struct kvm_vcpu_init init; 64 + struct kvm_vm *vm; 65 + 66 + vm = vm_create(1); 67 + vm_ioctl(vm, KVM_ARM_PREFERRED_TARGET, &init); 68 + 69 + /* 70 + * Enable in-kernel emulation of PSCI to ensure that calls are denied 71 + * due to the SMCCC filter, not because of KVM. 72 + */ 73 + init.features[0] |= (1 << KVM_ARM_VCPU_PSCI_0_2); 74 + 75 + *vcpu = aarch64_vcpu_add(vm, 0, &init, guest_main); 76 + return vm; 77 + } 78 + 79 + static void test_pad_must_be_zero(void) 80 + { 81 + struct kvm_vcpu *vcpu; 82 + struct kvm_vm *vm = setup_vm(&vcpu); 83 + struct kvm_smccc_filter filter = { 84 + .base = PSCI_0_2_FN_PSCI_VERSION, 85 + .nr_functions = 1, 86 + .action = KVM_SMCCC_FILTER_DENY, 87 + .pad = { -1 }, 88 + }; 89 + int r; 90 + 91 + r = __kvm_device_attr_set(vm->fd, KVM_ARM_VM_SMCCC_CTRL, 92 + KVM_ARM_VM_SMCCC_FILTER, &filter); 93 + TEST_ASSERT(r < 0 && errno == EINVAL, 94 + "Setting filter with nonzero padding should return EINVAL"); 95 + } 96 + 97 + /* Ensure that userspace cannot filter the Arm Architecture SMCCC range */ 98 + static void test_filter_reserved_range(void) 99 + { 100 + struct kvm_vcpu *vcpu; 101 + struct kvm_vm *vm = setup_vm(&vcpu); 102 + uint32_t smc64_fn; 103 + int r; 104 + 105 + r = __set_smccc_filter(vm, ARM_SMCCC_ARCH_WORKAROUND_1, 106 + 1, KVM_SMCCC_FILTER_DENY); 107 + TEST_ASSERT(r < 0 && errno == EEXIST, 108 + "Attempt to filter reserved range should return EEXIST"); 109 + 110 + smc64_fn = ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, ARM_SMCCC_SMC_64, 111 + 0, 0); 112 + 113 + r = __set_smccc_filter(vm, smc64_fn, 1, KVM_SMCCC_FILTER_DENY); 114 + TEST_ASSERT(r < 0 && errno == EEXIST, 115 + "Attempt to filter reserved range should return EEXIST"); 116 + 117 + kvm_vm_free(vm); 118 + } 119 + 120 + static void test_invalid_nr_functions(void) 121 + { 122 + struct kvm_vcpu *vcpu; 123 + struct kvm_vm *vm = setup_vm(&vcpu); 124 + int r; 125 + 126 + r = __set_smccc_filter(vm, PSCI_0_2_FN64_CPU_ON, 0, KVM_SMCCC_FILTER_DENY); 127 + TEST_ASSERT(r < 0 && errno == EINVAL, 128 + "Attempt to filter 0 functions should return EINVAL"); 129 + 130 + kvm_vm_free(vm); 131 + } 132 + 133 + static void test_overflow_nr_functions(void) 134 + { 135 + struct kvm_vcpu *vcpu; 136 + struct kvm_vm *vm = setup_vm(&vcpu); 137 + int r; 138 + 139 + r = __set_smccc_filter(vm, ~0, ~0, KVM_SMCCC_FILTER_DENY); 140 + TEST_ASSERT(r < 0 && errno == EINVAL, 141 + "Attempt to overflow filter range should return EINVAL"); 142 + 143 + kvm_vm_free(vm); 144 + } 145 + 146 + static void test_reserved_action(void) 147 + { 148 + struct kvm_vcpu *vcpu; 149 + struct kvm_vm *vm = setup_vm(&vcpu); 150 + int r; 151 + 152 + r = __set_smccc_filter(vm, PSCI_0_2_FN64_CPU_ON, 1, -1); 153 + TEST_ASSERT(r < 0 && errno == EINVAL, 154 + "Attempt to use reserved filter action should return EINVAL"); 155 + 156 + kvm_vm_free(vm); 157 + } 158 + 159 + 160 + /* Test that overlapping configurations of the SMCCC filter are rejected */ 161 + static void test_filter_overlap(void) 162 + { 163 + struct kvm_vcpu *vcpu; 164 + struct kvm_vm *vm = setup_vm(&vcpu); 165 + int r; 166 + 167 + set_smccc_filter(vm, PSCI_0_2_FN64_CPU_ON, 1, KVM_SMCCC_FILTER_DENY); 168 + 169 + r = __set_smccc_filter(vm, PSCI_0_2_FN64_CPU_ON, 1, KVM_SMCCC_FILTER_DENY); 170 + TEST_ASSERT(r < 0 && errno == EEXIST, 171 + "Attempt to filter already configured range should return EEXIST"); 172 + 173 + kvm_vm_free(vm); 174 + } 175 + 176 + static void expect_call_denied(struct kvm_vcpu *vcpu) 177 + { 178 + struct ucall uc; 179 + 180 + if (get_ucall(vcpu, &uc) != UCALL_SYNC) 181 + TEST_FAIL("Unexpected ucall: %lu\n", uc.cmd); 182 + 183 + TEST_ASSERT(uc.args[1] == SMCCC_RET_NOT_SUPPORTED, 184 + "Unexpected SMCCC return code: %lu", uc.args[1]); 185 + } 186 + 187 + /* Denied SMCCC calls have a return code of SMCCC_RET_NOT_SUPPORTED */ 188 + static void test_filter_denied(void) 189 + { 190 + enum smccc_conduit conduit; 191 + struct kvm_vcpu *vcpu; 192 + struct kvm_vm *vm; 193 + 194 + for_each_conduit(conduit) { 195 + vm = setup_vm(&vcpu); 196 + 197 + set_smccc_filter(vm, PSCI_0_2_FN_PSCI_VERSION, 1, KVM_SMCCC_FILTER_DENY); 198 + vcpu_args_set(vcpu, 2, PSCI_0_2_FN_PSCI_VERSION, conduit); 199 + 200 + vcpu_run(vcpu); 201 + expect_call_denied(vcpu); 202 + 203 + kvm_vm_free(vm); 204 + } 205 + } 206 + 207 + static void expect_call_fwd_to_user(struct kvm_vcpu *vcpu, uint32_t func_id, 208 + enum smccc_conduit conduit) 209 + { 210 + struct kvm_run *run = vcpu->run; 211 + 212 + TEST_ASSERT(run->exit_reason == KVM_EXIT_HYPERCALL, 213 + "Unexpected exit reason: %u", run->exit_reason); 214 + TEST_ASSERT(run->hypercall.nr == func_id, 215 + "Unexpected SMCCC function: %llu", run->hypercall.nr); 216 + 217 + if (conduit == SMC_INSN) 218 + TEST_ASSERT(run->hypercall.flags & KVM_HYPERCALL_EXIT_SMC, 219 + "KVM_HYPERCALL_EXIT_SMC is not set"); 220 + else 221 + TEST_ASSERT(!(run->hypercall.flags & KVM_HYPERCALL_EXIT_SMC), 222 + "KVM_HYPERCALL_EXIT_SMC is set"); 223 + } 224 + 225 + /* SMCCC calls forwarded to userspace cause KVM_EXIT_HYPERCALL exits */ 226 + static void test_filter_fwd_to_user(void) 227 + { 228 + enum smccc_conduit conduit; 229 + struct kvm_vcpu *vcpu; 230 + struct kvm_vm *vm; 231 + 232 + for_each_conduit(conduit) { 233 + vm = setup_vm(&vcpu); 234 + 235 + set_smccc_filter(vm, PSCI_0_2_FN_PSCI_VERSION, 1, KVM_SMCCC_FILTER_FWD_TO_USER); 236 + vcpu_args_set(vcpu, 2, PSCI_0_2_FN_PSCI_VERSION, conduit); 237 + 238 + vcpu_run(vcpu); 239 + expect_call_fwd_to_user(vcpu, PSCI_0_2_FN_PSCI_VERSION, conduit); 240 + 241 + kvm_vm_free(vm); 242 + } 243 + } 244 + 245 + static bool kvm_supports_smccc_filter(void) 246 + { 247 + struct kvm_vm *vm = vm_create_barebones(); 248 + int r; 249 + 250 + r = __kvm_has_device_attr(vm->fd, KVM_ARM_VM_SMCCC_CTRL, KVM_ARM_VM_SMCCC_FILTER); 251 + 252 + kvm_vm_free(vm); 253 + return !r; 254 + } 255 + 256 + int main(void) 257 + { 258 + TEST_REQUIRE(kvm_supports_smccc_filter()); 259 + 260 + test_pad_must_be_zero(); 261 + test_invalid_nr_functions(); 262 + test_overflow_nr_functions(); 263 + test_reserved_action(); 264 + test_filter_reserved_range(); 265 + test_filter_overlap(); 266 + test_filter_denied(); 267 + test_filter_fwd_to_user(); 268 + }

+1

tools/testing/selftests/kvm/config

··· 2 2 CONFIG_KVM_INTEL=y 3 3 CONFIG_KVM_AMD=y 4 4 CONFIG_USERFAULTFD=y 5 + CONFIG_IDLE_PAGE_TRACKING=y

+1 -1

tools/testing/selftests/kvm/demand_paging_test.c

··· 194 194 ts_diff.tv_sec, ts_diff.tv_nsec); 195 195 pr_info("Overall demand paging rate: %f pgs/sec\n", 196 196 memstress_args.vcpu_args[0].pages * nr_vcpus / 197 - ((double)ts_diff.tv_sec + (double)ts_diff.tv_nsec / 100000000.0)); 197 + ((double)ts_diff.tv_sec + (double)ts_diff.tv_nsec / NSEC_PER_SEC)); 198 198 199 199 memstress_destroy_vm(vm); 200 200

+13

tools/testing/selftests/kvm/include/aarch64/processor.h

··· 214 214 uint64_t arg2, uint64_t arg3, uint64_t arg4, uint64_t arg5, 215 215 uint64_t arg6, struct arm_smccc_res *res); 216 216 217 + /** 218 + * smccc_smc - Invoke a SMCCC function using the smc conduit 219 + * @function_id: the SMCCC function to be called 220 + * @arg0-arg6: SMCCC function arguments, corresponding to registers x1-x7 221 + * @res: pointer to write the return values from registers x0-x3 222 + * 223 + */ 224 + void smccc_smc(uint32_t function_id, uint64_t arg0, uint64_t arg1, 225 + uint64_t arg2, uint64_t arg3, uint64_t arg4, uint64_t arg5, 226 + uint64_t arg6, struct arm_smccc_res *res); 227 + 228 + 229 + 217 230 uint32_t guest_get_vcpuid(void); 218 231 219 232 #endif /* SELFTEST_KVM_PROCESSOR_H */

+1

tools/testing/selftests/kvm/include/kvm_util_base.h

··· 213 213 int open_path_or_exit(const char *path, int flags); 214 214 int open_kvm_dev_path_or_exit(void); 215 215 216 + bool get_kvm_param_bool(const char *param); 216 217 bool get_kvm_intel_param_bool(const char *param); 217 218 bool get_kvm_amd_param_bool(const char *param); 218 219

+108 -16

tools/testing/selftests/kvm/include/x86_64/processor.h

··· 48 48 #define X86_CR4_SMAP (1ul << 21) 49 49 #define X86_CR4_PKE (1ul << 22) 50 50 51 + struct xstate_header { 52 + u64 xstate_bv; 53 + u64 xcomp_bv; 54 + u64 reserved[6]; 55 + } __attribute__((packed)); 56 + 57 + struct xstate { 58 + u8 i387[512]; 59 + struct xstate_header header; 60 + u8 extended_state_area[0]; 61 + } __attribute__ ((packed, aligned (64))); 62 + 63 + #define XFEATURE_MASK_FP BIT_ULL(0) 64 + #define XFEATURE_MASK_SSE BIT_ULL(1) 65 + #define XFEATURE_MASK_YMM BIT_ULL(2) 66 + #define XFEATURE_MASK_BNDREGS BIT_ULL(3) 67 + #define XFEATURE_MASK_BNDCSR BIT_ULL(4) 68 + #define XFEATURE_MASK_OPMASK BIT_ULL(5) 69 + #define XFEATURE_MASK_ZMM_Hi256 BIT_ULL(6) 70 + #define XFEATURE_MASK_Hi16_ZMM BIT_ULL(7) 71 + #define XFEATURE_MASK_XTILE_CFG BIT_ULL(17) 72 + #define XFEATURE_MASK_XTILE_DATA BIT_ULL(18) 73 + 74 + #define XFEATURE_MASK_AVX512 (XFEATURE_MASK_OPMASK | \ 75 + XFEATURE_MASK_ZMM_Hi256 | \ 76 + XFEATURE_MASK_Hi16_ZMM) 77 + #define XFEATURE_MASK_XTILE (XFEATURE_MASK_XTILE_DATA | \ 78 + XFEATURE_MASK_XTILE_CFG) 79 + 51 80 /* Note, these are ordered alphabetically to match kvm_cpuid_entry2. Eww. */ 52 81 enum cpuid_output_regs { 53 82 KVM_CPUID_EAX, ··· 160 131 #define X86_FEATURE_XTILEDATA KVM_X86_CPU_FEATURE(0xD, 0, EAX, 18) 161 132 #define X86_FEATURE_XSAVES KVM_X86_CPU_FEATURE(0xD, 1, EAX, 3) 162 133 #define X86_FEATURE_XFD KVM_X86_CPU_FEATURE(0xD, 1, EAX, 4) 134 + #define X86_FEATURE_XTILEDATA_XFD KVM_X86_CPU_FEATURE(0xD, 18, ECX, 2) 163 135 164 136 /* 165 137 * Extended Leafs, a.k.a. AMD defined ··· 241 211 #define X86_PROPERTY_PMU_NR_GP_COUNTERS KVM_X86_CPU_PROPERTY(0xa, 0, EAX, 8, 15) 242 212 #define X86_PROPERTY_PMU_EBX_BIT_VECTOR_LENGTH KVM_X86_CPU_PROPERTY(0xa, 0, EAX, 24, 31) 243 213 214 + #define X86_PROPERTY_SUPPORTED_XCR0_LO KVM_X86_CPU_PROPERTY(0xd, 0, EAX, 0, 31) 244 215 #define X86_PROPERTY_XSTATE_MAX_SIZE_XCR0 KVM_X86_CPU_PROPERTY(0xd, 0, EBX, 0, 31) 245 216 #define X86_PROPERTY_XSTATE_MAX_SIZE KVM_X86_CPU_PROPERTY(0xd, 0, ECX, 0, 31) 217 + #define X86_PROPERTY_SUPPORTED_XCR0_HI KVM_X86_CPU_PROPERTY(0xd, 0, EDX, 0, 31) 218 + 246 219 #define X86_PROPERTY_XSTATE_TILE_SIZE KVM_X86_CPU_PROPERTY(0xd, 18, EAX, 0, 31) 247 220 #define X86_PROPERTY_XSTATE_TILE_OFFSET KVM_X86_CPU_PROPERTY(0xd, 18, EBX, 0, 31) 221 + #define X86_PROPERTY_AMX_MAX_PALETTE_TABLES KVM_X86_CPU_PROPERTY(0x1d, 0, EAX, 0, 31) 248 222 #define X86_PROPERTY_AMX_TOTAL_TILE_BYTES KVM_X86_CPU_PROPERTY(0x1d, 1, EAX, 0, 15) 249 223 #define X86_PROPERTY_AMX_BYTES_PER_TILE KVM_X86_CPU_PROPERTY(0x1d, 1, EAX, 16, 31) 250 224 #define X86_PROPERTY_AMX_BYTES_PER_ROW KVM_X86_CPU_PROPERTY(0x1d, 1, EBX, 0, 15) ··· 530 496 __asm__ __volatile__("mov %0, %%cr4" : : "r" (val) : "memory"); 531 497 } 532 498 499 + static inline u64 xgetbv(u32 index) 500 + { 501 + u32 eax, edx; 502 + 503 + __asm__ __volatile__("xgetbv;" 504 + : "=a" (eax), "=d" (edx) 505 + : "c" (index)); 506 + return eax | ((u64)edx << 32); 507 + } 508 + 509 + static inline void xsetbv(u32 index, u64 value) 510 + { 511 + u32 eax = value; 512 + u32 edx = value >> 32; 513 + 514 + __asm__ __volatile__("xsetbv" :: "a" (eax), "d" (edx), "c" (index)); 515 + } 516 + 533 517 static inline struct desc_ptr get_gdt(void) 534 518 { 535 519 struct desc_ptr gdt; ··· 682 630 683 631 return nr_bits > feature.anti_feature.bit && 684 632 !this_cpu_has(feature.anti_feature); 633 + } 634 + 635 + static __always_inline uint64_t this_cpu_supported_xcr0(void) 636 + { 637 + if (!this_cpu_has_p(X86_PROPERTY_SUPPORTED_XCR0_LO)) 638 + return 0; 639 + 640 + return this_cpu_property(X86_PROPERTY_SUPPORTED_XCR0_LO) | 641 + ((uint64_t)this_cpu_property(X86_PROPERTY_SUPPORTED_XCR0_HI) << 32); 685 642 } 686 643 687 644 typedef u32 __attribute__((vector_size(16))) sse128_t; ··· 989 928 uint64_t vcpu_get_msr(struct kvm_vcpu *vcpu, uint64_t msr_index); 990 929 int _vcpu_set_msr(struct kvm_vcpu *vcpu, uint64_t msr_index, uint64_t msr_value); 991 930 992 - static inline void vcpu_set_msr(struct kvm_vcpu *vcpu, uint64_t msr_index, 993 - uint64_t msr_value) 994 - { 995 - int r = _vcpu_set_msr(vcpu, msr_index, msr_value); 931 + /* 932 + * Assert on an MSR access(es) and pretty print the MSR name when possible. 933 + * Note, the caller provides the stringified name so that the name of macro is 934 + * printed, not the value the macro resolves to (due to macro expansion). 935 + */ 936 + #define TEST_ASSERT_MSR(cond, fmt, msr, str, args...) \ 937 + do { \ 938 + if (__builtin_constant_p(msr)) { \ 939 + TEST_ASSERT(cond, fmt, str, args); \ 940 + } else if (!(cond)) { \ 941 + char buf[16]; \ 942 + \ 943 + snprintf(buf, sizeof(buf), "MSR 0x%x", msr); \ 944 + TEST_ASSERT(cond, fmt, buf, args); \ 945 + } \ 946 + } while (0) 996 947 997 - TEST_ASSERT(r == 1, KVM_IOCTL_ERROR(KVM_SET_MSRS, r)); 948 + /* 949 + * Returns true if KVM should return the last written value when reading an MSR 950 + * from userspace, e.g. the MSR isn't a command MSR, doesn't emulate state that 951 + * is changing, etc. This is NOT an exhaustive list! The intent is to filter 952 + * out MSRs that are not durable _and_ that a selftest wants to write. 953 + */ 954 + static inline bool is_durable_msr(uint32_t msr) 955 + { 956 + return msr != MSR_IA32_TSC; 998 957 } 999 958 959 + #define vcpu_set_msr(vcpu, msr, val) \ 960 + do { \ 961 + uint64_t r, v = val; \ 962 + \ 963 + TEST_ASSERT_MSR(_vcpu_set_msr(vcpu, msr, v) == 1, \ 964 + "KVM_SET_MSRS failed on %s, value = 0x%lx", msr, #msr, v); \ 965 + if (!is_durable_msr(msr)) \ 966 + break; \ 967 + r = vcpu_get_msr(vcpu, msr); \ 968 + TEST_ASSERT_MSR(r == v, "Set %s to '0x%lx', got back '0x%lx'", msr, #msr, v, r);\ 969 + } while (0) 1000 970 1001 971 void kvm_get_cpu_address_width(unsigned int *pa_bits, unsigned int *va_bits); 1002 972 bool vm_is_unrestricted_guest(struct kvm_vm *vm); ··· 1147 1055 return kvm_asm_safe("wrmsr", "a"(val & -1u), "d"(val >> 32), "c"(msr)); 1148 1056 } 1149 1057 1058 + static inline uint8_t xsetbv_safe(uint32_t index, uint64_t value) 1059 + { 1060 + u32 eax = value; 1061 + u32 edx = value >> 32; 1062 + 1063 + return kvm_asm_safe("xsetbv", "a" (eax), "d" (edx), "c" (index)); 1064 + } 1065 + 1150 1066 bool kvm_is_tdp_enabled(void); 1151 1067 1152 1068 uint64_t *__vm_get_page_table_entry(struct kvm_vm *vm, uint64_t vaddr, ··· 1166 1066 uint64_t __xen_hypercall(uint64_t nr, uint64_t a0, void *a1); 1167 1067 void xen_hypercall(uint64_t nr, uint64_t a0, void *a1); 1168 1068 1169 - void __vm_xsave_require_permission(int bit, const char *name); 1069 + void __vm_xsave_require_permission(uint64_t xfeature, const char *name); 1170 1070 1171 - #define vm_xsave_require_permission(perm) \ 1172 - __vm_xsave_require_permission(perm, #perm) 1071 + #define vm_xsave_require_permission(xfeature) \ 1072 + __vm_xsave_require_permission(xfeature, #xfeature) 1173 1073 1174 1074 enum pg_level { 1175 1075 PG_LEVEL_NONE, ··· 1205 1105 #define X86_CR0_NW (1UL<<29) /* Not Write-through */ 1206 1106 #define X86_CR0_CD (1UL<<30) /* Cache Disable */ 1207 1107 #define X86_CR0_PG (1UL<<31) /* Paging */ 1208 - 1209 - #define XSTATE_XTILE_CFG_BIT 17 1210 - #define XSTATE_XTILE_DATA_BIT 18 1211 - 1212 - #define XSTATE_XTILE_CFG_MASK (1ULL << XSTATE_XTILE_CFG_BIT) 1213 - #define XSTATE_XTILE_DATA_MASK (1ULL << XSTATE_XTILE_DATA_BIT) 1214 - #define XFEATURE_XTILE_MASK (XSTATE_XTILE_CFG_MASK | \ 1215 - XSTATE_XTILE_DATA_MASK) 1216 1108 1217 1109 #define PFERR_PRESENT_BIT 0 1218 1110 #define PFERR_WRITE_BIT 1

+62 -29

tools/testing/selftests/kvm/lib/aarch64/processor.c

··· 58 58 return (gva >> vm->page_shift) & mask; 59 59 } 60 60 61 - static uint64_t pte_addr(struct kvm_vm *vm, uint64_t entry) 61 + static uint64_t addr_pte(struct kvm_vm *vm, uint64_t pa, uint64_t attrs) 62 62 { 63 - uint64_t mask = ((1UL << (vm->va_bits - vm->page_shift)) - 1) << vm->page_shift; 64 - return entry & mask; 63 + uint64_t pte; 64 + 65 + pte = pa & GENMASK(47, vm->page_shift); 66 + if (vm->page_shift == 16) 67 + pte |= FIELD_GET(GENMASK(51, 48), pa) << 12; 68 + pte |= attrs; 69 + 70 + return pte; 71 + } 72 + 73 + static uint64_t pte_addr(struct kvm_vm *vm, uint64_t pte) 74 + { 75 + uint64_t pa; 76 + 77 + pa = pte & GENMASK(47, vm->page_shift); 78 + if (vm->page_shift == 16) 79 + pa |= FIELD_GET(GENMASK(15, 12), pte) << 48; 80 + 81 + return pa; 65 82 } 66 83 67 84 static uint64_t ptrs_per_pgd(struct kvm_vm *vm) ··· 127 110 128 111 ptep = addr_gpa2hva(vm, vm->pgd) + pgd_index(vm, vaddr) * 8; 129 112 if (!*ptep) 130 - *ptep = vm_alloc_page_table(vm) | 3; 113 + *ptep = addr_pte(vm, vm_alloc_page_table(vm), 3); 131 114 132 115 switch (vm->pgtable_levels) { 133 116 case 4: 134 117 ptep = addr_gpa2hva(vm, pte_addr(vm, *ptep)) + pud_index(vm, vaddr) * 8; 135 118 if (!*ptep) 136 - *ptep = vm_alloc_page_table(vm) | 3; 119 + *ptep = addr_pte(vm, vm_alloc_page_table(vm), 3); 137 120 /* fall through */ 138 121 case 3: 139 122 ptep = addr_gpa2hva(vm, pte_addr(vm, *ptep)) + pmd_index(vm, vaddr) * 8; 140 123 if (!*ptep) 141 - *ptep = vm_alloc_page_table(vm) | 3; 124 + *ptep = addr_pte(vm, vm_alloc_page_table(vm), 3); 142 125 /* fall through */ 143 126 case 2: 144 127 ptep = addr_gpa2hva(vm, pte_addr(vm, *ptep)) + pte_index(vm, vaddr) * 8; ··· 147 130 TEST_FAIL("Page table levels must be 2, 3, or 4"); 148 131 } 149 132 150 - *ptep = paddr | 3; 151 - *ptep |= (attr_idx << 2) | (1 << 10) /* Access Flag */; 133 + *ptep = addr_pte(vm, paddr, (attr_idx << 2) | (1 << 10) | 3); /* AF */ 152 134 } 153 135 154 136 void virt_arch_pg_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr) ··· 242 226 { 243 227 struct kvm_vcpu_init default_init = { .target = -1, }; 244 228 struct kvm_vm *vm = vcpu->vm; 245 - uint64_t sctlr_el1, tcr_el1; 229 + uint64_t sctlr_el1, tcr_el1, ttbr0_el1; 246 230 247 231 if (!init) 248 232 init = &default_init; ··· 293 277 TEST_FAIL("Unknown guest mode, mode: 0x%x", vm->mode); 294 278 } 295 279 280 + ttbr0_el1 = vm->pgd & GENMASK(47, vm->page_shift); 281 + 296 282 /* Configure output size */ 297 283 switch (vm->mode) { 298 284 case VM_MODE_P52V48_64K: 299 285 tcr_el1 |= 6ul << 32; /* IPS = 52 bits */ 286 + ttbr0_el1 |= FIELD_GET(GENMASK(51, 48), vm->pgd) << 2; 300 287 break; 301 288 case VM_MODE_P48V48_4K: 302 289 case VM_MODE_P48V48_16K: ··· 329 310 vcpu_set_reg(vcpu, KVM_ARM64_SYS_REG(SYS_SCTLR_EL1), sctlr_el1); 330 311 vcpu_set_reg(vcpu, KVM_ARM64_SYS_REG(SYS_TCR_EL1), tcr_el1); 331 312 vcpu_set_reg(vcpu, KVM_ARM64_SYS_REG(SYS_MAIR_EL1), DEFAULT_MAIR_EL1); 332 - vcpu_set_reg(vcpu, KVM_ARM64_SYS_REG(SYS_TTBR0_EL1), vm->pgd); 313 + vcpu_set_reg(vcpu, KVM_ARM64_SYS_REG(SYS_TTBR0_EL1), ttbr0_el1); 333 314 vcpu_set_reg(vcpu, KVM_ARM64_SYS_REG(SYS_TPIDR_EL1), vcpu->id); 334 315 } 335 316 ··· 527 508 close(kvm_fd); 528 509 } 529 510 511 + #define __smccc_call(insn, function_id, arg0, arg1, arg2, arg3, arg4, arg5, \ 512 + arg6, res) \ 513 + asm volatile("mov w0, %w[function_id]\n" \ 514 + "mov x1, %[arg0]\n" \ 515 + "mov x2, %[arg1]\n" \ 516 + "mov x3, %[arg2]\n" \ 517 + "mov x4, %[arg3]\n" \ 518 + "mov x5, %[arg4]\n" \ 519 + "mov x6, %[arg5]\n" \ 520 + "mov x7, %[arg6]\n" \ 521 + #insn "#0\n" \ 522 + "mov %[res0], x0\n" \ 523 + "mov %[res1], x1\n" \ 524 + "mov %[res2], x2\n" \ 525 + "mov %[res3], x3\n" \ 526 + : [res0] "=r"(res->a0), [res1] "=r"(res->a1), \ 527 + [res2] "=r"(res->a2), [res3] "=r"(res->a3) \ 528 + : [function_id] "r"(function_id), [arg0] "r"(arg0), \ 529 + [arg1] "r"(arg1), [arg2] "r"(arg2), [arg3] "r"(arg3), \ 530 + [arg4] "r"(arg4), [arg5] "r"(arg5), [arg6] "r"(arg6) \ 531 + : "x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7") 532 + 533 + 530 534 void smccc_hvc(uint32_t function_id, uint64_t arg0, uint64_t arg1, 531 535 uint64_t arg2, uint64_t arg3, uint64_t arg4, uint64_t arg5, 532 536 uint64_t arg6, struct arm_smccc_res *res) 533 537 { 534 - asm volatile("mov w0, %w[function_id]\n" 535 - "mov x1, %[arg0]\n" 536 - "mov x2, %[arg1]\n" 537 - "mov x3, %[arg2]\n" 538 - "mov x4, %[arg3]\n" 539 - "mov x5, %[arg4]\n" 540 - "mov x6, %[arg5]\n" 541 - "mov x7, %[arg6]\n" 542 - "hvc #0\n" 543 - "mov %[res0], x0\n" 544 - "mov %[res1], x1\n" 545 - "mov %[res2], x2\n" 546 - "mov %[res3], x3\n" 547 - : [res0] "=r"(res->a0), [res1] "=r"(res->a1), 548 - [res2] "=r"(res->a2), [res3] "=r"(res->a3) 549 - : [function_id] "r"(function_id), [arg0] "r"(arg0), 550 - [arg1] "r"(arg1), [arg2] "r"(arg2), [arg3] "r"(arg3), 551 - [arg4] "r"(arg4), [arg5] "r"(arg5), [arg6] "r"(arg6) 552 - : "x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7"); 538 + __smccc_call(hvc, function_id, arg0, arg1, arg2, arg3, arg4, arg5, 539 + arg6, res); 540 + } 541 + 542 + void smccc_smc(uint32_t function_id, uint64_t arg0, uint64_t arg1, 543 + uint64_t arg2, uint64_t arg3, uint64_t arg4, uint64_t arg5, 544 + uint64_t arg6, struct arm_smccc_res *res) 545 + { 546 + __smccc_call(smc, function_id, arg0, arg1, arg2, arg3, arg4, arg5, 547 + arg6, res); 553 548 } 554 549 555 550 void kvm_selftest_arch_init(void)

+5

tools/testing/selftests/kvm/lib/kvm_util.c

··· 80 80 TEST_FAIL("Unrecognized value '%c' for boolean module param", value); 81 81 } 82 82 83 + bool get_kvm_param_bool(const char *param) 84 + { 85 + return get_module_param_bool("kvm", param); 86 + } 87 + 83 88 bool get_kvm_intel_param_bool(const char *param) 84 89 { 85 90 return get_module_param_bool("kvm_intel", param);

+28 -8

tools/testing/selftests/kvm/lib/x86_64/processor.c

··· 5 5 * Copyright (C) 2018, Google LLC. 6 6 */ 7 7 8 + #include "linux/bitmap.h" 8 9 #include "test_util.h" 9 10 #include "kvm_util.h" 10 11 #include "processor.h" ··· 574 573 DEFAULT_GUEST_STACK_VADDR_MIN, 575 574 MEM_REGION_DATA); 576 575 576 + stack_vaddr += DEFAULT_STACK_PGS * getpagesize(); 577 + 578 + /* 579 + * Align stack to match calling sequence requirements in section "The 580 + * Stack Frame" of the System V ABI AMD64 Architecture Processor 581 + * Supplement, which requires the value (%rsp + 8) to be a multiple of 582 + * 16 when control is transferred to the function entry point. 583 + * 584 + * If this code is ever used to launch a vCPU with 32-bit entry point it 585 + * may need to subtract 4 bytes instead of 8 bytes. 586 + */ 587 + TEST_ASSERT(IS_ALIGNED(stack_vaddr, PAGE_SIZE), 588 + "__vm_vaddr_alloc() did not provide a page-aligned address"); 589 + stack_vaddr -= 8; 590 + 577 591 vcpu = __vm_vcpu_add(vm, vcpu_id); 578 592 vcpu_init_cpuid(vcpu, kvm_get_supported_cpuid()); 579 593 vcpu_setup(vm, vcpu); ··· 596 580 /* Setup guest general purpose registers */ 597 581 vcpu_regs_get(vcpu, &regs); 598 582 regs.rflags = regs.rflags | 0x2; 599 - regs.rsp = stack_vaddr + (DEFAULT_STACK_PGS * getpagesize()); 583 + regs.rsp = stack_vaddr; 600 584 regs.rip = (unsigned long) guest_code; 601 585 vcpu_regs_set(vcpu, &regs); 602 586 ··· 697 681 return buffer.entry.data; 698 682 } 699 683 700 - void __vm_xsave_require_permission(int bit, const char *name) 684 + void __vm_xsave_require_permission(uint64_t xfeature, const char *name) 701 685 { 702 686 int kvm_fd; 703 687 u64 bitmask; ··· 705 689 struct kvm_device_attr attr = { 706 690 .group = 0, 707 691 .attr = KVM_X86_XCOMP_GUEST_SUPP, 708 - .addr = (unsigned long) &bitmask 692 + .addr = (unsigned long) &bitmask, 709 693 }; 710 694 711 695 TEST_ASSERT(!kvm_supported_cpuid, 712 696 "kvm_get_supported_cpuid() cannot be used before ARCH_REQ_XCOMP_GUEST_PERM"); 697 + 698 + TEST_ASSERT(is_power_of_2(xfeature), 699 + "Dynamic XFeatures must be enabled one at a time"); 713 700 714 701 kvm_fd = open_kvm_dev_path_or_exit(); 715 702 rc = __kvm_ioctl(kvm_fd, KVM_GET_DEVICE_ATTR, &attr); ··· 723 704 724 705 TEST_ASSERT(rc == 0, "KVM_GET_DEVICE_ATTR(0, KVM_X86_XCOMP_GUEST_SUPP) error: %ld", rc); 725 706 726 - __TEST_REQUIRE(bitmask & (1ULL << bit), 707 + __TEST_REQUIRE(bitmask & xfeature, 727 708 "Required XSAVE feature '%s' not supported", name); 728 709 729 - TEST_REQUIRE(!syscall(SYS_arch_prctl, ARCH_REQ_XCOMP_GUEST_PERM, bit)); 710 + TEST_REQUIRE(!syscall(SYS_arch_prctl, ARCH_REQ_XCOMP_GUEST_PERM, ilog2(xfeature))); 730 711 731 712 rc = syscall(SYS_arch_prctl, ARCH_GET_XCOMP_GUEST_PERM, &bitmask); 732 713 TEST_ASSERT(rc == 0, "prctl(ARCH_GET_XCOMP_GUEST_PERM) error: %ld", rc); 733 - TEST_ASSERT(bitmask & (1ULL << bit), 734 - "prctl(ARCH_REQ_XCOMP_GUEST_PERM) failure bitmask=0x%lx", 735 - bitmask); 714 + TEST_ASSERT(bitmask & xfeature, 715 + "'%s' (0x%lx) not permitted after prctl(ARCH_REQ_XCOMP_GUEST_PERM) permitted=0x%lx", 716 + name, xfeature, bitmask); 736 717 } 737 718 738 719 void vcpu_init_cpuid(struct kvm_vcpu *vcpu, const struct kvm_cpuid2 *cpuid) ··· 973 954 vcpu_run_complete_io(vcpu); 974 955 975 956 state = malloc(sizeof(*state) + msr_list->nmsrs * sizeof(state->msrs.entries[0])); 957 + TEST_ASSERT(state, "-ENOMEM when allocating kvm state"); 976 958 977 959 vcpu_events_get(vcpu, &state->events); 978 960 vcpu_mp_state_get(vcpu, &state->mp_state);

+47 -71

tools/testing/selftests/kvm/x86_64/amx_test.c

··· 30 30 #define XSAVE_SIZE ((NUM_TILES * TILE_SIZE) + PAGE_SIZE) 31 31 32 32 /* Tile configuration associated: */ 33 + #define PALETTE_TABLE_INDEX 1 33 34 #define MAX_TILES 16 34 35 #define RESERVED_BYTES 14 35 36 36 - #define XFEATURE_XTILECFG 17 37 - #define XFEATURE_XTILEDATA 18 38 - #define XFEATURE_MASK_XTILECFG (1 << XFEATURE_XTILECFG) 39 - #define XFEATURE_MASK_XTILEDATA (1 << XFEATURE_XTILEDATA) 40 - #define XFEATURE_MASK_XTILE (XFEATURE_MASK_XTILECFG | XFEATURE_MASK_XTILEDATA) 41 - 42 37 #define XSAVE_HDR_OFFSET 512 43 - 44 - struct xsave_data { 45 - u8 area[XSAVE_SIZE]; 46 - } __aligned(64); 47 38 48 39 struct tile_config { 49 40 u8 palette_id; ··· 59 68 60 69 static struct xtile_info xtile; 61 70 62 - static inline u64 __xgetbv(u32 index) 63 - { 64 - u32 eax, edx; 65 - 66 - asm volatile("xgetbv;" 67 - : "=a" (eax), "=d" (edx) 68 - : "c" (index)); 69 - return eax + ((u64)edx << 32); 70 - } 71 - 72 - static inline void __xsetbv(u32 index, u64 value) 73 - { 74 - u32 eax = value; 75 - u32 edx = value >> 32; 76 - 77 - asm volatile("xsetbv" :: "a" (eax), "d" (edx), "c" (index)); 78 - } 79 - 80 71 static inline void __ldtilecfg(void *cfg) 81 72 { 82 73 asm volatile(".byte 0xc4,0xe2,0x78,0x49,0x00" ··· 76 103 asm volatile(".byte 0xc4, 0xe2, 0x78, 0x49, 0xc0" ::); 77 104 } 78 105 79 - static inline void __xsavec(struct xsave_data *data, uint64_t rfbm) 106 + static inline void __xsavec(struct xstate *xstate, uint64_t rfbm) 80 107 { 81 108 uint32_t rfbm_lo = rfbm; 82 109 uint32_t rfbm_hi = rfbm >> 32; 83 110 84 111 asm volatile("xsavec (%%rdi)" 85 - : : "D" (data), "a" (rfbm_lo), "d" (rfbm_hi) 112 + : : "D" (xstate), "a" (rfbm_lo), "d" (rfbm_hi) 86 113 : "memory"); 87 - } 88 - 89 - static inline void check_cpuid_xsave(void) 90 - { 91 - GUEST_ASSERT(this_cpu_has(X86_FEATURE_XSAVE)); 92 - GUEST_ASSERT(this_cpu_has(X86_FEATURE_OSXSAVE)); 93 - } 94 - 95 - static bool check_xsave_supports_xtile(void) 96 - { 97 - return __xgetbv(0) & XFEATURE_MASK_XTILE; 98 114 } 99 115 100 116 static void check_xtile_info(void) ··· 96 134 xtile.xsave_size = this_cpu_property(X86_PROPERTY_XSTATE_TILE_SIZE); 97 135 GUEST_ASSERT(xtile.xsave_size == 8192); 98 136 GUEST_ASSERT(sizeof(struct tile_data) >= xtile.xsave_size); 137 + 138 + GUEST_ASSERT(this_cpu_has_p(X86_PROPERTY_AMX_MAX_PALETTE_TABLES)); 139 + GUEST_ASSERT(this_cpu_property(X86_PROPERTY_AMX_MAX_PALETTE_TABLES) >= 140 + PALETTE_TABLE_INDEX); 99 141 100 142 GUEST_ASSERT(this_cpu_has_p(X86_PROPERTY_AMX_NR_TILE_REGS)); 101 143 xtile.max_names = this_cpu_property(X86_PROPERTY_AMX_NR_TILE_REGS); ··· 124 158 } 125 159 } 126 160 127 - static void set_xstatebv(void *data, uint64_t bv) 128 - { 129 - *(uint64_t *)(data + XSAVE_HDR_OFFSET) = bv; 130 - } 131 - 132 - static u64 get_xstatebv(void *data) 133 - { 134 - return *(u64 *)(data + XSAVE_HDR_OFFSET); 135 - } 136 - 137 161 static void init_regs(void) 138 162 { 139 163 uint64_t cr4, xcr0; 164 + 165 + GUEST_ASSERT(this_cpu_has(X86_FEATURE_XSAVE)); 140 166 141 167 /* turn on CR4.OSXSAVE */ 142 168 cr4 = get_cr4(); 143 169 cr4 |= X86_CR4_OSXSAVE; 144 170 set_cr4(cr4); 171 + GUEST_ASSERT(this_cpu_has(X86_FEATURE_OSXSAVE)); 145 172 146 - xcr0 = __xgetbv(0); 173 + xcr0 = xgetbv(0); 147 174 xcr0 |= XFEATURE_MASK_XTILE; 148 - __xsetbv(0x0, xcr0); 175 + xsetbv(0x0, xcr0); 176 + GUEST_ASSERT((xgetbv(0) & XFEATURE_MASK_XTILE) == XFEATURE_MASK_XTILE); 149 177 } 150 178 151 179 static void __attribute__((__flatten__)) guest_code(struct tile_config *amx_cfg, 152 180 struct tile_data *tiledata, 153 - struct xsave_data *xsave_data) 181 + struct xstate *xstate) 154 182 { 155 183 init_regs(); 156 - check_cpuid_xsave(); 157 - check_xsave_supports_xtile(); 158 184 check_xtile_info(); 159 185 GUEST_SYNC(1); 160 186 ··· 162 204 GUEST_SYNC(4); 163 205 __tilerelease(); 164 206 GUEST_SYNC(5); 165 - /* bit 18 not in the XCOMP_BV after xsavec() */ 166 - set_xstatebv(xsave_data, XFEATURE_MASK_XTILEDATA); 167 - __xsavec(xsave_data, XFEATURE_MASK_XTILEDATA); 168 - GUEST_ASSERT((get_xstatebv(xsave_data) & XFEATURE_MASK_XTILEDATA) == 0); 207 + /* 208 + * After XSAVEC, XTILEDATA is cleared in the xstate_bv but is set in 209 + * the xcomp_bv. 210 + */ 211 + xstate->header.xstate_bv = XFEATURE_MASK_XTILE_DATA; 212 + __xsavec(xstate, XFEATURE_MASK_XTILE_DATA); 213 + GUEST_ASSERT(!(xstate->header.xstate_bv & XFEATURE_MASK_XTILE_DATA)); 214 + GUEST_ASSERT(xstate->header.xcomp_bv & XFEATURE_MASK_XTILE_DATA); 169 215 170 216 /* xfd=0x40000, disable amx tiledata */ 171 - wrmsr(MSR_IA32_XFD, XFEATURE_MASK_XTILEDATA); 217 + wrmsr(MSR_IA32_XFD, XFEATURE_MASK_XTILE_DATA); 218 + 219 + /* 220 + * XTILEDATA is cleared in xstate_bv but set in xcomp_bv, this property 221 + * remains the same even when amx tiledata is disabled by IA32_XFD. 222 + */ 223 + xstate->header.xstate_bv = XFEATURE_MASK_XTILE_DATA; 224 + __xsavec(xstate, XFEATURE_MASK_XTILE_DATA); 225 + GUEST_ASSERT(!(xstate->header.xstate_bv & XFEATURE_MASK_XTILE_DATA)); 226 + GUEST_ASSERT((xstate->header.xcomp_bv & XFEATURE_MASK_XTILE_DATA)); 227 + 172 228 GUEST_SYNC(6); 173 - GUEST_ASSERT(rdmsr(MSR_IA32_XFD) == XFEATURE_MASK_XTILEDATA); 229 + GUEST_ASSERT(rdmsr(MSR_IA32_XFD) == XFEATURE_MASK_XTILE_DATA); 174 230 set_tilecfg(amx_cfg); 175 231 __ldtilecfg(amx_cfg); 176 232 /* Trigger #NM exception */ ··· 196 224 197 225 void guest_nm_handler(struct ex_regs *regs) 198 226 { 199 - /* Check if #NM is triggered by XFEATURE_MASK_XTILEDATA */ 227 + /* Check if #NM is triggered by XFEATURE_MASK_XTILE_DATA */ 200 228 GUEST_SYNC(7); 201 - GUEST_ASSERT(rdmsr(MSR_IA32_XFD_ERR) == XFEATURE_MASK_XTILEDATA); 229 + GUEST_ASSERT(!(get_cr0() & X86_CR0_TS)); 230 + GUEST_ASSERT(rdmsr(MSR_IA32_XFD_ERR) == XFEATURE_MASK_XTILE_DATA); 231 + GUEST_ASSERT(rdmsr(MSR_IA32_XFD) == XFEATURE_MASK_XTILE_DATA); 202 232 GUEST_SYNC(8); 203 - GUEST_ASSERT(rdmsr(MSR_IA32_XFD_ERR) == XFEATURE_MASK_XTILEDATA); 233 + GUEST_ASSERT(rdmsr(MSR_IA32_XFD_ERR) == XFEATURE_MASK_XTILE_DATA); 234 + GUEST_ASSERT(rdmsr(MSR_IA32_XFD) == XFEATURE_MASK_XTILE_DATA); 204 235 /* Clear xfd_err */ 205 236 wrmsr(MSR_IA32_XFD_ERR, 0); 206 237 /* xfd=0, enable amx */ ··· 218 243 struct kvm_vm *vm; 219 244 struct kvm_x86_state *state; 220 245 int xsave_restore_size; 221 - vm_vaddr_t amx_cfg, tiledata, xsavedata; 246 + vm_vaddr_t amx_cfg, tiledata, xstate; 222 247 struct ucall uc; 223 248 u32 amx_offset; 224 249 int stage, ret; ··· 227 252 * Note, all off-by-default features must be enabled before anything 228 253 * caches KVM_GET_SUPPORTED_CPUID, e.g. before using kvm_cpu_has(). 229 254 */ 230 - vm_xsave_require_permission(XSTATE_XTILE_DATA_BIT); 255 + vm_xsave_require_permission(XFEATURE_MASK_XTILE_DATA); 231 256 232 257 TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_XFD)); 233 258 TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_XSAVE)); 234 259 TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_AMX_TILE)); 235 260 TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_XTILECFG)); 236 261 TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_XTILEDATA)); 262 + TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_XTILEDATA_XFD)); 237 263 238 264 /* Create VM */ 239 265 vm = vm_create_with_one_vcpu(&vcpu, guest_code); ··· 258 282 tiledata = vm_vaddr_alloc_pages(vm, 2); 259 283 memset(addr_gva2hva(vm, tiledata), rand() | 1, 2 * getpagesize()); 260 284 261 - /* xsave data for guest_code */ 262 - xsavedata = vm_vaddr_alloc_pages(vm, 3); 263 - memset(addr_gva2hva(vm, xsavedata), 0, 3 * getpagesize()); 264 - vcpu_args_set(vcpu, 3, amx_cfg, tiledata, xsavedata); 285 + /* XSAVE state for guest_code */ 286 + xstate = vm_vaddr_alloc_pages(vm, DIV_ROUND_UP(XSAVE_SIZE, PAGE_SIZE)); 287 + memset(addr_gva2hva(vm, xstate), 0, PAGE_SIZE * DIV_ROUND_UP(XSAVE_SIZE, PAGE_SIZE)); 288 + vcpu_args_set(vcpu, 3, amx_cfg, tiledata, xstate); 265 289 266 290 for (stage = 1; ; stage++) { 267 291 vcpu_run(vcpu);

+141 -112

tools/testing/selftests/kvm/x86_64/pmu_event_filter_test.c

··· 54 54 55 55 #define AMD_ZEN_BR_RETIRED EVENT(0xc2, 0) 56 56 57 + 58 + /* 59 + * "Retired instructions", from Processor Programming Reference 60 + * (PPR) for AMD Family 17h Model 01h, Revision B1 Processors, 61 + * Preliminary Processor Programming Reference (PPR) for AMD Family 62 + * 17h Model 31h, Revision B0 Processors, and Preliminary Processor 63 + * Programming Reference (PPR) for AMD Family 19h Model 01h, Revision 64 + * B1 Processors Volume 1 of 2. 65 + * --- and --- 66 + * "Instructions retired", from the Intel SDM, volume 3, 67 + * "Pre-defined Architectural Performance Events." 68 + */ 69 + 70 + #define INST_RETIRED EVENT(0xc0, 0) 71 + 57 72 /* 58 73 * This event list comprises Intel's eight architectural events plus 59 74 * AMD's "retired branch instructions" for Zen[123] (and possibly ··· 76 61 */ 77 62 static const uint64_t event_list[] = { 78 63 EVENT(0x3c, 0), 79 - EVENT(0xc0, 0), 64 + INST_RETIRED, 80 65 EVENT(0x3c, 1), 81 66 EVENT(0x2e, 0x4f), 82 67 EVENT(0x2e, 0x41), ··· 86 71 AMD_ZEN_BR_RETIRED, 87 72 }; 88 73 74 + struct { 75 + uint64_t loads; 76 + uint64_t stores; 77 + uint64_t loads_stores; 78 + uint64_t branches_retired; 79 + uint64_t instructions_retired; 80 + } pmc_results; 81 + 89 82 /* 90 83 * If we encounter a #GP during the guest PMU sanity check, then the guest 91 84 * PMU is not functional. Inform the hypervisor via GUEST_SYNC(0). 92 85 */ 93 86 static void guest_gp_handler(struct ex_regs *regs) 94 87 { 95 - GUEST_SYNC(0); 88 + GUEST_SYNC(-EFAULT); 96 89 } 97 90 98 91 /* ··· 115 92 116 93 wrmsr(msr, v); 117 94 if (rdmsr(msr) != v) 118 - GUEST_SYNC(0); 95 + GUEST_SYNC(-EIO); 119 96 120 97 v ^= bits_to_flip; 121 98 wrmsr(msr, v); 122 99 if (rdmsr(msr) != v) 123 - GUEST_SYNC(0); 100 + GUEST_SYNC(-EIO); 101 + } 102 + 103 + static void run_and_measure_loop(uint32_t msr_base) 104 + { 105 + const uint64_t branches_retired = rdmsr(msr_base + 0); 106 + const uint64_t insn_retired = rdmsr(msr_base + 1); 107 + 108 + __asm__ __volatile__("loop ." : "+c"((int){NUM_BRANCHES})); 109 + 110 + pmc_results.branches_retired = rdmsr(msr_base + 0) - branches_retired; 111 + pmc_results.instructions_retired = rdmsr(msr_base + 1) - insn_retired; 124 112 } 125 113 126 114 static void intel_guest_code(void) ··· 139 105 check_msr(MSR_CORE_PERF_GLOBAL_CTRL, 1); 140 106 check_msr(MSR_P6_EVNTSEL0, 0xffff); 141 107 check_msr(MSR_IA32_PMC0, 0xffff); 142 - GUEST_SYNC(1); 108 + GUEST_SYNC(0); 143 109 144 110 for (;;) { 145 - uint64_t br0, br1; 146 - 147 111 wrmsr(MSR_CORE_PERF_GLOBAL_CTRL, 0); 148 112 wrmsr(MSR_P6_EVNTSEL0, ARCH_PERFMON_EVENTSEL_ENABLE | 149 113 ARCH_PERFMON_EVENTSEL_OS | INTEL_BR_RETIRED); 150 - wrmsr(MSR_CORE_PERF_GLOBAL_CTRL, 1); 151 - br0 = rdmsr(MSR_IA32_PMC0); 152 - __asm__ __volatile__("loop ." : "+c"((int){NUM_BRANCHES})); 153 - br1 = rdmsr(MSR_IA32_PMC0); 154 - GUEST_SYNC(br1 - br0); 114 + wrmsr(MSR_P6_EVNTSEL1, ARCH_PERFMON_EVENTSEL_ENABLE | 115 + ARCH_PERFMON_EVENTSEL_OS | INST_RETIRED); 116 + wrmsr(MSR_CORE_PERF_GLOBAL_CTRL, 0x3); 117 + 118 + run_and_measure_loop(MSR_IA32_PMC0); 119 + GUEST_SYNC(0); 155 120 } 156 121 } 157 122 ··· 163 130 { 164 131 check_msr(MSR_K7_EVNTSEL0, 0xffff); 165 132 check_msr(MSR_K7_PERFCTR0, 0xffff); 166 - GUEST_SYNC(1); 133 + GUEST_SYNC(0); 167 134 168 135 for (;;) { 169 - uint64_t br0, br1; 170 - 171 136 wrmsr(MSR_K7_EVNTSEL0, 0); 172 137 wrmsr(MSR_K7_EVNTSEL0, ARCH_PERFMON_EVENTSEL_ENABLE | 173 138 ARCH_PERFMON_EVENTSEL_OS | AMD_ZEN_BR_RETIRED); 174 - br0 = rdmsr(MSR_K7_PERFCTR0); 175 - __asm__ __volatile__("loop ." : "+c"((int){NUM_BRANCHES})); 176 - br1 = rdmsr(MSR_K7_PERFCTR0); 177 - GUEST_SYNC(br1 - br0); 139 + wrmsr(MSR_K7_EVNTSEL1, ARCH_PERFMON_EVENTSEL_ENABLE | 140 + ARCH_PERFMON_EVENTSEL_OS | INST_RETIRED); 141 + 142 + run_and_measure_loop(MSR_K7_PERFCTR0); 143 + GUEST_SYNC(0); 178 144 } 179 145 } 180 146 ··· 193 161 return uc.args[1]; 194 162 } 195 163 164 + static void run_vcpu_and_sync_pmc_results(struct kvm_vcpu *vcpu) 165 + { 166 + uint64_t r; 167 + 168 + memset(&pmc_results, 0, sizeof(pmc_results)); 169 + sync_global_to_guest(vcpu->vm, pmc_results); 170 + 171 + r = run_vcpu_to_sync(vcpu); 172 + TEST_ASSERT(!r, "Unexpected sync value: 0x%lx", r); 173 + 174 + sync_global_from_guest(vcpu->vm, pmc_results); 175 + } 176 + 196 177 /* 197 178 * In a nested environment or if the vPMU is disabled, the guest PMU 198 179 * might not work as architected (accessing the PMU MSRs may raise ··· 216 171 */ 217 172 static bool sanity_check_pmu(struct kvm_vcpu *vcpu) 218 173 { 219 - bool success; 174 + uint64_t r; 220 175 221 176 vm_install_exception_handler(vcpu->vm, GP_VECTOR, guest_gp_handler); 222 - success = run_vcpu_to_sync(vcpu); 177 + r = run_vcpu_to_sync(vcpu); 223 178 vm_install_exception_handler(vcpu->vm, GP_VECTOR, NULL); 224 179 225 - return success; 180 + return !r; 226 181 } 227 182 228 183 static struct kvm_pmu_event_filter *alloc_pmu_event_filter(uint32_t nevents) ··· 282 237 return f; 283 238 } 284 239 240 + #define ASSERT_PMC_COUNTING_INSTRUCTIONS() \ 241 + do { \ 242 + uint64_t br = pmc_results.branches_retired; \ 243 + uint64_t ir = pmc_results.instructions_retired; \ 244 + \ 245 + if (br && br != NUM_BRANCHES) \ 246 + pr_info("%s: Branch instructions retired = %lu (expected %u)\n", \ 247 + __func__, br, NUM_BRANCHES); \ 248 + TEST_ASSERT(br, "%s: Branch instructions retired = %lu (expected > 0)", \ 249 + __func__, br); \ 250 + TEST_ASSERT(ir, "%s: Instructions retired = %lu (expected > 0)", \ 251 + __func__, ir); \ 252 + } while (0) 253 + 254 + #define ASSERT_PMC_NOT_COUNTING_INSTRUCTIONS() \ 255 + do { \ 256 + uint64_t br = pmc_results.branches_retired; \ 257 + uint64_t ir = pmc_results.instructions_retired; \ 258 + \ 259 + TEST_ASSERT(!br, "%s: Branch instructions retired = %lu (expected 0)", \ 260 + __func__, br); \ 261 + TEST_ASSERT(!ir, "%s: Instructions retired = %lu (expected 0)", \ 262 + __func__, ir); \ 263 + } while (0) 264 + 285 265 static void test_without_filter(struct kvm_vcpu *vcpu) 286 266 { 287 - uint64_t count = run_vcpu_to_sync(vcpu); 267 + run_vcpu_and_sync_pmc_results(vcpu); 288 268 289 - if (count != NUM_BRANCHES) 290 - pr_info("%s: Branch instructions retired = %lu (expected %u)\n", 291 - __func__, count, NUM_BRANCHES); 292 - TEST_ASSERT(count, "Allowed PMU event is not counting"); 269 + ASSERT_PMC_COUNTING_INSTRUCTIONS(); 293 270 } 294 271 295 - static uint64_t test_with_filter(struct kvm_vcpu *vcpu, 296 - struct kvm_pmu_event_filter *f) 272 + static void test_with_filter(struct kvm_vcpu *vcpu, 273 + struct kvm_pmu_event_filter *f) 297 274 { 298 275 vm_ioctl(vcpu->vm, KVM_SET_PMU_EVENT_FILTER, f); 299 - return run_vcpu_to_sync(vcpu); 276 + run_vcpu_and_sync_pmc_results(vcpu); 300 277 } 301 278 302 279 static void test_amd_deny_list(struct kvm_vcpu *vcpu) 303 280 { 304 281 uint64_t event = EVENT(0x1C2, 0); 305 282 struct kvm_pmu_event_filter *f; 306 - uint64_t count; 307 283 308 284 f = create_pmu_event_filter(&event, 1, KVM_PMU_EVENT_DENY, 0); 309 - count = test_with_filter(vcpu, f); 310 - 285 + test_with_filter(vcpu, f); 311 286 free(f); 312 - if (count != NUM_BRANCHES) 313 - pr_info("%s: Branch instructions retired = %lu (expected %u)\n", 314 - __func__, count, NUM_BRANCHES); 315 - TEST_ASSERT(count, "Allowed PMU event is not counting"); 287 + 288 + ASSERT_PMC_COUNTING_INSTRUCTIONS(); 316 289 } 317 290 318 291 static void test_member_deny_list(struct kvm_vcpu *vcpu) 319 292 { 320 293 struct kvm_pmu_event_filter *f = event_filter(KVM_PMU_EVENT_DENY); 321 - uint64_t count = test_with_filter(vcpu, f); 322 294 295 + test_with_filter(vcpu, f); 323 296 free(f); 324 - if (count) 325 - pr_info("%s: Branch instructions retired = %lu (expected 0)\n", 326 - __func__, count); 327 - TEST_ASSERT(!count, "Disallowed PMU Event is counting"); 297 + 298 + ASSERT_PMC_NOT_COUNTING_INSTRUCTIONS(); 328 299 } 329 300 330 301 static void test_member_allow_list(struct kvm_vcpu *vcpu) 331 302 { 332 303 struct kvm_pmu_event_filter *f = event_filter(KVM_PMU_EVENT_ALLOW); 333 - uint64_t count = test_with_filter(vcpu, f); 334 304 305 + test_with_filter(vcpu, f); 335 306 free(f); 336 - if (count != NUM_BRANCHES) 337 - pr_info("%s: Branch instructions retired = %lu (expected %u)\n", 338 - __func__, count, NUM_BRANCHES); 339 - TEST_ASSERT(count, "Allowed PMU event is not counting"); 307 + 308 + ASSERT_PMC_COUNTING_INSTRUCTIONS(); 340 309 } 341 310 342 311 static void test_not_member_deny_list(struct kvm_vcpu *vcpu) 343 312 { 344 313 struct kvm_pmu_event_filter *f = event_filter(KVM_PMU_EVENT_DENY); 345 - uint64_t count; 346 314 315 + remove_event(f, INST_RETIRED); 347 316 remove_event(f, INTEL_BR_RETIRED); 348 317 remove_event(f, AMD_ZEN_BR_RETIRED); 349 - count = test_with_filter(vcpu, f); 318 + test_with_filter(vcpu, f); 350 319 free(f); 351 - if (count != NUM_BRANCHES) 352 - pr_info("%s: Branch instructions retired = %lu (expected %u)\n", 353 - __func__, count, NUM_BRANCHES); 354 - TEST_ASSERT(count, "Allowed PMU event is not counting"); 320 + 321 + ASSERT_PMC_COUNTING_INSTRUCTIONS(); 355 322 } 356 323 357 324 static void test_not_member_allow_list(struct kvm_vcpu *vcpu) 358 325 { 359 326 struct kvm_pmu_event_filter *f = event_filter(KVM_PMU_EVENT_ALLOW); 360 - uint64_t count; 361 327 328 + remove_event(f, INST_RETIRED); 362 329 remove_event(f, INTEL_BR_RETIRED); 363 330 remove_event(f, AMD_ZEN_BR_RETIRED); 364 - count = test_with_filter(vcpu, f); 331 + test_with_filter(vcpu, f); 365 332 free(f); 366 - if (count) 367 - pr_info("%s: Branch instructions retired = %lu (expected 0)\n", 368 - __func__, count); 369 - TEST_ASSERT(!count, "Disallowed PMU Event is counting"); 333 + 334 + ASSERT_PMC_NOT_COUNTING_INSTRUCTIONS(); 370 335 } 371 336 372 337 /* ··· 505 450 #define EXCLUDE_MASKED_ENTRY(event_select, mask, match) \ 506 451 KVM_PMU_ENCODE_MASKED_ENTRY(event_select, mask, match, true) 507 452 508 - struct perf_counter { 509 - union { 510 - uint64_t raw; 511 - struct { 512 - uint64_t loads:22; 513 - uint64_t stores:22; 514 - uint64_t loads_stores:20; 515 - }; 516 - }; 517 - }; 518 - 519 - static uint64_t masked_events_guest_test(uint32_t msr_base) 453 + static void masked_events_guest_test(uint32_t msr_base) 520 454 { 521 - uint64_t ld0, ld1, st0, st1, ls0, ls1; 522 - struct perf_counter c; 523 - int val; 524 - 525 455 /* 526 - * The acutal value of the counters don't determine the outcome of 456 + * The actual value of the counters don't determine the outcome of 527 457 * the test. Only that they are zero or non-zero. 528 458 */ 529 - ld0 = rdmsr(msr_base + 0); 530 - st0 = rdmsr(msr_base + 1); 531 - ls0 = rdmsr(msr_base + 2); 459 + const uint64_t loads = rdmsr(msr_base + 0); 460 + const uint64_t stores = rdmsr(msr_base + 1); 461 + const uint64_t loads_stores = rdmsr(msr_base + 2); 462 + int val; 463 + 532 464 533 465 __asm__ __volatile__("movl $0, %[v];" 534 466 "movl %[v], %%eax;" 535 467 "incl %[v];" 536 468 : [v]"+m"(val) :: "eax"); 537 469 538 - ld1 = rdmsr(msr_base + 0); 539 - st1 = rdmsr(msr_base + 1); 540 - ls1 = rdmsr(msr_base + 2); 541 - 542 - c.loads = ld1 - ld0; 543 - c.stores = st1 - st0; 544 - c.loads_stores = ls1 - ls0; 545 - 546 - return c.raw; 470 + pmc_results.loads = rdmsr(msr_base + 0) - loads; 471 + pmc_results.stores = rdmsr(msr_base + 1) - stores; 472 + pmc_results.loads_stores = rdmsr(msr_base + 2) - loads_stores; 547 473 } 548 474 549 475 static void intel_masked_events_guest_code(void) 550 476 { 551 - uint64_t r; 552 - 553 477 for (;;) { 554 478 wrmsr(MSR_CORE_PERF_GLOBAL_CTRL, 0); 555 479 ··· 541 507 542 508 wrmsr(MSR_CORE_PERF_GLOBAL_CTRL, 0x7); 543 509 544 - r = masked_events_guest_test(MSR_IA32_PMC0); 545 - 546 - GUEST_SYNC(r); 510 + masked_events_guest_test(MSR_IA32_PMC0); 511 + GUEST_SYNC(0); 547 512 } 548 513 } 549 514 550 515 static void amd_masked_events_guest_code(void) 551 516 { 552 - uint64_t r; 553 - 554 517 for (;;) { 555 518 wrmsr(MSR_K7_EVNTSEL0, 0); 556 519 wrmsr(MSR_K7_EVNTSEL1, 0); ··· 560 529 wrmsr(MSR_K7_EVNTSEL2, ARCH_PERFMON_EVENTSEL_ENABLE | 561 530 ARCH_PERFMON_EVENTSEL_OS | LS_DISPATCH_LOAD_STORE); 562 531 563 - r = masked_events_guest_test(MSR_K7_PERFCTR0); 564 - 565 - GUEST_SYNC(r); 532 + masked_events_guest_test(MSR_K7_PERFCTR0); 533 + GUEST_SYNC(0); 566 534 } 567 535 } 568 536 569 - static struct perf_counter run_masked_events_test(struct kvm_vcpu *vcpu, 570 - const uint64_t masked_events[], 571 - const int nmasked_events) 537 + static void run_masked_events_test(struct kvm_vcpu *vcpu, 538 + const uint64_t masked_events[], 539 + const int nmasked_events) 572 540 { 573 541 struct kvm_pmu_event_filter *f; 574 - struct perf_counter r; 575 542 576 543 f = create_pmu_event_filter(masked_events, nmasked_events, 577 544 KVM_PMU_EVENT_ALLOW, 578 545 KVM_PMU_EVENT_FLAG_MASKED_EVENTS); 579 - r.raw = test_with_filter(vcpu, f); 546 + test_with_filter(vcpu, f); 580 547 free(f); 581 - 582 - return r; 583 548 } 584 549 585 550 /* Matches KVM_PMU_EVENT_FILTER_MAX_EVENTS in pmu.c */ ··· 700 673 int nevents) 701 674 { 702 675 int ntests = ARRAY_SIZE(test_cases); 703 - struct perf_counter c; 704 676 int i, n; 705 677 706 678 for (i = 0; i < ntests; i++) { ··· 711 685 712 686 n = append_test_events(test, events, nevents); 713 687 714 - c = run_masked_events_test(vcpu, events, n); 715 - TEST_ASSERT(bool_eq(c.loads, test->flags & ALLOW_LOADS) && 716 - bool_eq(c.stores, test->flags & ALLOW_STORES) && 717 - bool_eq(c.loads_stores, 688 + run_masked_events_test(vcpu, events, n); 689 + 690 + TEST_ASSERT(bool_eq(pmc_results.loads, test->flags & ALLOW_LOADS) && 691 + bool_eq(pmc_results.stores, test->flags & ALLOW_STORES) && 692 + bool_eq(pmc_results.loads_stores, 718 693 test->flags & ALLOW_LOADS_STORES), 719 - "%s loads: %u, stores: %u, loads + stores: %u", 720 - test->msg, c.loads, c.stores, c.loads_stores); 694 + "%s loads: %lu, stores: %lu, loads + stores: %lu", 695 + test->msg, pmc_results.loads, pmc_results.stores, 696 + pmc_results.loads_stores); 721 697 } 722 698 } 723 699 ··· 792 764 struct kvm_vcpu *vcpu, *vcpu2 = NULL; 793 765 struct kvm_vm *vm; 794 766 767 + TEST_REQUIRE(get_kvm_param_bool("enable_pmu")); 795 768 TEST_REQUIRE(kvm_has_cap(KVM_CAP_PMU_EVENT_FILTER)); 796 769 TEST_REQUIRE(kvm_has_cap(KVM_CAP_PMU_EVENT_MASKED_EVENTS)); 797 770

+6 -2

tools/testing/selftests/kvm/x86_64/vmx_nested_tsc_scaling_test.c

··· 126 126 goto skip_test; 127 127 128 128 if (fgets(buf, sizeof(buf), fp) == NULL) 129 - goto skip_test; 129 + goto close_fp; 130 130 131 131 if (strncmp(buf, "tsc", sizeof(buf))) 132 - goto skip_test; 132 + goto close_fp; 133 133 134 + fclose(fp); 134 135 return; 136 + 137 + close_fp: 138 + fclose(fp); 135 139 skip_test: 136 140 print_skip("Kernel does not use TSC clocksource - assuming that host TSC is not stable"); 137 141 exit(KSFT_SKIP);

+205 -44

tools/testing/selftests/kvm/x86_64/vmx_pmu_caps_test.c

··· 14 14 #define _GNU_SOURCE /* for program_invocation_short_name */ 15 15 #include <sys/ioctl.h> 16 16 17 + #include <linux/bitmap.h> 18 + 17 19 #include "kvm_util.h" 18 20 #include "vmx.h" 19 - 20 - #define PMU_CAP_FW_WRITES (1ULL << 13) 21 - #define PMU_CAP_LBR_FMT 0x3f 22 21 23 22 union perf_capabilities { 24 23 struct { ··· 35 36 u64 capabilities; 36 37 }; 37 38 38 - static void guest_code(void) 39 + /* 40 + * The LBR format and most PEBS features are immutable, all other features are 41 + * fungible (if supported by the host and KVM). 42 + */ 43 + static const union perf_capabilities immutable_caps = { 44 + .lbr_format = -1, 45 + .pebs_trap = 1, 46 + .pebs_arch_reg = 1, 47 + .pebs_format = -1, 48 + .pebs_baseline = 1, 49 + }; 50 + 51 + static const union perf_capabilities format_caps = { 52 + .lbr_format = -1, 53 + .pebs_format = -1, 54 + }; 55 + 56 + static void guest_code(uint64_t current_val) 39 57 { 40 - wrmsr(MSR_IA32_PERF_CAPABILITIES, PMU_CAP_LBR_FMT); 58 + uint8_t vector; 59 + int i; 60 + 61 + vector = wrmsr_safe(MSR_IA32_PERF_CAPABILITIES, current_val); 62 + GUEST_ASSERT_2(vector == GP_VECTOR, current_val, vector); 63 + 64 + vector = wrmsr_safe(MSR_IA32_PERF_CAPABILITIES, 0); 65 + GUEST_ASSERT_2(vector == GP_VECTOR, 0, vector); 66 + 67 + for (i = 0; i < 64; i++) { 68 + vector = wrmsr_safe(MSR_IA32_PERF_CAPABILITIES, 69 + current_val ^ BIT_ULL(i)); 70 + GUEST_ASSERT_2(vector == GP_VECTOR, 71 + current_val ^ BIT_ULL(i), vector); 72 + } 73 + 74 + GUEST_DONE(); 75 + } 76 + 77 + /* 78 + * Verify that guest WRMSRs to PERF_CAPABILITIES #GP regardless of the value 79 + * written, that the guest always sees the userspace controlled value, and that 80 + * PERF_CAPABILITIES is immutable after KVM_RUN. 81 + */ 82 + static void test_guest_wrmsr_perf_capabilities(union perf_capabilities host_cap) 83 + { 84 + struct kvm_vcpu *vcpu; 85 + struct kvm_vm *vm = vm_create_with_one_vcpu(&vcpu, guest_code); 86 + struct ucall uc; 87 + int r, i; 88 + 89 + vm_init_descriptor_tables(vm); 90 + vcpu_init_descriptor_tables(vcpu); 91 + 92 + vcpu_set_msr(vcpu, MSR_IA32_PERF_CAPABILITIES, host_cap.capabilities); 93 + 94 + vcpu_args_set(vcpu, 1, host_cap.capabilities); 95 + vcpu_run(vcpu); 96 + 97 + switch (get_ucall(vcpu, &uc)) { 98 + case UCALL_ABORT: 99 + REPORT_GUEST_ASSERT_2(uc, "val = 0x%lx, vector = %lu"); 100 + break; 101 + case UCALL_DONE: 102 + break; 103 + default: 104 + TEST_FAIL("Unexpected ucall: %lu", uc.cmd); 105 + } 106 + 107 + ASSERT_EQ(vcpu_get_msr(vcpu, MSR_IA32_PERF_CAPABILITIES), host_cap.capabilities); 108 + 109 + vcpu_set_msr(vcpu, MSR_IA32_PERF_CAPABILITIES, host_cap.capabilities); 110 + 111 + r = _vcpu_set_msr(vcpu, MSR_IA32_PERF_CAPABILITIES, 0); 112 + TEST_ASSERT(!r, "Post-KVM_RUN write '0' didn't fail"); 113 + 114 + for (i = 0; i < 64; i++) { 115 + r = _vcpu_set_msr(vcpu, MSR_IA32_PERF_CAPABILITIES, 116 + host_cap.capabilities ^ BIT_ULL(i)); 117 + TEST_ASSERT(!r, "Post-KVM_RUN write '0x%llx'didn't fail", 118 + host_cap.capabilities ^ BIT_ULL(i)); 119 + } 120 + 121 + kvm_vm_free(vm); 122 + } 123 + 124 + /* 125 + * Verify KVM allows writing PERF_CAPABILITIES with all KVM-supported features 126 + * enabled, as well as '0' (to disable all features). 127 + */ 128 + static void test_basic_perf_capabilities(union perf_capabilities host_cap) 129 + { 130 + struct kvm_vcpu *vcpu; 131 + struct kvm_vm *vm = vm_create_with_one_vcpu(&vcpu, NULL); 132 + 133 + vcpu_set_msr(vcpu, MSR_IA32_PERF_CAPABILITIES, 0); 134 + vcpu_set_msr(vcpu, MSR_IA32_PERF_CAPABILITIES, host_cap.capabilities); 135 + 136 + kvm_vm_free(vm); 137 + } 138 + 139 + static void test_fungible_perf_capabilities(union perf_capabilities host_cap) 140 + { 141 + const uint64_t fungible_caps = host_cap.capabilities & ~immutable_caps.capabilities; 142 + 143 + struct kvm_vcpu *vcpu; 144 + struct kvm_vm *vm = vm_create_with_one_vcpu(&vcpu, NULL); 145 + int bit; 146 + 147 + for_each_set_bit(bit, &fungible_caps, 64) { 148 + vcpu_set_msr(vcpu, MSR_IA32_PERF_CAPABILITIES, BIT_ULL(bit)); 149 + vcpu_set_msr(vcpu, MSR_IA32_PERF_CAPABILITIES, 150 + host_cap.capabilities & ~BIT_ULL(bit)); 151 + } 152 + vcpu_set_msr(vcpu, MSR_IA32_PERF_CAPABILITIES, host_cap.capabilities); 153 + 154 + kvm_vm_free(vm); 155 + } 156 + 157 + /* 158 + * Verify KVM rejects attempts to set unsupported and/or immutable features in 159 + * PERF_CAPABILITIES. Note, LBR format and PEBS format need to be validated 160 + * separately as they are multi-bit values, e.g. toggling or setting a single 161 + * bit can generate a false positive without dedicated safeguards. 162 + */ 163 + static void test_immutable_perf_capabilities(union perf_capabilities host_cap) 164 + { 165 + const uint64_t reserved_caps = (~host_cap.capabilities | 166 + immutable_caps.capabilities) & 167 + ~format_caps.capabilities; 168 + 169 + struct kvm_vcpu *vcpu; 170 + struct kvm_vm *vm = vm_create_with_one_vcpu(&vcpu, NULL); 171 + union perf_capabilities val = host_cap; 172 + int r, bit; 173 + 174 + for_each_set_bit(bit, &reserved_caps, 64) { 175 + r = _vcpu_set_msr(vcpu, MSR_IA32_PERF_CAPABILITIES, 176 + host_cap.capabilities ^ BIT_ULL(bit)); 177 + TEST_ASSERT(!r, "%s immutable feature 0x%llx (bit %d) didn't fail", 178 + host_cap.capabilities & BIT_ULL(bit) ? "Setting" : "Clearing", 179 + BIT_ULL(bit), bit); 180 + } 181 + 182 + /* 183 + * KVM only supports the host's native LBR format, as well as '0' (to 184 + * disable LBR support). Verify KVM rejects all other LBR formats. 185 + */ 186 + for (val.lbr_format = 1; val.lbr_format; val.lbr_format++) { 187 + if (val.lbr_format == host_cap.lbr_format) 188 + continue; 189 + 190 + r = _vcpu_set_msr(vcpu, MSR_IA32_PERF_CAPABILITIES, val.capabilities); 191 + TEST_ASSERT(!r, "Bad LBR FMT = 0x%x didn't fail, host = 0x%x", 192 + val.lbr_format, host_cap.lbr_format); 193 + } 194 + 195 + /* Ditto for the PEBS format. */ 196 + for (val.pebs_format = 1; val.pebs_format; val.pebs_format++) { 197 + if (val.pebs_format == host_cap.pebs_format) 198 + continue; 199 + 200 + r = _vcpu_set_msr(vcpu, MSR_IA32_PERF_CAPABILITIES, val.capabilities); 201 + TEST_ASSERT(!r, "Bad PEBS FMT = 0x%x didn't fail, host = 0x%x", 202 + val.pebs_format, host_cap.pebs_format); 203 + } 204 + 205 + kvm_vm_free(vm); 206 + } 207 + 208 + /* 209 + * Test that LBR MSRs are writable when LBRs are enabled, and then verify that 210 + * disabling the vPMU via CPUID also disables LBR support. Set bits 2:0 of 211 + * LBR_TOS as those bits are writable across all uarch implementations (arch 212 + * LBRs will need to poke a different MSR). 213 + */ 214 + static void test_lbr_perf_capabilities(union perf_capabilities host_cap) 215 + { 216 + struct kvm_vcpu *vcpu; 217 + struct kvm_vm *vm; 218 + int r; 219 + 220 + if (!host_cap.lbr_format) 221 + return; 222 + 223 + vm = vm_create_with_one_vcpu(&vcpu, NULL); 224 + 225 + vcpu_set_msr(vcpu, MSR_IA32_PERF_CAPABILITIES, host_cap.capabilities); 226 + vcpu_set_msr(vcpu, MSR_LBR_TOS, 7); 227 + 228 + vcpu_clear_cpuid_entry(vcpu, X86_PROPERTY_PMU_VERSION.function); 229 + 230 + r = _vcpu_set_msr(vcpu, MSR_LBR_TOS, 7); 231 + TEST_ASSERT(!r, "Writing LBR_TOS should fail after disabling vPMU"); 232 + 233 + kvm_vm_free(vm); 41 234 } 42 235 43 236 int main(int argc, char *argv[]) 44 237 { 45 - struct kvm_vm *vm; 46 - struct kvm_vcpu *vcpu; 47 - int ret; 48 238 union perf_capabilities host_cap; 49 - uint64_t val; 50 239 51 - host_cap.capabilities = kvm_get_feature_msr(MSR_IA32_PERF_CAPABILITIES); 52 - host_cap.capabilities &= (PMU_CAP_FW_WRITES | PMU_CAP_LBR_FMT); 53 - 54 - /* Create VM */ 55 - vm = vm_create_with_one_vcpu(&vcpu, guest_code); 56 - 240 + TEST_REQUIRE(get_kvm_param_bool("enable_pmu")); 57 241 TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_PDCM)); 58 242 59 243 TEST_REQUIRE(kvm_cpu_has_p(X86_PROPERTY_PMU_VERSION)); 60 244 TEST_REQUIRE(kvm_cpu_property(X86_PROPERTY_PMU_VERSION) > 0); 61 245 62 - /* testcase 1, set capabilities when we have PDCM bit */ 63 - vcpu_set_msr(vcpu, MSR_IA32_PERF_CAPABILITIES, PMU_CAP_FW_WRITES); 246 + host_cap.capabilities = kvm_get_feature_msr(MSR_IA32_PERF_CAPABILITIES); 64 247 65 - /* check capabilities can be retrieved with KVM_GET_MSR */ 66 - ASSERT_EQ(vcpu_get_msr(vcpu, MSR_IA32_PERF_CAPABILITIES), PMU_CAP_FW_WRITES); 248 + TEST_ASSERT(host_cap.full_width_write, 249 + "Full-width writes should always be supported"); 67 250 68 - /* check whatever we write with KVM_SET_MSR is _not_ modified */ 69 - vcpu_run(vcpu); 70 - ASSERT_EQ(vcpu_get_msr(vcpu, MSR_IA32_PERF_CAPABILITIES), PMU_CAP_FW_WRITES); 71 - 72 - /* testcase 2, check valid LBR formats are accepted */ 73 - vcpu_set_msr(vcpu, MSR_IA32_PERF_CAPABILITIES, 0); 74 - ASSERT_EQ(vcpu_get_msr(vcpu, MSR_IA32_PERF_CAPABILITIES), 0); 75 - 76 - vcpu_set_msr(vcpu, MSR_IA32_PERF_CAPABILITIES, host_cap.lbr_format); 77 - ASSERT_EQ(vcpu_get_msr(vcpu, MSR_IA32_PERF_CAPABILITIES), (u64)host_cap.lbr_format); 78 - 79 - /* 80 - * Testcase 3, check that an "invalid" LBR format is rejected. Only an 81 - * exact match of the host's format (and 0/disabled) is allowed. 82 - */ 83 - for (val = 1; val <= PMU_CAP_LBR_FMT; val++) { 84 - if (val == (host_cap.capabilities & PMU_CAP_LBR_FMT)) 85 - continue; 86 - 87 - ret = _vcpu_set_msr(vcpu, MSR_IA32_PERF_CAPABILITIES, val); 88 - TEST_ASSERT(!ret, "Bad LBR FMT = 0x%lx didn't fail", val); 89 - } 90 - 91 - printf("Completed perf capability tests.\n"); 92 - kvm_vm_free(vm); 251 + test_basic_perf_capabilities(host_cap); 252 + test_fungible_perf_capabilities(host_cap); 253 + test_immutable_perf_capabilities(host_cap); 254 + test_guest_wrmsr_perf_capabilities(host_cap); 255 + test_lbr_perf_capabilities(host_cap); 93 256 }

+132

tools/testing/selftests/kvm/x86_64/xcr0_cpuid_test.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * XCR0 cpuid test 4 + * 5 + * Copyright (C) 2022, Google LLC. 6 + */ 7 + 8 + #include <fcntl.h> 9 + #include <stdio.h> 10 + #include <stdlib.h> 11 + #include <string.h> 12 + #include <sys/ioctl.h> 13 + 14 + #include "test_util.h" 15 + 16 + #include "kvm_util.h" 17 + #include "processor.h" 18 + 19 + /* 20 + * Assert that architectural dependency rules are satisfied, e.g. that AVX is 21 + * supported if and only if SSE is supported. 22 + */ 23 + #define ASSERT_XFEATURE_DEPENDENCIES(supported_xcr0, xfeatures, dependencies) \ 24 + do { \ 25 + uint64_t __supported = (supported_xcr0) & ((xfeatures) | (dependencies)); \ 26 + \ 27 + GUEST_ASSERT_3((__supported & (xfeatures)) != (xfeatures) || \ 28 + __supported == ((xfeatures) | (dependencies)), \ 29 + __supported, (xfeatures), (dependencies)); \ 30 + } while (0) 31 + 32 + /* 33 + * Assert that KVM reports a sane, usable as-is XCR0. Architecturally, a CPU 34 + * isn't strictly required to _support_ all XFeatures related to a feature, but 35 + * at the same time XSETBV will #GP if bundled XFeatures aren't enabled and 36 + * disabled coherently. E.g. a CPU can technically enumerate supported for 37 + * XTILE_CFG but not XTILE_DATA, but attempting to enable XTILE_CFG without 38 + * XTILE_DATA will #GP. 39 + */ 40 + #define ASSERT_ALL_OR_NONE_XFEATURE(supported_xcr0, xfeatures) \ 41 + do { \ 42 + uint64_t __supported = (supported_xcr0) & (xfeatures); \ 43 + \ 44 + GUEST_ASSERT_2(!__supported || __supported == (xfeatures), \ 45 + __supported, (xfeatures)); \ 46 + } while (0) 47 + 48 + static void guest_code(void) 49 + { 50 + uint64_t xcr0_reset; 51 + uint64_t supported_xcr0; 52 + int i, vector; 53 + 54 + set_cr4(get_cr4() | X86_CR4_OSXSAVE); 55 + 56 + xcr0_reset = xgetbv(0); 57 + supported_xcr0 = this_cpu_supported_xcr0(); 58 + 59 + GUEST_ASSERT(xcr0_reset == XFEATURE_MASK_FP); 60 + 61 + /* Check AVX */ 62 + ASSERT_XFEATURE_DEPENDENCIES(supported_xcr0, 63 + XFEATURE_MASK_YMM, 64 + XFEATURE_MASK_SSE); 65 + 66 + /* Check MPX */ 67 + ASSERT_ALL_OR_NONE_XFEATURE(supported_xcr0, 68 + XFEATURE_MASK_BNDREGS | XFEATURE_MASK_BNDCSR); 69 + 70 + /* Check AVX-512 */ 71 + ASSERT_XFEATURE_DEPENDENCIES(supported_xcr0, 72 + XFEATURE_MASK_AVX512, 73 + XFEATURE_MASK_SSE | XFEATURE_MASK_YMM); 74 + ASSERT_ALL_OR_NONE_XFEATURE(supported_xcr0, 75 + XFEATURE_MASK_AVX512); 76 + 77 + /* Check AMX */ 78 + ASSERT_ALL_OR_NONE_XFEATURE(supported_xcr0, 79 + XFEATURE_MASK_XTILE); 80 + 81 + vector = xsetbv_safe(0, supported_xcr0); 82 + GUEST_ASSERT_2(!vector, supported_xcr0, vector); 83 + 84 + for (i = 0; i < 64; i++) { 85 + if (supported_xcr0 & BIT_ULL(i)) 86 + continue; 87 + 88 + vector = xsetbv_safe(0, supported_xcr0 | BIT_ULL(i)); 89 + GUEST_ASSERT_3(vector == GP_VECTOR, supported_xcr0, vector, BIT_ULL(i)); 90 + } 91 + 92 + GUEST_DONE(); 93 + } 94 + 95 + int main(int argc, char *argv[]) 96 + { 97 + struct kvm_vcpu *vcpu; 98 + struct kvm_run *run; 99 + struct kvm_vm *vm; 100 + struct ucall uc; 101 + 102 + TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_XSAVE)); 103 + 104 + vm = vm_create_with_one_vcpu(&vcpu, guest_code); 105 + run = vcpu->run; 106 + 107 + vm_init_descriptor_tables(vm); 108 + vcpu_init_descriptor_tables(vcpu); 109 + 110 + while (1) { 111 + vcpu_run(vcpu); 112 + 113 + TEST_ASSERT(run->exit_reason == KVM_EXIT_IO, 114 + "Unexpected exit reason: %u (%s),\n", 115 + run->exit_reason, 116 + exit_reason_str(run->exit_reason)); 117 + 118 + switch (get_ucall(vcpu, &uc)) { 119 + case UCALL_ABORT: 120 + REPORT_GUEST_ASSERT_3(uc, "0x%lx 0x%lx 0x%lx"); 121 + break; 122 + case UCALL_DONE: 123 + goto done; 124 + default: 125 + TEST_FAIL("Unknown ucall %lu", uc.cmd); 126 + } 127 + } 128 + 129 + done: 130 + kvm_vm_free(vm); 131 + return 0; 132 + }

+14 -16

virt/kvm/kvm_main.c

··· 1301 1301 * At this point, pending calls to invalidate_range_start() 1302 1302 * have completed but no more MMU notifiers will run, so 1303 1303 * mn_active_invalidate_count may remain unbalanced. 1304 - * No threads can be waiting in install_new_memslots as the 1304 + * No threads can be waiting in kvm_swap_active_memslots() as the 1305 1305 * last reference on KVM has been dropped, but freeing 1306 1306 * memslots would deadlock without this manual intervention. 1307 1307 */ ··· 1745 1745 kvm_arch_flush_shadow_memslot(kvm, old); 1746 1746 kvm_arch_guest_memory_reclaimed(kvm); 1747 1747 1748 - /* Was released by kvm_swap_active_memslots, reacquire. */ 1748 + /* Was released by kvm_swap_active_memslots(), reacquire. */ 1749 1749 mutex_lock(&kvm->slots_arch_lock); 1750 1750 1751 1751 /* 1752 1752 * Copy the arch-specific field of the newly-installed slot back to the 1753 1753 * old slot as the arch data could have changed between releasing 1754 - * slots_arch_lock in install_new_memslots() and re-acquiring the lock 1754 + * slots_arch_lock in kvm_swap_active_memslots() and re-acquiring the lock 1755 1755 * above. Writers are required to retrieve memslots *after* acquiring 1756 1756 * slots_arch_lock, thus the active slot's data is guaranteed to be fresh. 1757 1757 */ ··· 1813 1813 int r; 1814 1814 1815 1815 /* 1816 - * Released in kvm_swap_active_memslots. 1816 + * Released in kvm_swap_active_memslots(). 1817 1817 * 1818 - * Must be held from before the current memslots are copied until 1819 - * after the new memslots are installed with rcu_assign_pointer, 1820 - * then released before the synchronize srcu in kvm_swap_active_memslots. 1818 + * Must be held from before the current memslots are copied until after 1819 + * the new memslots are installed with rcu_assign_pointer, then 1820 + * released before the synchronize srcu in kvm_swap_active_memslots(). 1821 1821 * 1822 1822 * When modifying memslots outside of the slots_lock, must be held 1823 1823 * before reading the pointer to the current memslots until after all ··· 3869 3869 #ifdef __KVM_HAVE_ARCH_VCPU_DEBUGFS 3870 3870 static int vcpu_get_pid(void *data, u64 *val) 3871 3871 { 3872 - struct kvm_vcpu *vcpu = (struct kvm_vcpu *) data; 3872 + struct kvm_vcpu *vcpu = data; 3873 3873 *val = pid_nr(rcu_access_pointer(vcpu->pid)); 3874 3874 return 0; 3875 3875 } ··· 4470 4470 return 0; 4471 4471 } 4472 4472 4473 - static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg) 4473 + static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg) 4474 4474 { 4475 4475 switch (arg) { 4476 4476 case KVM_CAP_USER_MEMORY: ··· 5047 5047 static long kvm_dev_ioctl(struct file *filp, 5048 5048 unsigned int ioctl, unsigned long arg) 5049 5049 { 5050 - long r = -EINVAL; 5050 + int r = -EINVAL; 5051 5051 5052 5052 switch (ioctl) { 5053 5053 case KVM_GET_API_VERSION: ··· 5574 5574 const char *fmt) 5575 5575 { 5576 5576 int ret; 5577 - struct kvm_stat_data *stat_data = (struct kvm_stat_data *) 5578 - inode->i_private; 5577 + struct kvm_stat_data *stat_data = inode->i_private; 5579 5578 5580 5579 /* 5581 5580 * The debugfs files are a reference to the kvm struct which ··· 5595 5596 5596 5597 static int kvm_debugfs_release(struct inode *inode, struct file *file) 5597 5598 { 5598 - struct kvm_stat_data *stat_data = (struct kvm_stat_data *) 5599 - inode->i_private; 5599 + struct kvm_stat_data *stat_data = inode->i_private; 5600 5600 5601 5601 simple_attr_release(inode, file); 5602 5602 kvm_put_kvm(stat_data->kvm); ··· 5644 5646 static int kvm_stat_data_get(void *data, u64 *val) 5645 5647 { 5646 5648 int r = -EFAULT; 5647 - struct kvm_stat_data *stat_data = (struct kvm_stat_data *)data; 5649 + struct kvm_stat_data *stat_data = data; 5648 5650 5649 5651 switch (stat_data->kind) { 5650 5652 case KVM_STAT_VM: ··· 5663 5665 static int kvm_stat_data_clear(void *data, u64 val) 5664 5666 { 5665 5667 int r = -EFAULT; 5666 - struct kvm_stat_data *stat_data = (struct kvm_stat_data *)data; 5668 + struct kvm_stat_data *stat_data = data; 5667 5669 5668 5670 if (val) 5669 5671 return -EINVAL;

Configure Feed

Configure Feed