Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

+66 -4

Documentation/virt/kvm/api.rst

··· 7286 7286 it will enter with output fields already valid; in the common case, the 7287 7287 ``unknown.ret`` field of the union will be ``TDVMCALL_STATUS_SUBFUNC_UNSUPPORTED``. 7288 7288 Userspace need not do anything if it does not wish to support a TDVMCALL. 7289 + 7290 + :: 7291 + 7292 + /* KVM_EXIT_ARM_SEA */ 7293 + struct { 7294 + #define KVM_EXIT_ARM_SEA_FLAG_GPA_VALID (1ULL << 0) 7295 + __u64 flags; 7296 + __u64 esr; 7297 + __u64 gva; 7298 + __u64 gpa; 7299 + } arm_sea; 7300 + 7301 + Used on arm64 systems. When the VM capability ``KVM_CAP_ARM_SEA_TO_USER`` is 7302 + enabled, a KVM exits to userspace if a guest access causes a synchronous 7303 + external abort (SEA) and the host APEI fails to handle the SEA. 7304 + 7305 + ``esr`` is set to a sanitized value of ESR_EL2 from the exception taken to KVM, 7306 + consisting of the following fields: 7307 + 7308 + - ``ESR_EL2.EC`` 7309 + - ``ESR_EL2.IL`` 7310 + - ``ESR_EL2.FnV`` 7311 + - ``ESR_EL2.EA`` 7312 + - ``ESR_EL2.CM`` 7313 + - ``ESR_EL2.WNR`` 7314 + - ``ESR_EL2.FSC`` 7315 + - ``ESR_EL2.SET`` (when FEAT_RAS is implemented for the VM) 7316 + 7317 + ``gva`` is set to the value of FAR_EL2 from the exception taken to KVM when 7318 + ``ESR_EL2.FnV == 0``. Otherwise, the value of ``gva`` is unknown. 7319 + 7320 + ``gpa`` is set to the faulting IPA from the exception taken to KVM when 7321 + the ``KVM_EXIT_ARM_SEA_FLAG_GPA_VALID`` flag is set. Otherwise, the value of 7322 + ``gpa`` is unknown. 7323 + 7289 7324 :: 7290 7325 7291 7326 /* Fix the size of the union. */ ··· 7855 7820 :Architectures: s390 7856 7821 :Parameters: none 7857 7822 7858 - With this capability enabled, all illegal instructions 0x0000 (2 bytes) will 7823 + With this capability enabled, the illegal instruction 0x0000 (2 bytes) will 7859 7824 be intercepted and forwarded to user space. User space can use this 7860 7825 mechanism e.g. to realize 2-byte software breakpoints. The kernel will 7861 7826 not inject an operating exception for these instructions, user space has ··· 8063 8028 dirty logging can be enabled gradually in small chunks on the first call 8064 8029 to KVM_CLEAR_DIRTY_LOG. KVM_DIRTY_LOG_INITIALLY_SET depends on 8065 8030 KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE (it is also only available on 8066 - x86 and arm64 for now). 8031 + x86, arm64 and riscv for now). 8067 8032 8068 8033 KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 was previously available under the name 8069 8034 KVM_CAP_MANUAL_DIRTY_LOG_PROTECT, but the implementation had bugs that make ··· 8559 8524 the dirty pages. 8560 8525 8561 8526 The dirty ring can get full. When it happens, the KVM_RUN of the 8562 - vcpu will return with exit reason KVM_EXIT_DIRTY_LOG_FULL. 8527 + vcpu will return with exit reason KVM_EXIT_DIRTY_RING_FULL. 8563 8528 8564 8529 The dirty ring interface has a major difference comparing to the 8565 8530 KVM_GET_DIRTY_LOG interface in that, when reading the dirty ring from ··· 8727 8692 When this capability is enabled, KVM resets the VCPU when setting 8728 8693 MP_STATE_INIT_RECEIVED through IOCTL. The original MP_STATE is preserved. 8729 8694 8730 - 7.43 KVM_CAP_ARM_CACHEABLE_PFNMAP_SUPPORTED 8695 + 7.44 KVM_CAP_ARM_CACHEABLE_PFNMAP_SUPPORTED 8731 8696 ------------------------------------------- 8732 8697 8733 8698 :Architectures: arm64 ··· 8737 8702 This capability indicate to the userspace whether a PFNMAP memory region 8738 8703 can be safely mapped as cacheable. This relies on the presence of 8739 8704 force write back (FWB) feature support on the hardware. 8705 + 8706 + 7.45 KVM_CAP_ARM_SEA_TO_USER 8707 + ---------------------------- 8708 + 8709 + :Architecture: arm64 8710 + :Target: VM 8711 + :Parameters: none 8712 + :Returns: 0 on success, -EINVAL if unsupported. 8713 + 8714 + When this capability is enabled, KVM may exit to userspace for SEAs taken to 8715 + EL2 resulting from a guest access. See ``KVM_EXIT_ARM_SEA`` for more 8716 + information. 8717 + 8718 + 7.46 KVM_CAP_S390_USER_OPEREXEC 8719 + ------------------------------- 8720 + 8721 + :Architectures: s390 8722 + :Parameters: none 8723 + 8724 + When this capability is enabled KVM forwards all operation exceptions 8725 + that it doesn't handle itself to user space. This also includes the 8726 + 0x0000 instructions managed by KVM_CAP_S390_USER_INSTR0. This is 8727 + helpful if user space wants to emulate instructions which are not 8728 + (yet) implemented in hardware. 8729 + 8730 + This capability can be enabled dynamically even if VCPUs were already 8731 + created and are running. 8740 8732 8741 8733 8. Other capabilities. 8742 8734 ======================

+8 -1

Documentation/virt/kvm/x86/errata.rst

··· 48 48 Nested virtualization features 49 49 ------------------------------ 50 50 51 - TBD 51 + On AMD CPUs, when GIF is cleared, #DB exceptions or traps due to a breakpoint 52 + register match are ignored and discarded by the CPU. The CPU relies on the VMM 53 + to fully virtualize this behavior, even when vGIF is enabled for the guest 54 + (i.e. vGIF=0 does not cause the CPU to drop #DBs when the guest is running). 55 + KVM does not virtualize this behavior as the complexity is unjustified given 56 + the rarity of the use case. One way to handle this would be for KVM to 57 + intercept the #DB, temporarily disable the breakpoint, single-step over the 58 + instruction, then re-enable the breakpoint. 52 59 53 60 x2APIC 54 61 ------

+1

arch/arm64/include/asm/kvm_arm.h

··· 111 111 #define TCR_EL2_DS (1UL << 32) 112 112 #define TCR_EL2_RES1 ((1U << 31) | (1 << 23)) 113 113 #define TCR_EL2_HPD (1 << 24) 114 + #define TCR_EL2_HA (1 << 21) 114 115 #define TCR_EL2_TBI (1 << 20) 115 116 #define TCR_EL2_PS_SHIFT 16 116 117 #define TCR_EL2_PS_MASK (7 << TCR_EL2_PS_SHIFT)

+4 -4

arch/arm64/include/asm/kvm_asm.h

··· 79 79 __KVM_HOST_SMCCC_FUNC___kvm_tlb_flush_vmid_range, 80 80 __KVM_HOST_SMCCC_FUNC___kvm_flush_cpu_context, 81 81 __KVM_HOST_SMCCC_FUNC___kvm_timer_set_cntvoff, 82 - __KVM_HOST_SMCCC_FUNC___vgic_v3_save_vmcr_aprs, 82 + __KVM_HOST_SMCCC_FUNC___vgic_v3_save_aprs, 83 83 __KVM_HOST_SMCCC_FUNC___vgic_v3_restore_vmcr_aprs, 84 84 __KVM_HOST_SMCCC_FUNC___pkvm_reserve_vm, 85 85 __KVM_HOST_SMCCC_FUNC___pkvm_unreserve_vm, ··· 246 246 extern int __kvm_tlbi_s1e2(struct kvm_s2_mmu *mmu, u64 va, u64 sys_encoding); 247 247 248 248 extern void __kvm_timer_set_cntvoff(u64 cntvoff); 249 - extern void __kvm_at_s1e01(struct kvm_vcpu *vcpu, u32 op, u64 vaddr); 250 - extern void __kvm_at_s1e2(struct kvm_vcpu *vcpu, u32 op, u64 vaddr); 251 - extern void __kvm_at_s12(struct kvm_vcpu *vcpu, u32 op, u64 vaddr); 249 + extern int __kvm_at_s1e01(struct kvm_vcpu *vcpu, u32 op, u64 vaddr); 250 + extern int __kvm_at_s1e2(struct kvm_vcpu *vcpu, u32 op, u64 vaddr); 251 + extern int __kvm_at_s12(struct kvm_vcpu *vcpu, u32 op, u64 vaddr); 252 252 253 253 extern int __kvm_vcpu_run(struct kvm_vcpu *vcpu); 254 254

+3

arch/arm64/include/asm/kvm_host.h

··· 54 54 #define KVM_REQ_NESTED_S2_UNMAP KVM_ARCH_REQ(8) 55 55 #define KVM_REQ_GUEST_HYP_IRQ_PENDING KVM_ARCH_REQ(9) 56 56 #define KVM_REQ_MAP_L1_VNCR_EL2 KVM_ARCH_REQ(10) 57 + #define KVM_REQ_VGIC_PROCESS_UPDATE KVM_ARCH_REQ(11) 57 58 58 59 #define KVM_DIRTY_LOG_MANUAL_CAPS (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \ 59 60 KVM_DIRTY_LOG_INITIALLY_SET) ··· 351 350 #define KVM_ARCH_FLAG_GUEST_HAS_SVE 9 352 351 /* MIDR_EL1, REVIDR_EL1, and AIDR_EL1 are writable from userspace */ 353 352 #define KVM_ARCH_FLAG_WRITABLE_IMP_ID_REGS 10 353 + /* Unhandled SEAs are taken to userspace */ 354 + #define KVM_ARCH_FLAG_EXIT_SEA 11 354 355 unsigned long flags; 355 356 356 357 /* VM-wide vCPU feature set */

+2 -1

arch/arm64/include/asm/kvm_hyp.h

··· 77 77 int __vgic_v2_perform_cpuif_access(struct kvm_vcpu *vcpu); 78 78 79 79 u64 __gic_v3_get_lr(unsigned int lr); 80 + void __gic_v3_set_lr(u64 val, int lr); 80 81 81 82 void __vgic_v3_save_state(struct vgic_v3_cpu_if *cpu_if); 82 83 void __vgic_v3_restore_state(struct vgic_v3_cpu_if *cpu_if); 83 84 void __vgic_v3_activate_traps(struct vgic_v3_cpu_if *cpu_if); 84 85 void __vgic_v3_deactivate_traps(struct vgic_v3_cpu_if *cpu_if); 85 - void __vgic_v3_save_vmcr_aprs(struct vgic_v3_cpu_if *cpu_if); 86 + void __vgic_v3_save_aprs(struct vgic_v3_cpu_if *cpu_if); 86 87 void __vgic_v3_restore_vmcr_aprs(struct vgic_v3_cpu_if *cpu_if); 87 88 int __vgic_v3_perform_cpuif_access(struct kvm_vcpu *vcpu); 88 89

+38 -2

arch/arm64/include/asm/kvm_nested.h

··· 120 120 return trans->writable; 121 121 } 122 122 123 - static inline bool kvm_s2_trans_executable(struct kvm_s2_trans *trans) 123 + static inline bool kvm_has_xnx(struct kvm *kvm) 124 124 { 125 - return !(trans->desc & BIT(54)); 125 + return cpus_have_final_cap(ARM64_HAS_XNX) && 126 + kvm_has_feat(kvm, ID_AA64MMFR1_EL1, XNX, IMP); 127 + } 128 + 129 + static inline bool kvm_s2_trans_exec_el0(struct kvm *kvm, struct kvm_s2_trans *trans) 130 + { 131 + u8 xn = FIELD_GET(KVM_PTE_LEAF_ATTR_HI_S2_XN, trans->desc); 132 + 133 + if (!kvm_has_xnx(kvm)) 134 + xn &= FIELD_PREP(KVM_PTE_LEAF_ATTR_HI_S2_XN, 0b10); 135 + 136 + switch (xn) { 137 + case 0b00: 138 + case 0b01: 139 + return true; 140 + default: 141 + return false; 142 + } 143 + } 144 + 145 + static inline bool kvm_s2_trans_exec_el1(struct kvm *kvm, struct kvm_s2_trans *trans) 146 + { 147 + u8 xn = FIELD_GET(KVM_PTE_LEAF_ATTR_HI_S2_XN, trans->desc); 148 + 149 + if (!kvm_has_xnx(kvm)) 150 + xn &= FIELD_PREP(KVM_PTE_LEAF_ATTR_HI_S2_XN, 0b10); 151 + 152 + switch (xn) { 153 + case 0b00: 154 + case 0b11: 155 + return true; 156 + default: 157 + return false; 158 + } 126 159 } 127 160 128 161 extern int kvm_walk_nested_s2(struct kvm_vcpu *vcpu, phys_addr_t gipa, ··· 353 320 bool be; 354 321 bool s2; 355 322 bool pa52bit; 323 + bool ha; 356 324 }; 357 325 358 326 struct s1_walk_result { ··· 403 369 BUG_ON(__c >= NR_CPUS); \ 404 370 (FIX_VNCR - __c); \ 405 371 }) 372 + 373 + int __kvm_at_swap_desc(struct kvm *kvm, gpa_t ipa, u64 old, u64 new); 406 374 407 375 #endif /* __ARM64_KVM_NESTED_H */

+42 -7

arch/arm64/include/asm/kvm_pgtable.h

··· 89 89 90 90 #define KVM_PTE_LEAF_ATTR_HI_S1_XN BIT(54) 91 91 92 - #define KVM_PTE_LEAF_ATTR_HI_S2_XN BIT(54) 92 + #define KVM_PTE_LEAF_ATTR_HI_S2_XN GENMASK(54, 53) 93 93 94 94 #define KVM_PTE_LEAF_ATTR_HI_S1_GP BIT(50) 95 95 ··· 240 240 241 241 /** 242 242 * enum kvm_pgtable_prot - Page-table permissions and attributes. 243 - * @KVM_PGTABLE_PROT_X: Execute permission. 243 + * @KVM_PGTABLE_PROT_UX: Unprivileged execute permission. 244 + * @KVM_PGTABLE_PROT_PX: Privileged execute permission. 245 + * @KVM_PGTABLE_PROT_X: Privileged and unprivileged execute permission. 244 246 * @KVM_PGTABLE_PROT_W: Write permission. 245 247 * @KVM_PGTABLE_PROT_R: Read permission. 246 248 * @KVM_PGTABLE_PROT_DEVICE: Device attributes. ··· 253 251 * @KVM_PGTABLE_PROT_SW3: Software bit 3. 254 252 */ 255 253 enum kvm_pgtable_prot { 256 - KVM_PGTABLE_PROT_X = BIT(0), 257 - KVM_PGTABLE_PROT_W = BIT(1), 258 - KVM_PGTABLE_PROT_R = BIT(2), 254 + KVM_PGTABLE_PROT_PX = BIT(0), 255 + KVM_PGTABLE_PROT_UX = BIT(1), 256 + KVM_PGTABLE_PROT_X = KVM_PGTABLE_PROT_PX | 257 + KVM_PGTABLE_PROT_UX, 258 + KVM_PGTABLE_PROT_W = BIT(2), 259 + KVM_PGTABLE_PROT_R = BIT(3), 259 260 260 - KVM_PGTABLE_PROT_DEVICE = BIT(3), 261 - KVM_PGTABLE_PROT_NORMAL_NC = BIT(4), 261 + KVM_PGTABLE_PROT_DEVICE = BIT(4), 262 + KVM_PGTABLE_PROT_NORMAL_NC = BIT(5), 262 263 263 264 KVM_PGTABLE_PROT_SW0 = BIT(55), 264 265 KVM_PGTABLE_PROT_SW1 = BIT(56), ··· 360 355 return pteref; 361 356 } 362 357 358 + static inline kvm_pte_t *kvm_dereference_pteref_raw(kvm_pteref_t pteref) 359 + { 360 + return pteref; 361 + } 362 + 363 363 static inline int kvm_pgtable_walk_begin(struct kvm_pgtable_walker *walker) 364 364 { 365 365 /* ··· 392 382 kvm_pteref_t pteref) 393 383 { 394 384 return rcu_dereference_check(pteref, !(walker->flags & KVM_PGTABLE_WALK_SHARED)); 385 + } 386 + 387 + static inline kvm_pte_t *kvm_dereference_pteref_raw(kvm_pteref_t pteref) 388 + { 389 + return rcu_dereference_raw(pteref); 395 390 } 396 391 397 392 static inline int kvm_pgtable_walk_begin(struct kvm_pgtable_walker *walker) ··· 565 550 * to freeing and therefore no TLB invalidation is performed. 566 551 */ 567 552 void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt); 553 + 554 + /** 555 + * kvm_pgtable_stage2_destroy_range() - Destroy the unlinked range of addresses. 556 + * @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*(). 557 + * @addr: Intermediate physical address at which to place the mapping. 558 + * @size: Size of the mapping. 559 + * 560 + * The page-table is assumed to be unreachable by any hardware walkers prior 561 + * to freeing and therefore no TLB invalidation is performed. 562 + */ 563 + void kvm_pgtable_stage2_destroy_range(struct kvm_pgtable *pgt, 564 + u64 addr, u64 size); 565 + 566 + /** 567 + * kvm_pgtable_stage2_destroy_pgd() - Destroy the PGD of guest stage-2 page-table. 568 + * @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*(). 569 + * 570 + * It is assumed that the rest of the page-table is freed before this operation. 571 + */ 572 + void kvm_pgtable_stage2_destroy_pgd(struct kvm_pgtable *pgt); 568 573 569 574 /** 570 575 * kvm_pgtable_stage2_free_unlinked() - Free an unlinked stage-2 paging structure.

+3 -1

arch/arm64/include/asm/kvm_pkvm.h

··· 180 180 181 181 int pkvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_s2_mmu *mmu, 182 182 struct kvm_pgtable_mm_ops *mm_ops); 183 - void pkvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt); 183 + void pkvm_pgtable_stage2_destroy_range(struct kvm_pgtable *pgt, 184 + u64 addr, u64 size); 185 + void pkvm_pgtable_stage2_destroy_pgd(struct kvm_pgtable *pgt); 184 186 int pkvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size, u64 phys, 185 187 enum kvm_pgtable_prot prot, void *mc, 186 188 enum kvm_pgtable_walk_flags flags);

+6 -1

arch/arm64/include/asm/virt.h

··· 40 40 */ 41 41 #define HVC_FINALISE_EL2 3 42 42 43 + /* 44 + * HVC_GET_ICH_VTR_EL2 - Retrieve the ICH_VTR_EL2 value 45 + */ 46 + #define HVC_GET_ICH_VTR_EL2 4 47 + 43 48 /* Max number of HYP stub hypercalls */ 44 - #define HVC_STUB_HCALL_NR 4 49 + #define HVC_STUB_HCALL_NR 5 45 50 46 51 /* Error returned when an invalid stub number is passed into x0 */ 47 52 #define HVC_STUB_ERR 0xbadca11

+59

arch/arm64/kernel/cpufeature.c

··· 2304 2304 } 2305 2305 #endif 2306 2306 2307 + static bool can_trap_icv_dir_el1(const struct arm64_cpu_capabilities *entry, 2308 + int scope) 2309 + { 2310 + static const struct midr_range has_vgic_v3[] = { 2311 + MIDR_ALL_VERSIONS(MIDR_APPLE_M1_ICESTORM), 2312 + MIDR_ALL_VERSIONS(MIDR_APPLE_M1_FIRESTORM), 2313 + MIDR_ALL_VERSIONS(MIDR_APPLE_M1_ICESTORM_PRO), 2314 + MIDR_ALL_VERSIONS(MIDR_APPLE_M1_FIRESTORM_PRO), 2315 + MIDR_ALL_VERSIONS(MIDR_APPLE_M1_ICESTORM_MAX), 2316 + MIDR_ALL_VERSIONS(MIDR_APPLE_M1_FIRESTORM_MAX), 2317 + MIDR_ALL_VERSIONS(MIDR_APPLE_M2_BLIZZARD), 2318 + MIDR_ALL_VERSIONS(MIDR_APPLE_M2_AVALANCHE), 2319 + MIDR_ALL_VERSIONS(MIDR_APPLE_M2_BLIZZARD_PRO), 2320 + MIDR_ALL_VERSIONS(MIDR_APPLE_M2_AVALANCHE_PRO), 2321 + MIDR_ALL_VERSIONS(MIDR_APPLE_M2_BLIZZARD_MAX), 2322 + MIDR_ALL_VERSIONS(MIDR_APPLE_M2_AVALANCHE_MAX), 2323 + {}, 2324 + }; 2325 + struct arm_smccc_res res = {}; 2326 + 2327 + BUILD_BUG_ON(ARM64_HAS_ICH_HCR_EL2_TDIR <= ARM64_HAS_GICV3_CPUIF); 2328 + BUILD_BUG_ON(ARM64_HAS_ICH_HCR_EL2_TDIR <= ARM64_HAS_GICV5_LEGACY); 2329 + if (!this_cpu_has_cap(ARM64_HAS_GICV3_CPUIF) && 2330 + !is_midr_in_range_list(has_vgic_v3)) 2331 + return false; 2332 + 2333 + if (!is_hyp_mode_available()) 2334 + return false; 2335 + 2336 + if (this_cpu_has_cap(ARM64_HAS_GICV5_LEGACY)) 2337 + return true; 2338 + 2339 + if (is_kernel_in_hyp_mode()) 2340 + res.a1 = read_sysreg_s(SYS_ICH_VTR_EL2); 2341 + else 2342 + arm_smccc_1_1_hvc(HVC_GET_ICH_VTR_EL2, &res); 2343 + 2344 + if (res.a0 == HVC_STUB_ERR) 2345 + return false; 2346 + 2347 + return res.a1 & ICH_VTR_EL2_TDS; 2348 + } 2349 + 2307 2350 #ifdef CONFIG_ARM64_BTI 2308 2351 static void bti_enable(const struct arm64_cpu_capabilities *__unused) 2309 2352 { ··· 2858 2815 .matches = has_gic_prio_relaxed_sync, 2859 2816 }, 2860 2817 #endif 2818 + { 2819 + /* 2820 + * Depends on having GICv3 2821 + */ 2822 + .desc = "ICV_DIR_EL1 trapping", 2823 + .capability = ARM64_HAS_ICH_HCR_EL2_TDIR, 2824 + .type = ARM64_CPUCAP_EARLY_LOCAL_CPU_FEATURE, 2825 + .matches = can_trap_icv_dir_el1, 2826 + }, 2861 2827 #ifdef CONFIG_ARM64_E0PD 2862 2828 { 2863 2829 .desc = "E0PD", ··· 3140 3088 .type = ARM64_CPUCAP_EARLY_LOCAL_CPU_FEATURE, 3141 3089 .capability = ARM64_HAS_GICV5_LEGACY, 3142 3090 .matches = test_has_gicv5_legacy, 3091 + }, 3092 + { 3093 + .desc = "XNX", 3094 + .capability = ARM64_HAS_XNX, 3095 + .type = ARM64_CPUCAP_SYSTEM_FEATURE, 3096 + .matches = has_cpuid_feature, 3097 + ARM64_CPUID_FIELDS(ID_AA64MMFR1_EL1, XNX, IMP) 3143 3098 }, 3144 3099 {}, 3145 3100 };

+5

arch/arm64/kernel/hyp-stub.S

··· 54 54 1: cmp x0, #HVC_FINALISE_EL2 55 55 b.eq __finalise_el2 56 56 57 + cmp x0, #HVC_GET_ICH_VTR_EL2 58 + b.ne 2f 59 + mrs_s x1, SYS_ICH_VTR_EL2 60 + b 9f 61 + 57 62 2: cmp x0, #HVC_SOFT_RESTART 58 63 b.ne 3f 59 64 mov x0, x2

+1

arch/arm64/kernel/image-vars.h

··· 91 91 KVM_NVHE_ALIAS(spectre_bhb_patch_wa3); 92 92 KVM_NVHE_ALIAS(spectre_bhb_patch_clearbhb); 93 93 KVM_NVHE_ALIAS(alt_cb_patch_nops); 94 + KVM_NVHE_ALIAS(kvm_compute_ich_hcr_trap_bits); 94 95 95 96 /* Global kernel state accessed by nVHE hyp code. */ 96 97 KVM_NVHE_ALIAS(kvm_vgic_global_state);

+17 -3

arch/arm64/kvm/arm.c

··· 132 132 } 133 133 mutex_unlock(&kvm->lock); 134 134 break; 135 + case KVM_CAP_ARM_SEA_TO_USER: 136 + r = 0; 137 + set_bit(KVM_ARCH_FLAG_EXIT_SEA, &kvm->arch.flags); 138 + break; 135 139 default: 136 140 break; 137 141 } ··· 331 327 case KVM_CAP_IRQFD_RESAMPLE: 332 328 case KVM_CAP_COUNTER_OFFSET: 333 329 case KVM_CAP_ARM_WRITABLE_IMP_ID_REGS: 330 + case KVM_CAP_ARM_SEA_TO_USER: 334 331 r = 1; 335 332 break; 336 333 case KVM_CAP_SET_GUEST_DEBUG2: ··· 445 440 if (!has_vhe()) 446 441 return kzalloc(sz, GFP_KERNEL_ACCOUNT); 447 442 448 - return __vmalloc(sz, GFP_KERNEL_ACCOUNT | __GFP_HIGHMEM | __GFP_ZERO); 443 + return kvzalloc(sz, GFP_KERNEL_ACCOUNT); 449 444 } 450 445 451 446 int kvm_arch_vcpu_precreate(struct kvm *kvm, unsigned int id) ··· 664 659 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) 665 660 { 666 661 if (is_protected_kvm_enabled()) { 667 - kvm_call_hyp(__vgic_v3_save_vmcr_aprs, 668 - &vcpu->arch.vgic_cpu.vgic_v3); 662 + kvm_call_hyp(__vgic_v3_save_aprs, &vcpu->arch.vgic_cpu.vgic_v3); 669 663 kvm_call_hyp_nvhe(__pkvm_vcpu_put); 670 664 } 671 665 ··· 1045 1041 * that a VCPU sees new virtual interrupts. 1046 1042 */ 1047 1043 kvm_check_request(KVM_REQ_IRQ_PENDING, vcpu); 1044 + 1045 + /* Process interrupts deactivated through a trap */ 1046 + if (kvm_check_request(KVM_REQ_VGIC_PROCESS_UPDATE, vcpu)) 1047 + kvm_vgic_process_async_update(vcpu); 1048 1048 1049 1049 if (kvm_check_request(KVM_REQ_RECORD_STEAL, vcpu)) 1050 1050 kvm_update_stolen_time(vcpu); ··· 1841 1833 } 1842 1834 1843 1835 return r; 1836 + } 1837 + 1838 + long kvm_arch_vcpu_unlocked_ioctl(struct file *filp, unsigned int ioctl, 1839 + unsigned long arg) 1840 + { 1841 + return -ENOIOCTLCMD; 1844 1842 } 1845 1843 1846 1844 void kvm_arch_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot)

+176 -20

arch/arm64/kvm/at.c

··· 346 346 347 347 wi->baddr &= GENMASK_ULL(wi->max_oa_bits - 1, x); 348 348 349 + wi->ha = kvm_has_feat(vcpu->kvm, ID_AA64MMFR1_EL1, HAFDBS, AF); 350 + wi->ha &= (wi->regime == TR_EL2 ? 351 + FIELD_GET(TCR_EL2_HA, tcr) : 352 + FIELD_GET(TCR_HA, tcr)); 353 + 349 354 return 0; 350 355 351 356 addrsz: ··· 367 362 return -EFAULT; 368 363 } 369 364 365 + static int kvm_read_s1_desc(struct kvm_vcpu *vcpu, u64 pa, u64 *desc, 366 + struct s1_walk_info *wi) 367 + { 368 + u64 val; 369 + int r; 370 + 371 + r = kvm_read_guest(vcpu->kvm, pa, &val, sizeof(val)); 372 + if (r) 373 + return r; 374 + 375 + if (wi->be) 376 + *desc = be64_to_cpu((__force __be64)val); 377 + else 378 + *desc = le64_to_cpu((__force __le64)val); 379 + 380 + return 0; 381 + } 382 + 383 + static int kvm_swap_s1_desc(struct kvm_vcpu *vcpu, u64 pa, u64 old, u64 new, 384 + struct s1_walk_info *wi) 385 + { 386 + if (wi->be) { 387 + old = (__force u64)cpu_to_be64(old); 388 + new = (__force u64)cpu_to_be64(new); 389 + } else { 390 + old = (__force u64)cpu_to_le64(old); 391 + new = (__force u64)cpu_to_le64(new); 392 + } 393 + 394 + return __kvm_at_swap_desc(vcpu->kvm, pa, old, new); 395 + } 396 + 370 397 static int walk_s1(struct kvm_vcpu *vcpu, struct s1_walk_info *wi, 371 398 struct s1_walk_result *wr, u64 va) 372 399 { 373 - u64 va_top, va_bottom, baddr, desc; 400 + u64 va_top, va_bottom, baddr, desc, new_desc, ipa; 374 401 int level, stride, ret; 375 402 376 403 level = wi->sl; ··· 412 375 va_top = get_ia_size(wi) - 1; 413 376 414 377 while (1) { 415 - u64 index, ipa; 378 + u64 index; 416 379 417 380 va_bottom = (3 - level) * stride + wi->pgshift; 418 381 index = (va & GENMASK_ULL(va_top, va_bottom)) >> (va_bottom - 3); ··· 451 414 return ret; 452 415 } 453 416 454 - ret = kvm_read_guest(vcpu->kvm, ipa, &desc, sizeof(desc)); 417 + ret = kvm_read_s1_desc(vcpu, ipa, &desc, wi); 455 418 if (ret) { 456 419 fail_s1_walk(wr, ESR_ELx_FSC_SEA_TTW(level), false); 457 420 return ret; 458 421 } 459 422 460 - if (wi->be) 461 - desc = be64_to_cpu((__force __be64)desc); 462 - else 463 - desc = le64_to_cpu((__force __le64)desc); 423 + new_desc = desc; 464 424 465 425 /* Invalid descriptor */ 466 426 if (!(desc & BIT(0))) ··· 510 476 baddr = desc_to_oa(wi, desc); 511 477 if (check_output_size(baddr & GENMASK(52, va_bottom), wi)) 512 478 goto addrsz; 479 + 480 + if (wi->ha) 481 + new_desc |= PTE_AF; 482 + 483 + if (new_desc != desc) { 484 + ret = kvm_swap_s1_desc(vcpu, ipa, desc, new_desc, wi); 485 + if (ret) 486 + return ret; 487 + 488 + desc = new_desc; 489 + } 513 490 514 491 if (!(desc & PTE_AF)) { 515 492 fail_s1_walk(wr, ESR_ELx_FSC_ACCESS_L(level), false); ··· 1266 1221 wr->pr &= !pan; 1267 1222 } 1268 1223 1269 - static u64 handle_at_slow(struct kvm_vcpu *vcpu, u32 op, u64 vaddr) 1224 + static int handle_at_slow(struct kvm_vcpu *vcpu, u32 op, u64 vaddr, u64 *par) 1270 1225 { 1271 1226 struct s1_walk_result wr = {}; 1272 1227 struct s1_walk_info wi = {}; ··· 1291 1246 1292 1247 srcu_read_unlock(&vcpu->kvm->srcu, idx); 1293 1248 1249 + /* 1250 + * Race to update a descriptor -- restart the walk. 1251 + */ 1252 + if (ret == -EAGAIN) 1253 + return ret; 1294 1254 if (ret) 1295 1255 goto compute_par; 1296 1256 ··· 1329 1279 fail_s1_walk(&wr, ESR_ELx_FSC_PERM_L(wr.level), false); 1330 1280 1331 1281 compute_par: 1332 - return compute_par_s1(vcpu, &wi, &wr); 1282 + *par = compute_par_s1(vcpu, &wi, &wr); 1283 + return 0; 1333 1284 } 1334 1285 1335 1286 /* ··· 1458 1407 !(par & SYS_PAR_EL1_S)); 1459 1408 } 1460 1409 1461 - void __kvm_at_s1e01(struct kvm_vcpu *vcpu, u32 op, u64 vaddr) 1410 + int __kvm_at_s1e01(struct kvm_vcpu *vcpu, u32 op, u64 vaddr) 1462 1411 { 1463 1412 u64 par = __kvm_at_s1e01_fast(vcpu, op, vaddr); 1413 + int ret; 1464 1414 1465 1415 /* 1466 1416 * If PAR_EL1 reports that AT failed on a S1 permission or access ··· 1473 1421 */ 1474 1422 if ((par & SYS_PAR_EL1_F) && 1475 1423 !par_check_s1_perm_fault(par) && 1476 - !par_check_s1_access_fault(par)) 1477 - par = handle_at_slow(vcpu, op, vaddr); 1424 + !par_check_s1_access_fault(par)) { 1425 + ret = handle_at_slow(vcpu, op, vaddr, &par); 1426 + if (ret) 1427 + return ret; 1428 + } 1478 1429 1479 1430 vcpu_write_sys_reg(vcpu, par, PAR_EL1); 1431 + return 0; 1480 1432 } 1481 1433 1482 - void __kvm_at_s1e2(struct kvm_vcpu *vcpu, u32 op, u64 vaddr) 1434 + int __kvm_at_s1e2(struct kvm_vcpu *vcpu, u32 op, u64 vaddr) 1483 1435 { 1484 1436 u64 par; 1437 + int ret; 1485 1438 1486 1439 /* 1487 1440 * We've trapped, so everything is live on the CPU. As we will be ··· 1533 1476 } 1534 1477 1535 1478 /* We failed the translation, let's replay it in slow motion */ 1536 - if ((par & SYS_PAR_EL1_F) && !par_check_s1_perm_fault(par)) 1537 - par = handle_at_slow(vcpu, op, vaddr); 1479 + if ((par & SYS_PAR_EL1_F) && !par_check_s1_perm_fault(par)) { 1480 + ret = handle_at_slow(vcpu, op, vaddr, &par); 1481 + if (ret) 1482 + return ret; 1483 + } 1538 1484 1539 1485 vcpu_write_sys_reg(vcpu, par, PAR_EL1); 1486 + return 0; 1540 1487 } 1541 1488 1542 - void __kvm_at_s12(struct kvm_vcpu *vcpu, u32 op, u64 vaddr) 1489 + int __kvm_at_s12(struct kvm_vcpu *vcpu, u32 op, u64 vaddr) 1543 1490 { 1544 1491 struct kvm_s2_trans out = {}; 1545 1492 u64 ipa, par; ··· 1570 1509 break; 1571 1510 default: 1572 1511 WARN_ON_ONCE(1); 1573 - return; 1512 + return 0; 1574 1513 } 1575 1514 1576 1515 __kvm_at_s1e01(vcpu, op, vaddr); 1577 1516 par = vcpu_read_sys_reg(vcpu, PAR_EL1); 1578 1517 if (par & SYS_PAR_EL1_F) 1579 - return; 1518 + return 0; 1580 1519 1581 1520 /* 1582 1521 * If we only have a single stage of translation (EL2&0), exit ··· 1584 1523 */ 1585 1524 if (compute_translation_regime(vcpu, op) == TR_EL20 || 1586 1525 !(vcpu_read_sys_reg(vcpu, HCR_EL2) & (HCR_VM | HCR_DC))) 1587 - return; 1526 + return 0; 1588 1527 1589 1528 /* Do the stage-2 translation */ 1590 1529 ipa = (par & GENMASK_ULL(47, 12)) | (vaddr & GENMASK_ULL(11, 0)); 1591 1530 out.esr = 0; 1592 1531 ret = kvm_walk_nested_s2(vcpu, ipa, &out); 1593 1532 if (ret < 0) 1594 - return; 1533 + return ret; 1595 1534 1596 1535 /* Check the access permission */ 1597 1536 if (!out.esr && ··· 1600 1539 1601 1540 par = compute_par_s12(vcpu, par, &out); 1602 1541 vcpu_write_sys_reg(vcpu, par, PAR_EL1); 1542 + return 0; 1603 1543 } 1604 1544 1605 1545 /* ··· 1698 1636 /* Any other error... */ 1699 1637 return ret; 1700 1638 } 1639 + } 1640 + 1641 + #ifdef CONFIG_ARM64_LSE_ATOMICS 1642 + static int __lse_swap_desc(u64 __user *ptep, u64 old, u64 new) 1643 + { 1644 + u64 tmp = old; 1645 + int ret = 0; 1646 + 1647 + uaccess_enable_privileged(); 1648 + 1649 + asm volatile(__LSE_PREAMBLE 1650 + "1: cas %[old], %[new], %[addr]\n" 1651 + "2:\n" 1652 + _ASM_EXTABLE_UACCESS_ERR(1b, 2b, %w[ret]) 1653 + : [old] "+r" (old), [addr] "+Q" (*ptep), [ret] "+r" (ret) 1654 + : [new] "r" (new) 1655 + : "memory"); 1656 + 1657 + uaccess_disable_privileged(); 1658 + 1659 + if (ret) 1660 + return ret; 1661 + if (tmp != old) 1662 + return -EAGAIN; 1663 + 1664 + return ret; 1665 + } 1666 + #else 1667 + static int __lse_swap_desc(u64 __user *ptep, u64 old, u64 new) 1668 + { 1669 + return -EINVAL; 1670 + } 1671 + #endif 1672 + 1673 + static int __llsc_swap_desc(u64 __user *ptep, u64 old, u64 new) 1674 + { 1675 + int ret = 1; 1676 + u64 tmp; 1677 + 1678 + uaccess_enable_privileged(); 1679 + 1680 + asm volatile("prfm pstl1strm, %[addr]\n" 1681 + "1: ldxr %[tmp], %[addr]\n" 1682 + "sub %[tmp], %[tmp], %[old]\n" 1683 + "cbnz %[tmp], 3f\n" 1684 + "2: stlxr %w[ret], %[new], %[addr]\n" 1685 + "3:\n" 1686 + _ASM_EXTABLE_UACCESS_ERR(1b, 3b, %w[ret]) 1687 + _ASM_EXTABLE_UACCESS_ERR(2b, 3b, %w[ret]) 1688 + : [ret] "+r" (ret), [addr] "+Q" (*ptep), [tmp] "=&r" (tmp) 1689 + : [old] "r" (old), [new] "r" (new) 1690 + : "memory"); 1691 + 1692 + uaccess_disable_privileged(); 1693 + 1694 + /* STLXR didn't update the descriptor, or the compare failed */ 1695 + if (ret == 1) 1696 + return -EAGAIN; 1697 + 1698 + return ret; 1699 + } 1700 + 1701 + int __kvm_at_swap_desc(struct kvm *kvm, gpa_t ipa, u64 old, u64 new) 1702 + { 1703 + struct kvm_memory_slot *slot; 1704 + unsigned long hva; 1705 + u64 __user *ptep; 1706 + bool writable; 1707 + int offset; 1708 + gfn_t gfn; 1709 + int r; 1710 + 1711 + lockdep_assert(srcu_read_lock_held(&kvm->srcu)); 1712 + 1713 + gfn = ipa >> PAGE_SHIFT; 1714 + offset = offset_in_page(ipa); 1715 + slot = gfn_to_memslot(kvm, gfn); 1716 + hva = gfn_to_hva_memslot_prot(slot, gfn, &writable); 1717 + if (kvm_is_error_hva(hva)) 1718 + return -EINVAL; 1719 + if (!writable) 1720 + return -EPERM; 1721 + 1722 + ptep = (u64 __user *)hva + offset; 1723 + if (cpus_have_final_cap(ARM64_HAS_LSE_ATOMICS)) 1724 + r = __lse_swap_desc(ptep, old, new); 1725 + else 1726 + r = __llsc_swap_desc(ptep, old, new); 1727 + 1728 + if (r < 0) 1729 + return r; 1730 + 1731 + mark_page_dirty_in_slot(kvm, slot, gfn); 1732 + return 0; 1701 1733 }

+4 -3

arch/arm64/kvm/hyp/nvhe/hyp-main.c

··· 157 157 host_vcpu->arch.iflags = hyp_vcpu->vcpu.arch.iflags; 158 158 159 159 host_cpu_if->vgic_hcr = hyp_cpu_if->vgic_hcr; 160 + host_cpu_if->vgic_vmcr = hyp_cpu_if->vgic_vmcr; 160 161 for (i = 0; i < hyp_cpu_if->used_lrs; ++i) 161 162 host_cpu_if->vgic_lr[i] = hyp_cpu_if->vgic_lr[i]; 162 163 } ··· 465 464 __vgic_v3_init_lrs(); 466 465 } 467 466 468 - static void handle___vgic_v3_save_vmcr_aprs(struct kvm_cpu_context *host_ctxt) 467 + static void handle___vgic_v3_save_aprs(struct kvm_cpu_context *host_ctxt) 469 468 { 470 469 DECLARE_REG(struct vgic_v3_cpu_if *, cpu_if, host_ctxt, 1); 471 470 472 - __vgic_v3_save_vmcr_aprs(kern_hyp_va(cpu_if)); 471 + __vgic_v3_save_aprs(kern_hyp_va(cpu_if)); 473 472 } 474 473 475 474 static void handle___vgic_v3_restore_vmcr_aprs(struct kvm_cpu_context *host_ctxt) ··· 617 616 HANDLE_FUNC(__kvm_tlb_flush_vmid_range), 618 617 HANDLE_FUNC(__kvm_flush_cpu_context), 619 618 HANDLE_FUNC(__kvm_timer_set_cntvoff), 620 - HANDLE_FUNC(__vgic_v3_save_vmcr_aprs), 619 + HANDLE_FUNC(__vgic_v3_save_aprs), 621 620 HANDLE_FUNC(__vgic_v3_restore_vmcr_aprs), 622 621 HANDLE_FUNC(__pkvm_reserve_vm), 623 622 HANDLE_FUNC(__pkvm_unreserve_vm),

+3

arch/arm64/kvm/hyp/nvhe/pkvm.c

··· 337 337 /* CTR_EL0 is always under host control, even for protected VMs. */ 338 338 hyp_vm->kvm.arch.ctr_el0 = host_kvm->arch.ctr_el0; 339 339 340 + /* Preserve the vgic model so that GICv3 emulation works */ 341 + hyp_vm->kvm.arch.vgic.vgic_model = host_kvm->arch.vgic.vgic_model; 342 + 340 343 if (test_bit(KVM_ARCH_FLAG_MTE_ENABLED, &host_kvm->arch.flags)) 341 344 set_bit(KVM_ARCH_FLAG_MTE_ENABLED, &kvm->arch.flags); 342 345

+5

arch/arm64/kvm/hyp/nvhe/sys_regs.c

··· 444 444 445 445 /* Scalable Vector Registers are restricted. */ 446 446 447 + HOST_HANDLED(SYS_ICC_PMR_EL1), 448 + 447 449 RAZ_WI(SYS_ERRIDR_EL1), 448 450 RAZ_WI(SYS_ERRSELR_EL1), 449 451 RAZ_WI(SYS_ERXFR_EL1), ··· 459 457 460 458 /* Limited Ordering Regions Registers are restricted. */ 461 459 460 + HOST_HANDLED(SYS_ICC_DIR_EL1), 461 + HOST_HANDLED(SYS_ICC_RPR_EL1), 462 462 HOST_HANDLED(SYS_ICC_SGI1R_EL1), 463 463 HOST_HANDLED(SYS_ICC_ASGI1R_EL1), 464 464 HOST_HANDLED(SYS_ICC_SGI0R_EL1), 465 + HOST_HANDLED(SYS_ICC_CTLR_EL1), 465 466 { SYS_DESC(SYS_ICC_SRE_EL1), .access = pvm_gic_read_sre, }, 466 467 467 468 HOST_HANDLED(SYS_CCSIDR_EL1),

+107 -21

arch/arm64/kvm/hyp/pgtable.c

··· 661 661 662 662 #define KVM_S2_MEMATTR(pgt, attr) PAGE_S2_MEMATTR(attr, stage2_has_fwb(pgt)) 663 663 664 + static int stage2_set_xn_attr(enum kvm_pgtable_prot prot, kvm_pte_t *attr) 665 + { 666 + bool px, ux; 667 + u8 xn; 668 + 669 + px = prot & KVM_PGTABLE_PROT_PX; 670 + ux = prot & KVM_PGTABLE_PROT_UX; 671 + 672 + if (!cpus_have_final_cap(ARM64_HAS_XNX) && px != ux) 673 + return -EINVAL; 674 + 675 + if (px && ux) 676 + xn = 0b00; 677 + else if (!px && ux) 678 + xn = 0b01; 679 + else if (!px && !ux) 680 + xn = 0b10; 681 + else 682 + xn = 0b11; 683 + 684 + *attr &= ~KVM_PTE_LEAF_ATTR_HI_S2_XN; 685 + *attr |= FIELD_PREP(KVM_PTE_LEAF_ATTR_HI_S2_XN, xn); 686 + return 0; 687 + } 688 + 664 689 static int stage2_set_prot_attr(struct kvm_pgtable *pgt, enum kvm_pgtable_prot prot, 665 690 kvm_pte_t *ptep) 666 691 { 667 692 kvm_pte_t attr; 668 693 u32 sh = KVM_PTE_LEAF_ATTR_LO_S2_SH_IS; 694 + int r; 669 695 670 696 switch (prot & (KVM_PGTABLE_PROT_DEVICE | 671 697 KVM_PGTABLE_PROT_NORMAL_NC)) { ··· 711 685 attr = KVM_S2_MEMATTR(pgt, NORMAL); 712 686 } 713 687 714 - if (!(prot & KVM_PGTABLE_PROT_X)) 715 - attr |= KVM_PTE_LEAF_ATTR_HI_S2_XN; 688 + r = stage2_set_xn_attr(prot, &attr); 689 + if (r) 690 + return r; 716 691 717 692 if (prot & KVM_PGTABLE_PROT_R) 718 693 attr |= KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R; ··· 742 715 prot |= KVM_PGTABLE_PROT_R; 743 716 if (pte & KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W) 744 717 prot |= KVM_PGTABLE_PROT_W; 745 - if (!(pte & KVM_PTE_LEAF_ATTR_HI_S2_XN)) 746 - prot |= KVM_PGTABLE_PROT_X; 718 + 719 + switch (FIELD_GET(KVM_PTE_LEAF_ATTR_HI_S2_XN, pte)) { 720 + case 0b00: 721 + prot |= KVM_PGTABLE_PROT_PX | KVM_PGTABLE_PROT_UX; 722 + break; 723 + case 0b01: 724 + prot |= KVM_PGTABLE_PROT_UX; 725 + break; 726 + case 0b11: 727 + prot |= KVM_PGTABLE_PROT_PX; 728 + break; 729 + default: 730 + break; 731 + } 747 732 748 733 return prot; 749 734 } ··· 1329 1290 int kvm_pgtable_stage2_relax_perms(struct kvm_pgtable *pgt, u64 addr, 1330 1291 enum kvm_pgtable_prot prot, enum kvm_pgtable_walk_flags flags) 1331 1292 { 1332 - int ret; 1293 + kvm_pte_t xn = 0, set = 0, clr = 0; 1333 1294 s8 level; 1334 - kvm_pte_t set = 0, clr = 0; 1295 + int ret; 1335 1296 1336 1297 if (prot & KVM_PTE_LEAF_ATTR_HI_SW) 1337 1298 return -EINVAL; ··· 1342 1303 if (prot & KVM_PGTABLE_PROT_W) 1343 1304 set |= KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W; 1344 1305 1345 - if (prot & KVM_PGTABLE_PROT_X) 1346 - clr |= KVM_PTE_LEAF_ATTR_HI_S2_XN; 1306 + ret = stage2_set_xn_attr(prot, &xn); 1307 + if (ret) 1308 + return ret; 1309 + 1310 + set |= xn & KVM_PTE_LEAF_ATTR_HI_S2_XN; 1311 + clr |= ~xn & KVM_PTE_LEAF_ATTR_HI_S2_XN; 1347 1312 1348 1313 ret = stage2_update_leaf_attrs(pgt, addr, 1, set, clr, NULL, &level, flags); 1349 1314 if (!ret || ret == -EAGAIN) ··· 1578 1535 return kvm_pgd_pages(ia_bits, start_level) * PAGE_SIZE; 1579 1536 } 1580 1537 1581 - static int stage2_free_walker(const struct kvm_pgtable_visit_ctx *ctx, 1582 - enum kvm_pgtable_walk_flags visit) 1538 + static int stage2_free_leaf(const struct kvm_pgtable_visit_ctx *ctx) 1583 1539 { 1584 1540 struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops; 1585 1541 1586 - if (!stage2_pte_is_counted(ctx->old)) 1587 - return 0; 1588 - 1589 1542 mm_ops->put_page(ctx->ptep); 1590 - 1591 - if (kvm_pte_table(ctx->old, ctx->level)) 1592 - mm_ops->put_page(kvm_pte_follow(ctx->old, mm_ops)); 1593 - 1594 1543 return 0; 1595 1544 } 1596 1545 1597 - void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt) 1546 + static int stage2_free_table_post(const struct kvm_pgtable_visit_ctx *ctx) 1598 1547 { 1599 - size_t pgd_sz; 1548 + struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops; 1549 + kvm_pte_t *childp = kvm_pte_follow(ctx->old, mm_ops); 1550 + 1551 + if (mm_ops->page_count(childp) != 1) 1552 + return 0; 1553 + 1554 + /* 1555 + * Drop references and clear the now stale PTE to avoid rewalking the 1556 + * freed page table. 1557 + */ 1558 + mm_ops->put_page(ctx->ptep); 1559 + mm_ops->put_page(childp); 1560 + kvm_clear_pte(ctx->ptep); 1561 + return 0; 1562 + } 1563 + 1564 + static int stage2_free_walker(const struct kvm_pgtable_visit_ctx *ctx, 1565 + enum kvm_pgtable_walk_flags visit) 1566 + { 1567 + if (!stage2_pte_is_counted(ctx->old)) 1568 + return 0; 1569 + 1570 + switch (visit) { 1571 + case KVM_PGTABLE_WALK_LEAF: 1572 + return stage2_free_leaf(ctx); 1573 + case KVM_PGTABLE_WALK_TABLE_POST: 1574 + return stage2_free_table_post(ctx); 1575 + default: 1576 + return -EINVAL; 1577 + } 1578 + } 1579 + 1580 + void kvm_pgtable_stage2_destroy_range(struct kvm_pgtable *pgt, 1581 + u64 addr, u64 size) 1582 + { 1600 1583 struct kvm_pgtable_walker walker = { 1601 1584 .cb = stage2_free_walker, 1602 1585 .flags = KVM_PGTABLE_WALK_LEAF | 1603 1586 KVM_PGTABLE_WALK_TABLE_POST, 1604 1587 }; 1605 1588 1606 - WARN_ON(kvm_pgtable_walk(pgt, 0, BIT(pgt->ia_bits), &walker)); 1589 + WARN_ON(kvm_pgtable_walk(pgt, addr, size, &walker)); 1590 + } 1591 + 1592 + void kvm_pgtable_stage2_destroy_pgd(struct kvm_pgtable *pgt) 1593 + { 1594 + size_t pgd_sz; 1595 + 1607 1596 pgd_sz = kvm_pgd_pages(pgt->ia_bits, pgt->start_level) * PAGE_SIZE; 1608 - pgt->mm_ops->free_pages_exact(kvm_dereference_pteref(&walker, pgt->pgd), pgd_sz); 1597 + 1598 + /* 1599 + * Since the pgtable is unlinked at this point, and not shared with 1600 + * other walkers, safely deference pgd with kvm_dereference_pteref_raw() 1601 + */ 1602 + pgt->mm_ops->free_pages_exact(kvm_dereference_pteref_raw(pgt->pgd), pgd_sz); 1609 1603 pgt->pgd = NULL; 1604 + } 1605 + 1606 + void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt) 1607 + { 1608 + kvm_pgtable_stage2_destroy_range(pgt, 0, BIT(pgt->ia_bits)); 1609 + kvm_pgtable_stage2_destroy_pgd(pgt); 1610 1610 } 1611 1611 1612 1612 void kvm_pgtable_stage2_free_unlinked(struct kvm_pgtable_mm_ops *mm_ops, void *pgtable, s8 level)

+4

arch/arm64/kvm/hyp/vgic-v2-cpuif-proxy.c

··· 63 63 return -1; 64 64 } 65 65 66 + /* Handle deactivation as a normal exit */ 67 + if ((fault_ipa - vgic->vgic_cpu_base) >= GIC_CPU_DEACTIVATE) 68 + return 0; 69 + 66 70 rd = kvm_vcpu_dabt_get_rd(vcpu); 67 71 addr = kvm_vgic_global_state.vcpu_hyp_va; 68 72 addr += fault_ipa - vgic->vgic_cpu_base;

+63 -33

arch/arm64/kvm/hyp/vgic-v3-sr.c

··· 14 14 #include <asm/kvm_hyp.h> 15 15 #include <asm/kvm_mmu.h> 16 16 17 + #include "../../vgic/vgic.h" 18 + 17 19 #define vtr_to_max_lr_idx(v) ((v) & 0xf) 18 20 #define vtr_to_nr_pre_bits(v) ((((u32)(v) >> 26) & 7) + 1) 19 21 #define vtr_to_nr_apr_regs(v) (1 << (vtr_to_nr_pre_bits(v) - 5)) ··· 60 58 unreachable(); 61 59 } 62 60 63 - static void __gic_v3_set_lr(u64 val, int lr) 61 + void __gic_v3_set_lr(u64 val, int lr) 64 62 { 65 63 switch (lr & 0xf) { 66 64 case 0: ··· 198 196 return val; 199 197 } 200 198 199 + static u64 compute_ich_hcr(struct vgic_v3_cpu_if *cpu_if) 200 + { 201 + return cpu_if->vgic_hcr | vgic_ich_hcr_trap_bits(); 202 + } 203 + 201 204 void __vgic_v3_save_state(struct vgic_v3_cpu_if *cpu_if) 202 205 { 203 206 u64 used_lrs = cpu_if->used_lrs; ··· 219 212 } 220 213 } 221 214 222 - if (used_lrs || cpu_if->its_vpe.its_vm) { 215 + if (used_lrs) { 223 216 int i; 224 217 u32 elrsr; 225 218 226 219 elrsr = read_gicreg(ICH_ELRSR_EL2); 227 - 228 - write_gicreg(cpu_if->vgic_hcr & ~ICH_HCR_EL2_En, ICH_HCR_EL2); 229 220 230 221 for (i = 0; i < used_lrs; i++) { 231 222 if (elrsr & (1 << i)) ··· 234 229 __gic_v3_set_lr(0, i); 235 230 } 236 231 } 232 + 233 + cpu_if->vgic_vmcr = read_gicreg(ICH_VMCR_EL2); 234 + 235 + if (cpu_if->vgic_hcr & ICH_HCR_EL2_LRENPIE) { 236 + u64 val = read_gicreg(ICH_HCR_EL2); 237 + cpu_if->vgic_hcr &= ~ICH_HCR_EL2_EOIcount; 238 + cpu_if->vgic_hcr |= val & ICH_HCR_EL2_EOIcount; 239 + } 240 + 241 + write_gicreg(0, ICH_HCR_EL2); 242 + 243 + /* 244 + * Hack alert: On NV, this results in a trap so that the above write 245 + * actually takes effect... No synchronisation is necessary, as we 246 + * only care about the effects when this traps. 247 + */ 248 + read_gicreg(ICH_MISR_EL2); 237 249 } 238 250 239 251 void __vgic_v3_restore_state(struct vgic_v3_cpu_if *cpu_if) ··· 258 236 u64 used_lrs = cpu_if->used_lrs; 259 237 int i; 260 238 261 - if (used_lrs || cpu_if->its_vpe.its_vm) { 262 - write_gicreg(cpu_if->vgic_hcr, ICH_HCR_EL2); 239 + write_gicreg(compute_ich_hcr(cpu_if), ICH_HCR_EL2); 263 240 264 - for (i = 0; i < used_lrs; i++) 265 - __gic_v3_set_lr(cpu_if->vgic_lr[i], i); 266 - } 241 + for (i = 0; i < used_lrs; i++) 242 + __gic_v3_set_lr(cpu_if->vgic_lr[i], i); 267 243 268 244 /* 269 245 * Ensure that writes to the LRs, and on non-VHE systems ensure that ··· 327 307 } 328 308 329 309 /* 330 - * If we need to trap system registers, we must write 331 - * ICH_HCR_EL2 anyway, even if no interrupts are being 332 - * injected. Note that this also applies if we don't expect 333 - * any system register access (no vgic at all). 310 + * If we need to trap system registers, we must write ICH_HCR_EL2 311 + * anyway, even if no interrupts are being injected. Note that this 312 + * also applies if we don't expect any system register access (no 313 + * vgic at all). In any case, no need to provide MI configuration. 334 314 */ 335 315 if (static_branch_unlikely(&vgic_v3_cpuif_trap) || 336 316 cpu_if->its_vpe.its_vm || !cpu_if->vgic_sre) 337 - write_gicreg(cpu_if->vgic_hcr, ICH_HCR_EL2); 317 + write_gicreg(vgic_ich_hcr_trap_bits() | ICH_HCR_EL2_En, ICH_HCR_EL2); 338 318 } 339 319 340 320 void __vgic_v3_deactivate_traps(struct vgic_v3_cpu_if *cpu_if) 341 321 { 342 322 u64 val; 343 - 344 - if (!cpu_if->vgic_sre) { 345 - cpu_if->vgic_vmcr = read_gicreg(ICH_VMCR_EL2); 346 - } 347 323 348 324 /* Only restore SRE if the host implements the GICv2 interface */ 349 325 if (static_branch_unlikely(&vgic_v3_has_v2_compat)) { ··· 362 346 write_gicreg(0, ICH_HCR_EL2); 363 347 } 364 348 365 - static void __vgic_v3_save_aprs(struct vgic_v3_cpu_if *cpu_if) 349 + void __vgic_v3_save_aprs(struct vgic_v3_cpu_if *cpu_if) 366 350 { 367 351 u64 val; 368 352 u32 nr_pre_bits; ··· 521 505 static void __vgic_v3_write_vmcr(u32 vmcr) 522 506 { 523 507 write_gicreg(vmcr, ICH_VMCR_EL2); 524 - } 525 - 526 - void __vgic_v3_save_vmcr_aprs(struct vgic_v3_cpu_if *cpu_if) 527 - { 528 - __vgic_v3_save_aprs(cpu_if); 529 - if (cpu_if->vgic_sre) 530 - cpu_if->vgic_vmcr = __vgic_v3_read_vmcr(); 531 508 } 532 509 533 510 void __vgic_v3_restore_vmcr_aprs(struct vgic_v3_cpu_if *cpu_if) ··· 799 790 write_gicreg(hcr, ICH_HCR_EL2); 800 791 } 801 792 802 - static void __vgic_v3_write_dir(struct kvm_vcpu *vcpu, u32 vmcr, int rt) 793 + static int ___vgic_v3_write_dir(struct kvm_vcpu *vcpu, u32 vmcr, int rt) 803 794 { 804 795 u32 vid = vcpu_get_reg(vcpu, rt); 805 796 u64 lr_val; ··· 807 798 808 799 /* EOImode == 0, nothing to be done here */ 809 800 if (!(vmcr & ICH_VMCR_EOIM_MASK)) 810 - return; 801 + return 1; 811 802 812 803 /* No deactivate to be performed on an LPI */ 813 804 if (vid >= VGIC_MIN_LPI) 814 - return; 805 + return 1; 815 806 816 807 lr = __vgic_v3_find_active_lr(vcpu, vid, &lr_val); 817 - if (lr == -1) { 818 - __vgic_v3_bump_eoicount(); 819 - return; 808 + if (lr != -1) { 809 + __vgic_v3_clear_active_lr(lr, lr_val); 810 + return 1; 820 811 } 821 812 822 - __vgic_v3_clear_active_lr(lr, lr_val); 813 + return 0; 814 + } 815 + 816 + static void __vgic_v3_write_dir(struct kvm_vcpu *vcpu, u32 vmcr, int rt) 817 + { 818 + if (!___vgic_v3_write_dir(vcpu, vmcr, rt)) 819 + __vgic_v3_bump_eoicount(); 823 820 } 824 821 825 822 static void __vgic_v3_write_eoir(struct kvm_vcpu *vcpu, u32 vmcr, int rt) ··· 1260 1245 case SYS_ICC_DIR_EL1: 1261 1246 if (unlikely(is_read)) 1262 1247 return 0; 1248 + /* 1249 + * Full exit if required to handle overflow deactivation, 1250 + * unless we can emulate it in the LRs (likely the majority 1251 + * of the cases). 1252 + */ 1253 + if (vcpu->arch.vgic_cpu.vgic_v3.vgic_hcr & ICH_HCR_EL2_TDIR) { 1254 + int ret; 1255 + 1256 + ret = ___vgic_v3_write_dir(vcpu, __vgic_v3_read_vmcr(), 1257 + kvm_vcpu_sys_get_rt(vcpu)); 1258 + if (ret) 1259 + __kvm_skip_instr(vcpu); 1260 + 1261 + return ret; 1262 + } 1263 1263 fn = __vgic_v3_write_dir; 1264 1264 break; 1265 1265 case SYS_ICC_RPR_EL1:

+124 -8

arch/arm64/kvm/mmu.c

··· 904 904 return 0; 905 905 } 906 906 907 + /* 908 + * Assume that @pgt is valid and unlinked from the KVM MMU to free the 909 + * page-table without taking the kvm_mmu_lock and without performing any 910 + * TLB invalidations. 911 + * 912 + * Also, the range of addresses can be large enough to cause need_resched 913 + * warnings, for instance on CONFIG_PREEMPT_NONE kernels. Hence, invoke 914 + * cond_resched() periodically to prevent hogging the CPU for a long time 915 + * and schedule something else, if required. 916 + */ 917 + static void stage2_destroy_range(struct kvm_pgtable *pgt, phys_addr_t addr, 918 + phys_addr_t end) 919 + { 920 + u64 next; 921 + 922 + do { 923 + next = stage2_range_addr_end(addr, end); 924 + KVM_PGT_FN(kvm_pgtable_stage2_destroy_range)(pgt, addr, 925 + next - addr); 926 + if (next != end) 927 + cond_resched(); 928 + } while (addr = next, addr != end); 929 + } 930 + 931 + static void kvm_stage2_destroy(struct kvm_pgtable *pgt) 932 + { 933 + unsigned int ia_bits = VTCR_EL2_IPA(pgt->mmu->vtcr); 934 + 935 + stage2_destroy_range(pgt, 0, BIT(ia_bits)); 936 + KVM_PGT_FN(kvm_pgtable_stage2_destroy_pgd)(pgt); 937 + } 938 + 907 939 /** 908 940 * kvm_init_stage2_mmu - Initialise a S2 MMU structure 909 941 * @kvm: The pointer to the KVM structure ··· 1012 980 return 0; 1013 981 1014 982 out_destroy_pgtable: 1015 - KVM_PGT_FN(kvm_pgtable_stage2_destroy)(pgt); 983 + kvm_stage2_destroy(pgt); 1016 984 out_free_pgtable: 1017 985 kfree(pgt); 1018 986 return err; ··· 1113 1081 write_unlock(&kvm->mmu_lock); 1114 1082 1115 1083 if (pgt) { 1116 - KVM_PGT_FN(kvm_pgtable_stage2_destroy)(pgt); 1084 + kvm_stage2_destroy(pgt); 1117 1085 kfree(pgt); 1118 1086 } 1119 1087 } ··· 1553 1521 *prot |= kvm_encode_nested_level(nested); 1554 1522 } 1555 1523 1524 + static void adjust_nested_exec_perms(struct kvm *kvm, 1525 + struct kvm_s2_trans *nested, 1526 + enum kvm_pgtable_prot *prot) 1527 + { 1528 + if (!kvm_s2_trans_exec_el0(kvm, nested)) 1529 + *prot &= ~KVM_PGTABLE_PROT_UX; 1530 + if (!kvm_s2_trans_exec_el1(kvm, nested)) 1531 + *prot &= ~KVM_PGTABLE_PROT_PX; 1532 + } 1533 + 1556 1534 #define KVM_PGTABLE_WALK_MEMABORT_FLAGS (KVM_PGTABLE_WALK_HANDLE_FAULT | KVM_PGTABLE_WALK_SHARED) 1557 1535 1558 1536 static int gmem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, ··· 1614 1572 if (writable) 1615 1573 prot |= KVM_PGTABLE_PROT_W; 1616 1574 1617 - if (exec_fault || 1618 - (cpus_have_final_cap(ARM64_HAS_CACHE_DIC) && 1619 - (!nested || kvm_s2_trans_executable(nested)))) 1575 + if (exec_fault || cpus_have_final_cap(ARM64_HAS_CACHE_DIC)) 1620 1576 prot |= KVM_PGTABLE_PROT_X; 1577 + 1578 + if (nested) 1579 + adjust_nested_exec_perms(kvm, nested, &prot); 1621 1580 1622 1581 kvm_fault_lock(kvm); 1623 1582 if (mmu_invalidate_retry(kvm, mmu_seq)) { ··· 1894 1851 prot |= KVM_PGTABLE_PROT_NORMAL_NC; 1895 1852 else 1896 1853 prot |= KVM_PGTABLE_PROT_DEVICE; 1897 - } else if (cpus_have_final_cap(ARM64_HAS_CACHE_DIC) && 1898 - (!nested || kvm_s2_trans_executable(nested))) { 1854 + } else if (cpus_have_final_cap(ARM64_HAS_CACHE_DIC)) { 1899 1855 prot |= KVM_PGTABLE_PROT_X; 1900 1856 } 1857 + 1858 + if (nested) 1859 + adjust_nested_exec_perms(kvm, nested, &prot); 1901 1860 1902 1861 /* 1903 1862 * Under the premise of getting a FSC_PERM fault, we just need to relax ··· 1944 1899 read_unlock(&vcpu->kvm->mmu_lock); 1945 1900 } 1946 1901 1902 + /* 1903 + * Returns true if the SEA should be handled locally within KVM if the abort 1904 + * is caused by a kernel memory allocation (e.g. stage-2 table memory). 1905 + */ 1906 + static bool host_owns_sea(struct kvm_vcpu *vcpu, u64 esr) 1907 + { 1908 + /* 1909 + * Without FEAT_RAS HCR_EL2.TEA is RES0, meaning any external abort 1910 + * taken from a guest EL to EL2 is due to a host-imposed access (e.g. 1911 + * stage-2 PTW). 1912 + */ 1913 + if (!cpus_have_final_cap(ARM64_HAS_RAS_EXTN)) 1914 + return true; 1915 + 1916 + /* KVM owns the VNCR when the vCPU isn't in a nested context. */ 1917 + if (is_hyp_ctxt(vcpu) && !kvm_vcpu_trap_is_iabt(vcpu) && (esr & ESR_ELx_VNCR)) 1918 + return true; 1919 + 1920 + /* 1921 + * Determining if an external abort during a table walk happened at 1922 + * stage-2 is only possible with S1PTW is set. Otherwise, since KVM 1923 + * sets HCR_EL2.TEA, SEAs due to a stage-1 walk (i.e. accessing the 1924 + * PA of the stage-1 descriptor) can reach here and are reported 1925 + * with a TTW ESR value. 1926 + */ 1927 + return (esr_fsc_is_sea_ttw(esr) && (esr & ESR_ELx_S1PTW)); 1928 + } 1929 + 1947 1930 int kvm_handle_guest_sea(struct kvm_vcpu *vcpu) 1948 1931 { 1932 + struct kvm *kvm = vcpu->kvm; 1933 + struct kvm_run *run = vcpu->run; 1934 + u64 esr = kvm_vcpu_get_esr(vcpu); 1935 + u64 esr_mask = ESR_ELx_EC_MASK | 1936 + ESR_ELx_IL | 1937 + ESR_ELx_FnV | 1938 + ESR_ELx_EA | 1939 + ESR_ELx_CM | 1940 + ESR_ELx_WNR | 1941 + ESR_ELx_FSC; 1942 + u64 ipa; 1943 + 1949 1944 /* 1950 1945 * Give APEI the opportunity to claim the abort before handling it 1951 1946 * within KVM. apei_claim_sea() expects to be called with IRQs enabled. ··· 1994 1909 if (apei_claim_sea(NULL) == 0) 1995 1910 return 1; 1996 1911 1997 - return kvm_inject_serror(vcpu); 1912 + if (host_owns_sea(vcpu, esr) || 1913 + !test_bit(KVM_ARCH_FLAG_EXIT_SEA, &vcpu->kvm->arch.flags)) 1914 + return kvm_inject_serror(vcpu); 1915 + 1916 + /* ESR_ELx.SET is RES0 when FEAT_RAS isn't implemented. */ 1917 + if (kvm_has_ras(kvm)) 1918 + esr_mask |= ESR_ELx_SET_MASK; 1919 + 1920 + /* 1921 + * Exit to userspace, and provide faulting guest virtual and physical 1922 + * addresses in case userspace wants to emulate SEA to guest by 1923 + * writing to FAR_ELx and HPFAR_ELx registers. 1924 + */ 1925 + memset(&run->arm_sea, 0, sizeof(run->arm_sea)); 1926 + run->exit_reason = KVM_EXIT_ARM_SEA; 1927 + run->arm_sea.esr = esr & esr_mask; 1928 + 1929 + if (!(esr & ESR_ELx_FnV)) 1930 + run->arm_sea.gva = kvm_vcpu_get_hfar(vcpu); 1931 + 1932 + ipa = kvm_vcpu_get_fault_ipa(vcpu); 1933 + if (ipa != INVALID_GPA) { 1934 + run->arm_sea.flags |= KVM_EXIT_ARM_SEA_FLAG_GPA_VALID; 1935 + run->arm_sea.gpa = ipa; 1936 + } 1937 + 1938 + return 0; 1998 1939 } 1999 1940 2000 1941 /** ··· 2110 1999 u32 esr; 2111 2000 2112 2001 ret = kvm_walk_nested_s2(vcpu, fault_ipa, &nested_trans); 2002 + if (ret == -EAGAIN) { 2003 + ret = 1; 2004 + goto out_unlock; 2005 + } 2006 + 2113 2007 if (ret) { 2114 2008 esr = kvm_s2_trans_esr(&nested_trans); 2115 2009 kvm_inject_s2_fault(vcpu, esr);

+84 -39

arch/arm64/kvm/nested.c

··· 124 124 } 125 125 126 126 struct s2_walk_info { 127 - int (*read_desc)(phys_addr_t pa, u64 *desc, void *data); 128 - void *data; 129 - u64 baddr; 130 - unsigned int max_oa_bits; 131 - unsigned int pgshift; 132 - unsigned int sl; 133 - unsigned int t0sz; 134 - bool be; 127 + u64 baddr; 128 + unsigned int max_oa_bits; 129 + unsigned int pgshift; 130 + unsigned int sl; 131 + unsigned int t0sz; 132 + bool be; 133 + bool ha; 135 134 }; 136 135 137 136 static u32 compute_fsc(int level, u32 fsc) ··· 198 199 return 0; 199 200 } 200 201 202 + static int read_guest_s2_desc(struct kvm_vcpu *vcpu, phys_addr_t pa, u64 *desc, 203 + struct s2_walk_info *wi) 204 + { 205 + u64 val; 206 + int r; 207 + 208 + r = kvm_read_guest(vcpu->kvm, pa, &val, sizeof(val)); 209 + if (r) 210 + return r; 211 + 212 + /* 213 + * Handle reversedescriptors if endianness differs between the 214 + * host and the guest hypervisor. 215 + */ 216 + if (wi->be) 217 + *desc = be64_to_cpu((__force __be64)val); 218 + else 219 + *desc = le64_to_cpu((__force __le64)val); 220 + 221 + return 0; 222 + } 223 + 224 + static int swap_guest_s2_desc(struct kvm_vcpu *vcpu, phys_addr_t pa, u64 old, u64 new, 225 + struct s2_walk_info *wi) 226 + { 227 + if (wi->be) { 228 + old = (__force u64)cpu_to_be64(old); 229 + new = (__force u64)cpu_to_be64(new); 230 + } else { 231 + old = (__force u64)cpu_to_le64(old); 232 + new = (__force u64)cpu_to_le64(new); 233 + } 234 + 235 + return __kvm_at_swap_desc(vcpu->kvm, pa, old, new); 236 + } 237 + 201 238 /* 202 239 * This is essentially a C-version of the pseudo code from the ARM ARM 203 240 * AArch64.TranslationTableWalk function. I strongly recommend looking at ··· 241 206 * 242 207 * Must be called with the kvm->srcu read lock held 243 208 */ 244 - static int walk_nested_s2_pgd(phys_addr_t ipa, 209 + static int walk_nested_s2_pgd(struct kvm_vcpu *vcpu, phys_addr_t ipa, 245 210 struct s2_walk_info *wi, struct kvm_s2_trans *out) 246 211 { 247 212 int first_block_level, level, stride, input_size, base_lower_bound; 248 213 phys_addr_t base_addr; 249 214 unsigned int addr_top, addr_bottom; 250 - u64 desc; /* page table entry */ 215 + u64 desc, new_desc; /* page table entry */ 251 216 int ret; 252 217 phys_addr_t paddr; 253 218 ··· 292 257 >> (addr_bottom - 3); 293 258 294 259 paddr = base_addr | index; 295 - ret = wi->read_desc(paddr, &desc, wi->data); 260 + ret = read_guest_s2_desc(vcpu, paddr, &desc, wi); 296 261 if (ret < 0) 297 262 return ret; 298 263 299 - /* 300 - * Handle reversedescriptors if endianness differs between the 301 - * host and the guest hypervisor. 302 - */ 303 - if (wi->be) 304 - desc = be64_to_cpu((__force __be64)desc); 305 - else 306 - desc = le64_to_cpu((__force __le64)desc); 264 + new_desc = desc; 307 265 308 266 /* Check for valid descriptor at this point */ 309 - if (!(desc & 1) || ((desc & 3) == 1 && level == 3)) { 267 + if (!(desc & KVM_PTE_VALID)) { 310 268 out->esr = compute_fsc(level, ESR_ELx_FSC_FAULT); 311 269 out->desc = desc; 312 270 return 1; 313 271 } 314 272 315 - /* We're at the final level or block translation level */ 316 - if ((desc & 3) == 1 || level == 3) 273 + if (FIELD_GET(KVM_PTE_TYPE, desc) == KVM_PTE_TYPE_BLOCK) { 274 + if (level < 3) 275 + break; 276 + 277 + out->esr = compute_fsc(level, ESR_ELx_FSC_FAULT); 278 + out->desc = desc; 279 + return 1; 280 + } 281 + 282 + /* We're at the final level */ 283 + if (level == 3) 317 284 break; 318 285 319 286 if (check_output_size(wi, desc)) { ··· 342 305 return 1; 343 306 } 344 307 345 - if (!(desc & BIT(10))) { 308 + if (wi->ha) 309 + new_desc |= KVM_PTE_LEAF_ATTR_LO_S2_AF; 310 + 311 + if (new_desc != desc) { 312 + ret = swap_guest_s2_desc(vcpu, paddr, desc, new_desc, wi); 313 + if (ret) 314 + return ret; 315 + 316 + desc = new_desc; 317 + } 318 + 319 + if (!(desc & KVM_PTE_LEAF_ATTR_LO_S2_AF)) { 346 320 out->esr = compute_fsc(level, ESR_ELx_FSC_ACCESS); 347 321 out->desc = desc; 348 322 return 1; ··· 366 318 (ipa & GENMASK_ULL(addr_bottom - 1, 0)); 367 319 out->output = paddr; 368 320 out->block_size = 1UL << ((3 - level) * stride + wi->pgshift); 369 - out->readable = desc & (0b01 << 6); 370 - out->writable = desc & (0b10 << 6); 321 + out->readable = desc & KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R; 322 + out->writable = desc & KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W; 371 323 out->level = level; 372 324 out->desc = desc; 373 325 return 0; 374 - } 375 - 376 - static int read_guest_s2_desc(phys_addr_t pa, u64 *desc, void *data) 377 - { 378 - struct kvm_vcpu *vcpu = data; 379 - 380 - return kvm_read_guest(vcpu->kvm, pa, desc, sizeof(*desc)); 381 326 } 382 327 383 328 static void vtcr_to_walk_info(u64 vtcr, struct s2_walk_info *wi) ··· 391 350 /* Global limit for now, should eventually be per-VM */ 392 351 wi->max_oa_bits = min(get_kvm_ipa_limit(), 393 352 ps_to_output_size(FIELD_GET(VTCR_EL2_PS_MASK, vtcr), false)); 353 + 354 + wi->ha = vtcr & VTCR_EL2_HA; 394 355 } 395 356 396 357 int kvm_walk_nested_s2(struct kvm_vcpu *vcpu, phys_addr_t gipa, ··· 407 364 if (!vcpu_has_nv(vcpu)) 408 365 return 0; 409 366 410 - wi.read_desc = read_guest_s2_desc; 411 - wi.data = vcpu; 412 367 wi.baddr = vcpu_read_sys_reg(vcpu, VTTBR_EL2); 413 368 414 369 vtcr_to_walk_info(vtcr, &wi); 415 370 416 371 wi.be = vcpu_read_sys_reg(vcpu, SCTLR_EL2) & SCTLR_ELx_EE; 417 372 418 - ret = walk_nested_s2_pgd(gipa, &wi, result); 373 + ret = walk_nested_s2_pgd(vcpu, gipa, &wi, result); 419 374 if (ret) 420 375 result->esr |= (kvm_vcpu_get_esr(vcpu) & ~ESR_ELx_FSC); 421 376 ··· 829 788 return 0; 830 789 831 790 if (kvm_vcpu_trap_is_iabt(vcpu)) { 832 - forward_fault = !kvm_s2_trans_executable(trans); 791 + if (vcpu_mode_priv(vcpu)) 792 + forward_fault = !kvm_s2_trans_exec_el1(vcpu->kvm, trans); 793 + else 794 + forward_fault = !kvm_s2_trans_exec_el0(vcpu->kvm, trans); 833 795 } else { 834 796 bool write_fault = kvm_is_write_fault(vcpu); 835 797 ··· 1599 1555 case SYS_ID_AA64MMFR1_EL1: 1600 1556 val &= ~(ID_AA64MMFR1_EL1_CMOW | 1601 1557 ID_AA64MMFR1_EL1_nTLBPA | 1602 - ID_AA64MMFR1_EL1_ETS | 1603 - ID_AA64MMFR1_EL1_XNX | 1604 - ID_AA64MMFR1_EL1_HAFDBS); 1558 + ID_AA64MMFR1_EL1_ETS); 1559 + 1605 1560 /* FEAT_E2H0 implies no VHE */ 1606 1561 if (test_bit(KVM_ARM_VCPU_HAS_EL2_E2H0, kvm->arch.vcpu_features)) 1607 1562 val &= ~ID_AA64MMFR1_EL1_VH; 1563 + 1564 + val = ID_REG_LIMIT_FIELD_ENUM(val, ID_AA64MMFR1_EL1, HAFDBS, AF); 1608 1565 break; 1609 1566 1610 1567 case SYS_ID_AA64MMFR2_EL1:

+9 -2

arch/arm64/kvm/pkvm.c

··· 344 344 return 0; 345 345 } 346 346 347 - void pkvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt) 347 + void pkvm_pgtable_stage2_destroy_range(struct kvm_pgtable *pgt, 348 + u64 addr, u64 size) 348 349 { 349 - __pkvm_pgtable_stage2_unmap(pgt, 0, ~(0ULL)); 350 + __pkvm_pgtable_stage2_unmap(pgt, addr, addr + size); 351 + } 352 + 353 + void pkvm_pgtable_stage2_destroy_pgd(struct kvm_pgtable *pgt) 354 + { 355 + /* Expected to be called after all pKVM mappings have been released. */ 356 + WARN_ON_ONCE(!RB_EMPTY_ROOT(&pgt->pkvm_mappings.rb_root)); 350 357 } 351 358 352 359 int pkvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,

+27 -8

arch/arm64/kvm/ptdump.c

··· 31 31 .val = PTE_VALID, 32 32 .set = " ", 33 33 .clear = "F", 34 - }, { 34 + }, 35 + { 35 36 .mask = KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R, 36 37 .val = KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R, 37 38 .set = "R", 38 39 .clear = " ", 39 - }, { 40 + }, 41 + { 40 42 .mask = KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W, 41 43 .val = KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W, 42 44 .set = "W", 43 45 .clear = " ", 44 - }, { 46 + }, 47 + { 45 48 .mask = KVM_PTE_LEAF_ATTR_HI_S2_XN, 46 - .val = KVM_PTE_LEAF_ATTR_HI_S2_XN, 47 - .set = "NX", 48 - .clear = "x ", 49 - }, { 49 + .val = 0b00UL << __bf_shf(KVM_PTE_LEAF_ATTR_HI_S2_XN), 50 + .set = "px ux ", 51 + }, 52 + { 53 + .mask = KVM_PTE_LEAF_ATTR_HI_S2_XN, 54 + .val = 0b01UL << __bf_shf(KVM_PTE_LEAF_ATTR_HI_S2_XN), 55 + .set = "PXNux ", 56 + }, 57 + { 58 + .mask = KVM_PTE_LEAF_ATTR_HI_S2_XN, 59 + .val = 0b10UL << __bf_shf(KVM_PTE_LEAF_ATTR_HI_S2_XN), 60 + .set = "PXNUXN", 61 + }, 62 + { 63 + .mask = KVM_PTE_LEAF_ATTR_HI_S2_XN, 64 + .val = 0b11UL << __bf_shf(KVM_PTE_LEAF_ATTR_HI_S2_XN), 65 + .set = "px UXN", 66 + }, 67 + { 50 68 .mask = KVM_PTE_LEAF_ATTR_LO_S2_AF, 51 69 .val = KVM_PTE_LEAF_ATTR_LO_S2_AF, 52 70 .set = "AF", 53 71 .clear = " ", 54 - }, { 72 + }, 73 + { 55 74 .mask = PMD_TYPE_MASK, 56 75 .val = PMD_TYPE_SECT, 57 76 .set = "BLK",

+23 -5

arch/arm64/kvm/sys_regs.c

··· 666 666 return true; 667 667 } 668 668 669 + static bool access_gic_dir(struct kvm_vcpu *vcpu, 670 + struct sys_reg_params *p, 671 + const struct sys_reg_desc *r) 672 + { 673 + if (!kvm_has_gicv3(vcpu->kvm)) 674 + return undef_access(vcpu, p, r); 675 + 676 + if (!p->is_write) 677 + return undef_access(vcpu, p, r); 678 + 679 + vgic_v3_deactivate(vcpu, p->regval); 680 + 681 + return true; 682 + } 683 + 669 684 static bool trap_raz_wi(struct kvm_vcpu *vcpu, 670 685 struct sys_reg_params *p, 671 686 const struct sys_reg_desc *r) ··· 3388 3373 { SYS_DESC(SYS_ICC_AP1R1_EL1), undef_access }, 3389 3374 { SYS_DESC(SYS_ICC_AP1R2_EL1), undef_access }, 3390 3375 { SYS_DESC(SYS_ICC_AP1R3_EL1), undef_access }, 3391 - { SYS_DESC(SYS_ICC_DIR_EL1), undef_access }, 3376 + { SYS_DESC(SYS_ICC_DIR_EL1), access_gic_dir }, 3392 3377 { SYS_DESC(SYS_ICC_RPR_EL1), undef_access }, 3393 3378 { SYS_DESC(SYS_ICC_SGI1R_EL1), access_gic_sgi }, 3394 3379 { SYS_DESC(SYS_ICC_ASGI1R_EL1), access_gic_sgi }, ··· 3785 3770 { 3786 3771 u32 op = sys_insn(p->Op0, p->Op1, p->CRn, p->CRm, p->Op2); 3787 3772 3788 - __kvm_at_s1e01(vcpu, op, p->regval); 3773 + if (__kvm_at_s1e01(vcpu, op, p->regval)) 3774 + return false; 3789 3775 3790 3776 return true; 3791 3777 } ··· 3803 3787 return false; 3804 3788 } 3805 3789 3806 - __kvm_at_s1e2(vcpu, op, p->regval); 3790 + if (__kvm_at_s1e2(vcpu, op, p->regval)) 3791 + return false; 3807 3792 3808 3793 return true; 3809 3794 } ··· 3814 3797 { 3815 3798 u32 op = sys_insn(p->Op0, p->Op1, p->CRn, p->CRm, p->Op2); 3816 3799 3817 - __kvm_at_s12(vcpu, op, p->regval); 3800 + if (__kvm_at_s12(vcpu, op, p->regval)) 3801 + return false; 3818 3802 3819 3803 return true; 3820 3804 } ··· 4516 4498 { CP15_SYS_DESC(SYS_ICC_AP1R1_EL1), undef_access }, 4517 4499 { CP15_SYS_DESC(SYS_ICC_AP1R2_EL1), undef_access }, 4518 4500 { CP15_SYS_DESC(SYS_ICC_AP1R3_EL1), undef_access }, 4519 - { CP15_SYS_DESC(SYS_ICC_DIR_EL1), undef_access }, 4501 + { CP15_SYS_DESC(SYS_ICC_DIR_EL1), access_gic_dir }, 4520 4502 { CP15_SYS_DESC(SYS_ICC_RPR_EL1), undef_access }, 4521 4503 { CP15_SYS_DESC(SYS_ICC_IAR1_EL1), undef_access }, 4522 4504 { CP15_SYS_DESC(SYS_ICC_EOIR1_EL1), undef_access },

+5 -4

arch/arm64/kvm/vgic/vgic-init.c

··· 198 198 struct kvm_vcpu *vcpu0 = kvm_get_vcpu(kvm, 0); 199 199 int i; 200 200 201 + dist->active_spis = (atomic_t)ATOMIC_INIT(0); 201 202 dist->spis = kcalloc(nr_spis, sizeof(struct vgic_irq), GFP_KERNEL_ACCOUNT); 202 203 if (!dist->spis) 203 204 return -ENOMEM; ··· 364 363 return ret; 365 364 } 366 365 367 - static void kvm_vgic_vcpu_enable(struct kvm_vcpu *vcpu) 366 + static void kvm_vgic_vcpu_reset(struct kvm_vcpu *vcpu) 368 367 { 369 368 if (kvm_vgic_global_state.type == VGIC_V2) 370 - vgic_v2_enable(vcpu); 369 + vgic_v2_reset(vcpu); 371 370 else 372 - vgic_v3_enable(vcpu); 371 + vgic_v3_reset(vcpu); 373 372 } 374 373 375 374 /* ··· 416 415 } 417 416 418 417 kvm_for_each_vcpu(idx, vcpu, kvm) 419 - kvm_vgic_vcpu_enable(vcpu); 418 + kvm_vgic_vcpu_reset(vcpu); 420 419 421 420 ret = kvm_vgic_setup_default_irq_routing(kvm); 422 421 if (ret)

+24

arch/arm64/kvm/vgic/vgic-mmio-v2.c

··· 359 359 vgic_set_vmcr(vcpu, &vmcr); 360 360 } 361 361 362 + static void vgic_mmio_write_dir(struct kvm_vcpu *vcpu, 363 + gpa_t addr, unsigned int len, 364 + unsigned long val) 365 + { 366 + if (kvm_vgic_global_state.type == VGIC_V2) 367 + vgic_v2_deactivate(vcpu, val); 368 + else 369 + vgic_v3_deactivate(vcpu, val); 370 + } 371 + 362 372 static unsigned long vgic_mmio_read_apr(struct kvm_vcpu *vcpu, 363 373 gpa_t addr, unsigned int len) 364 374 { ··· 492 482 REGISTER_DESC_WITH_LENGTH(GIC_CPU_IDENT, 493 483 vgic_mmio_read_vcpuif, vgic_mmio_write_vcpuif, 4, 494 484 VGIC_ACCESS_32bit), 485 + REGISTER_DESC_WITH_LENGTH_UACCESS(GIC_CPU_DEACTIVATE, 486 + vgic_mmio_read_raz, vgic_mmio_write_dir, 487 + vgic_mmio_read_raz, vgic_mmio_uaccess_write_wi, 488 + 4, VGIC_ACCESS_32bit), 495 489 }; 496 490 497 491 unsigned int vgic_v2_init_dist_iodev(struct vgic_io_device *dev) ··· 506 492 kvm_iodevice_init(&dev->dev, &kvm_io_gic_ops); 507 493 508 494 return SZ_4K; 495 + } 496 + 497 + unsigned int vgic_v2_init_cpuif_iodev(struct vgic_io_device *dev) 498 + { 499 + dev->regions = vgic_v2_cpu_registers; 500 + dev->nr_regions = ARRAY_SIZE(vgic_v2_cpu_registers); 501 + 502 + kvm_iodevice_init(&dev->dev, &kvm_io_gic_ops); 503 + 504 + return KVM_VGIC_V2_CPU_SIZE; 509 505 } 510 506 511 507 int vgic_v2_has_attr_regs(struct kvm_device *dev, struct kvm_device_attr *attr)

+1

arch/arm64/kvm/vgic/vgic-mmio.h

··· 213 213 const u32 val); 214 214 215 215 unsigned int vgic_v2_init_dist_iodev(struct vgic_io_device *dev); 216 + unsigned int vgic_v2_init_cpuif_iodev(struct vgic_io_device *dev); 216 217 217 218 unsigned int vgic_v3_init_dist_iodev(struct vgic_io_device *dev); 218 219

+222 -73

arch/arm64/kvm/vgic/vgic-v2.c

··· 9 9 #include <kvm/arm_vgic.h> 10 10 #include <asm/kvm_mmu.h> 11 11 12 + #include "vgic-mmio.h" 12 13 #include "vgic.h" 13 14 14 15 static inline void vgic_v2_write_lr(int lr, u32 val) ··· 27 26 vgic_v2_write_lr(i, 0); 28 27 } 29 28 30 - void vgic_v2_set_underflow(struct kvm_vcpu *vcpu) 29 + void vgic_v2_configure_hcr(struct kvm_vcpu *vcpu, 30 + struct ap_list_summary *als) 31 31 { 32 32 struct vgic_v2_cpu_if *cpuif = &vcpu->arch.vgic_cpu.vgic_v2; 33 33 34 - cpuif->vgic_hcr |= GICH_HCR_UIE; 34 + cpuif->vgic_hcr = GICH_HCR_EN; 35 + 36 + if (irqs_pending_outside_lrs(als)) 37 + cpuif->vgic_hcr |= GICH_HCR_NPIE; 38 + if (irqs_active_outside_lrs(als)) 39 + cpuif->vgic_hcr |= GICH_HCR_LRENPIE; 40 + if (irqs_outside_lrs(als)) 41 + cpuif->vgic_hcr |= GICH_HCR_UIE; 42 + 43 + cpuif->vgic_hcr |= (cpuif->vgic_vmcr & GICH_VMCR_ENABLE_GRP0_MASK) ? 44 + GICH_HCR_VGrp0DIE : GICH_HCR_VGrp0EIE; 45 + cpuif->vgic_hcr |= (cpuif->vgic_vmcr & GICH_VMCR_ENABLE_GRP1_MASK) ? 46 + GICH_HCR_VGrp1DIE : GICH_HCR_VGrp1EIE; 35 47 } 36 48 37 49 static bool lr_signals_eoi_mi(u32 lr_val) ··· 53 39 !(lr_val & GICH_LR_HW); 54 40 } 55 41 56 - /* 57 - * transfer the content of the LRs back into the corresponding ap_list: 58 - * - active bit is transferred as is 59 - * - pending bit is 60 - * - transferred as is in case of edge sensitive IRQs 61 - * - set to the line-level (resample time) for level sensitive IRQs 62 - */ 63 - void vgic_v2_fold_lr_state(struct kvm_vcpu *vcpu) 42 + static void vgic_v2_fold_lr(struct kvm_vcpu *vcpu, u32 val) 64 43 { 65 - struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic_cpu; 66 - struct vgic_v2_cpu_if *cpuif = &vgic_cpu->vgic_v2; 67 - int lr; 44 + u32 cpuid, intid = val & GICH_LR_VIRTUALID; 45 + struct vgic_irq *irq; 46 + bool deactivated; 68 47 69 - DEBUG_SPINLOCK_BUG_ON(!irqs_disabled()); 48 + /* Extract the source vCPU id from the LR */ 49 + cpuid = FIELD_GET(GICH_LR_PHYSID_CPUID, val) & 7; 70 50 71 - cpuif->vgic_hcr &= ~GICH_HCR_UIE; 51 + /* Notify fds when the guest EOI'ed a level-triggered SPI */ 52 + if (lr_signals_eoi_mi(val) && vgic_valid_spi(vcpu->kvm, intid)) 53 + kvm_notify_acked_irq(vcpu->kvm, 0, 54 + intid - VGIC_NR_PRIVATE_IRQS); 72 55 73 - for (lr = 0; lr < vgic_cpu->vgic_v2.used_lrs; lr++) { 74 - u32 val = cpuif->vgic_lr[lr]; 75 - u32 cpuid, intid = val & GICH_LR_VIRTUALID; 76 - struct vgic_irq *irq; 77 - bool deactivated; 56 + irq = vgic_get_vcpu_irq(vcpu, intid); 78 57 79 - /* Extract the source vCPU id from the LR */ 80 - cpuid = val & GICH_LR_PHYSID_CPUID; 81 - cpuid >>= GICH_LR_PHYSID_CPUID_SHIFT; 82 - cpuid &= 7; 83 - 84 - /* Notify fds when the guest EOI'ed a level-triggered SPI */ 85 - if (lr_signals_eoi_mi(val) && vgic_valid_spi(vcpu->kvm, intid)) 86 - kvm_notify_acked_irq(vcpu->kvm, 0, 87 - intid - VGIC_NR_PRIVATE_IRQS); 88 - 89 - irq = vgic_get_vcpu_irq(vcpu, intid); 90 - 91 - raw_spin_lock(&irq->irq_lock); 92 - 58 + scoped_guard(raw_spinlock, &irq->irq_lock) { 93 59 /* Always preserve the active bit, note deactivation */ 94 60 deactivated = irq->active && !(val & GICH_LR_ACTIVE_BIT); 95 61 irq->active = !!(val & GICH_LR_ACTIVE_BIT); ··· 95 101 /* Handle resampling for mapped interrupts if required */ 96 102 vgic_irq_handle_resampling(irq, deactivated, val & GICH_LR_PENDING_BIT); 97 103 98 - raw_spin_unlock(&irq->irq_lock); 99 - vgic_put_irq(vcpu->kvm, irq); 104 + irq->on_lr = false; 105 + } 106 + 107 + vgic_put_irq(vcpu->kvm, irq); 108 + } 109 + 110 + static u32 vgic_v2_compute_lr(struct kvm_vcpu *vcpu, struct vgic_irq *irq); 111 + 112 + /* 113 + * transfer the content of the LRs back into the corresponding ap_list: 114 + * - active bit is transferred as is 115 + * - pending bit is 116 + * - transferred as is in case of edge sensitive IRQs 117 + * - set to the line-level (resample time) for level sensitive IRQs 118 + */ 119 + void vgic_v2_fold_lr_state(struct kvm_vcpu *vcpu) 120 + { 121 + struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic_cpu; 122 + struct vgic_v2_cpu_if *cpuif = &vgic_cpu->vgic_v2; 123 + u32 eoicount = FIELD_GET(GICH_HCR_EOICOUNT, cpuif->vgic_hcr); 124 + struct vgic_irq *irq; 125 + 126 + DEBUG_SPINLOCK_BUG_ON(!irqs_disabled()); 127 + 128 + for (int lr = 0; lr < vgic_cpu->vgic_v2.used_lrs; lr++) 129 + vgic_v2_fold_lr(vcpu, cpuif->vgic_lr[lr]); 130 + 131 + /* See the GICv3 equivalent for the EOIcount handling rationale */ 132 + list_for_each_entry(irq, &vgic_cpu->ap_list_head, ap_list) { 133 + u32 lr; 134 + 135 + if (!eoicount) { 136 + break; 137 + } else { 138 + guard(raw_spinlock)(&irq->irq_lock); 139 + 140 + if (!(likely(vgic_target_oracle(irq) == vcpu) && 141 + irq->active)) 142 + continue; 143 + 144 + lr = vgic_v2_compute_lr(vcpu, irq) & ~GICH_LR_ACTIVE_BIT; 145 + } 146 + 147 + if (lr & GICH_LR_HW) 148 + writel_relaxed(FIELD_GET(GICH_LR_PHYSID_CPUID, lr), 149 + kvm_vgic_global_state.gicc_base + GIC_CPU_DEACTIVATE); 150 + vgic_v2_fold_lr(vcpu, lr); 151 + eoicount--; 100 152 } 101 153 102 154 cpuif->used_lrs = 0; 103 155 } 104 156 105 - /* 106 - * Populates the particular LR with the state of a given IRQ: 107 - * - for an edge sensitive IRQ the pending state is cleared in struct vgic_irq 108 - * - for a level sensitive IRQ the pending state value is unchanged; 109 - * it is dictated directly by the input level 110 - * 111 - * If @irq describes an SGI with multiple sources, we choose the 112 - * lowest-numbered source VCPU and clear that bit in the source bitmap. 113 - * 114 - * The irq_lock must be held by the caller. 115 - */ 116 - void vgic_v2_populate_lr(struct kvm_vcpu *vcpu, struct vgic_irq *irq, int lr) 157 + void vgic_v2_deactivate(struct kvm_vcpu *vcpu, u32 val) 158 + { 159 + struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic_cpu; 160 + struct vgic_v2_cpu_if *cpuif = &vgic_cpu->vgic_v2; 161 + struct kvm_vcpu *target_vcpu = NULL; 162 + bool mmio = false; 163 + struct vgic_irq *irq; 164 + unsigned long flags; 165 + u64 lr = 0; 166 + u8 cpuid; 167 + 168 + /* Snapshot CPUID, and remove it from the INTID */ 169 + cpuid = FIELD_GET(GENMASK_ULL(12, 10), val); 170 + val &= ~GENMASK_ULL(12, 10); 171 + 172 + /* We only deal with DIR when EOIMode==1 */ 173 + if (!(cpuif->vgic_vmcr & GICH_VMCR_EOI_MODE_MASK)) 174 + return; 175 + 176 + /* Make sure we're in the same context as LR handling */ 177 + local_irq_save(flags); 178 + 179 + irq = vgic_get_vcpu_irq(vcpu, val); 180 + if (WARN_ON_ONCE(!irq)) 181 + goto out; 182 + 183 + /* See the corresponding v3 code for the rationale */ 184 + scoped_guard(raw_spinlock, &irq->irq_lock) { 185 + target_vcpu = irq->vcpu; 186 + 187 + /* Not on any ap_list? */ 188 + if (!target_vcpu) 189 + goto put; 190 + 191 + /* 192 + * Urgh. We're deactivating something that we cannot 193 + * observe yet... Big hammer time. 194 + */ 195 + if (irq->on_lr) { 196 + mmio = true; 197 + goto put; 198 + } 199 + 200 + /* SGI: check that the cpuid matches */ 201 + if (val < VGIC_NR_SGIS && irq->active_source != cpuid) { 202 + target_vcpu = NULL; 203 + goto put; 204 + } 205 + 206 + /* (with a Dalek voice) DEACTIVATE!!!! */ 207 + lr = vgic_v2_compute_lr(vcpu, irq) & ~GICH_LR_ACTIVE_BIT; 208 + } 209 + 210 + if (lr & GICH_LR_HW) 211 + writel_relaxed(FIELD_GET(GICH_LR_PHYSID_CPUID, lr), 212 + kvm_vgic_global_state.gicc_base + GIC_CPU_DEACTIVATE); 213 + 214 + vgic_v2_fold_lr(vcpu, lr); 215 + 216 + put: 217 + vgic_put_irq(vcpu->kvm, irq); 218 + 219 + out: 220 + local_irq_restore(flags); 221 + 222 + if (mmio) 223 + vgic_mmio_write_cactive(vcpu, (val / 32) * 4, 4, BIT(val % 32)); 224 + 225 + /* Force the ap_list to be pruned */ 226 + if (target_vcpu) 227 + kvm_make_request(KVM_REQ_VGIC_PROCESS_UPDATE, target_vcpu); 228 + } 229 + 230 + static u32 vgic_v2_compute_lr(struct kvm_vcpu *vcpu, struct vgic_irq *irq) 117 231 { 118 232 u32 val = irq->intid; 119 233 bool allow_pending = true; 234 + 235 + WARN_ON(irq->on_lr); 120 236 121 237 if (irq->active) { 122 238 val |= GICH_LR_ACTIVE_BIT; ··· 267 163 if (allow_pending && irq_is_pending(irq)) { 268 164 val |= GICH_LR_PENDING_BIT; 269 165 166 + if (vgic_irq_is_sgi(irq->intid)) { 167 + u32 src = ffs(irq->source); 168 + 169 + if (WARN_RATELIMIT(!src, "No SGI source for INTID %d\n", 170 + irq->intid)) 171 + return 0; 172 + 173 + val |= (src - 1) << GICH_LR_PHYSID_CPUID_SHIFT; 174 + if (irq->source & ~BIT(src - 1)) 175 + val |= GICH_LR_EOI; 176 + } 177 + } 178 + 179 + /* The GICv2 LR only holds five bits of priority. */ 180 + val |= (irq->priority >> 3) << GICH_LR_PRIORITY_SHIFT; 181 + 182 + return val; 183 + } 184 + 185 + /* 186 + * Populates the particular LR with the state of a given IRQ: 187 + * - for an edge sensitive IRQ the pending state is cleared in struct vgic_irq 188 + * - for a level sensitive IRQ the pending state value is unchanged; 189 + * it is dictated directly by the input level 190 + * 191 + * If @irq describes an SGI with multiple sources, we choose the 192 + * lowest-numbered source VCPU and clear that bit in the source bitmap. 193 + * 194 + * The irq_lock must be held by the caller. 195 + */ 196 + void vgic_v2_populate_lr(struct kvm_vcpu *vcpu, struct vgic_irq *irq, int lr) 197 + { 198 + u32 val = vgic_v2_compute_lr(vcpu, irq); 199 + 200 + vcpu->arch.vgic_cpu.vgic_v2.vgic_lr[lr] = val; 201 + 202 + if (val & GICH_LR_PENDING_BIT) { 270 203 if (irq->config == VGIC_CONFIG_EDGE) 271 204 irq->pending_latch = false; 272 205 273 206 if (vgic_irq_is_sgi(irq->intid)) { 274 207 u32 src = ffs(irq->source); 275 208 276 - if (WARN_RATELIMIT(!src, "No SGI source for INTID %d\n", 277 - irq->intid)) 278 - return; 279 - 280 - val |= (src - 1) << GICH_LR_PHYSID_CPUID_SHIFT; 281 - irq->source &= ~(1 << (src - 1)); 282 - if (irq->source) { 209 + irq->source &= ~BIT(src - 1); 210 + if (irq->source) 283 211 irq->pending_latch = true; 284 - val |= GICH_LR_EOI; 285 - } 286 212 } 287 213 } 288 214 ··· 328 194 /* The GICv2 LR only holds five bits of priority. */ 329 195 val |= (irq->priority >> 3) << GICH_LR_PRIORITY_SHIFT; 330 196 331 - vcpu->arch.vgic_cpu.vgic_v2.vgic_lr[lr] = val; 197 + irq->on_lr = true; 332 198 } 333 199 334 200 void vgic_v2_clear_lr(struct kvm_vcpu *vcpu, int lr) ··· 391 257 GICH_VMCR_PRIMASK_SHIFT) << GICV_PMR_PRIORITY_SHIFT; 392 258 } 393 259 394 - void vgic_v2_enable(struct kvm_vcpu *vcpu) 260 + void vgic_v2_reset(struct kvm_vcpu *vcpu) 395 261 { 396 262 /* 397 263 * By forcing VMCR to zero, the GIC will restore the binary ··· 399 265 * anyway. 400 266 */ 401 267 vcpu->arch.vgic_cpu.vgic_v2.vgic_vmcr = 0; 402 - 403 - /* Get the show on the road... */ 404 - vcpu->arch.vgic_cpu.vgic_v2.vgic_hcr = GICH_HCR_EN; 405 268 } 406 269 407 270 /* check for overlapping regions and for regions crossing the end of memory */ ··· 420 289 int vgic_v2_map_resources(struct kvm *kvm) 421 290 { 422 291 struct vgic_dist *dist = &kvm->arch.vgic; 292 + unsigned int len; 423 293 int ret = 0; 424 294 425 295 if (IS_VGIC_ADDR_UNDEF(dist->vgic_dist_base) || ··· 444 312 return ret; 445 313 } 446 314 315 + len = vgic_v2_init_cpuif_iodev(&dist->cpuif_iodev); 316 + dist->cpuif_iodev.base_addr = dist->vgic_cpu_base; 317 + dist->cpuif_iodev.iodev_type = IODEV_CPUIF; 318 + dist->cpuif_iodev.redist_vcpu = NULL; 319 + 320 + ret = kvm_io_bus_register_dev(kvm, KVM_MMIO_BUS, dist->vgic_cpu_base, 321 + len, &dist->cpuif_iodev.dev); 322 + if (ret) 323 + return ret; 324 + 447 325 if (!static_branch_unlikely(&vgic_v2_cpuif_trap)) { 448 326 ret = kvm_phys_addr_ioremap(kvm, dist->vgic_cpu_base, 449 327 kvm_vgic_global_state.vcpu_base, 450 - KVM_VGIC_V2_CPU_SIZE, true); 328 + KVM_VGIC_V2_CPU_SIZE - SZ_4K, true); 451 329 if (ret) { 452 330 kvm_err("Unable to remap VGIC CPU to VCPU\n"); 453 331 return ret; ··· 527 385 528 386 kvm_vgic_global_state.can_emulate_gicv2 = true; 529 387 kvm_vgic_global_state.vcpu_base = info->vcpu.start; 388 + kvm_vgic_global_state.gicc_base = info->gicc_base; 530 389 kvm_vgic_global_state.type = VGIC_V2; 531 390 kvm_vgic_global_state.max_gic_vcpus = VGIC_V2_MAX_CPUS; 532 391 ··· 566 423 567 424 void vgic_v2_save_state(struct kvm_vcpu *vcpu) 568 425 { 426 + struct vgic_v2_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v2; 569 427 void __iomem *base = kvm_vgic_global_state.vctrl_base; 570 428 u64 used_lrs = vcpu->arch.vgic_cpu.vgic_v2.used_lrs; 571 429 572 430 if (!base) 573 431 return; 574 432 575 - if (used_lrs) { 433 + cpu_if->vgic_vmcr = readl_relaxed(kvm_vgic_global_state.vctrl_base + GICH_VMCR); 434 + 435 + if (used_lrs) 576 436 save_lrs(vcpu, base); 577 - writel_relaxed(0, base + GICH_HCR); 437 + 438 + if (cpu_if->vgic_hcr & GICH_HCR_LRENPIE) { 439 + u32 val = readl_relaxed(base + GICH_HCR); 440 + 441 + cpu_if->vgic_hcr &= ~GICH_HCR_EOICOUNT; 442 + cpu_if->vgic_hcr |= val & GICH_HCR_EOICOUNT; 578 443 } 444 + 445 + writel_relaxed(0, base + GICH_HCR); 579 446 } 580 447 581 448 void vgic_v2_restore_state(struct kvm_vcpu *vcpu) ··· 598 445 if (!base) 599 446 return; 600 447 601 - if (used_lrs) { 602 - writel_relaxed(cpu_if->vgic_hcr, base + GICH_HCR); 603 - for (i = 0; i < used_lrs; i++) { 604 - writel_relaxed(cpu_if->vgic_lr[i], 605 - base + GICH_LR0 + (i * 4)); 606 - } 607 - } 448 + writel_relaxed(cpu_if->vgic_hcr, base + GICH_HCR); 449 + 450 + for (i = 0; i < used_lrs; i++) 451 + writel_relaxed(cpu_if->vgic_lr[i], base + GICH_LR0 + (i * 4)); 608 452 } 609 453 610 454 void vgic_v2_load(struct kvm_vcpu *vcpu) ··· 618 468 { 619 469 struct vgic_v2_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v2; 620 470 621 - cpu_if->vgic_vmcr = readl_relaxed(kvm_vgic_global_state.vctrl_base + GICH_VMCR); 622 471 cpu_if->vgic_apr = readl_relaxed(kvm_vgic_global_state.vctrl_base + GICH_APR); 623 472 }

+53 -53

arch/arm64/kvm/vgic/vgic-v3-nested.c

··· 70 70 * - on L2 put: perform the inverse transformation, so that the result of L2 71 71 * running becomes visible to L1 in the VNCR-accessible registers. 72 72 * 73 - * - there is nothing to do on L2 entry, as everything will have happened 74 - * on load. However, this is the point where we detect that an interrupt 75 - * targeting L1 and prepare the grand switcheroo. 73 + * - there is nothing to do on L2 entry apart from enabling the vgic, as 74 + * everything will have happened on load. However, this is the point where 75 + * we detect that an interrupt targeting L1 and prepare the grand 76 + * switcheroo. 76 77 * 77 - * - on L2 exit: emulate the HW bit, and deactivate corresponding the L1 78 - * interrupt. The L0 active state will be cleared by the HW if the L1 79 - * interrupt was itself backed by a HW interrupt. 78 + * - on L2 exit: resync the LRs and VMCR, emulate the HW bit, and deactivate 79 + * corresponding the L1 interrupt. The L0 active state will be cleared by 80 + * the HW if the L1 interrupt was itself backed by a HW interrupt. 80 81 * 81 82 * Maintenance Interrupt (MI) management: 82 83 * ··· 94 93 * 95 94 * - because most of the ICH_*_EL2 registers live in the VNCR page, the 96 95 * quality of emulation is poor: L1 can setup the vgic so that an MI would 97 - * immediately fire, and not observe anything until the next exit. Trying 98 - * to read ICH_MISR_EL2 would do the trick, for example. 96 + * immediately fire, and not observe anything until the next exit. 97 + * Similarly, a pending MI is not immediately disabled by clearing 98 + * ICH_HCR_EL2.En. Trying to read ICH_MISR_EL2 would do the trick, for 99 + * example. 99 100 * 100 101 * System register emulation: 101 102 * ··· 268 265 s_cpu_if->used_lrs = hweight16(shadow_if->lr_map); 269 266 } 270 267 268 + void vgic_v3_flush_nested(struct kvm_vcpu *vcpu) 269 + { 270 + u64 val = __vcpu_sys_reg(vcpu, ICH_HCR_EL2); 271 + 272 + write_sysreg_s(val | vgic_ich_hcr_trap_bits(), SYS_ICH_HCR_EL2); 273 + } 274 + 271 275 void vgic_v3_sync_nested(struct kvm_vcpu *vcpu) 272 276 { 273 277 struct shadow_if *shadow_if = get_shadow_if(); 274 278 int i; 275 279 276 280 for_each_set_bit(i, &shadow_if->lr_map, kvm_vgic_global_state.nr_lr) { 277 - u64 lr = __vcpu_sys_reg(vcpu, ICH_LRN(i)); 278 - struct vgic_irq *irq; 281 + u64 val, host_lr, lr; 279 282 280 - if (!(lr & ICH_LR_HW) || !(lr & ICH_LR_STATE)) 283 + host_lr = __gic_v3_get_lr(lr_map_idx_to_shadow_idx(shadow_if, i)); 284 + 285 + /* Propagate the new LR state */ 286 + lr = __vcpu_sys_reg(vcpu, ICH_LRN(i)); 287 + val = lr & ~ICH_LR_STATE; 288 + val |= host_lr & ICH_LR_STATE; 289 + __vcpu_assign_sys_reg(vcpu, ICH_LRN(i), val); 290 + 291 + /* 292 + * Deactivation of a HW interrupt: the LR must have the HW 293 + * bit set, have been in a non-invalid state before the run, 294 + * and now be in an invalid state. If any of that doesn't 295 + * hold, we're done with this LR. 296 + */ 297 + if (!((lr & ICH_LR_HW) && (lr & ICH_LR_STATE) && 298 + !(host_lr & ICH_LR_STATE))) 281 299 continue; 282 300 283 301 /* ··· 306 282 * need to emulate the HW effect between the guest hypervisor 307 283 * and the nested guest. 308 284 */ 309 - irq = vgic_get_vcpu_irq(vcpu, FIELD_GET(ICH_LR_PHYS_ID_MASK, lr)); 310 - if (WARN_ON(!irq)) /* Shouldn't happen as we check on load */ 311 - continue; 312 - 313 - lr = __gic_v3_get_lr(lr_map_idx_to_shadow_idx(shadow_if, i)); 314 - if (!(lr & ICH_LR_STATE)) 315 - irq->active = false; 316 - 317 - vgic_put_irq(vcpu->kvm, irq); 285 + vgic_v3_deactivate(vcpu, FIELD_GET(ICH_LR_PHYS_ID_MASK, lr)); 318 286 } 287 + 288 + /* We need these to be synchronised to generate the MI */ 289 + __vcpu_assign_sys_reg(vcpu, ICH_VMCR_EL2, read_sysreg_s(SYS_ICH_VMCR_EL2)); 290 + __vcpu_rmw_sys_reg(vcpu, ICH_HCR_EL2, &=, ~ICH_HCR_EL2_EOIcount); 291 + __vcpu_rmw_sys_reg(vcpu, ICH_HCR_EL2, |=, read_sysreg_s(SYS_ICH_HCR_EL2) & ICH_HCR_EL2_EOIcount); 292 + 293 + write_sysreg_s(0, SYS_ICH_HCR_EL2); 294 + isb(); 295 + 296 + vgic_v3_nested_update_mi(vcpu); 319 297 } 320 298 321 299 static void vgic_v3_create_shadow_state(struct kvm_vcpu *vcpu, 322 300 struct vgic_v3_cpu_if *s_cpu_if) 323 301 { 324 302 struct vgic_v3_cpu_if *host_if = &vcpu->arch.vgic_cpu.vgic_v3; 325 - u64 val = 0; 326 303 int i; 327 304 328 - /* 329 - * If we're on a system with a broken vgic that requires 330 - * trapping, propagate the trapping requirements. 331 - * 332 - * Ah, the smell of rotten fruits... 333 - */ 334 - if (static_branch_unlikely(&vgic_v3_cpuif_trap)) 335 - val = host_if->vgic_hcr & (ICH_HCR_EL2_TALL0 | ICH_HCR_EL2_TALL1 | 336 - ICH_HCR_EL2_TC | ICH_HCR_EL2_TDIR); 337 - s_cpu_if->vgic_hcr = __vcpu_sys_reg(vcpu, ICH_HCR_EL2) | val; 305 + s_cpu_if->vgic_hcr = __vcpu_sys_reg(vcpu, ICH_HCR_EL2); 338 306 s_cpu_if->vgic_vmcr = __vcpu_sys_reg(vcpu, ICH_VMCR_EL2); 339 307 s_cpu_if->vgic_sre = host_if->vgic_sre; 340 308 ··· 350 334 __vgic_v3_restore_vmcr_aprs(cpu_if); 351 335 __vgic_v3_activate_traps(cpu_if); 352 336 353 - __vgic_v3_restore_state(cpu_if); 337 + for (int i = 0; i < cpu_if->used_lrs; i++) 338 + __gic_v3_set_lr(cpu_if->vgic_lr[i], i); 354 339 355 340 /* 356 341 * Propagate the number of used LRs for the benefit of the HYP ··· 364 347 { 365 348 struct shadow_if *shadow_if = get_shadow_if(); 366 349 struct vgic_v3_cpu_if *s_cpu_if = &shadow_if->cpuif; 367 - u64 val; 368 350 int i; 369 351 370 - __vgic_v3_save_vmcr_aprs(s_cpu_if); 371 - __vgic_v3_deactivate_traps(s_cpu_if); 372 - __vgic_v3_save_state(s_cpu_if); 373 - 374 - /* 375 - * Translate the shadow state HW fields back to the virtual ones 376 - * before copying the shadow struct back to the nested one. 377 - */ 378 - val = __vcpu_sys_reg(vcpu, ICH_HCR_EL2); 379 - val &= ~ICH_HCR_EL2_EOIcount_MASK; 380 - val |= (s_cpu_if->vgic_hcr & ICH_HCR_EL2_EOIcount_MASK); 381 - __vcpu_assign_sys_reg(vcpu, ICH_HCR_EL2, val); 382 - __vcpu_assign_sys_reg(vcpu, ICH_VMCR_EL2, s_cpu_if->vgic_vmcr); 352 + __vgic_v3_save_aprs(s_cpu_if); 383 353 384 354 for (i = 0; i < 4; i++) { 385 355 __vcpu_assign_sys_reg(vcpu, ICH_AP0RN(i), s_cpu_if->vgic_ap0r[i]); 386 356 __vcpu_assign_sys_reg(vcpu, ICH_AP1RN(i), s_cpu_if->vgic_ap1r[i]); 387 357 } 388 358 389 - for_each_set_bit(i, &shadow_if->lr_map, kvm_vgic_global_state.nr_lr) { 390 - val = __vcpu_sys_reg(vcpu, ICH_LRN(i)); 359 + for (i = 0; i < s_cpu_if->used_lrs; i++) 360 + __gic_v3_set_lr(0, i); 391 361 392 - val &= ~ICH_LR_STATE; 393 - val |= s_cpu_if->vgic_lr[lr_map_idx_to_shadow_idx(shadow_if, i)] & ICH_LR_STATE; 394 - 395 - __vcpu_assign_sys_reg(vcpu, ICH_LRN(i), val); 396 - } 362 + __vgic_v3_deactivate_traps(s_cpu_if); 397 363 398 364 vcpu->arch.vgic_cpu.vgic_v3.used_lrs = 0; 399 365 }

+330 -100

arch/arm64/kvm/vgic/vgic-v3.c

··· 12 12 #include <asm/kvm_mmu.h> 13 13 #include <asm/kvm_asm.h> 14 14 15 + #include "vgic-mmio.h" 15 16 #include "vgic.h" 16 17 17 18 static bool group0_trap; ··· 21 20 static bool dir_trap; 22 21 static bool gicv4_enable; 23 22 24 - void vgic_v3_set_underflow(struct kvm_vcpu *vcpu) 23 + void vgic_v3_configure_hcr(struct kvm_vcpu *vcpu, 24 + struct ap_list_summary *als) 25 25 { 26 26 struct vgic_v3_cpu_if *cpuif = &vcpu->arch.vgic_cpu.vgic_v3; 27 27 28 - cpuif->vgic_hcr |= ICH_HCR_EL2_UIE; 28 + if (!irqchip_in_kernel(vcpu->kvm)) 29 + return; 30 + 31 + cpuif->vgic_hcr = ICH_HCR_EL2_En; 32 + 33 + if (irqs_pending_outside_lrs(als)) 34 + cpuif->vgic_hcr |= ICH_HCR_EL2_NPIE; 35 + if (irqs_active_outside_lrs(als)) 36 + cpuif->vgic_hcr |= ICH_HCR_EL2_LRENPIE; 37 + if (irqs_outside_lrs(als)) 38 + cpuif->vgic_hcr |= ICH_HCR_EL2_UIE; 39 + 40 + if (!als->nr_sgi) 41 + cpuif->vgic_hcr |= ICH_HCR_EL2_vSGIEOICount; 42 + 43 + cpuif->vgic_hcr |= (cpuif->vgic_vmcr & ICH_VMCR_ENG0_MASK) ? 44 + ICH_HCR_EL2_VGrp0DIE : ICH_HCR_EL2_VGrp0EIE; 45 + cpuif->vgic_hcr |= (cpuif->vgic_vmcr & ICH_VMCR_ENG1_MASK) ? 46 + ICH_HCR_EL2_VGrp1DIE : ICH_HCR_EL2_VGrp1EIE; 47 + 48 + /* 49 + * Dealing with EOImode=1 is a massive source of headache. Not 50 + * only do we need to track that we have active interrupts 51 + * outside of the LRs and force DIR to be trapped, we also 52 + * need to deal with SPIs that can be deactivated on another 53 + * CPU. 54 + * 55 + * On systems that do not implement TDIR, force the bit in the 56 + * shadow state anyway to avoid IPI-ing on these poor sods. 57 + * 58 + * Note that we set the trap irrespective of EOIMode, as that 59 + * can change behind our back without any warning... 60 + */ 61 + if (!cpus_have_final_cap(ARM64_HAS_ICH_HCR_EL2_TDIR) || 62 + irqs_active_outside_lrs(als) || 63 + atomic_read(&vcpu->kvm->arch.vgic.active_spis)) 64 + cpuif->vgic_hcr |= ICH_HCR_EL2_TDIR; 29 65 } 30 66 31 67 static bool lr_signals_eoi_mi(u64 lr_val) ··· 71 33 !(lr_val & ICH_LR_HW); 72 34 } 73 35 74 - void vgic_v3_fold_lr_state(struct kvm_vcpu *vcpu) 36 + static void vgic_v3_fold_lr(struct kvm_vcpu *vcpu, u64 val) 75 37 { 76 - struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic_cpu; 77 - struct vgic_v3_cpu_if *cpuif = &vgic_cpu->vgic_v3; 78 - u32 model = vcpu->kvm->arch.vgic.vgic_model; 79 - int lr; 38 + struct vgic_irq *irq; 39 + bool is_v2_sgi = false; 40 + bool deactivated; 41 + u32 intid; 80 42 81 - DEBUG_SPINLOCK_BUG_ON(!irqs_disabled()); 43 + if (vcpu->kvm->arch.vgic.vgic_model == KVM_DEV_TYPE_ARM_VGIC_V3) { 44 + intid = val & ICH_LR_VIRTUAL_ID_MASK; 45 + } else { 46 + intid = val & GICH_LR_VIRTUALID; 47 + is_v2_sgi = vgic_irq_is_sgi(intid); 48 + } 82 49 83 - cpuif->vgic_hcr &= ~ICH_HCR_EL2_UIE; 50 + irq = vgic_get_vcpu_irq(vcpu, intid); 51 + if (!irq) /* An LPI could have been unmapped. */ 52 + return; 84 53 85 - for (lr = 0; lr < cpuif->used_lrs; lr++) { 86 - u64 val = cpuif->vgic_lr[lr]; 87 - u32 intid, cpuid; 88 - struct vgic_irq *irq; 89 - bool is_v2_sgi = false; 90 - bool deactivated; 91 - 92 - cpuid = val & GICH_LR_PHYSID_CPUID; 93 - cpuid >>= GICH_LR_PHYSID_CPUID_SHIFT; 94 - 95 - if (model == KVM_DEV_TYPE_ARM_VGIC_V3) { 96 - intid = val & ICH_LR_VIRTUAL_ID_MASK; 97 - } else { 98 - intid = val & GICH_LR_VIRTUALID; 99 - is_v2_sgi = vgic_irq_is_sgi(intid); 100 - } 101 - 102 - /* Notify fds when the guest EOI'ed a level-triggered IRQ */ 103 - if (lr_signals_eoi_mi(val) && vgic_valid_spi(vcpu->kvm, intid)) 104 - kvm_notify_acked_irq(vcpu->kvm, 0, 105 - intid - VGIC_NR_PRIVATE_IRQS); 106 - 107 - irq = vgic_get_vcpu_irq(vcpu, intid); 108 - if (!irq) /* An LPI could have been unmapped. */ 109 - continue; 110 - 111 - raw_spin_lock(&irq->irq_lock); 112 - 113 - /* Always preserve the active bit, note deactivation */ 54 + scoped_guard(raw_spinlock, &irq->irq_lock) { 55 + /* Always preserve the active bit for !LPIs, note deactivation */ 56 + if (irq->intid >= VGIC_MIN_LPI) 57 + val &= ~ICH_LR_ACTIVE_BIT; 114 58 deactivated = irq->active && !(val & ICH_LR_ACTIVE_BIT); 115 59 irq->active = !!(val & ICH_LR_ACTIVE_BIT); 116 60 117 - if (irq->active && is_v2_sgi) 118 - irq->active_source = cpuid; 119 - 120 61 /* Edge is the only case where we preserve the pending bit */ 121 62 if (irq->config == VGIC_CONFIG_EDGE && 122 - (val & ICH_LR_PENDING_BIT)) { 63 + (val & ICH_LR_PENDING_BIT)) 123 64 irq->pending_latch = true; 124 - 125 - if (is_v2_sgi) 126 - irq->source |= (1 << cpuid); 127 - } 128 65 129 66 /* 130 67 * Clear soft pending state when level irqs have been acked. ··· 107 94 if (irq->config == VGIC_CONFIG_LEVEL && !(val & ICH_LR_STATE)) 108 95 irq->pending_latch = false; 109 96 97 + if (is_v2_sgi) { 98 + u8 cpuid = FIELD_GET(GICH_LR_PHYSID_CPUID, val); 99 + 100 + if (irq->active) 101 + irq->active_source = cpuid; 102 + 103 + if (val & ICH_LR_PENDING_BIT) 104 + irq->source |= BIT(cpuid); 105 + } 106 + 110 107 /* Handle resampling for mapped interrupts if required */ 111 108 vgic_irq_handle_resampling(irq, deactivated, val & ICH_LR_PENDING_BIT); 112 109 113 - raw_spin_unlock(&irq->irq_lock); 114 - vgic_put_irq(vcpu->kvm, irq); 110 + irq->on_lr = false; 111 + } 112 + 113 + /* Notify fds when the guest EOI'ed a level-triggered SPI, and drop the refcount */ 114 + if (deactivated && lr_signals_eoi_mi(val) && vgic_valid_spi(vcpu->kvm, intid)) { 115 + kvm_notify_acked_irq(vcpu->kvm, 0, 116 + intid - VGIC_NR_PRIVATE_IRQS); 117 + atomic_dec_if_positive(&vcpu->kvm->arch.vgic.active_spis); 118 + } 119 + 120 + vgic_put_irq(vcpu->kvm, irq); 121 + } 122 + 123 + static u64 vgic_v3_compute_lr(struct kvm_vcpu *vcpu, struct vgic_irq *irq); 124 + 125 + static void vgic_v3_deactivate_phys(u32 intid) 126 + { 127 + if (cpus_have_final_cap(ARM64_HAS_GICV5_LEGACY)) 128 + gic_insn(intid | FIELD_PREP(GICV5_GIC_CDDI_TYPE_MASK, 1), CDDI); 129 + else 130 + gic_write_dir(intid); 131 + } 132 + 133 + void vgic_v3_fold_lr_state(struct kvm_vcpu *vcpu) 134 + { 135 + struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic_cpu; 136 + struct vgic_v3_cpu_if *cpuif = &vgic_cpu->vgic_v3; 137 + u32 eoicount = FIELD_GET(ICH_HCR_EL2_EOIcount, cpuif->vgic_hcr); 138 + struct vgic_irq *irq; 139 + 140 + DEBUG_SPINLOCK_BUG_ON(!irqs_disabled()); 141 + 142 + for (int lr = 0; lr < cpuif->used_lrs; lr++) 143 + vgic_v3_fold_lr(vcpu, cpuif->vgic_lr[lr]); 144 + 145 + /* 146 + * EOIMode=0: use EOIcount to emulate deactivation. We are 147 + * guaranteed to deactivate in reverse order of the activation, so 148 + * just pick one active interrupt after the other in the ap_list, 149 + * and replay the deactivation as if the CPU was doing it. We also 150 + * rely on priority drop to have taken place, and the list to be 151 + * sorted by priority. 152 + */ 153 + list_for_each_entry(irq, &vgic_cpu->ap_list_head, ap_list) { 154 + u64 lr; 155 + 156 + /* 157 + * I would have loved to write this using a scoped_guard(), 158 + * but using 'continue' here is a total train wreck. 159 + */ 160 + if (!eoicount) { 161 + break; 162 + } else { 163 + guard(raw_spinlock)(&irq->irq_lock); 164 + 165 + if (!(likely(vgic_target_oracle(irq) == vcpu) && 166 + irq->active)) 167 + continue; 168 + 169 + lr = vgic_v3_compute_lr(vcpu, irq) & ~ICH_LR_ACTIVE_BIT; 170 + } 171 + 172 + if (lr & ICH_LR_HW) 173 + vgic_v3_deactivate_phys(FIELD_GET(ICH_LR_PHYS_ID_MASK, lr)); 174 + 175 + vgic_v3_fold_lr(vcpu, lr); 176 + eoicount--; 115 177 } 116 178 117 179 cpuif->used_lrs = 0; 118 180 } 119 181 182 + void vgic_v3_deactivate(struct kvm_vcpu *vcpu, u64 val) 183 + { 184 + struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic_cpu; 185 + struct vgic_v3_cpu_if *cpuif = &vgic_cpu->vgic_v3; 186 + u32 model = vcpu->kvm->arch.vgic.vgic_model; 187 + struct kvm_vcpu *target_vcpu = NULL; 188 + bool mmio = false, is_v2_sgi; 189 + struct vgic_irq *irq; 190 + unsigned long flags; 191 + u64 lr = 0; 192 + u8 cpuid; 193 + 194 + /* Snapshot CPUID, and remove it from the INTID */ 195 + cpuid = FIELD_GET(GENMASK_ULL(12, 10), val); 196 + val &= ~GENMASK_ULL(12, 10); 197 + 198 + is_v2_sgi = (model == KVM_DEV_TYPE_ARM_VGIC_V2 && 199 + val < VGIC_NR_SGIS); 200 + 201 + /* 202 + * We only deal with DIR when EOIMode==1, and only for SGI, 203 + * PPI or SPI. 204 + */ 205 + if (!(cpuif->vgic_vmcr & ICH_VMCR_EOIM_MASK) || 206 + val >= vcpu->kvm->arch.vgic.nr_spis + VGIC_NR_PRIVATE_IRQS) 207 + return; 208 + 209 + /* Make sure we're in the same context as LR handling */ 210 + local_irq_save(flags); 211 + 212 + irq = vgic_get_vcpu_irq(vcpu, val); 213 + if (WARN_ON_ONCE(!irq)) 214 + goto out; 215 + 216 + /* 217 + * EOIMode=1: we must rely on traps to handle deactivate of 218 + * overflowing interrupts, as there is no ordering guarantee and 219 + * EOIcount isn't being incremented. Priority drop will have taken 220 + * place, as ICV_EOIxR_EL1 only affects the APRs and not the LRs. 221 + * 222 + * Three possibities: 223 + * 224 + * - The irq is not queued on any CPU, and there is nothing to 225 + * do, 226 + * 227 + * - Or the irq is in an LR, meaning that its state is not 228 + * directly observable. Treat it bluntly by making it as if 229 + * this was a write to GICD_ICACTIVER, which will force an 230 + * exit on all vcpus. If it hurts, don't do that. 231 + * 232 + * - Or the irq is active, but not in an LR, and we can 233 + * directly deactivate it by building a pseudo-LR, fold it, 234 + * and queue a request to prune the resulting ap_list, 235 + * 236 + * Special care must be taken to match the source CPUID when 237 + * deactivating a GICv2 SGI. 238 + */ 239 + scoped_guard(raw_spinlock, &irq->irq_lock) { 240 + target_vcpu = irq->vcpu; 241 + 242 + /* Not on any ap_list? */ 243 + if (!target_vcpu) 244 + goto put; 245 + 246 + /* 247 + * Urgh. We're deactivating something that we cannot 248 + * observe yet... Big hammer time. 249 + */ 250 + if (irq->on_lr) { 251 + mmio = true; 252 + goto put; 253 + } 254 + 255 + /* GICv2 SGI: check that the cpuid matches */ 256 + if (is_v2_sgi && irq->active_source != cpuid) { 257 + target_vcpu = NULL; 258 + goto put; 259 + } 260 + 261 + /* (with a Dalek voice) DEACTIVATE!!!! */ 262 + lr = vgic_v3_compute_lr(vcpu, irq) & ~ICH_LR_ACTIVE_BIT; 263 + } 264 + 265 + if (lr & ICH_LR_HW) 266 + vgic_v3_deactivate_phys(FIELD_GET(ICH_LR_PHYS_ID_MASK, lr)); 267 + 268 + vgic_v3_fold_lr(vcpu, lr); 269 + 270 + put: 271 + vgic_put_irq(vcpu->kvm, irq); 272 + 273 + out: 274 + local_irq_restore(flags); 275 + 276 + if (mmio) 277 + vgic_mmio_write_cactive(vcpu, (val / 32) * 4, 4, BIT(val % 32)); 278 + 279 + /* Force the ap_list to be pruned */ 280 + if (target_vcpu) 281 + kvm_make_request(KVM_REQ_VGIC_PROCESS_UPDATE, target_vcpu); 282 + } 283 + 120 284 /* Requires the irq to be locked already */ 121 - void vgic_v3_populate_lr(struct kvm_vcpu *vcpu, struct vgic_irq *irq, int lr) 285 + static u64 vgic_v3_compute_lr(struct kvm_vcpu *vcpu, struct vgic_irq *irq) 122 286 { 123 287 u32 model = vcpu->kvm->arch.vgic.vgic_model; 124 288 u64 val = irq->intid; 125 289 bool allow_pending = true, is_v2_sgi; 290 + 291 + WARN_ON(irq->on_lr); 126 292 127 293 is_v2_sgi = (vgic_irq_is_sgi(irq->intid) && 128 294 model == KVM_DEV_TYPE_ARM_VGIC_V2); ··· 342 150 if (allow_pending && irq_is_pending(irq)) { 343 151 val |= ICH_LR_PENDING_BIT; 344 152 153 + if (is_v2_sgi) { 154 + u32 src = ffs(irq->source); 155 + 156 + if (WARN_RATELIMIT(!src, "No SGI source for INTID %d\n", 157 + irq->intid)) 158 + return 0; 159 + 160 + val |= (src - 1) << GICH_LR_PHYSID_CPUID_SHIFT; 161 + if (irq->source & ~BIT(src - 1)) 162 + val |= ICH_LR_EOI; 163 + } 164 + } 165 + 166 + if (irq->group) 167 + val |= ICH_LR_GROUP; 168 + 169 + val |= (u64)irq->priority << ICH_LR_PRIORITY_SHIFT; 170 + 171 + return val; 172 + } 173 + 174 + void vgic_v3_populate_lr(struct kvm_vcpu *vcpu, struct vgic_irq *irq, int lr) 175 + { 176 + u32 model = vcpu->kvm->arch.vgic.vgic_model; 177 + u64 val = vgic_v3_compute_lr(vcpu, irq); 178 + 179 + vcpu->arch.vgic_cpu.vgic_v3.vgic_lr[lr] = val; 180 + 181 + if (val & ICH_LR_PENDING_BIT) { 345 182 if (irq->config == VGIC_CONFIG_EDGE) 346 183 irq->pending_latch = false; 347 184 ··· 378 157 model == KVM_DEV_TYPE_ARM_VGIC_V2) { 379 158 u32 src = ffs(irq->source); 380 159 381 - if (WARN_RATELIMIT(!src, "No SGI source for INTID %d\n", 382 - irq->intid)) 383 - return; 384 - 385 - val |= (src - 1) << GICH_LR_PHYSID_CPUID_SHIFT; 386 - irq->source &= ~(1 << (src - 1)); 387 - if (irq->source) { 160 + irq->source &= ~BIT(src - 1); 161 + if (irq->source) 388 162 irq->pending_latch = true; 389 - val |= ICH_LR_EOI; 390 - } 391 163 } 392 164 } 393 165 ··· 393 179 if (vgic_irq_is_mapped_level(irq) && (val & ICH_LR_PENDING_BIT)) 394 180 irq->line_level = false; 395 181 396 - if (irq->group) 397 - val |= ICH_LR_GROUP; 398 - 399 - val |= (u64)irq->priority << ICH_LR_PRIORITY_SHIFT; 400 - 401 - vcpu->arch.vgic_cpu.vgic_v3.vgic_lr[lr] = val; 182 + irq->on_lr = true; 402 183 } 403 184 404 185 void vgic_v3_clear_lr(struct kvm_vcpu *vcpu, int lr) ··· 467 258 GIC_BASER_CACHEABILITY(GICR_PENDBASER, OUTER, SameAsInner) | \ 468 259 GIC_BASER_SHAREABILITY(GICR_PENDBASER, InnerShareable)) 469 260 470 - void vgic_v3_enable(struct kvm_vcpu *vcpu) 261 + void vgic_v3_reset(struct kvm_vcpu *vcpu) 471 262 { 472 263 struct vgic_v3_cpu_if *vgic_v3 = &vcpu->arch.vgic_cpu.vgic_v3; 473 264 ··· 497 288 kvm_vgic_global_state.ich_vtr_el2); 498 289 vcpu->arch.vgic_cpu.num_pri_bits = FIELD_GET(ICH_VTR_EL2_PRIbits, 499 290 kvm_vgic_global_state.ich_vtr_el2) + 1; 500 - 501 - /* Get the show on the road... */ 502 - vgic_v3->vgic_hcr = ICH_HCR_EL2_En; 503 291 } 504 292 505 293 void vcpu_set_ich_hcr(struct kvm_vcpu *vcpu) ··· 508 302 509 303 /* Hide GICv3 sysreg if necessary */ 510 304 if (vcpu->kvm->arch.vgic.vgic_model == KVM_DEV_TYPE_ARM_VGIC_V2 || 511 - !irqchip_in_kernel(vcpu->kvm)) { 305 + !irqchip_in_kernel(vcpu->kvm)) 512 306 vgic_v3->vgic_hcr |= (ICH_HCR_EL2_TALL0 | ICH_HCR_EL2_TALL1 | 513 307 ICH_HCR_EL2_TC); 514 - return; 515 - } 516 - 517 - if (group0_trap) 518 - vgic_v3->vgic_hcr |= ICH_HCR_EL2_TALL0; 519 - if (group1_trap) 520 - vgic_v3->vgic_hcr |= ICH_HCR_EL2_TALL1; 521 - if (common_trap) 522 - vgic_v3->vgic_hcr |= ICH_HCR_EL2_TC; 523 - if (dir_trap) 524 - vgic_v3->vgic_hcr |= ICH_HCR_EL2_TDIR; 525 308 } 526 309 527 310 int vgic_v3_lpi_sync_pending_status(struct kvm *kvm, struct vgic_irq *irq) ··· 831 636 832 637 static bool vgic_v3_broken_seis(void) 833 638 { 834 - return ((kvm_vgic_global_state.ich_vtr_el2 & ICH_VTR_EL2_SEIS) && 835 - is_midr_in_range_list(broken_seis)); 639 + return (is_kernel_in_hyp_mode() && 640 + is_midr_in_range_list(broken_seis) && 641 + (read_sysreg_s(SYS_ICH_VTR_EL2) & ICH_VTR_EL2_SEIS)); 642 + } 643 + 644 + void noinstr kvm_compute_ich_hcr_trap_bits(struct alt_instr *alt, 645 + __le32 *origptr, __le32 *updptr, 646 + int nr_inst) 647 + { 648 + u32 insn, oinsn, rd; 649 + u64 hcr = 0; 650 + 651 + if (cpus_have_cap(ARM64_WORKAROUND_CAVIUM_30115)) { 652 + group0_trap = true; 653 + group1_trap = true; 654 + } 655 + 656 + if (vgic_v3_broken_seis()) { 657 + /* We know that these machines have ICH_HCR_EL2.TDIR */ 658 + group0_trap = true; 659 + group1_trap = true; 660 + dir_trap = true; 661 + } 662 + 663 + if (!cpus_have_cap(ARM64_HAS_ICH_HCR_EL2_TDIR)) 664 + common_trap = true; 665 + 666 + if (group0_trap) 667 + hcr |= ICH_HCR_EL2_TALL0; 668 + if (group1_trap) 669 + hcr |= ICH_HCR_EL2_TALL1; 670 + if (common_trap) 671 + hcr |= ICH_HCR_EL2_TC; 672 + if (dir_trap) 673 + hcr |= ICH_HCR_EL2_TDIR; 674 + 675 + /* Compute target register */ 676 + oinsn = le32_to_cpu(*origptr); 677 + rd = aarch64_insn_decode_register(AARCH64_INSN_REGTYPE_RD, oinsn); 678 + 679 + /* movz rd, #(val & 0xffff) */ 680 + insn = aarch64_insn_gen_movewide(rd, 681 + (u16)hcr, 682 + 0, 683 + AARCH64_INSN_VARIANT_64BIT, 684 + AARCH64_INSN_MOVEWIDE_ZERO); 685 + *updptr = cpu_to_le32(insn); 836 686 } 837 687 838 688 /** ··· 891 651 { 892 652 u64 ich_vtr_el2 = kvm_call_hyp_ret(__vgic_v3_get_gic_config); 893 653 bool has_v2; 654 + u64 traps; 894 655 int ret; 895 656 896 657 has_v2 = ich_vtr_el2 >> 63; ··· 950 709 if (has_v2) 951 710 static_branch_enable(&vgic_v3_has_v2_compat); 952 711 953 - if (cpus_have_final_cap(ARM64_WORKAROUND_CAVIUM_30115)) { 954 - group0_trap = true; 955 - group1_trap = true; 956 - } 957 - 958 712 if (vgic_v3_broken_seis()) { 959 713 kvm_info("GICv3 with broken locally generated SEI\n"); 960 - 961 714 kvm_vgic_global_state.ich_vtr_el2 &= ~ICH_VTR_EL2_SEIS; 962 - group0_trap = true; 963 - group1_trap = true; 964 - if (ich_vtr_el2 & ICH_VTR_EL2_TDS) 965 - dir_trap = true; 966 - else 967 - common_trap = true; 968 715 } 969 716 970 - if (group0_trap || group1_trap || common_trap | dir_trap) { 717 + traps = vgic_ich_hcr_trap_bits(); 718 + if (traps) { 971 719 kvm_info("GICv3 sysreg trapping enabled ([%s%s%s%s], reduced performance)\n", 972 - group0_trap ? "G0" : "", 973 - group1_trap ? "G1" : "", 974 - common_trap ? "C" : "", 975 - dir_trap ? "D" : ""); 720 + (traps & ICH_HCR_EL2_TALL0) ? "G0" : "", 721 + (traps & ICH_HCR_EL2_TALL1) ? "G1" : "", 722 + (traps & ICH_HCR_EL2_TC) ? "C" : "", 723 + (traps & ICH_HCR_EL2_TDIR) ? "D" : ""); 976 724 static_branch_enable(&vgic_v3_cpuif_trap); 977 725 } 978 726 ··· 1001 771 } 1002 772 1003 773 if (likely(!is_protected_kvm_enabled())) 1004 - kvm_call_hyp(__vgic_v3_save_vmcr_aprs, cpu_if); 774 + kvm_call_hyp(__vgic_v3_save_aprs, cpu_if); 1005 775 WARN_ON(vgic_v4_put(vcpu)); 1006 776 1007 777 if (has_vhe())

+4 -1

arch/arm64/kvm/vgic/vgic-v4.c

··· 163 163 struct vgic_irq *irq = vgic_get_vcpu_irq(vcpu, i); 164 164 struct irq_desc *desc; 165 165 unsigned long flags; 166 + bool pending; 166 167 int ret; 167 168 168 169 raw_spin_lock_irqsave(&irq->irq_lock, flags); ··· 174 173 irq->hw = false; 175 174 ret = irq_get_irqchip_state(irq->host_irq, 176 175 IRQCHIP_STATE_PENDING, 177 - &irq->pending_latch); 176 + &pending); 178 177 WARN_ON(ret); 178 + 179 + irq->pending_latch = pending; 179 180 180 181 desc = irq_to_desc(irq->host_irq); 181 182 irq_domain_deactivate_irq(irq_desc_get_irq_data(desc));

+189 -115

arch/arm64/kvm/vgic/vgic.c

··· 244 244 * 245 245 * Requires the IRQ lock to be held. 246 246 */ 247 - static struct kvm_vcpu *vgic_target_oracle(struct vgic_irq *irq) 247 + struct kvm_vcpu *vgic_target_oracle(struct vgic_irq *irq) 248 248 { 249 249 lockdep_assert_held(&irq->irq_lock); 250 250 ··· 272 272 return NULL; 273 273 } 274 274 275 + struct vgic_sort_info { 276 + struct kvm_vcpu *vcpu; 277 + struct vgic_vmcr vmcr; 278 + }; 279 + 275 280 /* 276 281 * The order of items in the ap_lists defines how we'll pack things in LRs as 277 282 * well, the first items in the list being the first things populated in the 278 283 * LRs. 279 284 * 280 - * A hard rule is that active interrupts can never be pushed out of the LRs 281 - * (and therefore take priority) since we cannot reliably trap on deactivation 282 - * of IRQs and therefore they have to be present in the LRs. 283 - * 285 + * Pending, non-active interrupts must be placed at the head of the list. 284 286 * Otherwise things should be sorted by the priority field and the GIC 285 287 * hardware support will take care of preemption of priority groups etc. 288 + * Interrupts that are not deliverable should be at the end of the list. 286 289 * 287 290 * Return negative if "a" sorts before "b", 0 to preserve order, and positive 288 291 * to sort "b" before "a". ··· 295 292 { 296 293 struct vgic_irq *irqa = container_of(a, struct vgic_irq, ap_list); 297 294 struct vgic_irq *irqb = container_of(b, struct vgic_irq, ap_list); 295 + struct vgic_sort_info *info = priv; 296 + struct kvm_vcpu *vcpu = info->vcpu; 298 297 bool penda, pendb; 299 298 int ret; 300 299 ··· 310 305 raw_spin_lock(&irqa->irq_lock); 311 306 raw_spin_lock_nested(&irqb->irq_lock, SINGLE_DEPTH_NESTING); 312 307 313 - if (irqa->active || irqb->active) { 314 - ret = (int)irqb->active - (int)irqa->active; 308 + /* Undeliverable interrupts should be last */ 309 + ret = (int)(vgic_target_oracle(irqb) == vcpu) - (int)(vgic_target_oracle(irqa) == vcpu); 310 + if (ret) 315 311 goto out; 316 - } 317 312 318 - penda = irqa->enabled && irq_is_pending(irqa); 319 - pendb = irqb->enabled && irq_is_pending(irqb); 320 - 321 - if (!penda || !pendb) { 322 - ret = (int)pendb - (int)penda; 313 + /* Same thing for interrupts targeting a disabled group */ 314 + ret = (int)(irqb->group ? info->vmcr.grpen1 : info->vmcr.grpen0); 315 + ret -= (int)(irqa->group ? info->vmcr.grpen1 : info->vmcr.grpen0); 316 + if (ret) 323 317 goto out; 324 - } 325 318 326 - /* Both pending and enabled, sort by priority */ 327 - ret = irqa->priority - irqb->priority; 319 + penda = irqa->enabled && irq_is_pending(irqa) && !irqa->active; 320 + pendb = irqb->enabled && irq_is_pending(irqb) && !irqb->active; 321 + 322 + ret = (int)pendb - (int)penda; 323 + if (ret) 324 + goto out; 325 + 326 + /* Both pending and enabled, sort by priority (lower number first) */ 327 + ret = (int)irqa->priority - (int)irqb->priority; 328 + if (ret) 329 + goto out; 330 + 331 + /* Finally, HW bit active interrupts have priority over non-HW ones */ 332 + ret = (int)irqb->hw - (int)irqa->hw; 333 + 328 334 out: 329 335 raw_spin_unlock(&irqb->irq_lock); 330 336 raw_spin_unlock(&irqa->irq_lock); ··· 346 330 static void vgic_sort_ap_list(struct kvm_vcpu *vcpu) 347 331 { 348 332 struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic_cpu; 333 + struct vgic_sort_info info = { .vcpu = vcpu, }; 349 334 350 335 lockdep_assert_held(&vgic_cpu->ap_list_lock); 351 336 352 - list_sort(NULL, &vgic_cpu->ap_list_head, vgic_irq_cmp); 337 + vgic_get_vmcr(vcpu, &info.vmcr); 338 + list_sort(&info, &vgic_cpu->ap_list_head, vgic_irq_cmp); 353 339 } 354 340 355 341 /* ··· 374 356 return false; 375 357 } 376 358 359 + static bool vgic_model_needs_bcst_kick(struct kvm *kvm) 360 + { 361 + /* 362 + * A GICv3 (or GICv3-like) system exposing a GICv3 to the guest 363 + * needs a broadcast kick to set TDIR globally. 364 + * 365 + * For systems that do not have TDIR (ARM's own v8.0 CPUs), the 366 + * shadow TDIR bit is always set, and so is the register's TC bit, 367 + * so no need to kick the CPUs. 368 + */ 369 + return (cpus_have_final_cap(ARM64_HAS_ICH_HCR_EL2_TDIR) && 370 + kvm->arch.vgic.vgic_model == KVM_DEV_TYPE_ARM_VGIC_V3); 371 + } 372 + 377 373 /* 378 374 * Check whether an IRQ needs to (and can) be queued to a VCPU's ap list. 379 375 * Do the queuing if necessary, taking the right locks in the right order. ··· 400 368 unsigned long flags) __releases(&irq->irq_lock) 401 369 { 402 370 struct kvm_vcpu *vcpu; 371 + bool bcast; 403 372 404 373 lockdep_assert_held(&irq->irq_lock); 405 374 ··· 475 442 list_add_tail(&irq->ap_list, &vcpu->arch.vgic_cpu.ap_list_head); 476 443 irq->vcpu = vcpu; 477 444 445 + /* A new SPI may result in deactivation trapping on all vcpus */ 446 + bcast = (vgic_model_needs_bcst_kick(vcpu->kvm) && 447 + vgic_valid_spi(vcpu->kvm, irq->intid) && 448 + atomic_fetch_inc(&vcpu->kvm->arch.vgic.active_spis) == 0); 449 + 478 450 raw_spin_unlock(&irq->irq_lock); 479 451 raw_spin_unlock_irqrestore(&vcpu->arch.vgic_cpu.ap_list_lock, flags); 480 452 481 - kvm_make_request(KVM_REQ_IRQ_PENDING, vcpu); 482 - kvm_vcpu_kick(vcpu); 453 + if (!bcast) { 454 + kvm_make_request(KVM_REQ_IRQ_PENDING, vcpu); 455 + kvm_vcpu_kick(vcpu); 456 + } else { 457 + kvm_make_all_cpus_request(vcpu->kvm, KVM_REQ_IRQ_PENDING); 458 + } 483 459 484 460 return true; 485 461 } ··· 840 798 vgic_v3_clear_lr(vcpu, lr); 841 799 } 842 800 843 - static inline void vgic_set_underflow(struct kvm_vcpu *vcpu) 844 - { 845 - if (kvm_vgic_global_state.type == VGIC_V2) 846 - vgic_v2_set_underflow(vcpu); 847 - else 848 - vgic_v3_set_underflow(vcpu); 849 - } 850 - 851 - /* Requires the ap_list_lock to be held. */ 852 - static int compute_ap_list_depth(struct kvm_vcpu *vcpu, 853 - bool *multi_sgi) 801 + static void summarize_ap_list(struct kvm_vcpu *vcpu, 802 + struct ap_list_summary *als) 854 803 { 855 804 struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic_cpu; 856 805 struct vgic_irq *irq; 857 - int count = 0; 858 - 859 - *multi_sgi = false; 860 806 861 807 lockdep_assert_held(&vgic_cpu->ap_list_lock); 862 808 809 + *als = (typeof(*als)){}; 810 + 863 811 list_for_each_entry(irq, &vgic_cpu->ap_list_head, ap_list) { 864 - int w; 812 + guard(raw_spinlock)(&irq->irq_lock); 865 813 866 - raw_spin_lock(&irq->irq_lock); 867 - /* GICv2 SGIs can count for more than one... */ 868 - w = vgic_irq_get_lr_count(irq); 869 - raw_spin_unlock(&irq->irq_lock); 814 + if (unlikely(vgic_target_oracle(irq) != vcpu)) 815 + continue; 870 816 871 - count += w; 872 - *multi_sgi |= (w > 1); 817 + if (!irq->active) 818 + als->nr_pend++; 819 + else 820 + als->nr_act++; 821 + 822 + if (irq->intid < VGIC_NR_SGIS) 823 + als->nr_sgi++; 873 824 } 874 - return count; 875 825 } 876 826 877 - /* Requires the VCPU's ap_list_lock to be held. */ 827 + /* 828 + * Dealing with LR overflow is close to black magic -- dress accordingly. 829 + * 830 + * We have to present an almost infinite number of interrupts through a very 831 + * limited number of registers. Therefore crucial decisions must be made to 832 + * ensure we feed the most relevant interrupts into the LRs, and yet have 833 + * some facilities to let the guest interact with those that are not there. 834 + * 835 + * All considerations below are in the context of interrupts targeting a 836 + * single vcpu with non-idle state (either pending, active, or both), 837 + * colloquially called the ap_list: 838 + * 839 + * - Pending interrupts must have priority over active interrupts. This also 840 + * excludes pending+active interrupts. This ensures that a guest can 841 + * perform priority drops on any number of interrupts, and yet be 842 + * presented the next pending one. 843 + * 844 + * - Deactivation of interrupts outside of the LRs must be tracked by using 845 + * either the EOIcount-driven maintenance interrupt, and sometimes by 846 + * trapping the DIR register. 847 + * 848 + * - For EOImode=0, a non-zero EOIcount means walking the ap_list past the 849 + * point that made it into the LRs, and deactivate interrupts that would 850 + * have made it onto the LRs if we had the space. 851 + * 852 + * - The MI-generation bits must be used to try and force an exit when the 853 + * guest has done enough changes to the LRs that we want to reevaluate the 854 + * situation: 855 + * 856 + * - if the total number of pending interrupts exceeds the number of 857 + * LR, NPIE must be set in order to exit once no pending interrupts 858 + * are present in the LRs, allowing us to populate the next batch. 859 + * 860 + * - if there are active interrupts outside of the LRs, then LRENPIE 861 + * must be set so that we exit on deactivation of one of these, and 862 + * work out which one is to be deactivated. Note that this is not 863 + * enough to deal with EOImode=1, see below. 864 + * 865 + * - if the overall number of interrupts exceeds the number of LRs, 866 + * then UIE must be set to allow refilling of the LRs once the 867 + * majority of them has been processed. 868 + * 869 + * - as usual, MI triggers are only an optimisation, since we cannot 870 + * rely on the MI being delivered in timely manner... 871 + * 872 + * - EOImode=1 creates some additional problems: 873 + * 874 + * - deactivation can happen in any order, and we cannot rely on 875 + * EOImode=0's coupling of priority-drop and deactivation which 876 + * imposes strict reverse Ack order. This means that DIR must 877 + * trap if we have active interrupts outside of the LRs. 878 + * 879 + * - deactivation of SPIs can occur on any CPU, while the SPI is only 880 + * present in the ap_list of the CPU that actually ack-ed it. In that 881 + * case, EOIcount doesn't provide enough information, and we must 882 + * resort to trapping DIR even if we don't overflow the LRs. Bonus 883 + * point for not trapping DIR when no SPIs are pending or active in 884 + * the whole VM. 885 + * 886 + * - LPIs do not suffer the same problem as SPIs on deactivation, as we 887 + * have to essentially discard the active state, see below. 888 + * 889 + * - Virtual LPIs have an active state (surprise!), which gets removed on 890 + * priority drop (EOI). However, EOIcount doesn't get bumped when the LPI 891 + * is not present in the LR (surprise again!). Special care must therefore 892 + * be taken to remove the active state from any activated LPI when exiting 893 + * from the guest. This is in a way no different from what happens on the 894 + * physical side. We still rely on the running priority to have been 895 + * removed from the APRs, irrespective of the LPI being present in the LRs 896 + * or not. 897 + * 898 + * - Virtual SGIs directly injected via GICv4.1 must not affect EOIcount, as 899 + * they are not managed in SW and don't have a true active state. So only 900 + * set vSGIEOICount when no SGIs are in the ap_list. 901 + * 902 + * - GICv2 SGIs with multiple sources are injected one source at a time, as 903 + * if they were made pending sequentially. This may mean that we don't 904 + * always present the HPPI if other interrupts with lower priority are 905 + * pending in the LRs. Big deal. 906 + */ 878 907 static void vgic_flush_lr_state(struct kvm_vcpu *vcpu) 879 908 { 880 909 struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic_cpu; 910 + struct ap_list_summary als; 881 911 struct vgic_irq *irq; 882 - int count; 883 - bool multi_sgi; 884 - u8 prio = 0xff; 885 - int i = 0; 912 + int count = 0; 886 913 887 914 lockdep_assert_held(&vgic_cpu->ap_list_lock); 888 915 889 - count = compute_ap_list_depth(vcpu, &multi_sgi); 890 - if (count > kvm_vgic_global_state.nr_lr || multi_sgi) 916 + summarize_ap_list(vcpu, &als); 917 + 918 + if (irqs_outside_lrs(&als)) 891 919 vgic_sort_ap_list(vcpu); 892 920 893 - count = 0; 894 - 895 921 list_for_each_entry(irq, &vgic_cpu->ap_list_head, ap_list) { 896 - raw_spin_lock(&irq->irq_lock); 922 + scoped_guard(raw_spinlock, &irq->irq_lock) { 923 + if (likely(vgic_target_oracle(irq) == vcpu)) { 924 + vgic_populate_lr(vcpu, irq, count++); 925 + } 926 + } 897 927 898 - /* 899 - * If we have multi-SGIs in the pipeline, we need to 900 - * guarantee that they are all seen before any IRQ of 901 - * lower priority. In that case, we need to filter out 902 - * these interrupts by exiting early. This is easy as 903 - * the AP list has been sorted already. 904 - */ 905 - if (multi_sgi && irq->priority > prio) { 906 - raw_spin_unlock(&irq->irq_lock); 928 + if (count == kvm_vgic_global_state.nr_lr) 907 929 break; 908 - } 909 - 910 - if (likely(vgic_target_oracle(irq) == vcpu)) { 911 - vgic_populate_lr(vcpu, irq, count++); 912 - 913 - if (irq->source) 914 - prio = irq->priority; 915 - } 916 - 917 - raw_spin_unlock(&irq->irq_lock); 918 - 919 - if (count == kvm_vgic_global_state.nr_lr) { 920 - if (!list_is_last(&irq->ap_list, 921 - &vgic_cpu->ap_list_head)) 922 - vgic_set_underflow(vcpu); 923 - break; 924 - } 925 930 } 926 931 927 932 /* Nuke remaining LRs */ 928 - for (i = count ; i < kvm_vgic_global_state.nr_lr; i++) 933 + for (int i = count ; i < kvm_vgic_global_state.nr_lr; i++) 929 934 vgic_clear_lr(vcpu, i); 930 935 931 - if (!static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) 936 + if (!static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) { 932 937 vcpu->arch.vgic_cpu.vgic_v2.used_lrs = count; 933 - else 938 + vgic_v2_configure_hcr(vcpu, &als); 939 + } else { 934 940 vcpu->arch.vgic_cpu.vgic_v3.used_lrs = count; 941 + vgic_v3_configure_hcr(vcpu, &als); 942 + } 935 943 } 936 944 937 945 static inline bool can_access_vgic_from_kernel(void) ··· 1005 913 /* Sync back the hardware VGIC state into our emulation after a guest's run. */ 1006 914 void kvm_vgic_sync_hwstate(struct kvm_vcpu *vcpu) 1007 915 { 1008 - int used_lrs; 1009 - 1010 916 /* If nesting, emulate the HW effect from L0 to L1 */ 1011 917 if (vgic_state_is_nested(vcpu)) { 1012 918 vgic_v3_sync_nested(vcpu); ··· 1014 924 if (vcpu_has_nv(vcpu)) 1015 925 vgic_v3_nested_update_mi(vcpu); 1016 926 1017 - /* An empty ap_list_head implies used_lrs == 0 */ 1018 - if (list_empty(&vcpu->arch.vgic_cpu.ap_list_head)) 1019 - return; 1020 - 1021 927 if (can_access_vgic_from_kernel()) 1022 928 vgic_save_state(vcpu); 1023 929 1024 - if (!static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) 1025 - used_lrs = vcpu->arch.vgic_cpu.vgic_v2.used_lrs; 1026 - else 1027 - used_lrs = vcpu->arch.vgic_cpu.vgic_v3.used_lrs; 1028 - 1029 - if (used_lrs) 1030 - vgic_fold_lr_state(vcpu); 930 + vgic_fold_lr_state(vcpu); 1031 931 vgic_prune_ap_list(vcpu); 932 + } 933 + 934 + /* Sync interrupts that were deactivated through a DIR trap */ 935 + void kvm_vgic_process_async_update(struct kvm_vcpu *vcpu) 936 + { 937 + unsigned long flags; 938 + 939 + /* Make sure we're in the same context as LR handling */ 940 + local_irq_save(flags); 941 + vgic_prune_ap_list(vcpu); 942 + local_irq_restore(flags); 1032 943 } 1033 944 1034 945 static inline void vgic_restore_state(struct kvm_vcpu *vcpu) ··· 1056 965 * abort the entry procedure and inject the exception at the 1057 966 * beginning of the run loop. 1058 967 * 1059 - * - Otherwise, do exactly *NOTHING*. The guest state is 1060 - * already loaded, and we can carry on with running it. 968 + * - Otherwise, do exactly *NOTHING* apart from enabling the virtual 969 + * CPU interface. The guest state is already loaded, and we can 970 + * carry on with running it. 1061 971 * 1062 972 * If we have NV, but are not in a nested state, compute the 1063 973 * maintenance interrupt state, as it may fire. ··· 1067 975 if (kvm_vgic_vcpu_pending_irq(vcpu)) 1068 976 kvm_make_request(KVM_REQ_GUEST_HYP_IRQ_PENDING, vcpu); 1069 977 978 + vgic_v3_flush_nested(vcpu); 1070 979 return; 1071 980 } 1072 981 1073 982 if (vcpu_has_nv(vcpu)) 1074 983 vgic_v3_nested_update_mi(vcpu); 1075 984 1076 - /* 1077 - * If there are no virtual interrupts active or pending for this 1078 - * VCPU, then there is no work to do and we can bail out without 1079 - * taking any lock. There is a potential race with someone injecting 1080 - * interrupts to the VCPU, but it is a benign race as the VCPU will 1081 - * either observe the new interrupt before or after doing this check, 1082 - * and introducing additional synchronization mechanism doesn't change 1083 - * this. 1084 - * 1085 - * Note that we still need to go through the whole thing if anything 1086 - * can be directly injected (GICv4). 1087 - */ 1088 - if (list_empty(&vcpu->arch.vgic_cpu.ap_list_head) && 1089 - !vgic_supports_direct_irqs(vcpu->kvm)) 1090 - return; 1091 - 1092 985 DEBUG_SPINLOCK_BUG_ON(!irqs_disabled()); 1093 986 1094 - if (!list_empty(&vcpu->arch.vgic_cpu.ap_list_head)) { 1095 - raw_spin_lock(&vcpu->arch.vgic_cpu.ap_list_lock); 987 + scoped_guard(raw_spinlock, &vcpu->arch.vgic_cpu.ap_list_lock) 1096 988 vgic_flush_lr_state(vcpu); 1097 - raw_spin_unlock(&vcpu->arch.vgic_cpu.ap_list_lock); 1098 - } 1099 989 1100 990 if (can_access_vgic_from_kernel()) 1101 991 vgic_restore_state(vcpu);

+39 -4

arch/arm64/kvm/vgic/vgic.h

··· 164 164 return ret; 165 165 } 166 166 167 + void kvm_compute_ich_hcr_trap_bits(struct alt_instr *alt, 168 + __le32 *origptr, __le32 *updptr, int nr_inst); 169 + 170 + static inline u64 vgic_ich_hcr_trap_bits(void) 171 + { 172 + u64 hcr; 173 + 174 + /* All the traps are in the bottom 16bits */ 175 + asm volatile(ALTERNATIVE_CB("movz %0, #0\n", 176 + ARM64_ALWAYS_SYSTEM, 177 + kvm_compute_ich_hcr_trap_bits) 178 + : "=r" (hcr)); 179 + 180 + return hcr; 181 + } 182 + 167 183 /* 168 184 * This struct provides an intermediate representation of the fields contained 169 185 * in the GICH_VMCR and ICH_VMCR registers, such that code exporting the GIC ··· 236 220 u32 event_id; 237 221 }; 238 222 223 + struct ap_list_summary { 224 + unsigned int nr_pend; /* purely pending, not active */ 225 + unsigned int nr_act; /* active, or active+pending */ 226 + unsigned int nr_sgi; /* any SGI */ 227 + }; 228 + 229 + #define irqs_outside_lrs(s) \ 230 + (((s)->nr_pend + (s)->nr_act) > kvm_vgic_global_state.nr_lr) 231 + 232 + #define irqs_pending_outside_lrs(s) \ 233 + ((s)->nr_pend > kvm_vgic_global_state.nr_lr) 234 + 235 + #define irqs_active_outside_lrs(s) \ 236 + ((s)->nr_act && irqs_outside_lrs(s)) 237 + 239 238 int vgic_v3_parse_attr(struct kvm_device *dev, struct kvm_device_attr *attr, 240 239 struct vgic_reg_attr *reg_attr); 241 240 int vgic_v2_parse_attr(struct kvm_device *dev, struct kvm_device_attr *attr, ··· 261 230 struct vgic_irq *vgic_get_irq(struct kvm *kvm, u32 intid); 262 231 struct vgic_irq *vgic_get_vcpu_irq(struct kvm_vcpu *vcpu, u32 intid); 263 232 void vgic_put_irq(struct kvm *kvm, struct vgic_irq *irq); 233 + struct kvm_vcpu *vgic_target_oracle(struct vgic_irq *irq); 264 234 bool vgic_get_phys_line_level(struct vgic_irq *irq); 265 235 void vgic_irq_set_phys_pending(struct vgic_irq *irq, bool pending); 266 236 void vgic_irq_set_phys_active(struct vgic_irq *irq, bool active); ··· 277 245 278 246 void vgic_v2_fold_lr_state(struct kvm_vcpu *vcpu); 279 247 void vgic_v2_populate_lr(struct kvm_vcpu *vcpu, struct vgic_irq *irq, int lr); 248 + void vgic_v2_deactivate(struct kvm_vcpu *vcpu, u32 val); 280 249 void vgic_v2_clear_lr(struct kvm_vcpu *vcpu, int lr); 281 - void vgic_v2_set_underflow(struct kvm_vcpu *vcpu); 250 + void vgic_v2_configure_hcr(struct kvm_vcpu *vcpu, struct ap_list_summary *als); 282 251 int vgic_v2_has_attr_regs(struct kvm_device *dev, struct kvm_device_attr *attr); 283 252 int vgic_v2_dist_uaccess(struct kvm_vcpu *vcpu, bool is_write, 284 253 int offset, u32 *val); ··· 287 254 int offset, u32 *val); 288 255 void vgic_v2_set_vmcr(struct kvm_vcpu *vcpu, struct vgic_vmcr *vmcr); 289 256 void vgic_v2_get_vmcr(struct kvm_vcpu *vcpu, struct vgic_vmcr *vmcr); 290 - void vgic_v2_enable(struct kvm_vcpu *vcpu); 257 + void vgic_v2_reset(struct kvm_vcpu *vcpu); 291 258 int vgic_v2_probe(const struct gic_kvm_info *info); 292 259 int vgic_v2_map_resources(struct kvm *kvm); 293 260 int vgic_register_dist_iodev(struct kvm *kvm, gpa_t dist_base_address, ··· 319 286 void vgic_v3_fold_lr_state(struct kvm_vcpu *vcpu); 320 287 void vgic_v3_populate_lr(struct kvm_vcpu *vcpu, struct vgic_irq *irq, int lr); 321 288 void vgic_v3_clear_lr(struct kvm_vcpu *vcpu, int lr); 322 - void vgic_v3_set_underflow(struct kvm_vcpu *vcpu); 289 + void vgic_v3_deactivate(struct kvm_vcpu *vcpu, u64 val); 290 + void vgic_v3_configure_hcr(struct kvm_vcpu *vcpu, struct ap_list_summary *als); 323 291 void vgic_v3_set_vmcr(struct kvm_vcpu *vcpu, struct vgic_vmcr *vmcr); 324 292 void vgic_v3_get_vmcr(struct kvm_vcpu *vcpu, struct vgic_vmcr *vmcr); 325 - void vgic_v3_enable(struct kvm_vcpu *vcpu); 293 + void vgic_v3_reset(struct kvm_vcpu *vcpu); 326 294 int vgic_v3_probe(const struct gic_kvm_info *info); 327 295 int vgic_v3_map_resources(struct kvm *kvm); 328 296 int vgic_v3_lpi_sync_pending_status(struct kvm *kvm, struct vgic_irq *irq); ··· 446 412 return kvm_has_feat(kvm, ID_AA64PFR0_EL1, GIC, IMP); 447 413 } 448 414 415 + void vgic_v3_flush_nested(struct kvm_vcpu *vcpu); 449 416 void vgic_v3_sync_nested(struct kvm_vcpu *vcpu); 450 417 void vgic_v3_load_nested(struct kvm_vcpu *vcpu); 451 418 void vgic_v3_put_nested(struct kvm_vcpu *vcpu);

+2

arch/arm64/tools/cpucaps

··· 40 40 HAS_GICV5_LEGACY 41 41 HAS_GIC_PRIO_MASKING 42 42 HAS_GIC_PRIO_RELAXED_SYNC 43 + HAS_ICH_HCR_EL2_TDIR 43 44 HAS_HCR_NV1 44 45 HAS_HCX 45 46 HAS_LDAPR ··· 65 64 HAS_VA52 66 65 HAS_VIRT_HOST_EXTN 67 66 HAS_WFXT 67 + HAS_XNX 68 68 HAFT 69 69 HW_DBM 70 70 KVM_HVHE

+8 -47

arch/loongarch/include/asm/kvm_eiointc.h

··· 10 10 11 11 #define EIOINTC_IRQS 256 12 12 #define EIOINTC_ROUTE_MAX_VCPUS 256 13 - #define EIOINTC_IRQS_U8_NUMS (EIOINTC_IRQS / 8) 14 - #define EIOINTC_IRQS_U16_NUMS (EIOINTC_IRQS_U8_NUMS / 2) 15 - #define EIOINTC_IRQS_U32_NUMS (EIOINTC_IRQS_U8_NUMS / 4) 16 - #define EIOINTC_IRQS_U64_NUMS (EIOINTC_IRQS_U8_NUMS / 8) 13 + #define EIOINTC_IRQS_U64_NUMS (EIOINTC_IRQS / 64) 17 14 /* map to ipnum per 32 irqs */ 18 15 #define EIOINTC_IRQS_NODETYPE_COUNT 16 19 16 ··· 61 64 uint32_t status; 62 65 63 66 /* hardware state */ 64 - union nodetype { 65 - u64 reg_u64[EIOINTC_IRQS_NODETYPE_COUNT / 4]; 66 - u32 reg_u32[EIOINTC_IRQS_NODETYPE_COUNT / 2]; 67 - u16 reg_u16[EIOINTC_IRQS_NODETYPE_COUNT]; 68 - u8 reg_u8[EIOINTC_IRQS_NODETYPE_COUNT * 2]; 69 - } nodetype; 67 + u64 nodetype[EIOINTC_IRQS_NODETYPE_COUNT / 4]; 70 68 71 69 /* one bit shows the state of one irq */ 72 - union bounce { 73 - u64 reg_u64[EIOINTC_IRQS_U64_NUMS]; 74 - u32 reg_u32[EIOINTC_IRQS_U32_NUMS]; 75 - u16 reg_u16[EIOINTC_IRQS_U16_NUMS]; 76 - u8 reg_u8[EIOINTC_IRQS_U8_NUMS]; 77 - } bounce; 78 - 79 - union isr { 80 - u64 reg_u64[EIOINTC_IRQS_U64_NUMS]; 81 - u32 reg_u32[EIOINTC_IRQS_U32_NUMS]; 82 - u16 reg_u16[EIOINTC_IRQS_U16_NUMS]; 83 - u8 reg_u8[EIOINTC_IRQS_U8_NUMS]; 84 - } isr; 85 - union coreisr { 86 - u64 reg_u64[EIOINTC_ROUTE_MAX_VCPUS][EIOINTC_IRQS_U64_NUMS]; 87 - u32 reg_u32[EIOINTC_ROUTE_MAX_VCPUS][EIOINTC_IRQS_U32_NUMS]; 88 - u16 reg_u16[EIOINTC_ROUTE_MAX_VCPUS][EIOINTC_IRQS_U16_NUMS]; 89 - u8 reg_u8[EIOINTC_ROUTE_MAX_VCPUS][EIOINTC_IRQS_U8_NUMS]; 90 - } coreisr; 91 - union enable { 92 - u64 reg_u64[EIOINTC_IRQS_U64_NUMS]; 93 - u32 reg_u32[EIOINTC_IRQS_U32_NUMS]; 94 - u16 reg_u16[EIOINTC_IRQS_U16_NUMS]; 95 - u8 reg_u8[EIOINTC_IRQS_U8_NUMS]; 96 - } enable; 70 + u64 bounce[EIOINTC_IRQS_U64_NUMS]; 71 + u64 isr[EIOINTC_IRQS_U64_NUMS]; 72 + u64 coreisr[EIOINTC_ROUTE_MAX_VCPUS][EIOINTC_IRQS_U64_NUMS]; 73 + u64 enable[EIOINTC_IRQS_U64_NUMS]; 97 74 98 75 /* use one byte to config ipmap for 32 irqs at once */ 99 - union ipmap { 100 - u64 reg_u64; 101 - u32 reg_u32[EIOINTC_IRQS_U32_NUMS / 4]; 102 - u16 reg_u16[EIOINTC_IRQS_U16_NUMS / 4]; 103 - u8 reg_u8[EIOINTC_IRQS_U8_NUMS / 4]; 104 - } ipmap; 76 + u64 ipmap; 105 77 /* use one byte to config coremap for one irq */ 106 - union coremap { 107 - u64 reg_u64[EIOINTC_IRQS / 8]; 108 - u32 reg_u32[EIOINTC_IRQS / 4]; 109 - u16 reg_u16[EIOINTC_IRQS / 2]; 110 - u8 reg_u8[EIOINTC_IRQS]; 111 - } coremap; 78 + u64 coremap[EIOINTC_IRQS / 8]; 112 79 113 80 DECLARE_BITMAP(sw_coreisr[EIOINTC_ROUTE_MAX_VCPUS][LOONGSON_IP_NUM], EIOINTC_IRQS); 114 81 uint8_t sw_coremap[EIOINTC_IRQS];

+8

arch/loongarch/include/asm/kvm_host.h

··· 126 126 struct kvm_phyid_map *phyid_map; 127 127 /* Enabled PV features */ 128 128 unsigned long pv_features; 129 + /* Supported KVM features */ 130 + unsigned long kvm_features; 129 131 130 132 s64 time_offset; 131 133 struct kvm_context __percpu *vmcs; ··· 293 291 static inline int kvm_get_pmu_num(struct kvm_vcpu_arch *arch) 294 292 { 295 293 return (arch->cpucfg[6] & CPUCFG6_PMNUM) >> CPUCFG6_PMNUM_SHIFT; 294 + } 295 + 296 + /* Check whether KVM support this feature (VMM may disable it) */ 297 + static inline bool kvm_vm_support(struct kvm_arch *arch, int feature) 298 + { 299 + return !!(arch->kvm_features & BIT_ULL(feature)); 296 300 } 297 301 298 302 bool kvm_arch_pmi_in_guest(struct kvm_vcpu *vcpu);

+1

arch/loongarch/include/asm/kvm_vcpu.h

··· 15 15 #define CPU_PMU (_ULCAST_(1) << 10) 16 16 #define CPU_TIMER (_ULCAST_(1) << 11) 17 17 #define CPU_IPI (_ULCAST_(1) << 12) 18 + #define CPU_AVEC (_ULCAST_(1) << 14) 18 19 19 20 /* Controlled by 0x52 guest exception VIP aligned to estat bit 5~12 */ 20 21 #define CPU_IP0 (_ULCAST_(1))

+2

arch/loongarch/include/asm/loongarch.h

··· 511 511 #define CSR_GCFG_GPERF_SHIFT 24 512 512 #define CSR_GCFG_GPERF_WIDTH 3 513 513 #define CSR_GCFG_GPERF (_ULCAST_(0x7) << CSR_GCFG_GPERF_SHIFT) 514 + #define CSR_GCFG_GPMP_SHIFT 23 515 + #define CSR_GCFG_GPMP (_ULCAST_(0x1) << CSR_GCFG_GPMP_SHIFT) 514 516 #define CSR_GCFG_GCI_SHIFT 20 515 517 #define CSR_GCFG_GCI_WIDTH 2 516 518 #define CSR_GCFG_GCI (_ULCAST_(0x3) << CSR_GCFG_GCI_SHIFT)

+1

arch/loongarch/include/uapi/asm/kvm.h

··· 104 104 #define KVM_LOONGARCH_VM_FEAT_PV_IPI 6 105 105 #define KVM_LOONGARCH_VM_FEAT_PV_STEALTIME 7 106 106 #define KVM_LOONGARCH_VM_FEAT_PTW 8 107 + #define KVM_LOONGARCH_VM_FEAT_MSGINT 9 107 108 108 109 /* Device Control API on vcpu fd */ 109 110 #define KVM_LOONGARCH_VCPU_CPUCFG 0

-1

arch/loongarch/kvm/Kconfig

··· 25 25 select HAVE_KVM_IRQCHIP 26 26 select HAVE_KVM_MSI 27 27 select HAVE_KVM_READONLY_MEM 28 - select HAVE_KVM_VCPU_ASYNC_IOCTL 29 28 select KVM_COMMON 30 29 select KVM_GENERIC_DIRTYLOG_READ_PROTECT 31 30 select KVM_GENERIC_HARDWARE_ENABLING

+40 -40

arch/loongarch/kvm/intc/eiointc.c

··· 13 13 struct kvm_vcpu *vcpu; 14 14 15 15 for (irq = 0; irq < EIOINTC_IRQS; irq++) { 16 - ipnum = s->ipmap.reg_u8[irq / 32]; 16 + ipnum = (s->ipmap >> (irq / 32 * 8)) & 0xff; 17 17 if (!(s->status & BIT(EIOINTC_ENABLE_INT_ENCODE))) { 18 18 ipnum = count_trailing_zeros(ipnum); 19 19 ipnum = (ipnum >= 0 && ipnum < 4) ? ipnum : 0; 20 20 } 21 21 22 - cpuid = s->coremap.reg_u8[irq]; 22 + cpuid = ((u8 *)s->coremap)[irq]; 23 23 vcpu = kvm_get_vcpu_by_cpuid(s->kvm, cpuid); 24 24 if (!vcpu) 25 25 continue; 26 26 27 27 cpu = vcpu->vcpu_id; 28 - if (test_bit(irq, (unsigned long *)s->coreisr.reg_u32[cpu])) 28 + if (test_bit(irq, (unsigned long *)s->coreisr[cpu])) 29 29 __set_bit(irq, s->sw_coreisr[cpu][ipnum]); 30 30 else 31 31 __clear_bit(irq, s->sw_coreisr[cpu][ipnum]); ··· 38 38 struct kvm_vcpu *vcpu; 39 39 struct kvm_interrupt vcpu_irq; 40 40 41 - ipnum = s->ipmap.reg_u8[irq / 32]; 41 + ipnum = (s->ipmap >> (irq / 32 * 8)) & 0xff; 42 42 if (!(s->status & BIT(EIOINTC_ENABLE_INT_ENCODE))) { 43 43 ipnum = count_trailing_zeros(ipnum); 44 44 ipnum = (ipnum >= 0 && ipnum < 4) ? ipnum : 0; ··· 53 53 54 54 if (level) { 55 55 /* if not enable return false */ 56 - if (!test_bit(irq, (unsigned long *)s->enable.reg_u32)) 56 + if (!test_bit(irq, (unsigned long *)s->enable)) 57 57 return; 58 - __set_bit(irq, (unsigned long *)s->coreisr.reg_u32[cpu]); 58 + __set_bit(irq, (unsigned long *)s->coreisr[cpu]); 59 59 found = find_first_bit(s->sw_coreisr[cpu][ipnum], EIOINTC_IRQS); 60 60 __set_bit(irq, s->sw_coreisr[cpu][ipnum]); 61 61 } else { 62 - __clear_bit(irq, (unsigned long *)s->coreisr.reg_u32[cpu]); 62 + __clear_bit(irq, (unsigned long *)s->coreisr[cpu]); 63 63 __clear_bit(irq, s->sw_coreisr[cpu][ipnum]); 64 64 found = find_first_bit(s->sw_coreisr[cpu][ipnum], EIOINTC_IRQS); 65 65 } ··· 94 94 if (s->sw_coremap[irq + i] == cpu) 95 95 continue; 96 96 97 - if (notify && test_bit(irq + i, (unsigned long *)s->isr.reg_u8)) { 97 + if (notify && test_bit(irq + i, (unsigned long *)s->isr)) { 98 98 /* lower irq at old cpu and raise irq at new cpu */ 99 99 eiointc_update_irq(s, irq + i, 0); 100 100 s->sw_coremap[irq + i] = cpu; ··· 108 108 void eiointc_set_irq(struct loongarch_eiointc *s, int irq, int level) 109 109 { 110 110 unsigned long flags; 111 - unsigned long *isr = (unsigned long *)s->isr.reg_u8; 111 + unsigned long *isr = (unsigned long *)s->isr; 112 112 113 113 spin_lock_irqsave(&s->lock, flags); 114 114 level ? __set_bit(irq, isr) : __clear_bit(irq, isr); ··· 127 127 switch (offset) { 128 128 case EIOINTC_NODETYPE_START ... EIOINTC_NODETYPE_END: 129 129 index = (offset - EIOINTC_NODETYPE_START) >> 3; 130 - data = s->nodetype.reg_u64[index]; 130 + data = s->nodetype[index]; 131 131 break; 132 132 case EIOINTC_IPMAP_START ... EIOINTC_IPMAP_END: 133 133 index = (offset - EIOINTC_IPMAP_START) >> 3; 134 - data = s->ipmap.reg_u64; 134 + data = s->ipmap; 135 135 break; 136 136 case EIOINTC_ENABLE_START ... EIOINTC_ENABLE_END: 137 137 index = (offset - EIOINTC_ENABLE_START) >> 3; 138 - data = s->enable.reg_u64[index]; 138 + data = s->enable[index]; 139 139 break; 140 140 case EIOINTC_BOUNCE_START ... EIOINTC_BOUNCE_END: 141 141 index = (offset - EIOINTC_BOUNCE_START) >> 3; 142 - data = s->bounce.reg_u64[index]; 142 + data = s->bounce[index]; 143 143 break; 144 144 case EIOINTC_COREISR_START ... EIOINTC_COREISR_END: 145 145 index = (offset - EIOINTC_COREISR_START) >> 3; 146 - data = s->coreisr.reg_u64[vcpu->vcpu_id][index]; 146 + data = s->coreisr[vcpu->vcpu_id][index]; 147 147 break; 148 148 case EIOINTC_COREMAP_START ... EIOINTC_COREMAP_END: 149 149 index = (offset - EIOINTC_COREMAP_START) >> 3; 150 - data = s->coremap.reg_u64[index]; 150 + data = s->coremap[index]; 151 151 break; 152 152 default: 153 153 ret = -EINVAL; ··· 223 223 switch (offset) { 224 224 case EIOINTC_NODETYPE_START ... EIOINTC_NODETYPE_END: 225 225 index = (offset - EIOINTC_NODETYPE_START) >> 3; 226 - old = s->nodetype.reg_u64[index]; 227 - s->nodetype.reg_u64[index] = (old & ~mask) | data; 226 + old = s->nodetype[index]; 227 + s->nodetype[index] = (old & ~mask) | data; 228 228 break; 229 229 case EIOINTC_IPMAP_START ... EIOINTC_IPMAP_END: 230 230 /* 231 231 * ipmap cannot be set at runtime, can be set only at the beginning 232 232 * of irqchip driver, need not update upper irq level 233 233 */ 234 - old = s->ipmap.reg_u64; 235 - s->ipmap.reg_u64 = (old & ~mask) | data; 234 + old = s->ipmap; 235 + s->ipmap = (old & ~mask) | data; 236 236 break; 237 237 case EIOINTC_ENABLE_START ... EIOINTC_ENABLE_END: 238 238 index = (offset - EIOINTC_ENABLE_START) >> 3; 239 - old = s->enable.reg_u64[index]; 240 - s->enable.reg_u64[index] = (old & ~mask) | data; 239 + old = s->enable[index]; 240 + s->enable[index] = (old & ~mask) | data; 241 241 /* 242 242 * 1: enable irq. 243 243 * update irq when isr is set. 244 244 */ 245 - data = s->enable.reg_u64[index] & ~old & s->isr.reg_u64[index]; 245 + data = s->enable[index] & ~old & s->isr[index]; 246 246 while (data) { 247 247 irq = __ffs(data); 248 248 eiointc_update_irq(s, irq + index * 64, 1); ··· 252 252 * 0: disable irq. 253 253 * update irq when isr is set. 254 254 */ 255 - data = ~s->enable.reg_u64[index] & old & s->isr.reg_u64[index]; 255 + data = ~s->enable[index] & old & s->isr[index]; 256 256 while (data) { 257 257 irq = __ffs(data); 258 258 eiointc_update_irq(s, irq + index * 64, 0); ··· 262 262 case EIOINTC_BOUNCE_START ... EIOINTC_BOUNCE_END: 263 263 /* do not emulate hw bounced irq routing */ 264 264 index = (offset - EIOINTC_BOUNCE_START) >> 3; 265 - old = s->bounce.reg_u64[index]; 266 - s->bounce.reg_u64[index] = (old & ~mask) | data; 265 + old = s->bounce[index]; 266 + s->bounce[index] = (old & ~mask) | data; 267 267 break; 268 268 case EIOINTC_COREISR_START ... EIOINTC_COREISR_END: 269 269 index = (offset - EIOINTC_COREISR_START) >> 3; 270 270 /* use attrs to get current cpu index */ 271 271 cpu = vcpu->vcpu_id; 272 - old = s->coreisr.reg_u64[cpu][index]; 272 + old = s->coreisr[cpu][index]; 273 273 /* write 1 to clear interrupt */ 274 - s->coreisr.reg_u64[cpu][index] = old & ~data; 274 + s->coreisr[cpu][index] = old & ~data; 275 275 data &= old; 276 276 while (data) { 277 277 irq = __ffs(data); ··· 281 281 break; 282 282 case EIOINTC_COREMAP_START ... EIOINTC_COREMAP_END: 283 283 index = (offset - EIOINTC_COREMAP_START) >> 3; 284 - old = s->coremap.reg_u64[index]; 285 - s->coremap.reg_u64[index] = (old & ~mask) | data; 286 - data = s->coremap.reg_u64[index]; 284 + old = s->coremap[index]; 285 + s->coremap[index] = (old & ~mask) | data; 286 + data = s->coremap[index]; 287 287 eiointc_update_sw_coremap(s, index * 8, data, sizeof(data), true); 288 288 break; 289 289 default: ··· 451 451 break; 452 452 case KVM_DEV_LOONGARCH_EXTIOI_CTRL_LOAD_FINISHED: 453 453 eiointc_set_sw_coreisr(s); 454 - for (i = 0; i < (EIOINTC_IRQS / 4); i++) { 455 - start_irq = i * 4; 454 + for (i = 0; i < (EIOINTC_IRQS / 8); i++) { 455 + start_irq = i * 8; 456 456 eiointc_update_sw_coremap(s, start_irq, 457 - s->coremap.reg_u32[i], sizeof(u32), false); 457 + s->coremap[i], sizeof(u64), false); 458 458 } 459 459 break; 460 460 default: ··· 481 481 switch (addr) { 482 482 case EIOINTC_NODETYPE_START ... EIOINTC_NODETYPE_END: 483 483 offset = (addr - EIOINTC_NODETYPE_START) / 4; 484 - p = &s->nodetype.reg_u32[offset]; 484 + p = s->nodetype + offset * 4; 485 485 break; 486 486 case EIOINTC_IPMAP_START ... EIOINTC_IPMAP_END: 487 487 offset = (addr - EIOINTC_IPMAP_START) / 4; 488 - p = &s->ipmap.reg_u32[offset]; 488 + p = &s->ipmap + offset * 4; 489 489 break; 490 490 case EIOINTC_ENABLE_START ... EIOINTC_ENABLE_END: 491 491 offset = (addr - EIOINTC_ENABLE_START) / 4; 492 - p = &s->enable.reg_u32[offset]; 492 + p = s->enable + offset * 4; 493 493 break; 494 494 case EIOINTC_BOUNCE_START ... EIOINTC_BOUNCE_END: 495 495 offset = (addr - EIOINTC_BOUNCE_START) / 4; 496 - p = &s->bounce.reg_u32[offset]; 496 + p = s->bounce + offset * 4; 497 497 break; 498 498 case EIOINTC_ISR_START ... EIOINTC_ISR_END: 499 499 offset = (addr - EIOINTC_ISR_START) / 4; 500 - p = &s->isr.reg_u32[offset]; 500 + p = s->isr + offset * 4; 501 501 break; 502 502 case EIOINTC_COREISR_START ... EIOINTC_COREISR_END: 503 503 if (cpu >= s->num_cpu) 504 504 return -EINVAL; 505 505 506 506 offset = (addr - EIOINTC_COREISR_START) / 4; 507 - p = &s->coreisr.reg_u32[cpu][offset]; 507 + p = s->coreisr[cpu] + offset * 4; 508 508 break; 509 509 case EIOINTC_COREMAP_START ... EIOINTC_COREMAP_END: 510 510 offset = (addr - EIOINTC_COREMAP_START) / 4; 511 - p = &s->coremap.reg_u32[offset]; 511 + p = s->coremap + offset * 4; 512 512 break; 513 513 default: 514 514 kvm_err("%s: unknown eiointc register, addr = %d\n", __func__, addr);

+13 -2

arch/loongarch/kvm/interrupt.c

··· 21 21 [INT_HWI5] = CPU_IP5, 22 22 [INT_HWI6] = CPU_IP6, 23 23 [INT_HWI7] = CPU_IP7, 24 + [INT_AVEC] = CPU_AVEC, 24 25 }; 25 26 26 27 static int kvm_irq_deliver(struct kvm_vcpu *vcpu, unsigned int priority) ··· 31 30 clear_bit(priority, &vcpu->arch.irq_pending); 32 31 if (priority < EXCCODE_INT_NUM) 33 32 irq = priority_to_irq[priority]; 33 + 34 + if (cpu_has_msgint && (priority == INT_AVEC)) { 35 + set_gcsr_estat(irq); 36 + return 1; 37 + } 34 38 35 39 switch (priority) { 36 40 case INT_TI: ··· 64 58 if (priority < EXCCODE_INT_NUM) 65 59 irq = priority_to_irq[priority]; 66 60 61 + if (cpu_has_msgint && (priority == INT_AVEC)) { 62 + clear_gcsr_estat(irq); 63 + return 1; 64 + } 65 + 67 66 switch (priority) { 68 67 case INT_TI: 69 68 case INT_IPI: ··· 94 83 unsigned long *pending = &vcpu->arch.irq_pending; 95 84 unsigned long *pending_clr = &vcpu->arch.irq_clear; 96 85 97 - for_each_set_bit(priority, pending_clr, INT_IPI + 1) 86 + for_each_set_bit(priority, pending_clr, EXCCODE_INT_NUM) 98 87 kvm_irq_clear(vcpu, priority); 99 88 100 - for_each_set_bit(priority, pending, INT_IPI + 1) 89 + for_each_set_bit(priority, pending, EXCCODE_INT_NUM) 101 90 kvm_irq_deliver(vcpu, priority); 102 91 } 103 92

+19 -4

arch/loongarch/kvm/vcpu.c

··· 659 659 *v = GENMASK(31, 0); 660 660 return 0; 661 661 case LOONGARCH_CPUCFG1: 662 - /* CPUCFG1_MSGINT is not supported by KVM */ 663 - *v = GENMASK(25, 0); 662 + *v = GENMASK(26, 0); 664 663 return 0; 665 664 case LOONGARCH_CPUCFG2: 666 665 /* CPUCFG2 features unconditionally supported by KVM */ ··· 727 728 return -EINVAL; 728 729 729 730 switch (id) { 731 + case LOONGARCH_CPUCFG1: 732 + if ((val & CPUCFG1_MSGINT) && !cpu_has_msgint) 733 + return -EINVAL; 734 + return 0; 730 735 case LOONGARCH_CPUCFG2: 731 736 if (!(val & CPUCFG2_LLFTP)) 732 737 /* Guests must have a constant timer */ ··· 1476 1473 return 0; 1477 1474 } 1478 1475 1479 - long kvm_arch_vcpu_async_ioctl(struct file *filp, 1480 - unsigned int ioctl, unsigned long arg) 1476 + long kvm_arch_vcpu_unlocked_ioctl(struct file *filp, unsigned int ioctl, 1477 + unsigned long arg) 1481 1478 { 1482 1479 void __user *argp = (void __user *)arg; 1483 1480 struct kvm_vcpu *vcpu = filp->private_data; ··· 1660 1657 kvm_restore_hw_gcsr(csr, LOONGARCH_CSR_DMWIN2); 1661 1658 kvm_restore_hw_gcsr(csr, LOONGARCH_CSR_DMWIN3); 1662 1659 kvm_restore_hw_gcsr(csr, LOONGARCH_CSR_LLBCTL); 1660 + if (cpu_has_msgint) { 1661 + kvm_restore_hw_gcsr(csr, LOONGARCH_CSR_ISR0); 1662 + kvm_restore_hw_gcsr(csr, LOONGARCH_CSR_ISR1); 1663 + kvm_restore_hw_gcsr(csr, LOONGARCH_CSR_ISR2); 1664 + kvm_restore_hw_gcsr(csr, LOONGARCH_CSR_ISR3); 1665 + } 1663 1666 1664 1667 /* Restore Root.GINTC from unused Guest.GINTC register */ 1665 1668 write_csr_gintc(csr->csrs[LOONGARCH_CSR_GINTC]); ··· 1755 1746 kvm_save_hw_gcsr(csr, LOONGARCH_CSR_DMWIN1); 1756 1747 kvm_save_hw_gcsr(csr, LOONGARCH_CSR_DMWIN2); 1757 1748 kvm_save_hw_gcsr(csr, LOONGARCH_CSR_DMWIN3); 1749 + if (cpu_has_msgint) { 1750 + kvm_save_hw_gcsr(csr, LOONGARCH_CSR_ISR0); 1751 + kvm_save_hw_gcsr(csr, LOONGARCH_CSR_ISR1); 1752 + kvm_save_hw_gcsr(csr, LOONGARCH_CSR_ISR2); 1753 + kvm_save_hw_gcsr(csr, LOONGARCH_CSR_ISR3); 1754 + } 1758 1755 1759 1756 vcpu->arch.aux_inuse |= KVM_LARCH_SWCSR_LATEST; 1760 1757

+29 -15

arch/loongarch/kvm/vm.c

··· 6 6 #include <linux/kvm_host.h> 7 7 #include <asm/kvm_mmu.h> 8 8 #include <asm/kvm_vcpu.h> 9 + #include <asm/kvm_csr.h> 9 10 #include <asm/kvm_eiointc.h> 10 11 #include <asm/kvm_pch_pic.h> 11 12 ··· 24 23 .data_offset = sizeof(struct kvm_stats_header) + KVM_STATS_NAME_SIZE + 25 24 sizeof(kvm_vm_stats_desc), 26 25 }; 26 + 27 + static void kvm_vm_init_features(struct kvm *kvm) 28 + { 29 + unsigned long val; 30 + 31 + val = read_csr_gcfg(); 32 + if (val & CSR_GCFG_GPMP) 33 + kvm->arch.kvm_features |= BIT(KVM_LOONGARCH_VM_FEAT_PMU); 34 + 35 + /* Enable all PV features by default */ 36 + kvm->arch.pv_features = BIT(KVM_FEATURE_IPI); 37 + kvm->arch.kvm_features = BIT(KVM_LOONGARCH_VM_FEAT_PV_IPI); 38 + if (kvm_pvtime_supported()) { 39 + kvm->arch.pv_features |= BIT(KVM_FEATURE_STEAL_TIME); 40 + kvm->arch.kvm_features |= BIT(KVM_LOONGARCH_VM_FEAT_PV_STEALTIME); 41 + } 42 + } 27 43 28 44 int kvm_arch_init_vm(struct kvm *kvm, unsigned long type) 29 45 { ··· 60 42 spin_lock_init(&kvm->arch.phyid_map_lock); 61 43 62 44 kvm_init_vmcs(kvm); 63 - 64 - /* Enable all PV features by default */ 65 - kvm->arch.pv_features = BIT(KVM_FEATURE_IPI); 66 - if (kvm_pvtime_supported()) 67 - kvm->arch.pv_features |= BIT(KVM_FEATURE_STEAL_TIME); 45 + kvm_vm_init_features(kvm); 68 46 69 47 /* 70 48 * cpu_vabits means user address space only (a half of total). ··· 150 136 if (cpu_has_lbt_mips) 151 137 return 0; 152 138 return -ENXIO; 153 - case KVM_LOONGARCH_VM_FEAT_PMU: 154 - if (cpu_has_pmp) 155 - return 0; 156 - return -ENXIO; 157 - case KVM_LOONGARCH_VM_FEAT_PV_IPI: 158 - return 0; 159 - case KVM_LOONGARCH_VM_FEAT_PV_STEALTIME: 160 - if (kvm_pvtime_supported()) 161 - return 0; 162 - return -ENXIO; 163 139 case KVM_LOONGARCH_VM_FEAT_PTW: 164 140 if (cpu_has_ptw) 141 + return 0; 142 + return -ENXIO; 143 + case KVM_LOONGARCH_VM_FEAT_MSGINT: 144 + if (cpu_has_msgint) 145 + return 0; 146 + return -ENXIO; 147 + case KVM_LOONGARCH_VM_FEAT_PMU: 148 + case KVM_LOONGARCH_VM_FEAT_PV_IPI: 149 + case KVM_LOONGARCH_VM_FEAT_PV_STEALTIME: 150 + if (kvm_vm_support(&kvm->arch, attr->attr)) 165 151 return 0; 166 152 return -ENXIO; 167 153 default:

-1

arch/mips/kvm/Kconfig

··· 22 22 select EXPORT_UASM 23 23 select KVM_COMMON 24 24 select KVM_GENERIC_DIRTYLOG_READ_PROTECT 25 - select HAVE_KVM_VCPU_ASYNC_IOCTL 26 25 select KVM_MMIO 27 26 select KVM_GENERIC_MMU_NOTIFIER 28 27 select KVM_GENERIC_HARDWARE_ENABLING

+2 -2

arch/mips/kvm/mips.c

··· 895 895 return r; 896 896 } 897 897 898 - long kvm_arch_vcpu_async_ioctl(struct file *filp, unsigned int ioctl, 899 - unsigned long arg) 898 + long kvm_arch_vcpu_unlocked_ioctl(struct file *filp, unsigned int ioctl, 899 + unsigned long arg) 900 900 { 901 901 struct kvm_vcpu *vcpu = filp->private_data; 902 902 void __user *argp = (void __user *)arg;

-1

arch/powerpc/kvm/Kconfig

··· 20 20 config KVM 21 21 bool 22 22 select KVM_COMMON 23 - select HAVE_KVM_VCPU_ASYNC_IOCTL 24 23 select KVM_VFIO 25 24 select HAVE_KVM_IRQ_BYPASS 26 25

+2 -2

arch/powerpc/kvm/powerpc.c

··· 2028 2028 return -EINVAL; 2029 2029 } 2030 2030 2031 - long kvm_arch_vcpu_async_ioctl(struct file *filp, 2032 - unsigned int ioctl, unsigned long arg) 2031 + long kvm_arch_vcpu_unlocked_ioctl(struct file *filp, unsigned int ioctl, 2032 + unsigned long arg) 2033 2033 { 2034 2034 struct kvm_vcpu *vcpu = filp->private_data; 2035 2035 void __user *argp = (void __user *)arg;

+6

arch/riscv/include/asm/kvm_host.h

··· 59 59 BIT(IRQ_VS_TIMER) | \ 60 60 BIT(IRQ_VS_EXT)) 61 61 62 + #define KVM_DIRTY_LOG_MANUAL_CAPS (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \ 63 + KVM_DIRTY_LOG_INITIALLY_SET) 64 + 62 65 struct kvm_vm_stat { 63 66 struct kvm_vm_stat_generic generic; 64 67 }; ··· 329 326 bool kvm_riscv_vcpu_stopped(struct kvm_vcpu *vcpu); 330 327 331 328 void kvm_riscv_vcpu_record_steal_time(struct kvm_vcpu *vcpu); 329 + 330 + /* Flags representing implementation specific details */ 331 + DECLARE_STATIC_KEY_FALSE(kvm_riscv_vsstage_tlb_no_gpa); 332 332 333 333 #endif /* __RISCV_KVM_HOST_H__ */

+1

arch/riscv/include/asm/kvm_tlb.h

··· 49 49 unsigned long gva, unsigned long gvsz, 50 50 unsigned long order); 51 51 void kvm_riscv_local_hfence_vvma_all(unsigned long vmid); 52 + void kvm_riscv_local_tlb_sanitize(struct kvm_vcpu *vcpu); 52 53 53 54 void kvm_riscv_tlb_flush_process(struct kvm_vcpu *vcpu); 54 55

+4 -1

arch/riscv/include/asm/kvm_vcpu_sbi.h

··· 69 69 unsigned long reg_size, const void *reg_val); 70 70 }; 71 71 72 - void kvm_riscv_vcpu_sbi_forward(struct kvm_vcpu *vcpu, struct kvm_run *run); 72 + int kvm_riscv_vcpu_sbi_forward_handler(struct kvm_vcpu *vcpu, 73 + struct kvm_run *run, 74 + struct kvm_vcpu_sbi_return *retdata); 73 75 void kvm_riscv_vcpu_sbi_system_reset(struct kvm_vcpu *vcpu, 74 76 struct kvm_run *run, 75 77 u32 type, u64 flags); ··· 107 105 extern const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_susp; 108 106 extern const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_sta; 109 107 extern const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_fwft; 108 + extern const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_mpxy; 110 109 extern const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_experimental; 111 110 extern const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_vendor; 112 111

-1

arch/riscv/include/asm/kvm_vmid.h

··· 22 22 int kvm_riscv_gstage_vmid_init(struct kvm *kvm); 23 23 bool kvm_riscv_gstage_vmid_ver_changed(struct kvm_vmid *vmid); 24 24 void kvm_riscv_gstage_vmid_update(struct kvm_vcpu *vcpu); 25 - void kvm_riscv_gstage_vmid_sanitize(struct kvm_vcpu *vcpu); 26 25 27 26 #endif

+3

arch/riscv/include/uapi/asm/kvm.h

··· 23 23 #define KVM_INTERRUPT_SET -1U 24 24 #define KVM_INTERRUPT_UNSET -2U 25 25 26 + #define KVM_EXIT_FAIL_ENTRY_NO_VSFILE (1ULL << 0) 27 + 26 28 /* for KVM_GET_REGS and KVM_SET_REGS */ 27 29 struct kvm_regs { 28 30 }; ··· 213 211 KVM_RISCV_SBI_EXT_STA, 214 212 KVM_RISCV_SBI_EXT_SUSP, 215 213 KVM_RISCV_SBI_EXT_FWFT, 214 + KVM_RISCV_SBI_EXT_MPXY, 216 215 KVM_RISCV_SBI_EXT_MAX, 217 216 }; 218 217

-1

arch/riscv/kvm/Kconfig

··· 23 23 select HAVE_KVM_IRQCHIP 24 24 select HAVE_KVM_IRQ_ROUTING 25 25 select HAVE_KVM_MSI 26 - select HAVE_KVM_VCPU_ASYNC_IOCTL 27 26 select HAVE_KVM_READONLY_MEM 28 27 select HAVE_KVM_DIRTY_RING_ACQ_REL 29 28 select KVM_COMMON

+1

arch/riscv/kvm/Makefile

··· 27 27 kvm-$(CONFIG_RISCV_PMU_SBI) += vcpu_pmu.o 28 28 kvm-y += vcpu_sbi.o 29 29 kvm-y += vcpu_sbi_base.o 30 + kvm-y += vcpu_sbi_forward.o 30 31 kvm-y += vcpu_sbi_fwft.o 31 32 kvm-y += vcpu_sbi_hsm.o 32 33 kvm-$(CONFIG_RISCV_PMU_SBI) += vcpu_sbi_pmu.o

+1 -1

arch/riscv/kvm/aia_imsic.c

··· 814 814 /* For HW acceleration mode, we can't continue */ 815 815 if (kvm->arch.aia.mode == KVM_DEV_RISCV_AIA_MODE_HWACCEL) { 816 816 run->fail_entry.hardware_entry_failure_reason = 817 - CSR_HSTATUS; 817 + KVM_EXIT_FAIL_ENTRY_NO_VSFILE; 818 818 run->fail_entry.cpu = vcpu->cpu; 819 819 run->exit_reason = KVM_EXIT_FAIL_ENTRY; 820 820 return 0;

+14

arch/riscv/kvm/main.c

··· 15 15 #include <asm/kvm_nacl.h> 16 16 #include <asm/sbi.h> 17 17 18 + DEFINE_STATIC_KEY_FALSE(kvm_riscv_vsstage_tlb_no_gpa); 19 + 20 + static void kvm_riscv_setup_vendor_features(void) 21 + { 22 + /* Andes AX66: split two-stage TLBs */ 23 + if (riscv_cached_mvendorid(0) == ANDES_VENDOR_ID && 24 + (riscv_cached_marchid(0) & 0xFFFF) == 0x8A66) { 25 + static_branch_enable(&kvm_riscv_vsstage_tlb_no_gpa); 26 + kvm_info("VS-stage TLB does not cache guest physical address and VMID\n"); 27 + } 28 + } 29 + 18 30 long kvm_arch_dev_ioctl(struct file *filp, 19 31 unsigned int ioctl, unsigned long arg) 20 32 { ··· 171 159 if (kvm_riscv_aia_available()) 172 160 kvm_info("AIA available with %d guest external interrupts\n", 173 161 kvm_riscv_aia_nr_hgei); 162 + 163 + kvm_riscv_setup_vendor_features(); 174 164 175 165 kvm_register_perf_callbacks(NULL); 176 166

+4 -1

arch/riscv/kvm/mmu.c

··· 161 161 * allocated dirty_bitmap[], dirty pages will be tracked while 162 162 * the memory slot is write protected. 163 163 */ 164 - if (change != KVM_MR_DELETE && new->flags & KVM_MEM_LOG_DIRTY_PAGES) 164 + if (change != KVM_MR_DELETE && new->flags & KVM_MEM_LOG_DIRTY_PAGES) { 165 + if (kvm_dirty_log_manual_protect_and_init_set(kvm)) 166 + return; 165 167 mmu_wp_memory_region(kvm, new->id); 168 + } 166 169 } 167 170 168 171 int kvm_arch_prepare_memory_region(struct kvm *kvm,

+30

arch/riscv/kvm/tlb.c

··· 158 158 csr_write(CSR_HGATP, hgatp); 159 159 } 160 160 161 + void kvm_riscv_local_tlb_sanitize(struct kvm_vcpu *vcpu) 162 + { 163 + unsigned long vmid; 164 + 165 + if (!kvm_riscv_gstage_vmid_bits() || 166 + vcpu->arch.last_exit_cpu == vcpu->cpu) 167 + return; 168 + 169 + /* 170 + * On RISC-V platforms with hardware VMID support, we share same 171 + * VMID for all VCPUs of a particular Guest/VM. This means we might 172 + * have stale G-stage TLB entries on the current Host CPU due to 173 + * some other VCPU of the same Guest which ran previously on the 174 + * current Host CPU. 175 + * 176 + * To cleanup stale TLB entries, we simply flush all G-stage TLB 177 + * entries by VMID whenever underlying Host CPU changes for a VCPU. 178 + */ 179 + 180 + vmid = READ_ONCE(vcpu->kvm->arch.vmid.vmid); 181 + kvm_riscv_local_hfence_gvma_vmid_all(vmid); 182 + 183 + /* 184 + * Flush VS-stage TLB entries for implementation where VS-stage 185 + * TLB does not cahce guest physical address and VMID. 186 + */ 187 + if (static_branch_unlikely(&kvm_riscv_vsstage_tlb_no_gpa)) 188 + kvm_riscv_local_hfence_vvma_all(vmid); 189 + } 190 + 161 191 void kvm_riscv_fence_i_process(struct kvm_vcpu *vcpu) 162 192 { 163 193 kvm_riscv_vcpu_pmu_incr_fw(vcpu, SBI_PMU_FW_FENCE_I_RCVD);

+3 -3

arch/riscv/kvm/vcpu.c

··· 238 238 return VM_FAULT_SIGBUS; 239 239 } 240 240 241 - long kvm_arch_vcpu_async_ioctl(struct file *filp, 242 - unsigned int ioctl, unsigned long arg) 241 + long kvm_arch_vcpu_unlocked_ioctl(struct file *filp, unsigned int ioctl, 242 + unsigned long arg) 243 243 { 244 244 struct kvm_vcpu *vcpu = filp->private_data; 245 245 void __user *argp = (void __user *)arg; ··· 968 968 * Note: This should be done after G-stage VMID has been 969 969 * updated using kvm_riscv_gstage_vmid_ver_changed() 970 970 */ 971 - kvm_riscv_gstage_vmid_sanitize(vcpu); 971 + kvm_riscv_local_tlb_sanitize(vcpu); 972 972 973 973 trace_kvm_entry(vcpu); 974 974

+22

arch/riscv/kvm/vcpu_insn.c

··· 298 298 return (rc <= 0) ? rc : 1; 299 299 } 300 300 301 + static bool is_load_guest_page_fault(unsigned long scause) 302 + { 303 + /** 304 + * If a g-stage page fault occurs, the direct approach 305 + * is to let the g-stage page fault handler handle it 306 + * naturally, however, calling the g-stage page fault 307 + * handler here seems rather strange. 308 + * Considering this is a corner case, we can directly 309 + * return to the guest and re-execute the same PC, this 310 + * will trigger a g-stage page fault again and then the 311 + * regular g-stage page fault handler will populate 312 + * g-stage page table. 313 + */ 314 + return (scause == EXC_LOAD_GUEST_PAGE_FAULT); 315 + } 316 + 301 317 /** 302 318 * kvm_riscv_vcpu_virtual_insn -- Handle virtual instruction trap 303 319 * ··· 339 323 ct->sepc, 340 324 &utrap); 341 325 if (utrap.scause) { 326 + if (is_load_guest_page_fault(utrap.scause)) 327 + return 1; 342 328 utrap.sepc = ct->sepc; 343 329 kvm_riscv_vcpu_trap_redirect(vcpu, &utrap); 344 330 return 1; ··· 396 378 insn = kvm_riscv_vcpu_unpriv_read(vcpu, true, ct->sepc, 397 379 &utrap); 398 380 if (utrap.scause) { 381 + if (is_load_guest_page_fault(utrap.scause)) 382 + return 1; 399 383 /* Redirect trap if we failed to read instruction */ 400 384 utrap.sepc = ct->sepc; 401 385 kvm_riscv_vcpu_trap_redirect(vcpu, &utrap); ··· 524 504 insn = kvm_riscv_vcpu_unpriv_read(vcpu, true, ct->sepc, 525 505 &utrap); 526 506 if (utrap.scause) { 507 + if (is_load_guest_page_fault(utrap.scause)) 508 + return 1; 527 509 /* Redirect trap if we failed to read instruction */ 528 510 utrap.sepc = ct->sepc; 529 511 kvm_riscv_vcpu_trap_redirect(vcpu, &utrap);

+9 -1

arch/riscv/kvm/vcpu_sbi.c

··· 83 83 .ext_ptr = &vcpu_sbi_ext_fwft, 84 84 }, 85 85 { 86 + .ext_idx = KVM_RISCV_SBI_EXT_MPXY, 87 + .ext_ptr = &vcpu_sbi_ext_mpxy, 88 + }, 89 + { 86 90 .ext_idx = KVM_RISCV_SBI_EXT_EXPERIMENTAL, 87 91 .ext_ptr = &vcpu_sbi_ext_experimental, 88 92 }, ··· 124 120 return sext && scontext->ext_status[sext->ext_idx] != KVM_RISCV_SBI_EXT_STATUS_UNAVAILABLE; 125 121 } 126 122 127 - void kvm_riscv_vcpu_sbi_forward(struct kvm_vcpu *vcpu, struct kvm_run *run) 123 + int kvm_riscv_vcpu_sbi_forward_handler(struct kvm_vcpu *vcpu, 124 + struct kvm_run *run, 125 + struct kvm_vcpu_sbi_return *retdata) 128 126 { 129 127 struct kvm_cpu_context *cp = &vcpu->arch.guest_context; 130 128 ··· 143 137 run->riscv_sbi.args[5] = cp->a5; 144 138 run->riscv_sbi.ret[0] = SBI_ERR_NOT_SUPPORTED; 145 139 run->riscv_sbi.ret[1] = 0; 140 + retdata->uexit = true; 141 + return 0; 146 142 } 147 143 148 144 void kvm_riscv_vcpu_sbi_system_reset(struct kvm_vcpu *vcpu,

+1 -27

arch/riscv/kvm/vcpu_sbi_base.c

··· 41 41 * For experimental/vendor extensions 42 42 * forward it to the userspace 43 43 */ 44 - kvm_riscv_vcpu_sbi_forward(vcpu, run); 45 - retdata->uexit = true; 44 + return kvm_riscv_vcpu_sbi_forward_handler(vcpu, run, retdata); 46 45 } else { 47 46 sbi_ext = kvm_vcpu_sbi_find_ext(vcpu, cp->a0); 48 47 *out_val = sbi_ext && sbi_ext->probe ? ··· 69 70 .extid_start = SBI_EXT_BASE, 70 71 .extid_end = SBI_EXT_BASE, 71 72 .handler = kvm_sbi_ext_base_handler, 72 - }; 73 - 74 - static int kvm_sbi_ext_forward_handler(struct kvm_vcpu *vcpu, 75 - struct kvm_run *run, 76 - struct kvm_vcpu_sbi_return *retdata) 77 - { 78 - /* 79 - * Both SBI experimental and vendor extensions are 80 - * unconditionally forwarded to userspace. 81 - */ 82 - kvm_riscv_vcpu_sbi_forward(vcpu, run); 83 - retdata->uexit = true; 84 - return 0; 85 - } 86 - 87 - const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_experimental = { 88 - .extid_start = SBI_EXT_EXPERIMENTAL_START, 89 - .extid_end = SBI_EXT_EXPERIMENTAL_END, 90 - .handler = kvm_sbi_ext_forward_handler, 91 - }; 92 - 93 - const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_vendor = { 94 - .extid_start = SBI_EXT_VENDOR_START, 95 - .extid_end = SBI_EXT_VENDOR_END, 96 - .handler = kvm_sbi_ext_forward_handler, 97 73 };

+34

arch/riscv/kvm/vcpu_sbi_forward.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * Copyright (c) 2025 Ventana Micro Systems Inc. 4 + */ 5 + 6 + #include <linux/kvm_host.h> 7 + #include <asm/kvm_vcpu_sbi.h> 8 + #include <asm/sbi.h> 9 + 10 + const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_experimental = { 11 + .extid_start = SBI_EXT_EXPERIMENTAL_START, 12 + .extid_end = SBI_EXT_EXPERIMENTAL_END, 13 + .handler = kvm_riscv_vcpu_sbi_forward_handler, 14 + }; 15 + 16 + const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_vendor = { 17 + .extid_start = SBI_EXT_VENDOR_START, 18 + .extid_end = SBI_EXT_VENDOR_END, 19 + .handler = kvm_riscv_vcpu_sbi_forward_handler, 20 + }; 21 + 22 + const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_dbcn = { 23 + .extid_start = SBI_EXT_DBCN, 24 + .extid_end = SBI_EXT_DBCN, 25 + .default_disabled = true, 26 + .handler = kvm_riscv_vcpu_sbi_forward_handler, 27 + }; 28 + 29 + const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_mpxy = { 30 + .extid_start = SBI_EXT_MPXY, 31 + .extid_end = SBI_EXT_MPXY, 32 + .default_disabled = true, 33 + .handler = kvm_riscv_vcpu_sbi_forward_handler, 34 + };

-32

arch/riscv/kvm/vcpu_sbi_replace.c

··· 185 185 .extid_end = SBI_EXT_SRST, 186 186 .handler = kvm_sbi_ext_srst_handler, 187 187 }; 188 - 189 - static int kvm_sbi_ext_dbcn_handler(struct kvm_vcpu *vcpu, 190 - struct kvm_run *run, 191 - struct kvm_vcpu_sbi_return *retdata) 192 - { 193 - struct kvm_cpu_context *cp = &vcpu->arch.guest_context; 194 - unsigned long funcid = cp->a6; 195 - 196 - switch (funcid) { 197 - case SBI_EXT_DBCN_CONSOLE_WRITE: 198 - case SBI_EXT_DBCN_CONSOLE_READ: 199 - case SBI_EXT_DBCN_CONSOLE_WRITE_BYTE: 200 - /* 201 - * The SBI debug console functions are unconditionally 202 - * forwarded to the userspace. 203 - */ 204 - kvm_riscv_vcpu_sbi_forward(vcpu, run); 205 - retdata->uexit = true; 206 - break; 207 - default: 208 - retdata->err_val = SBI_ERR_NOT_SUPPORTED; 209 - } 210 - 211 - return 0; 212 - } 213 - 214 - const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_dbcn = { 215 - .extid_start = SBI_EXT_DBCN, 216 - .extid_end = SBI_EXT_DBCN, 217 - .default_disabled = true, 218 - .handler = kvm_sbi_ext_dbcn_handler, 219 - };

+1 -3

arch/riscv/kvm/vcpu_sbi_system.c

··· 47 47 kvm_riscv_vcpu_sbi_request_reset(vcpu, cp->a1, cp->a2); 48 48 49 49 /* userspace provides the suspend implementation */ 50 - kvm_riscv_vcpu_sbi_forward(vcpu, run); 51 - retdata->uexit = true; 52 - break; 50 + return kvm_riscv_vcpu_sbi_forward_handler(vcpu, run, retdata); 53 51 default: 54 52 retdata->err_val = SBI_ERR_NOT_SUPPORTED; 55 53 break;

+1 -2

arch/riscv/kvm/vcpu_sbi_v01.c

··· 32 32 * The CONSOLE_GETCHAR/CONSOLE_PUTCHAR SBI calls cannot be 33 33 * handled in kernel so we forward these to user-space 34 34 */ 35 - kvm_riscv_vcpu_sbi_forward(vcpu, run); 36 - retdata->uexit = true; 35 + ret = kvm_riscv_vcpu_sbi_forward_handler(vcpu, run, retdata); 37 36 break; 38 37 case SBI_EXT_0_1_SET_TIMER: 39 38 #if __riscv_xlen == 32

-23

arch/riscv/kvm/vmid.c

··· 122 122 kvm_for_each_vcpu(i, v, vcpu->kvm) 123 123 kvm_make_request(KVM_REQ_UPDATE_HGATP, v); 124 124 } 125 - 126 - void kvm_riscv_gstage_vmid_sanitize(struct kvm_vcpu *vcpu) 127 - { 128 - unsigned long vmid; 129 - 130 - if (!kvm_riscv_gstage_vmid_bits() || 131 - vcpu->arch.last_exit_cpu == vcpu->cpu) 132 - return; 133 - 134 - /* 135 - * On RISC-V platforms with hardware VMID support, we share same 136 - * VMID for all VCPUs of a particular Guest/VM. This means we might 137 - * have stale G-stage TLB entries on the current Host CPU due to 138 - * some other VCPU of the same Guest which ran previously on the 139 - * current Host CPU. 140 - * 141 - * To cleanup stale TLB entries, we simply flush all G-stage TLB 142 - * entries by VMID whenever underlying Host CPU changes for a VCPU. 143 - */ 144 - 145 - vmid = READ_ONCE(vcpu->kvm->arch.vmid.vmid); 146 - kvm_riscv_local_hfence_gvma_vmid_all(vmid); 147 - }

+4 -4

arch/s390/include/asm/kvm_host.h

··· 146 146 u64 instruction_diagnose_500; 147 147 u64 instruction_diagnose_other; 148 148 u64 pfault_sync; 149 + u64 signal_exits; 149 150 }; 150 151 151 152 #define PGM_OPERATION 0x01 ··· 632 631 struct mmu_notifier mmu_notifier; 633 632 }; 634 633 635 - struct kvm_arch{ 636 - void *sca; 637 - int use_esca; 638 - rwlock_t sca_lock; 634 + struct kvm_arch { 635 + struct esca_block *sca; 639 636 debug_info_t *dbf; 640 637 struct kvm_s390_float_interrupt float_int; 641 638 struct kvm_device *flic; ··· 649 650 int user_sigp; 650 651 int user_stsi; 651 652 int user_instr0; 653 + int user_operexec; 652 654 struct s390_io_adapter *adapters[MAX_S390_IO_ADAPTERS]; 653 655 wait_queue_head_t ipte_wq; 654 656 int ipte_lock_count;

+1

arch/s390/include/asm/stacktrace.h

··· 66 66 unsigned long sie_flags; 67 67 unsigned long sie_control_block_phys; 68 68 unsigned long sie_guest_asce; 69 + unsigned long sie_irq; 69 70 }; 70 71 }; 71 72 unsigned long gprs[10];

+1

arch/s390/kernel/asm-offsets.c

··· 67 67 OFFSET(__SF_SIE_FLAGS, stack_frame, sie_flags); 68 68 OFFSET(__SF_SIE_CONTROL_PHYS, stack_frame, sie_control_block_phys); 69 69 OFFSET(__SF_SIE_GUEST_ASCE, stack_frame, sie_guest_asce); 70 + OFFSET(__SF_SIE_IRQ, stack_frame, sie_irq); 70 71 DEFINE(STACK_FRAME_OVERHEAD, sizeof(struct stack_frame)); 71 72 BLANK(); 72 73 OFFSET(__SFUSER_BACKCHAIN, stack_frame_user, back_chain);

+2

arch/s390/kernel/entry.S

··· 193 193 mvc __SF_SIE_FLAGS(8,%r15),__TI_flags(%r14) # copy thread flags 194 194 lmg %r0,%r13,0(%r4) # load guest gprs 0-13 195 195 mvi __TI_sie(%r14),1 196 + stosm __SF_SIE_IRQ(%r15),0x03 # enable interrupts 196 197 lctlg %c1,%c1,__SF_SIE_GUEST_ASCE(%r15) # load primary asce 197 198 lg %r14,__SF_SIE_CONTROL(%r15) # get control block pointer 198 199 oi __SIE_PROG0C+3(%r14),1 # we are going into SIE now ··· 217 216 lg %r14,__LC_CURRENT(%r14) 218 217 mvi __TI_sie(%r14),0 219 218 SYM_INNER_LABEL(sie_exit, SYM_L_GLOBAL) 219 + stnsm __SF_SIE_IRQ(%r15),0xfc # disable interrupts 220 220 lg %r14,__SF_SIE_SAVEAREA(%r15) # load guest register save area 221 221 stmg %r0,%r13,0(%r14) # save guest gprs 0-13 222 222 xgr %r0,%r0 # clear guest registers to

+1 -1

arch/s390/kvm/Kconfig

··· 20 20 def_tristate y 21 21 prompt "Kernel-based Virtual Machine (KVM) support" 22 22 select HAVE_KVM_CPU_RELAX_INTERCEPT 23 - select HAVE_KVM_VCPU_ASYNC_IOCTL 24 23 select KVM_ASYNC_PF 25 24 select KVM_ASYNC_PF_SYNC 26 25 select KVM_COMMON ··· 29 30 select HAVE_KVM_NO_POLL 30 31 select KVM_VFIO 31 32 select MMU_NOTIFIER 33 + select VIRT_XFER_TO_GUEST_WORK 32 34 help 33 35 Support hosting paravirtualized guest machines using the SIE 34 36 virtualization capability on the mainframe. This should work

+6 -21

arch/s390/kvm/gaccess.c

··· 109 109 110 110 int ipte_lock_held(struct kvm *kvm) 111 111 { 112 - if (sclp.has_siif) { 113 - int rc; 112 + if (sclp.has_siif) 113 + return kvm->arch.sca->ipte_control.kh != 0; 114 114 115 - read_lock(&kvm->arch.sca_lock); 116 - rc = kvm_s390_get_ipte_control(kvm)->kh != 0; 117 - read_unlock(&kvm->arch.sca_lock); 118 - return rc; 119 - } 120 115 return kvm->arch.ipte_lock_count != 0; 121 116 } 122 117 ··· 124 129 if (kvm->arch.ipte_lock_count > 1) 125 130 goto out; 126 131 retry: 127 - read_lock(&kvm->arch.sca_lock); 128 - ic = kvm_s390_get_ipte_control(kvm); 132 + ic = &kvm->arch.sca->ipte_control; 129 133 old = READ_ONCE(*ic); 130 134 do { 131 135 if (old.k) { 132 - read_unlock(&kvm->arch.sca_lock); 133 136 cond_resched(); 134 137 goto retry; 135 138 } 136 139 new = old; 137 140 new.k = 1; 138 141 } while (!try_cmpxchg(&ic->val, &old.val, new.val)); 139 - read_unlock(&kvm->arch.sca_lock); 140 142 out: 141 143 mutex_unlock(&kvm->arch.ipte_mutex); 142 144 } ··· 146 154 kvm->arch.ipte_lock_count--; 147 155 if (kvm->arch.ipte_lock_count) 148 156 goto out; 149 - read_lock(&kvm->arch.sca_lock); 150 - ic = kvm_s390_get_ipte_control(kvm); 157 + ic = &kvm->arch.sca->ipte_control; 151 158 old = READ_ONCE(*ic); 152 159 do { 153 160 new = old; 154 161 new.k = 0; 155 162 } while (!try_cmpxchg(&ic->val, &old.val, new.val)); 156 - read_unlock(&kvm->arch.sca_lock); 157 163 wake_up(&kvm->arch.ipte_wq); 158 164 out: 159 165 mutex_unlock(&kvm->arch.ipte_mutex); ··· 162 172 union ipte_control old, new, *ic; 163 173 164 174 retry: 165 - read_lock(&kvm->arch.sca_lock); 166 - ic = kvm_s390_get_ipte_control(kvm); 175 + ic = &kvm->arch.sca->ipte_control; 167 176 old = READ_ONCE(*ic); 168 177 do { 169 178 if (old.kg) { 170 - read_unlock(&kvm->arch.sca_lock); 171 179 cond_resched(); 172 180 goto retry; 173 181 } ··· 173 185 new.k = 1; 174 186 new.kh++; 175 187 } while (!try_cmpxchg(&ic->val, &old.val, new.val)); 176 - read_unlock(&kvm->arch.sca_lock); 177 188 } 178 189 179 190 static void ipte_unlock_siif(struct kvm *kvm) 180 191 { 181 192 union ipte_control old, new, *ic; 182 193 183 - read_lock(&kvm->arch.sca_lock); 184 - ic = kvm_s390_get_ipte_control(kvm); 194 + ic = &kvm->arch.sca->ipte_control; 185 195 old = READ_ONCE(*ic); 186 196 do { 187 197 new = old; ··· 187 201 if (!new.kh) 188 202 new.k = 0; 189 203 } while (!try_cmpxchg(&ic->val, &old.val, new.val)); 190 - read_unlock(&kvm->arch.sca_lock); 191 204 if (!new.kh) 192 205 wake_up(&kvm->arch.ipte_wq); 193 206 }

+3

arch/s390/kvm/intercept.c

··· 471 471 if (vcpu->arch.sie_block->ipa == 0xb256) 472 472 return handle_sthyi(vcpu); 473 473 474 + if (vcpu->kvm->arch.user_operexec) 475 + return -EOPNOTSUPP; 476 + 474 477 if (vcpu->arch.sie_block->ipa == 0 && vcpu->kvm->arch.user_instr0) 475 478 return -EOPNOTSUPP; 476 479 rc = read_guest_lc(vcpu, __LC_PGM_NEW_PSW, &newpsw, sizeof(psw_t));

+17 -63

arch/s390/kvm/interrupt.c

··· 44 44 /* handle external calls via sigp interpretation facility */ 45 45 static int sca_ext_call_pending(struct kvm_vcpu *vcpu, int *src_id) 46 46 { 47 - int c, scn; 47 + struct esca_block *sca = vcpu->kvm->arch.sca; 48 + union esca_sigp_ctrl sigp_ctrl = sca->cpu[vcpu->vcpu_id].sigp_ctrl; 48 49 49 50 if (!kvm_s390_test_cpuflags(vcpu, CPUSTAT_ECALL_PEND)) 50 51 return 0; 51 52 52 53 BUG_ON(!kvm_s390_use_sca_entries()); 53 - read_lock(&vcpu->kvm->arch.sca_lock); 54 - if (vcpu->kvm->arch.use_esca) { 55 - struct esca_block *sca = vcpu->kvm->arch.sca; 56 - union esca_sigp_ctrl sigp_ctrl = 57 - sca->cpu[vcpu->vcpu_id].sigp_ctrl; 58 - 59 - c = sigp_ctrl.c; 60 - scn = sigp_ctrl.scn; 61 - } else { 62 - struct bsca_block *sca = vcpu->kvm->arch.sca; 63 - union bsca_sigp_ctrl sigp_ctrl = 64 - sca->cpu[vcpu->vcpu_id].sigp_ctrl; 65 - 66 - c = sigp_ctrl.c; 67 - scn = sigp_ctrl.scn; 68 - } 69 - read_unlock(&vcpu->kvm->arch.sca_lock); 70 54 71 55 if (src_id) 72 - *src_id = scn; 56 + *src_id = sigp_ctrl.scn; 73 57 74 - return c; 58 + return sigp_ctrl.c; 75 59 } 76 60 77 61 static int sca_inject_ext_call(struct kvm_vcpu *vcpu, int src_id) 78 62 { 63 + struct esca_block *sca = vcpu->kvm->arch.sca; 64 + union esca_sigp_ctrl *sigp_ctrl = &sca->cpu[vcpu->vcpu_id].sigp_ctrl; 65 + union esca_sigp_ctrl old_val, new_val = {.scn = src_id, .c = 1}; 79 66 int expect, rc; 80 67 81 68 BUG_ON(!kvm_s390_use_sca_entries()); 82 - read_lock(&vcpu->kvm->arch.sca_lock); 83 - if (vcpu->kvm->arch.use_esca) { 84 - struct esca_block *sca = vcpu->kvm->arch.sca; 85 - union esca_sigp_ctrl *sigp_ctrl = 86 - &(sca->cpu[vcpu->vcpu_id].sigp_ctrl); 87 - union esca_sigp_ctrl new_val = {0}, old_val; 88 69 89 - old_val = READ_ONCE(*sigp_ctrl); 90 - new_val.scn = src_id; 91 - new_val.c = 1; 92 - old_val.c = 0; 70 + old_val = READ_ONCE(*sigp_ctrl); 71 + old_val.c = 0; 93 72 94 - expect = old_val.value; 95 - rc = cmpxchg(&sigp_ctrl->value, old_val.value, new_val.value); 96 - } else { 97 - struct bsca_block *sca = vcpu->kvm->arch.sca; 98 - union bsca_sigp_ctrl *sigp_ctrl = 99 - &(sca->cpu[vcpu->vcpu_id].sigp_ctrl); 100 - union bsca_sigp_ctrl new_val = {0}, old_val; 101 - 102 - old_val = READ_ONCE(*sigp_ctrl); 103 - new_val.scn = src_id; 104 - new_val.c = 1; 105 - old_val.c = 0; 106 - 107 - expect = old_val.value; 108 - rc = cmpxchg(&sigp_ctrl->value, old_val.value, new_val.value); 109 - } 110 - read_unlock(&vcpu->kvm->arch.sca_lock); 73 + expect = old_val.value; 74 + rc = cmpxchg(&sigp_ctrl->value, old_val.value, new_val.value); 111 75 112 76 if (rc != expect) { 113 77 /* another external call is pending */ ··· 83 119 84 120 static void sca_clear_ext_call(struct kvm_vcpu *vcpu) 85 121 { 122 + struct esca_block *sca = vcpu->kvm->arch.sca; 123 + union esca_sigp_ctrl *sigp_ctrl = &sca->cpu[vcpu->vcpu_id].sigp_ctrl; 124 + 86 125 if (!kvm_s390_use_sca_entries()) 87 126 return; 88 127 kvm_s390_clear_cpuflags(vcpu, CPUSTAT_ECALL_PEND); 89 - read_lock(&vcpu->kvm->arch.sca_lock); 90 - if (vcpu->kvm->arch.use_esca) { 91 - struct esca_block *sca = vcpu->kvm->arch.sca; 92 - union esca_sigp_ctrl *sigp_ctrl = 93 - &(sca->cpu[vcpu->vcpu_id].sigp_ctrl); 94 128 95 - WRITE_ONCE(sigp_ctrl->value, 0); 96 - } else { 97 - struct bsca_block *sca = vcpu->kvm->arch.sca; 98 - union bsca_sigp_ctrl *sigp_ctrl = 99 - &(sca->cpu[vcpu->vcpu_id].sigp_ctrl); 100 - 101 - WRITE_ONCE(sigp_ctrl->value, 0); 102 - } 103 - read_unlock(&vcpu->kvm->arch.sca_lock); 129 + WRITE_ONCE(sigp_ctrl->value, 0); 104 130 } 105 131 106 132 int psw_extint_disabled(struct kvm_vcpu *vcpu) ··· 1177 1223 { 1178 1224 struct kvm_s390_local_interrupt *li = &vcpu->arch.local_int; 1179 1225 1180 - if (!sclp.has_sigpif) 1226 + if (!kvm_s390_use_sca_entries()) 1181 1227 return test_bit(IRQ_PEND_EXT_EXTERNAL, &li->pending_irqs); 1182 1228 1183 1229 return sca_ext_call_pending(vcpu, NULL); ··· 1502 1548 if (kvm_get_vcpu_by_id(vcpu->kvm, src_id) == NULL) 1503 1549 return -EINVAL; 1504 1550 1505 - if (sclp.has_sigpif && !kvm_s390_pv_cpu_get_handle(vcpu)) 1551 + if (kvm_s390_use_sca_entries() && !kvm_s390_pv_cpu_get_handle(vcpu)) 1506 1552 return sca_inject_ext_call(vcpu, src_id); 1507 1553 1508 1554 if (test_and_set_bit(IRQ_PEND_EXT_EXTERNAL, &li->pending_irqs))

+64 -169

arch/s390/kvm/kvm-s390.c

··· 13 13 #define pr_fmt(fmt) "kvm-s390: " fmt 14 14 15 15 #include <linux/compiler.h> 16 + #include <linux/entry-virt.h> 16 17 #include <linux/export.h> 17 18 #include <linux/err.h> 18 19 #include <linux/fs.h> ··· 185 184 STATS_DESC_COUNTER(VCPU, instruction_diagnose_308), 186 185 STATS_DESC_COUNTER(VCPU, instruction_diagnose_500), 187 186 STATS_DESC_COUNTER(VCPU, instruction_diagnose_other), 188 - STATS_DESC_COUNTER(VCPU, pfault_sync) 187 + STATS_DESC_COUNTER(VCPU, pfault_sync), 188 + STATS_DESC_COUNTER(VCPU, signal_exits) 189 189 }; 190 190 191 191 const struct kvm_stats_header kvm_vcpu_stats_header = { ··· 273 271 /* forward declarations */ 274 272 static void kvm_gmap_notifier(struct gmap *gmap, unsigned long start, 275 273 unsigned long end); 276 - static int sca_switch_to_extended(struct kvm *kvm); 277 274 278 275 static void kvm_clock_sync_scb(struct kvm_s390_sie_block *scb, u64 delta) 279 276 { ··· 607 606 case KVM_CAP_SET_GUEST_DEBUG: 608 607 case KVM_CAP_S390_DIAG318: 609 608 case KVM_CAP_IRQFD_RESAMPLE: 609 + case KVM_CAP_S390_USER_OPEREXEC: 610 610 r = 1; 611 611 break; 612 612 case KVM_CAP_SET_GUEST_DEBUG2: ··· 633 631 case KVM_CAP_NR_VCPUS: 634 632 case KVM_CAP_MAX_VCPUS: 635 633 case KVM_CAP_MAX_VCPU_ID: 636 - r = KVM_S390_BSCA_CPU_SLOTS; 634 + /* 635 + * Return the same value for KVM_CAP_MAX_VCPUS and 636 + * KVM_CAP_MAX_VCPU_ID to conform with the KVM API. 637 + */ 638 + r = KVM_S390_ESCA_CPU_SLOTS; 637 639 if (!kvm_s390_use_sca_entries()) 638 640 r = KVM_MAX_VCPUS; 639 - else if (sclp.has_esca && sclp.has_64bscao) 640 - r = KVM_S390_ESCA_CPU_SLOTS; 641 641 if (ext == KVM_CAP_NR_VCPUS) 642 642 r = min_t(unsigned int, num_online_cpus(), r); 643 643 break; ··· 922 918 mutex_unlock(&kvm->lock); 923 919 VM_EVENT(kvm, 3, "ENABLE: CAP_S390_CPU_TOPOLOGY %s", 924 920 r ? "(not available)" : "(success)"); 921 + break; 922 + case KVM_CAP_S390_USER_OPEREXEC: 923 + VM_EVENT(kvm, 3, "%s", "ENABLE: CAP_S390_USER_OPEREXEC"); 924 + kvm->arch.user_operexec = 1; 925 + icpt_operexc_on_all_vcpus(kvm); 926 + r = 0; 925 927 break; 926 928 default: 927 929 r = -EINVAL; ··· 1940 1930 * Updates the Multiprocessor Topology-Change-Report bit to signal 1941 1931 * the guest with a topology change. 1942 1932 * This is only relevant if the topology facility is present. 1943 - * 1944 - * The SCA version, bsca or esca, doesn't matter as offset is the same. 1945 1933 */ 1946 1934 static void kvm_s390_update_topology_change_report(struct kvm *kvm, bool val) 1947 1935 { 1948 1936 union sca_utility new, old; 1949 - struct bsca_block *sca; 1937 + struct esca_block *sca; 1950 1938 1951 - read_lock(&kvm->arch.sca_lock); 1952 1939 sca = kvm->arch.sca; 1953 1940 old = READ_ONCE(sca->utility); 1954 1941 do { 1955 1942 new = old; 1956 1943 new.mtcr = val; 1957 1944 } while (!try_cmpxchg(&sca->utility.val, &old.val, new.val)); 1958 - read_unlock(&kvm->arch.sca_lock); 1959 1945 } 1960 1946 1961 1947 static int kvm_s390_set_topo_change_indication(struct kvm *kvm, ··· 1972 1966 if (!test_kvm_facility(kvm, 11)) 1973 1967 return -ENXIO; 1974 1968 1975 - read_lock(&kvm->arch.sca_lock); 1976 - topo = ((struct bsca_block *)kvm->arch.sca)->utility.mtcr; 1977 - read_unlock(&kvm->arch.sca_lock); 1969 + topo = kvm->arch.sca->utility.mtcr; 1978 1970 1979 1971 return put_user(topo, (u8 __user *)attr->addr); 1980 1972 } ··· 2670 2666 if (kvm_s390_pv_is_protected(kvm)) 2671 2667 break; 2672 2668 2673 - /* 2674 - * FMT 4 SIE needs esca. As we never switch back to bsca from 2675 - * esca, we need no cleanup in the error cases below 2676 - */ 2677 - r = sca_switch_to_extended(kvm); 2678 - if (r) 2679 - break; 2680 - 2681 2669 mmap_write_lock(kvm->mm); 2682 2670 r = gmap_helper_disable_cow_sharing(); 2683 2671 mmap_write_unlock(kvm->mm); ··· 3312 3316 3313 3317 static void sca_dispose(struct kvm *kvm) 3314 3318 { 3315 - if (kvm->arch.use_esca) 3316 - free_pages_exact(kvm->arch.sca, sizeof(struct esca_block)); 3317 - else 3318 - free_page((unsigned long)(kvm->arch.sca)); 3319 + free_pages_exact(kvm->arch.sca, sizeof(*kvm->arch.sca)); 3319 3320 kvm->arch.sca = NULL; 3320 3321 } 3321 3322 ··· 3326 3333 3327 3334 int kvm_arch_init_vm(struct kvm *kvm, unsigned long type) 3328 3335 { 3329 - gfp_t alloc_flags = GFP_KERNEL_ACCOUNT; 3330 - int i, rc; 3336 + gfp_t alloc_flags = GFP_KERNEL_ACCOUNT | __GFP_ZERO; 3331 3337 char debug_name[16]; 3332 - static unsigned long sca_offset; 3338 + int i, rc; 3333 3339 3334 3340 rc = -EINVAL; 3335 3341 #ifdef CONFIG_KVM_S390_UCONTROL ··· 3349 3357 3350 3358 if (!sclp.has_64bscao) 3351 3359 alloc_flags |= GFP_DMA; 3352 - rwlock_init(&kvm->arch.sca_lock); 3353 - /* start with basic SCA */ 3354 - kvm->arch.sca = (struct bsca_block *) get_zeroed_page(alloc_flags); 3360 + mutex_lock(&kvm_lock); 3361 + 3362 + kvm->arch.sca = alloc_pages_exact(sizeof(*kvm->arch.sca), alloc_flags); 3363 + mutex_unlock(&kvm_lock); 3355 3364 if (!kvm->arch.sca) 3356 3365 goto out_err; 3357 - mutex_lock(&kvm_lock); 3358 - sca_offset += 16; 3359 - if (sca_offset + sizeof(struct bsca_block) > PAGE_SIZE) 3360 - sca_offset = 0; 3361 - kvm->arch.sca = (struct bsca_block *) 3362 - ((char *) kvm->arch.sca + sca_offset); 3363 - mutex_unlock(&kvm_lock); 3364 3366 3365 - sprintf(debug_name, "kvm-%u", current->pid); 3367 + snprintf(debug_name, sizeof(debug_name), "kvm-%u", current->pid); 3366 3368 3367 3369 kvm->arch.dbf = debug_register(debug_name, 32, 1, 7 * sizeof(long)); 3368 3370 if (!kvm->arch.dbf) ··· 3533 3547 3534 3548 static void sca_del_vcpu(struct kvm_vcpu *vcpu) 3535 3549 { 3550 + struct esca_block *sca = vcpu->kvm->arch.sca; 3551 + 3536 3552 if (!kvm_s390_use_sca_entries()) 3537 3553 return; 3538 - read_lock(&vcpu->kvm->arch.sca_lock); 3539 - if (vcpu->kvm->arch.use_esca) { 3540 - struct esca_block *sca = vcpu->kvm->arch.sca; 3541 3554 3542 - clear_bit_inv(vcpu->vcpu_id, (unsigned long *) sca->mcn); 3543 - sca->cpu[vcpu->vcpu_id].sda = 0; 3544 - } else { 3545 - struct bsca_block *sca = vcpu->kvm->arch.sca; 3546 - 3547 - clear_bit_inv(vcpu->vcpu_id, (unsigned long *) &sca->mcn); 3548 - sca->cpu[vcpu->vcpu_id].sda = 0; 3549 - } 3550 - read_unlock(&vcpu->kvm->arch.sca_lock); 3555 + clear_bit_inv(vcpu->vcpu_id, (unsigned long *)sca->mcn); 3556 + sca->cpu[vcpu->vcpu_id].sda = 0; 3551 3557 } 3552 3558 3553 3559 static void sca_add_vcpu(struct kvm_vcpu *vcpu) 3554 3560 { 3555 - if (!kvm_s390_use_sca_entries()) { 3556 - phys_addr_t sca_phys = virt_to_phys(vcpu->kvm->arch.sca); 3561 + struct esca_block *sca = vcpu->kvm->arch.sca; 3562 + phys_addr_t sca_phys = virt_to_phys(sca); 3557 3563 3558 - /* we still need the basic sca for the ipte control */ 3559 - vcpu->arch.sie_block->scaoh = sca_phys >> 32; 3560 - vcpu->arch.sie_block->scaol = sca_phys; 3564 + /* we still need the sca header for the ipte control */ 3565 + vcpu->arch.sie_block->scaoh = sca_phys >> 32; 3566 + vcpu->arch.sie_block->scaol = sca_phys & ESCA_SCAOL_MASK; 3567 + vcpu->arch.sie_block->ecb2 |= ECB2_ESCA; 3568 + 3569 + if (!kvm_s390_use_sca_entries()) 3561 3570 return; 3562 - } 3563 - read_lock(&vcpu->kvm->arch.sca_lock); 3564 - if (vcpu->kvm->arch.use_esca) { 3565 - struct esca_block *sca = vcpu->kvm->arch.sca; 3566 - phys_addr_t sca_phys = virt_to_phys(sca); 3567 3571 3568 - sca->cpu[vcpu->vcpu_id].sda = virt_to_phys(vcpu->arch.sie_block); 3569 - vcpu->arch.sie_block->scaoh = sca_phys >> 32; 3570 - vcpu->arch.sie_block->scaol = sca_phys & ESCA_SCAOL_MASK; 3571 - vcpu->arch.sie_block->ecb2 |= ECB2_ESCA; 3572 - set_bit_inv(vcpu->vcpu_id, (unsigned long *) sca->mcn); 3573 - } else { 3574 - struct bsca_block *sca = vcpu->kvm->arch.sca; 3575 - phys_addr_t sca_phys = virt_to_phys(sca); 3576 - 3577 - sca->cpu[vcpu->vcpu_id].sda = virt_to_phys(vcpu->arch.sie_block); 3578 - vcpu->arch.sie_block->scaoh = sca_phys >> 32; 3579 - vcpu->arch.sie_block->scaol = sca_phys; 3580 - set_bit_inv(vcpu->vcpu_id, (unsigned long *) &sca->mcn); 3581 - } 3582 - read_unlock(&vcpu->kvm->arch.sca_lock); 3583 - } 3584 - 3585 - /* Basic SCA to Extended SCA data copy routines */ 3586 - static inline void sca_copy_entry(struct esca_entry *d, struct bsca_entry *s) 3587 - { 3588 - d->sda = s->sda; 3589 - d->sigp_ctrl.c = s->sigp_ctrl.c; 3590 - d->sigp_ctrl.scn = s->sigp_ctrl.scn; 3591 - } 3592 - 3593 - static void sca_copy_b_to_e(struct esca_block *d, struct bsca_block *s) 3594 - { 3595 - int i; 3596 - 3597 - d->ipte_control = s->ipte_control; 3598 - d->mcn[0] = s->mcn; 3599 - for (i = 0; i < KVM_S390_BSCA_CPU_SLOTS; i++) 3600 - sca_copy_entry(&d->cpu[i], &s->cpu[i]); 3601 - } 3602 - 3603 - static int sca_switch_to_extended(struct kvm *kvm) 3604 - { 3605 - struct bsca_block *old_sca = kvm->arch.sca; 3606 - struct esca_block *new_sca; 3607 - struct kvm_vcpu *vcpu; 3608 - unsigned long vcpu_idx; 3609 - u32 scaol, scaoh; 3610 - phys_addr_t new_sca_phys; 3611 - 3612 - if (kvm->arch.use_esca) 3613 - return 0; 3614 - 3615 - new_sca = alloc_pages_exact(sizeof(*new_sca), GFP_KERNEL_ACCOUNT | __GFP_ZERO); 3616 - if (!new_sca) 3617 - return -ENOMEM; 3618 - 3619 - new_sca_phys = virt_to_phys(new_sca); 3620 - scaoh = new_sca_phys >> 32; 3621 - scaol = new_sca_phys & ESCA_SCAOL_MASK; 3622 - 3623 - kvm_s390_vcpu_block_all(kvm); 3624 - write_lock(&kvm->arch.sca_lock); 3625 - 3626 - sca_copy_b_to_e(new_sca, old_sca); 3627 - 3628 - kvm_for_each_vcpu(vcpu_idx, vcpu, kvm) { 3629 - vcpu->arch.sie_block->scaoh = scaoh; 3630 - vcpu->arch.sie_block->scaol = scaol; 3631 - vcpu->arch.sie_block->ecb2 |= ECB2_ESCA; 3632 - } 3633 - kvm->arch.sca = new_sca; 3634 - kvm->arch.use_esca = 1; 3635 - 3636 - write_unlock(&kvm->arch.sca_lock); 3637 - kvm_s390_vcpu_unblock_all(kvm); 3638 - 3639 - free_page((unsigned long)old_sca); 3640 - 3641 - VM_EVENT(kvm, 2, "Switched to ESCA (0x%p -> 0x%p)", 3642 - old_sca, kvm->arch.sca); 3643 - return 0; 3572 + set_bit_inv(vcpu->vcpu_id, (unsigned long *)sca->mcn); 3573 + sca->cpu[vcpu->vcpu_id].sda = virt_to_phys(vcpu->arch.sie_block); 3644 3574 } 3645 3575 3646 3576 static int sca_can_add_vcpu(struct kvm *kvm, unsigned int id) 3647 3577 { 3648 - int rc; 3578 + if (!kvm_s390_use_sca_entries()) 3579 + return id < KVM_MAX_VCPUS; 3649 3580 3650 - if (!kvm_s390_use_sca_entries()) { 3651 - if (id < KVM_MAX_VCPUS) 3652 - return true; 3653 - return false; 3654 - } 3655 - if (id < KVM_S390_BSCA_CPU_SLOTS) 3656 - return true; 3657 - if (!sclp.has_esca || !sclp.has_64bscao) 3658 - return false; 3659 - 3660 - rc = kvm->arch.use_esca ? 0 : sca_switch_to_extended(kvm); 3661 - 3662 - return rc == 0 && id < KVM_S390_ESCA_CPU_SLOTS; 3581 + return id < KVM_S390_ESCA_CPU_SLOTS; 3663 3582 } 3664 3583 3665 3584 /* needs disabled preemption to protect from TOD sync and vcpu_load/put */ ··· 3810 3919 vcpu->arch.sie_block->eca |= ECA_IB; 3811 3920 if (sclp.has_siif) 3812 3921 vcpu->arch.sie_block->eca |= ECA_SII; 3813 - if (sclp.has_sigpif) 3922 + if (kvm_s390_use_sca_entries()) 3814 3923 vcpu->arch.sie_block->eca |= ECA_SIGPI; 3815 3924 if (test_kvm_facility(vcpu->kvm, 129)) { 3816 3925 vcpu->arch.sie_block->eca |= ECA_VX; ··· 4257 4366 4258 4367 int kvm_arch_vcpu_ioctl_set_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu) 4259 4368 { 4260 - int ret = 0; 4261 - 4262 4369 vcpu_load(vcpu); 4263 4370 4264 4371 vcpu->run->s.regs.fpc = fpu->fpc; ··· 4267 4378 memcpy(vcpu->run->s.regs.fprs, &fpu->fprs, sizeof(fpu->fprs)); 4268 4379 4269 4380 vcpu_put(vcpu); 4270 - return ret; 4381 + return 0; 4271 4382 } 4272 4383 4273 4384 int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu) ··· 4675 4786 vcpu->arch.sie_block->gg14 = vcpu->run->s.regs.gprs[14]; 4676 4787 vcpu->arch.sie_block->gg15 = vcpu->run->s.regs.gprs[15]; 4677 4788 4678 - if (need_resched()) 4679 - schedule(); 4680 - 4681 4789 if (!kvm_is_ucontrol(vcpu->kvm)) { 4682 4790 rc = kvm_s390_deliver_pending_interrupts(vcpu); 4683 4791 if (rc || guestdbg_exit_pending(vcpu)) ··· 4959 5073 * The guest_state_{enter,exit}_irqoff() functions inform lockdep and 4960 5074 * tracing that entry to the guest will enable host IRQs, and exit from 4961 5075 * the guest will disable host IRQs. 4962 - * 4963 - * We must not use lockdep/tracing/RCU in this critical section, so we 4964 - * use the low-level arch_local_irq_*() helpers to enable/disable IRQs. 4965 5076 */ 4966 - arch_local_irq_enable(); 4967 5077 ret = sie64a(scb, gprs, gasce); 4968 - arch_local_irq_disable(); 4969 5078 4970 5079 guest_state_exit_irqoff(); 4971 5080 ··· 4979 5098 */ 4980 5099 kvm_vcpu_srcu_read_lock(vcpu); 4981 5100 4982 - do { 5101 + while (true) { 4983 5102 rc = vcpu_pre_run(vcpu); 5103 + kvm_vcpu_srcu_read_unlock(vcpu); 4984 5104 if (rc || guestdbg_exit_pending(vcpu)) 4985 5105 break; 4986 5106 4987 - kvm_vcpu_srcu_read_unlock(vcpu); 4988 5107 /* 4989 5108 * As PF_VCPU will be used in fault handler, between 4990 5109 * guest_timing_enter_irqoff and guest_timing_exit_irqoff ··· 4996 5115 sizeof(sie_page->pv_grregs)); 4997 5116 } 4998 5117 5118 + xfer_to_guest_mode_check: 4999 5119 local_irq_disable(); 5120 + xfer_to_guest_mode_prepare(); 5121 + if (xfer_to_guest_mode_work_pending()) { 5122 + local_irq_enable(); 5123 + rc = kvm_xfer_to_guest_mode_handle_work(vcpu); 5124 + if (rc) 5125 + break; 5126 + goto xfer_to_guest_mode_check; 5127 + } 5128 + 5000 5129 guest_timing_enter_irqoff(); 5001 5130 __disable_cpu_timer_accounting(vcpu); 5002 5131 ··· 5036 5145 kvm_vcpu_srcu_read_lock(vcpu); 5037 5146 5038 5147 rc = vcpu_post_run(vcpu, exit_reason); 5039 - } while (!signal_pending(current) && !guestdbg_exit_pending(vcpu) && !rc); 5148 + if (rc || guestdbg_exit_pending(vcpu)) { 5149 + kvm_vcpu_srcu_read_unlock(vcpu); 5150 + break; 5151 + } 5152 + } 5040 5153 5041 - kvm_vcpu_srcu_read_unlock(vcpu); 5042 5154 return rc; 5043 5155 } 5044 5156 ··· 5257 5363 5258 5364 if (signal_pending(current) && !rc) { 5259 5365 kvm_run->exit_reason = KVM_EXIT_INTR; 5366 + vcpu->stat.signal_exits++; 5260 5367 rc = -EINTR; 5261 5368 } 5262 5369 ··· 5624 5729 return r; 5625 5730 } 5626 5731 5627 - long kvm_arch_vcpu_async_ioctl(struct file *filp, 5628 - unsigned int ioctl, unsigned long arg) 5732 + long kvm_arch_vcpu_unlocked_ioctl(struct file *filp, unsigned int ioctl, 5733 + unsigned long arg) 5629 5734 { 5630 5735 struct kvm_vcpu *vcpu = filp->private_data; 5631 5736 void __user *argp = (void __user *)arg;

+1 -8

arch/s390/kvm/kvm-s390.h

··· 570 570 int kvm_s390_handle_per_ifetch_icpt(struct kvm_vcpu *vcpu); 571 571 int kvm_s390_handle_per_event(struct kvm_vcpu *vcpu); 572 572 573 - /* support for Basic/Extended SCA handling */ 574 - static inline union ipte_control *kvm_s390_get_ipte_control(struct kvm *kvm) 575 - { 576 - struct bsca_block *sca = kvm->arch.sca; /* SCA version doesn't matter */ 577 - 578 - return &sca->ipte_control; 579 - } 580 573 static inline int kvm_s390_use_sca_entries(void) 581 574 { 582 575 /* ··· 577 584 * might use the entries. By not setting the entries and keeping them 578 585 * invalid, hardware will not access them but intercept. 579 586 */ 580 - return sclp.has_sigpif; 587 + return sclp.has_sigpif && sclp.has_esca; 581 588 } 582 589 void kvm_s390_reinject_machine_check(struct kvm_vcpu *vcpu, 583 590 struct mcck_volatile_info *mcck_info);

+14 -6

arch/s390/kvm/vsie.c

··· 782 782 else if ((gpa & ~0x1fffUL) == kvm_s390_get_prefix(vcpu)) 783 783 rc = set_validity_icpt(scb_s, 0x0011U); 784 784 else if ((gpa & PAGE_MASK) != 785 - ((gpa + sizeof(struct bsca_block) - 1) & PAGE_MASK)) 785 + ((gpa + offsetof(struct bsca_block, cpu[0]) - 1) & PAGE_MASK)) 786 786 rc = set_validity_icpt(scb_s, 0x003bU); 787 787 if (!rc) { 788 788 rc = pin_guest_page(vcpu->kvm, gpa, &hpa); ··· 1180 1180 current->thread.gmap_int_code = 0; 1181 1181 barrier(); 1182 1182 if (!kvm_s390_vcpu_sie_inhibited(vcpu)) { 1183 + xfer_to_guest_mode_check: 1183 1184 local_irq_disable(); 1185 + xfer_to_guest_mode_prepare(); 1186 + if (xfer_to_guest_mode_work_pending()) { 1187 + local_irq_enable(); 1188 + rc = kvm_xfer_to_guest_mode_handle_work(vcpu); 1189 + if (rc) 1190 + goto skip_sie; 1191 + goto xfer_to_guest_mode_check; 1192 + } 1184 1193 guest_timing_enter_irqoff(); 1185 1194 rc = kvm_s390_enter_exit_sie(scb_s, vcpu->run->s.regs.gprs, vsie_page->gmap->asce); 1186 1195 guest_timing_exit_irqoff(); 1187 1196 local_irq_enable(); 1188 1197 } 1198 + 1199 + skip_sie: 1189 1200 barrier(); 1190 1201 vcpu->arch.sie_block->prog0c &= ~PROG_IN_SIE; 1191 1202 ··· 1356 1345 * but rewind the PSW to re-enter SIE once that's completed 1357 1346 * instead of passing a "no action" intercept to the guest. 1358 1347 */ 1359 - if (signal_pending(current) || 1360 - kvm_s390_vcpu_has_irq(vcpu, 0) || 1348 + if (kvm_s390_vcpu_has_irq(vcpu, 0) || 1361 1349 kvm_s390_vcpu_sie_inhibited(vcpu)) { 1362 1350 kvm_s390_rewind_psw(vcpu, 4); 1363 1351 break; 1364 1352 } 1365 - cond_resched(); 1366 1353 } 1367 1354 1368 1355 if (rc == -EFAULT) { ··· 1492 1483 if (unlikely(scb_addr & 0x1ffUL)) 1493 1484 return kvm_s390_inject_program_int(vcpu, PGM_SPECIFICATION); 1494 1485 1495 - if (signal_pending(current) || kvm_s390_vcpu_has_irq(vcpu, 0) || 1496 - kvm_s390_vcpu_sie_inhibited(vcpu)) { 1486 + if (kvm_s390_vcpu_has_irq(vcpu, 0) || kvm_s390_vcpu_sie_inhibited(vcpu)) { 1497 1487 kvm_s390_rewind_psw(vcpu, 4); 1498 1488 return 0; 1499 1489 }

+7

arch/x86/include/asm/cpufeatures.h

··· 339 339 #define X86_FEATURE_AMD_STIBP (13*32+15) /* Single Thread Indirect Branch Predictors */ 340 340 #define X86_FEATURE_AMD_STIBP_ALWAYS_ON (13*32+17) /* Single Thread Indirect Branch Predictors always-on preferred */ 341 341 #define X86_FEATURE_AMD_IBRS_SAME_MODE (13*32+19) /* Indirect Branch Restricted Speculation same mode protection*/ 342 + #define X86_FEATURE_EFER_LMSLE_MBZ (13*32+20) /* EFER.LMSLE must be zero */ 342 343 #define X86_FEATURE_AMD_PPIN (13*32+23) /* "amd_ppin" Protected Processor Inventory Number */ 343 344 #define X86_FEATURE_AMD_SSBD (13*32+24) /* Speculative Store Bypass Disable */ 344 345 #define X86_FEATURE_VIRT_SSBD (13*32+25) /* "virt_ssbd" Virtualized Speculative Store Bypass Disable */ ··· 507 506 #define X86_FEATURE_SGX_EUPDATESVN (21*32+17) /* Support for ENCLS[EUPDATESVN] instruction */ 508 507 509 508 #define X86_FEATURE_SDCIAE (21*32+18) /* L3 Smart Data Cache Injection Allocation Enforcement */ 509 + #define X86_FEATURE_CLEAR_CPU_BUF_VM_MMIO (21*32+19) /* 510 + * Clear CPU buffers before VM-Enter if the vCPU 511 + * can access host MMIO (ignored for all intents 512 + * and purposes if CLEAR_CPU_BUF_VM is set). 513 + */ 514 + #define X86_FEATURE_X2AVIC_EXT (21*32+20) /* AMD SVM x2AVIC support for 4k vCPUs */ 510 515 511 516 /* 512 517 * BUG word(s)

+2 -2

arch/x86/include/asm/hardirq.h

··· 5 5 #include <linux/threads.h> 6 6 7 7 typedef struct { 8 - #if IS_ENABLED(CONFIG_KVM_INTEL) 8 + #if IS_ENABLED(CONFIG_CPU_MITIGATIONS) && IS_ENABLED(CONFIG_KVM_INTEL) 9 9 u8 kvm_cpu_l1tf_flush_l1d; 10 10 #endif 11 11 unsigned int __nmi_count; /* arch dependent */ ··· 68 68 DECLARE_PER_CPU_CACHE_HOT(u16, __softirq_pending); 69 69 #define local_softirq_pending_ref __softirq_pending 70 70 71 - #if IS_ENABLED(CONFIG_KVM_INTEL) 71 + #if IS_ENABLED(CONFIG_CPU_MITIGATIONS) && IS_ENABLED(CONFIG_KVM_INTEL) 72 72 /* 73 73 * This function is called from noinstr interrupt contexts 74 74 * and must be inlined to not get instrumentation.

+1

arch/x86/include/asm/kvm-x86-ops.h

··· 128 128 KVM_X86_OP_OPTIONAL(dev_get_attr) 129 129 KVM_X86_OP_OPTIONAL(mem_enc_ioctl) 130 130 KVM_X86_OP_OPTIONAL(vcpu_mem_enc_ioctl) 131 + KVM_X86_OP_OPTIONAL(vcpu_mem_enc_unlocked_ioctl) 131 132 KVM_X86_OP_OPTIONAL(mem_enc_register_region) 132 133 KVM_X86_OP_OPTIONAL(mem_enc_unregister_region) 133 134 KVM_X86_OP_OPTIONAL(vm_copy_enc_context_from)

+14 -9

arch/x86/include/asm/kvm_host.h

··· 1055 1055 /* be preempted when it's in kernel-mode(cpl=0) */ 1056 1056 bool preempted_in_kernel; 1057 1057 1058 - /* Flush the L1 Data cache for L1TF mitigation on VMENTER */ 1059 - bool l1tf_flush_l1d; 1060 - 1061 1058 /* Host CPU on which VM-entry was most recently attempted */ 1062 1059 int last_vmentry_cpu; 1063 1060 ··· 1453 1456 bool use_master_clock; 1454 1457 u64 master_kernel_ns; 1455 1458 u64 master_cycle_now; 1456 - struct delayed_work kvmclock_update_work; 1457 - struct delayed_work kvmclock_sync_work; 1458 1459 1459 1460 #ifdef CONFIG_KVM_HYPERV 1460 1461 struct kvm_hv hyperv; ··· 1843 1848 void *external_spt); 1844 1849 /* Update the external page table from spte getting set. */ 1845 1850 int (*set_external_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level, 1846 - kvm_pfn_t pfn_for_gfn); 1851 + u64 mirror_spte); 1847 1852 1848 1853 /* Update external page tables for page table about to be freed. */ 1849 1854 int (*free_external_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level, 1850 1855 void *external_spt); 1851 1856 1852 1857 /* Update external page table from spte getting removed, and flush TLB. */ 1853 - int (*remove_external_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level, 1854 - kvm_pfn_t pfn_for_gfn); 1858 + void (*remove_external_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level, 1859 + u64 mirror_spte); 1855 1860 1856 1861 bool (*has_wbinvd_exit)(void); 1857 1862 ··· 1909 1914 int (*dev_get_attr)(u32 group, u64 attr, u64 *val); 1910 1915 int (*mem_enc_ioctl)(struct kvm *kvm, void __user *argp); 1911 1916 int (*vcpu_mem_enc_ioctl)(struct kvm_vcpu *vcpu, void __user *argp); 1917 + int (*vcpu_mem_enc_unlocked_ioctl)(struct kvm_vcpu *vcpu, void __user *argp); 1912 1918 int (*mem_enc_register_region)(struct kvm *kvm, struct kvm_enc_region *argp); 1913 1919 int (*mem_enc_unregister_region)(struct kvm *kvm, struct kvm_enc_region *argp); 1914 1920 int (*vm_copy_enc_context_from)(struct kvm *kvm, unsigned int source_fd); ··· 2139 2143 * the gfn, i.e. retrying the instruction will hit a 2140 2144 * !PRESENT fault, which results in a new shadow page 2141 2145 * and sends KVM back to square one. 2146 + * 2147 + * EMULTYPE_SKIP_SOFT_INT - Set in combination with EMULTYPE_SKIP to only skip 2148 + * an instruction if it could generate a given software 2149 + * interrupt, which must be encoded via 2150 + * EMULTYPE_SET_SOFT_INT_VECTOR(). 2142 2151 */ 2143 2152 #define EMULTYPE_NO_DECODE (1 << 0) 2144 2153 #define EMULTYPE_TRAP_UD (1 << 1) ··· 2154 2153 #define EMULTYPE_PF (1 << 6) 2155 2154 #define EMULTYPE_COMPLETE_USER_EXIT (1 << 7) 2156 2155 #define EMULTYPE_WRITE_PF_TO_SP (1 << 8) 2156 + #define EMULTYPE_SKIP_SOFT_INT (1 << 9) 2157 + 2158 + #define EMULTYPE_SET_SOFT_INT_VECTOR(v) ((u32)((v) & 0xff) << 16) 2159 + #define EMULTYPE_GET_SOFT_INT_VECTOR(e) (((e) >> 16) & 0xff) 2157 2160 2158 2161 static inline bool kvm_can_emulate_event_vectoring(int emul_type) 2159 2162 { ··· 2172 2167 void kvm_prepare_emulation_failure_exit(struct kvm_vcpu *vcpu); 2173 2168 2174 2169 void kvm_prepare_event_vectoring_exit(struct kvm_vcpu *vcpu, gpa_t gpa); 2170 + void kvm_prepare_unexpected_reason_exit(struct kvm_vcpu *vcpu, u64 exit_reason); 2175 2171 2176 2172 void kvm_enable_efer_bits(u64); 2177 2173 bool kvm_valid_efer(struct kvm_vcpu *vcpu, u64 efer); ··· 2384 2378 int kvm_add_user_return_msr(u32 msr); 2385 2379 int kvm_find_user_return_msr(u32 msr); 2386 2380 int kvm_set_user_return_msr(unsigned index, u64 val, u64 mask); 2387 - void kvm_user_return_msr_update_cache(unsigned int index, u64 val); 2388 2381 u64 kvm_get_user_return_msr(unsigned int slot); 2389 2382 2390 2383 static inline bool kvm_is_supported_user_return_msr(u32 msr)

+15 -15

arch/x86/include/asm/nospec-branch.h

··· 308 308 * CFLAGS.ZF. 309 309 * Note: Only the memory operand variant of VERW clears the CPU buffers. 310 310 */ 311 - .macro __CLEAR_CPU_BUFFERS feature 312 311 #ifdef CONFIG_X86_64 313 - ALTERNATIVE "", "verw x86_verw_sel(%rip)", \feature 312 + #define VERW verw x86_verw_sel(%rip) 314 313 #else 315 - /* 316 - * In 32bit mode, the memory operand must be a %cs reference. The data 317 - * segments may not be usable (vm86 mode), and the stack segment may not 318 - * be flat (ESPFIX32). 319 - */ 320 - ALTERNATIVE "", "verw %cs:x86_verw_sel", \feature 314 + /* 315 + * In 32bit mode, the memory operand must be a %cs reference. The data segments 316 + * may not be usable (vm86 mode), and the stack segment may not be flat (ESPFIX32). 317 + */ 318 + #define VERW verw %cs:x86_verw_sel 321 319 #endif 322 - .endm 323 320 321 + /* 322 + * Provide a stringified VERW macro for simple usage, and a non-stringified 323 + * VERW macro for use in more elaborate sequences, e.g. to encode a conditional 324 + * VERW within an ALTERNATIVE. 325 + */ 326 + #define __CLEAR_CPU_BUFFERS __stringify(VERW) 327 + 328 + /* If necessary, emit VERW on exit-to-userspace to clear CPU buffers. */ 324 329 #define CLEAR_CPU_BUFFERS \ 325 - __CLEAR_CPU_BUFFERS X86_FEATURE_CLEAR_CPU_BUF 326 - 327 - #define VM_CLEAR_CPU_BUFFERS \ 328 - __CLEAR_CPU_BUFFERS X86_FEATURE_CLEAR_CPU_BUF_VM 330 + ALTERNATIVE "", __CLEAR_CPU_BUFFERS, X86_FEATURE_CLEAR_CPU_BUF 329 331 330 332 #ifdef CONFIG_X86_64 331 333 .macro CLEAR_BRANCH_HISTORY ··· 581 579 DECLARE_STATIC_KEY_FALSE(cpu_buf_idle_clear); 582 580 583 581 DECLARE_STATIC_KEY_FALSE(switch_mm_cond_l1d_flush); 584 - 585 - DECLARE_STATIC_KEY_FALSE(cpu_buf_vm_clear); 586 582 587 583 extern u16 x86_verw_sel; 588 584

+4 -1

arch/x86/include/asm/svm.h

··· 279 279 AVIC_IPI_FAILURE_INVALID_IPI_VECTOR, 280 280 }; 281 281 282 - #define AVIC_PHYSICAL_MAX_INDEX_MASK GENMASK_ULL(8, 0) 282 + #define AVIC_PHYSICAL_MAX_INDEX_MASK GENMASK_ULL(11, 0) 283 283 284 284 /* 285 285 * For AVIC, the max index allowed for physical APIC ID table is 0xfe (254), as ··· 289 289 290 290 /* 291 291 * For x2AVIC, the max index allowed for physical APIC ID table is 0x1ff (511). 292 + * With X86_FEATURE_X2AVIC_EXT, the max index is increased to 0xfff (4095). 292 293 */ 293 294 #define X2AVIC_MAX_PHYSICAL_ID 0x1FFUL 295 + #define X2AVIC_4K_MAX_PHYSICAL_ID 0xFFFUL 294 296 295 297 static_assert((AVIC_MAX_PHYSICAL_ID & AVIC_PHYSICAL_MAX_INDEX_MASK) == AVIC_MAX_PHYSICAL_ID); 296 298 static_assert((X2AVIC_MAX_PHYSICAL_ID & AVIC_PHYSICAL_MAX_INDEX_MASK) == X2AVIC_MAX_PHYSICAL_ID); 299 + static_assert((X2AVIC_4K_MAX_PHYSICAL_ID & AVIC_PHYSICAL_MAX_INDEX_MASK) == X2AVIC_4K_MAX_PHYSICAL_ID); 297 300 298 301 #define SVM_SEV_FEAT_SNP_ACTIVE BIT(0) 299 302 #define SVM_SEV_FEAT_RESTRICTED_INJECTION BIT(3)

+1

arch/x86/include/uapi/asm/kvm.h

··· 502 502 /* vendor-specific groups and attributes for system fd */ 503 503 #define KVM_X86_GRP_SEV 1 504 504 # define KVM_X86_SEV_VMSA_FEATURES 0 505 + # define KVM_X86_SNP_POLICY_BITS 1 505 506 506 507 struct kvm_vmx_nested_state_data { 507 508 __u8 vmcs12[KVM_STATE_NESTED_VMX_VMCS_SIZE];

+9 -13

arch/x86/kernel/cpu/bugs.c

··· 145 145 */ 146 146 DEFINE_STATIC_KEY_FALSE(switch_mm_cond_l1d_flush); 147 147 148 - /* 149 - * Controls CPU Fill buffer clear before VMenter. This is a subset of 150 - * X86_FEATURE_CLEAR_CPU_BUF, and should only be enabled when KVM-only 151 - * mitigation is required. 152 - */ 153 - DEFINE_STATIC_KEY_FALSE(cpu_buf_vm_clear); 154 - EXPORT_SYMBOL_FOR_KVM(cpu_buf_vm_clear); 155 - 156 148 #undef pr_fmt 157 149 #define pr_fmt(fmt) "mitigations: " fmt 158 150 ··· 341 349 IS_ENABLED(CONFIG_MITIGATION_RFDS) ? RFDS_MITIGATION_AUTO : RFDS_MITIGATION_OFF; 342 350 343 351 /* 344 - * Set if any of MDS/TAA/MMIO/RFDS are going to enable VERW clearing 345 - * through X86_FEATURE_CLEAR_CPU_BUF on kernel and guest entry. 352 + * Set if any of MDS/TAA/MMIO/RFDS are going to enable VERW clearing on exit to 353 + * userspace *and* on entry to KVM guests. 346 354 */ 347 355 static bool verw_clear_cpu_buf_mitigation_selected __ro_after_init; 348 356 ··· 388 396 if (mds_mitigation == MDS_MITIGATION_FULL || 389 397 mds_mitigation == MDS_MITIGATION_VMWERV) { 390 398 setup_force_cpu_cap(X86_FEATURE_CLEAR_CPU_BUF); 399 + setup_force_cpu_cap(X86_FEATURE_CLEAR_CPU_BUF_VM); 391 400 if (!boot_cpu_has(X86_BUG_MSBDS_ONLY) && 392 401 (mds_nosmt || smt_mitigations == SMT_MITIGATIONS_ON)) 393 402 cpu_smt_disable(false); ··· 500 507 * present on host, enable the mitigation for UCODE_NEEDED as well. 501 508 */ 502 509 setup_force_cpu_cap(X86_FEATURE_CLEAR_CPU_BUF); 510 + setup_force_cpu_cap(X86_FEATURE_CLEAR_CPU_BUF_VM); 503 511 504 512 if (taa_nosmt || smt_mitigations == SMT_MITIGATIONS_ON) 505 513 cpu_smt_disable(false); ··· 602 608 */ 603 609 if (verw_clear_cpu_buf_mitigation_selected) { 604 610 setup_force_cpu_cap(X86_FEATURE_CLEAR_CPU_BUF); 605 - static_branch_disable(&cpu_buf_vm_clear); 611 + setup_force_cpu_cap(X86_FEATURE_CLEAR_CPU_BUF_VM); 606 612 } else { 607 - static_branch_enable(&cpu_buf_vm_clear); 613 + setup_force_cpu_cap(X86_FEATURE_CLEAR_CPU_BUF_VM_MMIO); 608 614 } 609 615 610 616 /* ··· 693 699 694 700 static void __init rfds_apply_mitigation(void) 695 701 { 696 - if (rfds_mitigation == RFDS_MITIGATION_VERW) 702 + if (rfds_mitigation == RFDS_MITIGATION_VERW) { 697 703 setup_force_cpu_cap(X86_FEATURE_CLEAR_CPU_BUF); 704 + setup_force_cpu_cap(X86_FEATURE_CLEAR_CPU_BUF_VM); 705 + } 698 706 } 699 707 700 708 static __init int rfds_parse_cmdline(char *str)

+1

arch/x86/kernel/cpu/scattered.c

··· 53 53 { X86_FEATURE_PROC_FEEDBACK, CPUID_EDX, 11, 0x80000007, 0 }, 54 54 { X86_FEATURE_AMD_FAST_CPPC, CPUID_EDX, 15, 0x80000007, 0 }, 55 55 { X86_FEATURE_MBA, CPUID_EBX, 6, 0x80000008, 0 }, 56 + { X86_FEATURE_X2AVIC_EXT, CPUID_ECX, 6, 0x8000000a, 0 }, 56 57 { X86_FEATURE_COHERENCY_SFW_NO, CPUID_EBX, 31, 0x8000001f, 0 }, 57 58 { X86_FEATURE_SMBA, CPUID_EBX, 2, 0x80000020, 0 }, 58 59 { X86_FEATURE_BMEC, CPUID_EBX, 3, 0x80000020, 0 },

+1

arch/x86/kvm/cpuid.c

··· 1135 1135 F(AMD_STIBP), 1136 1136 F(AMD_STIBP_ALWAYS_ON), 1137 1137 F(AMD_IBRS_SAME_MODE), 1138 + PASSTHROUGH_F(EFER_LMSLE_MBZ), 1138 1139 F(AMD_PSFD), 1139 1140 F(AMD_IBPB_RET), 1140 1141 );

+223 -96

arch/x86/kvm/emulate.c

··· 81 81 */ 82 82 83 83 /* Operand sizes: 8-bit operands or specified/overridden size. */ 84 - #define ByteOp (1<<0) /* 8-bit operands. */ 85 - /* Destination operand type. */ 86 - #define DstShift 1 84 + #define ByteOp (1<<0) /* 8-bit operands. */ 85 + #define DstShift 1 /* Destination operand type at bits 1-5 */ 87 86 #define ImplicitOps (OpImplicit << DstShift) 88 87 #define DstReg (OpReg << DstShift) 89 88 #define DstMem (OpMem << DstShift) ··· 94 95 #define DstDX (OpDX << DstShift) 95 96 #define DstAccLo (OpAccLo << DstShift) 96 97 #define DstMask (OpMask << DstShift) 97 - /* Source operand type. */ 98 - #define SrcShift 6 98 + #define SrcShift 6 /* Source operand type at bits 6-10 */ 99 99 #define SrcNone (OpNone << SrcShift) 100 100 #define SrcReg (OpReg << SrcShift) 101 101 #define SrcMem (OpMem << SrcShift) ··· 117 119 #define SrcAccHi (OpAccHi << SrcShift) 118 120 #define SrcMask (OpMask << SrcShift) 119 121 #define BitOp (1<<11) 120 - #define MemAbs (1<<12) /* Memory operand is absolute displacement */ 122 + #define MemAbs (1<<12) /* Memory operand is absolute displacement */ 121 123 #define String (1<<13) /* String instruction (rep capable) */ 122 124 #define Stack (1<<14) /* Stack instruction (push/pop) */ 123 - #define GroupMask (7<<15) /* Opcode uses one of the group mechanisms */ 125 + #define GroupMask (7<<15) /* Group mechanisms, at bits 15-17 */ 124 126 #define Group (1<<15) /* Bits 3:5 of modrm byte extend opcode */ 125 127 #define GroupDual (2<<15) /* Alternate decoding of mod == 3 */ 126 128 #define Prefix (3<<15) /* Instruction varies with 66/f2/f3 prefix */ ··· 129 131 #define InstrDual (6<<15) /* Alternate instruction decoding of mod == 3 */ 130 132 #define ModeDual (7<<15) /* Different instruction for 32/64 bit */ 131 133 #define Sse (1<<18) /* SSE Vector instruction */ 132 - /* Generic ModRM decode. */ 133 - #define ModRM (1<<19) 134 - /* Destination is only written; never read. */ 135 - #define Mov (1<<20) 136 - /* Misc flags */ 134 + #define ModRM (1<<19) /* Generic ModRM decode. */ 135 + #define Mov (1<<20) /* Destination is only written; never read. */ 137 136 #define Prot (1<<21) /* instruction generates #UD if not in prot-mode */ 138 137 #define EmulateOnUD (1<<22) /* Emulate if unsupported by the host */ 139 138 #define NoAccess (1<<23) /* Don't access memory (lea/invlpg/verr etc) */ ··· 138 143 #define Undefined (1<<25) /* No Such Instruction */ 139 144 #define Lock (1<<26) /* lock prefix is allowed for the instruction */ 140 145 #define Priv (1<<27) /* instruction generates #GP if current CPL != 0 */ 141 - #define No64 (1<<28) 146 + #define No64 (1<<28) /* Instruction generates #UD in 64-bit mode */ 142 147 #define PageTable (1 << 29) /* instruction used to write page table */ 143 148 #define NotImpl (1 << 30) /* instruction is not implemented */ 144 - /* Source 2 operand type */ 145 - #define Src2Shift (31) 149 + #define Avx ((u64)1 << 31) /* Instruction uses VEX prefix */ 150 + #define Src2Shift (32) /* Source 2 operand type at bits 32-36 */ 146 151 #define Src2None (OpNone << Src2Shift) 147 152 #define Src2Mem (OpMem << Src2Shift) 148 153 #define Src2CL (OpCL << Src2Shift) ··· 156 161 #define Src2FS (OpFS << Src2Shift) 157 162 #define Src2GS (OpGS << Src2Shift) 158 163 #define Src2Mask (OpMask << Src2Shift) 164 + /* free: 37-39 */ 159 165 #define Mmx ((u64)1 << 40) /* MMX Vector instruction */ 160 - #define AlignMask ((u64)7 << 41) 166 + #define AlignMask ((u64)3 << 41) /* Memory alignment requirement at bits 41-42 */ 161 167 #define Aligned ((u64)1 << 41) /* Explicitly aligned (e.g. MOVDQA) */ 162 168 #define Unaligned ((u64)2 << 41) /* Explicitly unaligned (e.g. MOVDQU) */ 163 - #define Avx ((u64)3 << 41) /* Advanced Vector Extensions */ 164 - #define Aligned16 ((u64)4 << 41) /* Aligned to 16 byte boundary (e.g. FXSAVE) */ 169 + #define Aligned16 ((u64)3 << 41) /* Aligned to 16 byte boundary (e.g. FXSAVE) */ 170 + /* free: 43-44 */ 165 171 #define NoWrite ((u64)1 << 45) /* No writeback */ 166 172 #define SrcWrite ((u64)1 << 46) /* Write back src operand */ 167 173 #define NoMod ((u64)1 << 47) /* Mod field is ignored */ ··· 237 241 X86_TRANSFER_CALL_JMP, 238 242 X86_TRANSFER_RET, 239 243 X86_TRANSFER_TASK_SWITCH, 244 + }; 245 + 246 + enum rex_bits { 247 + REX_B = 1, 248 + REX_X = 2, 249 + REX_R = 4, 250 + REX_W = 8, 240 251 }; 241 252 242 253 static void writeback_registers(struct x86_emulate_ctxt *ctxt) ··· 625 622 626 623 switch (alignment) { 627 624 case Unaligned: 628 - case Avx: 629 625 return 1; 630 626 case Aligned16: 631 627 return 16; ··· 926 924 int byteop) 927 925 { 928 926 void *p; 929 - int highbyte_regs = (ctxt->rex_prefix == 0) && byteop; 927 + int highbyte_regs = (ctxt->rex_prefix == REX_NONE) && byteop; 930 928 931 929 if (highbyte_regs && modrm_reg >= 4 && modrm_reg < 8) 932 930 p = (unsigned char *)reg_rmw(ctxt, modrm_reg & 3) + 1; ··· 1032 1030 op->val = *(u64 *)op->addr.reg; 1033 1031 break; 1034 1032 } 1033 + op->orig_val = op->val; 1035 1034 } 1036 1035 1037 1036 static int em_fninit(struct x86_emulate_ctxt *ctxt) ··· 1078 1075 return X86EMUL_CONTINUE; 1079 1076 } 1080 1077 1081 - static void decode_register_operand(struct x86_emulate_ctxt *ctxt, 1082 - struct operand *op) 1078 + static void __decode_register_operand(struct x86_emulate_ctxt *ctxt, 1079 + struct operand *op, int reg) 1083 1080 { 1084 - unsigned int reg; 1085 - 1086 - if (ctxt->d & ModRM) 1087 - reg = ctxt->modrm_reg; 1088 - else 1089 - reg = (ctxt->b & 7) | ((ctxt->rex_prefix & 1) << 3); 1090 - 1091 - if (ctxt->d & Sse) { 1081 + if ((ctxt->d & Avx) && ctxt->op_bytes == 32) { 1082 + op->type = OP_YMM; 1083 + op->bytes = 32; 1084 + op->addr.xmm = reg; 1085 + kvm_read_avx_reg(reg, &op->vec_val2); 1086 + return; 1087 + } 1088 + if (ctxt->d & (Avx|Sse)) { 1092 1089 op->type = OP_XMM; 1093 1090 op->bytes = 16; 1094 1091 op->addr.xmm = reg; ··· 1106 1103 op->type = OP_REG; 1107 1104 op->bytes = (ctxt->d & ByteOp) ? 1 : ctxt->op_bytes; 1108 1105 op->addr.reg = decode_register(ctxt, reg, ctxt->d & ByteOp); 1109 - 1110 1106 fetch_register_operand(op); 1111 - op->orig_val = op->val; 1107 + } 1108 + 1109 + static void decode_register_operand(struct x86_emulate_ctxt *ctxt, 1110 + struct operand *op) 1111 + { 1112 + unsigned int reg; 1113 + 1114 + if (ctxt->d & ModRM) 1115 + reg = ctxt->modrm_reg; 1116 + else 1117 + reg = (ctxt->b & 7) | (ctxt->rex_bits & REX_B ? 8 : 0); 1118 + 1119 + __decode_register_operand(ctxt, op, reg); 1112 1120 } 1113 1121 1114 1122 static void adjust_modrm_seg(struct x86_emulate_ctxt *ctxt, int base_reg) ··· 1136 1122 int rc = X86EMUL_CONTINUE; 1137 1123 ulong modrm_ea = 0; 1138 1124 1139 - ctxt->modrm_reg = ((ctxt->rex_prefix << 1) & 8); /* REX.R */ 1140 - index_reg = (ctxt->rex_prefix << 2) & 8; /* REX.X */ 1141 - base_reg = (ctxt->rex_prefix << 3) & 8; /* REX.B */ 1125 + ctxt->modrm_reg = (ctxt->rex_bits & REX_R ? 8 : 0); 1126 + index_reg = (ctxt->rex_bits & REX_X ? 8 : 0); 1127 + base_reg = (ctxt->rex_bits & REX_B ? 8 : 0); 1142 1128 1143 1129 ctxt->modrm_mod = (ctxt->modrm & 0xc0) >> 6; 1144 1130 ctxt->modrm_reg |= (ctxt->modrm & 0x38) >> 3; ··· 1146 1132 ctxt->modrm_seg = VCPU_SREG_DS; 1147 1133 1148 1134 if (ctxt->modrm_mod == 3 || (ctxt->d & NoMod)) { 1149 - op->type = OP_REG; 1150 - op->bytes = (ctxt->d & ByteOp) ? 1 : ctxt->op_bytes; 1151 - op->addr.reg = decode_register(ctxt, ctxt->modrm_rm, 1152 - ctxt->d & ByteOp); 1153 - if (ctxt->d & Sse) { 1154 - op->type = OP_XMM; 1155 - op->bytes = 16; 1156 - op->addr.xmm = ctxt->modrm_rm; 1157 - kvm_read_sse_reg(ctxt->modrm_rm, &op->vec_val); 1158 - return rc; 1159 - } 1160 - if (ctxt->d & Mmx) { 1161 - op->type = OP_MM; 1162 - op->bytes = 8; 1163 - op->addr.mm = ctxt->modrm_rm & 7; 1164 - return rc; 1165 - } 1166 - fetch_register_operand(op); 1135 + __decode_register_operand(ctxt, op, ctxt->modrm_rm); 1167 1136 return rc; 1168 1137 } 1169 1138 ··· 1780 1783 op->data, 1781 1784 op->bytes * op->count); 1782 1785 case OP_XMM: 1783 - kvm_write_sse_reg(op->addr.xmm, &op->vec_val); 1786 + if (!(ctxt->d & Avx)) { 1787 + kvm_write_sse_reg(op->addr.xmm, &op->vec_val); 1788 + break; 1789 + } 1790 + /* full YMM write but with high bytes cleared */ 1791 + memset(op->valptr + 16, 0, 16); 1792 + fallthrough; 1793 + case OP_YMM: 1794 + kvm_write_avx_reg(op->addr.xmm, &op->vec_val2); 1784 1795 break; 1785 1796 case OP_MM: 1786 1797 kvm_write_mmx_reg(op->addr.mm, &op->mm_val); ··· 2471 2466 2472 2467 setup_syscalls_segments(&cs, &ss); 2473 2468 2474 - if ((ctxt->rex_prefix & 0x8) != 0x0) 2469 + if (ctxt->rex_bits & REX_W) 2475 2470 usermode = X86EMUL_MODE_PROT64; 2476 2471 else 2477 2472 usermode = X86EMUL_MODE_PROT32; ··· 3963 3958 I2bv(((_f) | DstReg | SrcMem | ModRM) & ~Lock, _e), \ 3964 3959 I2bv(((_f) & ~Lock) | DstAcc | SrcImm, _e) 3965 3960 3961 + static const struct opcode ud = I(SrcNone, emulate_ud); 3962 + 3966 3963 static const struct opcode group7_rm0[] = { 3967 3964 N, 3968 3965 I(SrcNone | Priv | EmulateOnUD, em_hypercall), ··· 4121 4114 } }; 4122 4115 4123 4116 static const struct gprefix pfx_0f_6f_0f_7f = { 4124 - I(Mmx, em_mov), I(Sse | Aligned, em_mov), N, I(Sse | Unaligned, em_mov), 4117 + I(Mmx, em_mov), I(Sse | Avx | Aligned, em_mov), N, I(Sse | Avx | Unaligned, em_mov), 4125 4118 }; 4126 4119 4127 4120 static const struct instr_dual instr_dual_0f_2b = { ··· 4140 4133 I(Aligned, em_mov), I(Aligned, em_mov), N, N, 4141 4134 }; 4142 4135 4143 - static const struct gprefix pfx_0f_e7 = { 4144 - N, I(Sse, em_mov), N, N, 4136 + static const struct gprefix pfx_0f_e7_0f_38_2a = { 4137 + N, I(Sse | Avx, em_mov), N, N, 4145 4138 }; 4146 4139 4147 4140 static const struct escape escape_d9 = { { ··· 4354 4347 DI(ImplicitOps | Priv, invd), DI(ImplicitOps | Priv, wbinvd), N, N, 4355 4348 N, D(ImplicitOps | ModRM | SrcMem | NoAccess), N, N, 4356 4349 /* 0x10 - 0x1F */ 4357 - GP(ModRM | DstReg | SrcMem | Mov | Sse, &pfx_0f_10_0f_11), 4358 - GP(ModRM | DstMem | SrcReg | Mov | Sse, &pfx_0f_10_0f_11), 4350 + GP(ModRM | DstReg | SrcMem | Mov | Sse | Avx, &pfx_0f_10_0f_11), 4351 + GP(ModRM | DstMem | SrcReg | Mov | Sse | Avx, &pfx_0f_10_0f_11), 4359 4352 N, N, N, N, N, N, 4360 4353 D(ImplicitOps | ModRM | SrcMem | NoAccess), /* 4 * prefetch + 4 * reserved NOP */ 4361 4354 D(ImplicitOps | ModRM | SrcMem | NoAccess), N, N, ··· 4371 4364 IIP(ModRM | SrcMem | Priv | Op3264 | NoMod, em_dr_write, dr_write, 4372 4365 check_dr_write), 4373 4366 N, N, N, N, 4374 - GP(ModRM | DstReg | SrcMem | Mov | Sse, &pfx_0f_28_0f_29), 4375 - GP(ModRM | DstMem | SrcReg | Mov | Sse, &pfx_0f_28_0f_29), 4376 - N, GP(ModRM | DstMem | SrcReg | Mov | Sse, &pfx_0f_2b), 4367 + GP(ModRM | DstReg | SrcMem | Mov | Sse | Avx, &pfx_0f_28_0f_29), 4368 + GP(ModRM | DstMem | SrcReg | Mov | Sse | Avx, &pfx_0f_28_0f_29), 4369 + N, GP(ModRM | DstMem | SrcReg | Mov | Sse | Avx, &pfx_0f_2b), 4377 4370 N, N, N, N, 4378 4371 /* 0x30 - 0x3F */ 4379 4372 II(ImplicitOps | Priv, em_wrmsr, wrmsr), ··· 4438 4431 /* 0xD0 - 0xDF */ 4439 4432 N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, 4440 4433 /* 0xE0 - 0xEF */ 4441 - N, N, N, N, N, N, N, GP(SrcReg | DstMem | ModRM | Mov, &pfx_0f_e7), 4434 + N, N, N, N, N, N, N, GP(SrcReg | DstMem | ModRM | Mov, &pfx_0f_e7_0f_38_2a), 4442 4435 N, N, N, N, N, N, N, N, 4443 4436 /* 0xF0 - 0xFF */ 4444 4437 N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N ··· 4465 4458 * byte. 4466 4459 */ 4467 4460 static const struct opcode opcode_map_0f_38[256] = { 4468 - /* 0x00 - 0x7f */ 4469 - X16(N), X16(N), X16(N), X16(N), X16(N), X16(N), X16(N), X16(N), 4461 + /* 0x00 - 0x1f */ 4462 + X16(N), X16(N), 4463 + /* 0x20 - 0x2f */ 4464 + X8(N), 4465 + X2(N), GP(SrcReg | DstMem | ModRM | Mov | Aligned, &pfx_0f_e7_0f_38_2a), N, N, N, N, N, 4466 + /* 0x30 - 0x7f */ 4467 + X16(N), X16(N), X16(N), X16(N), X16(N), 4470 4468 /* 0x80 - 0xef */ 4471 4469 X16(N), X16(N), X16(N), X16(N), X16(N), X16(N), X16(N), 4472 4470 /* 0xf0 - 0xf1 */ ··· 4630 4618 op->bytes = (ctxt->d & ByteOp) ? 1 : ctxt->op_bytes; 4631 4619 op->addr.reg = reg_rmw(ctxt, VCPU_REGS_RAX); 4632 4620 fetch_register_operand(op); 4633 - op->orig_val = op->val; 4634 4621 break; 4635 4622 case OpAccLo: 4636 4623 op->type = OP_REG; 4637 4624 op->bytes = (ctxt->d & ByteOp) ? 2 : ctxt->op_bytes; 4638 4625 op->addr.reg = reg_rmw(ctxt, VCPU_REGS_RAX); 4639 4626 fetch_register_operand(op); 4640 - op->orig_val = op->val; 4641 4627 break; 4642 4628 case OpAccHi: 4643 4629 if (ctxt->d & ByteOp) { ··· 4646 4636 op->bytes = ctxt->op_bytes; 4647 4637 op->addr.reg = reg_rmw(ctxt, VCPU_REGS_RDX); 4648 4638 fetch_register_operand(op); 4649 - op->orig_val = op->val; 4650 4639 break; 4651 4640 case OpDI: 4652 4641 op->type = OP_MEM; ··· 4764 4755 return rc; 4765 4756 } 4766 4757 4758 + static int x86_decode_avx(struct x86_emulate_ctxt *ctxt, 4759 + u8 vex_1st, u8 vex_2nd, struct opcode *opcode) 4760 + { 4761 + u8 vex_3rd, map, pp, l, v; 4762 + int rc = X86EMUL_CONTINUE; 4763 + 4764 + if (ctxt->rep_prefix || ctxt->op_prefix || ctxt->rex_prefix) 4765 + goto ud; 4766 + 4767 + if (vex_1st == 0xc5) { 4768 + /* Expand RVVVVlpp to VEX3 format */ 4769 + vex_3rd = vex_2nd & ~0x80; /* VVVVlpp from VEX2, w=0 */ 4770 + vex_2nd = (vex_2nd & 0x80) | 0x61; /* R from VEX2, X=1 B=1 mmmmm=00001 */ 4771 + } else { 4772 + vex_3rd = insn_fetch(u8, ctxt); 4773 + } 4774 + 4775 + /* vex_2nd = RXBmmmmm, vex_3rd = wVVVVlpp. Fix polarity */ 4776 + vex_2nd ^= 0xE0; /* binary 11100000 */ 4777 + vex_3rd ^= 0x78; /* binary 01111000 */ 4778 + 4779 + ctxt->rex_prefix = REX_PREFIX; 4780 + ctxt->rex_bits = (vex_2nd & 0xE0) >> 5; /* RXB */ 4781 + ctxt->rex_bits |= (vex_3rd & 0x80) >> 4; /* w */ 4782 + if (ctxt->rex_bits && ctxt->mode != X86EMUL_MODE_PROT64) 4783 + goto ud; 4784 + 4785 + map = vex_2nd & 0x1f; 4786 + v = (vex_3rd >> 3) & 0xf; 4787 + l = vex_3rd & 0x4; 4788 + pp = vex_3rd & 0x3; 4789 + 4790 + ctxt->b = insn_fetch(u8, ctxt); 4791 + switch (map) { 4792 + case 1: 4793 + ctxt->opcode_len = 2; 4794 + *opcode = twobyte_table[ctxt->b]; 4795 + break; 4796 + case 2: 4797 + ctxt->opcode_len = 3; 4798 + *opcode = opcode_map_0f_38[ctxt->b]; 4799 + break; 4800 + case 3: 4801 + /* no 0f 3a instructions are supported yet */ 4802 + return X86EMUL_UNHANDLEABLE; 4803 + default: 4804 + goto ud; 4805 + } 4806 + 4807 + /* 4808 + * No three operand instructions are supported yet; those that 4809 + * *are* marked with the Avx flag reserve the VVVV flag. 4810 + */ 4811 + if (v) 4812 + goto ud; 4813 + 4814 + if (l) 4815 + ctxt->op_bytes = 32; 4816 + else 4817 + ctxt->op_bytes = 16; 4818 + 4819 + switch (pp) { 4820 + case 0: break; 4821 + case 1: ctxt->op_prefix = true; break; 4822 + case 2: ctxt->rep_prefix = 0xf3; break; 4823 + case 3: ctxt->rep_prefix = 0xf2; break; 4824 + } 4825 + 4826 + done: 4827 + return rc; 4828 + ud: 4829 + *opcode = ud; 4830 + return rc; 4831 + } 4832 + 4767 4833 int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len, int emulation_type) 4768 4834 { 4769 4835 int rc = X86EMUL_CONTINUE; 4770 4836 int mode = ctxt->mode; 4771 4837 int def_op_bytes, def_ad_bytes, goffset, simd_prefix; 4772 - bool op_prefix = false; 4838 + bool vex_prefix = false; 4773 4839 bool has_seg_override = false; 4774 4840 struct opcode opcode; 4775 4841 u16 dummy; ··· 4896 4812 for (;;) { 4897 4813 switch (ctxt->b = insn_fetch(u8, ctxt)) { 4898 4814 case 0x66: /* operand-size override */ 4899 - op_prefix = true; 4815 + ctxt->op_prefix = true; 4900 4816 /* switch between 2/4 bytes */ 4901 4817 ctxt->op_bytes = def_op_bytes ^ 6; 4902 4818 break; ··· 4935 4851 case 0x40 ... 0x4f: /* REX */ 4936 4852 if (mode != X86EMUL_MODE_PROT64) 4937 4853 goto done_prefixes; 4938 - ctxt->rex_prefix = ctxt->b; 4854 + ctxt->rex_prefix = REX_PREFIX; 4855 + ctxt->rex_bits = ctxt->b & 0xf; 4939 4856 continue; 4940 4857 case 0xf0: /* LOCK */ 4941 4858 ctxt->lock_prefix = 1; ··· 4950 4865 } 4951 4866 4952 4867 /* Any legacy prefix after a REX prefix nullifies its effect. */ 4953 - 4954 - ctxt->rex_prefix = 0; 4868 + ctxt->rex_prefix = REX_NONE; 4869 + ctxt->rex_bits = 0; 4955 4870 } 4956 4871 4957 4872 done_prefixes: 4958 4873 4959 4874 /* REX prefix. */ 4960 - if (ctxt->rex_prefix & 8) 4961 - ctxt->op_bytes = 8; /* REX.W */ 4875 + if (ctxt->rex_bits & REX_W) 4876 + ctxt->op_bytes = 8; 4962 4877 4963 4878 /* Opcode byte(s). */ 4964 - opcode = opcode_table[ctxt->b]; 4965 - /* Two-byte opcode? */ 4966 - if (ctxt->b == 0x0f) { 4879 + if (ctxt->b == 0xc4 || ctxt->b == 0xc5) { 4880 + /* VEX or LDS/LES */ 4881 + u8 vex_2nd = insn_fetch(u8, ctxt); 4882 + if (mode != X86EMUL_MODE_PROT64 && (vex_2nd & 0xc0) != 0xc0) { 4883 + opcode = opcode_table[ctxt->b]; 4884 + ctxt->modrm = vex_2nd; 4885 + /* the Mod/RM byte has been fetched already! */ 4886 + goto done_modrm; 4887 + } 4888 + 4889 + vex_prefix = true; 4890 + rc = x86_decode_avx(ctxt, ctxt->b, vex_2nd, &opcode); 4891 + if (rc != X86EMUL_CONTINUE) 4892 + goto done; 4893 + } else if (ctxt->b == 0x0f) { 4894 + /* Two- or three-byte opcode */ 4967 4895 ctxt->opcode_len = 2; 4968 4896 ctxt->b = insn_fetch(u8, ctxt); 4969 4897 opcode = twobyte_table[ctxt->b]; ··· 4987 4889 ctxt->b = insn_fetch(u8, ctxt); 4988 4890 opcode = opcode_map_0f_38[ctxt->b]; 4989 4891 } 4892 + } else { 4893 + /* Opcode byte(s). */ 4894 + opcode = opcode_table[ctxt->b]; 4990 4895 } 4991 - ctxt->d = opcode.flags; 4992 4896 4993 - if (ctxt->d & ModRM) 4897 + if (opcode.flags & ModRM) 4994 4898 ctxt->modrm = insn_fetch(u8, ctxt); 4995 4899 4996 - /* vex-prefix instructions are not implemented */ 4997 - if (ctxt->opcode_len == 1 && (ctxt->b == 0xc5 || ctxt->b == 0xc4) && 4998 - (mode == X86EMUL_MODE_PROT64 || (ctxt->modrm & 0xc0) == 0xc0)) { 4999 - ctxt->d = NotImpl; 5000 - } 5001 - 4900 + done_modrm: 4901 + ctxt->d = opcode.flags; 5002 4902 while (ctxt->d & GroupMask) { 5003 4903 switch (ctxt->d & GroupMask) { 5004 4904 case Group: ··· 5015 4919 opcode = opcode.u.group[goffset]; 5016 4920 break; 5017 4921 case Prefix: 5018 - if (ctxt->rep_prefix && op_prefix) 4922 + if (ctxt->rep_prefix && ctxt->op_prefix) 5019 4923 return EMULATION_FAILED; 5020 - simd_prefix = op_prefix ? 0x66 : ctxt->rep_prefix; 4924 + simd_prefix = ctxt->op_prefix ? 0x66 : ctxt->rep_prefix; 5021 4925 switch (simd_prefix) { 5022 4926 case 0x00: opcode = opcode.u.gprefix->pfx_no; break; 5023 4927 case 0x66: opcode = opcode.u.gprefix->pfx_66; break; ··· 5061 4965 /* Unrecognised? */ 5062 4966 if (ctxt->d == 0) 5063 4967 return EMULATION_FAILED; 4968 + 4969 + if (unlikely(vex_prefix)) { 4970 + /* 4971 + * Only specifically marked instructions support VEX. Since many 4972 + * instructions support it but are not annotated, return not implemented 4973 + * rather than #UD. 4974 + */ 4975 + if (!(ctxt->d & Avx)) 4976 + return EMULATION_FAILED; 4977 + 4978 + if (!(ctxt->d & AlignMask)) 4979 + ctxt->d |= Unaligned; 4980 + } 5064 4981 5065 4982 ctxt->execute = opcode.u.execute; 5066 4983 ··· 5145 5036 if ((ctxt->d & No16) && ctxt->op_bytes == 2) 5146 5037 ctxt->op_bytes = 4; 5147 5038 5148 - if (ctxt->d & Sse) 5149 - ctxt->op_bytes = 16; 5039 + if (vex_prefix) 5040 + ; 5041 + else if (ctxt->d & Sse) 5042 + ctxt->op_bytes = 16, ctxt->d &= ~Avx; 5150 5043 else if (ctxt->d & Mmx) 5151 5044 ctxt->op_bytes = 8; 5152 5045 } ··· 5248 5137 { 5249 5138 /* Clear fields that are set conditionally but read without a guard. */ 5250 5139 ctxt->rip_relative = false; 5251 - ctxt->rex_prefix = 0; 5140 + ctxt->rex_prefix = REX_NONE; 5141 + ctxt->rex_bits = 0; 5252 5142 ctxt->lock_prefix = 0; 5143 + ctxt->op_prefix = false; 5253 5144 ctxt->rep_prefix = 0; 5254 5145 ctxt->regs_valid = 0; 5255 5146 ctxt->regs_dirty = 0; ··· 5281 5168 } 5282 5169 5283 5170 if (unlikely(ctxt->d & 5284 - (No64|Undefined|Sse|Mmx|Intercept|CheckPerm|Priv|Prot|String))) { 5171 + (No64|Undefined|Avx|Sse|Mmx|Intercept|CheckPerm|Priv|Prot|String))) { 5285 5172 if ((ctxt->mode == X86EMUL_MODE_PROT64 && (ctxt->d & No64)) || 5286 5173 (ctxt->d & Undefined)) { 5287 5174 rc = emulate_ud(ctxt); 5288 5175 goto done; 5289 5176 } 5290 5177 5291 - if (((ctxt->d & (Sse|Mmx)) && ((ops->get_cr(ctxt, 0) & X86_CR0_EM))) 5292 - || ((ctxt->d & Sse) && !(ops->get_cr(ctxt, 4) & X86_CR4_OSFXSR))) { 5178 + if ((ctxt->d & (Avx|Sse|Mmx)) && ((ops->get_cr(ctxt, 0) & X86_CR0_EM))) { 5293 5179 rc = emulate_ud(ctxt); 5294 5180 goto done; 5295 5181 } 5296 5182 5297 - if ((ctxt->d & (Sse|Mmx)) && (ops->get_cr(ctxt, 0) & X86_CR0_TS)) { 5183 + if (ctxt->d & Avx) { 5184 + u64 xcr = 0; 5185 + if (!(ops->get_cr(ctxt, 4) & X86_CR4_OSXSAVE) 5186 + || ops->get_xcr(ctxt, 0, &xcr) 5187 + || !(xcr & XFEATURE_MASK_YMM)) { 5188 + rc = emulate_ud(ctxt); 5189 + goto done; 5190 + } 5191 + } else if (ctxt->d & Sse) { 5192 + if (!(ops->get_cr(ctxt, 4) & X86_CR4_OSFXSR)) { 5193 + rc = emulate_ud(ctxt); 5194 + goto done; 5195 + } 5196 + } 5197 + 5198 + if ((ctxt->d & (Avx|Sse|Mmx)) && (ops->get_cr(ctxt, 0) & X86_CR0_TS)) { 5298 5199 rc = emulate_nm(ctxt); 5299 5200 goto done; 5300 5201 }

+66

arch/x86/kvm/fpu.h

··· 15 15 #define sse128_l3(x) ({ __sse128_u t; t.vec = x; t.as_u32[3]; }) 16 16 #define sse128(lo, hi) ({ __sse128_u t; t.as_u64[0] = lo; t.as_u64[1] = hi; t.vec; }) 17 17 18 + typedef u32 __attribute__((vector_size(32))) avx256_t; 19 + 20 + static inline void _kvm_read_avx_reg(int reg, avx256_t *data) 21 + { 22 + switch (reg) { 23 + case 0: asm("vmovdqa %%ymm0, %0" : "=m"(*data)); break; 24 + case 1: asm("vmovdqa %%ymm1, %0" : "=m"(*data)); break; 25 + case 2: asm("vmovdqa %%ymm2, %0" : "=m"(*data)); break; 26 + case 3: asm("vmovdqa %%ymm3, %0" : "=m"(*data)); break; 27 + case 4: asm("vmovdqa %%ymm4, %0" : "=m"(*data)); break; 28 + case 5: asm("vmovdqa %%ymm5, %0" : "=m"(*data)); break; 29 + case 6: asm("vmovdqa %%ymm6, %0" : "=m"(*data)); break; 30 + case 7: asm("vmovdqa %%ymm7, %0" : "=m"(*data)); break; 31 + #ifdef CONFIG_X86_64 32 + case 8: asm("vmovdqa %%ymm8, %0" : "=m"(*data)); break; 33 + case 9: asm("vmovdqa %%ymm9, %0" : "=m"(*data)); break; 34 + case 10: asm("vmovdqa %%ymm10, %0" : "=m"(*data)); break; 35 + case 11: asm("vmovdqa %%ymm11, %0" : "=m"(*data)); break; 36 + case 12: asm("vmovdqa %%ymm12, %0" : "=m"(*data)); break; 37 + case 13: asm("vmovdqa %%ymm13, %0" : "=m"(*data)); break; 38 + case 14: asm("vmovdqa %%ymm14, %0" : "=m"(*data)); break; 39 + case 15: asm("vmovdqa %%ymm15, %0" : "=m"(*data)); break; 40 + #endif 41 + default: BUG(); 42 + } 43 + } 44 + 45 + static inline void _kvm_write_avx_reg(int reg, const avx256_t *data) 46 + { 47 + switch (reg) { 48 + case 0: asm("vmovdqa %0, %%ymm0" : : "m"(*data)); break; 49 + case 1: asm("vmovdqa %0, %%ymm1" : : "m"(*data)); break; 50 + case 2: asm("vmovdqa %0, %%ymm2" : : "m"(*data)); break; 51 + case 3: asm("vmovdqa %0, %%ymm3" : : "m"(*data)); break; 52 + case 4: asm("vmovdqa %0, %%ymm4" : : "m"(*data)); break; 53 + case 5: asm("vmovdqa %0, %%ymm5" : : "m"(*data)); break; 54 + case 6: asm("vmovdqa %0, %%ymm6" : : "m"(*data)); break; 55 + case 7: asm("vmovdqa %0, %%ymm7" : : "m"(*data)); break; 56 + #ifdef CONFIG_X86_64 57 + case 8: asm("vmovdqa %0, %%ymm8" : : "m"(*data)); break; 58 + case 9: asm("vmovdqa %0, %%ymm9" : : "m"(*data)); break; 59 + case 10: asm("vmovdqa %0, %%ymm10" : : "m"(*data)); break; 60 + case 11: asm("vmovdqa %0, %%ymm11" : : "m"(*data)); break; 61 + case 12: asm("vmovdqa %0, %%ymm12" : : "m"(*data)); break; 62 + case 13: asm("vmovdqa %0, %%ymm13" : : "m"(*data)); break; 63 + case 14: asm("vmovdqa %0, %%ymm14" : : "m"(*data)); break; 64 + case 15: asm("vmovdqa %0, %%ymm15" : : "m"(*data)); break; 65 + #endif 66 + default: BUG(); 67 + } 68 + } 69 + 18 70 static inline void _kvm_read_sse_reg(int reg, sse128_t *data) 19 71 { 20 72 switch (reg) { ··· 159 107 static inline void kvm_fpu_put(void) 160 108 { 161 109 fpregs_unlock(); 110 + } 111 + 112 + static inline void kvm_read_avx_reg(int reg, avx256_t *data) 113 + { 114 + kvm_fpu_get(); 115 + _kvm_read_avx_reg(reg, data); 116 + kvm_fpu_put(); 117 + } 118 + 119 + static inline void kvm_write_avx_reg(int reg, const avx256_t *data) 120 + { 121 + kvm_fpu_get(); 122 + _kvm_write_avx_reg(reg, data); 123 + kvm_fpu_put(); 162 124 } 163 125 164 126 static inline void kvm_read_sse_reg(int reg, sse128_t *data)

+1 -1

arch/x86/kvm/hyperv.c

··· 1568 1568 * only, there can be valuable data in the rest which needs 1569 1569 * to be preserved e.g. on migration. 1570 1570 */ 1571 - if (__put_user(0, (u32 __user *)addr)) 1571 + if (put_user(0, (u32 __user *)addr)) 1572 1572 return 1; 1573 1573 hv_vcpu->hv_vapic = data; 1574 1574 kvm_vcpu_mark_page_dirty(vcpu, gfn);

+16 -4

arch/x86/kvm/kvm_emulate.h

··· 237 237 bool (*is_smm)(struct x86_emulate_ctxt *ctxt); 238 238 int (*leave_smm)(struct x86_emulate_ctxt *ctxt); 239 239 void (*triple_fault)(struct x86_emulate_ctxt *ctxt); 240 + int (*get_xcr)(struct x86_emulate_ctxt *ctxt, u32 index, u64 *xcr); 240 241 int (*set_xcr)(struct x86_emulate_ctxt *ctxt, u32 index, u64 xcr); 241 242 242 243 gva_t (*get_untagged_addr)(struct x86_emulate_ctxt *ctxt, gva_t addr, ··· 249 248 250 249 /* Type, address-of, and value of an instruction's operand. */ 251 250 struct operand { 252 - enum { OP_REG, OP_MEM, OP_MEM_STR, OP_IMM, OP_XMM, OP_MM, OP_NONE } type; 251 + enum { OP_REG, OP_MEM, OP_MEM_STR, OP_IMM, OP_XMM, OP_YMM, OP_MM, OP_NONE } type; 253 252 unsigned int bytes; 254 253 unsigned int count; 255 254 union { ··· 268 267 union { 269 268 unsigned long val; 270 269 u64 val64; 271 - char valptr[sizeof(sse128_t)]; 270 + char valptr[sizeof(avx256_t)]; 272 271 sse128_t vec_val; 272 + avx256_t vec_val2; 273 273 u64 mm_val; 274 274 void *data; 275 - }; 275 + } __aligned(32); 276 276 }; 277 277 278 278 #define X86_MAX_INSTRUCTION_LENGTH 15 ··· 319 317 #define NR_EMULATOR_GPRS 8 320 318 #endif 321 319 320 + /* 321 + * Distinguish between no prefix, REX, or in the future REX2. 322 + */ 323 + enum rex_type { 324 + REX_NONE, 325 + REX_PREFIX, 326 + }; 327 + 322 328 struct x86_emulate_ctxt { 323 329 void *vcpu; 324 330 const struct x86_emulate_ops *ops; ··· 358 348 u8 opcode_len; 359 349 u8 b; 360 350 u8 intercept; 351 + bool op_prefix; 361 352 u8 op_bytes; 362 353 u8 ad_bytes; 363 354 union { ··· 368 357 int (*check_perm)(struct x86_emulate_ctxt *ctxt); 369 358 370 359 bool rip_relative; 371 - u8 rex_prefix; 360 + enum rex_type rex_prefix; 361 + u8 rex_bits; 372 362 u8 lock_prefix; 373 363 u8 rep_prefix; 374 364 /* bitmaps of registers in _regs[] that can be read */

+31 -13

arch/x86/kvm/lapic.c

··· 2126 2126 2127 2127 static void advance_periodic_target_expiration(struct kvm_lapic *apic) 2128 2128 { 2129 + struct kvm_timer *ktimer = &apic->lapic_timer; 2129 2130 ktime_t now = ktime_get(); 2130 2131 u64 tscl = rdtsc(); 2131 2132 ktime_t delta; 2132 2133 2133 2134 /* 2134 - * Synchronize both deadlines to the same time source or 2135 - * differences in the periods (caused by differences in the 2136 - * underlying clocks or numerical approximation errors) will 2137 - * cause the two to drift apart over time as the errors 2138 - * accumulate. 2135 + * Use kernel time as the time source for both the hrtimer deadline and 2136 + * TSC-based deadline so that they stay synchronized. Computing each 2137 + * deadline independently will cause the two deadlines to drift apart 2138 + * over time as differences in the periods accumulate, e.g. due to 2139 + * differences in the underlying clocks or numerical approximation errors. 2139 2140 */ 2140 - apic->lapic_timer.target_expiration = 2141 - ktime_add_ns(apic->lapic_timer.target_expiration, 2142 - apic->lapic_timer.period); 2143 - delta = ktime_sub(apic->lapic_timer.target_expiration, now); 2144 - apic->lapic_timer.tscdeadline = kvm_read_l1_tsc(apic->vcpu, tscl) + 2145 - nsec_to_cycles(apic->vcpu, delta); 2141 + ktimer->target_expiration = ktime_add_ns(ktimer->target_expiration, 2142 + ktimer->period); 2143 + 2144 + /* 2145 + * If the new expiration is in the past, e.g. because userspace stopped 2146 + * running the VM for an extended duration, then force the expiration 2147 + * to "now" and don't try to play catch-up with the missed events. KVM 2148 + * will only deliver a single interrupt regardless of how many events 2149 + * are pending, i.e. restarting the timer with an expiration in the 2150 + * past will do nothing more than waste host cycles, and can even lead 2151 + * to a hard lockup in extreme cases. 2152 + */ 2153 + if (ktime_before(ktimer->target_expiration, now)) 2154 + ktimer->target_expiration = now; 2155 + 2156 + /* 2157 + * Note, ensuring the expiration isn't in the past also prevents delta 2158 + * from going negative, which could cause the TSC deadline to become 2159 + * excessively large due to it an unsigned value. 2160 + */ 2161 + delta = ktime_sub(ktimer->target_expiration, now); 2162 + ktimer->tscdeadline = kvm_read_l1_tsc(apic->vcpu, tscl) + 2163 + nsec_to_cycles(apic->vcpu, delta); 2146 2164 } 2147 2165 2148 2166 static void start_sw_period(struct kvm_lapic *apic) ··· 2988 2970 2989 2971 apic_timer_expired(apic, true); 2990 2972 2991 - if (lapic_is_periodic(apic)) { 2973 + if (lapic_is_periodic(apic) && !WARN_ON_ONCE(!apic->lapic_timer.period)) { 2992 2974 advance_periodic_target_expiration(apic); 2993 - hrtimer_add_expires_ns(&ktimer->timer, ktimer->period); 2975 + hrtimer_set_expires(&ktimer->timer, ktimer->target_expiration); 2994 2976 return HRTIMER_RESTART; 2995 2977 } else 2996 2978 return HRTIMER_NORESTART;

+1 -4

arch/x86/kvm/mmu.h

··· 235 235 return -(u32)fault & errcode; 236 236 } 237 237 238 - bool kvm_mmu_may_ignore_guest_pat(struct kvm *kvm); 239 - 240 238 int kvm_mmu_post_init_vm(struct kvm *kvm); 241 239 void kvm_mmu_pre_destroy_vm(struct kvm *kvm); 242 240 ··· 255 257 #define tdp_mmu_enabled false 256 258 #endif 257 259 258 - bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa); 259 - int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level); 260 + int kvm_tdp_mmu_map_private_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn); 260 261 261 262 static inline bool kvm_memslots_have_rmaps(struct kvm *kvm) 262 263 {

+88 -6

arch/x86/kvm/mmu/mmu.c

··· 4859 4859 */ 4860 4860 BUILD_BUG_ON(lower_32_bits(PFERR_SYNTHETIC_MASK)); 4861 4861 4862 - vcpu->arch.l1tf_flush_l1d = true; 4862 + kvm_request_l1tf_flush_l1d(); 4863 4863 if (!flags) { 4864 4864 trace_kvm_page_fault(vcpu, fault_address, error_code); 4865 4865 ··· 4924 4924 return direct_page_fault(vcpu, fault); 4925 4925 } 4926 4926 4927 - int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level) 4927 + static int kvm_tdp_page_prefault(struct kvm_vcpu *vcpu, gpa_t gpa, 4928 + u64 error_code, u8 *level) 4928 4929 { 4929 4930 int r; 4930 4931 ··· 4967 4966 return -EIO; 4968 4967 } 4969 4968 } 4970 - EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_tdp_map_page); 4971 4969 4972 4970 long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu, 4973 4971 struct kvm_pre_fault_memory *range) ··· 5002 5002 * Shadow paging uses GVA for kvm page fault, so restrict to 5003 5003 * two-dimensional paging. 5004 5004 */ 5005 - r = kvm_tdp_map_page(vcpu, range->gpa | direct_bits, error_code, &level); 5005 + r = kvm_tdp_page_prefault(vcpu, range->gpa | direct_bits, error_code, &level); 5006 5006 if (r < 0) 5007 5007 return r; 5008 5008 ··· 5013 5013 end = (range->gpa & KVM_HPAGE_MASK(level)) + KVM_HPAGE_SIZE(level); 5014 5014 return min(range->size, end - range->gpa); 5015 5015 } 5016 + 5017 + #ifdef CONFIG_KVM_GUEST_MEMFD 5018 + static void kvm_assert_gmem_invalidate_lock_held(struct kvm_memory_slot *slot) 5019 + { 5020 + #ifdef CONFIG_PROVE_LOCKING 5021 + if (WARN_ON_ONCE(!kvm_slot_has_gmem(slot)) || 5022 + WARN_ON_ONCE(!slot->gmem.file) || 5023 + WARN_ON_ONCE(!file_count(slot->gmem.file))) 5024 + return; 5025 + 5026 + lockdep_assert_held(&file_inode(slot->gmem.file)->i_mapping->invalidate_lock); 5027 + #endif 5028 + } 5029 + 5030 + int kvm_tdp_mmu_map_private_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn) 5031 + { 5032 + struct kvm_page_fault fault = { 5033 + .addr = gfn_to_gpa(gfn), 5034 + .error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS, 5035 + .prefetch = true, 5036 + .is_tdp = true, 5037 + .nx_huge_page_workaround_enabled = is_nx_huge_page_enabled(vcpu->kvm), 5038 + 5039 + .max_level = PG_LEVEL_4K, 5040 + .req_level = PG_LEVEL_4K, 5041 + .goal_level = PG_LEVEL_4K, 5042 + .is_private = true, 5043 + 5044 + .gfn = gfn, 5045 + .slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn), 5046 + .pfn = pfn, 5047 + .map_writable = true, 5048 + }; 5049 + struct kvm *kvm = vcpu->kvm; 5050 + int r; 5051 + 5052 + lockdep_assert_held(&kvm->slots_lock); 5053 + 5054 + /* 5055 + * Mapping a pre-determined private pfn is intended only for use when 5056 + * populating a guest_memfd instance. Assert that the slot is backed 5057 + * by guest_memfd and that the gmem instance's invalidate_lock is held. 5058 + */ 5059 + kvm_assert_gmem_invalidate_lock_held(fault.slot); 5060 + 5061 + if (KVM_BUG_ON(!tdp_mmu_enabled, kvm)) 5062 + return -EIO; 5063 + 5064 + if (kvm_gfn_is_write_tracked(kvm, fault.slot, fault.gfn)) 5065 + return -EPERM; 5066 + 5067 + r = kvm_mmu_reload(vcpu); 5068 + if (r) 5069 + return r; 5070 + 5071 + r = mmu_topup_memory_caches(vcpu, false); 5072 + if (r) 5073 + return r; 5074 + 5075 + do { 5076 + if (signal_pending(current)) 5077 + return -EINTR; 5078 + 5079 + if (kvm_test_request(KVM_REQ_VM_DEAD, vcpu)) 5080 + return -EIO; 5081 + 5082 + cond_resched(); 5083 + 5084 + guard(read_lock)(&kvm->mmu_lock); 5085 + 5086 + r = kvm_tdp_mmu_map(vcpu, &fault); 5087 + } while (r == RET_PF_RETRY); 5088 + 5089 + if (r != RET_PF_FIXED) 5090 + return -EIO; 5091 + 5092 + return 0; 5093 + } 5094 + EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_tdp_mmu_map_private_pfn); 5095 + #endif 5016 5096 5017 5097 static void nonpaging_init_context(struct kvm_mmu *context) 5018 5098 { ··· 6077 5997 out: 6078 5998 return r; 6079 5999 } 6080 - EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_mmu_load); 6081 6000 6082 6001 void kvm_mmu_unload(struct kvm_vcpu *vcpu) 6083 6002 { ··· 6942 6863 6943 6864 write_unlock(&kvm->mmu_lock); 6944 6865 } 6866 + EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_zap_gfn_range); 6945 6867 6946 6868 static bool slot_rmap_write_protect(struct kvm *kvm, 6947 6869 struct kvm_rmap_head *rmap_head, ··· 7284 7204 7285 7205 return need_tlb_flush; 7286 7206 } 7287 - EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_zap_gfn_range); 7288 7207 7289 7208 static void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm, 7290 7209 const struct kvm_memory_slot *slot) ··· 7442 7363 void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen) 7443 7364 { 7444 7365 WARN_ON_ONCE(gen & KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS); 7366 + 7367 + if (!enable_mmio_caching) 7368 + return; 7445 7369 7446 7370 gen &= MMIO_SPTE_GEN_MASK; 7447 7371

-10

arch/x86/kvm/mmu/mmu_internal.h

··· 39 39 #define INVALID_PAE_ROOT 0 40 40 #define IS_VALID_PAE_ROOT(x) (!!(x)) 41 41 42 - static inline hpa_t kvm_mmu_get_dummy_root(void) 43 - { 44 - return my_zero_pfn(0) << PAGE_SHIFT; 45 - } 46 - 47 - static inline bool kvm_mmu_is_dummy_root(hpa_t shadow_page) 48 - { 49 - return is_zero_pfn(shadow_page >> PAGE_SHIFT); 50 - } 51 - 52 42 typedef u64 __rcu *tdp_ptep_t; 53 43 54 44 struct kvm_mmu_page {

+1 -1

arch/x86/kvm/mmu/paging_tmpl.h

··· 402 402 goto error; 403 403 404 404 ptep_user = (pt_element_t __user *)((void *)host_addr + offset); 405 - if (unlikely(__get_user(pte, ptep_user))) 405 + if (unlikely(get_user(pte, ptep_user))) 406 406 goto error; 407 407 walker->ptep_user[walker->level - 1] = ptep_user; 408 408

+1 -1

arch/x86/kvm/mmu/spte.c

··· 292 292 mark_page_dirty_in_slot(vcpu->kvm, slot, gfn); 293 293 } 294 294 295 - if (static_branch_unlikely(&cpu_buf_vm_clear) && 295 + if (cpu_feature_enabled(X86_FEATURE_CLEAR_CPU_BUF_VM_MMIO) && 296 296 !kvm_vcpu_can_access_host_mmio(vcpu) && 297 297 kvm_is_mmio_pfn(pfn, &is_host_mmio)) 298 298 kvm_track_host_mmio_mapping(vcpu);

+10

arch/x86/kvm/mmu/spte.h

··· 246 246 */ 247 247 extern u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask; 248 248 249 + static inline hpa_t kvm_mmu_get_dummy_root(void) 250 + { 251 + return my_zero_pfn(0) << PAGE_SHIFT; 252 + } 253 + 254 + static inline bool kvm_mmu_is_dummy_root(hpa_t shadow_page) 255 + { 256 + return is_zero_pfn(shadow_page >> PAGE_SHIFT); 257 + } 258 + 249 259 static inline struct kvm_mmu_page *to_shadow_page(hpa_t shadow_page) 250 260 { 251 261 struct page *page = pfn_to_page((shadow_page) >> PAGE_SHIFT);

+10 -40

arch/x86/kvm/mmu/tdp_mmu.c

··· 362 362 static void remove_external_spte(struct kvm *kvm, gfn_t gfn, u64 old_spte, 363 363 int level) 364 364 { 365 - kvm_pfn_t old_pfn = spte_to_pfn(old_spte); 366 - int ret; 367 - 368 365 /* 369 366 * External (TDX) SPTEs are limited to PG_LEVEL_4K, and external 370 367 * PTs are removed in a special order, involving free_external_spt(). ··· 374 377 375 378 /* Zapping leaf spte is allowed only when write lock is held. */ 376 379 lockdep_assert_held_write(&kvm->mmu_lock); 377 - /* Because write lock is held, operation should success. */ 378 - ret = kvm_x86_call(remove_external_spte)(kvm, gfn, level, old_pfn); 379 - KVM_BUG_ON(ret, kvm); 380 + 381 + kvm_x86_call(remove_external_spte)(kvm, gfn, level, old_spte); 380 382 } 381 383 382 384 /** ··· 515 519 bool was_present = is_shadow_present_pte(old_spte); 516 520 bool is_present = is_shadow_present_pte(new_spte); 517 521 bool is_leaf = is_present && is_last_spte(new_spte, level); 518 - kvm_pfn_t new_pfn = spte_to_pfn(new_spte); 519 522 int ret = 0; 520 523 521 524 KVM_BUG_ON(was_present, kvm); ··· 533 538 * external page table, or leaf. 534 539 */ 535 540 if (is_leaf) { 536 - ret = kvm_x86_call(set_external_spte)(kvm, gfn, level, new_pfn); 541 + ret = kvm_x86_call(set_external_spte)(kvm, gfn, level, new_spte); 537 542 } else { 538 543 void *external_spt = get_external_spt(gfn, new_spte, level); 539 544 ··· 1268 1273 struct kvm_mmu_page *sp; 1269 1274 int ret = RET_PF_RETRY; 1270 1275 1276 + KVM_MMU_WARN_ON(!root || root->role.invalid); 1277 + 1271 1278 kvm_mmu_hugepage_adjust(vcpu, fault); 1272 1279 1273 1280 trace_kvm_mmu_spte_requested(fault); ··· 1936 1939 * 1937 1940 * Must be called between kvm_tdp_mmu_walk_lockless_{begin,end}. 1938 1941 */ 1939 - static int __kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes, 1940 - struct kvm_mmu_page *root) 1942 + int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes, 1943 + int *root_level) 1941 1944 { 1945 + struct kvm_mmu_page *root = root_to_sp(vcpu->arch.mmu->root.hpa); 1942 1946 struct tdp_iter iter; 1943 1947 gfn_t gfn = addr >> PAGE_SHIFT; 1944 1948 int leaf = -1; 1949 + 1950 + *root_level = vcpu->arch.mmu->root_role.level; 1945 1951 1946 1952 for_each_tdp_pte(iter, vcpu->kvm, root, gfn, gfn + 1) { 1947 1953 leaf = iter.level; ··· 1953 1953 1954 1954 return leaf; 1955 1955 } 1956 - 1957 - int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes, 1958 - int *root_level) 1959 - { 1960 - struct kvm_mmu_page *root = root_to_sp(vcpu->arch.mmu->root.hpa); 1961 - *root_level = vcpu->arch.mmu->root_role.level; 1962 - 1963 - return __kvm_tdp_mmu_get_walk(vcpu, addr, sptes, root); 1964 - } 1965 - 1966 - bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa) 1967 - { 1968 - struct kvm *kvm = vcpu->kvm; 1969 - bool is_direct = kvm_is_addr_direct(kvm, gpa); 1970 - hpa_t root = is_direct ? vcpu->arch.mmu->root.hpa : 1971 - vcpu->arch.mmu->mirror_root_hpa; 1972 - u64 sptes[PT64_ROOT_MAX_LEVEL + 1], spte; 1973 - int leaf; 1974 - 1975 - lockdep_assert_held(&kvm->mmu_lock); 1976 - rcu_read_lock(); 1977 - leaf = __kvm_tdp_mmu_get_walk(vcpu, gpa, sptes, root_to_sp(root)); 1978 - rcu_read_unlock(); 1979 - if (leaf < 0) 1980 - return false; 1981 - 1982 - spte = sptes[leaf]; 1983 - return is_shadow_present_pte(spte) && is_last_spte(spte, leaf); 1984 - } 1985 - EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_tdp_mmu_gpa_is_mapped); 1986 1956 1987 1957 /* 1988 1958 * Returns the last level spte pointer of the shadow page walk for the given

+70 -16

arch/x86/kvm/svm/avic.c

··· 106 106 static bool next_vm_id_wrapped = 0; 107 107 static DEFINE_SPINLOCK(svm_vm_data_hash_lock); 108 108 static bool x2avic_enabled; 109 - 109 + static u32 x2avic_max_physical_id; 110 110 111 111 static void avic_set_x2apic_msr_interception(struct vcpu_svm *svm, 112 112 bool intercept) ··· 158 158 svm->x2avic_msrs_intercepted = intercept; 159 159 } 160 160 161 + static u32 __avic_get_max_physical_id(struct kvm *kvm, struct kvm_vcpu *vcpu) 162 + { 163 + u32 arch_max; 164 + 165 + /* 166 + * Return the largest size (x2APIC) when querying without a vCPU, e.g. 167 + * to allocate the per-VM table.. 168 + */ 169 + if (x2avic_enabled && (!vcpu || apic_x2apic_mode(vcpu->arch.apic))) 170 + arch_max = x2avic_max_physical_id; 171 + else 172 + arch_max = AVIC_MAX_PHYSICAL_ID; 173 + 174 + /* 175 + * Despite its name, KVM_CAP_MAX_VCPU_ID represents the maximum APIC ID 176 + * plus one, so the max possible APIC ID is one less than that. 177 + */ 178 + return min(kvm->arch.max_vcpu_ids - 1, arch_max); 179 + } 180 + 181 + static u32 avic_get_max_physical_id(struct kvm_vcpu *vcpu) 182 + { 183 + return __avic_get_max_physical_id(vcpu->kvm, vcpu); 184 + } 185 + 161 186 static void avic_activate_vmcb(struct vcpu_svm *svm) 162 187 { 163 188 struct vmcb *vmcb = svm->vmcb01.ptr; 189 + struct kvm_vcpu *vcpu = &svm->vcpu; 164 190 165 191 vmcb->control.int_ctl &= ~(AVIC_ENABLE_MASK | X2APIC_MODE_MASK); 192 + 166 193 vmcb->control.avic_physical_id &= ~AVIC_PHYSICAL_MAX_INDEX_MASK; 194 + vmcb->control.avic_physical_id |= avic_get_max_physical_id(vcpu); 167 195 168 196 vmcb->control.int_ctl |= AVIC_ENABLE_MASK; 169 197 ··· 204 176 */ 205 177 if (x2avic_enabled && apic_x2apic_mode(svm->vcpu.arch.apic)) { 206 178 vmcb->control.int_ctl |= X2APIC_MODE_MASK; 207 - vmcb->control.avic_physical_id |= X2AVIC_MAX_PHYSICAL_ID; 179 + 208 180 /* Disabling MSR intercept for x2APIC registers */ 209 181 avic_set_x2apic_msr_interception(svm, false); 210 182 } else { ··· 214 186 */ 215 187 kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, &svm->vcpu); 216 188 217 - /* For xAVIC and hybrid-xAVIC modes */ 218 - vmcb->control.avic_physical_id |= AVIC_MAX_PHYSICAL_ID; 219 189 /* Enabling MSR intercept for x2APIC registers */ 220 190 avic_set_x2apic_msr_interception(svm, true); 221 191 } ··· 273 247 return 0; 274 248 } 275 249 250 + static int avic_get_physical_id_table_order(struct kvm *kvm) 251 + { 252 + /* Provision for the maximum physical ID supported in x2avic mode */ 253 + return get_order((__avic_get_max_physical_id(kvm, NULL) + 1) * sizeof(u64)); 254 + } 255 + 256 + int avic_alloc_physical_id_table(struct kvm *kvm) 257 + { 258 + struct kvm_svm *kvm_svm = to_kvm_svm(kvm); 259 + 260 + if (!irqchip_in_kernel(kvm) || !enable_apicv) 261 + return 0; 262 + 263 + if (kvm_svm->avic_physical_id_table) 264 + return 0; 265 + 266 + kvm_svm->avic_physical_id_table = (void *)__get_free_pages(GFP_KERNEL_ACCOUNT | __GFP_ZERO, 267 + avic_get_physical_id_table_order(kvm)); 268 + if (!kvm_svm->avic_physical_id_table) 269 + return -ENOMEM; 270 + 271 + return 0; 272 + } 273 + 276 274 void avic_vm_destroy(struct kvm *kvm) 277 275 { 278 276 unsigned long flags; ··· 306 256 return; 307 257 308 258 free_page((unsigned long)kvm_svm->avic_logical_id_table); 309 - free_page((unsigned long)kvm_svm->avic_physical_id_table); 259 + free_pages((unsigned long)kvm_svm->avic_physical_id_table, 260 + avic_get_physical_id_table_order(kvm)); 310 261 311 262 spin_lock_irqsave(&svm_vm_data_hash_lock, flags); 312 263 hash_del(&kvm_svm->hnode); ··· 324 273 325 274 if (!enable_apicv) 326 275 return 0; 327 - 328 - kvm_svm->avic_physical_id_table = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT); 329 - if (!kvm_svm->avic_physical_id_table) 330 - goto free_avic; 331 276 332 277 kvm_svm->avic_logical_id_table = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT); 333 278 if (!kvm_svm->avic_logical_id_table) ··· 389 342 * fully initialized AVIC. 390 343 */ 391 344 if ((!x2avic_enabled && id > AVIC_MAX_PHYSICAL_ID) || 392 - (id > X2AVIC_MAX_PHYSICAL_ID)) { 345 + (id > x2avic_max_physical_id)) { 393 346 kvm_set_apicv_inhibit(vcpu->kvm, APICV_INHIBIT_REASON_PHYSICAL_ID_TOO_BIG); 394 347 vcpu->arch.apic->apicv_active = false; 395 348 return 0; ··· 609 562 u32 icrh = svm->vmcb->control.exit_info_1 >> 32; 610 563 u32 icrl = svm->vmcb->control.exit_info_1; 611 564 u32 id = svm->vmcb->control.exit_info_2 >> 32; 612 - u32 index = svm->vmcb->control.exit_info_2 & 0x1FF; 565 + u32 index = svm->vmcb->control.exit_info_2 & AVIC_PHYSICAL_MAX_INDEX_MASK; 613 566 struct kvm_lapic *apic = vcpu->arch.apic; 614 567 615 568 trace_kvm_avic_incomplete_ipi(vcpu->vcpu_id, icrh, icrl, id, index); ··· 1009 962 if (WARN_ON(h_physical_id & ~AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK)) 1010 963 return; 1011 964 1012 - if (WARN_ON_ONCE(vcpu->vcpu_id * sizeof(entry) >= PAGE_SIZE)) 965 + if (WARN_ON_ONCE(vcpu->vcpu_id * sizeof(entry) >= 966 + PAGE_SIZE << avic_get_physical_id_table_order(vcpu->kvm))) 1013 967 return; 1014 968 1015 969 /* ··· 1072 1024 1073 1025 lockdep_assert_preemption_disabled(); 1074 1026 1075 - if (WARN_ON_ONCE(vcpu->vcpu_id * sizeof(entry) >= PAGE_SIZE)) 1027 + if (WARN_ON_ONCE(vcpu->vcpu_id * sizeof(entry) >= 1028 + PAGE_SIZE << avic_get_physical_id_table_order(vcpu->kvm))) 1076 1029 return; 1077 1030 1078 1031 /* ··· 1275 1226 1276 1227 /* AVIC is a prerequisite for x2AVIC. */ 1277 1228 x2avic_enabled = boot_cpu_has(X86_FEATURE_X2AVIC); 1278 - if (x2avic_enabled) 1279 - pr_info("x2AVIC enabled\n"); 1280 - else 1229 + if (x2avic_enabled) { 1230 + if (cpu_feature_enabled(X86_FEATURE_X2AVIC_EXT)) 1231 + x2avic_max_physical_id = X2AVIC_4K_MAX_PHYSICAL_ID; 1232 + else 1233 + x2avic_max_physical_id = X2AVIC_MAX_PHYSICAL_ID; 1234 + pr_info("x2AVIC enabled (max %u vCPUs)\n", x2avic_max_physical_id + 1); 1235 + } else { 1281 1236 svm_x86_ops.allow_apicv_in_x2apic_without_x2apic_virtualization = true; 1237 + } 1282 1238 1283 1239 /* 1284 1240 * Disable IPI virtualization for AMD Family 17h CPUs (Zen1 and Zen2)

+2 -10

arch/x86/kvm/svm/nested.c

··· 613 613 struct kvm_vcpu *vcpu = &svm->vcpu; 614 614 615 615 nested_vmcb02_compute_g_pat(svm); 616 + vmcb_mark_dirty(vmcb02, VMCB_NPT); 616 617 617 618 /* Load the nested guest state */ 618 619 if (svm->nested.vmcb12_gpa != svm->nested.last_vmcb12_gpa) { ··· 752 751 vmcb02->control.nested_ctl = vmcb01->control.nested_ctl; 753 752 vmcb02->control.iopm_base_pa = vmcb01->control.iopm_base_pa; 754 753 vmcb02->control.msrpm_base_pa = vmcb01->control.msrpm_base_pa; 754 + vmcb_mark_dirty(vmcb02, VMCB_PERM_MAP); 755 755 756 756 /* 757 757 * Stash vmcb02's counter if the guest hasn't moved past the guilty ··· 1432 1430 case SVM_EXIT_IOIO: 1433 1431 vmexit = nested_svm_intercept_ioio(svm); 1434 1432 break; 1435 - case SVM_EXIT_READ_CR0 ... SVM_EXIT_WRITE_CR8: { 1436 - if (vmcb12_is_intercept(&svm->nested.ctl, exit_code)) 1437 - vmexit = NESTED_EXIT_DONE; 1438 - break; 1439 - } 1440 - case SVM_EXIT_READ_DR0 ... SVM_EXIT_WRITE_DR7: { 1441 - if (vmcb12_is_intercept(&svm->nested.ctl, exit_code)) 1442 - vmexit = NESTED_EXIT_DONE; 1443 - break; 1444 - } 1445 1433 case SVM_EXIT_EXCP_BASE ... SVM_EXIT_EXCP_BASE + 0x1f: { 1446 1434 /* 1447 1435 * Host-intercepted exceptions have been checked already in

+28 -17

arch/x86/kvm/svm/sev.c

··· 65 65 #define AP_RESET_HOLD_NAE_EVENT 1 66 66 #define AP_RESET_HOLD_MSR_PROTO 2 67 67 68 - /* As defined by SEV-SNP Firmware ABI, under "Guest Policy". */ 69 - #define SNP_POLICY_MASK_API_MINOR GENMASK_ULL(7, 0) 70 - #define SNP_POLICY_MASK_API_MAJOR GENMASK_ULL(15, 8) 71 - #define SNP_POLICY_MASK_SMT BIT_ULL(16) 72 - #define SNP_POLICY_MASK_RSVD_MBO BIT_ULL(17) 73 - #define SNP_POLICY_MASK_DEBUG BIT_ULL(19) 74 - #define SNP_POLICY_MASK_SINGLE_SOCKET BIT_ULL(20) 68 + /* 69 + * SEV-SNP policy bits that can be supported by KVM. These include policy bits 70 + * that have implementation support within KVM or policy bits that do not 71 + * require implementation support within KVM to enforce the policy. 72 + */ 73 + #define KVM_SNP_POLICY_MASK_VALID (SNP_POLICY_MASK_API_MINOR | \ 74 + SNP_POLICY_MASK_API_MAJOR | \ 75 + SNP_POLICY_MASK_SMT | \ 76 + SNP_POLICY_MASK_RSVD_MBO | \ 77 + SNP_POLICY_MASK_DEBUG | \ 78 + SNP_POLICY_MASK_SINGLE_SOCKET | \ 79 + SNP_POLICY_MASK_CXL_ALLOW | \ 80 + SNP_POLICY_MASK_MEM_AES_256_XTS | \ 81 + SNP_POLICY_MASK_RAPL_DIS | \ 82 + SNP_POLICY_MASK_CIPHERTEXT_HIDING_DRAM | \ 83 + SNP_POLICY_MASK_PAGE_SWAP_DISABLE) 75 84 76 - #define SNP_POLICY_MASK_VALID (SNP_POLICY_MASK_API_MINOR | \ 77 - SNP_POLICY_MASK_API_MAJOR | \ 78 - SNP_POLICY_MASK_SMT | \ 79 - SNP_POLICY_MASK_RSVD_MBO | \ 80 - SNP_POLICY_MASK_DEBUG | \ 81 - SNP_POLICY_MASK_SINGLE_SOCKET) 85 + static u64 snp_supported_policy_bits __ro_after_init; 82 86 83 87 #define INITIAL_VMSA_GPA 0xFFFFFFFFF000 84 88 ··· 2147 2143 *val = sev_supported_vmsa_features; 2148 2144 return 0; 2149 2145 2146 + case KVM_X86_SNP_POLICY_BITS: 2147 + *val = snp_supported_policy_bits; 2148 + return 0; 2149 + 2150 2150 default: 2151 2151 return -ENXIO; 2152 2152 } ··· 2215 2207 if (params.flags) 2216 2208 return -EINVAL; 2217 2209 2218 - if (params.policy & ~SNP_POLICY_MASK_VALID) 2210 + if (params.policy & ~snp_supported_policy_bits) 2219 2211 return -EINVAL; 2220 2212 2221 2213 /* Check for policy bits that must be set */ ··· 3108 3100 else if (sev_snp_supported) 3109 3101 sev_snp_supported = is_sev_snp_initialized(); 3110 3102 3111 - if (sev_snp_supported) 3103 + if (sev_snp_supported) { 3104 + snp_supported_policy_bits = sev_get_snp_policy_bits() & 3105 + KVM_SNP_POLICY_MASK_VALID; 3112 3106 nr_ciphertext_hiding_asids = init_args.max_snp_asid; 3107 + } 3113 3108 3114 3109 /* 3115 3110 * If ciphertext hiding is enabled, the joint SEV-ES/SEV-SNP ··· 5096 5085 5097 5086 /* Check if the SEV policy allows debugging */ 5098 5087 if (sev_snp_guest(vcpu->kvm)) { 5099 - if (!(sev->policy & SNP_POLICY_DEBUG)) 5088 + if (!(sev->policy & SNP_POLICY_MASK_DEBUG)) 5100 5089 return NULL; 5101 5090 } else { 5102 - if (sev->policy & SEV_POLICY_NODBG) 5091 + if (sev->policy & SEV_POLICY_MASK_NODBG) 5103 5092 return NULL; 5104 5093 } 5105 5094

+65 -48

arch/x86/kvm/svm/svm.c

··· 272 272 } 273 273 274 274 static int __svm_skip_emulated_instruction(struct kvm_vcpu *vcpu, 275 + int emul_type, 275 276 bool commit_side_effects) 276 277 { 277 278 struct vcpu_svm *svm = to_svm(vcpu); ··· 294 293 if (unlikely(!commit_side_effects)) 295 294 old_rflags = svm->vmcb->save.rflags; 296 295 297 - if (!kvm_emulate_instruction(vcpu, EMULTYPE_SKIP)) 296 + if (!kvm_emulate_instruction(vcpu, emul_type)) 298 297 return 0; 299 298 300 299 if (unlikely(!commit_side_effects)) ··· 312 311 313 312 static int svm_skip_emulated_instruction(struct kvm_vcpu *vcpu) 314 313 { 315 - return __svm_skip_emulated_instruction(vcpu, true); 314 + return __svm_skip_emulated_instruction(vcpu, EMULTYPE_SKIP, true); 316 315 } 317 316 318 - static int svm_update_soft_interrupt_rip(struct kvm_vcpu *vcpu) 317 + static int svm_update_soft_interrupt_rip(struct kvm_vcpu *vcpu, u8 vector) 319 318 { 319 + const int emul_type = EMULTYPE_SKIP | EMULTYPE_SKIP_SOFT_INT | 320 + EMULTYPE_SET_SOFT_INT_VECTOR(vector); 320 321 unsigned long rip, old_rip = kvm_rip_read(vcpu); 321 322 struct vcpu_svm *svm = to_svm(vcpu); 322 323 ··· 334 331 * in use, the skip must not commit any side effects such as clearing 335 332 * the interrupt shadow or RFLAGS.RF. 336 333 */ 337 - if (!__svm_skip_emulated_instruction(vcpu, !nrips)) 334 + if (!__svm_skip_emulated_instruction(vcpu, emul_type, !nrips)) 338 335 return -EIO; 339 336 340 337 rip = kvm_rip_read(vcpu); ··· 370 367 kvm_deliver_exception_payload(vcpu, ex); 371 368 372 369 if (kvm_exception_is_soft(ex->vector) && 373 - svm_update_soft_interrupt_rip(vcpu)) 370 + svm_update_soft_interrupt_rip(vcpu, ex->vector)) 374 371 return; 375 372 376 373 svm->vmcb->control.event_inj = ex->vector ··· 1199 1196 { 1200 1197 svm->current_vmcb = target_vmcb; 1201 1198 svm->vmcb = target_vmcb->ptr; 1199 + } 1200 + 1201 + static int svm_vcpu_precreate(struct kvm *kvm) 1202 + { 1203 + return avic_alloc_physical_id_table(kvm); 1202 1204 } 1203 1205 1204 1206 static int svm_vcpu_create(struct kvm_vcpu *vcpu) ··· 3450 3442 3451 3443 static int svm_handle_invalid_exit(struct kvm_vcpu *vcpu, u64 exit_code) 3452 3444 { 3453 - vcpu_unimpl(vcpu, "svm: unexpected exit reason 0x%llx\n", exit_code); 3454 3445 dump_vmcb(vcpu); 3455 - vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR; 3456 - vcpu->run->internal.suberror = KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON; 3457 - vcpu->run->internal.ndata = 2; 3458 - vcpu->run->internal.data[0] = exit_code; 3459 - vcpu->run->internal.data[1] = vcpu->arch.last_vmentry_cpu; 3446 + kvm_prepare_unexpected_reason_exit(vcpu, exit_code); 3460 3447 return 0; 3461 3448 } 3462 3449 ··· 3636 3633 3637 3634 static void svm_inject_irq(struct kvm_vcpu *vcpu, bool reinjected) 3638 3635 { 3636 + struct kvm_queued_interrupt *intr = &vcpu->arch.interrupt; 3639 3637 struct vcpu_svm *svm = to_svm(vcpu); 3640 3638 u32 type; 3641 3639 3642 - if (vcpu->arch.interrupt.soft) { 3643 - if (svm_update_soft_interrupt_rip(vcpu)) 3640 + if (intr->soft) { 3641 + if (svm_update_soft_interrupt_rip(vcpu, intr->nr)) 3644 3642 return; 3645 3643 3646 3644 type = SVM_EVTINJ_TYPE_SOFT; ··· 3649 3645 type = SVM_EVTINJ_TYPE_INTR; 3650 3646 } 3651 3647 3652 - trace_kvm_inj_virq(vcpu->arch.interrupt.nr, 3653 - vcpu->arch.interrupt.soft, reinjected); 3648 + trace_kvm_inj_virq(intr->nr, intr->soft, reinjected); 3654 3649 ++vcpu->stat.irq_injections; 3655 3650 3656 - svm->vmcb->control.event_inj = vcpu->arch.interrupt.nr | 3657 - SVM_EVTINJ_VALID | type; 3651 + svm->vmcb->control.event_inj = intr->nr | SVM_EVTINJ_VALID | type; 3658 3652 } 3659 3653 3660 3654 void svm_complete_interrupt_delivery(struct kvm_vcpu *vcpu, int delivery_mode, ··· 4253 4251 svm_set_dr6(vcpu, DR6_ACTIVE_LOW); 4254 4252 4255 4253 clgi(); 4256 - kvm_load_guest_xsave_state(vcpu); 4257 4254 4258 4255 /* 4259 4256 * Hardware only context switches DEBUGCTL if LBR virtualization is ··· 4295 4294 vcpu->arch.host_debugctl != svm->vmcb->save.dbgctl) 4296 4295 update_debugctlmsr(vcpu->arch.host_debugctl); 4297 4296 4298 - kvm_load_host_xsave_state(vcpu); 4299 4297 stgi(); 4300 4298 4301 4299 /* Any pending NMI will happen here */ ··· 4325 4325 kvm_read_and_reset_apf_flags(); 4326 4326 4327 4327 vcpu->arch.regs_avail &= ~SVM_REGS_LAZY_LOAD_SET; 4328 - 4329 - /* 4330 - * We need to handle MC intercepts here before the vcpu has a chance to 4331 - * change the physical cpu 4332 - */ 4333 - if (unlikely(svm->vmcb->control.exit_code == 4334 - SVM_EXIT_EXCP_BASE + MC_VECTOR)) 4335 - svm_handle_mce(vcpu); 4336 4328 4337 4329 trace_kvm_exit(vcpu, KVM_ISA_SVM); 4338 4330 ··· 4518 4526 case SVM_EXIT_WRITE_CR0: { 4519 4527 unsigned long cr0, val; 4520 4528 4521 - if (info->intercept == x86_intercept_cr_write) 4529 + /* 4530 + * Adjust the exit code accordingly if a CR other than CR0 is 4531 + * being written, and skip straight to the common handling as 4532 + * only CR0 has an additional selective intercept. 4533 + */ 4534 + if (info->intercept == x86_intercept_cr_write && info->modrm_reg) { 4522 4535 icpt_info.exit_code += info->modrm_reg; 4523 - 4524 - if (icpt_info.exit_code != SVM_EXIT_WRITE_CR0 || 4525 - info->intercept == x86_intercept_clts) 4526 4536 break; 4527 - 4528 - if (!(vmcb12_is_intercept(&svm->nested.ctl, 4529 - INTERCEPT_SELECTIVE_CR0))) 4530 - break; 4531 - 4532 - cr0 = vcpu->arch.cr0 & ~SVM_CR0_SELECTIVE_MASK; 4533 - val = info->src_val & ~SVM_CR0_SELECTIVE_MASK; 4534 - 4535 - if (info->intercept == x86_intercept_lmsw) { 4536 - cr0 &= 0xfUL; 4537 - val &= 0xfUL; 4538 - /* lmsw can't clear PE - catch this here */ 4539 - if (cr0 & X86_CR0_PE) 4540 - val |= X86_CR0_PE; 4541 4537 } 4542 4538 4539 + /* 4540 + * Convert the exit_code to SVM_EXIT_CR0_SEL_WRITE if a 4541 + * selective CR0 intercept is triggered (the common logic will 4542 + * treat the selective intercept as being enabled). Note, the 4543 + * unconditional intercept has higher priority, i.e. this is 4544 + * only relevant if *only* the selective intercept is enabled. 4545 + */ 4546 + if (vmcb12_is_intercept(&svm->nested.ctl, INTERCEPT_CR0_WRITE) || 4547 + !(vmcb12_is_intercept(&svm->nested.ctl, INTERCEPT_SELECTIVE_CR0))) 4548 + break; 4549 + 4550 + /* CLTS never triggers INTERCEPT_SELECTIVE_CR0 */ 4551 + if (info->intercept == x86_intercept_clts) 4552 + break; 4553 + 4554 + /* LMSW always triggers INTERCEPT_SELECTIVE_CR0 */ 4555 + if (info->intercept == x86_intercept_lmsw) { 4556 + icpt_info.exit_code = SVM_EXIT_CR0_SEL_WRITE; 4557 + break; 4558 + } 4559 + 4560 + /* 4561 + * MOV-to-CR0 only triggers INTERCEPT_SELECTIVE_CR0 if any bit 4562 + * other than SVM_CR0_SELECTIVE_MASK is changed. 4563 + */ 4564 + cr0 = vcpu->arch.cr0 & ~SVM_CR0_SELECTIVE_MASK; 4565 + val = info->src_val & ~SVM_CR0_SELECTIVE_MASK; 4543 4566 if (cr0 ^ val) 4544 4567 icpt_info.exit_code = SVM_EXIT_CR0_SEL_WRITE; 4545 - 4546 4568 break; 4547 4569 } 4548 4570 case SVM_EXIT_READ_DR0: ··· 4628 4622 4629 4623 static void svm_handle_exit_irqoff(struct kvm_vcpu *vcpu) 4630 4624 { 4631 - if (to_svm(vcpu)->vmcb->control.exit_code == SVM_EXIT_INTR) 4625 + switch (to_svm(vcpu)->vmcb->control.exit_code) { 4626 + case SVM_EXIT_EXCP_BASE + MC_VECTOR: 4627 + svm_handle_mce(vcpu); 4628 + break; 4629 + case SVM_EXIT_INTR: 4632 4630 vcpu->arch.at_instruction_boundary = true; 4631 + break; 4632 + default: 4633 + break; 4634 + } 4633 4635 } 4634 4636 4635 4637 static void svm_setup_mce(struct kvm_vcpu *vcpu) ··· 5026 5012 .emergency_disable_virtualization_cpu = svm_emergency_disable_virtualization_cpu, 5027 5013 .has_emulated_msr = svm_has_emulated_msr, 5028 5014 5015 + .vcpu_precreate = svm_vcpu_precreate, 5029 5016 .vcpu_create = svm_vcpu_create, 5030 5017 .vcpu_free = svm_vcpu_free, 5031 5018 .vcpu_reset = svm_vcpu_reset, ··· 5331 5316 5332 5317 if (nested) { 5333 5318 pr_info("Nested Virtualization enabled\n"); 5334 - kvm_enable_efer_bits(EFER_SVME | EFER_LMSLE); 5319 + kvm_enable_efer_bits(EFER_SVME); 5320 + if (!boot_cpu_has(X86_FEATURE_EFER_LMSLE_MBZ)) 5321 + kvm_enable_efer_bits(EFER_LMSLE); 5335 5322 5336 5323 r = nested_svm_init_msrpm_merge_offsets(); 5337 5324 if (r)

+1 -3

arch/x86/kvm/svm/svm.h

··· 117 117 cpumask_var_t have_run_cpus; /* CPUs that have done VMRUN for this VM. */ 118 118 }; 119 119 120 - #define SEV_POLICY_NODBG BIT_ULL(0) 121 - #define SNP_POLICY_DEBUG BIT_ULL(19) 122 - 123 120 struct kvm_svm { 124 121 struct kvm kvm; 125 122 ··· 804 807 805 808 bool __init avic_hardware_setup(void); 806 809 void avic_hardware_unsetup(void); 810 + int avic_alloc_physical_id_table(struct kvm *kvm); 807 811 void avic_vm_destroy(struct kvm *kvm); 808 812 int avic_vm_init(struct kvm *kvm); 809 813 void avic_init_vmcb(struct vcpu_svm *svm, struct vmcb *vmcb);

+41 -12

arch/x86/kvm/svm/vmenter.S

··· 52 52 * there must not be any returns or indirect branches between this code 53 53 * and vmentry. 54 54 */ 55 - movl SVM_spec_ctrl(%_ASM_DI), %eax 56 - cmp PER_CPU_VAR(x86_spec_ctrl_current), %eax 55 + #ifdef CONFIG_X86_64 56 + mov SVM_spec_ctrl(%rdi), %rdx 57 + cmp PER_CPU_VAR(x86_spec_ctrl_current), %rdx 57 58 je 801b 59 + movl %edx, %eax 60 + shr $32, %rdx 61 + #else 62 + mov SVM_spec_ctrl(%edi), %eax 63 + mov PER_CPU_VAR(x86_spec_ctrl_current), %ecx 64 + xor %eax, %ecx 65 + mov SVM_spec_ctrl + 4(%edi), %edx 66 + mov PER_CPU_VAR(x86_spec_ctrl_current + 4), %esi 67 + xor %edx, %esi 68 + or %esi, %ecx 69 + je 801b 70 + #endif 58 71 mov $MSR_IA32_SPEC_CTRL, %ecx 59 - xor %edx, %edx 60 72 wrmsr 61 73 jmp 801b 62 74 .endm ··· 93 81 jnz 998f 94 82 rdmsr 95 83 movl %eax, SVM_spec_ctrl(%_ASM_DI) 84 + movl %edx, SVM_spec_ctrl + 4(%_ASM_DI) 96 85 998: 97 - 98 86 /* Now restore the host value of the MSR if different from the guest's. */ 99 - movl PER_CPU_VAR(x86_spec_ctrl_current), %eax 100 - cmp SVM_spec_ctrl(%_ASM_DI), %eax 87 + #ifdef CONFIG_X86_64 88 + mov PER_CPU_VAR(x86_spec_ctrl_current), %rdx 89 + cmp SVM_spec_ctrl(%rdi), %rdx 101 90 je 901b 102 - xor %edx, %edx 91 + movl %edx, %eax 92 + shr $32, %rdx 93 + #else 94 + mov PER_CPU_VAR(x86_spec_ctrl_current), %eax 95 + mov SVM_spec_ctrl(%edi), %esi 96 + xor %eax, %esi 97 + mov PER_CPU_VAR(x86_spec_ctrl_current + 4), %edx 98 + mov SVM_spec_ctrl + 4(%edi), %edi 99 + xor %edx, %edi 100 + or %edi, %esi 101 + je 901b 102 + #endif 103 103 wrmsr 104 104 jmp 901b 105 105 .endm 106 106 107 + #define SVM_CLEAR_CPU_BUFFERS \ 108 + ALTERNATIVE "", __CLEAR_CPU_BUFFERS, X86_FEATURE_CLEAR_CPU_BUF_VM 107 109 108 110 /** 109 111 * __svm_vcpu_run - Run a vCPU via a transition to SVM guest mode ··· 160 134 mov %_ASM_ARG1, %_ASM_DI 161 135 .endif 162 136 163 - /* Clobbers RAX, RCX, RDX. */ 137 + /* Clobbers RAX, RCX, RDX (and ESI on 32-bit), consumes RDI (@svm). */ 164 138 RESTORE_GUEST_SPEC_CTRL 165 139 166 140 /* ··· 196 170 mov VCPU_RDI(%_ASM_DI), %_ASM_DI 197 171 198 172 /* Clobbers EFLAGS.ZF */ 199 - VM_CLEAR_CPU_BUFFERS 173 + SVM_CLEAR_CPU_BUFFERS 200 174 201 175 /* Enter guest mode */ 202 176 3: vmrun %_ASM_AX ··· 237 211 /* IMPORTANT: Stuff the RSB immediately after VM-Exit, before RET! */ 238 212 FILL_RETURN_BUFFER %_ASM_AX, RSB_CLEAR_LOOPS, X86_FEATURE_RSB_VMEXIT 239 213 240 - /* Clobbers RAX, RCX, RDX. */ 214 + /* 215 + * Clobbers RAX, RCX, RDX (and ESI, EDI on 32-bit), consumes RDI (@svm) 216 + * and RSP (pointer to @spec_ctrl_intercepted). 217 + */ 241 218 RESTORE_HOST_SPEC_CTRL 242 219 243 220 /* ··· 360 331 mov %rdi, SEV_ES_RDI (%rdx) 361 332 mov %rsi, SEV_ES_RSI (%rdx) 362 333 363 - /* Clobbers RAX, RCX, RDX (@hostsa). */ 334 + /* Clobbers RAX, RCX, and RDX (@hostsa), consumes RDI (@svm). */ 364 335 RESTORE_GUEST_SPEC_CTRL 365 336 366 337 /* Get svm->current_vmcb->pa into RAX. */ ··· 368 339 mov KVM_VMCB_pa(%rax), %rax 369 340 370 341 /* Clobbers EFLAGS.ZF */ 371 - VM_CLEAR_CPU_BUFFERS 342 + SVM_CLEAR_CPU_BUFFERS 372 343 373 344 /* Enter guest mode */ 374 345 1: vmrun %rax

+9

arch/x86/kvm/vmx/main.c

··· 831 831 return tdx_vcpu_ioctl(vcpu, argp); 832 832 } 833 833 834 + static int vt_vcpu_mem_enc_unlocked_ioctl(struct kvm_vcpu *vcpu, void __user *argp) 835 + { 836 + if (!is_td_vcpu(vcpu)) 837 + return -EINVAL; 838 + 839 + return tdx_vcpu_unlocked_ioctl(vcpu, argp); 840 + } 841 + 834 842 static int vt_gmem_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn, 835 843 bool is_private) 836 844 { ··· 1013 1005 1014 1006 .mem_enc_ioctl = vt_op_tdx_only(mem_enc_ioctl), 1015 1007 .vcpu_mem_enc_ioctl = vt_op_tdx_only(vcpu_mem_enc_ioctl), 1008 + .vcpu_mem_enc_unlocked_ioctl = vt_op_tdx_only(vcpu_mem_enc_unlocked_ioctl), 1016 1009 1017 1010 .gmem_max_mapping_level = vt_op_tdx_only(gmem_max_mapping_level) 1018 1011 };

+61 -112

arch/x86/kvm/vmx/nested.c

··· 23 23 static bool __read_mostly enable_shadow_vmcs = 1; 24 24 module_param_named(enable_shadow_vmcs, enable_shadow_vmcs, bool, S_IRUGO); 25 25 26 - static bool __read_mostly nested_early_check = 0; 27 - module_param(nested_early_check, bool, S_IRUGO); 26 + static bool __ro_after_init warn_on_missed_cc; 27 + module_param(warn_on_missed_cc, bool, 0444); 28 28 29 29 #define CC KVM_NESTED_VMENTER_CONSISTENCY_CHECK 30 30 ··· 555 555 if (CC(!page_address_valid(vcpu, vmcs12->virtual_apic_page_addr))) 556 556 return -EINVAL; 557 557 558 + if (CC(!nested_cpu_has_vid(vmcs12) && vmcs12->tpr_threshold >> 4)) 559 + return -EINVAL; 560 + 558 561 return 0; 559 562 } 560 563 ··· 764 761 vmcs12->vmcs_link_pointer, VMCS12_SIZE)) 765 762 return; 766 763 767 - kvm_read_guest_cached(vmx->vcpu.kvm, ghc, get_shadow_vmcs12(vcpu), 764 + kvm_read_guest_cached(vcpu->kvm, ghc, get_shadow_vmcs12(vcpu), 768 765 VMCS12_SIZE); 769 766 } 770 767 ··· 783 780 vmcs12->vmcs_link_pointer, VMCS12_SIZE)) 784 781 return; 785 782 786 - kvm_write_guest_cached(vmx->vcpu.kvm, ghc, get_shadow_vmcs12(vcpu), 783 + kvm_write_guest_cached(vcpu->kvm, ghc, get_shadow_vmcs12(vcpu), 787 784 VMCS12_SIZE); 788 785 } 789 786 ··· 2299 2296 return; 2300 2297 vmx->nested.vmcs02_initialized = true; 2301 2298 2302 - /* 2303 - * We don't care what the EPTP value is we just need to guarantee 2304 - * it's valid so we don't get a false positive when doing early 2305 - * consistency checks. 2306 - */ 2307 - if (enable_ept && nested_early_check) 2308 - vmcs_write64(EPT_POINTER, 2309 - construct_eptp(&vmx->vcpu, 0, PT64_ROOT_4LEVEL)); 2310 - 2311 2299 if (vmx->ve_info) 2312 2300 vmcs_write64(VE_INFORMATION_ADDRESS, __pa(vmx->ve_info)); 2313 2301 ··· 2743 2749 vmcs_write64(GUEST_IA32_PAT, vmcs12->guest_ia32_pat); 2744 2750 vcpu->arch.pat = vmcs12->guest_ia32_pat; 2745 2751 } else if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) { 2746 - vmcs_write64(GUEST_IA32_PAT, vmx->vcpu.arch.pat); 2752 + vmcs_write64(GUEST_IA32_PAT, vcpu->arch.pat); 2747 2753 } 2748 2754 2749 2755 vcpu->arch.tsc_offset = kvm_calc_nested_tsc_offset( ··· 2955 2961 } 2956 2962 } 2957 2963 2964 + if (nested_cpu_has2(vmcs12, SECONDARY_EXEC_TSC_SCALING) && 2965 + CC(!vmcs12->tsc_multiplier)) 2966 + return -EINVAL; 2967 + 2958 2968 return 0; 2959 2969 } 2960 2970 ··· 3072 3074 if (guest_cpu_cap_has_evmcs(vcpu)) 3073 3075 return nested_evmcs_check_controls(vmcs12); 3074 3076 #endif 3077 + 3078 + return 0; 3079 + } 3080 + 3081 + static int nested_vmx_check_controls_late(struct kvm_vcpu *vcpu, 3082 + struct vmcs12 *vmcs12) 3083 + { 3084 + void *vapic = to_vmx(vcpu)->nested.virtual_apic_map.hva; 3085 + u32 vtpr = vapic ? (*(u32 *)(vapic + APIC_TASKPRI)) >> 4 : 0; 3086 + 3087 + /* 3088 + * Don't bother with the consistency checks if KVM isn't configured to 3089 + * WARN on missed consistency checks, as KVM needs to rely on hardware 3090 + * to fully detect an illegal vTPR vs. TRP Threshold combination due to 3091 + * the vTPR being writable by L1 at all times (it's an in-memory value, 3092 + * not a VMCS field). I.e. even if the check passes now, it might fail 3093 + * at the actual VM-Enter. 3094 + * 3095 + * Keying off the module param also allows treating an invalid vAPIC 3096 + * mapping as a consistency check failure without increasing the risk 3097 + * of breaking a "real" VM. 3098 + */ 3099 + if (!warn_on_missed_cc) 3100 + return 0; 3101 + 3102 + if ((exec_controls_get(to_vmx(vcpu)) & CPU_BASED_TPR_SHADOW) && 3103 + nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW) && 3104 + !nested_cpu_has_vid(vmcs12) && 3105 + !nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES) && 3106 + (CC(!vapic) || 3107 + CC((vmcs12->tpr_threshold & GENMASK(3, 0)) > (vtpr & GENMASK(3, 0))))) 3108 + return -EINVAL; 3075 3109 3076 3110 return 0; 3077 3111 } ··· 3359 3329 3360 3330 if (nested_check_guest_non_reg_state(vmcs12)) 3361 3331 return -EINVAL; 3362 - 3363 - return 0; 3364 - } 3365 - 3366 - static int nested_vmx_check_vmentry_hw(struct kvm_vcpu *vcpu) 3367 - { 3368 - struct vcpu_vmx *vmx = to_vmx(vcpu); 3369 - unsigned long cr3, cr4; 3370 - bool vm_fail; 3371 - 3372 - if (!nested_early_check) 3373 - return 0; 3374 - 3375 - if (vmx->msr_autoload.host.nr) 3376 - vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, 0); 3377 - if (vmx->msr_autoload.guest.nr) 3378 - vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, 0); 3379 - 3380 - preempt_disable(); 3381 - 3382 - vmx_prepare_switch_to_guest(vcpu); 3383 - 3384 - /* 3385 - * Induce a consistency check VMExit by clearing bit 1 in GUEST_RFLAGS, 3386 - * which is reserved to '1' by hardware. GUEST_RFLAGS is guaranteed to 3387 - * be written (by prepare_vmcs02()) before the "real" VMEnter, i.e. 3388 - * there is no need to preserve other bits or save/restore the field. 3389 - */ 3390 - vmcs_writel(GUEST_RFLAGS, 0); 3391 - 3392 - cr3 = __get_current_cr3_fast(); 3393 - if (unlikely(cr3 != vmx->loaded_vmcs->host_state.cr3)) { 3394 - vmcs_writel(HOST_CR3, cr3); 3395 - vmx->loaded_vmcs->host_state.cr3 = cr3; 3396 - } 3397 - 3398 - cr4 = cr4_read_shadow(); 3399 - if (unlikely(cr4 != vmx->loaded_vmcs->host_state.cr4)) { 3400 - vmcs_writel(HOST_CR4, cr4); 3401 - vmx->loaded_vmcs->host_state.cr4 = cr4; 3402 - } 3403 - 3404 - vm_fail = __vmx_vcpu_run(vmx, (unsigned long *)&vcpu->arch.regs, 3405 - __vmx_vcpu_run_flags(vmx)); 3406 - 3407 - if (vmx->msr_autoload.host.nr) 3408 - vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, vmx->msr_autoload.host.nr); 3409 - if (vmx->msr_autoload.guest.nr) 3410 - vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, vmx->msr_autoload.guest.nr); 3411 - 3412 - if (vm_fail) { 3413 - u32 error = vmcs_read32(VM_INSTRUCTION_ERROR); 3414 - 3415 - preempt_enable(); 3416 - 3417 - trace_kvm_nested_vmenter_failed( 3418 - "early hardware check VM-instruction error: ", error); 3419 - WARN_ON_ONCE(error != VMXERR_ENTRY_INVALID_CONTROL_FIELD); 3420 - return 1; 3421 - } 3422 - 3423 - /* 3424 - * VMExit clears RFLAGS.IF and DR7, even on a consistency check. 3425 - */ 3426 - if (hw_breakpoint_active()) 3427 - set_debugreg(__this_cpu_read(cpu_dr7), 7); 3428 - local_irq_enable(); 3429 - preempt_enable(); 3430 - 3431 - /* 3432 - * A non-failing VMEntry means we somehow entered guest mode with 3433 - * an illegal RIP, and that's just the tip of the iceberg. There 3434 - * is no telling what memory has been modified or what state has 3435 - * been exposed to unknown code. Hitting this all but guarantees 3436 - * a (very critical) hardware issue. 3437 - */ 3438 - WARN_ON(!(vmcs_read32(VM_EXIT_REASON) & 3439 - VMX_EXIT_REASONS_FAILED_VMENTRY)); 3440 3332 3441 3333 return 0; 3442 3334 } ··· 3619 3667 &vmx->nested.pre_vmenter_ssp_tbl); 3620 3668 3621 3669 /* 3622 - * Overwrite vmcs01.GUEST_CR3 with L1's CR3 if EPT is disabled *and* 3623 - * nested early checks are disabled. In the event of a "late" VM-Fail, 3624 - * i.e. a VM-Fail detected by hardware but not KVM, KVM must unwind its 3625 - * software model to the pre-VMEntry host state. When EPT is disabled, 3626 - * GUEST_CR3 holds KVM's shadow CR3, not L1's "real" CR3, which causes 3627 - * nested_vmx_restore_host_state() to corrupt vcpu->arch.cr3. Stuffing 3628 - * vmcs01.GUEST_CR3 results in the unwind naturally setting arch.cr3 to 3629 - * the correct value. Smashing vmcs01.GUEST_CR3 is safe because nested 3630 - * VM-Exits, and the unwind, reset KVM's MMU, i.e. vmcs01.GUEST_CR3 is 3631 - * guaranteed to be overwritten with a shadow CR3 prior to re-entering 3632 - * L1. Don't stuff vmcs01.GUEST_CR3 when using nested early checks as 3633 - * KVM modifies vcpu->arch.cr3 if and only if the early hardware checks 3634 - * pass, and early VM-Fails do not reset KVM's MMU, i.e. the VM-Fail 3635 - * path would need to manually save/restore vmcs01.GUEST_CR3. 3670 + * Overwrite vmcs01.GUEST_CR3 with L1's CR3 if EPT is disabled. In the 3671 + * event of a "late" VM-Fail, i.e. a VM-Fail detected by hardware but 3672 + * not KVM, KVM must unwind its software model to the pre-VM-Entry host 3673 + * state. When EPT is disabled, GUEST_CR3 holds KVM's shadow CR3, not 3674 + * L1's "real" CR3, which causes nested_vmx_restore_host_state() to 3675 + * corrupt vcpu->arch.cr3. Stuffing vmcs01.GUEST_CR3 results in the 3676 + * unwind naturally setting arch.cr3 to the correct value. Smashing 3677 + * vmcs01.GUEST_CR3 is safe because nested VM-Exits, and the unwind, 3678 + * reset KVM's MMU, i.e. vmcs01.GUEST_CR3 is guaranteed to be 3679 + * overwritten with a shadow CR3 prior to re-entering L1. 3636 3680 */ 3637 - if (!enable_ept && !nested_early_check) 3681 + if (!enable_ept) 3638 3682 vmcs_writel(GUEST_CR3, vcpu->arch.cr3); 3639 3683 3640 3684 vmx_switch_vmcs(vcpu, &vmx->nested.vmcs02); ··· 3643 3695 return NVMX_VMENTRY_KVM_INTERNAL_ERROR; 3644 3696 } 3645 3697 3646 - if (nested_vmx_check_vmentry_hw(vcpu)) { 3698 + if (nested_vmx_check_controls_late(vcpu, vmcs12)) { 3647 3699 vmx_switch_vmcs(vcpu, &vmx->vmcs01); 3648 3700 return NVMX_VMENTRY_VMFAIL; 3649 3701 } ··· 3828 3880 goto vmentry_failed; 3829 3881 3830 3882 /* Hide L1D cache contents from the nested guest. */ 3831 - vmx->vcpu.arch.l1tf_flush_l1d = true; 3883 + kvm_request_l1tf_flush_l1d(); 3832 3884 3833 3885 /* 3834 3886 * Must happen outside of nested_vmx_enter_non_root_mode() as it will ··· 5112 5164 /* 5113 5165 * The only expected VM-instruction error is "VM entry with 5114 5166 * invalid control field(s)." Anything else indicates a 5115 - * problem with L0. And we should never get here with a 5116 - * VMFail of any type if early consistency checks are enabled. 5167 + * problem with L0. 5117 5168 */ 5118 5169 WARN_ON_ONCE(vmcs_read32(VM_INSTRUCTION_ERROR) != 5119 5170 VMXERR_ENTRY_INVALID_CONTROL_FIELD); 5120 - WARN_ON_ONCE(nested_early_check); 5171 + 5172 + /* VM-Fail at VM-Entry means KVM missed a consistency check. */ 5173 + WARN_ON_ONCE(warn_on_missed_cc); 5121 5174 } 5122 5175 5123 5176 /*

+3 -7

arch/x86/kvm/vmx/run_flags.h

··· 2 2 #ifndef __KVM_X86_VMX_RUN_FLAGS_H 3 3 #define __KVM_X86_VMX_RUN_FLAGS_H 4 4 5 - #define VMX_RUN_VMRESUME_SHIFT 0 6 - #define VMX_RUN_SAVE_SPEC_CTRL_SHIFT 1 7 - #define VMX_RUN_CLEAR_CPU_BUFFERS_FOR_MMIO_SHIFT 2 8 - 9 - #define VMX_RUN_VMRESUME BIT(VMX_RUN_VMRESUME_SHIFT) 10 - #define VMX_RUN_SAVE_SPEC_CTRL BIT(VMX_RUN_SAVE_SPEC_CTRL_SHIFT) 11 - #define VMX_RUN_CLEAR_CPU_BUFFERS_FOR_MMIO BIT(VMX_RUN_CLEAR_CPU_BUFFERS_FOR_MMIO_SHIFT) 5 + #define VMX_RUN_VMRESUME BIT(0) 6 + #define VMX_RUN_SAVE_SPEC_CTRL BIT(1) 7 + #define VMX_RUN_CLEAR_CPU_BUFFERS_FOR_MMIO BIT(2) 12 8 13 9 #endif /* __KVM_X86_VMX_RUN_FLAGS_H */

+378 -423

arch/x86/kvm/vmx/tdx.c

··· 24 24 #undef pr_fmt 25 25 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt 26 26 27 - #define pr_tdx_error(__fn, __err) \ 28 - pr_err_ratelimited("SEAMCALL %s failed: 0x%llx\n", #__fn, __err) 27 + #define __TDX_BUG_ON(__err, __f, __kvm, __fmt, __args...) \ 28 + ({ \ 29 + struct kvm *_kvm = (__kvm); \ 30 + bool __ret = !!(__err); \ 31 + \ 32 + if (WARN_ON_ONCE(__ret && (!_kvm || !_kvm->vm_bugged))) { \ 33 + if (_kvm) \ 34 + kvm_vm_bugged(_kvm); \ 35 + pr_err_ratelimited("SEAMCALL " __f " failed: 0x%llx" __fmt "\n",\ 36 + __err, __args); \ 37 + } \ 38 + unlikely(__ret); \ 39 + }) 29 40 30 - #define __pr_tdx_error_N(__fn_str, __err, __fmt, ...) \ 31 - pr_err_ratelimited("SEAMCALL " __fn_str " failed: 0x%llx, " __fmt, __err, __VA_ARGS__) 41 + #define TDX_BUG_ON(__err, __fn, __kvm) \ 42 + __TDX_BUG_ON(__err, #__fn, __kvm, "%s", "") 32 43 33 - #define pr_tdx_error_1(__fn, __err, __rcx) \ 34 - __pr_tdx_error_N(#__fn, __err, "rcx 0x%llx\n", __rcx) 44 + #define TDX_BUG_ON_1(__err, __fn, a1, __kvm) \ 45 + __TDX_BUG_ON(__err, #__fn, __kvm, ", " #a1 " 0x%llx", a1) 35 46 36 - #define pr_tdx_error_2(__fn, __err, __rcx, __rdx) \ 37 - __pr_tdx_error_N(#__fn, __err, "rcx 0x%llx, rdx 0x%llx\n", __rcx, __rdx) 47 + #define TDX_BUG_ON_2(__err, __fn, a1, a2, __kvm) \ 48 + __TDX_BUG_ON(__err, #__fn, __kvm, ", " #a1 " 0x%llx, " #a2 " 0x%llx", a1, a2) 38 49 39 - #define pr_tdx_error_3(__fn, __err, __rcx, __rdx, __r8) \ 40 - __pr_tdx_error_N(#__fn, __err, "rcx 0x%llx, rdx 0x%llx, r8 0x%llx\n", __rcx, __rdx, __r8) 50 + #define TDX_BUG_ON_3(__err, __fn, a1, a2, a3, __kvm) \ 51 + __TDX_BUG_ON(__err, #__fn, __kvm, ", " #a1 " 0x%llx, " #a2 ", 0x%llx, " #a3 " 0x%llx", \ 52 + a1, a2, a3) 53 + 41 54 42 55 bool enable_tdx __ro_after_init; 43 56 module_param_named(tdx, enable_tdx, bool, 0444); ··· 294 281 vcpu->cpu = -1; 295 282 } 296 283 297 - static void tdx_no_vcpus_enter_start(struct kvm *kvm) 298 - { 299 - struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); 300 - 301 - lockdep_assert_held_write(&kvm->mmu_lock); 302 - 303 - WRITE_ONCE(kvm_tdx->wait_for_sept_zap, true); 304 - 305 - kvm_make_all_cpus_request(kvm, KVM_REQ_OUTSIDE_GUEST_MODE); 306 - } 307 - 308 - static void tdx_no_vcpus_enter_stop(struct kvm *kvm) 309 - { 310 - struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); 311 - 312 - lockdep_assert_held_write(&kvm->mmu_lock); 313 - 314 - WRITE_ONCE(kvm_tdx->wait_for_sept_zap, false); 315 - } 284 + /* 285 + * Execute a SEAMCALL related to removing/blocking S-EPT entries, with a single 286 + * retry (if necessary) after forcing vCPUs to exit and wait for the operation 287 + * to complete. All flows that remove/block S-EPT entries run with mmu_lock 288 + * held for write, i.e. are mutually exclusive with each other, but they aren't 289 + * mutually exclusive with running vCPUs, and so can fail with "operand busy" 290 + * if a vCPU acquires a relevant lock in the TDX-Module, e.g. when doing TDCALL. 291 + * 292 + * Note, the retry is guaranteed to succeed, absent KVM and/or TDX-Module bugs. 293 + */ 294 + #define tdh_do_no_vcpus(tdh_func, kvm, args...) \ 295 + ({ \ 296 + struct kvm_tdx *__kvm_tdx = to_kvm_tdx(kvm); \ 297 + u64 __err; \ 298 + \ 299 + lockdep_assert_held_write(&kvm->mmu_lock); \ 300 + \ 301 + __err = tdh_func(args); \ 302 + if (unlikely(tdx_operand_busy(__err))) { \ 303 + WRITE_ONCE(__kvm_tdx->wait_for_sept_zap, true); \ 304 + kvm_make_all_cpus_request(kvm, KVM_REQ_OUTSIDE_GUEST_MODE); \ 305 + \ 306 + __err = tdh_func(args); \ 307 + \ 308 + WRITE_ONCE(__kvm_tdx->wait_for_sept_zap, false); \ 309 + } \ 310 + __err; \ 311 + }) 316 312 317 313 /* TDH.PHYMEM.PAGE.RECLAIM is allowed only when destroying the TD. */ 318 314 static int __tdx_reclaim_page(struct page *page) ··· 335 313 * before the HKID is released and control pages have also been 336 314 * released at this point, so there is no possibility of contention. 337 315 */ 338 - if (WARN_ON_ONCE(err)) { 339 - pr_tdx_error_3(TDH_PHYMEM_PAGE_RECLAIM, err, rcx, rdx, r8); 316 + if (TDX_BUG_ON_3(err, TDH_PHYMEM_PAGE_RECLAIM, rcx, rdx, r8, NULL)) 340 317 return -EIO; 341 - } 318 + 342 319 return 0; 343 320 } 344 321 ··· 425 404 return; 426 405 427 406 smp_call_function_single(cpu, tdx_flush_vp, &arg, 1); 428 - if (KVM_BUG_ON(arg.err, vcpu->kvm)) 429 - pr_tdx_error(TDH_VP_FLUSH, arg.err); 407 + 408 + TDX_BUG_ON(arg.err, TDH_VP_FLUSH, vcpu->kvm); 430 409 } 431 410 432 411 void tdx_disable_virtualization_cpu(void) ··· 485 464 } 486 465 487 466 out: 488 - if (WARN_ON_ONCE(err)) 489 - pr_tdx_error(TDH_PHYMEM_CACHE_WB, err); 467 + TDX_BUG_ON(err, TDH_PHYMEM_CACHE_WB, NULL); 490 468 } 491 469 492 470 void tdx_mmu_release_hkid(struct kvm *kvm) ··· 524 504 err = tdh_mng_vpflushdone(&kvm_tdx->td); 525 505 if (err == TDX_FLUSHVP_NOT_DONE) 526 506 goto out; 527 - if (KVM_BUG_ON(err, kvm)) { 528 - pr_tdx_error(TDH_MNG_VPFLUSHDONE, err); 507 + if (TDX_BUG_ON(err, TDH_MNG_VPFLUSHDONE, kvm)) { 529 508 pr_err("tdh_mng_vpflushdone() failed. HKID %d is leaked.\n", 530 509 kvm_tdx->hkid); 531 510 goto out; ··· 547 528 * tdh_mng_key_freeid() will fail. 548 529 */ 549 530 err = tdh_mng_key_freeid(&kvm_tdx->td); 550 - if (KVM_BUG_ON(err, kvm)) { 551 - pr_tdx_error(TDH_MNG_KEY_FREEID, err); 531 + if (TDX_BUG_ON(err, TDH_MNG_KEY_FREEID, kvm)) { 552 532 pr_err("tdh_mng_key_freeid() failed. HKID %d is leaked.\n", 553 533 kvm_tdx->hkid); 554 534 } else { ··· 598 580 * when it is reclaiming TDCS). 599 581 */ 600 582 err = tdh_phymem_page_wbinvd_tdr(&kvm_tdx->td); 601 - if (KVM_BUG_ON(err, kvm)) { 602 - pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err); 583 + if (TDX_BUG_ON(err, TDH_PHYMEM_PAGE_WBINVD, kvm)) 603 584 return; 604 - } 585 + 605 586 tdx_quirk_reset_page(kvm_tdx->td.tdr_page); 606 587 607 588 __free_page(kvm_tdx->td.tdr_page); ··· 623 606 624 607 /* TDX_RND_NO_ENTROPY related retries are handled by sc_retry() */ 625 608 err = tdh_mng_key_config(&kvm_tdx->td); 626 - 627 - if (KVM_BUG_ON(err, &kvm_tdx->kvm)) { 628 - pr_tdx_error(TDH_MNG_KEY_CONFIG, err); 609 + if (TDX_BUG_ON(err, TDH_MNG_KEY_CONFIG, &kvm_tdx->kvm)) 629 610 return -EIO; 630 - } 631 611 632 612 return 0; 633 613 } ··· 777 763 return tdx_vcpu_state_details_intr_pending(vcpu_state_details); 778 764 } 779 765 780 - /* 781 - * Compared to vmx_prepare_switch_to_guest(), there is not much to do 782 - * as SEAMCALL/SEAMRET calls take care of most of save and restore. 783 - */ 784 - void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) 785 - { 786 - struct vcpu_vt *vt = to_vt(vcpu); 787 - 788 - if (vt->guest_state_loaded) 789 - return; 790 - 791 - if (likely(is_64bit_mm(current->mm))) 792 - vt->msr_host_kernel_gs_base = current->thread.gsbase; 793 - else 794 - vt->msr_host_kernel_gs_base = read_msr(MSR_KERNEL_GS_BASE); 795 - 796 - vt->guest_state_loaded = true; 797 - } 798 - 799 766 struct tdx_uret_msr { 800 767 u32 msr; 801 768 unsigned int slot; ··· 790 795 {.msr = MSR_TSC_AUX,}, 791 796 }; 792 797 793 - static void tdx_user_return_msr_update_cache(void) 798 + void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) 794 799 { 800 + struct vcpu_vt *vt = to_vt(vcpu); 795 801 int i; 796 802 803 + if (vt->guest_state_loaded) 804 + return; 805 + 806 + if (likely(is_64bit_mm(current->mm))) 807 + vt->msr_host_kernel_gs_base = current->thread.gsbase; 808 + else 809 + vt->msr_host_kernel_gs_base = read_msr(MSR_KERNEL_GS_BASE); 810 + 811 + vt->guest_state_loaded = true; 812 + 813 + /* 814 + * Explicitly set user-return MSRs that are clobbered by the TDX-Module 815 + * if VP.ENTER succeeds, i.e. on TD-Exit, with the values that would be 816 + * written by the TDX-Module. Don't rely on the TDX-Module to actually 817 + * clobber the MSRs, as the contract is poorly defined and not upheld. 818 + * E.g. the TDX-Module will synthesize an EPT Violation without doing 819 + * VM-Enter if it suspects a zero-step attack, and never "restore" VMM 820 + * state. 821 + */ 797 822 for (i = 0; i < ARRAY_SIZE(tdx_uret_msrs); i++) 798 - kvm_user_return_msr_update_cache(tdx_uret_msrs[i].slot, 799 - tdx_uret_msrs[i].defval); 823 + kvm_set_user_return_msr(tdx_uret_msrs[i].slot, 824 + tdx_uret_msrs[i].defval, -1ull); 800 825 } 801 826 802 827 static void tdx_prepare_switch_to_host(struct kvm_vcpu *vcpu) 803 828 { 804 829 struct vcpu_vt *vt = to_vt(vcpu); 805 - struct vcpu_tdx *tdx = to_tdx(vcpu); 806 830 807 831 if (!vt->guest_state_loaded) 808 832 return; 809 833 810 834 ++vcpu->stat.host_state_reload; 811 835 wrmsrl(MSR_KERNEL_GS_BASE, vt->msr_host_kernel_gs_base); 812 - 813 - if (tdx->guest_entered) { 814 - tdx_user_return_msr_update_cache(); 815 - tdx->guest_entered = false; 816 - } 817 836 818 837 vt->guest_state_loaded = false; 819 838 } ··· 838 829 tdx_prepare_switch_to_host(vcpu); 839 830 } 840 831 832 + /* 833 + * Life cycles for a TD and a vCPU: 834 + * 1. KVM_CREATE_VM ioctl. 835 + * TD state is TD_STATE_UNINITIALIZED. 836 + * hkid is not assigned at this stage. 837 + * 2. KVM_TDX_INIT_VM ioctl. 838 + * TD transitions to TD_STATE_INITIALIZED. 839 + * hkid is assigned after this stage. 840 + * 3. KVM_CREATE_VCPU ioctl. (only when TD is TD_STATE_INITIALIZED). 841 + * 3.1 tdx_vcpu_create() transitions vCPU state to VCPU_TD_STATE_UNINITIALIZED. 842 + * 3.2 vcpu_load() and vcpu_put() in kvm_arch_vcpu_create(). 843 + * 3.3 (conditional) if any error encountered after kvm_arch_vcpu_create() 844 + * kvm_arch_vcpu_destroy() --> tdx_vcpu_free(). 845 + * 4. KVM_TDX_INIT_VCPU ioctl. 846 + * tdx_vcpu_init() transitions vCPU state to VCPU_TD_STATE_INITIALIZED. 847 + * vCPU control structures are allocated at this stage. 848 + * 5. kvm_destroy_vm(). 849 + * 5.1 tdx_mmu_release_hkid(): (1) tdh_vp_flush(), disassociates all vCPUs. 850 + * (2) puts hkid to !assigned state. 851 + * 5.2 kvm_destroy_vcpus() --> tdx_vcpu_free(): 852 + * transitions vCPU to VCPU_TD_STATE_UNINITIALIZED state. 853 + * 5.3 tdx_vm_destroy() 854 + * transitions TD to TD_STATE_UNINITIALIZED state. 855 + * 856 + * tdx_vcpu_free() can be invoked only at 3.3 or 5.2. 857 + * - If at 3.3, hkid is still assigned, but the vCPU must be in 858 + * VCPU_TD_STATE_UNINITIALIZED state. 859 + * - if at 5.2, hkid must be !assigned and all vCPUs must be in 860 + * VCPU_TD_STATE_INITIALIZED state and have been dissociated. 861 + */ 841 862 void tdx_vcpu_free(struct kvm_vcpu *vcpu) 842 863 { 843 864 struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm); 844 865 struct vcpu_tdx *tdx = to_tdx(vcpu); 845 866 int i; 846 867 868 + if (vcpu->cpu != -1) { 869 + KVM_BUG_ON(tdx->state == VCPU_TD_STATE_INITIALIZED, vcpu->kvm); 870 + tdx_flush_vp_on_cpu(vcpu); 871 + return; 872 + } 873 + 847 874 /* 848 875 * It is not possible to reclaim pages while hkid is assigned. It might 849 - * be assigned if: 850 - * 1. the TD VM is being destroyed but freeing hkid failed, in which 851 - * case the pages are leaked 852 - * 2. TD VCPU creation failed and this on the error path, in which case 853 - * there is nothing to do anyway 876 + * be assigned if the TD VM is being destroyed but freeing hkid failed, 877 + * in which case the pages are leaked. 854 878 */ 855 879 if (is_hkid_assigned(kvm_tdx)) 856 880 return; ··· 898 856 } 899 857 if (tdx->vp.tdvpr_page) { 900 858 tdx_reclaim_control_page(tdx->vp.tdvpr_page); 901 - tdx->vp.tdvpr_page = 0; 859 + tdx->vp.tdvpr_page = NULL; 902 860 tdx->vp.tdvpr_pa = 0; 903 861 } 904 862 ··· 1101 1059 update_debugctlmsr(vcpu->arch.host_debugctl); 1102 1060 1103 1061 tdx_load_host_xsave_state(vcpu); 1104 - tdx->guest_entered = true; 1105 1062 1106 1063 vcpu->arch.regs_avail &= TDX_REGS_AVAIL_SET; 1107 1064 ··· 1109 1068 1110 1069 if (unlikely((tdx->vp_enter_ret & TDX_SW_ERROR) == TDX_SW_ERROR)) 1111 1070 return EXIT_FASTPATH_NONE; 1112 - 1113 - if (unlikely(vmx_get_exit_reason(vcpu).basic == EXIT_REASON_MCE_DURING_VMENTRY)) 1114 - kvm_machine_check(); 1115 1071 1116 1072 trace_kvm_exit(vcpu, KVM_ISA_VMX); 1117 1073 ··· 1621 1583 td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa); 1622 1584 } 1623 1585 1624 - static void tdx_unpin(struct kvm *kvm, struct page *page) 1586 + static int tdx_mem_page_add(struct kvm *kvm, gfn_t gfn, enum pg_level level, 1587 + kvm_pfn_t pfn) 1625 1588 { 1626 - put_page(page); 1589 + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); 1590 + u64 err, entry, level_state; 1591 + gpa_t gpa = gfn_to_gpa(gfn); 1592 + 1593 + lockdep_assert_held(&kvm->slots_lock); 1594 + 1595 + if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm) || 1596 + KVM_BUG_ON(!kvm_tdx->page_add_src, kvm)) 1597 + return -EIO; 1598 + 1599 + err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn), 1600 + kvm_tdx->page_add_src, &entry, &level_state); 1601 + if (unlikely(tdx_operand_busy(err))) 1602 + return -EBUSY; 1603 + 1604 + if (TDX_BUG_ON_2(err, TDH_MEM_PAGE_ADD, entry, level_state, kvm)) 1605 + return -EIO; 1606 + 1607 + return 0; 1627 1608 } 1628 1609 1629 1610 static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn, 1630 - enum pg_level level, struct page *page) 1611 + enum pg_level level, kvm_pfn_t pfn) 1631 1612 { 1632 1613 int tdx_level = pg_level_to_tdx_sept_level(level); 1633 1614 struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); 1615 + struct page *page = pfn_to_page(pfn); 1634 1616 gpa_t gpa = gfn_to_gpa(gfn); 1635 1617 u64 entry, level_state; 1636 1618 u64 err; 1637 1619 1638 1620 err = tdh_mem_page_aug(&kvm_tdx->td, gpa, tdx_level, page, &entry, &level_state); 1639 - if (unlikely(tdx_operand_busy(err))) { 1640 - tdx_unpin(kvm, page); 1621 + if (unlikely(tdx_operand_busy(err))) 1641 1622 return -EBUSY; 1642 - } 1643 1623 1644 - if (KVM_BUG_ON(err, kvm)) { 1645 - pr_tdx_error_2(TDH_MEM_PAGE_AUG, err, entry, level_state); 1646 - tdx_unpin(kvm, page); 1624 + if (TDX_BUG_ON_2(err, TDH_MEM_PAGE_AUG, entry, level_state, kvm)) 1647 1625 return -EIO; 1648 - } 1649 1626 1650 - return 0; 1651 - } 1652 - 1653 - /* 1654 - * KVM_TDX_INIT_MEM_REGION calls kvm_gmem_populate() to map guest pages; the 1655 - * callback tdx_gmem_post_populate() then maps pages into private memory. 1656 - * through the a seamcall TDH.MEM.PAGE.ADD(). The SEAMCALL also requires the 1657 - * private EPT structures for the page to have been built before, which is 1658 - * done via kvm_tdp_map_page(). nr_premapped counts the number of pages that 1659 - * were added to the EPT structures but not added with TDH.MEM.PAGE.ADD(). 1660 - * The counter has to be zero on KVM_TDX_FINALIZE_VM, to ensure that there 1661 - * are no half-initialized shared EPT pages. 1662 - */ 1663 - static int tdx_mem_page_record_premap_cnt(struct kvm *kvm, gfn_t gfn, 1664 - enum pg_level level, kvm_pfn_t pfn) 1665 - { 1666 - struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); 1667 - 1668 - if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm)) 1669 - return -EINVAL; 1670 - 1671 - /* nr_premapped will be decreased when tdh_mem_page_add() is called. */ 1672 - atomic64_inc(&kvm_tdx->nr_premapped); 1673 1627 return 0; 1674 1628 } 1675 1629 1676 1630 static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn, 1677 - enum pg_level level, kvm_pfn_t pfn) 1631 + enum pg_level level, u64 mirror_spte) 1678 1632 { 1679 1633 struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); 1680 - struct page *page = pfn_to_page(pfn); 1634 + kvm_pfn_t pfn = spte_to_pfn(mirror_spte); 1681 1635 1682 1636 /* TODO: handle large pages. */ 1683 1637 if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm)) 1684 - return -EINVAL; 1638 + return -EIO; 1639 + 1640 + WARN_ON_ONCE(!is_shadow_present_pte(mirror_spte) || 1641 + (mirror_spte & VMX_EPT_RWX_MASK) != VMX_EPT_RWX_MASK); 1685 1642 1686 1643 /* 1687 - * Because guest_memfd doesn't support page migration with 1688 - * a_ops->migrate_folio (yet), no callback is triggered for KVM on page 1689 - * migration. Until guest_memfd supports page migration, prevent page 1690 - * migration. 1691 - * TODO: Once guest_memfd introduces callback on page migration, 1692 - * implement it and remove get_page/put_page(). 1693 - */ 1694 - get_page(page); 1695 - 1696 - /* 1697 - * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching 1698 - * barrier in tdx_td_finalize(). 1644 + * Ensure pre_fault_allowed is read by kvm_arch_vcpu_pre_fault_memory() 1645 + * before kvm_tdx->state. Userspace must not be allowed to pre-fault 1646 + * arbitrary memory until the initial memory image is finalized. Pairs 1647 + * with the smp_wmb() in tdx_td_finalize(). 1699 1648 */ 1700 1649 smp_rmb(); 1701 - if (likely(kvm_tdx->state == TD_STATE_RUNNABLE)) 1702 - return tdx_mem_page_aug(kvm, gfn, level, page); 1703 - 1704 - return tdx_mem_page_record_premap_cnt(kvm, gfn, level, pfn); 1705 - } 1706 - 1707 - static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn, 1708 - enum pg_level level, struct page *page) 1709 - { 1710 - int tdx_level = pg_level_to_tdx_sept_level(level); 1711 - struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); 1712 - gpa_t gpa = gfn_to_gpa(gfn); 1713 - u64 err, entry, level_state; 1714 - 1715 - /* TODO: handle large pages. */ 1716 - if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm)) 1717 - return -EINVAL; 1718 - 1719 - if (KVM_BUG_ON(!is_hkid_assigned(kvm_tdx), kvm)) 1720 - return -EINVAL; 1721 1650 1722 1651 /* 1723 - * When zapping private page, write lock is held. So no race condition 1724 - * with other vcpu sept operation. 1725 - * Race with TDH.VP.ENTER due to (0-step mitigation) and Guest TDCALLs. 1652 + * If the TD isn't finalized/runnable, then userspace is initializing 1653 + * the VM image via KVM_TDX_INIT_MEM_REGION; ADD the page to the TD. 1726 1654 */ 1727 - err = tdh_mem_page_remove(&kvm_tdx->td, gpa, tdx_level, &entry, 1728 - &level_state); 1655 + if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE)) 1656 + return tdx_mem_page_add(kvm, gfn, level, pfn); 1729 1657 1730 - if (unlikely(tdx_operand_busy(err))) { 1731 - /* 1732 - * The second retry is expected to succeed after kicking off all 1733 - * other vCPUs and prevent them from invoking TDH.VP.ENTER. 1734 - */ 1735 - tdx_no_vcpus_enter_start(kvm); 1736 - err = tdh_mem_page_remove(&kvm_tdx->td, gpa, tdx_level, &entry, 1737 - &level_state); 1738 - tdx_no_vcpus_enter_stop(kvm); 1739 - } 1740 - 1741 - if (KVM_BUG_ON(err, kvm)) { 1742 - pr_tdx_error_2(TDH_MEM_PAGE_REMOVE, err, entry, level_state); 1743 - return -EIO; 1744 - } 1745 - 1746 - err = tdh_phymem_page_wbinvd_hkid((u16)kvm_tdx->hkid, page); 1747 - 1748 - if (KVM_BUG_ON(err, kvm)) { 1749 - pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err); 1750 - return -EIO; 1751 - } 1752 - tdx_quirk_reset_page(page); 1753 - tdx_unpin(kvm, page); 1754 - return 0; 1658 + return tdx_mem_page_aug(kvm, gfn, level, pfn); 1755 1659 } 1756 1660 1757 1661 static int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn, ··· 1709 1729 if (unlikely(tdx_operand_busy(err))) 1710 1730 return -EBUSY; 1711 1731 1712 - if (KVM_BUG_ON(err, kvm)) { 1713 - pr_tdx_error_2(TDH_MEM_SEPT_ADD, err, entry, level_state); 1732 + if (TDX_BUG_ON_2(err, TDH_MEM_SEPT_ADD, entry, level_state, kvm)) 1714 1733 return -EIO; 1715 - } 1716 1734 1717 1735 return 0; 1718 - } 1719 - 1720 - /* 1721 - * Check if the error returned from a SEPT zap SEAMCALL is due to that a page is 1722 - * mapped by KVM_TDX_INIT_MEM_REGION without tdh_mem_page_add() being called 1723 - * successfully. 1724 - * 1725 - * Since tdh_mem_sept_add() must have been invoked successfully before a 1726 - * non-leaf entry present in the mirrored page table, the SEPT ZAP related 1727 - * SEAMCALLs should not encounter err TDX_EPT_WALK_FAILED. They should instead 1728 - * find TDX_EPT_ENTRY_STATE_INCORRECT due to an empty leaf entry found in the 1729 - * SEPT. 1730 - * 1731 - * Further check if the returned entry from SEPT walking is with RWX permissions 1732 - * to filter out anything unexpected. 1733 - * 1734 - * Note: @level is pg_level, not the tdx_level. The tdx_level extracted from 1735 - * level_state returned from a SEAMCALL error is the same as that passed into 1736 - * the SEAMCALL. 1737 - */ 1738 - static int tdx_is_sept_zap_err_due_to_premap(struct kvm_tdx *kvm_tdx, u64 err, 1739 - u64 entry, int level) 1740 - { 1741 - if (!err || kvm_tdx->state == TD_STATE_RUNNABLE) 1742 - return false; 1743 - 1744 - if (err != (TDX_EPT_ENTRY_STATE_INCORRECT | TDX_OPERAND_ID_RCX)) 1745 - return false; 1746 - 1747 - if ((is_last_spte(entry, level) && (entry & VMX_EPT_RWX_MASK))) 1748 - return false; 1749 - 1750 - return true; 1751 - } 1752 - 1753 - static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn, 1754 - enum pg_level level, struct page *page) 1755 - { 1756 - int tdx_level = pg_level_to_tdx_sept_level(level); 1757 - struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); 1758 - gpa_t gpa = gfn_to_gpa(gfn) & KVM_HPAGE_MASK(level); 1759 - u64 err, entry, level_state; 1760 - 1761 - /* For now large page isn't supported yet. */ 1762 - WARN_ON_ONCE(level != PG_LEVEL_4K); 1763 - 1764 - err = tdh_mem_range_block(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state); 1765 - 1766 - if (unlikely(tdx_operand_busy(err))) { 1767 - /* After no vCPUs enter, the second retry is expected to succeed */ 1768 - tdx_no_vcpus_enter_start(kvm); 1769 - err = tdh_mem_range_block(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state); 1770 - tdx_no_vcpus_enter_stop(kvm); 1771 - } 1772 - if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level) && 1773 - !KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm)) { 1774 - atomic64_dec(&kvm_tdx->nr_premapped); 1775 - tdx_unpin(kvm, page); 1776 - return 0; 1777 - } 1778 - 1779 - if (KVM_BUG_ON(err, kvm)) { 1780 - pr_tdx_error_2(TDH_MEM_RANGE_BLOCK, err, entry, level_state); 1781 - return -EIO; 1782 - } 1783 - return 1; 1784 1736 } 1785 1737 1786 1738 /* ··· 1748 1836 if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE)) 1749 1837 return; 1750 1838 1839 + /* 1840 + * The full sequence of TDH.MEM.TRACK and forcing vCPUs out of guest 1841 + * mode must be serialized, as TDH.MEM.TRACK will fail if the previous 1842 + * tracking epoch hasn't completed. 1843 + */ 1751 1844 lockdep_assert_held_write(&kvm->mmu_lock); 1752 1845 1753 - err = tdh_mem_track(&kvm_tdx->td); 1754 - if (unlikely(tdx_operand_busy(err))) { 1755 - /* After no vCPUs enter, the second retry is expected to succeed */ 1756 - tdx_no_vcpus_enter_start(kvm); 1757 - err = tdh_mem_track(&kvm_tdx->td); 1758 - tdx_no_vcpus_enter_stop(kvm); 1759 - } 1760 - 1761 - if (KVM_BUG_ON(err, kvm)) 1762 - pr_tdx_error(TDH_MEM_TRACK, err); 1846 + err = tdh_do_no_vcpus(tdh_mem_track, kvm, &kvm_tdx->td); 1847 + TDX_BUG_ON(err, TDH_MEM_TRACK, kvm); 1763 1848 1764 1849 kvm_make_all_cpus_request(kvm, KVM_REQ_OUTSIDE_GUEST_MODE); 1765 1850 } ··· 1775 1866 * and slot move/deletion. 1776 1867 */ 1777 1868 if (KVM_BUG_ON(is_hkid_assigned(kvm_tdx), kvm)) 1778 - return -EINVAL; 1869 + return -EIO; 1779 1870 1780 1871 /* 1781 1872 * The HKID assigned to this TD was already freed and cache was ··· 1784 1875 return tdx_reclaim_page(virt_to_page(private_spt)); 1785 1876 } 1786 1877 1787 - static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn, 1788 - enum pg_level level, kvm_pfn_t pfn) 1878 + static void tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn, 1879 + enum pg_level level, u64 mirror_spte) 1789 1880 { 1790 - struct page *page = pfn_to_page(pfn); 1791 - int ret; 1881 + struct page *page = pfn_to_page(spte_to_pfn(mirror_spte)); 1882 + int tdx_level = pg_level_to_tdx_sept_level(level); 1883 + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); 1884 + gpa_t gpa = gfn_to_gpa(gfn); 1885 + u64 err, entry, level_state; 1886 + 1887 + lockdep_assert_held_write(&kvm->mmu_lock); 1792 1888 1793 1889 /* 1794 1890 * HKID is released after all private pages have been removed, and set ··· 1801 1887 * there can't be anything populated in the private EPT. 1802 1888 */ 1803 1889 if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm)) 1804 - return -EINVAL; 1890 + return; 1805 1891 1806 - ret = tdx_sept_zap_private_spte(kvm, gfn, level, page); 1807 - if (ret <= 0) 1808 - return ret; 1892 + /* TODO: handle large pages. */ 1893 + if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm)) 1894 + return; 1895 + 1896 + err = tdh_do_no_vcpus(tdh_mem_range_block, kvm, &kvm_tdx->td, gpa, 1897 + tdx_level, &entry, &level_state); 1898 + if (TDX_BUG_ON_2(err, TDH_MEM_RANGE_BLOCK, entry, level_state, kvm)) 1899 + return; 1809 1900 1810 1901 /* 1811 1902 * TDX requires TLB tracking before dropping private page. Do ··· 1818 1899 */ 1819 1900 tdx_track(kvm); 1820 1901 1821 - return tdx_sept_drop_private_spte(kvm, gfn, level, page); 1902 + /* 1903 + * When zapping private page, write lock is held. So no race condition 1904 + * with other vcpu sept operation. 1905 + * Race with TDH.VP.ENTER due to (0-step mitigation) and Guest TDCALLs. 1906 + */ 1907 + err = tdh_do_no_vcpus(tdh_mem_page_remove, kvm, &kvm_tdx->td, gpa, 1908 + tdx_level, &entry, &level_state); 1909 + if (TDX_BUG_ON_2(err, TDH_MEM_PAGE_REMOVE, entry, level_state, kvm)) 1910 + return; 1911 + 1912 + err = tdh_phymem_page_wbinvd_hkid((u16)kvm_tdx->hkid, page); 1913 + if (TDX_BUG_ON(err, TDH_PHYMEM_PAGE_WBINVD, kvm)) 1914 + return; 1915 + 1916 + tdx_quirk_reset_page(page); 1822 1917 } 1823 1918 1824 1919 void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode, ··· 2078 2145 } 2079 2146 2080 2147 unhandled_exit: 2081 - vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR; 2082 - vcpu->run->internal.suberror = KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON; 2083 - vcpu->run->internal.ndata = 2; 2084 - vcpu->run->internal.data[0] = vp_enter_ret; 2085 - vcpu->run->internal.data[1] = vcpu->arch.last_vmentry_cpu; 2148 + kvm_prepare_unexpected_reason_exit(vcpu, vp_enter_ret); 2086 2149 return 0; 2087 2150 } 2088 2151 ··· 2211 2282 if (cmd->flags) 2212 2283 return -EINVAL; 2213 2284 2214 - caps = kzalloc(sizeof(*caps) + 2215 - sizeof(struct kvm_cpuid_entry2) * td_conf->num_cpuid_config, 2216 - GFP_KERNEL); 2285 + user_caps = u64_to_user_ptr(cmd->data); 2286 + if (get_user(nr_user_entries, &user_caps->cpuid.nent)) 2287 + return -EFAULT; 2288 + 2289 + if (nr_user_entries < td_conf->num_cpuid_config) 2290 + return -E2BIG; 2291 + 2292 + caps = kzalloc(struct_size(caps, cpuid.entries, 2293 + td_conf->num_cpuid_config), GFP_KERNEL); 2217 2294 if (!caps) 2218 2295 return -ENOMEM; 2219 - 2220 - user_caps = u64_to_user_ptr(cmd->data); 2221 - if (get_user(nr_user_entries, &user_caps->cpuid.nent)) { 2222 - ret = -EFAULT; 2223 - goto out; 2224 - } 2225 - 2226 - if (nr_user_entries < td_conf->num_cpuid_config) { 2227 - ret = -E2BIG; 2228 - goto out; 2229 - } 2230 2296 2231 2297 ret = init_kvm_tdx_caps(td_conf, caps); 2232 2298 if (ret) 2233 2299 goto out; 2234 2300 2235 - if (copy_to_user(user_caps, caps, sizeof(*caps))) { 2301 + if (copy_to_user(user_caps, caps, struct_size(caps, cpuid.entries, 2302 + caps->cpuid.nent))) { 2236 2303 ret = -EFAULT; 2237 2304 goto out; 2238 2305 } 2239 - 2240 - if (copy_to_user(user_caps->cpuid.entries, caps->cpuid.entries, 2241 - caps->cpuid.nent * 2242 - sizeof(caps->cpuid.entries[0]))) 2243 - ret = -EFAULT; 2244 2306 2245 2307 out: 2246 2308 /* kfree() accepts NULL. */ ··· 2457 2537 goto free_packages; 2458 2538 } 2459 2539 2460 - if (WARN_ON_ONCE(err)) { 2461 - pr_tdx_error(TDH_MNG_CREATE, err); 2540 + if (TDX_BUG_ON(err, TDH_MNG_CREATE, kvm)) { 2462 2541 ret = -EIO; 2463 2542 goto free_packages; 2464 2543 } ··· 2498 2579 ret = -EAGAIN; 2499 2580 goto teardown; 2500 2581 } 2501 - if (WARN_ON_ONCE(err)) { 2502 - pr_tdx_error(TDH_MNG_ADDCX, err); 2582 + if (TDX_BUG_ON(err, TDH_MNG_ADDCX, kvm)) { 2503 2583 ret = -EIO; 2504 2584 goto teardown; 2505 2585 } ··· 2515 2597 *seamcall_err = err; 2516 2598 ret = -EINVAL; 2517 2599 goto teardown; 2518 - } else if (WARN_ON_ONCE(err)) { 2519 - pr_tdx_error_1(TDH_MNG_INIT, err, rcx); 2600 + } else if (TDX_BUG_ON_1(err, TDH_MNG_INIT, rcx, kvm)) { 2520 2601 ret = -EIO; 2521 2602 goto teardown; 2522 2603 } ··· 2559 2642 free_tdr: 2560 2643 if (tdr_page) 2561 2644 __free_page(tdr_page); 2562 - kvm_tdx->td.tdr_page = 0; 2645 + kvm_tdx->td.tdr_page = NULL; 2563 2646 2564 2647 free_hkid: 2565 2648 tdx_hkid_free(kvm_tdx); ··· 2664 2747 return -EIO; 2665 2748 } 2666 2749 2750 + typedef void *tdx_vm_state_guard_t; 2751 + 2752 + static tdx_vm_state_guard_t tdx_acquire_vm_state_locks(struct kvm *kvm) 2753 + { 2754 + int r; 2755 + 2756 + mutex_lock(&kvm->lock); 2757 + 2758 + if (kvm->created_vcpus != atomic_read(&kvm->online_vcpus)) { 2759 + r = -EBUSY; 2760 + goto out_err; 2761 + } 2762 + 2763 + r = kvm_lock_all_vcpus(kvm); 2764 + if (r) 2765 + goto out_err; 2766 + 2767 + /* 2768 + * Note the unintuitive ordering! vcpu->mutex must be taken outside 2769 + * kvm->slots_lock! 2770 + */ 2771 + mutex_lock(&kvm->slots_lock); 2772 + return kvm; 2773 + 2774 + out_err: 2775 + mutex_unlock(&kvm->lock); 2776 + return ERR_PTR(r); 2777 + } 2778 + 2779 + static void tdx_release_vm_state_locks(struct kvm *kvm) 2780 + { 2781 + mutex_unlock(&kvm->slots_lock); 2782 + kvm_unlock_all_vcpus(kvm); 2783 + mutex_unlock(&kvm->lock); 2784 + } 2785 + 2786 + DEFINE_CLASS(tdx_vm_state_guard, tdx_vm_state_guard_t, 2787 + if (!IS_ERR(_T)) tdx_release_vm_state_locks(_T), 2788 + tdx_acquire_vm_state_locks(kvm), struct kvm *kvm); 2789 + 2667 2790 static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd) 2668 2791 { 2792 + struct kvm_tdx_init_vm __user *user_data = u64_to_user_ptr(cmd->data); 2669 2793 struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); 2670 2794 struct kvm_tdx_init_vm *init_vm; 2671 2795 struct td_params *td_params = NULL; 2796 + u32 nr_user_entries; 2672 2797 int ret; 2673 2798 2674 2799 BUILD_BUG_ON(sizeof(*init_vm) != 256 + sizeof_field(struct kvm_tdx_init_vm, cpuid)); ··· 2722 2763 if (cmd->flags) 2723 2764 return -EINVAL; 2724 2765 2725 - init_vm = kmalloc(sizeof(*init_vm) + 2726 - sizeof(init_vm->cpuid.entries[0]) * KVM_MAX_CPUID_ENTRIES, 2727 - GFP_KERNEL); 2728 - if (!init_vm) 2729 - return -ENOMEM; 2766 + if (get_user(nr_user_entries, &user_data->cpuid.nent)) 2767 + return -EFAULT; 2730 2768 2731 - if (copy_from_user(init_vm, u64_to_user_ptr(cmd->data), sizeof(*init_vm))) { 2732 - ret = -EFAULT; 2733 - goto out; 2734 - } 2769 + if (nr_user_entries > KVM_MAX_CPUID_ENTRIES) 2770 + return -E2BIG; 2735 2771 2736 - if (init_vm->cpuid.nent > KVM_MAX_CPUID_ENTRIES) { 2737 - ret = -E2BIG; 2738 - goto out; 2739 - } 2740 - 2741 - if (copy_from_user(init_vm->cpuid.entries, 2742 - u64_to_user_ptr(cmd->data) + sizeof(*init_vm), 2743 - flex_array_size(init_vm, cpuid.entries, init_vm->cpuid.nent))) { 2744 - ret = -EFAULT; 2745 - goto out; 2746 - } 2772 + init_vm = memdup_user(user_data, 2773 + struct_size(user_data, cpuid.entries, nr_user_entries)); 2774 + if (IS_ERR(init_vm)) 2775 + return PTR_ERR(init_vm); 2747 2776 2748 2777 if (memchr_inv(init_vm->reserved, 0, sizeof(init_vm->reserved))) { 2749 2778 ret = -EINVAL; ··· 2815 2868 { 2816 2869 struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); 2817 2870 2818 - guard(mutex)(&kvm->slots_lock); 2819 - 2820 2871 if (!is_hkid_assigned(kvm_tdx) || kvm_tdx->state == TD_STATE_RUNNABLE) 2821 - return -EINVAL; 2822 - /* 2823 - * Pages are pending for KVM_TDX_INIT_MEM_REGION to issue 2824 - * TDH.MEM.PAGE.ADD(). 2825 - */ 2826 - if (atomic64_read(&kvm_tdx->nr_premapped)) 2827 2872 return -EINVAL; 2828 2873 2829 2874 cmd->hw_error = tdh_mr_finalize(&kvm_tdx->td); 2830 2875 if (tdx_operand_busy(cmd->hw_error)) 2831 2876 return -EBUSY; 2832 - if (KVM_BUG_ON(cmd->hw_error, kvm)) { 2833 - pr_tdx_error(TDH_MR_FINALIZE, cmd->hw_error); 2877 + if (TDX_BUG_ON(cmd->hw_error, TDH_MR_FINALIZE, kvm)) 2834 2878 return -EIO; 2835 - } 2836 2879 2837 2880 kvm_tdx->state = TD_STATE_RUNNABLE; 2838 2881 /* TD_STATE_RUNNABLE must be set before 'pre_fault_allowed' */ ··· 2831 2894 return 0; 2832 2895 } 2833 2896 2897 + static int tdx_get_cmd(void __user *argp, struct kvm_tdx_cmd *cmd) 2898 + { 2899 + if (copy_from_user(cmd, argp, sizeof(*cmd))) 2900 + return -EFAULT; 2901 + 2902 + /* 2903 + * Userspace should never set hw_error. KVM writes hw_error to report 2904 + * hardware-defined error back to userspace. 2905 + */ 2906 + if (cmd->hw_error) 2907 + return -EINVAL; 2908 + 2909 + return 0; 2910 + } 2911 + 2834 2912 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) 2835 2913 { 2836 2914 struct kvm_tdx_cmd tdx_cmd; 2837 2915 int r; 2838 2916 2839 - if (copy_from_user(&tdx_cmd, argp, sizeof(struct kvm_tdx_cmd))) 2840 - return -EFAULT; 2917 + r = tdx_get_cmd(argp, &tdx_cmd); 2918 + if (r) 2919 + return r; 2841 2920 2842 - /* 2843 - * Userspace should never set hw_error. It is used to fill 2844 - * hardware-defined error by the kernel. 2845 - */ 2846 - if (tdx_cmd.hw_error) 2847 - return -EINVAL; 2921 + if (tdx_cmd.id == KVM_TDX_CAPABILITIES) 2922 + return tdx_get_capabilities(&tdx_cmd); 2848 2923 2849 - mutex_lock(&kvm->lock); 2924 + CLASS(tdx_vm_state_guard, guard)(kvm); 2925 + if (IS_ERR(guard)) 2926 + return PTR_ERR(guard); 2850 2927 2851 2928 switch (tdx_cmd.id) { 2852 - case KVM_TDX_CAPABILITIES: 2853 - r = tdx_get_capabilities(&tdx_cmd); 2854 - break; 2855 2929 case KVM_TDX_INIT_VM: 2856 2930 r = tdx_td_init(kvm, &tdx_cmd); 2857 2931 break; ··· 2870 2922 r = tdx_td_finalize(kvm, &tdx_cmd); 2871 2923 break; 2872 2924 default: 2873 - r = -EINVAL; 2874 - goto out; 2925 + return -EINVAL; 2875 2926 } 2876 2927 2877 2928 if (copy_to_user(argp, &tdx_cmd, sizeof(struct kvm_tdx_cmd))) 2878 - r = -EFAULT; 2929 + return -EFAULT; 2879 2930 2880 - out: 2881 - mutex_unlock(&kvm->lock); 2882 2931 return r; 2883 2932 } 2884 2933 ··· 2917 2972 } 2918 2973 2919 2974 err = tdh_vp_create(&kvm_tdx->td, &tdx->vp); 2920 - if (KVM_BUG_ON(err, vcpu->kvm)) { 2975 + if (TDX_BUG_ON(err, TDH_VP_CREATE, vcpu->kvm)) { 2921 2976 ret = -EIO; 2922 - pr_tdx_error(TDH_VP_CREATE, err); 2923 2977 goto free_tdcx; 2924 2978 } 2925 2979 2926 2980 for (i = 0; i < kvm_tdx->td.tdcx_nr_pages; i++) { 2927 2981 err = tdh_vp_addcx(&tdx->vp, tdx->vp.tdcx_pages[i]); 2928 - if (KVM_BUG_ON(err, vcpu->kvm)) { 2929 - pr_tdx_error(TDH_VP_ADDCX, err); 2982 + if (TDX_BUG_ON(err, TDH_VP_ADDCX, vcpu->kvm)) { 2930 2983 /* 2931 2984 * Pages already added are reclaimed by the vcpu_free 2932 2985 * method, but the rest are freed here. ··· 2937 2994 } 2938 2995 } 2939 2996 2940 - err = tdh_vp_init(&tdx->vp, vcpu_rcx, vcpu->vcpu_id); 2941 - if (KVM_BUG_ON(err, vcpu->kvm)) { 2942 - pr_tdx_error(TDH_VP_INIT, err); 2943 - return -EIO; 2997 + /* 2998 + * tdh_vp_init() can take an exclusive lock of the TDR resource inside 2999 + * the TDX-Module. The TDR resource is also taken as shared in several 3000 + * no-fail MMU paths, which could return TDX_OPERAND_BUSY on contention 3001 + * (TDX-Module locks are try-lock implementations with no slow path). 3002 + * Take mmu_lock for write to reflect the nature of the lock taken by 3003 + * the TDX-Module, and to ensure the no-fail MMU paths succeed, e.g. if 3004 + * a concurrent PUNCH_HOLE on guest_memfd triggers removal of SPTEs. 3005 + */ 3006 + scoped_guard(write_lock, &vcpu->kvm->mmu_lock) { 3007 + err = tdh_vp_init(&tdx->vp, vcpu_rcx, vcpu->vcpu_id); 3008 + if (TDX_BUG_ON(err, TDH_VP_INIT, vcpu->kvm)) 3009 + return -EIO; 2944 3010 } 2945 3011 2946 3012 vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE; ··· 2968 3016 free_tdvpr: 2969 3017 if (tdx->vp.tdvpr_page) 2970 3018 __free_page(tdx->vp.tdvpr_page); 2971 - tdx->vp.tdvpr_page = 0; 3019 + tdx->vp.tdvpr_page = NULL; 2972 3020 tdx->vp.tdvpr_pa = 0; 2973 3021 2974 3022 return ret; ··· 3006 3054 3007 3055 static int tdx_vcpu_get_cpuid(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd) 3008 3056 { 3009 - struct kvm_cpuid2 __user *output, *td_cpuid; 3057 + struct kvm_cpuid2 __user *output; 3058 + struct kvm_cpuid2 *td_cpuid; 3010 3059 int r = 0, i = 0, leaf; 3011 3060 u32 level; 3012 3061 ··· 3120 3167 static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, 3121 3168 void __user *src, int order, void *_arg) 3122 3169 { 3123 - u64 error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS; 3124 - struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); 3125 3170 struct tdx_gmem_post_populate_arg *arg = _arg; 3126 - struct kvm_vcpu *vcpu = arg->vcpu; 3171 + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); 3172 + u64 err, entry, level_state; 3127 3173 gpa_t gpa = gfn_to_gpa(gfn); 3128 - u8 level = PG_LEVEL_4K; 3129 3174 struct page *src_page; 3130 3175 int ret, i; 3131 - u64 err, entry, level_state; 3176 + 3177 + if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm)) 3178 + return -EIO; 3132 3179 3133 3180 /* 3134 3181 * Get the source page if it has been faulted in. Return failure if the ··· 3140 3187 if (ret != 1) 3141 3188 return -ENOMEM; 3142 3189 3143 - ret = kvm_tdp_map_page(vcpu, gpa, error_code, &level); 3144 - if (ret < 0) 3145 - goto out; 3190 + kvm_tdx->page_add_src = src_page; 3191 + ret = kvm_tdp_mmu_map_private_pfn(arg->vcpu, gfn, pfn); 3192 + kvm_tdx->page_add_src = NULL; 3193 + 3194 + put_page(src_page); 3195 + 3196 + if (ret || !(arg->flags & KVM_TDX_MEASURE_MEMORY_REGION)) 3197 + return ret; 3146 3198 3147 3199 /* 3148 - * The private mem cannot be zapped after kvm_tdp_map_page() 3149 - * because all paths are covered by slots_lock and the 3150 - * filemap invalidate lock. Check that they are indeed enough. 3200 + * Note, MR.EXTEND can fail if the S-EPT mapping is somehow removed 3201 + * between mapping the pfn and now, but slots_lock prevents memslot 3202 + * updates, filemap_invalidate_lock() prevents guest_memfd updates, 3203 + * mmu_notifier events can't reach S-EPT entries, and KVM's internal 3204 + * zapping flows are mutually exclusive with S-EPT mappings. 3151 3205 */ 3152 - if (IS_ENABLED(CONFIG_KVM_PROVE_MMU)) { 3153 - scoped_guard(read_lock, &kvm->mmu_lock) { 3154 - if (KVM_BUG_ON(!kvm_tdp_mmu_gpa_is_mapped(vcpu, gpa), kvm)) { 3155 - ret = -EIO; 3156 - goto out; 3157 - } 3158 - } 3206 + for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) { 3207 + err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry, &level_state); 3208 + if (TDX_BUG_ON_2(err, TDH_MR_EXTEND, entry, level_state, kvm)) 3209 + return -EIO; 3159 3210 } 3160 3211 3161 - ret = 0; 3162 - err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn), 3163 - src_page, &entry, &level_state); 3164 - if (err) { 3165 - ret = unlikely(tdx_operand_busy(err)) ? -EBUSY : -EIO; 3166 - goto out; 3167 - } 3168 - 3169 - if (!KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm)) 3170 - atomic64_dec(&kvm_tdx->nr_premapped); 3171 - 3172 - if (arg->flags & KVM_TDX_MEASURE_MEMORY_REGION) { 3173 - for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) { 3174 - err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry, 3175 - &level_state); 3176 - if (err) { 3177 - ret = -EIO; 3178 - break; 3179 - } 3180 - } 3181 - } 3182 - 3183 - out: 3184 - put_page(src_page); 3185 - return ret; 3212 + return 0; 3186 3213 } 3187 3214 3188 3215 static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd) ··· 3177 3244 3178 3245 if (tdx->state != VCPU_TD_STATE_INITIALIZED) 3179 3246 return -EINVAL; 3180 - 3181 - guard(mutex)(&kvm->slots_lock); 3182 3247 3183 3248 /* Once TD is finalized, the initial guest memory is fixed. */ 3184 3249 if (kvm_tdx->state == TD_STATE_RUNNABLE) ··· 3195 3264 !vt_is_tdx_private_gpa(kvm, region.gpa + (region.nr_pages << PAGE_SHIFT) - 1)) 3196 3265 return -EINVAL; 3197 3266 3198 - kvm_mmu_reload(vcpu); 3199 3267 ret = 0; 3200 3268 while (region.nr_pages) { 3201 3269 if (signal_pending(current)) { ··· 3231 3301 return ret; 3232 3302 } 3233 3303 3304 + int tdx_vcpu_unlocked_ioctl(struct kvm_vcpu *vcpu, void __user *argp) 3305 + { 3306 + struct kvm *kvm = vcpu->kvm; 3307 + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); 3308 + struct kvm_tdx_cmd cmd; 3309 + int r; 3310 + 3311 + r = tdx_get_cmd(argp, &cmd); 3312 + if (r) 3313 + return r; 3314 + 3315 + CLASS(tdx_vm_state_guard, guard)(kvm); 3316 + if (IS_ERR(guard)) 3317 + return PTR_ERR(guard); 3318 + 3319 + if (!is_hkid_assigned(kvm_tdx) || kvm_tdx->state == TD_STATE_RUNNABLE) 3320 + return -EINVAL; 3321 + 3322 + vcpu_load(vcpu); 3323 + 3324 + switch (cmd.id) { 3325 + case KVM_TDX_INIT_MEM_REGION: 3326 + r = tdx_vcpu_init_mem_region(vcpu, &cmd); 3327 + break; 3328 + case KVM_TDX_INIT_VCPU: 3329 + r = tdx_vcpu_init(vcpu, &cmd); 3330 + break; 3331 + default: 3332 + r = -ENOIOCTLCMD; 3333 + break; 3334 + } 3335 + 3336 + vcpu_put(vcpu); 3337 + 3338 + return r; 3339 + } 3340 + 3234 3341 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) 3235 3342 { 3236 3343 struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm); ··· 3277 3310 if (!is_hkid_assigned(kvm_tdx) || kvm_tdx->state == TD_STATE_RUNNABLE) 3278 3311 return -EINVAL; 3279 3312 3280 - if (copy_from_user(&cmd, argp, sizeof(cmd))) 3281 - return -EFAULT; 3282 - 3283 - if (cmd.hw_error) 3284 - return -EINVAL; 3313 + ret = tdx_get_cmd(argp, &cmd); 3314 + if (ret) 3315 + return ret; 3285 3316 3286 3317 switch (cmd.id) { 3287 - case KVM_TDX_INIT_VCPU: 3288 - ret = tdx_vcpu_init(vcpu, &cmd); 3289 - break; 3290 - case KVM_TDX_INIT_MEM_REGION: 3291 - ret = tdx_vcpu_init_mem_region(vcpu, &cmd); 3292 - break; 3293 3318 case KVM_TDX_GET_CPUID: 3294 3319 ret = tdx_vcpu_get_cpuid(vcpu, &cmd); 3295 3320 break; ··· 3406 3447 /* 3407 3448 * Check if MSRs (tdx_uret_msrs) can be saved/restored 3408 3449 * before returning to user space. 3409 - * 3410 - * this_cpu_ptr(user_return_msrs)->registered isn't checked 3411 - * because the registration is done at vcpu runtime by 3412 - * tdx_user_return_msr_update_cache(). 3413 3450 */ 3414 3451 tdx_uret_msrs[i].slot = kvm_find_user_return_msr(tdx_uret_msrs[i].msr); 3415 3452 if (tdx_uret_msrs[i].slot == -1) {

+6 -3

arch/x86/kvm/vmx/tdx.h

··· 36 36 37 37 struct tdx_td td; 38 38 39 - /* For KVM_TDX_INIT_MEM_REGION. */ 40 - atomic64_t nr_premapped; 39 + /* 40 + * Scratch pointer used to pass the source page to tdx_mem_page_add(). 41 + * Protected by slots_lock, and non-NULL only when mapping a private 42 + * pfn via tdx_gmem_post_populate(). 43 + */ 44 + struct page *page_add_src; 41 45 42 46 /* 43 47 * Prevent vCPUs from TD entry to ensure SEPT zap related SEAMCALLs do ··· 71 67 u64 vp_enter_ret; 72 68 73 69 enum vcpu_tdx_state state; 74 - bool guest_entered; 75 70 76 71 u64 map_gpa_next; 77 72 u64 map_gpa_end;

+33 -16

arch/x86/kvm/vmx/vmenter.S

··· 71 71 * @regs: unsigned long * (to guest registers) 72 72 * @flags: VMX_RUN_VMRESUME: use VMRESUME instead of VMLAUNCH 73 73 * VMX_RUN_SAVE_SPEC_CTRL: save guest SPEC_CTRL into vmx->spec_ctrl 74 + * VMX_RUN_CLEAR_CPU_BUFFERS_FOR_MMIO: vCPU can access host MMIO 74 75 * 75 76 * Returns: 76 77 * 0 on VM-Exit, 1 on VM-Fail ··· 93 92 /* Save @vmx for SPEC_CTRL handling */ 94 93 push %_ASM_ARG1 95 94 96 - /* Save @flags for SPEC_CTRL handling */ 95 + /* Save @flags (used for VMLAUNCH vs. VMRESUME and mitigations). */ 97 96 push %_ASM_ARG3 98 97 99 98 /* ··· 101 100 * @regs is needed after VM-Exit to save the guest's register values. 102 101 */ 103 102 push %_ASM_ARG2 104 - 105 - /* Copy @flags to EBX, _ASM_ARG3 is volatile. */ 106 - mov %_ASM_ARG3L, %ebx 107 103 108 104 lea (%_ASM_SP), %_ASM_ARG2 109 105 call vmx_update_host_rsp ··· 116 118 * and vmentry. 117 119 */ 118 120 mov 2*WORD_SIZE(%_ASM_SP), %_ASM_DI 119 - movl VMX_spec_ctrl(%_ASM_DI), %edi 120 - movl PER_CPU_VAR(x86_spec_ctrl_current), %esi 121 - cmp %edi, %esi 121 + #ifdef CONFIG_X86_64 122 + mov VMX_spec_ctrl(%rdi), %rdx 123 + cmp PER_CPU_VAR(x86_spec_ctrl_current), %rdx 122 124 je .Lspec_ctrl_done 125 + movl %edx, %eax 126 + shr $32, %rdx 127 + #else 128 + mov VMX_spec_ctrl(%edi), %eax 129 + mov PER_CPU_VAR(x86_spec_ctrl_current), %ecx 130 + xor %eax, %ecx 131 + mov VMX_spec_ctrl + 4(%edi), %edx 132 + mov PER_CPU_VAR(x86_spec_ctrl_current + 4), %edi 133 + xor %edx, %edi 134 + or %edi, %ecx 135 + je .Lspec_ctrl_done 136 + #endif 123 137 mov $MSR_IA32_SPEC_CTRL, %ecx 124 - xor %edx, %edx 125 - mov %edi, %eax 126 138 wrmsr 127 139 128 140 .Lspec_ctrl_done: ··· 144 136 145 137 /* Load @regs to RAX. */ 146 138 mov (%_ASM_SP), %_ASM_AX 147 - 148 - /* Check if vmlaunch or vmresume is needed */ 149 - bt $VMX_RUN_VMRESUME_SHIFT, %ebx 150 139 151 140 /* Load guest registers. Don't clobber flags. */ 152 141 mov VCPU_RCX(%_ASM_AX), %_ASM_CX ··· 165 160 /* Load guest RAX. This kills the @regs pointer! */ 166 161 mov VCPU_RAX(%_ASM_AX), %_ASM_AX 167 162 168 - /* Clobbers EFLAGS.ZF */ 169 - CLEAR_CPU_BUFFERS 163 + /* 164 + * Note, ALTERNATIVE_2 works in reverse order. If CLEAR_CPU_BUF_VM is 165 + * enabled, do VERW unconditionally. If CPU_BUF_VM_MMIO is enabled, 166 + * check @flags to see if the vCPU has access to host MMIO, and if so, 167 + * do VERW. Else, do nothing (no mitigations needed/enabled). 168 + */ 169 + ALTERNATIVE_2 "", \ 170 + __stringify(testl $VMX_RUN_CLEAR_CPU_BUFFERS_FOR_MMIO, WORD_SIZE(%_ASM_SP); \ 171 + jz .Lskip_mmio_verw; \ 172 + VERW; \ 173 + .Lskip_mmio_verw:), \ 174 + X86_FEATURE_CLEAR_CPU_BUF_VM_MMIO, \ 175 + __stringify(VERW), X86_FEATURE_CLEAR_CPU_BUF_VM 170 176 171 - /* Check EFLAGS.CF from the VMX_RUN_VMRESUME bit test above. */ 172 - jnc .Lvmlaunch 177 + /* Check @flags to see if VMLAUNCH or VMRESUME is needed. */ 178 + testl $VMX_RUN_VMRESUME, WORD_SIZE(%_ASM_SP) 179 + jz .Lvmlaunch 173 180 174 181 /* 175 182 * After a successful VMRESUME/VMLAUNCH, control flow "magically"

+173 -150

arch/x86/kvm/vmx/vmx.c

··· 203 203 204 204 struct x86_pmu_lbr __ro_after_init vmx_lbr_caps; 205 205 206 + #ifdef CONFIG_CPU_MITIGATIONS 206 207 static DEFINE_STATIC_KEY_FALSE(vmx_l1d_should_flush); 207 208 static DEFINE_STATIC_KEY_FALSE(vmx_l1d_flush_cond); 208 209 static DEFINE_MUTEX(vmx_l1d_flush_mutex); ··· 226 225 #define L1D_CACHE_ORDER 4 227 226 static void *vmx_l1d_flush_pages; 228 227 229 - static int vmx_setup_l1d_flush(enum vmx_l1d_flush_state l1tf) 228 + static int __vmx_setup_l1d_flush(enum vmx_l1d_flush_state l1tf) 230 229 { 231 230 struct page *page; 232 231 unsigned int i; ··· 303 302 return 0; 304 303 } 305 304 305 + static int vmx_setup_l1d_flush(void) 306 + { 307 + /* 308 + * Hand the parameter mitigation value in which was stored in the pre 309 + * module init parser. If no parameter was given, it will contain 310 + * 'auto' which will be turned into the default 'cond' mitigation mode. 311 + */ 312 + return __vmx_setup_l1d_flush(vmentry_l1d_flush_param); 313 + } 314 + 315 + static void vmx_cleanup_l1d_flush(void) 316 + { 317 + if (vmx_l1d_flush_pages) { 318 + free_pages((unsigned long)vmx_l1d_flush_pages, L1D_CACHE_ORDER); 319 + vmx_l1d_flush_pages = NULL; 320 + } 321 + /* Restore state so sysfs ignores VMX */ 322 + l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_AUTO; 323 + } 324 + 306 325 static int vmentry_l1d_flush_parse(const char *s) 307 326 { 308 327 unsigned int i; ··· 360 339 } 361 340 362 341 mutex_lock(&vmx_l1d_flush_mutex); 363 - ret = vmx_setup_l1d_flush(l1tf); 342 + ret = __vmx_setup_l1d_flush(l1tf); 364 343 mutex_unlock(&vmx_l1d_flush_mutex); 365 344 return ret; 366 345 } ··· 372 351 373 352 return sysfs_emit(s, "%s\n", vmentry_l1d_param[l1tf_vmx_mitigation].option); 374 353 } 354 + 355 + /* 356 + * Software based L1D cache flush which is used when microcode providing 357 + * the cache control MSR is not loaded. 358 + * 359 + * The L1D cache is 32 KiB on Nehalem and later microarchitectures, but to 360 + * flush it is required to read in 64 KiB because the replacement algorithm 361 + * is not exactly LRU. This could be sized at runtime via topology 362 + * information but as all relevant affected CPUs have 32KiB L1D cache size 363 + * there is no point in doing so. 364 + */ 365 + static noinstr void vmx_l1d_flush(struct kvm_vcpu *vcpu) 366 + { 367 + int size = PAGE_SIZE << L1D_CACHE_ORDER; 368 + 369 + if (!static_branch_unlikely(&vmx_l1d_should_flush)) 370 + return; 371 + 372 + /* 373 + * This code is only executed when the flush mode is 'cond' or 374 + * 'always' 375 + */ 376 + if (static_branch_likely(&vmx_l1d_flush_cond)) { 377 + /* 378 + * Clear the per-cpu flush bit, it gets set again if the vCPU 379 + * is reloaded, i.e. if the vCPU is scheduled out or if KVM 380 + * exits to userspace, or if KVM reaches one of the unsafe 381 + * VMEXIT handlers, e.g. if KVM calls into the emulator, 382 + * or from the interrupt handlers. 383 + */ 384 + if (!kvm_get_cpu_l1tf_flush_l1d()) 385 + return; 386 + kvm_clear_cpu_l1tf_flush_l1d(); 387 + } 388 + 389 + vcpu->stat.l1d_flush++; 390 + 391 + if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) { 392 + native_wrmsrq(MSR_IA32_FLUSH_CMD, L1D_FLUSH); 393 + return; 394 + } 395 + 396 + asm volatile( 397 + /* First ensure the pages are in the TLB */ 398 + "xorl %%eax, %%eax\n" 399 + ".Lpopulate_tlb:\n\t" 400 + "movzbl (%[flush_pages], %%" _ASM_AX "), %%ecx\n\t" 401 + "addl $4096, %%eax\n\t" 402 + "cmpl %%eax, %[size]\n\t" 403 + "jne .Lpopulate_tlb\n\t" 404 + "xorl %%eax, %%eax\n\t" 405 + "cpuid\n\t" 406 + /* Now fill the cache */ 407 + "xorl %%eax, %%eax\n" 408 + ".Lfill_cache:\n" 409 + "movzbl (%[flush_pages], %%" _ASM_AX "), %%ecx\n\t" 410 + "addl $64, %%eax\n\t" 411 + "cmpl %%eax, %[size]\n\t" 412 + "jne .Lfill_cache\n\t" 413 + "lfence\n" 414 + :: [flush_pages] "r" (vmx_l1d_flush_pages), 415 + [size] "r" (size) 416 + : "eax", "ebx", "ecx", "edx"); 417 + } 418 + 419 + #else /* CONFIG_CPU_MITIGATIONS*/ 420 + static int vmx_setup_l1d_flush(void) 421 + { 422 + l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_NEVER; 423 + return 0; 424 + } 425 + static void vmx_cleanup_l1d_flush(void) 426 + { 427 + l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_AUTO; 428 + } 429 + static __always_inline void vmx_l1d_flush(struct kvm_vcpu *vcpu) 430 + { 431 + 432 + } 433 + static int vmentry_l1d_flush_set(const char *s, const struct kernel_param *kp) 434 + { 435 + pr_warn_once("Kernel compiled without mitigations, ignoring vmentry_l1d_flush\n"); 436 + return 0; 437 + } 438 + static int vmentry_l1d_flush_get(char *s, const struct kernel_param *kp) 439 + { 440 + return sysfs_emit(s, "never\n"); 441 + } 442 + #endif 443 + 444 + static const struct kernel_param_ops vmentry_l1d_flush_ops = { 445 + .set = vmentry_l1d_flush_set, 446 + .get = vmentry_l1d_flush_get, 447 + }; 448 + module_param_cb(vmentry_l1d_flush, &vmentry_l1d_flush_ops, NULL, 0644); 375 449 376 450 static __always_inline void vmx_disable_fb_clear(struct vcpu_vmx *vmx) 377 451 { ··· 519 403 (vcpu->arch.arch_capabilities & ARCH_CAP_SBDR_SSDP_NO))) 520 404 vmx->disable_fb_clear = false; 521 405 } 522 - 523 - static const struct kernel_param_ops vmentry_l1d_flush_ops = { 524 - .set = vmentry_l1d_flush_set, 525 - .get = vmentry_l1d_flush_get, 526 - }; 527 - module_param_cb(vmentry_l1d_flush, &vmentry_l1d_flush_ops, NULL, 0644); 528 406 529 407 static u32 vmx_segment_access_rights(struct kvm_segment *var); 530 408 ··· 862 752 loaded_vmcs->launched = 0; 863 753 } 864 754 865 - void loaded_vmcs_clear(struct loaded_vmcs *loaded_vmcs) 755 + static void loaded_vmcs_clear(struct loaded_vmcs *loaded_vmcs) 866 756 { 867 757 int cpu = loaded_vmcs->cpu; 868 758 ··· 1013 903 if (!msr_write_intercepted(vmx, MSR_IA32_SPEC_CTRL)) 1014 904 flags |= VMX_RUN_SAVE_SPEC_CTRL; 1015 905 1016 - if (static_branch_unlikely(&cpu_buf_vm_clear) && 906 + if (cpu_feature_enabled(X86_FEATURE_CLEAR_CPU_BUF_VM_MMIO) && 1017 907 kvm_vcpu_can_access_host_mmio(&vmx->vcpu)) 1018 908 flags |= VMX_RUN_CLEAR_CPU_BUFFERS_FOR_MMIO; 1019 909 ··· 3329 3219 return to_vmx(vcpu)->vpid; 3330 3220 } 3331 3221 3222 + static u64 construct_eptp(hpa_t root_hpa) 3223 + { 3224 + u64 eptp = root_hpa | VMX_EPTP_MT_WB; 3225 + struct kvm_mmu_page *root; 3226 + 3227 + if (kvm_mmu_is_dummy_root(root_hpa)) 3228 + return eptp | VMX_EPTP_PWL_4; 3229 + 3230 + /* 3231 + * EPT roots should always have an associated MMU page. Return a "bad" 3232 + * EPTP to induce VM-Fail instead of continuing on in a unknown state. 3233 + */ 3234 + root = root_to_sp(root_hpa); 3235 + if (WARN_ON_ONCE(!root)) 3236 + return INVALID_PAGE; 3237 + 3238 + eptp |= (root->role.level == 5) ? VMX_EPTP_PWL_5 : VMX_EPTP_PWL_4; 3239 + 3240 + if (enable_ept_ad_bits && !root->role.ad_disabled) 3241 + eptp |= VMX_EPTP_AD_ENABLE_BIT; 3242 + 3243 + return eptp; 3244 + } 3245 + 3246 + static void vmx_flush_tlb_ept_root(hpa_t root_hpa) 3247 + { 3248 + u64 eptp = construct_eptp(root_hpa); 3249 + 3250 + if (VALID_PAGE(eptp)) 3251 + ept_sync_context(eptp); 3252 + else 3253 + ept_sync_global(); 3254 + } 3255 + 3332 3256 void vmx_flush_tlb_current(struct kvm_vcpu *vcpu) 3333 3257 { 3334 3258 struct kvm_mmu *mmu = vcpu->arch.mmu; ··· 3373 3229 return; 3374 3230 3375 3231 if (enable_ept) 3376 - ept_sync_context(construct_eptp(vcpu, root_hpa, 3377 - mmu->root_role.level)); 3232 + vmx_flush_tlb_ept_root(root_hpa); 3378 3233 else 3379 3234 vpid_sync_context(vmx_get_current_vpid(vcpu)); 3380 3235 } ··· 3539 3396 return 4; 3540 3397 } 3541 3398 3542 - u64 construct_eptp(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level) 3543 - { 3544 - u64 eptp = VMX_EPTP_MT_WB; 3545 - 3546 - eptp |= (root_level == 5) ? VMX_EPTP_PWL_5 : VMX_EPTP_PWL_4; 3547 - 3548 - if (enable_ept_ad_bits && 3549 - (!is_guest_mode(vcpu) || nested_ept_ad_enabled(vcpu))) 3550 - eptp |= VMX_EPTP_AD_ENABLE_BIT; 3551 - eptp |= root_hpa; 3552 - 3553 - return eptp; 3554 - } 3555 - 3556 3399 void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level) 3557 3400 { 3558 3401 struct kvm *kvm = vcpu->kvm; 3559 3402 bool update_guest_cr3 = true; 3560 3403 unsigned long guest_cr3; 3561 - u64 eptp; 3562 3404 3563 3405 if (enable_ept) { 3564 - eptp = construct_eptp(vcpu, root_hpa, root_level); 3565 - vmcs_write64(EPT_POINTER, eptp); 3406 + KVM_MMU_WARN_ON(root_to_sp(root_hpa) && 3407 + root_level != root_to_sp(root_hpa)->role.level); 3408 + vmcs_write64(EPT_POINTER, construct_eptp(root_hpa)); 3566 3409 3567 3410 hv_track_root_tdp(vcpu, root_hpa); 3568 3411 ··· 6760 6631 return kvm_vmx_exit_handlers[exit_handler_index](vcpu); 6761 6632 6762 6633 unexpected_vmexit: 6763 - vcpu_unimpl(vcpu, "vmx: unexpected exit reason 0x%x\n", 6764 - exit_reason.full); 6765 6634 dump_vmcs(vcpu); 6766 - vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR; 6767 - vcpu->run->internal.suberror = 6768 - KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON; 6769 - vcpu->run->internal.ndata = 2; 6770 - vcpu->run->internal.data[0] = exit_reason.full; 6771 - vcpu->run->internal.data[1] = vcpu->arch.last_vmentry_cpu; 6635 + kvm_prepare_unexpected_reason_exit(vcpu, exit_reason.full); 6772 6636 return 0; 6773 6637 } 6774 6638 ··· 6781 6659 return 0; 6782 6660 } 6783 6661 return ret; 6784 - } 6785 - 6786 - /* 6787 - * Software based L1D cache flush which is used when microcode providing 6788 - * the cache control MSR is not loaded. 6789 - * 6790 - * The L1D cache is 32 KiB on Nehalem and later microarchitectures, but to 6791 - * flush it is required to read in 64 KiB because the replacement algorithm 6792 - * is not exactly LRU. This could be sized at runtime via topology 6793 - * information but as all relevant affected CPUs have 32KiB L1D cache size 6794 - * there is no point in doing so. 6795 - */ 6796 - static noinstr void vmx_l1d_flush(struct kvm_vcpu *vcpu) 6797 - { 6798 - int size = PAGE_SIZE << L1D_CACHE_ORDER; 6799 - 6800 - /* 6801 - * This code is only executed when the flush mode is 'cond' or 6802 - * 'always' 6803 - */ 6804 - if (static_branch_likely(&vmx_l1d_flush_cond)) { 6805 - bool flush_l1d; 6806 - 6807 - /* 6808 - * Clear the per-vcpu flush bit, it gets set again if the vCPU 6809 - * is reloaded, i.e. if the vCPU is scheduled out or if KVM 6810 - * exits to userspace, or if KVM reaches one of the unsafe 6811 - * VMEXIT handlers, e.g. if KVM calls into the emulator. 6812 - */ 6813 - flush_l1d = vcpu->arch.l1tf_flush_l1d; 6814 - vcpu->arch.l1tf_flush_l1d = false; 6815 - 6816 - /* 6817 - * Clear the per-cpu flush bit, it gets set again from 6818 - * the interrupt handlers. 6819 - */ 6820 - flush_l1d |= kvm_get_cpu_l1tf_flush_l1d(); 6821 - kvm_clear_cpu_l1tf_flush_l1d(); 6822 - 6823 - if (!flush_l1d) 6824 - return; 6825 - } 6826 - 6827 - vcpu->stat.l1d_flush++; 6828 - 6829 - if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) { 6830 - native_wrmsrq(MSR_IA32_FLUSH_CMD, L1D_FLUSH); 6831 - return; 6832 - } 6833 - 6834 - asm volatile( 6835 - /* First ensure the pages are in the TLB */ 6836 - "xorl %%eax, %%eax\n" 6837 - ".Lpopulate_tlb:\n\t" 6838 - "movzbl (%[flush_pages], %%" _ASM_AX "), %%ecx\n\t" 6839 - "addl $4096, %%eax\n\t" 6840 - "cmpl %%eax, %[size]\n\t" 6841 - "jne .Lpopulate_tlb\n\t" 6842 - "xorl %%eax, %%eax\n\t" 6843 - "cpuid\n\t" 6844 - /* Now fill the cache */ 6845 - "xorl %%eax, %%eax\n" 6846 - ".Lfill_cache:\n" 6847 - "movzbl (%[flush_pages], %%" _ASM_AX "), %%ecx\n\t" 6848 - "addl $64, %%eax\n\t" 6849 - "cmpl %%eax, %[size]\n\t" 6850 - "jne .Lfill_cache\n\t" 6851 - "lfence\n" 6852 - :: [flush_pages] "r" (vmx_l1d_flush_pages), 6853 - [size] "r" (size) 6854 - : "eax", "ebx", "ecx", "edx"); 6855 6662 } 6856 6663 6857 6664 void vmx_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr) ··· 7101 7050 if (to_vt(vcpu)->emulation_required) 7102 7051 return; 7103 7052 7104 - if (vmx_get_exit_reason(vcpu).basic == EXIT_REASON_EXTERNAL_INTERRUPT) 7053 + switch (vmx_get_exit_reason(vcpu).basic) { 7054 + case EXIT_REASON_EXTERNAL_INTERRUPT: 7105 7055 handle_external_interrupt_irqoff(vcpu, vmx_get_intr_info(vcpu)); 7106 - else if (vmx_get_exit_reason(vcpu).basic == EXIT_REASON_EXCEPTION_NMI) 7056 + break; 7057 + case EXIT_REASON_EXCEPTION_NMI: 7107 7058 handle_exception_irqoff(vcpu, vmx_get_intr_info(vcpu)); 7059 + break; 7060 + case EXIT_REASON_MCE_DURING_VMENTRY: 7061 + kvm_machine_check(); 7062 + break; 7063 + default: 7064 + break; 7065 + } 7108 7066 } 7109 7067 7110 7068 /* ··· 7388 7328 7389 7329 guest_state_enter_irqoff(); 7390 7330 7391 - /* 7392 - * L1D Flush includes CPU buffer clear to mitigate MDS, but VERW 7393 - * mitigation for MDS is done late in VMentry and is still 7394 - * executed in spite of L1D Flush. This is because an extra VERW 7395 - * should not matter much after the big hammer L1D Flush. 7396 - * 7397 - * cpu_buf_vm_clear is used when system is not vulnerable to MDS/TAA, 7398 - * and is affected by MMIO Stale Data. In such cases mitigation in only 7399 - * needed against an MMIO capable guest. 7400 - */ 7401 - if (static_branch_unlikely(&vmx_l1d_should_flush)) 7402 - vmx_l1d_flush(vcpu); 7403 - else if (static_branch_unlikely(&cpu_buf_vm_clear) && 7404 - (flags & VMX_RUN_CLEAR_CPU_BUFFERS_FOR_MMIO)) 7405 - x86_clear_cpu_buffers(); 7331 + vmx_l1d_flush(vcpu); 7406 7332 7407 7333 vmx_disable_fb_clear(vmx); 7408 7334 ··· 7500 7454 if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) 7501 7455 vmx_set_interrupt_shadow(vcpu, 0); 7502 7456 7503 - kvm_load_guest_xsave_state(vcpu); 7504 - 7505 7457 pt_guest_enter(vmx); 7506 7458 7507 7459 atomic_switch_perf_msrs(vmx); ··· 7543 7499 7544 7500 pt_guest_exit(vmx); 7545 7501 7546 - kvm_load_host_xsave_state(vcpu); 7547 - 7548 7502 if (is_guest_mode(vcpu)) { 7549 7503 /* 7550 7504 * Track VMLAUNCH/VMRESUME that have made past guest state ··· 7557 7515 7558 7516 if (unlikely(vmx->fail)) 7559 7517 return EXIT_FASTPATH_NONE; 7560 - 7561 - if (unlikely((u16)vmx_get_exit_reason(vcpu).basic == EXIT_REASON_MCE_DURING_VMENTRY)) 7562 - kvm_machine_check(); 7563 7518 7564 7519 trace_kvm_exit(vcpu, KVM_ISA_VMX); 7565 7520 ··· 8718 8679 return r; 8719 8680 } 8720 8681 8721 - static void vmx_cleanup_l1d_flush(void) 8722 - { 8723 - if (vmx_l1d_flush_pages) { 8724 - free_pages((unsigned long)vmx_l1d_flush_pages, L1D_CACHE_ORDER); 8725 - vmx_l1d_flush_pages = NULL; 8726 - } 8727 - /* Restore state so sysfs ignores VMX */ 8728 - l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_AUTO; 8729 - } 8730 - 8731 8682 void vmx_exit(void) 8732 8683 { 8733 8684 allow_smaller_maxphyaddr = false; ··· 8753 8724 if (r) 8754 8725 return r; 8755 8726 8756 - /* 8757 - * Must be called after common x86 init so enable_ept is properly set 8758 - * up. Hand the parameter mitigation value in which was stored in 8759 - * the pre module init parser. If no parameter was given, it will 8760 - * contain 'auto' which will be turned into the default 'cond' 8761 - * mitigation mode. 8762 - */ 8763 - r = vmx_setup_l1d_flush(vmentry_l1d_flush_param); 8727 + /* Must be called after common x86 init so enable_ept is setup. */ 8728 + r = vmx_setup_l1d_flush(); 8764 8729 if (r) 8765 8730 goto err_l1d_flush; 8766 8731

-2

arch/x86/kvm/vmx/vmx.h

··· 369 369 void ept_save_pdptrs(struct kvm_vcpu *vcpu); 370 370 void vmx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg); 371 371 void __vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg); 372 - u64 construct_eptp(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level); 373 372 374 373 bool vmx_guest_inject_ac(struct kvm_vcpu *vcpu); 375 374 void vmx_update_exception_bitmap(struct kvm_vcpu *vcpu); ··· 680 681 void free_vmcs(struct vmcs *vmcs); 681 682 int alloc_loaded_vmcs(struct loaded_vmcs *loaded_vmcs); 682 683 void free_loaded_vmcs(struct loaded_vmcs *loaded_vmcs); 683 - void loaded_vmcs_clear(struct loaded_vmcs *loaded_vmcs); 684 684 685 685 static inline struct vmcs *alloc_vmcs(bool shadow) 686 686 {

+1 -1

arch/x86/kvm/vmx/x86_ops.h

··· 73 73 void vmx_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt); 74 74 void vmx_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt); 75 75 void vmx_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt); 76 - void vmx_set_dr6(struct kvm_vcpu *vcpu, unsigned long val); 77 76 void vmx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val); 78 77 void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu); 79 78 void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg); ··· 148 149 int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr); 149 150 150 151 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp); 152 + int tdx_vcpu_unlocked_ioctl(struct kvm_vcpu *vcpu, void __user *argp); 151 153 152 154 void tdx_flush_tlb_current(struct kvm_vcpu *vcpu); 153 155 void tdx_flush_tlb_all(struct kvm_vcpu *vcpu);

+156 -129

arch/x86/kvm/x86.c

··· 159 159 unsigned int min_timer_period_us = 200; 160 160 module_param(min_timer_period_us, uint, 0644); 161 161 162 - static bool __read_mostly kvmclock_periodic_sync = true; 163 - module_param(kvmclock_periodic_sync, bool, 0444); 164 - 165 162 /* tsc tolerance in parts per million - default to 1/2 of the NTP threshold */ 166 163 static u32 __read_mostly tsc_tolerance_ppm = 250; 167 164 module_param(tsc_tolerance_ppm, uint, 0644); ··· 209 212 u32 __read_mostly kvm_nr_uret_msrs; 210 213 EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_nr_uret_msrs); 211 214 static u32 __read_mostly kvm_uret_msrs_list[KVM_MAX_NR_USER_RETURN_MSRS]; 212 - static struct kvm_user_return_msrs __percpu *user_return_msrs; 215 + static DEFINE_PER_CPU(struct kvm_user_return_msrs, user_return_msrs); 213 216 214 217 #define KVM_SUPPORTED_XCR0 (XFEATURE_MASK_FP | XFEATURE_MASK_SSE \ 215 218 | XFEATURE_MASK_YMM | XFEATURE_MASK_BNDREGS \ ··· 572 575 vcpu->arch.apf.gfns[i] = ~0; 573 576 } 574 577 578 + static void kvm_destroy_user_return_msrs(void) 579 + { 580 + int cpu; 581 + 582 + for_each_possible_cpu(cpu) 583 + WARN_ON_ONCE(per_cpu(user_return_msrs, cpu).registered); 584 + 585 + kvm_nr_uret_msrs = 0; 586 + } 587 + 575 588 static void kvm_on_user_return(struct user_return_notifier *urn) 576 589 { 577 590 unsigned slot; 578 591 struct kvm_user_return_msrs *msrs 579 592 = container_of(urn, struct kvm_user_return_msrs, urn); 580 593 struct kvm_user_return_msr_values *values; 581 - unsigned long flags; 582 594 583 - /* 584 - * Disabling irqs at this point since the following code could be 585 - * interrupted and executed through kvm_arch_disable_virtualization_cpu() 586 - */ 587 - local_irq_save(flags); 588 - if (msrs->registered) { 589 - msrs->registered = false; 590 - user_return_notifier_unregister(urn); 591 - } 592 - local_irq_restore(flags); 595 + msrs->registered = false; 596 + user_return_notifier_unregister(urn); 597 + 593 598 for (slot = 0; slot < kvm_nr_uret_msrs; ++slot) { 594 599 values = &msrs->values[slot]; 595 600 if (values->host != values->curr) { ··· 642 643 643 644 static void kvm_user_return_msr_cpu_online(void) 644 645 { 645 - struct kvm_user_return_msrs *msrs = this_cpu_ptr(user_return_msrs); 646 + struct kvm_user_return_msrs *msrs = this_cpu_ptr(&user_return_msrs); 646 647 u64 value; 647 648 int i; 648 649 ··· 664 665 665 666 int kvm_set_user_return_msr(unsigned slot, u64 value, u64 mask) 666 667 { 667 - struct kvm_user_return_msrs *msrs = this_cpu_ptr(user_return_msrs); 668 + struct kvm_user_return_msrs *msrs = this_cpu_ptr(&user_return_msrs); 668 669 int err; 669 670 670 671 value = (value & mask) | (msrs->values[slot].host & ~mask); ··· 680 681 } 681 682 EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_set_user_return_msr); 682 683 683 - void kvm_user_return_msr_update_cache(unsigned int slot, u64 value) 684 - { 685 - struct kvm_user_return_msrs *msrs = this_cpu_ptr(user_return_msrs); 686 - 687 - msrs->values[slot].curr = value; 688 - kvm_user_return_register_notifier(msrs); 689 - } 690 - EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_user_return_msr_update_cache); 691 - 692 684 u64 kvm_get_user_return_msr(unsigned int slot) 693 685 { 694 - return this_cpu_ptr(user_return_msrs)->values[slot].curr; 686 + return this_cpu_ptr(&user_return_msrs)->values[slot].curr; 695 687 } 696 688 EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_get_user_return_msr); 697 689 698 690 static void drop_user_return_notifiers(void) 699 691 { 700 - struct kvm_user_return_msrs *msrs = this_cpu_ptr(user_return_msrs); 692 + struct kvm_user_return_msrs *msrs = this_cpu_ptr(&user_return_msrs); 701 693 702 694 if (msrs->registered) 703 695 kvm_on_user_return(&msrs->urn); ··· 1035 1045 } 1036 1046 EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_require_dr); 1037 1047 1048 + static bool kvm_pv_async_pf_enabled(struct kvm_vcpu *vcpu) 1049 + { 1050 + u64 mask = KVM_ASYNC_PF_ENABLED | KVM_ASYNC_PF_DELIVERY_AS_INT; 1051 + 1052 + return (vcpu->arch.apf.msr_en_val & mask) == mask; 1053 + } 1054 + 1038 1055 static inline u64 pdptr_rsvd_bits(struct kvm_vcpu *vcpu) 1039 1056 { 1040 1057 return vcpu->arch.reserved_gpa_bits | rsvd_bits(5, 8) | rsvd_bits(1, 2); ··· 1134 1137 } 1135 1138 1136 1139 if ((cr0 ^ old_cr0) & X86_CR0_PG) { 1137 - kvm_clear_async_pf_completion_queue(vcpu); 1138 - kvm_async_pf_hash_reset(vcpu); 1139 - 1140 1140 /* 1141 1141 * Clearing CR0.PG is defined to flush the TLB from the guest's 1142 1142 * perspective. 1143 1143 */ 1144 1144 if (!(cr0 & X86_CR0_PG)) 1145 1145 kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu); 1146 + /* 1147 + * Check for async #PF completion events when enabling paging, 1148 + * as the vCPU may have previously encountered async #PFs (it's 1149 + * entirely legal for the guest to toggle paging on/off without 1150 + * waiting for the async #PF queue to drain). 1151 + */ 1152 + else if (kvm_pv_async_pf_enabled(vcpu)) 1153 + kvm_make_request(KVM_REQ_APF_READY, vcpu); 1146 1154 } 1147 1155 1148 1156 if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS) ··· 1205 1203 } 1206 1204 EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_lmsw); 1207 1205 1208 - void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu) 1206 + static void kvm_load_xfeatures(struct kvm_vcpu *vcpu, bool load_guest) 1209 1207 { 1210 1208 if (vcpu->arch.guest_state_protected) 1211 1209 return; 1212 1210 1213 - if (kvm_is_cr4_bit_set(vcpu, X86_CR4_OSXSAVE)) { 1211 + if (!kvm_is_cr4_bit_set(vcpu, X86_CR4_OSXSAVE)) 1212 + return; 1214 1213 1215 - if (vcpu->arch.xcr0 != kvm_host.xcr0) 1216 - xsetbv(XCR_XFEATURE_ENABLED_MASK, vcpu->arch.xcr0); 1214 + if (vcpu->arch.xcr0 != kvm_host.xcr0) 1215 + xsetbv(XCR_XFEATURE_ENABLED_MASK, 1216 + load_guest ? vcpu->arch.xcr0 : kvm_host.xcr0); 1217 1217 1218 - if (guest_cpu_cap_has(vcpu, X86_FEATURE_XSAVES) && 1219 - vcpu->arch.ia32_xss != kvm_host.xss) 1220 - wrmsrq(MSR_IA32_XSS, vcpu->arch.ia32_xss); 1221 - } 1218 + if (guest_cpu_cap_has(vcpu, X86_FEATURE_XSAVES) && 1219 + vcpu->arch.ia32_xss != kvm_host.xss) 1220 + wrmsrq(MSR_IA32_XSS, load_guest ? vcpu->arch.ia32_xss : kvm_host.xss); 1221 + } 1222 + 1223 + static void kvm_load_guest_pkru(struct kvm_vcpu *vcpu) 1224 + { 1225 + if (vcpu->arch.guest_state_protected) 1226 + return; 1222 1227 1223 1228 if (cpu_feature_enabled(X86_FEATURE_PKU) && 1224 1229 vcpu->arch.pkru != vcpu->arch.host_pkru && ··· 1233 1224 kvm_is_cr4_bit_set(vcpu, X86_CR4_PKE))) 1234 1225 wrpkru(vcpu->arch.pkru); 1235 1226 } 1236 - EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_load_guest_xsave_state); 1237 1227 1238 - void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu) 1228 + static void kvm_load_host_pkru(struct kvm_vcpu *vcpu) 1239 1229 { 1240 1230 if (vcpu->arch.guest_state_protected) 1241 1231 return; ··· 1246 1238 if (vcpu->arch.pkru != vcpu->arch.host_pkru) 1247 1239 wrpkru(vcpu->arch.host_pkru); 1248 1240 } 1249 - 1250 - if (kvm_is_cr4_bit_set(vcpu, X86_CR4_OSXSAVE)) { 1251 - 1252 - if (vcpu->arch.xcr0 != kvm_host.xcr0) 1253 - xsetbv(XCR_XFEATURE_ENABLED_MASK, kvm_host.xcr0); 1254 - 1255 - if (guest_cpu_cap_has(vcpu, X86_FEATURE_XSAVES) && 1256 - vcpu->arch.ia32_xss != kvm_host.xss) 1257 - wrmsrq(MSR_IA32_XSS, kvm_host.xss); 1258 - } 1259 - 1260 1241 } 1261 - EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_load_host_xsave_state); 1262 1242 1263 1243 #ifdef CONFIG_X86_64 1264 1244 static inline u64 kvm_guest_supported_xfd(struct kvm_vcpu *vcpu) ··· 3501 3505 /* 3502 3506 * kvmclock updates which are isolated to a given vcpu, such as 3503 3507 * vcpu->cpu migration, should not allow system_timestamp from 3504 - * the rest of the vcpus to remain static. Otherwise ntp frequency 3505 - * correction applies to one vcpu's system_timestamp but not 3506 - * the others. 3508 + * the rest of the vcpus to remain static. 3507 3509 * 3508 3510 * So in those cases, request a kvmclock update for all vcpus. 3509 - * We need to rate-limit these requests though, as they can 3510 - * considerably slow guests that have a large number of vcpus. 3511 - * The time for a remote vcpu to update its kvmclock is bound 3512 - * by the delay we use to rate-limit the updates. 3511 + * The worst case for a remote vcpu to update its kvmclock 3512 + * is then bounded by maximum nohz sleep latency. 3513 3513 */ 3514 - 3515 - #define KVMCLOCK_UPDATE_DELAY msecs_to_jiffies(100) 3516 - 3517 - static void kvmclock_update_fn(struct work_struct *work) 3514 + static void kvm_gen_kvmclock_update(struct kvm_vcpu *v) 3518 3515 { 3519 3516 unsigned long i; 3520 - struct delayed_work *dwork = to_delayed_work(work); 3521 - struct kvm_arch *ka = container_of(dwork, struct kvm_arch, 3522 - kvmclock_update_work); 3523 - struct kvm *kvm = container_of(ka, struct kvm, arch); 3524 3517 struct kvm_vcpu *vcpu; 3518 + struct kvm *kvm = v->kvm; 3525 3519 3526 3520 kvm_for_each_vcpu(i, vcpu, kvm) { 3527 3521 kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu); 3528 3522 kvm_vcpu_kick(vcpu); 3529 3523 } 3530 - } 3531 - 3532 - static void kvm_gen_kvmclock_update(struct kvm_vcpu *v) 3533 - { 3534 - struct kvm *kvm = v->kvm; 3535 - 3536 - kvm_make_request(KVM_REQ_CLOCK_UPDATE, v); 3537 - schedule_delayed_work(&kvm->arch.kvmclock_update_work, 3538 - KVMCLOCK_UPDATE_DELAY); 3539 - } 3540 - 3541 - #define KVMCLOCK_SYNC_PERIOD (300 * HZ) 3542 - 3543 - static void kvmclock_sync_fn(struct work_struct *work) 3544 - { 3545 - struct delayed_work *dwork = to_delayed_work(work); 3546 - struct kvm_arch *ka = container_of(dwork, struct kvm_arch, 3547 - kvmclock_sync_work); 3548 - struct kvm *kvm = container_of(ka, struct kvm, arch); 3549 - 3550 - schedule_delayed_work(&kvm->arch.kvmclock_update_work, 0); 3551 - schedule_delayed_work(&kvm->arch.kvmclock_sync_work, 3552 - KVMCLOCK_SYNC_PERIOD); 3553 3524 } 3554 3525 3555 3526 /* These helpers are safe iff @msr is known to be an MCx bank MSR. */ ··· 3611 3648 return 1; 3612 3649 } 3613 3650 return 0; 3614 - } 3615 - 3616 - static inline bool kvm_pv_async_pf_enabled(struct kvm_vcpu *vcpu) 3617 - { 3618 - u64 mask = KVM_ASYNC_PF_ENABLED | KVM_ASYNC_PF_DELIVERY_AS_INT; 3619 - 3620 - return (vcpu->arch.apf.msr_en_val & mask) == mask; 3621 3651 } 3622 3652 3623 3653 static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data) ··· 4138 4182 if (!guest_pv_has(vcpu, KVM_FEATURE_ASYNC_PF_INT)) 4139 4183 return 1; 4140 4184 if (data & 0x1) { 4141 - vcpu->arch.apf.pageready_pending = false; 4185 + /* 4186 + * Pairs with the smp_mb__after_atomic() in 4187 + * kvm_arch_async_page_present_queued(). 4188 + */ 4189 + smp_store_mb(vcpu->arch.apf.pageready_pending, false); 4190 + 4142 4191 kvm_check_async_pf_completion(vcpu); 4143 4192 } 4144 4193 break; ··· 5149 5188 { 5150 5189 struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); 5151 5190 5152 - vcpu->arch.l1tf_flush_l1d = true; 5191 + kvm_request_l1tf_flush_l1d(); 5153 5192 5154 5193 if (vcpu->scheduled_out && pmu->version && pmu->event_count) { 5155 5194 pmu->need_cleanup = true; ··· 7200 7239 return 0; 7201 7240 } 7202 7241 7242 + long kvm_arch_vcpu_unlocked_ioctl(struct file *filp, unsigned int ioctl, 7243 + unsigned long arg) 7244 + { 7245 + struct kvm_vcpu *vcpu = filp->private_data; 7246 + void __user *argp = (void __user *)arg; 7247 + 7248 + if (ioctl == KVM_MEMORY_ENCRYPT_OP && 7249 + kvm_x86_ops.vcpu_mem_enc_unlocked_ioctl) 7250 + return kvm_x86_call(vcpu_mem_enc_unlocked_ioctl)(vcpu, argp); 7251 + 7252 + return -ENOIOCTLCMD; 7253 + } 7254 + 7203 7255 int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) 7204 7256 { 7205 7257 struct kvm *kvm = filp->private_data; ··· 7972 7998 unsigned int bytes, struct x86_exception *exception) 7973 7999 { 7974 8000 /* kvm_write_guest_virt_system can pull in tons of pages. */ 7975 - vcpu->arch.l1tf_flush_l1d = true; 8001 + kvm_request_l1tf_flush_l1d(); 7976 8002 7977 8003 return kvm_write_guest_virt_helper(addr, val, bytes, vcpu, 7978 8004 PFERR_WRITE_MASK, exception); ··· 8816 8842 kvm_make_request(KVM_REQ_TRIPLE_FAULT, emul_to_vcpu(ctxt)); 8817 8843 } 8818 8844 8845 + static int emulator_get_xcr(struct x86_emulate_ctxt *ctxt, u32 index, u64 *xcr) 8846 + { 8847 + if (index != XCR_XFEATURE_ENABLED_MASK) 8848 + return 1; 8849 + *xcr = emul_to_vcpu(ctxt)->arch.xcr0; 8850 + return 0; 8851 + } 8852 + 8819 8853 static int emulator_set_xcr(struct x86_emulate_ctxt *ctxt, u32 index, u64 xcr) 8820 8854 { 8821 8855 return __kvm_set_xcr(emul_to_vcpu(ctxt), index, xcr); ··· 8896 8914 .is_smm = emulator_is_smm, 8897 8915 .leave_smm = emulator_leave_smm, 8898 8916 .triple_fault = emulator_triple_fault, 8917 + .get_xcr = emulator_get_xcr, 8899 8918 .set_xcr = emulator_set_xcr, 8900 8919 .get_untagged_addr = emulator_get_untagged_addr, 8901 8920 .is_canonical_addr = emulator_is_canonical_addr, ··· 9091 9108 run->internal.ndata = ndata; 9092 9109 } 9093 9110 EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_prepare_event_vectoring_exit); 9111 + 9112 + void kvm_prepare_unexpected_reason_exit(struct kvm_vcpu *vcpu, u64 exit_reason) 9113 + { 9114 + vcpu_unimpl(vcpu, "unexpected exit reason 0x%llx\n", exit_reason); 9115 + 9116 + vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR; 9117 + vcpu->run->internal.suberror = KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON; 9118 + vcpu->run->internal.ndata = 2; 9119 + vcpu->run->internal.data[0] = exit_reason; 9120 + vcpu->run->internal.data[1] = vcpu->arch.last_vmentry_cpu; 9121 + } 9122 + EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_prepare_unexpected_reason_exit); 9094 9123 9095 9124 static int handle_emulation_failure(struct kvm_vcpu *vcpu, int emulation_type) 9096 9125 { ··· 9332 9337 return false; 9333 9338 } 9334 9339 9340 + static bool is_soft_int_instruction(struct x86_emulate_ctxt *ctxt, 9341 + int emulation_type) 9342 + { 9343 + u8 vector = EMULTYPE_GET_SOFT_INT_VECTOR(emulation_type); 9344 + 9345 + switch (ctxt->b) { 9346 + case 0xcc: 9347 + return vector == BP_VECTOR; 9348 + case 0xcd: 9349 + return vector == ctxt->src.val; 9350 + case 0xce: 9351 + return vector == OF_VECTOR; 9352 + default: 9353 + return false; 9354 + } 9355 + } 9356 + 9335 9357 /* 9336 9358 * Decode an instruction for emulation. The caller is responsible for handling 9337 9359 * code breakpoints. Note, manually detecting code breakpoints is unnecessary ··· 9406 9394 return handle_emulation_failure(vcpu, emulation_type); 9407 9395 } 9408 9396 9409 - vcpu->arch.l1tf_flush_l1d = true; 9397 + kvm_request_l1tf_flush_l1d(); 9410 9398 9411 9399 if (!(emulation_type & EMULTYPE_NO_DECODE)) { 9412 9400 kvm_clear_exception_queue(vcpu); ··· 9459 9447 * injecting single-step #DBs. 9460 9448 */ 9461 9449 if (emulation_type & EMULTYPE_SKIP) { 9450 + if (emulation_type & EMULTYPE_SKIP_SOFT_INT && 9451 + !is_soft_int_instruction(ctxt, emulation_type)) 9452 + return 0; 9453 + 9462 9454 if (ctxt->mode != X86EMUL_MODE_PROT64) 9463 9455 ctxt->eip = (u32)ctxt->_eip; 9464 9456 else ··· 10047 10031 return -ENOMEM; 10048 10032 } 10049 10033 10050 - user_return_msrs = alloc_percpu(struct kvm_user_return_msrs); 10051 - if (!user_return_msrs) { 10052 - pr_err("failed to allocate percpu kvm_user_return_msrs\n"); 10053 - r = -ENOMEM; 10054 - goto out_free_x86_emulator_cache; 10055 - } 10056 - kvm_nr_uret_msrs = 0; 10057 - 10058 10034 r = kvm_mmu_vendor_module_init(); 10059 10035 if (r) 10060 - goto out_free_percpu; 10036 + goto out_free_x86_emulator_cache; 10061 10037 10062 10038 kvm_caps.supported_vm_types = BIT(KVM_X86_DEFAULT_VM); 10063 10039 kvm_caps.supported_mce_cap = MCG_CTL_P | MCG_SER_P; ··· 10073 10065 10074 10066 if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES)) 10075 10067 rdmsrq(MSR_IA32_ARCH_CAPABILITIES, kvm_host.arch_capabilities); 10068 + 10069 + WARN_ON_ONCE(kvm_nr_uret_msrs); 10076 10070 10077 10071 r = ops->hardware_setup(); 10078 10072 if (r != 0) ··· 10148 10138 kvm_x86_ops.enable_virtualization_cpu = NULL; 10149 10139 kvm_x86_call(hardware_unsetup)(); 10150 10140 out_mmu_exit: 10141 + kvm_destroy_user_return_msrs(); 10151 10142 kvm_mmu_vendor_module_exit(); 10152 - out_free_percpu: 10153 - free_percpu(user_return_msrs); 10154 10143 out_free_x86_emulator_cache: 10155 10144 kmem_cache_destroy(x86_emulator_cache); 10156 10145 return r; ··· 10177 10168 cancel_work_sync(&pvclock_gtod_work); 10178 10169 #endif 10179 10170 kvm_x86_call(hardware_unsetup)(); 10171 + kvm_destroy_user_return_msrs(); 10180 10172 kvm_mmu_vendor_module_exit(); 10181 - free_percpu(user_return_msrs); 10182 10173 kmem_cache_destroy(x86_emulator_cache); 10183 10174 #ifdef CONFIG_KVM_XEN 10184 10175 static_key_deferred_flush(&kvm_xen_enabled); ··· 11300 11291 if (vcpu->arch.guest_fpu.xfd_err) 11301 11292 wrmsrq(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err); 11302 11293 11294 + kvm_load_xfeatures(vcpu, true); 11295 + 11303 11296 if (unlikely(vcpu->arch.switch_db_regs && 11304 11297 !(vcpu->arch.switch_db_regs & KVM_DEBUGREG_AUTO_SWITCH))) { 11305 11298 set_debugreg(DR7_FIXED_1, 7); ··· 11330 11319 11331 11320 guest_timing_enter_irqoff(); 11332 11321 11322 + /* 11323 + * Swap PKRU with hardware breakpoints disabled to minimize the number 11324 + * of flows where non-KVM code can run with guest state loaded. 11325 + */ 11326 + kvm_load_guest_pkru(vcpu); 11327 + 11333 11328 for (;;) { 11334 11329 /* 11335 11330 * Assert that vCPU vs. VM APICv state is consistent. An APICv ··· 11363 11346 /* Note, VM-Exits that go down the "slow" path are accounted below. */ 11364 11347 ++vcpu->stat.exits; 11365 11348 } 11349 + 11350 + kvm_load_host_pkru(vcpu); 11366 11351 11367 11352 /* 11368 11353 * Do this here before restoring debug registers on the host. And ··· 11395 11376 11396 11377 vcpu->mode = OUTSIDE_GUEST_MODE; 11397 11378 smp_wmb(); 11379 + 11380 + kvm_load_xfeatures(vcpu, false); 11398 11381 11399 11382 /* 11400 11383 * Sync xfd before calling handle_exit_irqoff() which may ··· 12755 12734 12756 12735 void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu) 12757 12736 { 12758 - struct kvm *kvm = vcpu->kvm; 12759 - 12760 12737 if (mutex_lock_killable(&vcpu->mutex)) 12761 12738 return; 12762 12739 vcpu_load(vcpu); ··· 12765 12746 vcpu->arch.msr_kvm_poll_control = 1; 12766 12747 12767 12748 mutex_unlock(&vcpu->mutex); 12768 - 12769 - if (kvmclock_periodic_sync && vcpu->vcpu_idx == 0) 12770 - schedule_delayed_work(&kvm->arch.kvmclock_sync_work, 12771 - KVMCLOCK_SYNC_PERIOD); 12772 12749 } 12773 12750 12774 12751 void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu) ··· 13103 13088 void kvm_arch_disable_virtualization_cpu(void) 13104 13089 { 13105 13090 kvm_x86_call(disable_virtualization_cpu)(); 13106 - drop_user_return_notifiers(); 13091 + 13092 + /* 13093 + * Leave the user-return notifiers as-is when disabling virtualization 13094 + * for reboot, i.e. when disabling via IPI function call, and instead 13095 + * pin kvm.ko (if it's a module) to defend against use-after-free (in 13096 + * the *very* unlikely scenario module unload is racing with reboot). 13097 + * On a forced reboot, tasks aren't frozen before shutdown, and so KVM 13098 + * could be actively modifying user-return MSR state when the IPI to 13099 + * disable virtualization arrives. Handle the extreme edge case here 13100 + * instead of trying to account for it in the normal flows. 13101 + */ 13102 + if (in_task() || WARN_ON_ONCE(!kvm_rebooting)) 13103 + drop_user_return_notifiers(); 13104 + else 13105 + __module_get(THIS_MODULE); 13107 13106 } 13108 13107 13109 13108 bool kvm_vcpu_is_reset_bsp(struct kvm_vcpu *vcpu) ··· 13188 13159 spin_lock_init(&kvm->arch.hv_root_tdp_lock); 13189 13160 kvm->arch.hv_root_tdp = INVALID_PAGE; 13190 13161 #endif 13191 - 13192 - INIT_DELAYED_WORK(&kvm->arch.kvmclock_update_work, kvmclock_update_fn); 13193 - INIT_DELAYED_WORK(&kvm->arch.kvmclock_sync_work, kvmclock_sync_fn); 13194 13162 13195 13163 kvm_apicv_init(kvm); 13196 13164 kvm_hv_init_vm(kvm); ··· 13295 13269 * is unsafe, i.e. will lead to use-after-free. The PIT also needs to 13296 13270 * be stopped before IRQ routing is freed. 13297 13271 */ 13298 - cancel_delayed_work_sync(&kvm->arch.kvmclock_sync_work); 13299 - cancel_delayed_work_sync(&kvm->arch.kvmclock_update_work); 13300 - 13301 13272 #ifdef CONFIG_KVM_IOAPIC 13302 13273 kvm_free_pit(kvm); 13303 13274 #endif ··· 13911 13888 if ((work->wakeup_all || work->notpresent_injected) && 13912 13889 kvm_pv_async_pf_enabled(vcpu) && 13913 13890 !apf_put_user_ready(vcpu, work->arch.token)) { 13914 - vcpu->arch.apf.pageready_pending = true; 13891 + WRITE_ONCE(vcpu->arch.apf.pageready_pending, true); 13915 13892 kvm_apic_set_irq(vcpu, &irq, NULL); 13916 13893 } 13917 13894 ··· 13922 13899 void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu) 13923 13900 { 13924 13901 kvm_make_request(KVM_REQ_APF_READY, vcpu); 13925 - if (!vcpu->arch.apf.pageready_pending) 13902 + 13903 + /* Pairs with smp_store_mb() in kvm_set_msr_common(). */ 13904 + smp_mb__after_atomic(); 13905 + 13906 + if (!READ_ONCE(vcpu->arch.apf.pageready_pending)) 13926 13907 kvm_vcpu_kick(vcpu); 13927 13908 } 13928 13909

+14 -2

arch/x86/kvm/x86.h

··· 420 420 return !(kvm->arch.disabled_quirks & quirk); 421 421 } 422 422 423 + static __always_inline void kvm_request_l1tf_flush_l1d(void) 424 + { 425 + #if IS_ENABLED(CONFIG_CPU_MITIGATIONS) && IS_ENABLED(CONFIG_KVM_INTEL) 426 + /* 427 + * Use a raw write to set the per-CPU flag, as KVM will ensure a flush 428 + * even if preemption is currently enabled.. If the current vCPU task 429 + * is migrated to a different CPU (or userspace runs the vCPU on a 430 + * different task) before the next VM-Entry, then kvm_arch_vcpu_load() 431 + * will request a flush on the new CPU. 432 + */ 433 + raw_cpu_write(irq_stat.kvm_cpu_l1tf_flush_l1d, 1); 434 + #endif 435 + } 436 + 423 437 void kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int inc_eip); 424 438 425 439 u64 get_kvmclock_ns(struct kvm *kvm); ··· 636 622 #endif 637 623 } 638 624 639 - void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu); 640 - void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu); 641 625 int kvm_spec_ctrl_test_value(u64 value); 642 626 int kvm_handle_memory_failure(struct kvm_vcpu *vcpu, int r, 643 627 struct x86_exception *e);

+37

drivers/crypto/ccp/sev-dev.c

··· 2770 2770 } 2771 2771 EXPORT_SYMBOL_GPL(sev_platform_shutdown); 2772 2772 2773 + u64 sev_get_snp_policy_bits(void) 2774 + { 2775 + struct psp_device *psp = psp_master; 2776 + struct sev_device *sev; 2777 + u64 policy_bits; 2778 + 2779 + if (!cc_platform_has(CC_ATTR_HOST_SEV_SNP)) 2780 + return 0; 2781 + 2782 + if (!psp || !psp->sev_data) 2783 + return 0; 2784 + 2785 + sev = psp->sev_data; 2786 + 2787 + policy_bits = SNP_POLICY_MASK_BASE; 2788 + 2789 + if (sev->snp_plat_status.feature_info) { 2790 + if (sev->snp_feat_info_0.ecx & SNP_RAPL_DISABLE_SUPPORTED) 2791 + policy_bits |= SNP_POLICY_MASK_RAPL_DIS; 2792 + 2793 + if (sev->snp_feat_info_0.ecx & SNP_CIPHER_TEXT_HIDING_SUPPORTED) 2794 + policy_bits |= SNP_POLICY_MASK_CIPHERTEXT_HIDING_DRAM; 2795 + 2796 + if (sev->snp_feat_info_0.ecx & SNP_AES_256_XTS_POLICY_SUPPORTED) 2797 + policy_bits |= SNP_POLICY_MASK_MEM_AES_256_XTS; 2798 + 2799 + if (sev->snp_feat_info_0.ecx & SNP_CXL_ALLOW_POLICY_SUPPORTED) 2800 + policy_bits |= SNP_POLICY_MASK_CXL_ALLOW; 2801 + 2802 + if (sev_version_greater_or_equal(1, 58)) 2803 + policy_bits |= SNP_POLICY_MASK_PAGE_SWAP_DISABLE; 2804 + } 2805 + 2806 + return policy_bits; 2807 + } 2808 + EXPORT_SYMBOL_GPL(sev_get_snp_policy_bits); 2809 + 2773 2810 void sev_dev_destroy(struct psp_device *psp) 2774 2811 { 2775 2812 struct sev_device *sev = psp->sev_data;

+5 -2

drivers/irqchip/irq-apple-aic.c

··· 411 411 if (is_kernel_in_hyp_mode() && 412 412 (read_sysreg_s(SYS_ICH_HCR_EL2) & ICH_HCR_EL2_En) && 413 413 read_sysreg_s(SYS_ICH_MISR_EL2) != 0) { 414 + u64 val; 415 + 414 416 generic_handle_domain_irq(aic_irqc->hw_domain, 415 417 AIC_FIQ_HWIRQ(AIC_VGIC_MI)); 416 418 417 419 if (unlikely((read_sysreg_s(SYS_ICH_HCR_EL2) & ICH_HCR_EL2_En) && 418 - read_sysreg_s(SYS_ICH_MISR_EL2))) { 419 - pr_err_ratelimited("vGIC IRQ fired and not handled by KVM, disabling.\n"); 420 + (val = read_sysreg_s(SYS_ICH_MISR_EL2)))) { 421 + pr_err_ratelimited("vGIC IRQ fired and not handled by KVM (MISR=%llx), disabling.\n", 422 + val); 420 423 sysreg_clear_set_s(SYS_ICH_HCR_EL2, ICH_HCR_EL2_En, 0); 421 424 } 422 425 }

+3

drivers/irqchip/irq-gic.c

··· 1459 1459 if (ret) 1460 1460 return; 1461 1461 1462 + gic_v2_kvm_info.gicc_base = gic_data[0].cpu_base.common_base; 1463 + 1462 1464 if (static_branch_likely(&supports_deactivate_key)) 1463 1465 vgic_set_kvm_info(&gic_v2_kvm_info); 1464 1466 } ··· 1622 1620 return; 1623 1621 1624 1622 gic_v2_kvm_info.maint_irq = irq; 1623 + gic_v2_kvm_info.gicc_base = gic_data[0].cpu_base.common_base; 1625 1624 1626 1625 vgic_set_kvm_info(&gic_v2_kvm_info); 1627 1626 }

+2 -2

fs/btrfs/compression.c

··· 475 475 continue; 476 476 } 477 477 478 - folio = filemap_alloc_folio(mapping_gfp_constraint(mapping, 479 - ~__GFP_FS), 0); 478 + folio = filemap_alloc_folio(mapping_gfp_constraint(mapping, ~__GFP_FS), 479 + 0, NULL); 480 480 if (!folio) 481 481 break; 482 482

+1 -1

fs/btrfs/verity.c

··· 736 736 } 737 737 738 738 folio = filemap_alloc_folio(mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS), 739 - 0); 739 + 0, NULL); 740 740 if (!folio) 741 741 return ERR_PTR(-ENOMEM); 742 742

+1 -1

fs/erofs/zdata.c

··· 562 562 * Allocate a managed folio for cached I/O, or it may be 563 563 * then filled with a file-backed folio for in-place I/O 564 564 */ 565 - newfolio = filemap_alloc_folio(gfp, 0); 565 + newfolio = filemap_alloc_folio(gfp, 0, NULL); 566 566 if (!newfolio) 567 567 continue; 568 568 newfolio->private = Z_EROFS_PREALLOCATED_FOLIO;

+1 -1

fs/f2fs/compress.c

··· 1947 1947 return; 1948 1948 } 1949 1949 1950 - cfolio = filemap_alloc_folio(__GFP_NOWARN | __GFP_IO, 0); 1950 + cfolio = filemap_alloc_folio(__GFP_NOWARN | __GFP_IO, 0, NULL); 1951 1951 if (!cfolio) 1952 1952 return; 1953 1953

+19 -10

include/kvm/arm_vgic.h

··· 59 59 /* virtual control interface mapping, HYP VA */ 60 60 void __iomem *vctrl_hyp; 61 61 62 + /* Physical CPU interface, kernel VA */ 63 + void __iomem *gicc_base; 64 + 62 65 /* Number of implemented list registers */ 63 66 int nr_lr; 64 67 ··· 123 120 124 121 struct vgic_irq { 125 122 raw_spinlock_t irq_lock; /* Protects the content of the struct */ 123 + u32 intid; /* Guest visible INTID */ 126 124 struct rcu_head rcu; 127 125 struct list_head ap_list; 128 126 ··· 138 134 * affinity reg (v3). 139 135 */ 140 136 141 - u32 intid; /* Guest visible INTID */ 142 - bool line_level; /* Level only */ 143 - bool pending_latch; /* The pending latch state used to calculate 144 - * the pending state for both level 145 - * and edge triggered IRQs. */ 146 - bool active; 147 - bool pending_release; /* Used for LPIs only, unreferenced IRQ 137 + bool pending_release:1; /* Used for LPIs only, unreferenced IRQ 148 138 * pending a release */ 149 139 150 - bool enabled; 151 - bool hw; /* Tied to HW IRQ */ 140 + bool pending_latch:1; /* The pending latch state used to calculate 141 + * the pending state for both level 142 + * and edge triggered IRQs. */ 143 + enum vgic_irq_config config:1; /* Level or edge */ 144 + bool line_level:1; /* Level only */ 145 + bool enabled:1; 146 + bool active:1; 147 + bool hw:1; /* Tied to HW IRQ */ 148 + bool on_lr:1; /* Present in a CPU LR */ 152 149 refcount_t refcount; /* Used for LPIs */ 153 150 u32 hwintid; /* HW INTID number */ 154 151 unsigned int host_irq; /* linux irq corresponding to hwintid */ ··· 161 156 u8 active_source; /* GICv2 SGIs only */ 162 157 u8 priority; 163 158 u8 group; /* 0 == group 0, 1 == group 1 */ 164 - enum vgic_irq_config config; /* Level or edge */ 165 159 166 160 struct irq_ops *ops; 167 161 ··· 263 259 /* The GIC maintenance IRQ for nested hypervisors. */ 264 260 u32 mi_intid; 265 261 262 + /* Track the number of in-flight active SPIs */ 263 + atomic_t active_spis; 264 + 266 265 /* base addresses in guest physical address space: */ 267 266 gpa_t vgic_dist_base; /* distributor */ 268 267 union { ··· 287 280 struct vgic_irq *spis; 288 281 289 282 struct vgic_io_device dist_iodev; 283 + struct vgic_io_device cpuif_iodev; 290 284 291 285 bool has_its; 292 286 bool table_write_in_progress; ··· 425 417 void kvm_vgic_sync_hwstate(struct kvm_vcpu *vcpu); 426 418 void kvm_vgic_flush_hwstate(struct kvm_vcpu *vcpu); 427 419 void kvm_vgic_reset_mapped_irq(struct kvm_vcpu *vcpu, u32 vintid); 420 + void kvm_vgic_process_async_update(struct kvm_vcpu *vcpu); 428 421 429 422 void vgic_v3_dispatch_sgi(struct kvm_vcpu *vcpu, u64 reg, bool allow_group1); 430 423

+6

include/linux/irqchip/arm-gic.h

··· 86 86 87 87 #define GICH_HCR_EN (1 << 0) 88 88 #define GICH_HCR_UIE (1 << 1) 89 + #define GICH_HCR_LRENPIE (1 << 2) 89 90 #define GICH_HCR_NPIE (1 << 3) 91 + #define GICH_HCR_VGrp0EIE (1 << 4) 92 + #define GICH_HCR_VGrp0DIE (1 << 5) 93 + #define GICH_HCR_VGrp1EIE (1 << 6) 94 + #define GICH_HCR_VGrp1DIE (1 << 7) 95 + #define GICH_HCR_EOICOUNT GENMASK(31, 27) 90 96 91 97 #define GICH_LR_VIRTUALID (0x3ff << 0) 92 98 #define GICH_LR_PHYSID_CPUID_SHIFT (10)

+2

include/linux/irqchip/arm-vgic-info.h

··· 24 24 enum gic_type type; 25 25 /* Virtual CPU interface */ 26 26 struct resource vcpu; 27 + /* GICv2 GICC VA */ 28 + void __iomem *gicc_base; 27 29 /* Interrupt number */ 28 30 unsigned int maint_irq; 29 31 /* No interrupt mask, no need to use the above field */

+2 -12

include/linux/kvm_host.h

··· 1557 1557 unsigned int ioctl, unsigned long arg); 1558 1558 long kvm_arch_vcpu_ioctl(struct file *filp, 1559 1559 unsigned int ioctl, unsigned long arg); 1560 + long kvm_arch_vcpu_unlocked_ioctl(struct file *filp, 1561 + unsigned int ioctl, unsigned long arg); 1560 1562 vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf); 1561 1563 1562 1564 int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext); ··· 2438 2436 return false; 2439 2437 } 2440 2438 #endif /* CONFIG_HAVE_KVM_NO_POLL */ 2441 - 2442 - #ifdef CONFIG_HAVE_KVM_VCPU_ASYNC_IOCTL 2443 - long kvm_arch_vcpu_async_ioctl(struct file *filp, 2444 - unsigned int ioctl, unsigned long arg); 2445 - #else 2446 - static inline long kvm_arch_vcpu_async_ioctl(struct file *filp, 2447 - unsigned int ioctl, 2448 - unsigned long arg) 2449 - { 2450 - return -ENOIOCTLCMD; 2451 - } 2452 - #endif /* CONFIG_HAVE_KVM_VCPU_ASYNC_IOCTL */ 2453 2439 2454 2440 void kvm_arch_guest_memory_reclaimed(struct kvm *kvm); 2455 2441

+13 -5

include/linux/pagemap.h

··· 651 651 } 652 652 653 653 #ifdef CONFIG_NUMA 654 - struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order); 654 + struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order, 655 + struct mempolicy *policy); 655 656 #else 656 - static inline struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order) 657 + static inline struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order, 658 + struct mempolicy *policy) 657 659 { 658 660 return folio_alloc_noprof(gfp, order); 659 661 } ··· 666 664 667 665 static inline struct page *__page_cache_alloc(gfp_t gfp) 668 666 { 669 - return &filemap_alloc_folio(gfp, 0)->page; 667 + return &filemap_alloc_folio(gfp, 0, NULL)->page; 670 668 } 671 669 672 670 static inline gfp_t readahead_gfp_mask(struct address_space *x) ··· 752 750 } 753 751 754 752 void *filemap_get_entry(struct address_space *mapping, pgoff_t index); 755 - struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index, 756 - fgf_t fgp_flags, gfp_t gfp); 753 + struct folio *__filemap_get_folio_mpol(struct address_space *mapping, 754 + pgoff_t index, fgf_t fgf_flags, gfp_t gfp, struct mempolicy *policy); 757 755 struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index, 758 756 fgf_t fgp_flags, gfp_t gfp); 757 + 758 + static inline struct folio *__filemap_get_folio(struct address_space *mapping, 759 + pgoff_t index, fgf_t fgf_flags, gfp_t gfp) 760 + { 761 + return __filemap_get_folio_mpol(mapping, index, fgf_flags, gfp, NULL); 762 + } 759 763 760 764 /** 761 765 * write_begin_get_folio - Get folio for write_begin with flags.

+37

include/linux/psp-sev.h

··· 14 14 15 15 #include <uapi/linux/psp-sev.h> 16 16 17 + /* As defined by SEV API, under "Guest Policy". */ 18 + #define SEV_POLICY_MASK_NODBG BIT(0) 19 + #define SEV_POLICY_MASK_NOKS BIT(1) 20 + #define SEV_POLICY_MASK_ES BIT(2) 21 + #define SEV_POLICY_MASK_NOSEND BIT(3) 22 + #define SEV_POLICY_MASK_DOMAIN BIT(4) 23 + #define SEV_POLICY_MASK_SEV BIT(5) 24 + #define SEV_POLICY_MASK_API_MAJOR GENMASK(23, 16) 25 + #define SEV_POLICY_MASK_API_MINOR GENMASK(31, 24) 26 + 27 + /* As defined by SEV-SNP Firmware ABI, under "Guest Policy". */ 28 + #define SNP_POLICY_MASK_API_MINOR GENMASK_ULL(7, 0) 29 + #define SNP_POLICY_MASK_API_MAJOR GENMASK_ULL(15, 8) 30 + #define SNP_POLICY_MASK_SMT BIT_ULL(16) 31 + #define SNP_POLICY_MASK_RSVD_MBO BIT_ULL(17) 32 + #define SNP_POLICY_MASK_MIGRATE_MA BIT_ULL(18) 33 + #define SNP_POLICY_MASK_DEBUG BIT_ULL(19) 34 + #define SNP_POLICY_MASK_SINGLE_SOCKET BIT_ULL(20) 35 + #define SNP_POLICY_MASK_CXL_ALLOW BIT_ULL(21) 36 + #define SNP_POLICY_MASK_MEM_AES_256_XTS BIT_ULL(22) 37 + #define SNP_POLICY_MASK_RAPL_DIS BIT_ULL(23) 38 + #define SNP_POLICY_MASK_CIPHERTEXT_HIDING_DRAM BIT_ULL(24) 39 + #define SNP_POLICY_MASK_PAGE_SWAP_DISABLE BIT_ULL(25) 40 + 41 + /* Base SEV-SNP policy bitmask for minimum supported SEV firmware version */ 42 + #define SNP_POLICY_MASK_BASE (SNP_POLICY_MASK_API_MINOR | \ 43 + SNP_POLICY_MASK_API_MAJOR | \ 44 + SNP_POLICY_MASK_SMT | \ 45 + SNP_POLICY_MASK_RSVD_MBO | \ 46 + SNP_POLICY_MASK_MIGRATE_MA | \ 47 + SNP_POLICY_MASK_DEBUG | \ 48 + SNP_POLICY_MASK_SINGLE_SOCKET) 49 + 17 50 #define SEV_FW_BLOB_MAX_SIZE 0x4000 /* 16KB */ 18 51 19 52 /** ··· 882 849 u32 edx; 883 850 } __packed; 884 851 852 + #define SNP_RAPL_DISABLE_SUPPORTED BIT(2) 885 853 #define SNP_CIPHER_TEXT_HIDING_SUPPORTED BIT(3) 854 + #define SNP_AES_256_XTS_POLICY_SUPPORTED BIT(4) 855 + #define SNP_CXL_ALLOW_POLICY_SUPPORTED BIT(5) 886 856 887 857 #ifdef CONFIG_CRYPTO_DEV_SP_PSP 888 858 ··· 1031 995 void snp_free_firmware_page(void *addr); 1032 996 void sev_platform_shutdown(void); 1033 997 bool sev_is_snp_ciphertext_hiding_supported(void); 998 + u64 sev_get_snp_policy_bits(void); 1034 999 1035 1000 #else /* !CONFIG_CRYPTO_DEV_SP_PSP */ 1036 1001

+11

include/uapi/linux/kvm.h

··· 179 179 #define KVM_EXIT_LOONGARCH_IOCSR 38 180 180 #define KVM_EXIT_MEMORY_FAULT 39 181 181 #define KVM_EXIT_TDX 40 182 + #define KVM_EXIT_ARM_SEA 41 182 183 183 184 /* For KVM_EXIT_INTERNAL_ERROR */ 184 185 /* Emulate instruction failed. */ ··· 474 473 } setup_event_notify; 475 474 }; 476 475 } tdx; 476 + /* KVM_EXIT_ARM_SEA */ 477 + struct { 478 + #define KVM_EXIT_ARM_SEA_FLAG_GPA_VALID (1ULL << 0) 479 + __u64 flags; 480 + __u64 esr; 481 + __u64 gva; 482 + __u64 gpa; 483 + } arm_sea; 477 484 /* Fix the size of the union. */ 478 485 char padding[256]; 479 486 }; ··· 972 963 #define KVM_CAP_RISCV_MP_STATE_RESET 242 973 964 #define KVM_CAP_ARM_CACHEABLE_PFNMAP_SUPPORTED 243 974 965 #define KVM_CAP_GUEST_MEMFD_FLAGS 244 966 + #define KVM_CAP_ARM_SEA_TO_USER 245 967 + #define KVM_CAP_S390_USER_OPEREXEC 246 975 968 976 969 struct kvm_irq_routing_irqchip { 977 970 __u32 irqchip;

+1

include/uapi/linux/magic.h

··· 103 103 #define DEVMEM_MAGIC 0x454d444d /* "DMEM" */ 104 104 #define SECRETMEM_MAGIC 0x5345434d /* "SECM" */ 105 105 #define PID_FS_MAGIC 0x50494446 /* "PIDF" */ 106 + #define GUEST_MEMFD_MAGIC 0x474d454d /* "GMEM" */ 106 107 107 108 #endif /* __LINUX_MAGIC_H__ */

+14 -9

mm/filemap.c

··· 990 990 EXPORT_SYMBOL_GPL(filemap_add_folio); 991 991 992 992 #ifdef CONFIG_NUMA 993 - struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order) 993 + struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order, 994 + struct mempolicy *policy) 994 995 { 995 996 int n; 996 997 struct folio *folio; 998 + 999 + if (policy) 1000 + return folio_alloc_mpol_noprof(gfp, order, policy, 1001 + NO_INTERLEAVE_INDEX, numa_node_id()); 997 1002 998 1003 if (cpuset_do_page_mem_spread()) { 999 1004 unsigned int cpuset_mems_cookie; ··· 1916 1911 } 1917 1912 1918 1913 /** 1919 - * __filemap_get_folio - Find and get a reference to a folio. 1914 + * __filemap_get_folio_mpol - Find and get a reference to a folio. 1920 1915 * @mapping: The address_space to search. 1921 1916 * @index: The page index. 1922 1917 * @fgp_flags: %FGP flags modify how the folio is returned. 1923 1918 * @gfp: Memory allocation flags to use if %FGP_CREAT is specified. 1919 + * @policy: NUMA memory allocation policy to follow. 1924 1920 * 1925 1921 * Looks up the page cache entry at @mapping & @index. 1926 1922 * ··· 1932 1926 * 1933 1927 * Return: The found folio or an ERR_PTR() otherwise. 1934 1928 */ 1935 - struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index, 1936 - fgf_t fgp_flags, gfp_t gfp) 1929 + struct folio *__filemap_get_folio_mpol(struct address_space *mapping, 1930 + pgoff_t index, fgf_t fgp_flags, gfp_t gfp, struct mempolicy *policy) 1937 1931 { 1938 1932 struct folio *folio; 1939 1933 ··· 2003 1997 err = -ENOMEM; 2004 1998 if (order > min_order) 2005 1999 alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN; 2006 - folio = filemap_alloc_folio(alloc_gfp, order); 2000 + folio = filemap_alloc_folio(alloc_gfp, order, policy); 2007 2001 if (!folio) 2008 2002 continue; 2009 2003 ··· 2050 2044 folio_clear_dropbehind(folio); 2051 2045 return folio; 2052 2046 } 2053 - EXPORT_SYMBOL(__filemap_get_folio); 2047 + EXPORT_SYMBOL(__filemap_get_folio_mpol); 2054 2048 2055 2049 static inline struct folio *find_get_entry(struct xa_state *xas, pgoff_t max, 2056 2050 xa_mark_t mark) ··· 2603 2597 if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_WAITQ)) 2604 2598 return -EAGAIN; 2605 2599 2606 - folio = filemap_alloc_folio(mapping_gfp_mask(mapping), min_order); 2600 + folio = filemap_alloc_folio(mapping_gfp_mask(mapping), min_order, NULL); 2607 2601 if (!folio) 2608 2602 return -ENOMEM; 2609 2603 if (iocb->ki_flags & IOCB_DONTCACHE) ··· 4056 4050 repeat: 4057 4051 folio = filemap_get_folio(mapping, index); 4058 4052 if (IS_ERR(folio)) { 4059 - folio = filemap_alloc_folio(gfp, 4060 - mapping_min_folio_order(mapping)); 4053 + folio = filemap_alloc_folio(gfp, mapping_min_folio_order(mapping), NULL); 4061 4054 if (!folio) 4062 4055 return ERR_PTR(-ENOMEM); 4063 4056 index = mapping_align_index(mapping, index);

+6

mm/mempolicy.c

··· 356 356 357 357 return &default_policy; 358 358 } 359 + EXPORT_SYMBOL_FOR_MODULES(get_task_policy, "kvm"); 359 360 360 361 static const struct mempolicy_operations { 361 362 int (*create)(struct mempolicy *pol, const nodemask_t *nodes); ··· 490 489 return; 491 490 kmem_cache_free(policy_cache, pol); 492 491 } 492 + EXPORT_SYMBOL_FOR_MODULES(__mpol_put, "kvm"); 493 493 494 494 static void mpol_rebind_default(struct mempolicy *pol, const nodemask_t *nodes) 495 495 { ··· 2955 2953 read_unlock(&sp->lock); 2956 2954 return pol; 2957 2955 } 2956 + EXPORT_SYMBOL_FOR_MODULES(mpol_shared_policy_lookup, "kvm"); 2958 2957 2959 2958 static void sp_free(struct sp_node *n) 2960 2959 { ··· 3241 3238 mpol_put(mpol); /* drop our incoming ref on sb mpol */ 3242 3239 } 3243 3240 } 3241 + EXPORT_SYMBOL_FOR_MODULES(mpol_shared_policy_init, "kvm"); 3244 3242 3245 3243 int mpol_set_shared_policy(struct shared_policy *sp, 3246 3244 struct vm_area_struct *vma, struct mempolicy *pol) ··· 3260 3256 sp_free(new); 3261 3257 return err; 3262 3258 } 3259 + EXPORT_SYMBOL_FOR_MODULES(mpol_set_shared_policy, "kvm"); 3263 3260 3264 3261 /* Free a backing policy store on inode delete. */ 3265 3262 void mpol_free_shared_policy(struct shared_policy *sp) ··· 3279 3274 } 3280 3275 write_unlock(&sp->lock); 3281 3276 } 3277 + EXPORT_SYMBOL_FOR_MODULES(mpol_free_shared_policy, "kvm"); 3282 3278 3283 3279 #ifdef CONFIG_NUMA_BALANCING 3284 3280 static int __initdata numabalancing_override;

+1 -1

mm/readahead.c

··· 186 186 { 187 187 struct folio *folio; 188 188 189 - folio = filemap_alloc_folio(gfp_mask, order); 189 + folio = filemap_alloc_folio(gfp_mask, order, NULL); 190 190 if (folio && ractl->dropbehind) 191 191 __folio_set_dropbehind(folio); 192 192

+2

tools/arch/arm64/include/asm/esr.h

··· 141 141 #define ESR_ELx_SF (UL(1) << ESR_ELx_SF_SHIFT) 142 142 #define ESR_ELx_AR_SHIFT (14) 143 143 #define ESR_ELx_AR (UL(1) << ESR_ELx_AR_SHIFT) 144 + #define ESR_ELx_VNCR_SHIFT (13) 145 + #define ESR_ELx_VNCR (UL(1) << ESR_ELx_VNCR_SHIFT) 144 146 #define ESR_ELx_CM_SHIFT (8) 145 147 #define ESR_ELx_CM (UL(1) << ESR_ELx_CM_SHIFT) 146 148

+1 -1

tools/testing/selftests/kvm/Makefile

··· 6 6 ifeq ($(ARCH),$(filter $(ARCH),arm64 s390 riscv x86 x86_64 loongarch)) 7 7 # Top-level selftests allows ARCH=x86_64 :-( 8 8 ifeq ($(ARCH),x86_64) 9 - ARCH := x86 9 + override ARCH := x86 10 10 endif 11 11 include Makefile.kvm 12 12 else

+9 -3

tools/testing/selftests/kvm/Makefile.kvm

··· 88 88 TEST_GEN_PROGS_x86 += x86/kvm_buslock_test 89 89 TEST_GEN_PROGS_x86 += x86/monitor_mwait_test 90 90 TEST_GEN_PROGS_x86 += x86/msrs_test 91 + TEST_GEN_PROGS_x86 += x86/nested_close_kvm_test 91 92 TEST_GEN_PROGS_x86 += x86/nested_emulation_test 92 93 TEST_GEN_PROGS_x86 += x86/nested_exceptions_test 94 + TEST_GEN_PROGS_x86 += x86/nested_invalid_cr3_test 95 + TEST_GEN_PROGS_x86 += x86/nested_tsc_adjust_test 96 + TEST_GEN_PROGS_x86 += x86/nested_tsc_scaling_test 93 97 TEST_GEN_PROGS_x86 += x86/platform_info_test 94 98 TEST_GEN_PROGS_x86 += x86/pmu_counters_test 95 99 TEST_GEN_PROGS_x86 += x86/pmu_event_filter_test ··· 115 111 TEST_GEN_PROGS_x86 += x86/userspace_io_test 116 112 TEST_GEN_PROGS_x86 += x86/userspace_msr_exit_test 117 113 TEST_GEN_PROGS_x86 += x86/vmx_apic_access_test 118 - TEST_GEN_PROGS_x86 += x86/vmx_close_while_nested_test 119 114 TEST_GEN_PROGS_x86 += x86/vmx_dirty_log_test 120 115 TEST_GEN_PROGS_x86 += x86/vmx_exception_with_invalid_guest_state 121 116 TEST_GEN_PROGS_x86 += x86/vmx_msrs_test 122 117 TEST_GEN_PROGS_x86 += x86/vmx_invalid_nested_guest_state 118 + TEST_GEN_PROGS_x86 += x86/vmx_nested_la57_state_test 123 119 TEST_GEN_PROGS_x86 += x86/vmx_set_nested_state_test 124 - TEST_GEN_PROGS_x86 += x86/vmx_tsc_adjust_test 125 - TEST_GEN_PROGS_x86 += x86/vmx_nested_tsc_scaling_test 126 120 TEST_GEN_PROGS_x86 += x86/apic_bus_clock_test 127 121 TEST_GEN_PROGS_x86 += x86/xapic_ipi_test 128 122 TEST_GEN_PROGS_x86 += x86/xapic_state_test ··· 158 156 TEST_GEN_PROGS_arm64 = $(TEST_GEN_PROGS_COMMON) 159 157 TEST_GEN_PROGS_arm64 += arm64/aarch32_id_regs 160 158 TEST_GEN_PROGS_arm64 += arm64/arch_timer_edge_cases 159 + TEST_GEN_PROGS_arm64 += arm64/at 161 160 TEST_GEN_PROGS_arm64 += arm64/debug-exceptions 162 161 TEST_GEN_PROGS_arm64 += arm64/hello_el2 163 162 TEST_GEN_PROGS_arm64 += arm64/host_sve ··· 166 163 TEST_GEN_PROGS_arm64 += arm64/external_aborts 167 164 TEST_GEN_PROGS_arm64 += arm64/page_fault_test 168 165 TEST_GEN_PROGS_arm64 += arm64/psci_test 166 + TEST_GEN_PROGS_arm64 += arm64/sea_to_user 169 167 TEST_GEN_PROGS_arm64 += arm64/set_id_regs 170 168 TEST_GEN_PROGS_arm64 += arm64/smccc_filter 171 169 TEST_GEN_PROGS_arm64 += arm64/vcpu_width_config ··· 198 194 TEST_GEN_PROGS_s390 += s390/cpumodel_subfuncs_test 199 195 TEST_GEN_PROGS_s390 += s390/shared_zeropage_test 200 196 TEST_GEN_PROGS_s390 += s390/ucontrol_test 197 + TEST_GEN_PROGS_s390 += s390/user_operexec 201 198 TEST_GEN_PROGS_s390 += rseq_test 202 199 203 200 TEST_GEN_PROGS_riscv = $(TEST_GEN_PROGS_COMMON) ··· 215 210 TEST_GEN_PROGS_riscv += rseq_test 216 211 TEST_GEN_PROGS_riscv += steal_time 217 212 213 + TEST_GEN_PROGS_loongarch = arch_timer 218 214 TEST_GEN_PROGS_loongarch += coalesced_io_test 219 215 TEST_GEN_PROGS_loongarch += demand_paging_test 220 216 TEST_GEN_PROGS_loongarch += dirty_log_perf_test

+166

tools/testing/selftests/kvm/arm64/at.c

··· 1 + // SPDX-License-Identifier: GPL-2.0-only 2 + /* 3 + * at - Test for KVM's AT emulation in the EL2&0 and EL1&0 translation regimes. 4 + */ 5 + #include "kvm_util.h" 6 + #include "processor.h" 7 + #include "test_util.h" 8 + #include "ucall.h" 9 + 10 + #include <asm/sysreg.h> 11 + 12 + #define TEST_ADDR 0x80000000 13 + 14 + enum { 15 + CLEAR_ACCESS_FLAG, 16 + TEST_ACCESS_FLAG, 17 + }; 18 + 19 + static u64 *ptep_hva; 20 + 21 + #define copy_el2_to_el1(reg) \ 22 + write_sysreg_s(read_sysreg_s(SYS_##reg##_EL1), SYS_##reg##_EL12) 23 + 24 + /* Yes, this is an ugly hack */ 25 + #define __at(op, addr) write_sysreg_s(addr, op) 26 + 27 + #define test_at_insn(op, expect_fault) \ 28 + do { \ 29 + u64 par, fsc; \ 30 + bool fault; \ 31 + \ 32 + GUEST_SYNC(CLEAR_ACCESS_FLAG); \ 33 + \ 34 + __at(OP_AT_##op, TEST_ADDR); \ 35 + isb(); \ 36 + par = read_sysreg(par_el1); \ 37 + \ 38 + fault = par & SYS_PAR_EL1_F; \ 39 + fsc = FIELD_GET(SYS_PAR_EL1_FST, par); \ 40 + \ 41 + __GUEST_ASSERT((expect_fault) == fault, \ 42 + "AT "#op": %sexpected fault (par: %lx)1", \ 43 + (expect_fault) ? "" : "un", par); \ 44 + if ((expect_fault)) { \ 45 + __GUEST_ASSERT(fsc == ESR_ELx_FSC_ACCESS_L(3), \ 46 + "AT "#op": expected access flag fault (par: %lx)", \ 47 + par); \ 48 + } else { \ 49 + GUEST_ASSERT_EQ(FIELD_GET(SYS_PAR_EL1_ATTR, par), MAIR_ATTR_NORMAL); \ 50 + GUEST_ASSERT_EQ(FIELD_GET(SYS_PAR_EL1_SH, par), PTE_SHARED >> 8); \ 51 + GUEST_ASSERT_EQ(par & SYS_PAR_EL1_PA, TEST_ADDR); \ 52 + GUEST_SYNC(TEST_ACCESS_FLAG); \ 53 + } \ 54 + } while (0) 55 + 56 + static void test_at(bool expect_fault) 57 + { 58 + test_at_insn(S1E2R, expect_fault); 59 + test_at_insn(S1E2W, expect_fault); 60 + 61 + /* Reuse the stage-1 MMU context from EL2 at EL1 */ 62 + copy_el2_to_el1(SCTLR); 63 + copy_el2_to_el1(MAIR); 64 + copy_el2_to_el1(TCR); 65 + copy_el2_to_el1(TTBR0); 66 + copy_el2_to_el1(TTBR1); 67 + 68 + /* Disable stage-2 translation and enter a non-host context */ 69 + write_sysreg(0, vtcr_el2); 70 + write_sysreg(0, vttbr_el2); 71 + sysreg_clear_set(hcr_el2, HCR_EL2_TGE | HCR_EL2_VM, 0); 72 + isb(); 73 + 74 + test_at_insn(S1E1R, expect_fault); 75 + test_at_insn(S1E1W, expect_fault); 76 + } 77 + 78 + static void guest_code(void) 79 + { 80 + sysreg_clear_set(tcr_el1, TCR_HA, 0); 81 + isb(); 82 + 83 + test_at(true); 84 + 85 + if (!SYS_FIELD_GET(ID_AA64MMFR1_EL1, HAFDBS, read_sysreg(id_aa64mmfr1_el1))) 86 + GUEST_DONE(); 87 + 88 + /* 89 + * KVM's software PTW makes the implementation choice that the AT 90 + * instruction sets the access flag. 91 + */ 92 + sysreg_clear_set(tcr_el1, 0, TCR_HA); 93 + isb(); 94 + test_at(false); 95 + 96 + GUEST_DONE(); 97 + } 98 + 99 + static void handle_sync(struct kvm_vcpu *vcpu, struct ucall *uc) 100 + { 101 + switch (uc->args[1]) { 102 + case CLEAR_ACCESS_FLAG: 103 + /* 104 + * Delete + reinstall the memslot to invalidate stage-2 105 + * mappings of the stage-1 page tables, forcing KVM to 106 + * use the 'slow' AT emulation path. 107 + * 108 + * This and clearing the access flag from host userspace 109 + * ensures that the access flag cannot be set speculatively 110 + * and is reliably cleared at the time of the AT instruction. 111 + */ 112 + clear_bit(__ffs(PTE_AF), ptep_hva); 113 + vm_mem_region_reload(vcpu->vm, vcpu->vm->memslots[MEM_REGION_PT]); 114 + break; 115 + case TEST_ACCESS_FLAG: 116 + TEST_ASSERT(test_bit(__ffs(PTE_AF), ptep_hva), 117 + "Expected access flag to be set (desc: %lu)", *ptep_hva); 118 + break; 119 + default: 120 + TEST_FAIL("Unexpected SYNC arg: %lu", uc->args[1]); 121 + } 122 + } 123 + 124 + static void run_test(struct kvm_vcpu *vcpu) 125 + { 126 + struct ucall uc; 127 + 128 + while (true) { 129 + vcpu_run(vcpu); 130 + switch (get_ucall(vcpu, &uc)) { 131 + case UCALL_DONE: 132 + return; 133 + case UCALL_SYNC: 134 + handle_sync(vcpu, &uc); 135 + continue; 136 + case UCALL_ABORT: 137 + REPORT_GUEST_ASSERT(uc); 138 + return; 139 + default: 140 + TEST_FAIL("Unexpected ucall: %lu", uc.cmd); 141 + } 142 + } 143 + } 144 + 145 + int main(void) 146 + { 147 + struct kvm_vcpu_init init; 148 + struct kvm_vcpu *vcpu; 149 + struct kvm_vm *vm; 150 + 151 + TEST_REQUIRE(kvm_check_cap(KVM_CAP_ARM_EL2)); 152 + 153 + vm = vm_create(1); 154 + 155 + kvm_get_default_vcpu_target(vm, &init); 156 + init.features[0] |= BIT(KVM_ARM_VCPU_HAS_EL2); 157 + vcpu = aarch64_vcpu_add(vm, 0, &init, guest_code); 158 + kvm_arch_vm_finalize_vcpus(vm); 159 + 160 + virt_map(vm, TEST_ADDR, TEST_ADDR, 1); 161 + ptep_hva = virt_get_pte_hva_at_level(vm, TEST_ADDR, 3); 162 + run_test(vcpu); 163 + 164 + kvm_vm_free(vm); 165 + return 0; 166 + }

+331

tools/testing/selftests/kvm/arm64/sea_to_user.c

··· 1 + // SPDX-License-Identifier: GPL-2.0-only 2 + /* 3 + * Test KVM returns to userspace with KVM_EXIT_ARM_SEA if host APEI fails 4 + * to handle SEA and userspace has opt-ed in KVM_CAP_ARM_SEA_TO_USER. 5 + * 6 + * After reaching userspace with expected arm_sea info, also test userspace 7 + * injecting a synchronous external data abort into the guest. 8 + * 9 + * This test utilizes EINJ to generate a REAL synchronous external data 10 + * abort by consuming a recoverable uncorrectable memory error. Therefore 11 + * the device under test must support EINJ in both firmware and host kernel, 12 + * including the notrigger feature. Otherwise the test will be skipped. 13 + * The under-test platform's APEI should be unable to claim SEA. Otherwise 14 + * the test will also be skipped. 15 + */ 16 + 17 + #include <signal.h> 18 + #include <stdio.h> 19 + #include <stdlib.h> 20 + #include <unistd.h> 21 + 22 + #include "test_util.h" 23 + #include "kvm_util.h" 24 + #include "processor.h" 25 + #include "guest_modes.h" 26 + 27 + #define PAGE_PRESENT (1ULL << 63) 28 + #define PAGE_PHYSICAL 0x007fffffffffffffULL 29 + #define PAGE_ADDR_MASK (~(0xfffULL)) 30 + 31 + /* Group ISV and ISS[23:14]. */ 32 + #define ESR_ELx_INST_SYNDROME ((ESR_ELx_ISV) | (ESR_ELx_SAS) | \ 33 + (ESR_ELx_SSE) | (ESR_ELx_SRT_MASK) | \ 34 + (ESR_ELx_SF) | (ESR_ELx_AR)) 35 + 36 + #define EINJ_ETYPE "/sys/kernel/debug/apei/einj/error_type" 37 + #define EINJ_ADDR "/sys/kernel/debug/apei/einj/param1" 38 + #define EINJ_MASK "/sys/kernel/debug/apei/einj/param2" 39 + #define EINJ_FLAGS "/sys/kernel/debug/apei/einj/flags" 40 + #define EINJ_NOTRIGGER "/sys/kernel/debug/apei/einj/notrigger" 41 + #define EINJ_DOIT "/sys/kernel/debug/apei/einj/error_inject" 42 + /* Memory Uncorrectable non-fatal. */ 43 + #define ERROR_TYPE_MEMORY_UER 0x10 44 + /* Memory address and mask valid (param1 and param2). */ 45 + #define MASK_MEMORY_UER 0b10 46 + 47 + /* Guest virtual address region = [2G, 3G). */ 48 + #define START_GVA 0x80000000UL 49 + #define VM_MEM_SIZE 0x40000000UL 50 + /* Note: EINJ_OFFSET must < VM_MEM_SIZE. */ 51 + #define EINJ_OFFSET 0x01234badUL 52 + #define EINJ_GVA ((START_GVA) + (EINJ_OFFSET)) 53 + 54 + static vm_paddr_t einj_gpa; 55 + static void *einj_hva; 56 + static uint64_t einj_hpa; 57 + static bool far_invalid; 58 + 59 + static uint64_t translate_to_host_paddr(unsigned long vaddr) 60 + { 61 + uint64_t pinfo; 62 + int64_t offset = vaddr / getpagesize() * sizeof(pinfo); 63 + int fd; 64 + uint64_t page_addr; 65 + uint64_t paddr; 66 + 67 + fd = open("/proc/self/pagemap", O_RDONLY); 68 + if (fd < 0) 69 + ksft_exit_fail_perror("Failed to open /proc/self/pagemap"); 70 + if (pread(fd, &pinfo, sizeof(pinfo), offset) != sizeof(pinfo)) { 71 + close(fd); 72 + ksft_exit_fail_perror("Failed to read /proc/self/pagemap"); 73 + } 74 + 75 + close(fd); 76 + 77 + if ((pinfo & PAGE_PRESENT) == 0) 78 + ksft_exit_fail_perror("Page not present"); 79 + 80 + page_addr = (pinfo & PAGE_PHYSICAL) << MIN_PAGE_SHIFT; 81 + paddr = page_addr + (vaddr & (getpagesize() - 1)); 82 + return paddr; 83 + } 84 + 85 + static void write_einj_entry(const char *einj_path, uint64_t val) 86 + { 87 + char cmd[256] = {0}; 88 + FILE *cmdfile = NULL; 89 + 90 + sprintf(cmd, "echo %#lx > %s", val, einj_path); 91 + cmdfile = popen(cmd, "r"); 92 + 93 + if (pclose(cmdfile) == 0) 94 + ksft_print_msg("echo %#lx > %s - done\n", val, einj_path); 95 + else 96 + ksft_exit_fail_perror("Failed to write EINJ entry"); 97 + } 98 + 99 + static void inject_uer(uint64_t paddr) 100 + { 101 + if (access("/sys/firmware/acpi/tables/EINJ", R_OK) == -1) 102 + ksft_test_result_skip("EINJ table no available in firmware"); 103 + 104 + if (access(EINJ_ETYPE, R_OK | W_OK) == -1) 105 + ksft_test_result_skip("EINJ module probably not loaded?"); 106 + 107 + write_einj_entry(EINJ_ETYPE, ERROR_TYPE_MEMORY_UER); 108 + write_einj_entry(EINJ_FLAGS, MASK_MEMORY_UER); 109 + write_einj_entry(EINJ_ADDR, paddr); 110 + write_einj_entry(EINJ_MASK, ~0x0UL); 111 + write_einj_entry(EINJ_NOTRIGGER, 1); 112 + write_einj_entry(EINJ_DOIT, 1); 113 + } 114 + 115 + /* 116 + * When host APEI successfully claims the SEA caused by guest_code, kernel 117 + * will send SIGBUS signal with BUS_MCEERR_AR to test thread. 118 + * 119 + * We set up this SIGBUS handler to skip the test for that case. 120 + */ 121 + static void sigbus_signal_handler(int sig, siginfo_t *si, void *v) 122 + { 123 + ksft_print_msg("SIGBUS (%d) received, dumping siginfo...\n", sig); 124 + ksft_print_msg("si_signo=%d, si_errno=%d, si_code=%d, si_addr=%p\n", 125 + si->si_signo, si->si_errno, si->si_code, si->si_addr); 126 + if (si->si_code == BUS_MCEERR_AR) 127 + ksft_test_result_skip("SEA is claimed by host APEI\n"); 128 + else 129 + ksft_test_result_fail("Exit with signal unhandled\n"); 130 + 131 + exit(0); 132 + } 133 + 134 + static void setup_sigbus_handler(void) 135 + { 136 + struct sigaction act; 137 + 138 + memset(&act, 0, sizeof(act)); 139 + sigemptyset(&act.sa_mask); 140 + act.sa_sigaction = sigbus_signal_handler; 141 + act.sa_flags = SA_SIGINFO; 142 + TEST_ASSERT(sigaction(SIGBUS, &act, NULL) == 0, 143 + "Failed to setup SIGBUS handler"); 144 + } 145 + 146 + static void guest_code(void) 147 + { 148 + uint64_t guest_data; 149 + 150 + /* Consumes error will cause a SEA. */ 151 + guest_data = *(uint64_t *)EINJ_GVA; 152 + 153 + GUEST_FAIL("Poison not protected by SEA: gva=%#lx, guest_data=%#lx\n", 154 + EINJ_GVA, guest_data); 155 + } 156 + 157 + static void expect_sea_handler(struct ex_regs *regs) 158 + { 159 + u64 esr = read_sysreg(esr_el1); 160 + u64 far = read_sysreg(far_el1); 161 + bool expect_far_invalid = far_invalid; 162 + 163 + GUEST_PRINTF("Handling Guest SEA\n"); 164 + GUEST_PRINTF("ESR_EL1=%#lx, FAR_EL1=%#lx\n", esr, far); 165 + 166 + GUEST_ASSERT_EQ(ESR_ELx_EC(esr), ESR_ELx_EC_DABT_CUR); 167 + GUEST_ASSERT_EQ(esr & ESR_ELx_FSC_TYPE, ESR_ELx_FSC_EXTABT); 168 + 169 + if (expect_far_invalid) { 170 + GUEST_ASSERT_EQ(esr & ESR_ELx_FnV, ESR_ELx_FnV); 171 + GUEST_PRINTF("Guest observed garbage value in FAR\n"); 172 + } else { 173 + GUEST_ASSERT_EQ(esr & ESR_ELx_FnV, 0); 174 + GUEST_ASSERT_EQ(far, EINJ_GVA); 175 + } 176 + 177 + GUEST_DONE(); 178 + } 179 + 180 + static void vcpu_inject_sea(struct kvm_vcpu *vcpu) 181 + { 182 + struct kvm_vcpu_events events = {}; 183 + 184 + events.exception.ext_dabt_pending = true; 185 + vcpu_events_set(vcpu, &events); 186 + } 187 + 188 + static void run_vm(struct kvm_vm *vm, struct kvm_vcpu *vcpu) 189 + { 190 + struct ucall uc; 191 + bool guest_done = false; 192 + struct kvm_run *run = vcpu->run; 193 + u64 esr; 194 + 195 + /* Resume the vCPU after error injection to consume the error. */ 196 + vcpu_run(vcpu); 197 + 198 + ksft_print_msg("Dump kvm_run info about KVM_EXIT_%s\n", 199 + exit_reason_str(run->exit_reason)); 200 + ksft_print_msg("kvm_run.arm_sea: esr=%#llx, flags=%#llx\n", 201 + run->arm_sea.esr, run->arm_sea.flags); 202 + ksft_print_msg("kvm_run.arm_sea: gva=%#llx, gpa=%#llx\n", 203 + run->arm_sea.gva, run->arm_sea.gpa); 204 + 205 + TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_ARM_SEA); 206 + 207 + esr = run->arm_sea.esr; 208 + TEST_ASSERT_EQ(ESR_ELx_EC(esr), ESR_ELx_EC_DABT_LOW); 209 + TEST_ASSERT_EQ(esr & ESR_ELx_FSC_TYPE, ESR_ELx_FSC_EXTABT); 210 + TEST_ASSERT_EQ(ESR_ELx_ISS2(esr), 0); 211 + TEST_ASSERT_EQ((esr & ESR_ELx_INST_SYNDROME), 0); 212 + TEST_ASSERT_EQ(esr & ESR_ELx_VNCR, 0); 213 + 214 + if (!(esr & ESR_ELx_FnV)) { 215 + ksft_print_msg("Expect gva to match given FnV bit is 0\n"); 216 + TEST_ASSERT_EQ(run->arm_sea.gva, EINJ_GVA); 217 + } 218 + 219 + if (run->arm_sea.flags & KVM_EXIT_ARM_SEA_FLAG_GPA_VALID) { 220 + ksft_print_msg("Expect gpa to match given KVM_EXIT_ARM_SEA_FLAG_GPA_VALID is set\n"); 221 + TEST_ASSERT_EQ(run->arm_sea.gpa, einj_gpa & PAGE_ADDR_MASK); 222 + } 223 + 224 + far_invalid = esr & ESR_ELx_FnV; 225 + 226 + /* Inject a SEA into guest and expect handled in SEA handler. */ 227 + vcpu_inject_sea(vcpu); 228 + 229 + /* Expect the guest to reach GUEST_DONE gracefully. */ 230 + do { 231 + vcpu_run(vcpu); 232 + switch (get_ucall(vcpu, &uc)) { 233 + case UCALL_PRINTF: 234 + ksft_print_msg("From guest: %s", uc.buffer); 235 + break; 236 + case UCALL_DONE: 237 + ksft_print_msg("Guest done gracefully!\n"); 238 + guest_done = 1; 239 + break; 240 + case UCALL_ABORT: 241 + ksft_print_msg("Guest aborted!\n"); 242 + guest_done = 1; 243 + REPORT_GUEST_ASSERT(uc); 244 + break; 245 + default: 246 + TEST_FAIL("Unexpected ucall: %lu\n", uc.cmd); 247 + } 248 + } while (!guest_done); 249 + } 250 + 251 + static struct kvm_vm *vm_create_with_sea_handler(struct kvm_vcpu **vcpu) 252 + { 253 + size_t backing_page_size; 254 + size_t guest_page_size; 255 + size_t alignment; 256 + uint64_t num_guest_pages; 257 + vm_paddr_t start_gpa; 258 + enum vm_mem_backing_src_type src_type = VM_MEM_SRC_ANONYMOUS_HUGETLB_1GB; 259 + struct kvm_vm *vm; 260 + 261 + backing_page_size = get_backing_src_pagesz(src_type); 262 + guest_page_size = vm_guest_mode_params[VM_MODE_DEFAULT].page_size; 263 + alignment = max(backing_page_size, guest_page_size); 264 + num_guest_pages = VM_MEM_SIZE / guest_page_size; 265 + 266 + vm = __vm_create_with_one_vcpu(vcpu, num_guest_pages, guest_code); 267 + vm_init_descriptor_tables(vm); 268 + vcpu_init_descriptor_tables(*vcpu); 269 + 270 + vm_install_sync_handler(vm, 271 + /*vector=*/VECTOR_SYNC_CURRENT, 272 + /*ec=*/ESR_ELx_EC_DABT_CUR, 273 + /*handler=*/expect_sea_handler); 274 + 275 + start_gpa = (vm->max_gfn - num_guest_pages) * guest_page_size; 276 + start_gpa = align_down(start_gpa, alignment); 277 + 278 + vm_userspace_mem_region_add( 279 + /*vm=*/vm, 280 + /*src_type=*/src_type, 281 + /*guest_paddr=*/start_gpa, 282 + /*slot=*/1, 283 + /*npages=*/num_guest_pages, 284 + /*flags=*/0); 285 + 286 + virt_map(vm, START_GVA, start_gpa, num_guest_pages); 287 + 288 + ksft_print_msg("Mapped %#lx pages: gva=%#lx to gpa=%#lx\n", 289 + num_guest_pages, START_GVA, start_gpa); 290 + return vm; 291 + } 292 + 293 + static void vm_inject_memory_uer(struct kvm_vm *vm) 294 + { 295 + uint64_t guest_data; 296 + 297 + einj_gpa = addr_gva2gpa(vm, EINJ_GVA); 298 + einj_hva = addr_gva2hva(vm, EINJ_GVA); 299 + 300 + /* Populate certain data before injecting UER. */ 301 + *(uint64_t *)einj_hva = 0xBAADCAFE; 302 + guest_data = *(uint64_t *)einj_hva; 303 + ksft_print_msg("Before EINJect: data=%#lx\n", 304 + guest_data); 305 + 306 + einj_hpa = translate_to_host_paddr((unsigned long)einj_hva); 307 + 308 + ksft_print_msg("EINJ_GVA=%#lx, einj_gpa=%#lx, einj_hva=%p, einj_hpa=%#lx\n", 309 + EINJ_GVA, einj_gpa, einj_hva, einj_hpa); 310 + 311 + inject_uer(einj_hpa); 312 + ksft_print_msg("Memory UER EINJected\n"); 313 + } 314 + 315 + int main(int argc, char *argv[]) 316 + { 317 + struct kvm_vm *vm; 318 + struct kvm_vcpu *vcpu; 319 + 320 + TEST_REQUIRE(kvm_has_cap(KVM_CAP_ARM_SEA_TO_USER)); 321 + 322 + setup_sigbus_handler(); 323 + 324 + vm = vm_create_with_sea_handler(&vcpu); 325 + vm_enable_cap(vm, KVM_CAP_ARM_SEA_TO_USER, 0); 326 + vm_inject_memory_uer(vm); 327 + run_vm(vm, vcpu); 328 + kvm_vm_free(vm); 329 + 330 + return 0; 331 + }

+264 -23

tools/testing/selftests/kvm/arm64/vgic_irq.c

··· 29 29 bool level_sensitive; /* 1 is level, 0 is edge */ 30 30 int kvm_max_routes; /* output of KVM_CAP_IRQ_ROUTING */ 31 31 bool kvm_supports_irqfd; /* output of KVM_CAP_IRQFD */ 32 + uint32_t shared_data; 32 33 }; 33 34 34 35 /* ··· 206 205 do { \ 207 206 uint32_t _intid; \ 208 207 _intid = gic_get_and_ack_irq(); \ 209 - GUEST_ASSERT(_intid == 0 || _intid == IAR_SPURIOUS); \ 208 + GUEST_ASSERT(_intid == IAR_SPURIOUS); \ 210 209 } while (0) 211 210 212 211 #define CAT_HELPER(a, b) a ## b ··· 360 359 * interrupts for the whole test. 361 360 */ 362 361 static void test_inject_preemption(struct test_args *args, 363 - uint32_t first_intid, int num, 364 - kvm_inject_cmd cmd) 362 + uint32_t first_intid, int num, 363 + const unsigned long *exclude, 364 + kvm_inject_cmd cmd) 365 365 { 366 366 uint32_t intid, prio, step = KVM_PRIO_STEPS; 367 367 int i; ··· 381 379 for (i = 0; i < num; i++) { 382 380 uint32_t tmp; 383 381 intid = i + first_intid; 382 + 383 + if (exclude && test_bit(i, exclude)) 384 + continue; 385 + 384 386 KVM_INJECT(cmd, intid); 385 387 /* Each successive IRQ will preempt the previous one. */ 386 388 tmp = wait_for_and_activate_irq(); ··· 396 390 /* finish handling the IRQs starting with the highest priority one. */ 397 391 for (i = 0; i < num; i++) { 398 392 intid = num - i - 1 + first_intid; 393 + 394 + if (exclude && test_bit(intid - first_intid, exclude)) 395 + continue; 396 + 399 397 gic_set_eoi(intid); 400 - if (args->eoi_split) 401 - gic_set_dir(intid); 398 + } 399 + 400 + if (args->eoi_split) { 401 + for (i = 0; i < num; i++) { 402 + intid = i + first_intid; 403 + 404 + if (exclude && test_bit(i, exclude)) 405 + continue; 406 + 407 + if (args->eoi_split) 408 + gic_set_dir(intid); 409 + } 402 410 } 403 411 404 412 local_irq_enable(); 405 413 406 - for (i = 0; i < num; i++) 414 + for (i = 0; i < num; i++) { 415 + if (exclude && test_bit(i, exclude)) 416 + continue; 417 + 407 418 GUEST_ASSERT(!gic_irq_get_active(i + first_intid)); 419 + } 408 420 GUEST_ASSERT_EQ(gic_read_ap1r0(), 0); 409 421 GUEST_ASSERT_IAR_EMPTY(); 410 422 ··· 460 436 461 437 static void test_preemption(struct test_args *args, struct kvm_inject_desc *f) 462 438 { 463 - /* 464 - * Test up to 4 levels of preemption. The reason is that KVM doesn't 465 - * currently implement the ability to have more than the number-of-LRs 466 - * number of concurrently active IRQs. The number of LRs implemented is 467 - * IMPLEMENTATION DEFINED, however, it seems that most implement 4. 468 - */ 439 + /* Timer PPIs cannot be injected from userspace */ 440 + static const unsigned long ppi_exclude = (BIT(27 - MIN_PPI) | 441 + BIT(30 - MIN_PPI) | 442 + BIT(28 - MIN_PPI) | 443 + BIT(26 - MIN_PPI)); 444 + 469 445 if (f->sgi) 470 - test_inject_preemption(args, MIN_SGI, 4, f->cmd); 446 + test_inject_preemption(args, MIN_SGI, 16, NULL, f->cmd); 471 447 472 448 if (f->ppi) 473 - test_inject_preemption(args, MIN_PPI, 4, f->cmd); 449 + test_inject_preemption(args, MIN_PPI, 16, &ppi_exclude, f->cmd); 474 450 475 451 if (f->spi) 476 - test_inject_preemption(args, MIN_SPI, 4, f->cmd); 452 + test_inject_preemption(args, MIN_SPI, 31, NULL, f->cmd); 477 453 } 478 454 479 455 static void test_restore_active(struct test_args *args, struct kvm_inject_desc *f) 480 456 { 481 - /* Test up to 4 active IRQs. Same reason as in test_preemption. */ 482 457 if (f->sgi) 483 - guest_restore_active(args, MIN_SGI, 4, f->cmd); 458 + guest_restore_active(args, MIN_SGI, 16, f->cmd); 484 459 485 460 if (f->ppi) 486 - guest_restore_active(args, MIN_PPI, 4, f->cmd); 461 + guest_restore_active(args, MIN_PPI, 16, f->cmd); 487 462 488 463 if (f->spi) 489 - guest_restore_active(args, MIN_SPI, 4, f->cmd); 464 + guest_restore_active(args, MIN_SPI, 31, f->cmd); 490 465 } 491 466 492 467 static void guest_code(struct test_args *args) ··· 496 473 497 474 gic_init(GIC_V3, 1); 498 475 499 - for (i = 0; i < nr_irqs; i++) 500 - gic_irq_enable(i); 501 - 502 476 for (i = MIN_SPI; i < nr_irqs; i++) 503 477 gic_irq_set_config(i, !level_sensitive); 478 + 479 + for (i = 0; i < nr_irqs; i++) 480 + gic_irq_enable(i); 504 481 505 482 gic_set_eoi_split(args->eoi_split); 506 483 ··· 659 636 } 660 637 661 638 for (f = 0, i = intid; i < (uint64_t)intid + num; i++, f++) 662 - close(fd[f]); 639 + kvm_close(fd[f]); 663 640 } 664 641 665 642 /* handles the valid case: intid=0xffffffff num=1 */ ··· 802 779 kvm_vm_free(vm); 803 780 } 804 781 782 + static void guest_code_asym_dir(struct test_args *args, int cpuid) 783 + { 784 + gic_init(GIC_V3, 2); 785 + 786 + gic_set_eoi_split(1); 787 + gic_set_priority_mask(CPU_PRIO_MASK); 788 + 789 + if (cpuid == 0) { 790 + uint32_t intid; 791 + 792 + local_irq_disable(); 793 + 794 + gic_set_priority(MIN_PPI, IRQ_DEFAULT_PRIO); 795 + gic_irq_enable(MIN_SPI); 796 + gic_irq_set_pending(MIN_SPI); 797 + 798 + intid = wait_for_and_activate_irq(); 799 + GUEST_ASSERT_EQ(intid, MIN_SPI); 800 + 801 + gic_set_eoi(intid); 802 + isb(); 803 + 804 + WRITE_ONCE(args->shared_data, MIN_SPI); 805 + dsb(ishst); 806 + 807 + do { 808 + dsb(ishld); 809 + } while (READ_ONCE(args->shared_data) == MIN_SPI); 810 + GUEST_ASSERT(!gic_irq_get_active(MIN_SPI)); 811 + } else { 812 + do { 813 + dsb(ishld); 814 + } while (READ_ONCE(args->shared_data) != MIN_SPI); 815 + 816 + gic_set_dir(MIN_SPI); 817 + isb(); 818 + 819 + WRITE_ONCE(args->shared_data, 0); 820 + dsb(ishst); 821 + } 822 + 823 + GUEST_DONE(); 824 + } 825 + 826 + static void guest_code_group_en(struct test_args *args, int cpuid) 827 + { 828 + uint32_t intid; 829 + 830 + gic_init(GIC_V3, 2); 831 + 832 + gic_set_eoi_split(0); 833 + gic_set_priority_mask(CPU_PRIO_MASK); 834 + /* SGI0 is G0, which is disabled */ 835 + gic_irq_set_group(0, 0); 836 + 837 + /* Configure all SGIs with decreasing priority */ 838 + for (intid = 0; intid < MIN_PPI; intid++) { 839 + gic_set_priority(intid, (intid + 1) * 8); 840 + gic_irq_enable(intid); 841 + gic_irq_set_pending(intid); 842 + } 843 + 844 + /* Ack and EOI all G1 interrupts */ 845 + for (int i = 1; i < MIN_PPI; i++) { 846 + intid = wait_for_and_activate_irq(); 847 + 848 + GUEST_ASSERT(intid < MIN_PPI); 849 + gic_set_eoi(intid); 850 + isb(); 851 + } 852 + 853 + /* 854 + * Check that SGI0 is still pending, inactive, and that we cannot 855 + * ack anything. 856 + */ 857 + GUEST_ASSERT(gic_irq_get_pending(0)); 858 + GUEST_ASSERT(!gic_irq_get_active(0)); 859 + GUEST_ASSERT_IAR_EMPTY(); 860 + GUEST_ASSERT(read_sysreg_s(SYS_ICC_IAR0_EL1) == IAR_SPURIOUS); 861 + 862 + /* Open the G0 gates, and verify we can ack SGI0 */ 863 + write_sysreg_s(1, SYS_ICC_IGRPEN0_EL1); 864 + isb(); 865 + 866 + do { 867 + intid = read_sysreg_s(SYS_ICC_IAR0_EL1); 868 + } while (intid == IAR_SPURIOUS); 869 + 870 + GUEST_ASSERT(intid == 0); 871 + GUEST_DONE(); 872 + } 873 + 874 + static void guest_code_timer_spi(struct test_args *args, int cpuid) 875 + { 876 + uint32_t intid; 877 + u64 val; 878 + 879 + gic_init(GIC_V3, 2); 880 + 881 + gic_set_eoi_split(1); 882 + gic_set_priority_mask(CPU_PRIO_MASK); 883 + 884 + /* Add a pending SPI so that KVM starts trapping DIR */ 885 + gic_set_priority(MIN_SPI + cpuid, IRQ_DEFAULT_PRIO); 886 + gic_irq_set_pending(MIN_SPI + cpuid); 887 + 888 + /* Configure the timer with a higher priority, make it pending */ 889 + gic_set_priority(27, IRQ_DEFAULT_PRIO - 8); 890 + 891 + isb(); 892 + val = read_sysreg(cntvct_el0); 893 + write_sysreg(val, cntv_cval_el0); 894 + write_sysreg(1, cntv_ctl_el0); 895 + isb(); 896 + 897 + GUEST_ASSERT(gic_irq_get_pending(27)); 898 + 899 + /* Enable both interrupts */ 900 + gic_irq_enable(MIN_SPI + cpuid); 901 + gic_irq_enable(27); 902 + 903 + /* The timer must fire */ 904 + intid = wait_for_and_activate_irq(); 905 + GUEST_ASSERT(intid == 27); 906 + 907 + /* Check that we can deassert it */ 908 + write_sysreg(0, cntv_ctl_el0); 909 + isb(); 910 + 911 + GUEST_ASSERT(!gic_irq_get_pending(27)); 912 + 913 + /* 914 + * Priority drop, deactivation -- we expect that the host 915 + * deactivation will have been effective 916 + */ 917 + gic_set_eoi(27); 918 + gic_set_dir(27); 919 + 920 + GUEST_ASSERT(!gic_irq_get_active(27)); 921 + 922 + /* Do it one more time */ 923 + isb(); 924 + val = read_sysreg(cntvct_el0); 925 + write_sysreg(val, cntv_cval_el0); 926 + write_sysreg(1, cntv_ctl_el0); 927 + isb(); 928 + 929 + GUEST_ASSERT(gic_irq_get_pending(27)); 930 + 931 + /* The timer must fire again */ 932 + intid = wait_for_and_activate_irq(); 933 + GUEST_ASSERT(intid == 27); 934 + 935 + GUEST_DONE(); 936 + } 937 + 938 + static void *test_vcpu_run(void *arg) 939 + { 940 + struct kvm_vcpu *vcpu = arg; 941 + struct ucall uc; 942 + 943 + while (1) { 944 + vcpu_run(vcpu); 945 + 946 + switch (get_ucall(vcpu, &uc)) { 947 + case UCALL_ABORT: 948 + REPORT_GUEST_ASSERT(uc); 949 + break; 950 + case UCALL_DONE: 951 + return NULL; 952 + default: 953 + TEST_FAIL("Unknown ucall %lu", uc.cmd); 954 + } 955 + } 956 + 957 + return NULL; 958 + } 959 + 960 + static void test_vgic_two_cpus(void *gcode) 961 + { 962 + pthread_t thr[2]; 963 + struct kvm_vcpu *vcpus[2]; 964 + struct test_args args = {}; 965 + struct kvm_vm *vm; 966 + vm_vaddr_t args_gva; 967 + int gic_fd, ret; 968 + 969 + vm = vm_create_with_vcpus(2, gcode, vcpus); 970 + 971 + vm_init_descriptor_tables(vm); 972 + vcpu_init_descriptor_tables(vcpus[0]); 973 + vcpu_init_descriptor_tables(vcpus[1]); 974 + 975 + /* Setup the guest args page (so it gets the args). */ 976 + args_gva = vm_vaddr_alloc_page(vm); 977 + memcpy(addr_gva2hva(vm, args_gva), &args, sizeof(args)); 978 + vcpu_args_set(vcpus[0], 2, args_gva, 0); 979 + vcpu_args_set(vcpus[1], 2, args_gva, 1); 980 + 981 + gic_fd = vgic_v3_setup(vm, 2, 64); 982 + 983 + ret = pthread_create(&thr[0], NULL, test_vcpu_run, vcpus[0]); 984 + if (ret) 985 + TEST_FAIL("Can't create thread for vcpu 0 (%d)\n", ret); 986 + ret = pthread_create(&thr[1], NULL, test_vcpu_run, vcpus[1]); 987 + if (ret) 988 + TEST_FAIL("Can't create thread for vcpu 1 (%d)\n", ret); 989 + 990 + pthread_join(thr[0], NULL); 991 + pthread_join(thr[1], NULL); 992 + 993 + close(gic_fd); 994 + kvm_vm_free(vm); 995 + } 996 + 805 997 static void help(const char *name) 806 998 { 807 999 printf( ··· 1073 835 test_vgic(nr_irqs, false /* level */, true /* eoi_split */); 1074 836 test_vgic(nr_irqs, true /* level */, false /* eoi_split */); 1075 837 test_vgic(nr_irqs, true /* level */, true /* eoi_split */); 838 + test_vgic_two_cpus(guest_code_asym_dir); 839 + test_vgic_two_cpus(guest_code_group_en); 840 + test_vgic_two_cpus(guest_code_timer_spi); 1076 841 } else { 1077 842 test_vgic(nr_irqs, level_sensitive, eoi_split); 1078 843 }

+4

tools/testing/selftests/kvm/arm64/vgic_lpi_stress.c

··· 118 118 119 119 guest_setup_its_mappings(); 120 120 guest_invalidate_all_rdists(); 121 + 122 + /* SYNC to ensure ITS setup is complete */ 123 + for (cpuid = 0; cpuid < test_data.nr_cpus; cpuid++) 124 + its_send_sync_cmd(test_data.cmdq_base_va, cpuid); 121 125 } 122 126 123 127 static void guest_code(size_t nr_lpis)

+98

tools/testing/selftests/kvm/guest_memfd_test.c

··· 19 19 #include <sys/stat.h> 20 20 21 21 #include "kvm_util.h" 22 + #include "numaif.h" 22 23 #include "test_util.h" 23 24 #include "ucall_common.h" 24 25 ··· 72 71 memset(mem, val, page_size); 73 72 for (i = 0; i < total_size; i++) 74 73 TEST_ASSERT_EQ(READ_ONCE(mem[i]), val); 74 + 75 + kvm_munmap(mem, total_size); 76 + } 77 + 78 + static void test_mbind(int fd, size_t total_size) 79 + { 80 + const unsigned long nodemask_0 = 1; /* nid: 0 */ 81 + unsigned long nodemask = 0; 82 + unsigned long maxnode = 8; 83 + int policy; 84 + char *mem; 85 + int ret; 86 + 87 + if (!is_multi_numa_node_system()) 88 + return; 89 + 90 + mem = kvm_mmap(total_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd); 91 + 92 + /* Test MPOL_INTERLEAVE policy */ 93 + kvm_mbind(mem, page_size * 2, MPOL_INTERLEAVE, &nodemask_0, maxnode, 0); 94 + kvm_get_mempolicy(&policy, &nodemask, maxnode, mem, MPOL_F_ADDR); 95 + TEST_ASSERT(policy == MPOL_INTERLEAVE && nodemask == nodemask_0, 96 + "Wanted MPOL_INTERLEAVE (%u) and nodemask 0x%lx, got %u and 0x%lx", 97 + MPOL_INTERLEAVE, nodemask_0, policy, nodemask); 98 + 99 + /* Test basic MPOL_BIND policy */ 100 + kvm_mbind(mem + page_size * 2, page_size * 2, MPOL_BIND, &nodemask_0, maxnode, 0); 101 + kvm_get_mempolicy(&policy, &nodemask, maxnode, mem + page_size * 2, MPOL_F_ADDR); 102 + TEST_ASSERT(policy == MPOL_BIND && nodemask == nodemask_0, 103 + "Wanted MPOL_BIND (%u) and nodemask 0x%lx, got %u and 0x%lx", 104 + MPOL_BIND, nodemask_0, policy, nodemask); 105 + 106 + /* Test MPOL_DEFAULT policy */ 107 + kvm_mbind(mem, total_size, MPOL_DEFAULT, NULL, 0, 0); 108 + kvm_get_mempolicy(&policy, &nodemask, maxnode, mem, MPOL_F_ADDR); 109 + TEST_ASSERT(policy == MPOL_DEFAULT && !nodemask, 110 + "Wanted MPOL_DEFAULT (%u) and nodemask 0x0, got %u and 0x%lx", 111 + MPOL_DEFAULT, policy, nodemask); 112 + 113 + /* Test with invalid policy */ 114 + ret = mbind(mem, page_size, 999, &nodemask_0, maxnode, 0); 115 + TEST_ASSERT(ret == -1 && errno == EINVAL, 116 + "mbind with invalid policy should fail with EINVAL"); 117 + 118 + kvm_munmap(mem, total_size); 119 + } 120 + 121 + static void test_numa_allocation(int fd, size_t total_size) 122 + { 123 + unsigned long node0_mask = 1; /* Node 0 */ 124 + unsigned long node1_mask = 2; /* Node 1 */ 125 + unsigned long maxnode = 8; 126 + void *pages[4]; 127 + int status[4]; 128 + char *mem; 129 + int i; 130 + 131 + if (!is_multi_numa_node_system()) 132 + return; 133 + 134 + mem = kvm_mmap(total_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd); 135 + 136 + for (i = 0; i < 4; i++) 137 + pages[i] = (char *)mem + page_size * i; 138 + 139 + /* Set NUMA policy after allocation */ 140 + memset(mem, 0xaa, page_size); 141 + kvm_mbind(pages[0], page_size, MPOL_BIND, &node0_mask, maxnode, 0); 142 + kvm_fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 0, page_size); 143 + 144 + /* Set NUMA policy before allocation */ 145 + kvm_mbind(pages[0], page_size * 2, MPOL_BIND, &node1_mask, maxnode, 0); 146 + kvm_mbind(pages[2], page_size * 2, MPOL_BIND, &node0_mask, maxnode, 0); 147 + memset(mem, 0xaa, total_size); 148 + 149 + /* Validate if pages are allocated on specified NUMA nodes */ 150 + kvm_move_pages(0, 4, pages, NULL, status, 0); 151 + TEST_ASSERT(status[0] == 1, "Expected page 0 on node 1, got it on node %d", status[0]); 152 + TEST_ASSERT(status[1] == 1, "Expected page 1 on node 1, got it on node %d", status[1]); 153 + TEST_ASSERT(status[2] == 0, "Expected page 2 on node 0, got it on node %d", status[2]); 154 + TEST_ASSERT(status[3] == 0, "Expected page 3 on node 0, got it on node %d", status[3]); 155 + 156 + /* Punch hole for all pages */ 157 + kvm_fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 0, total_size); 158 + 159 + /* Change NUMA policy nodes and reallocate */ 160 + kvm_mbind(pages[0], page_size * 2, MPOL_BIND, &node0_mask, maxnode, 0); 161 + kvm_mbind(pages[2], page_size * 2, MPOL_BIND, &node1_mask, maxnode, 0); 162 + memset(mem, 0xaa, total_size); 163 + 164 + kvm_move_pages(0, 4, pages, NULL, status, 0); 165 + TEST_ASSERT(status[0] == 0, "Expected page 0 on node 0, got it on node %d", status[0]); 166 + TEST_ASSERT(status[1] == 0, "Expected page 1 on node 0, got it on node %d", status[1]); 167 + TEST_ASSERT(status[2] == 1, "Expected page 2 on node 1, got it on node %d", status[2]); 168 + TEST_ASSERT(status[3] == 1, "Expected page 3 on node 1, got it on node %d", status[3]); 75 169 76 170 kvm_munmap(mem, total_size); 77 171 } ··· 369 273 if (flags & GUEST_MEMFD_FLAG_INIT_SHARED) { 370 274 gmem_test(mmap_supported, vm, flags); 371 275 gmem_test(fault_overflow, vm, flags); 276 + gmem_test(numa_allocation, vm, flags); 372 277 } else { 373 278 gmem_test(fault_private, vm, flags); 374 279 } 375 280 376 281 gmem_test(mmap_cow, vm, flags); 282 + gmem_test(mbind, vm, flags); 377 283 } else { 378 284 gmem_test(mmap_not_supported, vm, flags); 379 285 }

+1

tools/testing/selftests/kvm/include/arm64/gic.h

··· 57 57 void gic_irq_clear_pending(unsigned int intid); 58 58 bool gic_irq_get_pending(unsigned int intid); 59 59 void gic_irq_set_config(unsigned int intid, bool is_edge); 60 + void gic_irq_set_group(unsigned int intid, bool group); 60 61 61 62 void gic_rdist_enable_lpis(vm_paddr_t cfg_table, size_t cfg_table_size, 62 63 vm_paddr_t pend_table);

+1

tools/testing/selftests/kvm/include/arm64/gic_v3_its.h

··· 15 15 void its_send_mapti_cmd(void *cmdq_base, u32 device_id, u32 event_id, 16 16 u32 collection_id, u32 intid); 17 17 void its_send_invall_cmd(void *cmdq_base, u32 collection_id); 18 + void its_send_sync_cmd(void *cmdq_base, u32 vcpu_id); 18 19 19 20 #endif // __SELFTESTS_GIC_V3_ITS_H__

+81

tools/testing/selftests/kvm/include/kvm_syscalls.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0-only */ 2 + #ifndef SELFTEST_KVM_SYSCALLS_H 3 + #define SELFTEST_KVM_SYSCALLS_H 4 + 5 + #include <sys/syscall.h> 6 + 7 + #define MAP_ARGS0(m,...) 8 + #define MAP_ARGS1(m,t,a,...) m(t,a) 9 + #define MAP_ARGS2(m,t,a,...) m(t,a), MAP_ARGS1(m,__VA_ARGS__) 10 + #define MAP_ARGS3(m,t,a,...) m(t,a), MAP_ARGS2(m,__VA_ARGS__) 11 + #define MAP_ARGS4(m,t,a,...) m(t,a), MAP_ARGS3(m,__VA_ARGS__) 12 + #define MAP_ARGS5(m,t,a,...) m(t,a), MAP_ARGS4(m,__VA_ARGS__) 13 + #define MAP_ARGS6(m,t,a,...) m(t,a), MAP_ARGS5(m,__VA_ARGS__) 14 + #define MAP_ARGS(n,...) MAP_ARGS##n(__VA_ARGS__) 15 + 16 + #define __DECLARE_ARGS(t, a) t a 17 + #define __UNPACK_ARGS(t, a) a 18 + 19 + #define DECLARE_ARGS(nr_args, args...) MAP_ARGS(nr_args, __DECLARE_ARGS, args) 20 + #define UNPACK_ARGS(nr_args, args...) MAP_ARGS(nr_args, __UNPACK_ARGS, args) 21 + 22 + #define __KVM_SYSCALL_ERROR(_name, _ret) \ 23 + "%s failed, rc: %i errno: %i (%s)", (_name), (_ret), errno, strerror(errno) 24 + 25 + /* Define a kvm_<syscall>() API to assert success. */ 26 + #define __KVM_SYSCALL_DEFINE(name, nr_args, args...) \ 27 + static inline void kvm_##name(DECLARE_ARGS(nr_args, args)) \ 28 + { \ 29 + int r; \ 30 + \ 31 + r = name(UNPACK_ARGS(nr_args, args)); \ 32 + TEST_ASSERT(!r, __KVM_SYSCALL_ERROR(#name, r)); \ 33 + } 34 + 35 + /* 36 + * Macro to define syscall APIs, either because KVM selftests doesn't link to 37 + * the standard library, e.g. libnuma, or because there is no library that yet 38 + * provides the syscall. These 39 + */ 40 + #define KVM_SYSCALL_DEFINE(name, nr_args, args...) \ 41 + static inline long name(DECLARE_ARGS(nr_args, args)) \ 42 + { \ 43 + return syscall(__NR_##name, UNPACK_ARGS(nr_args, args)); \ 44 + } \ 45 + __KVM_SYSCALL_DEFINE(name, nr_args, args) 46 + 47 + /* 48 + * Special case mmap(), as KVM selftest rarely/never specific an address, 49 + * rarely specify an offset, and because the unique return code requires 50 + * special handling anyways. 51 + */ 52 + static inline void *__kvm_mmap(size_t size, int prot, int flags, int fd, 53 + off_t offset) 54 + { 55 + void *mem; 56 + 57 + mem = mmap(NULL, size, prot, flags, fd, offset); 58 + TEST_ASSERT(mem != MAP_FAILED, __KVM_SYSCALL_ERROR("mmap()", 59 + (int)(unsigned long)MAP_FAILED)); 60 + return mem; 61 + } 62 + 63 + static inline void *kvm_mmap(size_t size, int prot, int flags, int fd) 64 + { 65 + return __kvm_mmap(size, prot, flags, fd, 0); 66 + } 67 + 68 + static inline int kvm_dup(int fd) 69 + { 70 + int new_fd = dup(fd); 71 + 72 + TEST_ASSERT(new_fd >= 0, __KVM_SYSCALL_ERROR("dup()", new_fd)); 73 + return new_fd; 74 + } 75 + 76 + __KVM_SYSCALL_DEFINE(munmap, 2, void *, mem, size_t, size); 77 + __KVM_SYSCALL_DEFINE(close, 1, int, fd); 78 + __KVM_SYSCALL_DEFINE(fallocate, 4, int, fd, int, mode, loff_t, offset, loff_t, len); 79 + __KVM_SYSCALL_DEFINE(ftruncate, 2, unsigned int, fd, off_t, length); 80 + 81 + #endif /* SELFTEST_KVM_SYSCALLS_H */

+10 -35

tools/testing/selftests/kvm/include/kvm_util.h

··· 23 23 24 24 #include <pthread.h> 25 25 26 + #include "kvm_syscalls.h" 26 27 #include "kvm_util_arch.h" 27 28 #include "kvm_util_types.h" 28 29 #include "sparsebit.h" ··· 178 177 VM_MODE_P40V48_4K, 179 178 VM_MODE_P40V48_16K, 180 179 VM_MODE_P40V48_64K, 181 - VM_MODE_PXXV48_4K, /* For 48bits VA but ANY bits PA */ 180 + VM_MODE_PXXVYY_4K, /* For 48-bit or 57-bit VA, depending on host support */ 182 181 VM_MODE_P47V64_4K, 183 182 VM_MODE_P44V64_4K, 184 183 VM_MODE_P36V48_4K, ··· 220 219 221 220 #elif defined(__x86_64__) 222 221 223 - #define VM_MODE_DEFAULT VM_MODE_PXXV48_4K 222 + #define VM_MODE_DEFAULT VM_MODE_PXXVYY_4K 224 223 #define MIN_PAGE_SHIFT 12U 225 224 #define ptes_per_page(page_size) ((page_size) / 8) 226 225 ··· 282 281 static inline bool kvm_has_cap(long cap) 283 282 { 284 283 return kvm_check_cap(cap); 285 - } 286 - 287 - #define __KVM_SYSCALL_ERROR(_name, _ret) \ 288 - "%s failed, rc: %i errno: %i (%s)", (_name), (_ret), errno, strerror(errno) 289 - 290 - static inline void *__kvm_mmap(size_t size, int prot, int flags, int fd, 291 - off_t offset) 292 - { 293 - void *mem; 294 - 295 - mem = mmap(NULL, size, prot, flags, fd, offset); 296 - TEST_ASSERT(mem != MAP_FAILED, __KVM_SYSCALL_ERROR("mmap()", 297 - (int)(unsigned long)MAP_FAILED)); 298 - 299 - return mem; 300 - } 301 - 302 - static inline void *kvm_mmap(size_t size, int prot, int flags, int fd) 303 - { 304 - return __kvm_mmap(size, prot, flags, fd, 0); 305 - } 306 - 307 - static inline void kvm_munmap(void *mem, size_t size) 308 - { 309 - int ret; 310 - 311 - ret = munmap(mem, size); 312 - TEST_ASSERT(!ret, __KVM_SYSCALL_ERROR("munmap()", ret)); 313 284 } 314 285 315 286 /* ··· 673 700 uint32_t guest_memfd, uint64_t guest_memfd_offset); 674 701 675 702 void vm_userspace_mem_region_add(struct kvm_vm *vm, 676 - enum vm_mem_backing_src_type src_type, 677 - uint64_t guest_paddr, uint32_t slot, uint64_t npages, 678 - uint32_t flags); 703 + enum vm_mem_backing_src_type src_type, 704 + uint64_t gpa, uint32_t slot, uint64_t npages, 705 + uint32_t flags); 679 706 void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type, 680 - uint64_t guest_paddr, uint32_t slot, uint64_t npages, 681 - uint32_t flags, int guest_memfd_fd, uint64_t guest_memfd_offset); 707 + uint64_t gpa, uint32_t slot, uint64_t npages, uint32_t flags, 708 + int guest_memfd_fd, uint64_t guest_memfd_offset); 682 709 683 710 #ifndef vm_arch_has_protected_memory 684 711 static inline bool vm_arch_has_protected_memory(struct kvm_vm *vm) ··· 688 715 #endif 689 716 690 717 void vm_mem_region_set_flags(struct kvm_vm *vm, uint32_t slot, uint32_t flags); 718 + void vm_mem_region_reload(struct kvm_vm *vm, uint32_t slot); 691 719 void vm_mem_region_move(struct kvm_vm *vm, uint32_t slot, uint64_t new_gpa); 692 720 void vm_mem_region_delete(struct kvm_vm *vm, uint32_t slot); 693 721 struct kvm_vcpu *__vm_vcpu_add(struct kvm_vm *vm, uint32_t vcpu_id); ··· 1204 1230 static inline void virt_pg_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr) 1205 1231 { 1206 1232 virt_arch_pg_map(vm, vaddr, paddr); 1233 + sparsebit_set(vm->vpages_mapped, vaddr >> vm->page_shift); 1207 1234 } 1208 1235 1209 1236

+85

tools/testing/selftests/kvm/include/loongarch/arch_timer.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * LoongArch Constant Timer specific interface 4 + */ 5 + #ifndef SELFTEST_KVM_ARCH_TIMER_H 6 + #define SELFTEST_KVM_ARCH_TIMER_H 7 + 8 + #include "processor.h" 9 + 10 + /* LoongArch timer frequency is constant 100MHZ */ 11 + #define TIMER_FREQ (100UL << 20) 12 + #define msec_to_cycles(msec) (TIMER_FREQ * (unsigned long)(msec) / 1000) 13 + #define usec_to_cycles(usec) (TIMER_FREQ * (unsigned long)(usec) / 1000000) 14 + #define cycles_to_usec(cycles) ((unsigned long)(cycles) * 1000000 / TIMER_FREQ) 15 + 16 + static inline unsigned long timer_get_cycles(void) 17 + { 18 + unsigned long val = 0; 19 + 20 + __asm__ __volatile__( 21 + "rdtime.d %0, $zero\n\t" 22 + : "=r"(val) 23 + : 24 + ); 25 + 26 + return val; 27 + } 28 + 29 + static inline unsigned long timer_get_cfg(void) 30 + { 31 + return csr_read(LOONGARCH_CSR_TCFG); 32 + } 33 + 34 + static inline unsigned long timer_get_val(void) 35 + { 36 + return csr_read(LOONGARCH_CSR_TVAL); 37 + } 38 + 39 + static inline void disable_timer(void) 40 + { 41 + csr_write(0, LOONGARCH_CSR_TCFG); 42 + } 43 + 44 + static inline void timer_irq_enable(void) 45 + { 46 + unsigned long val; 47 + 48 + val = csr_read(LOONGARCH_CSR_ECFG); 49 + val |= ECFGF_TIMER; 50 + csr_write(val, LOONGARCH_CSR_ECFG); 51 + } 52 + 53 + static inline void timer_irq_disable(void) 54 + { 55 + unsigned long val; 56 + 57 + val = csr_read(LOONGARCH_CSR_ECFG); 58 + val &= ~ECFGF_TIMER; 59 + csr_write(val, LOONGARCH_CSR_ECFG); 60 + } 61 + 62 + static inline void timer_set_next_cmp_ms(unsigned int msec, bool period) 63 + { 64 + unsigned long val; 65 + 66 + val = msec_to_cycles(msec) & CSR_TCFG_VAL; 67 + val |= CSR_TCFG_EN; 68 + if (period) 69 + val |= CSR_TCFG_PERIOD; 70 + csr_write(val, LOONGARCH_CSR_TCFG); 71 + } 72 + 73 + static inline void __delay(uint64_t cycles) 74 + { 75 + uint64_t start = timer_get_cycles(); 76 + 77 + while ((timer_get_cycles() - start) < cycles) 78 + cpu_relax(); 79 + } 80 + 81 + static inline void udelay(unsigned long usec) 82 + { 83 + __delay(usec_to_cycles(usec)); 84 + } 85 + #endif /* SELFTEST_KVM_ARCH_TIMER_H */

+80 -1

tools/testing/selftests/kvm/include/loongarch/processor.h

··· 83 83 #define LOONGARCH_CSR_PRMD 0x1 84 84 #define LOONGARCH_CSR_EUEN 0x2 85 85 #define LOONGARCH_CSR_ECFG 0x4 86 + #define ECFGB_TIMER 11 87 + #define ECFGF_TIMER (BIT_ULL(ECFGB_TIMER)) 86 88 #define LOONGARCH_CSR_ESTAT 0x5 /* Exception status */ 89 + #define CSR_ESTAT_EXC_SHIFT 16 90 + #define CSR_ESTAT_EXC_WIDTH 6 91 + #define CSR_ESTAT_EXC (0x3f << CSR_ESTAT_EXC_SHIFT) 92 + #define EXCCODE_INT 0 /* Interrupt */ 93 + #define INT_TI 11 /* Timer interrupt*/ 87 94 #define LOONGARCH_CSR_ERA 0x6 /* ERA */ 88 95 #define LOONGARCH_CSR_BADV 0x7 /* Bad virtual address */ 89 96 #define LOONGARCH_CSR_EENTRY 0xc ··· 113 106 #define LOONGARCH_CSR_KS1 0x31 114 107 #define LOONGARCH_CSR_TMID 0x40 115 108 #define LOONGARCH_CSR_TCFG 0x41 109 + #define CSR_TCFG_VAL (BIT_ULL(48) - BIT_ULL(2)) 110 + #define CSR_TCFG_PERIOD_SHIFT 1 111 + #define CSR_TCFG_PERIOD (0x1UL << CSR_TCFG_PERIOD_SHIFT) 112 + #define CSR_TCFG_EN (0x1UL) 113 + #define LOONGARCH_CSR_TVAL 0x42 114 + #define LOONGARCH_CSR_TINTCLR 0x44 /* Timer interrupt clear */ 115 + #define CSR_TINTCLR_TI_SHIFT 0 116 + #define CSR_TINTCLR_TI (1 << CSR_TINTCLR_TI_SHIFT) 116 117 /* TLB refill exception entry */ 117 118 #define LOONGARCH_CSR_TLBRENTRY 0x88 118 119 #define LOONGARCH_CSR_TLBRSAVE 0x8b 119 120 #define LOONGARCH_CSR_TLBREHI 0x8e 120 121 #define CSR_TLBREHI_PS_SHIFT 0 121 122 #define CSR_TLBREHI_PS (0x3fUL << CSR_TLBREHI_PS_SHIFT) 123 + 124 + #define csr_read(csr) \ 125 + ({ \ 126 + register unsigned long __v; \ 127 + __asm__ __volatile__( \ 128 + "csrrd %[val], %[reg]\n\t" \ 129 + : [val] "=r" (__v) \ 130 + : [reg] "i" (csr) \ 131 + : "memory"); \ 132 + __v; \ 133 + }) 134 + 135 + #define csr_write(v, csr) \ 136 + ({ \ 137 + register unsigned long __v = v; \ 138 + __asm__ __volatile__ ( \ 139 + "csrwr %[val], %[reg]\n\t" \ 140 + : [val] "+r" (__v) \ 141 + : [reg] "i" (csr) \ 142 + : "memory"); \ 143 + __v; \ 144 + }) 122 145 123 146 #define EXREGS_GPRS (32) 124 147 ··· 161 124 unsigned long pc; 162 125 unsigned long estat; 163 126 unsigned long badv; 127 + unsigned long prmd; 164 128 }; 165 129 166 130 #define PC_OFFSET_EXREGS offsetof(struct ex_regs, pc) 167 131 #define ESTAT_OFFSET_EXREGS offsetof(struct ex_regs, estat) 168 132 #define BADV_OFFSET_EXREGS offsetof(struct ex_regs, badv) 133 + #define PRMD_OFFSET_EXREGS offsetof(struct ex_regs, prmd) 169 134 #define EXREGS_SIZE sizeof(struct ex_regs) 170 135 136 + #define VECTOR_NUM 64 137 + 138 + typedef void(*handler_fn)(struct ex_regs *); 139 + 140 + struct handlers { 141 + handler_fn exception_handlers[VECTOR_NUM]; 142 + }; 143 + 144 + void vm_init_descriptor_tables(struct kvm_vm *vm); 145 + void vm_install_exception_handler(struct kvm_vm *vm, int vector, handler_fn handler); 146 + 147 + static inline void cpu_relax(void) 148 + { 149 + asm volatile("nop" ::: "memory"); 150 + } 151 + 152 + static inline void local_irq_enable(void) 153 + { 154 + unsigned int flags = CSR_CRMD_IE; 155 + register unsigned int mask asm("$t0") = CSR_CRMD_IE; 156 + 157 + __asm__ __volatile__( 158 + "csrxchg %[val], %[mask], %[reg]\n\t" 159 + : [val] "+r" (flags) 160 + : [mask] "r" (mask), [reg] "i" (LOONGARCH_CSR_CRMD) 161 + : "memory"); 162 + } 163 + 164 + static inline void local_irq_disable(void) 165 + { 166 + unsigned int flags = 0; 167 + register unsigned int mask asm("$t0") = CSR_CRMD_IE; 168 + 169 + __asm__ __volatile__( 170 + "csrxchg %[val], %[mask], %[reg]\n\t" 171 + : [val] "+r" (flags) 172 + : [mask] "r" (mask), [reg] "i" (LOONGARCH_CSR_CRMD) 173 + : "memory"); 174 + } 171 175 #else 172 176 #define PC_OFFSET_EXREGS ((EXREGS_GPRS + 0) * 8) 173 177 #define ESTAT_OFFSET_EXREGS ((EXREGS_GPRS + 1) * 8) 174 178 #define BADV_OFFSET_EXREGS ((EXREGS_GPRS + 2) * 8) 175 - #define EXREGS_SIZE ((EXREGS_GPRS + 3) * 8) 179 + #define PRMD_OFFSET_EXREGS ((EXREGS_GPRS + 3) * 8) 180 + #define EXREGS_SIZE ((EXREGS_GPRS + 4) * 8) 176 181 #endif 177 182 178 183 #endif /* SELFTEST_KVM_PROCESSOR_H */

+69 -41

tools/testing/selftests/kvm/include/numaif.h

··· 1 1 /* SPDX-License-Identifier: GPL-2.0-only */ 2 - /* 3 - * tools/testing/selftests/kvm/include/numaif.h 4 - * 5 - * Copyright (C) 2020, Google LLC. 6 - * 7 - * This work is licensed under the terms of the GNU GPL, version 2. 8 - * 9 - * Header file that provides access to NUMA API functions not explicitly 10 - * exported to user space. 11 - */ 2 + /* Copyright (C) 2020, Google LLC. */ 12 3 13 4 #ifndef SELFTEST_KVM_NUMAIF_H 14 5 #define SELFTEST_KVM_NUMAIF_H 15 6 16 - #define __NR_get_mempolicy 239 17 - #define __NR_migrate_pages 256 7 + #include <dirent.h> 18 8 19 - /* System calls */ 20 - long get_mempolicy(int *policy, const unsigned long *nmask, 21 - unsigned long maxnode, void *addr, int flags) 9 + #include <linux/mempolicy.h> 10 + 11 + #include "kvm_syscalls.h" 12 + 13 + KVM_SYSCALL_DEFINE(get_mempolicy, 5, int *, policy, const unsigned long *, nmask, 14 + unsigned long, maxnode, void *, addr, int, flags); 15 + 16 + KVM_SYSCALL_DEFINE(set_mempolicy, 3, int, mode, const unsigned long *, nmask, 17 + unsigned long, maxnode); 18 + 19 + KVM_SYSCALL_DEFINE(set_mempolicy_home_node, 4, unsigned long, start, 20 + unsigned long, len, unsigned long, home_node, 21 + unsigned long, flags); 22 + 23 + KVM_SYSCALL_DEFINE(migrate_pages, 4, int, pid, unsigned long, maxnode, 24 + const unsigned long *, frommask, const unsigned long *, tomask); 25 + 26 + KVM_SYSCALL_DEFINE(move_pages, 6, int, pid, unsigned long, count, void *, pages, 27 + const int *, nodes, int *, status, int, flags); 28 + 29 + KVM_SYSCALL_DEFINE(mbind, 6, void *, addr, unsigned long, size, int, mode, 30 + const unsigned long *, nodemask, unsigned long, maxnode, 31 + unsigned int, flags); 32 + 33 + static inline int get_max_numa_node(void) 22 34 { 23 - return syscall(__NR_get_mempolicy, policy, nmask, 24 - maxnode, addr, flags); 35 + struct dirent *de; 36 + int max_node = 0; 37 + DIR *d; 38 + 39 + /* 40 + * Assume there's a single node if the kernel doesn't support NUMA, 41 + * or if no nodes are found. 42 + */ 43 + d = opendir("/sys/devices/system/node"); 44 + if (!d) 45 + return 0; 46 + 47 + while ((de = readdir(d)) != NULL) { 48 + int node_id; 49 + char *endptr; 50 + 51 + if (strncmp(de->d_name, "node", 4) != 0) 52 + continue; 53 + 54 + node_id = strtol(de->d_name + 4, &endptr, 10); 55 + if (*endptr != '\0') 56 + continue; 57 + 58 + if (node_id > max_node) 59 + max_node = node_id; 60 + } 61 + closedir(d); 62 + 63 + return max_node; 25 64 } 26 65 27 - long migrate_pages(int pid, unsigned long maxnode, 28 - const unsigned long *frommask, 29 - const unsigned long *tomask) 66 + static bool is_numa_available(void) 30 67 { 31 - return syscall(__NR_migrate_pages, pid, maxnode, frommask, tomask); 68 + /* 69 + * Probe for NUMA by doing a dummy get_mempolicy(). If the syscall 70 + * fails with ENOSYS, then the kernel was built without NUMA support. 71 + * if the syscall fails with EPERM, then the process/user lacks the 72 + * necessary capabilities (CAP_SYS_NICE). 73 + */ 74 + return !get_mempolicy(NULL, NULL, 0, NULL, 0) || 75 + (errno != ENOSYS && errno != EPERM); 32 76 } 33 77 34 - /* Policies */ 35 - #define MPOL_DEFAULT 0 36 - #define MPOL_PREFERRED 1 37 - #define MPOL_BIND 2 38 - #define MPOL_INTERLEAVE 3 39 - 40 - #define MPOL_MAX MPOL_INTERLEAVE 41 - 42 - /* Flags for get_mem_policy */ 43 - #define MPOL_F_NODE (1<<0) /* return next il node or node of address */ 44 - /* Warning: MPOL_F_NODE is unsupported and 45 - * subject to change. Don't use. 46 - */ 47 - #define MPOL_F_ADDR (1<<1) /* look up vma using address */ 48 - #define MPOL_F_MEMS_ALLOWED (1<<2) /* query nodes allowed in cpuset */ 49 - 50 - /* Flags for mbind */ 51 - #define MPOL_MF_STRICT (1<<0) /* Verify existing pages in the mapping */ 52 - #define MPOL_MF_MOVE (1<<1) /* Move pages owned by this process to conform to mapping */ 53 - #define MPOL_MF_MOVE_ALL (1<<2) /* Move every page to conform to mapping */ 78 + static inline bool is_multi_numa_node_system(void) 79 + { 80 + return is_numa_available() && get_max_numa_node() >= 1; 81 + } 54 82 55 83 #endif /* SELFTEST_KVM_NUMAIF_H */

+1 -1

tools/testing/selftests/kvm/include/x86/processor.h

··· 1441 1441 PG_LEVEL_2M, 1442 1442 PG_LEVEL_1G, 1443 1443 PG_LEVEL_512G, 1444 - PG_LEVEL_NUM 1444 + PG_LEVEL_256T 1445 1445 }; 1446 1446 1447 1447 #define PG_LEVEL_SHIFT(_level) ((_level - 1) * 9 + 12)

+1 -2

tools/testing/selftests/kvm/include/x86/vmx.h

··· 568 568 void nested_identity_map_1g(struct vmx_pages *vmx, struct kvm_vm *vm, 569 569 uint64_t addr, uint64_t size); 570 570 bool kvm_cpu_has_ept(void); 571 - void prepare_eptp(struct vmx_pages *vmx, struct kvm_vm *vm, 572 - uint32_t eptp_memslot); 571 + void prepare_eptp(struct vmx_pages *vmx, struct kvm_vm *vm); 573 572 void prepare_virtualize_apic_accesses(struct vmx_pages *vmx, struct kvm_vm *vm); 574 573 575 574 #endif /* SELFTEST_KVM_VMX_H */

+2 -2

tools/testing/selftests/kvm/kvm_binary_stats_test.c

··· 239 239 * single stats file works and doesn't cause explosions. 240 240 */ 241 241 vm_stats_fds = vm_get_stats_fd(vms[i]); 242 - stats_test(dup(vm_stats_fds)); 242 + stats_test(kvm_dup(vm_stats_fds)); 243 243 244 244 /* Verify userspace can instantiate multiple stats files. */ 245 245 stats_test(vm_get_stats_fd(vms[i])); 246 246 247 247 for (j = 0; j < max_vcpu; ++j) { 248 248 vcpu_stats_fds[j] = vcpu_get_stats_fd(vcpus[i * max_vcpu + j]); 249 - stats_test(dup(vcpu_stats_fds[j])); 249 + stats_test(kvm_dup(vcpu_stats_fds[j])); 250 250 stats_test(vcpu_get_stats_fd(vcpus[i * max_vcpu + j])); 251 251 } 252 252

+6

tools/testing/selftests/kvm/lib/arm64/gic.c

··· 155 155 GUEST_ASSERT(gic_common_ops); 156 156 gic_common_ops->gic_irq_set_config(intid, is_edge); 157 157 } 158 + 159 + void gic_irq_set_group(unsigned int intid, bool group) 160 + { 161 + GUEST_ASSERT(gic_common_ops); 162 + gic_common_ops->gic_irq_set_group(intid, group); 163 + }

+1

tools/testing/selftests/kvm/lib/arm64/gic_private.h

··· 25 25 void (*gic_irq_clear_pending)(uint32_t intid); 26 26 bool (*gic_irq_get_pending)(uint32_t intid); 27 27 void (*gic_irq_set_config)(uint32_t intid, bool is_edge); 28 + void (*gic_irq_set_group)(uint32_t intid, bool group); 28 29 }; 29 30 30 31 extern const struct gic_common_ops gicv3_ops;

+22

tools/testing/selftests/kvm/lib/arm64/gic_v3.c

··· 293 293 } 294 294 } 295 295 296 + static void gicv3_set_group(uint32_t intid, bool grp) 297 + { 298 + uint32_t cpu_or_dist; 299 + uint32_t val; 300 + 301 + cpu_or_dist = (get_intid_range(intid) == SPI_RANGE) ? DIST_BIT : guest_get_vcpuid(); 302 + val = gicv3_reg_readl(cpu_or_dist, GICD_IGROUPR + (intid / 32) * 4); 303 + if (grp) 304 + val |= BIT(intid % 32); 305 + else 306 + val &= ~BIT(intid % 32); 307 + gicv3_reg_writel(cpu_or_dist, GICD_IGROUPR + (intid / 32) * 4, val); 308 + } 309 + 296 310 static void gicv3_cpu_init(unsigned int cpu) 297 311 { 298 312 volatile void *sgi_base; 299 313 unsigned int i; 300 314 volatile void *redist_base_cpu; 315 + u64 typer; 301 316 302 317 GUEST_ASSERT(cpu < gicv3_data.nr_cpus); 303 318 304 319 redist_base_cpu = gicr_base_cpu(cpu); 305 320 sgi_base = sgi_base_from_redist(redist_base_cpu); 321 + 322 + /* Verify assumption that GICR_TYPER.Processor_number == cpu */ 323 + typer = readq_relaxed(redist_base_cpu + GICR_TYPER); 324 + GUEST_ASSERT_EQ(GICR_TYPER_CPU_NUMBER(typer), cpu); 306 325 307 326 gicv3_enable_redist(redist_base_cpu); 308 327 ··· 347 328 /* Set a default priority threshold */ 348 329 write_sysreg_s(ICC_PMR_DEF_PRIO, SYS_ICC_PMR_EL1); 349 330 331 + /* Disable Group-0 interrupts */ 332 + write_sysreg_s(ICC_IGRPEN0_EL1_MASK, SYS_ICC_IGRPEN1_EL1); 350 333 /* Enable non-secure Group-1 interrupts */ 351 334 write_sysreg_s(ICC_IGRPEN1_EL1_MASK, SYS_ICC_IGRPEN1_EL1); 352 335 } ··· 421 400 .gic_irq_clear_pending = gicv3_irq_clear_pending, 422 401 .gic_irq_get_pending = gicv3_irq_get_pending, 423 402 .gic_irq_set_config = gicv3_irq_set_config, 403 + .gic_irq_set_group = gicv3_set_group, 424 404 }; 425 405 426 406 void gic_rdist_enable_lpis(vm_paddr_t cfg_table, size_t cfg_table_size,

+10

tools/testing/selftests/kvm/lib/arm64/gic_v3_its.c

··· 253 253 254 254 its_send_cmd(cmdq_base, &cmd); 255 255 } 256 + 257 + void its_send_sync_cmd(void *cmdq_base, u32 vcpu_id) 258 + { 259 + struct its_cmd_block cmd = {}; 260 + 261 + its_encode_cmd(&cmd, GITS_CMD_SYNC); 262 + its_encode_target(&cmd, procnum_to_rdbase(vcpu_id)); 263 + 264 + its_send_cmd(cmdq_base, &cmd); 265 + }

+1 -1

tools/testing/selftests/kvm/lib/arm64/processor.c

··· 324 324 325 325 /* Configure base granule size */ 326 326 switch (vm->mode) { 327 - case VM_MODE_PXXV48_4K: 327 + case VM_MODE_PXXVYY_4K: 328 328 TEST_FAIL("AArch64 does not support 4K sized pages " 329 329 "with ANY-bit physical address ranges"); 330 330 case VM_MODE_P52V48_64K:

+80 -65

tools/testing/selftests/kvm/lib/kvm_util.c

··· 201 201 [VM_MODE_P40V48_4K] = "PA-bits:40, VA-bits:48, 4K pages", 202 202 [VM_MODE_P40V48_16K] = "PA-bits:40, VA-bits:48, 16K pages", 203 203 [VM_MODE_P40V48_64K] = "PA-bits:40, VA-bits:48, 64K pages", 204 - [VM_MODE_PXXV48_4K] = "PA-bits:ANY, VA-bits:48, 4K pages", 204 + [VM_MODE_PXXVYY_4K] = "PA-bits:ANY, VA-bits:48 or 57, 4K pages", 205 205 [VM_MODE_P47V64_4K] = "PA-bits:47, VA-bits:64, 4K pages", 206 206 [VM_MODE_P44V64_4K] = "PA-bits:44, VA-bits:64, 4K pages", 207 207 [VM_MODE_P36V48_4K] = "PA-bits:36, VA-bits:48, 4K pages", ··· 228 228 [VM_MODE_P40V48_4K] = { 40, 48, 0x1000, 12 }, 229 229 [VM_MODE_P40V48_16K] = { 40, 48, 0x4000, 14 }, 230 230 [VM_MODE_P40V48_64K] = { 40, 48, 0x10000, 16 }, 231 - [VM_MODE_PXXV48_4K] = { 0, 0, 0x1000, 12 }, 231 + [VM_MODE_PXXVYY_4K] = { 0, 0, 0x1000, 12 }, 232 232 [VM_MODE_P47V64_4K] = { 47, 64, 0x1000, 12 }, 233 233 [VM_MODE_P44V64_4K] = { 44, 64, 0x1000, 12 }, 234 234 [VM_MODE_P36V48_4K] = { 36, 48, 0x1000, 12 }, ··· 310 310 case VM_MODE_P36V47_16K: 311 311 vm->pgtable_levels = 3; 312 312 break; 313 - case VM_MODE_PXXV48_4K: 313 + case VM_MODE_PXXVYY_4K: 314 314 #ifdef __x86_64__ 315 315 kvm_get_cpu_address_width(&vm->pa_bits, &vm->va_bits); 316 316 kvm_init_vm_address_properties(vm); 317 - /* 318 - * Ignore KVM support for 5-level paging (vm->va_bits == 57), 319 - * it doesn't take effect unless a CR4.LA57 is set, which it 320 - * isn't for this mode (48-bit virtual address space). 321 - */ 322 - TEST_ASSERT(vm->va_bits == 48 || vm->va_bits == 57, 323 - "Linear address width (%d bits) not supported", 324 - vm->va_bits); 317 + 325 318 pr_debug("Guest physical address width detected: %d\n", 326 319 vm->pa_bits); 327 - vm->pgtable_levels = 4; 328 - vm->va_bits = 48; 320 + pr_debug("Guest virtual address width detected: %d\n", 321 + vm->va_bits); 322 + 323 + if (vm->va_bits == 57) { 324 + vm->pgtable_levels = 5; 325 + } else { 326 + TEST_ASSERT(vm->va_bits == 48, 327 + "Unexpected guest virtual address width: %d", 328 + vm->va_bits); 329 + vm->pgtable_levels = 4; 330 + } 329 331 #else 330 - TEST_FAIL("VM_MODE_PXXV48_4K not supported on non-x86 platforms"); 332 + TEST_FAIL("VM_MODE_PXXVYY_4K not supported on non-x86 platforms"); 331 333 #endif 332 334 break; 333 335 case VM_MODE_P47V64_4K: ··· 706 704 707 705 static void kvm_stats_release(struct kvm_binary_stats *stats) 708 706 { 709 - int ret; 710 - 711 707 if (stats->fd < 0) 712 708 return; 713 709 ··· 714 714 stats->desc = NULL; 715 715 } 716 716 717 - ret = close(stats->fd); 718 - TEST_ASSERT(!ret, __KVM_SYSCALL_ERROR("close()", ret)); 717 + kvm_close(stats->fd); 719 718 stats->fd = -1; 720 719 } 721 720 ··· 737 738 */ 738 739 static void vm_vcpu_rm(struct kvm_vm *vm, struct kvm_vcpu *vcpu) 739 740 { 740 - int ret; 741 - 742 741 if (vcpu->dirty_gfns) { 743 742 kvm_munmap(vcpu->dirty_gfns, vm->dirty_ring_size); 744 743 vcpu->dirty_gfns = NULL; ··· 744 747 745 748 kvm_munmap(vcpu->run, vcpu_mmap_sz()); 746 749 747 - ret = close(vcpu->fd); 748 - TEST_ASSERT(!ret, __KVM_SYSCALL_ERROR("close()", ret)); 749 - 750 + kvm_close(vcpu->fd); 750 751 kvm_stats_release(&vcpu->stats); 751 752 752 753 list_del(&vcpu->list); ··· 756 761 void kvm_vm_release(struct kvm_vm *vmp) 757 762 { 758 763 struct kvm_vcpu *vcpu, *tmp; 759 - int ret; 760 764 761 765 list_for_each_entry_safe(vcpu, tmp, &vmp->vcpus, list) 762 766 vm_vcpu_rm(vmp, vcpu); 763 767 764 - ret = close(vmp->fd); 765 - TEST_ASSERT(!ret, __KVM_SYSCALL_ERROR("close()", ret)); 766 - 767 - ret = close(vmp->kvm_fd); 768 - TEST_ASSERT(!ret, __KVM_SYSCALL_ERROR("close()", ret)); 768 + kvm_close(vmp->fd); 769 + kvm_close(vmp->kvm_fd); 769 770 770 771 /* Free cached stats metadata and close FD */ 771 772 kvm_stats_release(&vmp->stats); ··· 819 828 int kvm_memfd_alloc(size_t size, bool hugepages) 820 829 { 821 830 int memfd_flags = MFD_CLOEXEC; 822 - int fd, r; 831 + int fd; 823 832 824 833 if (hugepages) 825 834 memfd_flags |= MFD_HUGETLB; ··· 827 836 fd = memfd_create("kvm_selftest", memfd_flags); 828 837 TEST_ASSERT(fd != -1, __KVM_SYSCALL_ERROR("memfd_create()", fd)); 829 838 830 - r = ftruncate(fd, size); 831 - TEST_ASSERT(!r, __KVM_SYSCALL_ERROR("ftruncate()", r)); 832 - 833 - r = fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 0, size); 834 - TEST_ASSERT(!r, __KVM_SYSCALL_ERROR("fallocate()", r)); 839 + kvm_ftruncate(fd, size); 840 + kvm_fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 0, size); 835 841 836 842 return fd; 837 843 } ··· 945 957 946 958 /* FIXME: This thing needs to be ripped apart and rewritten. */ 947 959 void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type, 948 - uint64_t guest_paddr, uint32_t slot, uint64_t npages, 949 - uint32_t flags, int guest_memfd, uint64_t guest_memfd_offset) 960 + uint64_t gpa, uint32_t slot, uint64_t npages, uint32_t flags, 961 + int guest_memfd, uint64_t guest_memfd_offset) 950 962 { 951 963 int ret; 952 964 struct userspace_mem_region *region; ··· 960 972 "Number of guest pages is not compatible with the host. " 961 973 "Try npages=%d", vm_adjust_num_guest_pages(vm->mode, npages)); 962 974 963 - TEST_ASSERT((guest_paddr % vm->page_size) == 0, "Guest physical " 975 + TEST_ASSERT((gpa % vm->page_size) == 0, "Guest physical " 964 976 "address not on a page boundary.\n" 965 - " guest_paddr: 0x%lx vm->page_size: 0x%x", 966 - guest_paddr, vm->page_size); 967 - TEST_ASSERT((((guest_paddr >> vm->page_shift) + npages) - 1) 977 + " gpa: 0x%lx vm->page_size: 0x%x", 978 + gpa, vm->page_size); 979 + TEST_ASSERT((((gpa >> vm->page_shift) + npages) - 1) 968 980 <= vm->max_gfn, "Physical range beyond maximum " 969 981 "supported physical address,\n" 970 - " guest_paddr: 0x%lx npages: 0x%lx\n" 982 + " gpa: 0x%lx npages: 0x%lx\n" 971 983 " vm->max_gfn: 0x%lx vm->page_size: 0x%x", 972 - guest_paddr, npages, vm->max_gfn, vm->page_size); 984 + gpa, npages, vm->max_gfn, vm->page_size); 973 985 974 986 /* 975 987 * Confirm a mem region with an overlapping address doesn't 976 988 * already exist. 977 989 */ 978 990 region = (struct userspace_mem_region *) userspace_mem_region_find( 979 - vm, guest_paddr, (guest_paddr + npages * vm->page_size) - 1); 991 + vm, gpa, (gpa + npages * vm->page_size) - 1); 980 992 if (region != NULL) 981 993 TEST_FAIL("overlapping userspace_mem_region already " 982 994 "exists\n" 983 - " requested guest_paddr: 0x%lx npages: 0x%lx " 984 - "page_size: 0x%x\n" 985 - " existing guest_paddr: 0x%lx size: 0x%lx", 986 - guest_paddr, npages, vm->page_size, 995 + " requested gpa: 0x%lx npages: 0x%lx page_size: 0x%x\n" 996 + " existing gpa: 0x%lx size: 0x%lx", 997 + gpa, npages, vm->page_size, 987 998 (uint64_t) region->region.guest_phys_addr, 988 999 (uint64_t) region->region.memory_size); 989 1000 ··· 996 1009 "already exists.\n" 997 1010 " requested slot: %u paddr: 0x%lx npages: 0x%lx\n" 998 1011 " existing slot: %u paddr: 0x%lx size: 0x%lx", 999 - slot, guest_paddr, npages, 1000 - region->region.slot, 1012 + slot, gpa, npages, region->region.slot, 1001 1013 (uint64_t) region->region.guest_phys_addr, 1002 1014 (uint64_t) region->region.memory_size); 1003 1015 } ··· 1022 1036 if (src_type == VM_MEM_SRC_ANONYMOUS_THP) 1023 1037 alignment = max(backing_src_pagesz, alignment); 1024 1038 1025 - TEST_ASSERT_EQ(guest_paddr, align_up(guest_paddr, backing_src_pagesz)); 1039 + TEST_ASSERT_EQ(gpa, align_up(gpa, backing_src_pagesz)); 1026 1040 1027 1041 /* Add enough memory to align up if necessary */ 1028 1042 if (alignment > 1) ··· 1070 1084 * needing to track if the fd is owned by the framework 1071 1085 * or by the caller. 1072 1086 */ 1073 - guest_memfd = dup(guest_memfd); 1074 - TEST_ASSERT(guest_memfd >= 0, __KVM_SYSCALL_ERROR("dup()", guest_memfd)); 1087 + guest_memfd = kvm_dup(guest_memfd); 1075 1088 } 1076 1089 1077 1090 region->region.guest_memfd = guest_memfd; ··· 1082 1097 region->unused_phy_pages = sparsebit_alloc(); 1083 1098 if (vm_arch_has_protected_memory(vm)) 1084 1099 region->protected_phy_pages = sparsebit_alloc(); 1085 - sparsebit_set_num(region->unused_phy_pages, 1086 - guest_paddr >> vm->page_shift, npages); 1100 + sparsebit_set_num(region->unused_phy_pages, gpa >> vm->page_shift, npages); 1087 1101 region->region.slot = slot; 1088 1102 region->region.flags = flags; 1089 - region->region.guest_phys_addr = guest_paddr; 1103 + region->region.guest_phys_addr = gpa; 1090 1104 region->region.memory_size = npages * vm->page_size; 1091 1105 region->region.userspace_addr = (uintptr_t) region->host_mem; 1092 1106 ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION2, &region->region); 1093 1107 TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION2 IOCTL failed,\n" 1094 1108 " rc: %i errno: %i\n" 1095 1109 " slot: %u flags: 0x%x\n" 1096 - " guest_phys_addr: 0x%lx size: 0x%lx guest_memfd: %d", 1097 - ret, errno, slot, flags, 1098 - guest_paddr, (uint64_t) region->region.memory_size, 1110 + " guest_phys_addr: 0x%lx size: 0x%llx guest_memfd: %d", 1111 + ret, errno, slot, flags, gpa, region->region.memory_size, 1099 1112 region->region.guest_memfd); 1100 1113 1101 1114 /* Add to quick lookup data structures */ ··· 1115 1132 1116 1133 void vm_userspace_mem_region_add(struct kvm_vm *vm, 1117 1134 enum vm_mem_backing_src_type src_type, 1118 - uint64_t guest_paddr, uint32_t slot, 1119 - uint64_t npages, uint32_t flags) 1135 + uint64_t gpa, uint32_t slot, uint64_t npages, 1136 + uint32_t flags) 1120 1137 { 1121 - vm_mem_add(vm, src_type, guest_paddr, slot, npages, flags, -1, 0); 1138 + vm_mem_add(vm, src_type, gpa, slot, npages, flags, -1, 0); 1122 1139 } 1123 1140 1124 1141 /* ··· 1182 1199 TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION2 IOCTL failed,\n" 1183 1200 " rc: %i errno: %i slot: %u flags: 0x%x", 1184 1201 ret, errno, slot, flags); 1202 + } 1203 + 1204 + void vm_mem_region_reload(struct kvm_vm *vm, uint32_t slot) 1205 + { 1206 + struct userspace_mem_region *region = memslot2region(vm, slot); 1207 + struct kvm_userspace_memory_region2 tmp = region->region; 1208 + 1209 + tmp.memory_size = 0; 1210 + vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION2, &tmp); 1211 + vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION2, &region->region); 1185 1212 } 1186 1213 1187 1214 /* ··· 1449 1456 pages--, vaddr += vm->page_size, paddr += vm->page_size) { 1450 1457 1451 1458 virt_pg_map(vm, vaddr, paddr); 1452 - 1453 - sparsebit_set(vm->vpages_mapped, vaddr >> vm->page_shift); 1454 1459 } 1455 1460 1456 1461 return vaddr_start; ··· 1562 1571 1563 1572 while (npages--) { 1564 1573 virt_pg_map(vm, vaddr, paddr); 1565 - sparsebit_set(vm->vpages_mapped, vaddr >> vm->page_shift); 1566 1574 1567 1575 vaddr += page_size; 1568 1576 paddr += page_size; ··· 2015 2025 KVM_EXIT_STRING(NOTIFY), 2016 2026 KVM_EXIT_STRING(LOONGARCH_IOCSR), 2017 2027 KVM_EXIT_STRING(MEMORY_FAULT), 2028 + KVM_EXIT_STRING(ARM_SEA), 2018 2029 }; 2019 2030 2020 2031 /* ··· 2296 2305 { 2297 2306 } 2298 2307 2308 + static void report_unexpected_signal(int signum) 2309 + { 2310 + #define KVM_CASE_SIGNUM(sig) \ 2311 + case sig: TEST_FAIL("Unexpected " #sig " (%d)\n", signum) 2312 + 2313 + switch (signum) { 2314 + KVM_CASE_SIGNUM(SIGBUS); 2315 + KVM_CASE_SIGNUM(SIGSEGV); 2316 + KVM_CASE_SIGNUM(SIGILL); 2317 + KVM_CASE_SIGNUM(SIGFPE); 2318 + default: 2319 + TEST_FAIL("Unexpected signal %d\n", signum); 2320 + } 2321 + } 2322 + 2299 2323 void __attribute((constructor)) kvm_selftest_init(void) 2300 2324 { 2325 + struct sigaction sig_sa = { 2326 + .sa_handler = report_unexpected_signal, 2327 + }; 2328 + 2301 2329 /* Tell stdout not to buffer its content. */ 2302 2330 setbuf(stdout, NULL); 2331 + 2332 + sigaction(SIGBUS, &sig_sa, NULL); 2333 + sigaction(SIGSEGV, &sig_sa, NULL); 2334 + sigaction(SIGILL, &sig_sa, NULL); 2335 + sigaction(SIGFPE, &sig_sa, NULL); 2303 2336 2304 2337 guest_random_seed = last_guest_seed = random(); 2305 2338 pr_info("Random seed: 0x%x\n", guest_random_seed);

+6

tools/testing/selftests/kvm/lib/loongarch/exception.S

··· 51 51 st.d t0, sp, ESTAT_OFFSET_EXREGS 52 52 csrrd t0, LOONGARCH_CSR_BADV 53 53 st.d t0, sp, BADV_OFFSET_EXREGS 54 + csrrd t0, LOONGARCH_CSR_PRMD 55 + st.d t0, sp, PRMD_OFFSET_EXREGS 54 56 55 57 or a0, sp, zero 56 58 bl route_exception 59 + ld.d t0, sp, PC_OFFSET_EXREGS 60 + csrwr t0, LOONGARCH_CSR_ERA 61 + ld.d t0, sp, PRMD_OFFSET_EXREGS 62 + csrwr t0, LOONGARCH_CSR_PRMD 57 63 restore_gprs sp 58 64 csrrd sp, LOONGARCH_CSR_KS0 59 65 ertn

+45 -2

tools/testing/selftests/kvm/lib/loongarch/processor.c

··· 3 3 #include <assert.h> 4 4 #include <linux/compiler.h> 5 5 6 + #include <asm/kvm.h> 6 7 #include "kvm_util.h" 7 8 #include "processor.h" 8 9 #include "ucall_common.h" ··· 12 11 #define LOONGARCH_GUEST_STACK_VADDR_MIN 0x200000 13 12 14 13 static vm_paddr_t invalid_pgtable[4]; 14 + static vm_vaddr_t exception_handlers; 15 15 16 16 static uint64_t virt_pte_index(struct kvm_vm *vm, vm_vaddr_t gva, int level) 17 17 { ··· 185 183 186 184 void route_exception(struct ex_regs *regs) 187 185 { 186 + int vector; 188 187 unsigned long pc, estat, badv; 188 + struct handlers *handlers; 189 + 190 + handlers = (struct handlers *)exception_handlers; 191 + vector = (regs->estat & CSR_ESTAT_EXC) >> CSR_ESTAT_EXC_SHIFT; 192 + if (handlers && handlers->exception_handlers[vector]) 193 + return handlers->exception_handlers[vector](regs); 189 194 190 195 pc = regs->pc; 191 196 badv = regs->badv; 192 197 estat = regs->estat; 193 198 ucall(UCALL_UNHANDLED, 3, pc, estat, badv); 194 199 while (1) ; 200 + } 201 + 202 + void vm_init_descriptor_tables(struct kvm_vm *vm) 203 + { 204 + void *addr; 205 + 206 + vm->handlers = __vm_vaddr_alloc(vm, sizeof(struct handlers), 207 + LOONGARCH_GUEST_STACK_VADDR_MIN, MEM_REGION_DATA); 208 + 209 + addr = addr_gva2hva(vm, vm->handlers); 210 + memset(addr, 0, vm->page_size); 211 + exception_handlers = vm->handlers; 212 + sync_global_to_guest(vm, exception_handlers); 213 + } 214 + 215 + void vm_install_exception_handler(struct kvm_vm *vm, int vector, handler_fn handler) 216 + { 217 + struct handlers *handlers = addr_gva2hva(vm, vm->handlers); 218 + 219 + assert(vector < VECTOR_NUM); 220 + handlers->exception_handlers[vector] = handler; 221 + } 222 + 223 + uint32_t guest_get_vcpuid(void) 224 + { 225 + return csr_read(LOONGARCH_CSR_CPUID); 195 226 } 196 227 197 228 void vcpu_args_set(struct kvm_vcpu *vcpu, unsigned int num, ...) ··· 244 209 va_end(ap); 245 210 246 211 vcpu_regs_set(vcpu, &regs); 212 + } 213 + 214 + static void loongarch_set_reg(struct kvm_vcpu *vcpu, uint64_t id, uint64_t val) 215 + { 216 + __vcpu_set_reg(vcpu, id, val); 247 217 } 248 218 249 219 static void loongarch_get_csr(struct kvm_vcpu *vcpu, uint64_t id, void *addr) ··· 282 242 TEST_FAIL("Unknown guest mode, mode: 0x%x", vm->mode); 283 243 } 284 244 285 - /* user mode and page enable mode */ 286 - val = PLV_USER | CSR_CRMD_PG; 245 + /* kernel mode and page enable mode */ 246 + val = PLV_KERN | CSR_CRMD_PG; 287 247 loongarch_set_csr(vcpu, LOONGARCH_CSR_CRMD, val); 288 248 loongarch_set_csr(vcpu, LOONGARCH_CSR_PRMD, val); 289 249 loongarch_set_csr(vcpu, LOONGARCH_CSR_EUEN, 1); ··· 291 251 loongarch_set_csr(vcpu, LOONGARCH_CSR_TCFG, 0); 292 252 loongarch_set_csr(vcpu, LOONGARCH_CSR_ASID, 1); 293 253 254 + /* time count start from 0 */ 294 255 val = 0; 256 + loongarch_set_reg(vcpu, KVM_REG_LOONGARCH_COUNTER, val); 257 + 295 258 width = vm->page_shift - 3; 296 259 297 260 switch (vm->pgtable_levels) {

+1 -1

tools/testing/selftests/kvm/lib/x86/memstress.c

··· 63 63 { 64 64 uint64_t start, end; 65 65 66 - prepare_eptp(vmx, vm, 0); 66 + prepare_eptp(vmx, vm); 67 67 68 68 /* 69 69 * Identity map the first 4G and the test region with 1G pages so that

+40 -40

tools/testing/selftests/kvm/lib/x86/processor.c

··· 158 158 159 159 void virt_arch_pgd_alloc(struct kvm_vm *vm) 160 160 { 161 - TEST_ASSERT(vm->mode == VM_MODE_PXXV48_4K, "Attempt to use " 162 - "unknown or unsupported guest mode, mode: 0x%x", vm->mode); 161 + TEST_ASSERT(vm->mode == VM_MODE_PXXVYY_4K, 162 + "Unknown or unsupported guest mode: 0x%x", vm->mode); 163 163 164 - /* If needed, create page map l4 table. */ 164 + /* If needed, create the top-level page table. */ 165 165 if (!vm->pgd_created) { 166 166 vm->pgd = vm_alloc_page_table(vm); 167 167 vm->pgd_created = true; ··· 218 218 void __virt_pg_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr, int level) 219 219 { 220 220 const uint64_t pg_size = PG_LEVEL_SIZE(level); 221 - uint64_t *pml4e, *pdpe, *pde; 222 - uint64_t *pte; 221 + uint64_t *pte = &vm->pgd; 222 + int current_level; 223 223 224 - TEST_ASSERT(vm->mode == VM_MODE_PXXV48_4K, 225 - "Unknown or unsupported guest mode, mode: 0x%x", vm->mode); 224 + TEST_ASSERT(vm->mode == VM_MODE_PXXVYY_4K, 225 + "Unknown or unsupported guest mode: 0x%x", vm->mode); 226 226 227 227 TEST_ASSERT((vaddr % pg_size) == 0, 228 228 "Virtual address not aligned,\n" ··· 243 243 * Allocate upper level page tables, if not already present. Return 244 244 * early if a hugepage was created. 245 245 */ 246 - pml4e = virt_create_upper_pte(vm, &vm->pgd, vaddr, paddr, PG_LEVEL_512G, level); 247 - if (*pml4e & PTE_LARGE_MASK) 248 - return; 249 - 250 - pdpe = virt_create_upper_pte(vm, pml4e, vaddr, paddr, PG_LEVEL_1G, level); 251 - if (*pdpe & PTE_LARGE_MASK) 252 - return; 253 - 254 - pde = virt_create_upper_pte(vm, pdpe, vaddr, paddr, PG_LEVEL_2M, level); 255 - if (*pde & PTE_LARGE_MASK) 256 - return; 246 + for (current_level = vm->pgtable_levels; 247 + current_level > PG_LEVEL_4K; 248 + current_level--) { 249 + pte = virt_create_upper_pte(vm, pte, vaddr, paddr, 250 + current_level, level); 251 + if (*pte & PTE_LARGE_MASK) 252 + return; 253 + } 257 254 258 255 /* Fill in page table entry. */ 259 - pte = virt_get_pte(vm, pde, vaddr, PG_LEVEL_4K); 256 + pte = virt_get_pte(vm, pte, vaddr, PG_LEVEL_4K); 260 257 TEST_ASSERT(!(*pte & PTE_PRESENT_MASK), 261 258 "PTE already present for 4k page at vaddr: 0x%lx", vaddr); 262 259 *pte = PTE_PRESENT_MASK | PTE_WRITABLE_MASK | (paddr & PHYSICAL_PAGE_MASK); ··· 286 289 287 290 for (i = 0; i < nr_pages; i++) { 288 291 __virt_pg_map(vm, vaddr, paddr, level); 292 + sparsebit_set_num(vm->vpages_mapped, vaddr >> vm->page_shift, 293 + nr_bytes / PAGE_SIZE); 289 294 290 295 vaddr += pg_size; 291 296 paddr += pg_size; ··· 309 310 uint64_t *__vm_get_page_table_entry(struct kvm_vm *vm, uint64_t vaddr, 310 311 int *level) 311 312 { 312 - uint64_t *pml4e, *pdpe, *pde; 313 + int va_width = 12 + (vm->pgtable_levels) * 9; 314 + uint64_t *pte = &vm->pgd; 315 + int current_level; 313 316 314 317 TEST_ASSERT(!vm->arch.is_pt_protected, 315 318 "Walking page tables of protected guests is impossible"); 316 319 317 - TEST_ASSERT(*level >= PG_LEVEL_NONE && *level < PG_LEVEL_NUM, 320 + TEST_ASSERT(*level >= PG_LEVEL_NONE && *level <= vm->pgtable_levels, 318 321 "Invalid PG_LEVEL_* '%d'", *level); 319 322 320 - TEST_ASSERT(vm->mode == VM_MODE_PXXV48_4K, "Attempt to use " 321 - "unknown or unsupported guest mode, mode: 0x%x", vm->mode); 323 + TEST_ASSERT(vm->mode == VM_MODE_PXXVYY_4K, 324 + "Unknown or unsupported guest mode: 0x%x", vm->mode); 322 325 TEST_ASSERT(sparsebit_is_set(vm->vpages_valid, 323 326 (vaddr >> vm->page_shift)), 324 327 "Invalid virtual address, vaddr: 0x%lx", 325 328 vaddr); 326 329 /* 327 - * Based on the mode check above there are 48 bits in the vaddr, so 328 - * shift 16 to sign extend the last bit (bit-47), 330 + * Check that the vaddr is a sign-extended va_width value. 329 331 */ 330 - TEST_ASSERT(vaddr == (((int64_t)vaddr << 16) >> 16), 331 - "Canonical check failed. The virtual address is invalid."); 332 + TEST_ASSERT(vaddr == 333 + (((int64_t)vaddr << (64 - va_width) >> (64 - va_width))), 334 + "Canonical check failed. The virtual address is invalid."); 332 335 333 - pml4e = virt_get_pte(vm, &vm->pgd, vaddr, PG_LEVEL_512G); 334 - if (vm_is_target_pte(pml4e, level, PG_LEVEL_512G)) 335 - return pml4e; 336 + for (current_level = vm->pgtable_levels; 337 + current_level > PG_LEVEL_4K; 338 + current_level--) { 339 + pte = virt_get_pte(vm, pte, vaddr, current_level); 340 + if (vm_is_target_pte(pte, level, current_level)) 341 + return pte; 342 + } 336 343 337 - pdpe = virt_get_pte(vm, pml4e, vaddr, PG_LEVEL_1G); 338 - if (vm_is_target_pte(pdpe, level, PG_LEVEL_1G)) 339 - return pdpe; 340 - 341 - pde = virt_get_pte(vm, pdpe, vaddr, PG_LEVEL_2M); 342 - if (vm_is_target_pte(pde, level, PG_LEVEL_2M)) 343 - return pde; 344 - 345 - return virt_get_pte(vm, pde, vaddr, PG_LEVEL_4K); 344 + return virt_get_pte(vm, pte, vaddr, PG_LEVEL_4K); 346 345 } 347 346 348 347 uint64_t *vm_get_page_table_entry(struct kvm_vm *vm, uint64_t vaddr) ··· 523 526 { 524 527 struct kvm_sregs sregs; 525 528 526 - TEST_ASSERT_EQ(vm->mode, VM_MODE_PXXV48_4K); 529 + TEST_ASSERT(vm->mode == VM_MODE_PXXVYY_4K, 530 + "Unknown or unsupported guest mode: 0x%x", vm->mode); 527 531 528 532 /* Set mode specific system register values. */ 529 533 vcpu_sregs_get(vcpu, &sregs); ··· 538 540 sregs.cr4 |= X86_CR4_PAE | X86_CR4_OSFXSR; 539 541 if (kvm_cpu_has(X86_FEATURE_XSAVE)) 540 542 sregs.cr4 |= X86_CR4_OSXSAVE; 543 + if (vm->pgtable_levels == 5) 544 + sregs.cr4 |= X86_CR4_LA57; 541 545 sregs.efer |= (EFER_LME | EFER_LMA | EFER_NX); 542 546 543 547 kvm_seg_set_unusable(&sregs.ldt);

+4 -5

tools/testing/selftests/kvm/lib/x86/vmx.c

··· 401 401 struct eptPageTableEntry *pt = vmx->eptp_hva, *pte; 402 402 uint16_t index; 403 403 404 - TEST_ASSERT(vm->mode == VM_MODE_PXXV48_4K, "Attempt to use " 405 - "unknown or unsupported guest mode, mode: 0x%x", vm->mode); 404 + TEST_ASSERT(vm->mode == VM_MODE_PXXVYY_4K, 405 + "Unknown or unsupported guest mode: 0x%x", vm->mode); 406 406 407 407 TEST_ASSERT((nested_paddr >> 48) == 0, 408 - "Nested physical address 0x%lx requires 5-level paging", 408 + "Nested physical address 0x%lx is > 48-bits and requires 5-level EPT", 409 409 nested_paddr); 410 410 TEST_ASSERT((nested_paddr % page_size) == 0, 411 411 "Nested physical address not on page boundary,\n" ··· 534 534 return ctrl & SECONDARY_EXEC_ENABLE_EPT; 535 535 } 536 536 537 - void prepare_eptp(struct vmx_pages *vmx, struct kvm_vm *vm, 538 - uint32_t eptp_memslot) 537 + void prepare_eptp(struct vmx_pages *vmx, struct kvm_vm *vm) 539 538 { 540 539 TEST_ASSERT(kvm_cpu_has_ept(), "KVM doesn't support nested EPT"); 541 540

+200

tools/testing/selftests/kvm/loongarch/arch_timer.c

··· 1 + // SPDX-License-Identifier: GPL-2.0-only 2 + /* 3 + * The test validates periodic/one-shot constant timer IRQ using 4 + * CSR.TCFG and CSR.TVAL registers. 5 + */ 6 + #include "arch_timer.h" 7 + #include "kvm_util.h" 8 + #include "processor.h" 9 + #include "timer_test.h" 10 + #include "ucall_common.h" 11 + 12 + static void do_idle(void) 13 + { 14 + unsigned int intid; 15 + unsigned long estat; 16 + 17 + __asm__ __volatile__("idle 0" : : : "memory"); 18 + 19 + estat = csr_read(LOONGARCH_CSR_ESTAT); 20 + intid = !!(estat & BIT(INT_TI)); 21 + 22 + /* Make sure pending timer IRQ arrived */ 23 + GUEST_ASSERT_EQ(intid, 1); 24 + csr_write(CSR_TINTCLR_TI, LOONGARCH_CSR_TINTCLR); 25 + } 26 + 27 + static void guest_irq_handler(struct ex_regs *regs) 28 + { 29 + unsigned int intid; 30 + uint32_t cpu = guest_get_vcpuid(); 31 + uint64_t xcnt, val, cfg, xcnt_diff_us; 32 + struct test_vcpu_shared_data *shared_data = &vcpu_shared_data[cpu]; 33 + 34 + intid = !!(regs->estat & BIT(INT_TI)); 35 + 36 + /* Make sure we are dealing with the correct timer IRQ */ 37 + GUEST_ASSERT_EQ(intid, 1); 38 + 39 + cfg = timer_get_cfg(); 40 + if (cfg & CSR_TCFG_PERIOD) { 41 + WRITE_ONCE(shared_data->nr_iter, shared_data->nr_iter - 1); 42 + if (shared_data->nr_iter == 0) 43 + disable_timer(); 44 + csr_write(CSR_TINTCLR_TI, LOONGARCH_CSR_TINTCLR); 45 + return; 46 + } 47 + 48 + /* 49 + * On real machine, value of LOONGARCH_CSR_TVAL is BIT_ULL(48) - 1 50 + * On virtual machine, its value counts down from BIT_ULL(48) - 1 51 + */ 52 + val = timer_get_val(); 53 + xcnt = timer_get_cycles(); 54 + xcnt_diff_us = cycles_to_usec(xcnt - shared_data->xcnt); 55 + 56 + /* Basic 'timer condition met' check */ 57 + __GUEST_ASSERT(val > cfg, 58 + "val = 0x%lx, cfg = 0x%lx, xcnt_diff_us = 0x%lx", 59 + val, cfg, xcnt_diff_us); 60 + 61 + csr_write(CSR_TINTCLR_TI, LOONGARCH_CSR_TINTCLR); 62 + WRITE_ONCE(shared_data->nr_iter, shared_data->nr_iter + 1); 63 + } 64 + 65 + static void guest_test_period_timer(uint32_t cpu) 66 + { 67 + uint32_t irq_iter, config_iter; 68 + uint64_t us; 69 + struct test_vcpu_shared_data *shared_data = &vcpu_shared_data[cpu]; 70 + 71 + shared_data->nr_iter = test_args.nr_iter; 72 + shared_data->xcnt = timer_get_cycles(); 73 + us = msecs_to_usecs(test_args.timer_period_ms) + test_args.timer_err_margin_us; 74 + timer_set_next_cmp_ms(test_args.timer_period_ms, true); 75 + 76 + for (config_iter = 0; config_iter < test_args.nr_iter; config_iter++) { 77 + /* Setup a timeout for the interrupt to arrive */ 78 + udelay(us); 79 + } 80 + 81 + irq_iter = READ_ONCE(shared_data->nr_iter); 82 + __GUEST_ASSERT(irq_iter == 0, 83 + "irq_iter = 0x%x.\n" 84 + " Guest period timer interrupt was not triggered within the specified\n" 85 + " interval, try to increase the error margin by [-e] option.\n", 86 + irq_iter); 87 + } 88 + 89 + static void guest_test_oneshot_timer(uint32_t cpu) 90 + { 91 + uint32_t irq_iter, config_iter; 92 + uint64_t us; 93 + struct test_vcpu_shared_data *shared_data = &vcpu_shared_data[cpu]; 94 + 95 + shared_data->nr_iter = 0; 96 + shared_data->guest_stage = 0; 97 + us = msecs_to_usecs(test_args.timer_period_ms) + test_args.timer_err_margin_us; 98 + for (config_iter = 0; config_iter < test_args.nr_iter; config_iter++) { 99 + shared_data->xcnt = timer_get_cycles(); 100 + 101 + /* Setup the next interrupt */ 102 + timer_set_next_cmp_ms(test_args.timer_period_ms, false); 103 + /* Setup a timeout for the interrupt to arrive */ 104 + udelay(us); 105 + 106 + irq_iter = READ_ONCE(shared_data->nr_iter); 107 + __GUEST_ASSERT(config_iter + 1 == irq_iter, 108 + "config_iter + 1 = 0x%x, irq_iter = 0x%x.\n" 109 + " Guest timer interrupt was not triggered within the specified\n" 110 + " interval, try to increase the error margin by [-e] option.\n", 111 + config_iter + 1, irq_iter); 112 + } 113 + } 114 + 115 + static void guest_test_emulate_timer(uint32_t cpu) 116 + { 117 + uint32_t config_iter; 118 + uint64_t xcnt_diff_us, us; 119 + struct test_vcpu_shared_data *shared_data = &vcpu_shared_data[cpu]; 120 + 121 + local_irq_disable(); 122 + shared_data->nr_iter = 0; 123 + us = msecs_to_usecs(test_args.timer_period_ms); 124 + for (config_iter = 0; config_iter < test_args.nr_iter; config_iter++) { 125 + shared_data->xcnt = timer_get_cycles(); 126 + 127 + /* Setup the next interrupt */ 128 + timer_set_next_cmp_ms(test_args.timer_period_ms, false); 129 + do_idle(); 130 + 131 + xcnt_diff_us = cycles_to_usec(timer_get_cycles() - shared_data->xcnt); 132 + __GUEST_ASSERT(xcnt_diff_us >= us, 133 + "xcnt_diff_us = 0x%lx, us = 0x%lx.\n", 134 + xcnt_diff_us, us); 135 + } 136 + local_irq_enable(); 137 + } 138 + 139 + static void guest_time_count_test(uint32_t cpu) 140 + { 141 + uint32_t config_iter; 142 + unsigned long start, end, prev, us; 143 + 144 + /* Assuming that test case starts to run in 1 second */ 145 + start = timer_get_cycles(); 146 + us = msec_to_cycles(1000); 147 + __GUEST_ASSERT(start <= us, 148 + "start = 0x%lx, us = 0x%lx.\n", 149 + start, us); 150 + 151 + us = msec_to_cycles(test_args.timer_period_ms); 152 + for (config_iter = 0; config_iter < test_args.nr_iter; config_iter++) { 153 + start = timer_get_cycles(); 154 + end = start + us; 155 + /* test time count growing up always */ 156 + while (start < end) { 157 + prev = start; 158 + start = timer_get_cycles(); 159 + __GUEST_ASSERT(prev <= start, 160 + "prev = 0x%lx, start = 0x%lx.\n", 161 + prev, start); 162 + } 163 + } 164 + } 165 + 166 + static void guest_code(void) 167 + { 168 + uint32_t cpu = guest_get_vcpuid(); 169 + 170 + /* must run at first */ 171 + guest_time_count_test(cpu); 172 + 173 + timer_irq_enable(); 174 + local_irq_enable(); 175 + guest_test_period_timer(cpu); 176 + guest_test_oneshot_timer(cpu); 177 + guest_test_emulate_timer(cpu); 178 + 179 + GUEST_DONE(); 180 + } 181 + 182 + struct kvm_vm *test_vm_create(void) 183 + { 184 + struct kvm_vm *vm; 185 + int nr_vcpus = test_args.nr_vcpus; 186 + 187 + vm = vm_create_with_vcpus(nr_vcpus, guest_code, vcpus); 188 + vm_init_descriptor_tables(vm); 189 + vm_install_exception_handler(vm, EXCCODE_INT, guest_irq_handler); 190 + 191 + /* Make all the test's cmdline args visible to the guest */ 192 + sync_global_to_guest(vm, test_args); 193 + 194 + return vm; 195 + } 196 + 197 + void test_vm_cleanup(struct kvm_vm *vm) 198 + { 199 + kvm_vm_free(vm); 200 + }

+5 -5

tools/testing/selftests/kvm/mmu_stress_test.c

··· 263 263 TEST_ASSERT(!r, "sched_getaffinity failed, errno = %d (%s)", 264 264 errno, strerror(errno)); 265 265 266 - nr_vcpus = CPU_COUNT(&possible_mask) * 3/4; 266 + nr_vcpus = CPU_COUNT(&possible_mask); 267 267 TEST_ASSERT(nr_vcpus > 0, "Uh, no CPUs?"); 268 + if (nr_vcpus >= 2) 269 + nr_vcpus = nr_vcpus * 3/4; 268 270 } 269 271 270 272 int main(int argc, char *argv[]) ··· 362 360 363 361 #ifdef __x86_64__ 364 362 /* Identity map memory in the guest using 1gb pages. */ 365 - for (i = 0; i < slot_size; i += SZ_1G) 366 - __virt_pg_map(vm, gpa + i, gpa + i, PG_LEVEL_1G); 363 + virt_map_level(vm, gpa, gpa, slot_size, PG_LEVEL_1G); 367 364 #else 368 - for (i = 0; i < slot_size; i += vm->page_size) 369 - virt_pg_map(vm, gpa + i, gpa + i); 365 + virt_map(vm, gpa, gpa, slot_size >> vm->page_shift); 370 366 #endif 371 367 } 372 368

+14 -18

tools/testing/selftests/kvm/pre_fault_memory_test.c

··· 17 17 #define TEST_NPAGES (TEST_SIZE / PAGE_SIZE) 18 18 #define TEST_SLOT 10 19 19 20 - static void guest_code(uint64_t base_gpa) 20 + static void guest_code(uint64_t base_gva) 21 21 { 22 22 volatile uint64_t val __used; 23 23 int i; 24 24 25 25 for (i = 0; i < TEST_NPAGES; i++) { 26 - uint64_t *src = (uint64_t *)(base_gpa + i * PAGE_SIZE); 26 + uint64_t *src = (uint64_t *)(base_gva + i * PAGE_SIZE); 27 27 28 28 val = *src; 29 29 } ··· 161 161 162 162 static void __test_pre_fault_memory(unsigned long vm_type, bool private) 163 163 { 164 + uint64_t gpa, gva, alignment, guest_page_size; 164 165 const struct vm_shape shape = { 165 166 .mode = VM_MODE_DEFAULT, 166 167 .type = vm_type, ··· 171 170 struct kvm_vm *vm; 172 171 struct ucall uc; 173 172 174 - uint64_t guest_test_phys_mem; 175 - uint64_t guest_test_virt_mem; 176 - uint64_t alignment, guest_page_size; 177 - 178 173 vm = vm_create_shape_with_one_vcpu(shape, &vcpu, guest_code); 179 174 180 175 alignment = guest_page_size = vm_guest_mode_params[VM_MODE_DEFAULT].page_size; 181 - guest_test_phys_mem = (vm->max_gfn - TEST_NPAGES) * guest_page_size; 176 + gpa = (vm->max_gfn - TEST_NPAGES) * guest_page_size; 182 177 #ifdef __s390x__ 183 178 alignment = max(0x100000UL, guest_page_size); 184 179 #else 185 180 alignment = SZ_2M; 186 181 #endif 187 - guest_test_phys_mem = align_down(guest_test_phys_mem, alignment); 188 - guest_test_virt_mem = guest_test_phys_mem & ((1ULL << (vm->va_bits - 1)) - 1); 182 + gpa = align_down(gpa, alignment); 183 + gva = gpa & ((1ULL << (vm->va_bits - 1)) - 1); 189 184 190 - vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, 191 - guest_test_phys_mem, TEST_SLOT, TEST_NPAGES, 192 - private ? KVM_MEM_GUEST_MEMFD : 0); 193 - virt_map(vm, guest_test_virt_mem, guest_test_phys_mem, TEST_NPAGES); 185 + vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, gpa, TEST_SLOT, 186 + TEST_NPAGES, private ? KVM_MEM_GUEST_MEMFD : 0); 187 + virt_map(vm, gva, gpa, TEST_NPAGES); 194 188 195 189 if (private) 196 - vm_mem_set_private(vm, guest_test_phys_mem, TEST_SIZE); 190 + vm_mem_set_private(vm, gpa, TEST_SIZE); 197 191 198 - pre_fault_memory(vcpu, guest_test_phys_mem, 0, SZ_2M, 0, private); 199 - pre_fault_memory(vcpu, guest_test_phys_mem, SZ_2M, PAGE_SIZE * 2, PAGE_SIZE, private); 200 - pre_fault_memory(vcpu, guest_test_phys_mem, TEST_SIZE, PAGE_SIZE, PAGE_SIZE, private); 192 + pre_fault_memory(vcpu, gpa, 0, SZ_2M, 0, private); 193 + pre_fault_memory(vcpu, gpa, SZ_2M, PAGE_SIZE * 2, PAGE_SIZE, private); 194 + pre_fault_memory(vcpu, gpa, TEST_SIZE, PAGE_SIZE, PAGE_SIZE, private); 201 195 202 - vcpu_args_set(vcpu, 1, guest_test_virt_mem); 196 + vcpu_args_set(vcpu, 1, gva); 203 197 vcpu_run(vcpu); 204 198 205 199 run = vcpu->run;

+4

tools/testing/selftests/kvm/riscv/get-reg-list.c

··· 133 133 case KVM_REG_RISCV_SBI_EXT | KVM_REG_RISCV_SBI_SINGLE | KVM_RISCV_SBI_EXT_SUSP: 134 134 case KVM_REG_RISCV_SBI_EXT | KVM_REG_RISCV_SBI_SINGLE | KVM_RISCV_SBI_EXT_STA: 135 135 case KVM_REG_RISCV_SBI_EXT | KVM_REG_RISCV_SBI_SINGLE | KVM_RISCV_SBI_EXT_FWFT: 136 + case KVM_REG_RISCV_SBI_EXT | KVM_REG_RISCV_SBI_SINGLE | KVM_RISCV_SBI_EXT_MPXY: 136 137 case KVM_REG_RISCV_SBI_EXT | KVM_REG_RISCV_SBI_SINGLE | KVM_RISCV_SBI_EXT_EXPERIMENTAL: 137 138 case KVM_REG_RISCV_SBI_EXT | KVM_REG_RISCV_SBI_SINGLE | KVM_RISCV_SBI_EXT_VENDOR: 138 139 return true; ··· 640 639 KVM_SBI_EXT_ARR(KVM_RISCV_SBI_EXT_SUSP), 641 640 KVM_SBI_EXT_ARR(KVM_RISCV_SBI_EXT_STA), 642 641 KVM_SBI_EXT_ARR(KVM_RISCV_SBI_EXT_FWFT), 642 + KVM_SBI_EXT_ARR(KVM_RISCV_SBI_EXT_MPXY), 643 643 KVM_SBI_EXT_ARR(KVM_RISCV_SBI_EXT_EXPERIMENTAL), 644 644 KVM_SBI_EXT_ARR(KVM_RISCV_SBI_EXT_VENDOR), 645 645 }; ··· 1144 1142 KVM_SBI_EXT_SIMPLE_CONFIG(pmu, PMU); 1145 1143 KVM_SBI_EXT_SIMPLE_CONFIG(dbcn, DBCN); 1146 1144 KVM_SBI_EXT_SIMPLE_CONFIG(susp, SUSP); 1145 + KVM_SBI_EXT_SIMPLE_CONFIG(mpxy, MPXY); 1147 1146 KVM_SBI_EXT_SUBLIST_CONFIG(fwft, FWFT); 1148 1147 1149 1148 KVM_ISA_EXT_SUBLIST_CONFIG(aia, AIA); ··· 1225 1222 &config_sbi_pmu, 1226 1223 &config_sbi_dbcn, 1227 1224 &config_sbi_susp, 1225 + &config_sbi_mpxy, 1228 1226 &config_sbi_fwft, 1229 1227 &config_aia, 1230 1228 &config_fp_f,

+140

tools/testing/selftests/kvm/s390/user_operexec.c

··· 1 + // SPDX-License-Identifier: GPL-2.0-only 2 + /* Test operation exception forwarding. 3 + * 4 + * Copyright IBM Corp. 2025 5 + * 6 + * Authors: 7 + * Janosch Frank <frankja@linux.ibm.com> 8 + */ 9 + #include "kselftest.h" 10 + #include "kvm_util.h" 11 + #include "test_util.h" 12 + #include "sie.h" 13 + 14 + #include <linux/kvm.h> 15 + 16 + static void guest_code_instr0(void) 17 + { 18 + asm(".word 0x0000"); 19 + } 20 + 21 + static void test_user_instr0(void) 22 + { 23 + struct kvm_vcpu *vcpu; 24 + struct kvm_vm *vm; 25 + int rc; 26 + 27 + vm = vm_create_with_one_vcpu(&vcpu, guest_code_instr0); 28 + rc = __vm_enable_cap(vm, KVM_CAP_S390_USER_INSTR0, 0); 29 + TEST_ASSERT_EQ(0, rc); 30 + 31 + vcpu_run(vcpu); 32 + TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_S390_SIEIC); 33 + TEST_ASSERT_EQ(vcpu->run->s390_sieic.icptcode, ICPT_OPEREXC); 34 + TEST_ASSERT_EQ(vcpu->run->s390_sieic.ipa, 0); 35 + 36 + kvm_vm_free(vm); 37 + } 38 + 39 + static void guest_code_user_operexec(void) 40 + { 41 + asm(".word 0x0807"); 42 + } 43 + 44 + static void test_user_operexec(void) 45 + { 46 + struct kvm_vcpu *vcpu; 47 + struct kvm_vm *vm; 48 + int rc; 49 + 50 + vm = vm_create_with_one_vcpu(&vcpu, guest_code_user_operexec); 51 + rc = __vm_enable_cap(vm, KVM_CAP_S390_USER_OPEREXEC, 0); 52 + TEST_ASSERT_EQ(0, rc); 53 + 54 + vcpu_run(vcpu); 55 + TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_S390_SIEIC); 56 + TEST_ASSERT_EQ(vcpu->run->s390_sieic.icptcode, ICPT_OPEREXC); 57 + TEST_ASSERT_EQ(vcpu->run->s390_sieic.ipa, 0x0807); 58 + 59 + kvm_vm_free(vm); 60 + 61 + /* 62 + * Since user_operexec is the superset it can be used for the 63 + * 0 instruction. 64 + */ 65 + vm = vm_create_with_one_vcpu(&vcpu, guest_code_instr0); 66 + rc = __vm_enable_cap(vm, KVM_CAP_S390_USER_OPEREXEC, 0); 67 + TEST_ASSERT_EQ(0, rc); 68 + 69 + vcpu_run(vcpu); 70 + TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_S390_SIEIC); 71 + TEST_ASSERT_EQ(vcpu->run->s390_sieic.icptcode, ICPT_OPEREXC); 72 + TEST_ASSERT_EQ(vcpu->run->s390_sieic.ipa, 0); 73 + 74 + kvm_vm_free(vm); 75 + } 76 + 77 + /* combine user_instr0 and user_operexec */ 78 + static void test_user_operexec_combined(void) 79 + { 80 + struct kvm_vcpu *vcpu; 81 + struct kvm_vm *vm; 82 + int rc; 83 + 84 + vm = vm_create_with_one_vcpu(&vcpu, guest_code_user_operexec); 85 + rc = __vm_enable_cap(vm, KVM_CAP_S390_USER_INSTR0, 0); 86 + TEST_ASSERT_EQ(0, rc); 87 + rc = __vm_enable_cap(vm, KVM_CAP_S390_USER_OPEREXEC, 0); 88 + TEST_ASSERT_EQ(0, rc); 89 + 90 + vcpu_run(vcpu); 91 + TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_S390_SIEIC); 92 + TEST_ASSERT_EQ(vcpu->run->s390_sieic.icptcode, ICPT_OPEREXC); 93 + TEST_ASSERT_EQ(vcpu->run->s390_sieic.ipa, 0x0807); 94 + 95 + kvm_vm_free(vm); 96 + 97 + /* Reverse enablement order */ 98 + vm = vm_create_with_one_vcpu(&vcpu, guest_code_user_operexec); 99 + rc = __vm_enable_cap(vm, KVM_CAP_S390_USER_OPEREXEC, 0); 100 + TEST_ASSERT_EQ(0, rc); 101 + rc = __vm_enable_cap(vm, KVM_CAP_S390_USER_INSTR0, 0); 102 + TEST_ASSERT_EQ(0, rc); 103 + 104 + vcpu_run(vcpu); 105 + TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_S390_SIEIC); 106 + TEST_ASSERT_EQ(vcpu->run->s390_sieic.icptcode, ICPT_OPEREXC); 107 + TEST_ASSERT_EQ(vcpu->run->s390_sieic.ipa, 0x0807); 108 + 109 + kvm_vm_free(vm); 110 + } 111 + 112 + /* 113 + * Run all tests above. 114 + * 115 + * Enablement after VCPU has been added is automatically tested since 116 + * we enable the capability after VCPU creation. 117 + */ 118 + static struct testdef { 119 + const char *name; 120 + void (*test)(void); 121 + } testlist[] = { 122 + { "instr0", test_user_instr0 }, 123 + { "operexec", test_user_operexec }, 124 + { "operexec_combined", test_user_operexec_combined}, 125 + }; 126 + 127 + int main(int argc, char *argv[]) 128 + { 129 + int idx; 130 + 131 + TEST_REQUIRE(kvm_has_cap(KVM_CAP_S390_USER_INSTR0)); 132 + 133 + ksft_print_header(); 134 + ksft_set_plan(ARRAY_SIZE(testlist)); 135 + for (idx = 0; idx < ARRAY_SIZE(testlist); idx++) { 136 + testlist[idx].test(); 137 + ksft_test_result_pass("%s\n", testlist[idx].name); 138 + } 139 + ksft_finished(); 140 + }

+1 -1

tools/testing/selftests/kvm/x86/hyperv_features.c

··· 94 94 95 95 if (!(hcall->control & HV_HYPERCALL_FAST_BIT)) { 96 96 input = pgs_gpa; 97 - output = pgs_gpa + 4096; 97 + output = pgs_gpa + PAGE_SIZE; 98 98 } else { 99 99 input = output = 0; 100 100 }

+9 -9

tools/testing/selftests/kvm/x86/hyperv_ipi.c

··· 102 102 /* 'Slow' HvCallSendSyntheticClusterIpi to RECEIVER_VCPU_ID_1 */ 103 103 ipi->vector = IPI_VECTOR; 104 104 ipi->cpu_mask = 1 << RECEIVER_VCPU_ID_1; 105 - hyperv_hypercall(HVCALL_SEND_IPI, pgs_gpa, pgs_gpa + 4096); 105 + hyperv_hypercall(HVCALL_SEND_IPI, pgs_gpa, pgs_gpa + PAGE_SIZE); 106 106 nop_loop(); 107 107 GUEST_ASSERT(ipis_rcvd[RECEIVER_VCPU_ID_1] == ++ipis_expected[0]); 108 108 GUEST_ASSERT(ipis_rcvd[RECEIVER_VCPU_ID_2] == ipis_expected[1]); ··· 116 116 GUEST_SYNC(stage++); 117 117 118 118 /* 'Slow' HvCallSendSyntheticClusterIpiEx to RECEIVER_VCPU_ID_1 */ 119 - memset(hcall_page, 0, 4096); 119 + memset(hcall_page, 0, PAGE_SIZE); 120 120 ipi_ex->vector = IPI_VECTOR; 121 121 ipi_ex->vp_set.format = HV_GENERIC_SET_SPARSE_4K; 122 122 ipi_ex->vp_set.valid_bank_mask = 1 << 0; 123 123 ipi_ex->vp_set.bank_contents[0] = BIT(RECEIVER_VCPU_ID_1); 124 124 hyperv_hypercall(HVCALL_SEND_IPI_EX | (1 << HV_HYPERCALL_VARHEAD_OFFSET), 125 - pgs_gpa, pgs_gpa + 4096); 125 + pgs_gpa, pgs_gpa + PAGE_SIZE); 126 126 nop_loop(); 127 127 GUEST_ASSERT(ipis_rcvd[RECEIVER_VCPU_ID_1] == ++ipis_expected[0]); 128 128 GUEST_ASSERT(ipis_rcvd[RECEIVER_VCPU_ID_2] == ipis_expected[1]); ··· 138 138 GUEST_SYNC(stage++); 139 139 140 140 /* 'Slow' HvCallSendSyntheticClusterIpiEx to RECEIVER_VCPU_ID_2 */ 141 - memset(hcall_page, 0, 4096); 141 + memset(hcall_page, 0, PAGE_SIZE); 142 142 ipi_ex->vector = IPI_VECTOR; 143 143 ipi_ex->vp_set.format = HV_GENERIC_SET_SPARSE_4K; 144 144 ipi_ex->vp_set.valid_bank_mask = 1 << 1; 145 145 ipi_ex->vp_set.bank_contents[0] = BIT(RECEIVER_VCPU_ID_2 - 64); 146 146 hyperv_hypercall(HVCALL_SEND_IPI_EX | (1 << HV_HYPERCALL_VARHEAD_OFFSET), 147 - pgs_gpa, pgs_gpa + 4096); 147 + pgs_gpa, pgs_gpa + PAGE_SIZE); 148 148 nop_loop(); 149 149 GUEST_ASSERT(ipis_rcvd[RECEIVER_VCPU_ID_1] == ipis_expected[0]); 150 150 GUEST_ASSERT(ipis_rcvd[RECEIVER_VCPU_ID_2] == ++ipis_expected[1]); ··· 160 160 GUEST_SYNC(stage++); 161 161 162 162 /* 'Slow' HvCallSendSyntheticClusterIpiEx to both RECEIVER_VCPU_ID_{1,2} */ 163 - memset(hcall_page, 0, 4096); 163 + memset(hcall_page, 0, PAGE_SIZE); 164 164 ipi_ex->vector = IPI_VECTOR; 165 165 ipi_ex->vp_set.format = HV_GENERIC_SET_SPARSE_4K; 166 166 ipi_ex->vp_set.valid_bank_mask = 1 << 1 | 1; 167 167 ipi_ex->vp_set.bank_contents[0] = BIT(RECEIVER_VCPU_ID_1); 168 168 ipi_ex->vp_set.bank_contents[1] = BIT(RECEIVER_VCPU_ID_2 - 64); 169 169 hyperv_hypercall(HVCALL_SEND_IPI_EX | (2 << HV_HYPERCALL_VARHEAD_OFFSET), 170 - pgs_gpa, pgs_gpa + 4096); 170 + pgs_gpa, pgs_gpa + PAGE_SIZE); 171 171 nop_loop(); 172 172 GUEST_ASSERT(ipis_rcvd[RECEIVER_VCPU_ID_1] == ++ipis_expected[0]); 173 173 GUEST_ASSERT(ipis_rcvd[RECEIVER_VCPU_ID_2] == ++ipis_expected[1]); ··· 183 183 GUEST_SYNC(stage++); 184 184 185 185 /* 'Slow' HvCallSendSyntheticClusterIpiEx to HV_GENERIC_SET_ALL */ 186 - memset(hcall_page, 0, 4096); 186 + memset(hcall_page, 0, PAGE_SIZE); 187 187 ipi_ex->vector = IPI_VECTOR; 188 188 ipi_ex->vp_set.format = HV_GENERIC_SET_ALL; 189 - hyperv_hypercall(HVCALL_SEND_IPI_EX, pgs_gpa, pgs_gpa + 4096); 189 + hyperv_hypercall(HVCALL_SEND_IPI_EX, pgs_gpa, pgs_gpa + PAGE_SIZE); 190 190 nop_loop(); 191 191 GUEST_ASSERT(ipis_rcvd[RECEIVER_VCPU_ID_1] == ++ipis_expected[0]); 192 192 GUEST_ASSERT(ipis_rcvd[RECEIVER_VCPU_ID_2] == ++ipis_expected[1]);

+1 -1

tools/testing/selftests/kvm/x86/hyperv_tlb_flush.c

··· 621 621 for (i = 0; i < NTEST_PAGES; i++) { 622 622 pte = vm_get_page_table_entry(vm, data->test_pages + i * PAGE_SIZE); 623 623 gpa = addr_hva2gpa(vm, pte); 624 - __virt_pg_map(vm, gva + PAGE_SIZE * i, gpa & PAGE_MASK, PG_LEVEL_4K); 624 + virt_pg_map(vm, gva + PAGE_SIZE * i, gpa & PAGE_MASK); 625 625 data->test_pages_pte[i] = gva + (gpa & ~PAGE_MASK); 626 626 } 627 627

+116

tools/testing/selftests/kvm/x86/nested_invalid_cr3_test.c

··· 1 + // SPDX-License-Identifier: GPL-2.0-only 2 + /* 3 + * Copyright (C) 2025, Google LLC. 4 + * 5 + * This test verifies that L1 fails to enter L2 with an invalid CR3, and 6 + * succeeds otherwise. 7 + */ 8 + #include "kvm_util.h" 9 + #include "vmx.h" 10 + #include "svm_util.h" 11 + #include "kselftest.h" 12 + 13 + 14 + #define L2_GUEST_STACK_SIZE 64 15 + 16 + static void l2_guest_code(void) 17 + { 18 + vmcall(); 19 + } 20 + 21 + static void l1_svm_code(struct svm_test_data *svm) 22 + { 23 + unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE]; 24 + uintptr_t save_cr3; 25 + 26 + generic_svm_setup(svm, l2_guest_code, 27 + &l2_guest_stack[L2_GUEST_STACK_SIZE]); 28 + 29 + /* Try to run L2 with invalid CR3 and make sure it fails */ 30 + save_cr3 = svm->vmcb->save.cr3; 31 + svm->vmcb->save.cr3 = -1ull; 32 + run_guest(svm->vmcb, svm->vmcb_gpa); 33 + GUEST_ASSERT(svm->vmcb->control.exit_code == SVM_EXIT_ERR); 34 + 35 + /* Now restore CR3 and make sure L2 runs successfully */ 36 + svm->vmcb->save.cr3 = save_cr3; 37 + run_guest(svm->vmcb, svm->vmcb_gpa); 38 + GUEST_ASSERT(svm->vmcb->control.exit_code == SVM_EXIT_VMMCALL); 39 + 40 + GUEST_DONE(); 41 + } 42 + 43 + static void l1_vmx_code(struct vmx_pages *vmx_pages) 44 + { 45 + unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE]; 46 + uintptr_t save_cr3; 47 + 48 + GUEST_ASSERT(prepare_for_vmx_operation(vmx_pages)); 49 + GUEST_ASSERT(load_vmcs(vmx_pages)); 50 + 51 + prepare_vmcs(vmx_pages, l2_guest_code, 52 + &l2_guest_stack[L2_GUEST_STACK_SIZE]); 53 + 54 + /* Try to run L2 with invalid CR3 and make sure it fails */ 55 + save_cr3 = vmreadz(GUEST_CR3); 56 + vmwrite(GUEST_CR3, -1ull); 57 + GUEST_ASSERT(!vmlaunch()); 58 + GUEST_ASSERT(vmreadz(VM_EXIT_REASON) == 59 + (EXIT_REASON_FAILED_VMENTRY | EXIT_REASON_INVALID_STATE)); 60 + 61 + /* Now restore CR3 and make sure L2 runs successfully */ 62 + vmwrite(GUEST_CR3, save_cr3); 63 + GUEST_ASSERT(!vmlaunch()); 64 + GUEST_ASSERT(vmreadz(VM_EXIT_REASON) == EXIT_REASON_VMCALL); 65 + 66 + GUEST_DONE(); 67 + } 68 + 69 + static void l1_guest_code(void *data) 70 + { 71 + if (this_cpu_has(X86_FEATURE_VMX)) 72 + l1_vmx_code(data); 73 + else 74 + l1_svm_code(data); 75 + } 76 + 77 + int main(int argc, char *argv[]) 78 + { 79 + struct kvm_vcpu *vcpu; 80 + struct kvm_vm *vm; 81 + vm_vaddr_t guest_gva = 0; 82 + 83 + TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_VMX) || 84 + kvm_cpu_has(X86_FEATURE_SVM)); 85 + 86 + vm = vm_create_with_one_vcpu(&vcpu, l1_guest_code); 87 + 88 + if (kvm_cpu_has(X86_FEATURE_VMX)) 89 + vcpu_alloc_vmx(vm, &guest_gva); 90 + else 91 + vcpu_alloc_svm(vm, &guest_gva); 92 + 93 + vcpu_args_set(vcpu, 1, guest_gva); 94 + 95 + for (;;) { 96 + struct ucall uc; 97 + 98 + vcpu_run(vcpu); 99 + TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_IO); 100 + 101 + switch (get_ucall(vcpu, &uc)) { 102 + case UCALL_ABORT: 103 + REPORT_GUEST_ASSERT(uc); 104 + case UCALL_SYNC: 105 + break; 106 + case UCALL_DONE: 107 + goto done; 108 + default: 109 + TEST_FAIL("Unknown ucall %lu", uc.cmd); 110 + } 111 + } 112 + 113 + done: 114 + kvm_vm_free(vm); 115 + return 0; 116 + }

+3 -6

tools/testing/selftests/kvm/x86/private_mem_conversions_test.c

··· 380 380 struct kvm_vcpu *vcpus[KVM_MAX_VCPUS]; 381 381 pthread_t threads[KVM_MAX_VCPUS]; 382 382 struct kvm_vm *vm; 383 - int memfd, i, r; 383 + int memfd, i; 384 384 385 385 const struct vm_shape shape = { 386 386 .mode = VM_MODE_DEFAULT, ··· 428 428 * should prevent the VM from being fully destroyed until the last 429 429 * reference to the guest_memfd is also put. 430 430 */ 431 - r = fallocate(memfd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, 0, memfd_size); 432 - TEST_ASSERT(!r, __KVM_SYSCALL_ERROR("fallocate()", r)); 433 - 434 - r = fallocate(memfd, FALLOC_FL_KEEP_SIZE, 0, memfd_size); 435 - TEST_ASSERT(!r, __KVM_SYSCALL_ERROR("fallocate()", r)); 431 + kvm_fallocate(memfd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, 0, memfd_size); 432 + kvm_fallocate(memfd, FALLOC_FL_KEEP_SIZE, 0, memfd_size); 436 433 437 434 close(memfd); 438 435 }

+1 -1

tools/testing/selftests/kvm/x86/sev_smoke_test.c

··· 104 104 vm_sev_launch(vm, policy, NULL); 105 105 106 106 /* This page is shared, so make it decrypted. */ 107 - memset(hva, 0, 4096); 107 + memset(hva, 0, PAGE_SIZE); 108 108 109 109 vcpu_run(vcpu); 110 110

+1 -1

tools/testing/selftests/kvm/x86/state_test.c

··· 141 141 142 142 if (this_cpu_has(X86_FEATURE_XSAVE)) { 143 143 uint64_t supported_xcr0 = this_cpu_supported_xcr0(); 144 - uint8_t buffer[4096]; 144 + uint8_t buffer[PAGE_SIZE]; 145 145 146 146 memset(buffer, 0xcc, sizeof(buffer)); 147 147

+1 -1

tools/testing/selftests/kvm/x86/userspace_io_test.c

··· 85 85 regs.rcx = 1; 86 86 if (regs.rcx == 3) 87 87 regs.rcx = 8192; 88 - memset((void *)run + run->io.data_offset, 0xaa, 4096); 88 + memset((void *)run + run->io.data_offset, 0xaa, PAGE_SIZE); 89 89 vcpu_regs_set(vcpu, &regs); 90 90 } 91 91

+33 -9

tools/testing/selftests/kvm/x86/vmx_close_while_nested_test.c tools/testing/selftests/kvm/x86/nested_close_kvm_test.c

··· 1 1 // SPDX-License-Identifier: GPL-2.0-only 2 2 /* 3 - * vmx_close_while_nested 4 - * 5 3 * Copyright (C) 2019, Red Hat, Inc. 6 4 * 7 5 * Verify that nothing bad happens if a KVM user exits with open ··· 10 12 #include "kvm_util.h" 11 13 #include "processor.h" 12 14 #include "vmx.h" 15 + #include "svm_util.h" 13 16 14 17 #include <string.h> 15 18 #include <sys/ioctl.h> ··· 21 22 PORT_L0_EXIT = 0x2000, 22 23 }; 23 24 25 + #define L2_GUEST_STACK_SIZE 64 26 + 24 27 static void l2_guest_code(void) 25 28 { 26 29 /* Exit to L0 */ ··· 30 29 : : [port] "d" (PORT_L0_EXIT) : "rax"); 31 30 } 32 31 33 - static void l1_guest_code(struct vmx_pages *vmx_pages) 32 + static void l1_vmx_code(struct vmx_pages *vmx_pages) 34 33 { 35 - #define L2_GUEST_STACK_SIZE 64 36 34 unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE]; 37 35 38 36 GUEST_ASSERT(prepare_for_vmx_operation(vmx_pages)); ··· 45 45 GUEST_ASSERT(0); 46 46 } 47 47 48 + static void l1_svm_code(struct svm_test_data *svm) 49 + { 50 + unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE]; 51 + 52 + /* Prepare the VMCB for L2 execution. */ 53 + generic_svm_setup(svm, l2_guest_code, 54 + &l2_guest_stack[L2_GUEST_STACK_SIZE]); 55 + 56 + run_guest(svm->vmcb, svm->vmcb_gpa); 57 + GUEST_ASSERT(0); 58 + } 59 + 60 + static void l1_guest_code(void *data) 61 + { 62 + if (this_cpu_has(X86_FEATURE_VMX)) 63 + l1_vmx_code(data); 64 + else 65 + l1_svm_code(data); 66 + } 67 + 48 68 int main(int argc, char *argv[]) 49 69 { 50 - vm_vaddr_t vmx_pages_gva; 70 + vm_vaddr_t guest_gva; 51 71 struct kvm_vcpu *vcpu; 52 72 struct kvm_vm *vm; 53 73 54 - TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_VMX)); 74 + TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_VMX) || 75 + kvm_cpu_has(X86_FEATURE_SVM)); 55 76 56 77 vm = vm_create_with_one_vcpu(&vcpu, l1_guest_code); 57 78 58 - /* Allocate VMX pages and shared descriptors (vmx_pages). */ 59 - vcpu_alloc_vmx(vm, &vmx_pages_gva); 60 - vcpu_args_set(vcpu, 1, vmx_pages_gva); 79 + if (kvm_cpu_has(X86_FEATURE_VMX)) 80 + vcpu_alloc_vmx(vm, &guest_gva); 81 + else 82 + vcpu_alloc_svm(vm, &guest_gva); 83 + 84 + vcpu_args_set(vcpu, 1, guest_gva); 61 85 62 86 for (;;) { 63 87 volatile struct kvm_run *run = vcpu->run;

+6 -6

tools/testing/selftests/kvm/x86/vmx_dirty_log_test.c

··· 120 120 * GPAs as the EPT enabled case. 121 121 */ 122 122 if (enable_ept) { 123 - prepare_eptp(vmx, vm, 0); 123 + prepare_eptp(vmx, vm); 124 124 nested_map_memslot(vmx, vm, 0); 125 - nested_map(vmx, vm, NESTED_TEST_MEM1, GUEST_TEST_MEM, 4096); 126 - nested_map(vmx, vm, NESTED_TEST_MEM2, GUEST_TEST_MEM, 4096); 125 + nested_map(vmx, vm, NESTED_TEST_MEM1, GUEST_TEST_MEM, PAGE_SIZE); 126 + nested_map(vmx, vm, NESTED_TEST_MEM2, GUEST_TEST_MEM, PAGE_SIZE); 127 127 } 128 128 129 129 bmap = bitmap_zalloc(TEST_MEM_PAGES); 130 130 host_test_mem = addr_gpa2hva(vm, GUEST_TEST_MEM); 131 131 132 132 while (!done) { 133 - memset(host_test_mem, 0xaa, TEST_MEM_PAGES * 4096); 133 + memset(host_test_mem, 0xaa, TEST_MEM_PAGES * PAGE_SIZE); 134 134 vcpu_run(vcpu); 135 135 TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_IO); 136 136 ··· 153 153 } 154 154 155 155 TEST_ASSERT(!test_bit(1, bmap), "Page 1 incorrectly reported dirty"); 156 - TEST_ASSERT(host_test_mem[4096 / 8] == 0xaaaaaaaaaaaaaaaaULL, "Page 1 written by guest"); 156 + TEST_ASSERT(host_test_mem[PAGE_SIZE / 8] == 0xaaaaaaaaaaaaaaaaULL, "Page 1 written by guest"); 157 157 TEST_ASSERT(!test_bit(2, bmap), "Page 2 incorrectly reported dirty"); 158 - TEST_ASSERT(host_test_mem[8192 / 8] == 0xaaaaaaaaaaaaaaaaULL, "Page 2 written by guest"); 158 + TEST_ASSERT(host_test_mem[PAGE_SIZE*2 / 8] == 0xaaaaaaaaaaaaaaaaULL, "Page 2 written by guest"); 159 159 break; 160 160 case UCALL_DONE: 161 161 done = true;

+132

tools/testing/selftests/kvm/x86/vmx_nested_la57_state_test.c

··· 1 + // SPDX-License-Identifier: GPL-2.0-only 2 + /* 3 + * Copyright (C) 2025, Google LLC. 4 + * 5 + * Test KVM's ability to save and restore nested state when the L1 guest 6 + * is using 5-level paging and the L2 guest is using 4-level paging. 7 + * 8 + * This test would have failed prior to commit 9245fd6b8531 ("KVM: x86: 9 + * model canonical checks more precisely"). 10 + */ 11 + #include "test_util.h" 12 + #include "kvm_util.h" 13 + #include "processor.h" 14 + #include "vmx.h" 15 + 16 + #define LA57_GS_BASE 0xff2bc0311fb00000ull 17 + 18 + static void l2_guest_code(void) 19 + { 20 + /* 21 + * Sync with L0 to trigger save/restore. After 22 + * resuming, execute VMCALL to exit back to L1. 23 + */ 24 + GUEST_SYNC(1); 25 + vmcall(); 26 + } 27 + 28 + static void l1_guest_code(struct vmx_pages *vmx_pages) 29 + { 30 + #define L2_GUEST_STACK_SIZE 64 31 + unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE]; 32 + u64 guest_cr4; 33 + vm_paddr_t pml5_pa, pml4_pa; 34 + u64 *pml5; 35 + u64 exit_reason; 36 + 37 + /* Set GS_BASE to a value that is only canonical with LA57. */ 38 + wrmsr(MSR_GS_BASE, LA57_GS_BASE); 39 + GUEST_ASSERT(rdmsr(MSR_GS_BASE) == LA57_GS_BASE); 40 + 41 + GUEST_ASSERT(vmx_pages->vmcs_gpa); 42 + GUEST_ASSERT(prepare_for_vmx_operation(vmx_pages)); 43 + GUEST_ASSERT(load_vmcs(vmx_pages)); 44 + 45 + prepare_vmcs(vmx_pages, l2_guest_code, 46 + &l2_guest_stack[L2_GUEST_STACK_SIZE]); 47 + 48 + /* 49 + * Set up L2 with a 4-level page table by pointing its CR3 to 50 + * L1's first PML4 table and clearing CR4.LA57. This creates 51 + * the CR4.LA57 mismatch that exercises the bug. 52 + */ 53 + pml5_pa = get_cr3() & PHYSICAL_PAGE_MASK; 54 + pml5 = (u64 *)pml5_pa; 55 + pml4_pa = pml5[0] & PHYSICAL_PAGE_MASK; 56 + vmwrite(GUEST_CR3, pml4_pa); 57 + 58 + guest_cr4 = vmreadz(GUEST_CR4); 59 + guest_cr4 &= ~X86_CR4_LA57; 60 + vmwrite(GUEST_CR4, guest_cr4); 61 + 62 + GUEST_ASSERT(!vmlaunch()); 63 + 64 + exit_reason = vmreadz(VM_EXIT_REASON); 65 + GUEST_ASSERT(exit_reason == EXIT_REASON_VMCALL); 66 + } 67 + 68 + void guest_code(struct vmx_pages *vmx_pages) 69 + { 70 + l1_guest_code(vmx_pages); 71 + GUEST_DONE(); 72 + } 73 + 74 + int main(int argc, char *argv[]) 75 + { 76 + vm_vaddr_t vmx_pages_gva = 0; 77 + struct kvm_vm *vm; 78 + struct kvm_vcpu *vcpu; 79 + struct kvm_x86_state *state; 80 + struct ucall uc; 81 + int stage; 82 + 83 + TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_VMX)); 84 + TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_LA57)); 85 + TEST_REQUIRE(kvm_has_cap(KVM_CAP_NESTED_STATE)); 86 + 87 + vm = vm_create_with_one_vcpu(&vcpu, guest_code); 88 + 89 + /* 90 + * L1 needs to read its own PML5 table to set up L2. Identity map 91 + * the PML5 table to facilitate this. 92 + */ 93 + virt_map(vm, vm->pgd, vm->pgd, 1); 94 + 95 + vcpu_alloc_vmx(vm, &vmx_pages_gva); 96 + vcpu_args_set(vcpu, 1, vmx_pages_gva); 97 + 98 + for (stage = 1;; stage++) { 99 + vcpu_run(vcpu); 100 + TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_IO); 101 + 102 + switch (get_ucall(vcpu, &uc)) { 103 + case UCALL_ABORT: 104 + REPORT_GUEST_ASSERT(uc); 105 + /* NOT REACHED */ 106 + case UCALL_SYNC: 107 + break; 108 + case UCALL_DONE: 109 + goto done; 110 + default: 111 + TEST_FAIL("Unknown ucall %lu", uc.cmd); 112 + } 113 + 114 + TEST_ASSERT(uc.args[1] == stage, 115 + "Expected stage %d, got stage %lu", stage, (ulong)uc.args[1]); 116 + if (stage == 1) { 117 + pr_info("L2 is active; performing save/restore.\n"); 118 + state = vcpu_save_state(vcpu); 119 + 120 + kvm_vm_release(vm); 121 + 122 + /* Restore state in a new VM. */ 123 + vcpu = vm_recreate_with_one_vcpu(vm); 124 + vcpu_load_state(vcpu, state); 125 + kvm_x86_state_cleanup(state); 126 + } 127 + } 128 + 129 + done: 130 + kvm_vm_free(vm); 131 + return 0; 132 + }

+43 -5

tools/testing/selftests/kvm/x86/vmx_nested_tsc_scaling_test.c tools/testing/selftests/kvm/x86/nested_tsc_scaling_test.c

··· 13 13 14 14 #include "kvm_util.h" 15 15 #include "vmx.h" 16 + #include "svm_util.h" 16 17 #include "kselftest.h" 17 18 18 19 /* L2 is scaled up (from L1's perspective) by this factor */ ··· 80 79 __asm__ __volatile__("vmcall"); 81 80 } 82 81 83 - static void l1_guest_code(struct vmx_pages *vmx_pages) 82 + static void l1_svm_code(struct svm_test_data *svm) 83 + { 84 + unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE]; 85 + 86 + /* check that L1's frequency looks alright before launching L2 */ 87 + check_tsc_freq(UCHECK_L1); 88 + 89 + generic_svm_setup(svm, l2_guest_code, 90 + &l2_guest_stack[L2_GUEST_STACK_SIZE]); 91 + 92 + /* enable TSC scaling for L2 */ 93 + wrmsr(MSR_AMD64_TSC_RATIO, L2_SCALE_FACTOR << 32); 94 + 95 + /* launch L2 */ 96 + run_guest(svm->vmcb, svm->vmcb_gpa); 97 + GUEST_ASSERT(svm->vmcb->control.exit_code == SVM_EXIT_VMMCALL); 98 + 99 + /* check that L1's frequency still looks good */ 100 + check_tsc_freq(UCHECK_L1); 101 + 102 + GUEST_DONE(); 103 + } 104 + 105 + static void l1_vmx_code(struct vmx_pages *vmx_pages) 84 106 { 85 107 unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE]; 86 108 uint32_t control; ··· 140 116 GUEST_DONE(); 141 117 } 142 118 119 + static void l1_guest_code(void *data) 120 + { 121 + if (this_cpu_has(X86_FEATURE_VMX)) 122 + l1_vmx_code(data); 123 + else 124 + l1_svm_code(data); 125 + } 126 + 143 127 int main(int argc, char *argv[]) 144 128 { 145 129 struct kvm_vcpu *vcpu; 146 130 struct kvm_vm *vm; 147 - vm_vaddr_t vmx_pages_gva; 131 + vm_vaddr_t guest_gva = 0; 148 132 149 133 uint64_t tsc_start, tsc_end; 150 134 uint64_t tsc_khz; ··· 161 129 uint64_t l1_tsc_freq = 0; 162 130 uint64_t l2_tsc_freq = 0; 163 131 164 - TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_VMX)); 132 + TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_VMX) || 133 + kvm_cpu_has(X86_FEATURE_SVM)); 165 134 TEST_REQUIRE(kvm_has_cap(KVM_CAP_TSC_CONTROL)); 166 135 TEST_REQUIRE(sys_clocksource_is_based_on_tsc()); 167 136 ··· 185 152 printf("real TSC frequency is around: %"PRIu64"\n", l0_tsc_freq); 186 153 187 154 vm = vm_create_with_one_vcpu(&vcpu, l1_guest_code); 188 - vcpu_alloc_vmx(vm, &vmx_pages_gva); 189 - vcpu_args_set(vcpu, 1, vmx_pages_gva); 155 + 156 + if (kvm_cpu_has(X86_FEATURE_VMX)) 157 + vcpu_alloc_vmx(vm, &guest_gva); 158 + else 159 + vcpu_alloc_svm(vm, &guest_gva); 160 + 161 + vcpu_args_set(vcpu, 1, guest_gva); 190 162 191 163 tsc_khz = __vcpu_ioctl(vcpu, KVM_GET_TSC_KHZ, NULL); 192 164 TEST_ASSERT(tsc_khz != -1, "vcpu ioctl KVM_GET_TSC_KHZ failed");

+41 -32

tools/testing/selftests/kvm/x86/vmx_tsc_adjust_test.c tools/testing/selftests/kvm/x86/nested_tsc_adjust_test.c

··· 1 1 // SPDX-License-Identifier: GPL-2.0-only 2 2 /* 3 - * vmx_tsc_adjust_test 4 - * 5 3 * Copyright (C) 2018, Google LLC. 6 4 * 7 5 * IA32_TSC_ADJUST test ··· 20 22 #include "kvm_util.h" 21 23 #include "processor.h" 22 24 #include "vmx.h" 25 + #include "svm_util.h" 23 26 24 27 #include <string.h> 25 28 #include <sys/ioctl.h> ··· 33 34 34 35 #define TSC_ADJUST_VALUE (1ll << 32) 35 36 #define TSC_OFFSET_VALUE -(1ll << 48) 37 + 38 + #define L2_GUEST_STACK_SIZE 64 36 39 37 40 enum { 38 41 PORT_ABORT = 0x1000, ··· 73 72 __asm__ __volatile__("vmcall"); 74 73 } 75 74 76 - static void l1_guest_code(struct vmx_pages *vmx_pages) 75 + static void l1_guest_code(void *data) 77 76 { 78 - #define L2_GUEST_STACK_SIZE 64 79 77 unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE]; 80 - uint32_t control; 81 - uintptr_t save_cr3; 82 78 79 + /* Set TSC from L1 and make sure TSC_ADJUST is updated correctly */ 83 80 GUEST_ASSERT(rdtsc() < TSC_ADJUST_VALUE); 84 81 wrmsr(MSR_IA32_TSC, rdtsc() - TSC_ADJUST_VALUE); 85 82 check_ia32_tsc_adjust(-1 * TSC_ADJUST_VALUE); 86 83 87 - GUEST_ASSERT(prepare_for_vmx_operation(vmx_pages)); 88 - GUEST_ASSERT(load_vmcs(vmx_pages)); 84 + /* 85 + * Run L2 with TSC_OFFSET. L2 will write to TSC, and L1 is not 86 + * intercepting the write so it should update L1's TSC_ADJUST. 87 + */ 88 + if (this_cpu_has(X86_FEATURE_VMX)) { 89 + struct vmx_pages *vmx_pages = data; 90 + uint32_t control; 89 91 90 - /* Prepare the VMCS for L2 execution. */ 91 - prepare_vmcs(vmx_pages, l2_guest_code, 92 - &l2_guest_stack[L2_GUEST_STACK_SIZE]); 93 - control = vmreadz(CPU_BASED_VM_EXEC_CONTROL); 94 - control |= CPU_BASED_USE_MSR_BITMAPS | CPU_BASED_USE_TSC_OFFSETTING; 95 - vmwrite(CPU_BASED_VM_EXEC_CONTROL, control); 96 - vmwrite(TSC_OFFSET, TSC_OFFSET_VALUE); 92 + GUEST_ASSERT(prepare_for_vmx_operation(vmx_pages)); 93 + GUEST_ASSERT(load_vmcs(vmx_pages)); 97 94 98 - /* Jump into L2. First, test failure to load guest CR3. */ 99 - save_cr3 = vmreadz(GUEST_CR3); 100 - vmwrite(GUEST_CR3, -1ull); 101 - GUEST_ASSERT(!vmlaunch()); 102 - GUEST_ASSERT(vmreadz(VM_EXIT_REASON) == 103 - (EXIT_REASON_FAILED_VMENTRY | EXIT_REASON_INVALID_STATE)); 104 - check_ia32_tsc_adjust(-1 * TSC_ADJUST_VALUE); 105 - vmwrite(GUEST_CR3, save_cr3); 95 + prepare_vmcs(vmx_pages, l2_guest_code, 96 + &l2_guest_stack[L2_GUEST_STACK_SIZE]); 97 + control = vmreadz(CPU_BASED_VM_EXEC_CONTROL); 98 + control |= CPU_BASED_USE_MSR_BITMAPS | CPU_BASED_USE_TSC_OFFSETTING; 99 + vmwrite(CPU_BASED_VM_EXEC_CONTROL, control); 100 + vmwrite(TSC_OFFSET, TSC_OFFSET_VALUE); 106 101 107 - GUEST_ASSERT(!vmlaunch()); 108 - GUEST_ASSERT(vmreadz(VM_EXIT_REASON) == EXIT_REASON_VMCALL); 102 + GUEST_ASSERT(!vmlaunch()); 103 + GUEST_ASSERT(vmreadz(VM_EXIT_REASON) == EXIT_REASON_VMCALL); 104 + } else { 105 + struct svm_test_data *svm = data; 106 + 107 + generic_svm_setup(svm, l2_guest_code, 108 + &l2_guest_stack[L2_GUEST_STACK_SIZE]); 109 + 110 + svm->vmcb->control.tsc_offset = TSC_OFFSET_VALUE; 111 + run_guest(svm->vmcb, svm->vmcb_gpa); 112 + GUEST_ASSERT(svm->vmcb->control.exit_code == SVM_EXIT_VMMCALL); 113 + } 109 114 110 115 check_ia32_tsc_adjust(-2 * TSC_ADJUST_VALUE); 111 - 112 116 GUEST_DONE(); 113 117 } 114 118 ··· 125 119 126 120 int main(int argc, char *argv[]) 127 121 { 128 - vm_vaddr_t vmx_pages_gva; 122 + vm_vaddr_t nested_gva; 129 123 struct kvm_vcpu *vcpu; 130 124 131 - TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_VMX)); 125 + TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_VMX) || 126 + kvm_cpu_has(X86_FEATURE_SVM)); 132 127 133 - vm = vm_create_with_one_vcpu(&vcpu, (void *) l1_guest_code); 128 + vm = vm_create_with_one_vcpu(&vcpu, l1_guest_code); 129 + if (kvm_cpu_has(X86_FEATURE_VMX)) 130 + vcpu_alloc_vmx(vm, &nested_gva); 131 + else 132 + vcpu_alloc_svm(vm, &nested_gva); 134 133 135 - /* Allocate VMX pages and shared descriptors (vmx_pages). */ 136 - vcpu_alloc_vmx(vm, &vmx_pages_gva); 137 - vcpu_args_set(vcpu, 1, vmx_pages_gva); 134 + vcpu_args_set(vcpu, 1, nested_gva); 138 135 139 136 for (;;) { 140 137 struct ucall uc;

+2 -3

tools/testing/selftests/kvm/x86/xapic_ipi_test.c

··· 256 256 int nodes = 0; 257 257 time_t start_time, last_update, now; 258 258 time_t interval_secs = 1; 259 - int i, r; 259 + int i; 260 260 int from, to; 261 261 unsigned long bit; 262 262 uint64_t hlt_count; ··· 267 267 delay_usecs); 268 268 269 269 /* Get set of first 64 numa nodes available */ 270 - r = get_mempolicy(NULL, &nodemask, sizeof(nodemask) * 8, 270 + kvm_get_mempolicy(NULL, &nodemask, sizeof(nodemask) * 8, 271 271 0, MPOL_F_MEMS_ALLOWED); 272 - TEST_ASSERT(r == 0, "get_mempolicy failed errno=%d", errno); 273 272 274 273 fprintf(stderr, "Numa nodes found amongst first %lu possible nodes " 275 274 "(each 1-bit indicates node is present): %#lx\n",

-3

virt/kvm/Kconfig

··· 78 78 tristate 79 79 select IRQ_BYPASS_MANAGER 80 80 81 - config HAVE_KVM_VCPU_ASYNC_IOCTL 82 - bool 83 - 84 81 config HAVE_KVM_VCPU_RUN_PID_CHANGE 85 82 bool 86 83

+1 -1

virt/kvm/eventfd.c

··· 707 707 */ 708 708 int kvm_irqfd_init(void) 709 709 { 710 - irqfd_cleanup_wq = alloc_workqueue("kvm-irqfd-cleanup", 0, 0); 710 + irqfd_cleanup_wq = alloc_workqueue("kvm-irqfd-cleanup", WQ_PERCPU, 0); 711 711 if (!irqfd_cleanup_wq) 712 712 return -ENOMEM; 713 713

+279 -94

virt/kvm/guest_memfd.c

··· 1 1 // SPDX-License-Identifier: GPL-2.0 2 + #include <linux/anon_inodes.h> 2 3 #include <linux/backing-dev.h> 3 4 #include <linux/falloc.h> 5 + #include <linux/fs.h> 4 6 #include <linux/kvm_host.h> 7 + #include <linux/mempolicy.h> 8 + #include <linux/pseudo_fs.h> 5 9 #include <linux/pagemap.h> 6 - #include <linux/anon_inodes.h> 7 10 8 11 #include "kvm_mm.h" 9 12 10 - struct kvm_gmem { 13 + static struct vfsmount *kvm_gmem_mnt; 14 + 15 + /* 16 + * A guest_memfd instance can be associated multiple VMs, each with its own 17 + * "view" of the underlying physical memory. 18 + * 19 + * The gmem's inode is effectively the raw underlying physical storage, and is 20 + * used to track properties of the physical memory, while each gmem file is 21 + * effectively a single VM's view of that storage, and is used to track assets 22 + * specific to its associated VM, e.g. memslots=>gmem bindings. 23 + */ 24 + struct gmem_file { 11 25 struct kvm *kvm; 12 26 struct xarray bindings; 13 27 struct list_head entry; 14 28 }; 29 + 30 + struct gmem_inode { 31 + struct shared_policy policy; 32 + struct inode vfs_inode; 33 + 34 + u64 flags; 35 + }; 36 + 37 + static __always_inline struct gmem_inode *GMEM_I(struct inode *inode) 38 + { 39 + return container_of(inode, struct gmem_inode, vfs_inode); 40 + } 41 + 42 + #define kvm_gmem_for_each_file(f, mapping) \ 43 + list_for_each_entry(f, &(mapping)->i_private_list, entry) 15 44 16 45 /** 17 46 * folio_file_pfn - like folio_file_page, but return a pfn. ··· 52 23 static inline kvm_pfn_t folio_file_pfn(struct folio *folio, pgoff_t index) 53 24 { 54 25 return folio_pfn(folio) + (index & (folio_nr_pages(folio) - 1)); 26 + } 27 + 28 + static pgoff_t kvm_gmem_get_index(struct kvm_memory_slot *slot, gfn_t gfn) 29 + { 30 + return gfn - slot->base_gfn + slot->gmem.pgoff; 55 31 } 56 32 57 33 static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot, ··· 111 77 * The order will be passed when creating the guest_memfd, and 112 78 * checked when creating memslots. 113 79 */ 114 - WARN_ON(!IS_ALIGNED(slot->gmem.pgoff, 1 << folio_order(folio))); 115 - index = gfn - slot->base_gfn + slot->gmem.pgoff; 116 - index = ALIGN_DOWN(index, 1 << folio_order(folio)); 80 + WARN_ON(!IS_ALIGNED(slot->gmem.pgoff, folio_nr_pages(folio))); 81 + index = kvm_gmem_get_index(slot, gfn); 82 + index = ALIGN_DOWN(index, folio_nr_pages(folio)); 117 83 r = __kvm_gmem_prepare_folio(kvm, slot, index, folio); 118 84 if (!r) 119 85 kvm_gmem_mark_prepared(folio); ··· 133 99 static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index) 134 100 { 135 101 /* TODO: Support huge pages. */ 136 - return filemap_grab_folio(inode->i_mapping, index); 102 + struct mempolicy *policy; 103 + struct folio *folio; 104 + 105 + /* 106 + * Fast-path: See if folio is already present in mapping to avoid 107 + * policy_lookup. 108 + */ 109 + folio = __filemap_get_folio(inode->i_mapping, index, 110 + FGP_LOCK | FGP_ACCESSED, 0); 111 + if (!IS_ERR(folio)) 112 + return folio; 113 + 114 + policy = mpol_shared_policy_lookup(&GMEM_I(inode)->policy, index); 115 + folio = __filemap_get_folio_mpol(inode->i_mapping, index, 116 + FGP_LOCK | FGP_ACCESSED | FGP_CREAT, 117 + mapping_gfp_mask(inode->i_mapping), policy); 118 + mpol_cond_put(policy); 119 + 120 + return folio; 137 121 } 138 122 139 123 static enum kvm_gfn_range_filter kvm_gmem_get_invalidate_filter(struct inode *inode) 140 124 { 141 - if ((u64)inode->i_private & GUEST_MEMFD_FLAG_INIT_SHARED) 125 + if (GMEM_I(inode)->flags & GUEST_MEMFD_FLAG_INIT_SHARED) 142 126 return KVM_FILTER_SHARED; 143 127 144 128 return KVM_FILTER_PRIVATE; 145 129 } 146 130 147 - static void __kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start, 131 + static void __kvm_gmem_invalidate_begin(struct gmem_file *f, pgoff_t start, 148 132 pgoff_t end, 149 133 enum kvm_gfn_range_filter attr_filter) 150 134 { 151 135 bool flush = false, found_memslot = false; 152 136 struct kvm_memory_slot *slot; 153 - struct kvm *kvm = gmem->kvm; 137 + struct kvm *kvm = f->kvm; 154 138 unsigned long index; 155 139 156 - xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) { 140 + xa_for_each_range(&f->bindings, index, slot, start, end - 1) { 157 141 pgoff_t pgoff = slot->gmem.pgoff; 158 142 159 143 struct kvm_gfn_range gfn_range = { ··· 202 150 static void kvm_gmem_invalidate_begin(struct inode *inode, pgoff_t start, 203 151 pgoff_t end) 204 152 { 205 - struct list_head *gmem_list = &inode->i_mapping->i_private_list; 206 153 enum kvm_gfn_range_filter attr_filter; 207 - struct kvm_gmem *gmem; 154 + struct gmem_file *f; 208 155 209 156 attr_filter = kvm_gmem_get_invalidate_filter(inode); 210 157 211 - list_for_each_entry(gmem, gmem_list, entry) 212 - __kvm_gmem_invalidate_begin(gmem, start, end, attr_filter); 158 + kvm_gmem_for_each_file(f, inode->i_mapping) 159 + __kvm_gmem_invalidate_begin(f, start, end, attr_filter); 213 160 } 214 161 215 - static void __kvm_gmem_invalidate_end(struct kvm_gmem *gmem, pgoff_t start, 162 + static void __kvm_gmem_invalidate_end(struct gmem_file *f, pgoff_t start, 216 163 pgoff_t end) 217 164 { 218 - struct kvm *kvm = gmem->kvm; 165 + struct kvm *kvm = f->kvm; 219 166 220 - if (xa_find(&gmem->bindings, &start, end - 1, XA_PRESENT)) { 167 + if (xa_find(&f->bindings, &start, end - 1, XA_PRESENT)) { 221 168 KVM_MMU_LOCK(kvm); 222 169 kvm_mmu_invalidate_end(kvm); 223 170 KVM_MMU_UNLOCK(kvm); ··· 226 175 static void kvm_gmem_invalidate_end(struct inode *inode, pgoff_t start, 227 176 pgoff_t end) 228 177 { 229 - struct list_head *gmem_list = &inode->i_mapping->i_private_list; 230 - struct kvm_gmem *gmem; 178 + struct gmem_file *f; 231 179 232 - list_for_each_entry(gmem, gmem_list, entry) 233 - __kvm_gmem_invalidate_end(gmem, start, end); 180 + kvm_gmem_for_each_file(f, inode->i_mapping) 181 + __kvm_gmem_invalidate_end(f, start, end); 234 182 } 235 183 236 184 static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len) ··· 327 277 328 278 static int kvm_gmem_release(struct inode *inode, struct file *file) 329 279 { 330 - struct kvm_gmem *gmem = file->private_data; 280 + struct gmem_file *f = file->private_data; 331 281 struct kvm_memory_slot *slot; 332 - struct kvm *kvm = gmem->kvm; 282 + struct kvm *kvm = f->kvm; 333 283 unsigned long index; 334 284 335 285 /* ··· 349 299 350 300 filemap_invalidate_lock(inode->i_mapping); 351 301 352 - xa_for_each(&gmem->bindings, index, slot) 302 + xa_for_each(&f->bindings, index, slot) 353 303 WRITE_ONCE(slot->gmem.file, NULL); 354 304 355 305 /* ··· 357 307 * Zap all SPTEs pointed at by this file. Do not free the backing 358 308 * memory, as its lifetime is associated with the inode, not the file. 359 309 */ 360 - __kvm_gmem_invalidate_begin(gmem, 0, -1ul, 310 + __kvm_gmem_invalidate_begin(f, 0, -1ul, 361 311 kvm_gmem_get_invalidate_filter(inode)); 362 - __kvm_gmem_invalidate_end(gmem, 0, -1ul); 312 + __kvm_gmem_invalidate_end(f, 0, -1ul); 363 313 364 - list_del(&gmem->entry); 314 + list_del(&f->entry); 365 315 366 316 filemap_invalidate_unlock(inode->i_mapping); 367 317 368 318 mutex_unlock(&kvm->slots_lock); 369 319 370 - xa_destroy(&gmem->bindings); 371 - kfree(gmem); 320 + xa_destroy(&f->bindings); 321 + kfree(f); 372 322 373 323 kvm_put_kvm(kvm); 374 324 ··· 385 335 return get_file_active(&slot->gmem.file); 386 336 } 387 337 388 - static pgoff_t kvm_gmem_get_index(struct kvm_memory_slot *slot, gfn_t gfn) 389 - { 390 - return gfn - slot->base_gfn + slot->gmem.pgoff; 391 - } 338 + DEFINE_CLASS(gmem_get_file, struct file *, if (_T) fput(_T), 339 + kvm_gmem_get_file(slot), struct kvm_memory_slot *slot); 392 340 393 341 static bool kvm_gmem_supports_mmap(struct inode *inode) 394 342 { 395 - const u64 flags = (u64)inode->i_private; 396 - 397 - return flags & GUEST_MEMFD_FLAG_MMAP; 343 + return GMEM_I(inode)->flags & GUEST_MEMFD_FLAG_MMAP; 398 344 } 399 345 400 346 static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf) ··· 402 356 if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode)) 403 357 return VM_FAULT_SIGBUS; 404 358 405 - if (!((u64)inode->i_private & GUEST_MEMFD_FLAG_INIT_SHARED)) 359 + if (!(GMEM_I(inode)->flags & GUEST_MEMFD_FLAG_INIT_SHARED)) 406 360 return VM_FAULT_SIGBUS; 407 361 408 362 folio = kvm_gmem_get_folio(inode, vmf->pgoff); 409 363 if (IS_ERR(folio)) { 410 - int err = PTR_ERR(folio); 411 - 412 - if (err == -EAGAIN) 364 + if (PTR_ERR(folio) == -EAGAIN) 413 365 return VM_FAULT_RETRY; 414 366 415 - return vmf_error(err); 367 + return vmf_error(PTR_ERR(folio)); 416 368 } 417 369 418 370 if (WARN_ON_ONCE(folio_test_large(folio))) { ··· 434 390 return ret; 435 391 } 436 392 393 + #ifdef CONFIG_NUMA 394 + static int kvm_gmem_set_policy(struct vm_area_struct *vma, struct mempolicy *mpol) 395 + { 396 + struct inode *inode = file_inode(vma->vm_file); 397 + 398 + return mpol_set_shared_policy(&GMEM_I(inode)->policy, vma, mpol); 399 + } 400 + 401 + static struct mempolicy *kvm_gmem_get_policy(struct vm_area_struct *vma, 402 + unsigned long addr, pgoff_t *pgoff) 403 + { 404 + struct inode *inode = file_inode(vma->vm_file); 405 + 406 + *pgoff = vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT); 407 + 408 + /* 409 + * Return the memory policy for this index, or NULL if none is set. 410 + * 411 + * Returning NULL, e.g. instead of the current task's memory policy, is 412 + * important for the .get_policy kernel ABI: it indicates that no 413 + * explicit policy has been set via mbind() for this memory. The caller 414 + * can then replace NULL with the default memory policy instead of the 415 + * current task's memory policy. 416 + */ 417 + return mpol_shared_policy_lookup(&GMEM_I(inode)->policy, *pgoff); 418 + } 419 + #endif /* CONFIG_NUMA */ 420 + 437 421 static const struct vm_operations_struct kvm_gmem_vm_ops = { 438 - .fault = kvm_gmem_fault_user_mapping, 422 + .fault = kvm_gmem_fault_user_mapping, 423 + #ifdef CONFIG_NUMA 424 + .get_policy = kvm_gmem_get_policy, 425 + .set_policy = kvm_gmem_set_policy, 426 + #endif 439 427 }; 440 428 441 429 static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma) ··· 491 415 .release = kvm_gmem_release, 492 416 .fallocate = kvm_gmem_fallocate, 493 417 }; 494 - 495 - void kvm_gmem_init(struct module *module) 496 - { 497 - kvm_gmem_fops.owner = module; 498 - } 499 418 500 419 static int kvm_gmem_migrate_folio(struct address_space *mapping, 501 420 struct folio *dst, struct folio *src, ··· 563 492 564 493 static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags) 565 494 { 566 - const char *anon_name = "[kvm-gmem]"; 567 - struct kvm_gmem *gmem; 495 + static const char *name = "[kvm-gmem]"; 496 + struct gmem_file *f; 568 497 struct inode *inode; 569 498 struct file *file; 570 499 int fd, err; ··· 573 502 if (fd < 0) 574 503 return fd; 575 504 576 - gmem = kzalloc(sizeof(*gmem), GFP_KERNEL); 577 - if (!gmem) { 505 + f = kzalloc(sizeof(*f), GFP_KERNEL); 506 + if (!f) { 578 507 err = -ENOMEM; 579 508 goto err_fd; 580 509 } 581 510 582 - file = anon_inode_create_getfile(anon_name, &kvm_gmem_fops, gmem, 583 - O_RDWR, NULL); 584 - if (IS_ERR(file)) { 585 - err = PTR_ERR(file); 511 + /* __fput() will take care of fops_put(). */ 512 + if (!fops_get(&kvm_gmem_fops)) { 513 + err = -ENOENT; 586 514 goto err_gmem; 587 515 } 588 516 589 - file->f_flags |= O_LARGEFILE; 517 + inode = anon_inode_make_secure_inode(kvm_gmem_mnt->mnt_sb, name, NULL); 518 + if (IS_ERR(inode)) { 519 + err = PTR_ERR(inode); 520 + goto err_fops; 521 + } 590 522 591 - inode = file->f_inode; 592 - WARN_ON(file->f_mapping != inode->i_mapping); 593 - 594 - inode->i_private = (void *)(unsigned long)flags; 595 523 inode->i_op = &kvm_gmem_iops; 596 524 inode->i_mapping->a_ops = &kvm_gmem_aops; 597 525 inode->i_mode |= S_IFREG; ··· 600 530 /* Unmovable mappings are supposed to be marked unevictable as well. */ 601 531 WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping)); 602 532 533 + GMEM_I(inode)->flags = flags; 534 + 535 + file = alloc_file_pseudo(inode, kvm_gmem_mnt, name, O_RDWR, &kvm_gmem_fops); 536 + if (IS_ERR(file)) { 537 + err = PTR_ERR(file); 538 + goto err_inode; 539 + } 540 + 541 + file->f_flags |= O_LARGEFILE; 542 + file->private_data = f; 543 + 603 544 kvm_get_kvm(kvm); 604 - gmem->kvm = kvm; 605 - xa_init(&gmem->bindings); 606 - list_add(&gmem->entry, &inode->i_mapping->i_private_list); 545 + f->kvm = kvm; 546 + xa_init(&f->bindings); 547 + list_add(&f->entry, &inode->i_mapping->i_private_list); 607 548 608 549 fd_install(fd, file); 609 550 return fd; 610 551 552 + err_inode: 553 + iput(inode); 554 + err_fops: 555 + fops_put(&kvm_gmem_fops); 611 556 err_gmem: 612 - kfree(gmem); 557 + kfree(f); 613 558 err_fd: 614 559 put_unused_fd(fd); 615 560 return err; ··· 649 564 { 650 565 loff_t size = slot->npages << PAGE_SHIFT; 651 566 unsigned long start, end; 652 - struct kvm_gmem *gmem; 567 + struct gmem_file *f; 653 568 struct inode *inode; 654 569 struct file *file; 655 570 int r = -EINVAL; ··· 663 578 if (file->f_op != &kvm_gmem_fops) 664 579 goto err; 665 580 666 - gmem = file->private_data; 667 - if (gmem->kvm != kvm) 581 + f = file->private_data; 582 + if (f->kvm != kvm) 668 583 goto err; 669 584 670 585 inode = file_inode(file); ··· 678 593 start = offset >> PAGE_SHIFT; 679 594 end = start + slot->npages; 680 595 681 - if (!xa_empty(&gmem->bindings) && 682 - xa_find(&gmem->bindings, &start, end - 1, XA_PRESENT)) { 596 + if (!xa_empty(&f->bindings) && 597 + xa_find(&f->bindings, &start, end - 1, XA_PRESENT)) { 683 598 filemap_invalidate_unlock(inode->i_mapping); 684 599 goto err; 685 600 } ··· 694 609 if (kvm_gmem_supports_mmap(inode)) 695 610 slot->flags |= KVM_MEMSLOT_GMEM_ONLY; 696 611 697 - xa_store_range(&gmem->bindings, start, end - 1, slot, GFP_KERNEL); 612 + xa_store_range(&f->bindings, start, end - 1, slot, GFP_KERNEL); 698 613 filemap_invalidate_unlock(inode->i_mapping); 699 614 700 615 /* ··· 708 623 return r; 709 624 } 710 625 711 - static void __kvm_gmem_unbind(struct kvm_memory_slot *slot, struct kvm_gmem *gmem) 626 + static void __kvm_gmem_unbind(struct kvm_memory_slot *slot, struct gmem_file *f) 712 627 { 713 628 unsigned long start = slot->gmem.pgoff; 714 629 unsigned long end = start + slot->npages; 715 630 716 - xa_store_range(&gmem->bindings, start, end - 1, NULL, GFP_KERNEL); 631 + xa_store_range(&f->bindings, start, end - 1, NULL, GFP_KERNEL); 717 632 718 633 /* 719 634 * synchronize_srcu(&kvm->srcu) ensured that kvm_gmem_get_pfn() ··· 724 639 725 640 void kvm_gmem_unbind(struct kvm_memory_slot *slot) 726 641 { 727 - struct file *file; 728 - 729 642 /* 730 643 * Nothing to do if the underlying file was _already_ closed, as 731 644 * kvm_gmem_release() invalidates and nullifies all bindings. ··· 731 648 if (!slot->gmem.file) 732 649 return; 733 650 734 - file = kvm_gmem_get_file(slot); 651 + CLASS(gmem_get_file, file)(slot); 735 652 736 653 /* 737 654 * However, if the file is _being_ closed, then the bindings need to be ··· 751 668 filemap_invalidate_lock(file->f_mapping); 752 669 __kvm_gmem_unbind(slot, file->private_data); 753 670 filemap_invalidate_unlock(file->f_mapping); 754 - 755 - fput(file); 756 671 } 757 672 758 673 /* Returns a locked folio on success. */ ··· 759 678 pgoff_t index, kvm_pfn_t *pfn, 760 679 bool *is_prepared, int *max_order) 761 680 { 762 - struct file *gmem_file = READ_ONCE(slot->gmem.file); 763 - struct kvm_gmem *gmem = file->private_data; 681 + struct file *slot_file = READ_ONCE(slot->gmem.file); 682 + struct gmem_file *f = file->private_data; 764 683 struct folio *folio; 765 684 766 - if (file != gmem_file) { 767 - WARN_ON_ONCE(gmem_file); 685 + if (file != slot_file) { 686 + WARN_ON_ONCE(slot_file); 768 687 return ERR_PTR(-EFAULT); 769 688 } 770 689 771 - gmem = file->private_data; 772 - if (xa_load(&gmem->bindings, index) != slot) { 773 - WARN_ON_ONCE(xa_load(&gmem->bindings, index)); 690 + if (xa_load(&f->bindings, index) != slot) { 691 + WARN_ON_ONCE(xa_load(&f->bindings, index)); 774 692 return ERR_PTR(-EIO); 775 693 } 776 694 ··· 796 716 int *max_order) 797 717 { 798 718 pgoff_t index = kvm_gmem_get_index(slot, gfn); 799 - struct file *file = kvm_gmem_get_file(slot); 800 719 struct folio *folio; 801 720 bool is_prepared = false; 802 721 int r = 0; 803 722 723 + CLASS(gmem_get_file, file)(slot); 804 724 if (!file) 805 725 return -EFAULT; 806 726 807 727 folio = __kvm_gmem_get_pfn(file, slot, index, pfn, &is_prepared, max_order); 808 - if (IS_ERR(folio)) { 809 - r = PTR_ERR(folio); 810 - goto out; 811 - } 728 + if (IS_ERR(folio)) 729 + return PTR_ERR(folio); 812 730 813 731 if (!is_prepared) 814 732 r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio); ··· 818 740 else 819 741 folio_put(folio); 820 742 821 - out: 822 - fput(file); 823 743 return r; 824 744 } 825 745 EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn); ··· 826 750 long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long npages, 827 751 kvm_gmem_populate_cb post_populate, void *opaque) 828 752 { 829 - struct file *file; 830 753 struct kvm_memory_slot *slot; 831 754 void __user *p; 832 755 ··· 841 766 if (!kvm_slot_has_gmem(slot)) 842 767 return -EINVAL; 843 768 844 - file = kvm_gmem_get_file(slot); 769 + CLASS(gmem_get_file, file)(slot); 845 770 if (!file) 846 771 return -EFAULT; 847 772 ··· 899 824 900 825 filemap_invalidate_unlock(file->f_mapping); 901 826 902 - fput(file); 903 827 return ret && !i ? ret : i; 904 828 } 905 829 EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_populate); 906 830 #endif 831 + 832 + static struct kmem_cache *kvm_gmem_inode_cachep; 833 + 834 + static void kvm_gmem_init_inode_once(void *__gi) 835 + { 836 + struct gmem_inode *gi = __gi; 837 + 838 + /* 839 + * Note! Don't initialize the inode with anything specific to the 840 + * guest_memfd instance, or that might be specific to how the inode is 841 + * used (from the VFS-layer's perspective). This hook is called only 842 + * during the initial slab allocation, i.e. only fields/state that are 843 + * idempotent across _all_ use of the inode _object_ can be initialized 844 + * at this time! 845 + */ 846 + inode_init_once(&gi->vfs_inode); 847 + } 848 + 849 + static struct inode *kvm_gmem_alloc_inode(struct super_block *sb) 850 + { 851 + struct gmem_inode *gi; 852 + 853 + gi = alloc_inode_sb(sb, kvm_gmem_inode_cachep, GFP_KERNEL); 854 + if (!gi) 855 + return NULL; 856 + 857 + mpol_shared_policy_init(&gi->policy, NULL); 858 + 859 + gi->flags = 0; 860 + return &gi->vfs_inode; 861 + } 862 + 863 + static void kvm_gmem_destroy_inode(struct inode *inode) 864 + { 865 + mpol_free_shared_policy(&GMEM_I(inode)->policy); 866 + } 867 + 868 + static void kvm_gmem_free_inode(struct inode *inode) 869 + { 870 + kmem_cache_free(kvm_gmem_inode_cachep, GMEM_I(inode)); 871 + } 872 + 873 + static const struct super_operations kvm_gmem_super_operations = { 874 + .statfs = simple_statfs, 875 + .alloc_inode = kvm_gmem_alloc_inode, 876 + .destroy_inode = kvm_gmem_destroy_inode, 877 + .free_inode = kvm_gmem_free_inode, 878 + }; 879 + 880 + static int kvm_gmem_init_fs_context(struct fs_context *fc) 881 + { 882 + struct pseudo_fs_context *ctx; 883 + 884 + if (!init_pseudo(fc, GUEST_MEMFD_MAGIC)) 885 + return -ENOMEM; 886 + 887 + fc->s_iflags |= SB_I_NOEXEC; 888 + fc->s_iflags |= SB_I_NODEV; 889 + ctx = fc->fs_private; 890 + ctx->ops = &kvm_gmem_super_operations; 891 + 892 + return 0; 893 + } 894 + 895 + static struct file_system_type kvm_gmem_fs = { 896 + .name = "guest_memfd", 897 + .init_fs_context = kvm_gmem_init_fs_context, 898 + .kill_sb = kill_anon_super, 899 + }; 900 + 901 + static int kvm_gmem_init_mount(void) 902 + { 903 + kvm_gmem_mnt = kern_mount(&kvm_gmem_fs); 904 + 905 + if (IS_ERR(kvm_gmem_mnt)) 906 + return PTR_ERR(kvm_gmem_mnt); 907 + 908 + kvm_gmem_mnt->mnt_flags |= MNT_NOEXEC; 909 + return 0; 910 + } 911 + 912 + int kvm_gmem_init(struct module *module) 913 + { 914 + struct kmem_cache_args args = { 915 + .align = 0, 916 + .ctor = kvm_gmem_init_inode_once, 917 + }; 918 + int ret; 919 + 920 + kvm_gmem_fops.owner = module; 921 + kvm_gmem_inode_cachep = kmem_cache_create("kvm_gmem_inode_cache", 922 + sizeof(struct gmem_inode), 923 + &args, SLAB_ACCOUNT); 924 + if (!kvm_gmem_inode_cachep) 925 + return -ENOMEM; 926 + 927 + ret = kvm_gmem_init_mount(); 928 + if (ret) { 929 + kmem_cache_destroy(kvm_gmem_inode_cachep); 930 + return ret; 931 + } 932 + return 0; 933 + } 934 + 935 + void kvm_gmem_exit(void) 936 + { 937 + kern_unmount(kvm_gmem_mnt); 938 + kvm_gmem_mnt = NULL; 939 + rcu_barrier(); 940 + kmem_cache_destroy(kvm_gmem_inode_cachep); 941 + }

+10 -5

virt/kvm/kvm_main.c

··· 4027 4027 4028 4028 yielded = kvm_vcpu_yield_to(vcpu); 4029 4029 if (yielded > 0) { 4030 - WRITE_ONCE(kvm->last_boosted_vcpu, i); 4030 + WRITE_ONCE(kvm->last_boosted_vcpu, idx); 4031 4031 break; 4032 4032 } else if (yielded < 0 && !--try) { 4033 4033 break; ··· 4435 4435 return r; 4436 4436 4437 4437 /* 4438 - * Some architectures have vcpu ioctls that are asynchronous to vcpu 4439 - * execution; mutex_lock() would break them. 4438 + * Let arch code handle select vCPU ioctls without holding vcpu->mutex, 4439 + * e.g. to support ioctls that can run asynchronous to vCPU execution. 4440 4440 */ 4441 - r = kvm_arch_vcpu_async_ioctl(filp, ioctl, arg); 4441 + r = kvm_arch_vcpu_unlocked_ioctl(filp, ioctl, arg); 4442 4442 if (r != -ENOIOCTLCMD) 4443 4443 return r; 4444 4444 ··· 6524 6524 if (WARN_ON_ONCE(r)) 6525 6525 goto err_vfio; 6526 6526 6527 - kvm_gmem_init(module); 6527 + r = kvm_gmem_init(module); 6528 + if (r) 6529 + goto err_gmem; 6528 6530 6529 6531 r = kvm_init_virtualization(); 6530 6532 if (r) ··· 6547 6545 err_register: 6548 6546 kvm_uninit_virtualization(); 6549 6547 err_virt: 6548 + kvm_gmem_exit(); 6549 + err_gmem: 6550 6550 kvm_vfio_ops_exit(); 6551 6551 err_vfio: 6552 6552 kvm_async_pf_deinit(); ··· 6580 6576 for_each_possible_cpu(cpu) 6581 6577 free_cpumask_var(per_cpu(cpu_kick_mask, cpu)); 6582 6578 kmem_cache_destroy(kvm_vcpu_cache); 6579 + kvm_gmem_exit(); 6583 6580 kvm_vfio_ops_exit(); 6584 6581 kvm_async_pf_deinit(); 6585 6582 kvm_irqfd_exit();

+5 -4

virt/kvm/kvm_mm.h

··· 68 68 #endif /* HAVE_KVM_PFNCACHE */ 69 69 70 70 #ifdef CONFIG_KVM_GUEST_MEMFD 71 - void kvm_gmem_init(struct module *module); 71 + int kvm_gmem_init(struct module *module); 72 + void kvm_gmem_exit(void); 72 73 int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args); 73 74 int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot, 74 75 unsigned int fd, loff_t offset); 75 76 void kvm_gmem_unbind(struct kvm_memory_slot *slot); 76 77 #else 77 - static inline void kvm_gmem_init(struct module *module) 78 + static inline int kvm_gmem_init(struct module *module) 78 79 { 79 - 80 + return 0; 80 81 } 81 - 82 + static inline void kvm_gmem_exit(void) {}; 82 83 static inline int kvm_gmem_bind(struct kvm *kvm, 83 84 struct kvm_memory_slot *slot, 84 85 unsigned int fd, loff_t offset)

Configure Feed

Configure Feed