Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

+28 -8

Documentation/virt/kvm/api.rst

··· 3357 3357 - KVM_GUESTDBG_INJECT_DB: inject DB type exception [x86] 3358 3358 - KVM_GUESTDBG_INJECT_BP: inject BP type exception [x86] 3359 3359 - KVM_GUESTDBG_EXIT_PENDING: trigger an immediate guest exit [s390] 3360 + - KVM_GUESTDBG_BLOCKIRQ: avoid injecting interrupts/NMI/SMI [x86] 3360 3361 3361 3362 For example KVM_GUESTDBG_USE_SW_BP indicates that software breakpoints 3362 3363 are enabled in memory so we need to ensure breakpoint exceptions are ··· 5209 5208 #define KVM_STATS_TYPE_CUMULATIVE (0x0 << KVM_STATS_TYPE_SHIFT) 5210 5209 #define KVM_STATS_TYPE_INSTANT (0x1 << KVM_STATS_TYPE_SHIFT) 5211 5210 #define KVM_STATS_TYPE_PEAK (0x2 << KVM_STATS_TYPE_SHIFT) 5211 + #define KVM_STATS_TYPE_LINEAR_HIST (0x3 << KVM_STATS_TYPE_SHIFT) 5212 + #define KVM_STATS_TYPE_LOG_HIST (0x4 << KVM_STATS_TYPE_SHIFT) 5213 + #define KVM_STATS_TYPE_MAX KVM_STATS_TYPE_LOG_HIST 5212 5214 5213 5215 #define KVM_STATS_UNIT_SHIFT 4 5214 5216 #define KVM_STATS_UNIT_MASK (0xF << KVM_STATS_UNIT_SHIFT) ··· 5219 5215 #define KVM_STATS_UNIT_BYTES (0x1 << KVM_STATS_UNIT_SHIFT) 5220 5216 #define KVM_STATS_UNIT_SECONDS (0x2 << KVM_STATS_UNIT_SHIFT) 5221 5217 #define KVM_STATS_UNIT_CYCLES (0x3 << KVM_STATS_UNIT_SHIFT) 5218 + #define KVM_STATS_UNIT_MAX KVM_STATS_UNIT_CYCLES 5222 5219 5223 5220 #define KVM_STATS_BASE_SHIFT 8 5224 5221 #define KVM_STATS_BASE_MASK (0xF << KVM_STATS_BASE_SHIFT) 5225 5222 #define KVM_STATS_BASE_POW10 (0x0 << KVM_STATS_BASE_SHIFT) 5226 5223 #define KVM_STATS_BASE_POW2 (0x1 << KVM_STATS_BASE_SHIFT) 5224 + #define KVM_STATS_BASE_MAX KVM_STATS_BASE_POW2 5227 5225 5228 5226 struct kvm_stats_desc { 5229 5227 __u32 flags; 5230 5228 __s16 exponent; 5231 5229 __u16 size; 5232 5230 __u32 offset; 5233 - __u32 unused; 5231 + __u32 bucket_size; 5234 5232 char name[]; 5235 5233 }; 5236 5234 ··· 5243 5237 Bits 0-3 of ``flags`` encode the type: 5244 5238 5245 5239 * ``KVM_STATS_TYPE_CUMULATIVE`` 5246 - The statistics data is cumulative. The value of data can only be increased. 5240 + The statistics reports a cumulative count. The value of data can only be increased. 5247 5241 Most of the counters used in KVM are of this type. 5248 5242 The corresponding ``size`` field for this type is always 1. 5249 5243 All cumulative statistics data are read/write. 5250 5244 * ``KVM_STATS_TYPE_INSTANT`` 5251 - The statistics data is instantaneous. Its value can be increased or 5245 + The statistics reports an instantaneous value. Its value can be increased or 5252 5246 decreased. This type is usually used as a measurement of some resources, 5253 5247 like the number of dirty pages, the number of large pages, etc. 5254 5248 All instant statistics are read only. 5255 5249 The corresponding ``size`` field for this type is always 1. 5256 5250 * ``KVM_STATS_TYPE_PEAK`` 5257 - The statistics data is peak. The value of data can only be increased, and 5258 - represents a peak value for a measurement, for example the maximum number 5251 + The statistics data reports a peak value, for example the maximum number 5259 5252 of items in a hash table bucket, the longest time waited and so on. 5253 + The value of data can only be increased. 5260 5254 The corresponding ``size`` field for this type is always 1. 5255 + * ``KVM_STATS_TYPE_LINEAR_HIST`` 5256 + The statistic is reported as a linear histogram. The number of 5257 + buckets is specified by the ``size`` field. The size of buckets is specified 5258 + by the ``hist_param`` field. The range of the Nth bucket (1 <= N < ``size``) 5259 + is [``hist_param``*(N-1), ``hist_param``*N), while the range of the last 5260 + bucket is [``hist_param``*(``size``-1), +INF). (+INF means positive infinity 5261 + value.) The bucket value indicates how many samples fell in the bucket's range. 5262 + * ``KVM_STATS_TYPE_LOG_HIST`` 5263 + The statistic is reported as a logarithmic histogram. The number of 5264 + buckets is specified by the ``size`` field. The range of the first bucket is 5265 + [0, 1), while the range of the last bucket is [pow(2, ``size``-2), +INF). 5266 + Otherwise, The Nth bucket (1 < N < ``size``) covers 5267 + [pow(2, N-2), pow(2, N-1)). The bucket value indicates how many samples fell 5268 + in the bucket's range. 5261 5269 5262 5270 Bits 4-7 of ``flags`` encode the unit: 5263 5271 ··· 5306 5286 The ``offset`` field is the offset from the start of Data Block to the start of 5307 5287 the corresponding statistics data. 5308 5288 5309 - The ``unused`` field is reserved for future support for other types of 5310 - statistics data, like log/linear histogram. Its value is always 0 for the types 5311 - defined above. 5289 + The ``bucket_size`` field is used as a parameter for histogram statistics data. 5290 + It is only used by linear histogram statistics data, specifying the size of a 5291 + bucket. 5312 5292 5313 5293 The ``name`` field is the name string of the statistics data. The name string 5314 5294 starts at the end of ``struct kvm_stats_desc``. The maximum length including

+6

Documentation/virt/kvm/locking.rst

··· 21 21 can be taken inside a kvm->srcu read-side critical section, 22 22 while kvm->slots_lock cannot. 23 23 24 + - kvm->mn_active_invalidate_count ensures that pairs of 25 + invalidate_range_start() and invalidate_range_end() callbacks 26 + use the same memslots array. kvm->slots_lock and kvm->slots_arch_lock 27 + are taken on the waiting side in install_new_memslots, so MMU notifiers 28 + must not take either kvm->slots_lock or kvm->slots_arch_lock. 29 + 24 30 On x86: 25 31 26 32 - vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock

+9 -9

arch/arm64/include/asm/cpufeature.h

··· 602 602 { 603 603 u32 val = cpuid_feature_extract_unsigned_field(pfr0, ID_AA64PFR0_EL1_SHIFT); 604 604 605 - return val == ID_AA64PFR0_EL1_32BIT_64BIT; 605 + return val == ID_AA64PFR0_ELx_32BIT_64BIT; 606 606 } 607 607 608 608 static inline bool id_aa64pfr0_32bit_el0(u64 pfr0) 609 609 { 610 610 u32 val = cpuid_feature_extract_unsigned_field(pfr0, ID_AA64PFR0_EL0_SHIFT); 611 611 612 - return val == ID_AA64PFR0_EL0_32BIT_64BIT; 612 + return val == ID_AA64PFR0_ELx_32BIT_64BIT; 613 613 } 614 614 615 615 static inline bool id_aa64pfr0_sve(u64 pfr0) ··· 784 784 static inline u32 id_aa64mmfr0_parange_to_phys_shift(int parange) 785 785 { 786 786 switch (parange) { 787 - case 0: return 32; 788 - case 1: return 36; 789 - case 2: return 40; 790 - case 3: return 42; 791 - case 4: return 44; 792 - case 5: return 48; 793 - case 6: return 52; 787 + case ID_AA64MMFR0_PARANGE_32: return 32; 788 + case ID_AA64MMFR0_PARANGE_36: return 36; 789 + case ID_AA64MMFR0_PARANGE_40: return 40; 790 + case ID_AA64MMFR0_PARANGE_42: return 42; 791 + case ID_AA64MMFR0_PARANGE_44: return 44; 792 + case ID_AA64MMFR0_PARANGE_48: return 48; 793 + case ID_AA64MMFR0_PARANGE_52: return 52; 794 794 /* 795 795 * A future PE could use a value unknown to the kernel. 796 796 * However, by the "D10.1.4 Principles of the ID scheme

+38 -16

arch/arm64/include/asm/kvm_arm.h

··· 12 12 #include <asm/types.h> 13 13 14 14 /* Hyp Configuration Register (HCR) bits */ 15 + 16 + #define HCR_TID5 (UL(1) << 58) 17 + #define HCR_DCT (UL(1) << 57) 15 18 #define HCR_ATA_SHIFT 56 16 19 #define HCR_ATA (UL(1) << HCR_ATA_SHIFT) 20 + #define HCR_AMVOFFEN (UL(1) << 51) 21 + #define HCR_FIEN (UL(1) << 47) 17 22 #define HCR_FWB (UL(1) << 46) 18 23 #define HCR_API (UL(1) << 41) 19 24 #define HCR_APK (UL(1) << 40) ··· 37 32 #define HCR_TVM (UL(1) << 26) 38 33 #define HCR_TTLB (UL(1) << 25) 39 34 #define HCR_TPU (UL(1) << 24) 40 - #define HCR_TPC (UL(1) << 23) 35 + #define HCR_TPC (UL(1) << 23) /* HCR_TPCP if FEAT_DPB */ 41 36 #define HCR_TSW (UL(1) << 22) 42 - #define HCR_TAC (UL(1) << 21) 37 + #define HCR_TACR (UL(1) << 21) 43 38 #define HCR_TIDCP (UL(1) << 20) 44 39 #define HCR_TSC (UL(1) << 19) 45 40 #define HCR_TID3 (UL(1) << 18) ··· 61 56 #define HCR_PTW (UL(1) << 2) 62 57 #define HCR_SWIO (UL(1) << 1) 63 58 #define HCR_VM (UL(1) << 0) 59 + #define HCR_RES0 ((UL(1) << 48) | (UL(1) << 39)) 64 60 65 61 /* 66 62 * The bits we set in HCR: 67 63 * TLOR: Trap LORegion register accesses 68 64 * RW: 64bit by default, can be overridden for 32bit VMs 69 - * TAC: Trap ACTLR 65 + * TACR: Trap ACTLR 70 66 * TSC: Trap SMC 71 67 * TSW: Trap cache operations by set/way 72 68 * TWE: Trap WFE ··· 82 76 * PTW: Take a stage2 fault if a stage1 walk steps in device memory 83 77 */ 84 78 #define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWE | HCR_TWI | HCR_VM | \ 85 - HCR_BSU_IS | HCR_FB | HCR_TAC | \ 79 + HCR_BSU_IS | HCR_FB | HCR_TACR | \ 86 80 HCR_AMO | HCR_SWIO | HCR_TIDCP | HCR_RW | HCR_TLOR | \ 87 81 HCR_FMO | HCR_IMO | HCR_PTW ) 88 82 #define HCR_VIRT_EXCP_MASK (HCR_VSE | HCR_VI | HCR_VF) ··· 281 275 #define CPTR_EL2_TTA (1 << 20) 282 276 #define CPTR_EL2_TFP (1 << CPTR_EL2_TFP_SHIFT) 283 277 #define CPTR_EL2_TZ (1 << 8) 284 - #define CPTR_EL2_RES1 0x000032ff /* known RES1 bits in CPTR_EL2 */ 285 - #define CPTR_EL2_DEFAULT CPTR_EL2_RES1 278 + #define CPTR_NVHE_EL2_RES1 0x000032ff /* known RES1 bits in CPTR_EL2 (nVHE) */ 279 + #define CPTR_EL2_DEFAULT CPTR_NVHE_EL2_RES1 280 + #define CPTR_NVHE_EL2_RES0 (GENMASK(63, 32) | \ 281 + GENMASK(29, 21) | \ 282 + GENMASK(19, 14) | \ 283 + BIT(11)) 286 284 287 285 /* Hyp Debug Configuration Register bits */ 288 286 #define MDCR_EL2_E2TB_MASK (UL(0x3)) 289 287 #define MDCR_EL2_E2TB_SHIFT (UL(24)) 290 - #define MDCR_EL2_TTRF (1 << 19) 291 - #define MDCR_EL2_TPMS (1 << 14) 288 + #define MDCR_EL2_HPMFZS (UL(1) << 36) 289 + #define MDCR_EL2_HPMFZO (UL(1) << 29) 290 + #define MDCR_EL2_MTPME (UL(1) << 28) 291 + #define MDCR_EL2_TDCC (UL(1) << 27) 292 + #define MDCR_EL2_HCCD (UL(1) << 23) 293 + #define MDCR_EL2_TTRF (UL(1) << 19) 294 + #define MDCR_EL2_HPMD (UL(1) << 17) 295 + #define MDCR_EL2_TPMS (UL(1) << 14) 292 296 #define MDCR_EL2_E2PB_MASK (UL(0x3)) 293 297 #define MDCR_EL2_E2PB_SHIFT (UL(12)) 294 - #define MDCR_EL2_TDRA (1 << 11) 295 - #define MDCR_EL2_TDOSA (1 << 10) 296 - #define MDCR_EL2_TDA (1 << 9) 297 - #define MDCR_EL2_TDE (1 << 8) 298 - #define MDCR_EL2_HPME (1 << 7) 299 - #define MDCR_EL2_TPM (1 << 6) 300 - #define MDCR_EL2_TPMCR (1 << 5) 301 - #define MDCR_EL2_HPMN_MASK (0x1F) 298 + #define MDCR_EL2_TDRA (UL(1) << 11) 299 + #define MDCR_EL2_TDOSA (UL(1) << 10) 300 + #define MDCR_EL2_TDA (UL(1) << 9) 301 + #define MDCR_EL2_TDE (UL(1) << 8) 302 + #define MDCR_EL2_HPME (UL(1) << 7) 303 + #define MDCR_EL2_TPM (UL(1) << 6) 304 + #define MDCR_EL2_TPMCR (UL(1) << 5) 305 + #define MDCR_EL2_HPMN_MASK (UL(0x1F)) 306 + #define MDCR_EL2_RES0 (GENMASK(63, 37) | \ 307 + GENMASK(35, 30) | \ 308 + GENMASK(25, 24) | \ 309 + GENMASK(22, 20) | \ 310 + BIT(18) | \ 311 + GENMASK(16, 15)) 302 312 303 313 /* For compatibility with fault code shared with 32-bit */ 304 314 #define FSC_FAULT ESR_ELx_FSC_FAULT

+3 -4

arch/arm64/include/asm/kvm_asm.h

··· 59 59 #define __KVM_HOST_SMCCC_FUNC___vgic_v3_save_aprs 13 60 60 #define __KVM_HOST_SMCCC_FUNC___vgic_v3_restore_aprs 14 61 61 #define __KVM_HOST_SMCCC_FUNC___pkvm_init 15 62 - #define __KVM_HOST_SMCCC_FUNC___pkvm_create_mappings 16 62 + #define __KVM_HOST_SMCCC_FUNC___pkvm_host_share_hyp 16 63 63 #define __KVM_HOST_SMCCC_FUNC___pkvm_create_private_mapping 17 64 64 #define __KVM_HOST_SMCCC_FUNC___pkvm_cpu_set_vector 18 65 65 #define __KVM_HOST_SMCCC_FUNC___pkvm_prot_finalize 19 66 - #define __KVM_HOST_SMCCC_FUNC___pkvm_mark_hyp 20 67 - #define __KVM_HOST_SMCCC_FUNC___kvm_adjust_pc 21 66 + #define __KVM_HOST_SMCCC_FUNC___kvm_adjust_pc 20 68 67 69 68 #ifndef __ASSEMBLY__ 70 69 ··· 209 210 extern void __vgic_v3_write_vmcr(u32 vmcr); 210 211 extern void __vgic_v3_init_lrs(void); 211 212 212 - extern u32 __kvm_get_mdcr_el2(void); 213 + extern u64 __kvm_get_mdcr_el2(void); 213 214 214 215 #define __KVM_EXTABLE(from, to) \ 215 216 " .pushsection __kvm_ex_table, \"a\"\n" \

+13 -4

arch/arm64/include/asm/kvm_host.h

··· 66 66 extern unsigned int kvm_sve_max_vl; 67 67 int kvm_arm_init_sve(void); 68 68 69 - int __attribute_const__ kvm_target_cpu(void); 69 + u32 __attribute_const__ kvm_target_cpu(void); 70 70 int kvm_reset_vcpu(struct kvm_vcpu *vcpu); 71 71 void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu); 72 72 ··· 185 185 PMCNTENSET_EL0, /* Count Enable Set Register */ 186 186 PMINTENSET_EL1, /* Interrupt Enable Set Register */ 187 187 PMOVSSET_EL0, /* Overflow Flag Status Set Register */ 188 - PMSWINC_EL0, /* Software Increment Register */ 189 188 PMUSERENR_EL0, /* User Enable Register */ 190 189 191 190 /* Pointer Authentication Registers in a strict increasing order. */ ··· 286 287 /* Stage 2 paging state used by the hardware on next switch */ 287 288 struct kvm_s2_mmu *hw_mmu; 288 289 289 - /* HYP configuration */ 290 + /* Values of trap registers for the guest. */ 290 291 u64 hcr_el2; 291 - u32 mdcr_el2; 292 + u64 mdcr_el2; 293 + u64 cptr_el2; 294 + 295 + /* Values of trap registers for the host before guest entry. */ 296 + u64 mdcr_el2_host; 292 297 293 298 /* Exception Information */ 294 299 struct kvm_vcpu_fault_info fault; ··· 579 576 u64 wfi_exit_stat; 580 577 u64 mmio_exit_user; 581 578 u64 mmio_exit_kernel; 579 + u64 signal_exits; 582 580 u64 exits; 583 581 }; 584 582 ··· 774 770 void kvm_arch_free_vm(struct kvm *kvm); 775 771 776 772 int kvm_arm_setup_stage2(struct kvm *kvm, unsigned long type); 773 + 774 + static inline bool kvm_vm_is_protected(struct kvm *kvm) 775 + { 776 + return false; 777 + } 777 778 778 779 int kvm_arm_vcpu_finalize(struct kvm_vcpu *vcpu, int feature); 779 780 bool kvm_arm_vcpu_is_finalized(struct kvm_vcpu *vcpu);

+1 -1

arch/arm64/include/asm/kvm_hyp.h

··· 95 95 96 96 #ifndef __KVM_NVHE_HYPERVISOR__ 97 97 void activate_traps_vhe_load(struct kvm_vcpu *vcpu); 98 - void deactivate_traps_vhe_put(void); 98 + void deactivate_traps_vhe_put(struct kvm_vcpu *vcpu); 99 99 #endif 100 100 101 101 u64 __guest_enter(struct kvm_vcpu *vcpu);

+9 -8

arch/arm64/include/asm/kvm_mmu.h

··· 252 252 253 253 #define kvm_phys_to_vttbr(addr) phys_to_ttbr(addr) 254 254 255 + /* 256 + * When this is (directly or indirectly) used on the TLB invalidation 257 + * path, we rely on a previously issued DSB so that page table updates 258 + * and VMID reads are correctly ordered. 259 + */ 255 260 static __always_inline u64 kvm_get_vttbr(struct kvm_s2_mmu *mmu) 256 261 { 257 262 struct kvm_vmid *vmid = &mmu->vmid; ··· 264 259 u64 cnp = system_supports_cnp() ? VTTBR_CNP_BIT : 0; 265 260 266 261 baddr = mmu->pgd_phys; 267 - vmid_field = (u64)vmid->vmid << VTTBR_VMID_SHIFT; 262 + vmid_field = (u64)READ_ONCE(vmid->vmid) << VTTBR_VMID_SHIFT; 268 263 return kvm_phys_to_vttbr(baddr) | vmid_field | cnp; 269 264 } 270 265 ··· 272 267 * Must be called from hyp code running at EL2 with an updated VTTBR 273 268 * and interrupts disabled. 274 269 */ 275 - static __always_inline void __load_stage2(struct kvm_s2_mmu *mmu, unsigned long vtcr) 270 + static __always_inline void __load_stage2(struct kvm_s2_mmu *mmu, 271 + struct kvm_arch *arch) 276 272 { 277 - write_sysreg(vtcr, vtcr_el2); 273 + write_sysreg(arch->vtcr, vtcr_el2); 278 274 write_sysreg(kvm_get_vttbr(mmu), vttbr_el2); 279 275 280 276 /* ··· 284 278 * the guest. 285 279 */ 286 280 asm(ALTERNATIVE("nop", "isb", ARM64_WORKAROUND_SPECULATIVE_AT)); 287 - } 288 - 289 - static __always_inline void __load_guest_stage2(struct kvm_s2_mmu *mmu) 290 - { 291 - __load_stage2(mmu, kern_hyp_va(mmu->arch)->vtcr); 292 281 } 293 282 294 283 static inline struct kvm *kvm_s2_mmu_to_kvm(struct kvm_s2_mmu *mmu)

+126 -50

arch/arm64/include/asm/kvm_pgtable.h

··· 25 25 26 26 typedef u64 kvm_pte_t; 27 27 28 + #define KVM_PTE_VALID BIT(0) 29 + 30 + #define KVM_PTE_ADDR_MASK GENMASK(47, PAGE_SHIFT) 31 + #define KVM_PTE_ADDR_51_48 GENMASK(15, 12) 32 + 33 + static inline bool kvm_pte_valid(kvm_pte_t pte) 34 + { 35 + return pte & KVM_PTE_VALID; 36 + } 37 + 38 + static inline u64 kvm_pte_to_phys(kvm_pte_t pte) 39 + { 40 + u64 pa = pte & KVM_PTE_ADDR_MASK; 41 + 42 + if (PAGE_SHIFT == 16) 43 + pa |= FIELD_GET(KVM_PTE_ADDR_51_48, pte) << 48; 44 + 45 + return pa; 46 + } 47 + 48 + static inline u64 kvm_granule_shift(u32 level) 49 + { 50 + /* Assumes KVM_PGTABLE_MAX_LEVELS is 4 */ 51 + return ARM64_HW_PGTABLE_LEVEL_SHIFT(level); 52 + } 53 + 54 + static inline u64 kvm_granule_size(u32 level) 55 + { 56 + return BIT(kvm_granule_shift(level)); 57 + } 58 + 59 + static inline bool kvm_level_supports_block_mapping(u32 level) 60 + { 61 + /* 62 + * Reject invalid block mappings and don't bother with 4TB mappings for 63 + * 52-bit PAs. 64 + */ 65 + return !(level == 0 || (PAGE_SIZE != SZ_4K && level == 1)); 66 + } 67 + 28 68 /** 29 69 * struct kvm_pgtable_mm_ops - Memory management callbacks. 30 70 * @zalloc_page: Allocate a single zeroed memory page. ··· 116 76 }; 117 77 118 78 /** 79 + * enum kvm_pgtable_prot - Page-table permissions and attributes. 80 + * @KVM_PGTABLE_PROT_X: Execute permission. 81 + * @KVM_PGTABLE_PROT_W: Write permission. 82 + * @KVM_PGTABLE_PROT_R: Read permission. 83 + * @KVM_PGTABLE_PROT_DEVICE: Device attributes. 84 + * @KVM_PGTABLE_PROT_SW0: Software bit 0. 85 + * @KVM_PGTABLE_PROT_SW1: Software bit 1. 86 + * @KVM_PGTABLE_PROT_SW2: Software bit 2. 87 + * @KVM_PGTABLE_PROT_SW3: Software bit 3. 88 + */ 89 + enum kvm_pgtable_prot { 90 + KVM_PGTABLE_PROT_X = BIT(0), 91 + KVM_PGTABLE_PROT_W = BIT(1), 92 + KVM_PGTABLE_PROT_R = BIT(2), 93 + 94 + KVM_PGTABLE_PROT_DEVICE = BIT(3), 95 + 96 + KVM_PGTABLE_PROT_SW0 = BIT(55), 97 + KVM_PGTABLE_PROT_SW1 = BIT(56), 98 + KVM_PGTABLE_PROT_SW2 = BIT(57), 99 + KVM_PGTABLE_PROT_SW3 = BIT(58), 100 + }; 101 + 102 + #define KVM_PGTABLE_PROT_RW (KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W) 103 + #define KVM_PGTABLE_PROT_RWX (KVM_PGTABLE_PROT_RW | KVM_PGTABLE_PROT_X) 104 + 105 + #define PKVM_HOST_MEM_PROT KVM_PGTABLE_PROT_RWX 106 + #define PKVM_HOST_MMIO_PROT KVM_PGTABLE_PROT_RW 107 + 108 + #define PAGE_HYP KVM_PGTABLE_PROT_RW 109 + #define PAGE_HYP_EXEC (KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_X) 110 + #define PAGE_HYP_RO (KVM_PGTABLE_PROT_R) 111 + #define PAGE_HYP_DEVICE (PAGE_HYP | KVM_PGTABLE_PROT_DEVICE) 112 + 113 + typedef bool (*kvm_pgtable_force_pte_cb_t)(u64 addr, u64 end, 114 + enum kvm_pgtable_prot prot); 115 + 116 + /** 119 117 * struct kvm_pgtable - KVM page-table. 120 118 * @ia_bits: Maximum input address size, in bits. 121 119 * @start_level: Level at which the page-table walk starts. 122 120 * @pgd: Pointer to the first top-level entry of the page-table. 123 121 * @mm_ops: Memory management callbacks. 124 122 * @mmu: Stage-2 KVM MMU struct. Unused for stage-1 page-tables. 123 + * @flags: Stage-2 page-table flags. 124 + * @force_pte_cb: Function that returns true if page level mappings must 125 + * be used instead of block mappings. 125 126 */ 126 127 struct kvm_pgtable { 127 128 u32 ia_bits; ··· 173 92 /* Stage-2 only */ 174 93 struct kvm_s2_mmu *mmu; 175 94 enum kvm_pgtable_stage2_flags flags; 176 - }; 177 - 178 - /** 179 - * enum kvm_pgtable_prot - Page-table permissions and attributes. 180 - * @KVM_PGTABLE_PROT_X: Execute permission. 181 - * @KVM_PGTABLE_PROT_W: Write permission. 182 - * @KVM_PGTABLE_PROT_R: Read permission. 183 - * @KVM_PGTABLE_PROT_DEVICE: Device attributes. 184 - */ 185 - enum kvm_pgtable_prot { 186 - KVM_PGTABLE_PROT_X = BIT(0), 187 - KVM_PGTABLE_PROT_W = BIT(1), 188 - KVM_PGTABLE_PROT_R = BIT(2), 189 - 190 - KVM_PGTABLE_PROT_DEVICE = BIT(3), 191 - }; 192 - 193 - #define PAGE_HYP (KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W) 194 - #define PAGE_HYP_EXEC (KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_X) 195 - #define PAGE_HYP_RO (KVM_PGTABLE_PROT_R) 196 - #define PAGE_HYP_DEVICE (PAGE_HYP | KVM_PGTABLE_PROT_DEVICE) 197 - 198 - /** 199 - * struct kvm_mem_range - Range of Intermediate Physical Addresses 200 - * @start: Start of the range. 201 - * @end: End of the range. 202 - */ 203 - struct kvm_mem_range { 204 - u64 start; 205 - u64 end; 95 + kvm_pgtable_force_pte_cb_t force_pte_cb; 206 96 }; 207 97 208 98 /** ··· 268 216 u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift); 269 217 270 218 /** 271 - * kvm_pgtable_stage2_init_flags() - Initialise a guest stage-2 page-table. 219 + * __kvm_pgtable_stage2_init() - Initialise a guest stage-2 page-table. 272 220 * @pgt: Uninitialised page-table structure to initialise. 273 221 * @arch: Arch-specific KVM structure representing the guest virtual 274 222 * machine. 275 223 * @mm_ops: Memory management callbacks. 276 224 * @flags: Stage-2 configuration flags. 225 + * @force_pte_cb: Function that returns true if page level mappings must 226 + * be used instead of block mappings. 277 227 * 278 228 * Return: 0 on success, negative error code on failure. 279 229 */ 280 - int kvm_pgtable_stage2_init_flags(struct kvm_pgtable *pgt, struct kvm_arch *arch, 281 - struct kvm_pgtable_mm_ops *mm_ops, 282 - enum kvm_pgtable_stage2_flags flags); 230 + int __kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_arch *arch, 231 + struct kvm_pgtable_mm_ops *mm_ops, 232 + enum kvm_pgtable_stage2_flags flags, 233 + kvm_pgtable_force_pte_cb_t force_pte_cb); 283 234 284 235 #define kvm_pgtable_stage2_init(pgt, arch, mm_ops) \ 285 - kvm_pgtable_stage2_init_flags(pgt, arch, mm_ops, 0) 236 + __kvm_pgtable_stage2_init(pgt, arch, mm_ops, 0, NULL) 286 237 287 238 /** 288 239 * kvm_pgtable_stage2_destroy() - Destroy an unused guest stage-2 page-table. ··· 429 374 * If there is a valid, leaf page-table entry used to translate @addr, then 430 375 * relax the permissions in that entry according to the read, write and 431 376 * execute permissions specified by @prot. No permissions are removed, and 432 - * TLB invalidation is performed after updating the entry. 377 + * TLB invalidation is performed after updating the entry. Software bits cannot 378 + * be set or cleared using kvm_pgtable_stage2_relax_perms(). 433 379 * 434 380 * Return: 0 on success, negative error code on failure. 435 381 */ ··· 489 433 struct kvm_pgtable_walker *walker); 490 434 491 435 /** 492 - * kvm_pgtable_stage2_find_range() - Find a range of Intermediate Physical 493 - * Addresses with compatible permission 494 - * attributes. 495 - * @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*(). 496 - * @addr: Address that must be covered by the range. 497 - * @prot: Protection attributes that the range must be compatible with. 498 - * @range: Range structure used to limit the search space at call time and 499 - * that will hold the result. 436 + * kvm_pgtable_get_leaf() - Walk a page-table and retrieve the leaf entry 437 + * with its level. 438 + * @pgt: Page-table structure initialised by kvm_pgtable_*_init() 439 + * or a similar initialiser. 440 + * @addr: Input address for the start of the walk. 441 + * @ptep: Pointer to storage for the retrieved PTE. 442 + * @level: Pointer to storage for the level of the retrieved PTE. 500 443 * 501 - * The offset of @addr within a page is ignored. An IPA is compatible with @prot 502 - * iff its corresponding stage-2 page-table entry has default ownership and, if 503 - * valid, is mapped with protection attributes identical to @prot. 444 + * The offset of @addr within a page is ignored. 445 + * 446 + * The walker will walk the page-table entries corresponding to the input 447 + * address specified, retrieving the leaf corresponding to this address. 448 + * Invalid entries are treated as leaf entries. 504 449 * 505 450 * Return: 0 on success, negative error code on failure. 506 451 */ 507 - int kvm_pgtable_stage2_find_range(struct kvm_pgtable *pgt, u64 addr, 508 - enum kvm_pgtable_prot prot, 509 - struct kvm_mem_range *range); 452 + int kvm_pgtable_get_leaf(struct kvm_pgtable *pgt, u64 addr, 453 + kvm_pte_t *ptep, u32 *level); 454 + 455 + /** 456 + * kvm_pgtable_stage2_pte_prot() - Retrieve the protection attributes of a 457 + * stage-2 Page-Table Entry. 458 + * @pte: Page-table entry 459 + * 460 + * Return: protection attributes of the page-table entry in the enum 461 + * kvm_pgtable_prot format. 462 + */ 463 + enum kvm_pgtable_prot kvm_pgtable_stage2_pte_prot(kvm_pte_t pte); 464 + 465 + /** 466 + * kvm_pgtable_hyp_pte_prot() - Retrieve the protection attributes of a stage-1 467 + * Page-Table Entry. 468 + * @pte: Page-table entry 469 + * 470 + * Return: protection attributes of the page-table entry in the enum 471 + * kvm_pgtable_prot format. 472 + */ 473 + enum kvm_pgtable_prot kvm_pgtable_hyp_pte_prot(kvm_pte_t pte); 510 474 #endif /* __ARM64_KVM_PGTABLE_H__ */

+22 -4

arch/arm64/include/asm/sysreg.h

··· 784 784 #define ID_AA64PFR0_AMU 0x1 785 785 #define ID_AA64PFR0_SVE 0x1 786 786 #define ID_AA64PFR0_RAS_V1 0x1 787 + #define ID_AA64PFR0_RAS_V1P1 0x2 787 788 #define ID_AA64PFR0_FP_NI 0xf 788 789 #define ID_AA64PFR0_FP_SUPPORTED 0x0 789 790 #define ID_AA64PFR0_ASIMD_NI 0xf 790 791 #define ID_AA64PFR0_ASIMD_SUPPORTED 0x0 791 - #define ID_AA64PFR0_EL1_64BIT_ONLY 0x1 792 - #define ID_AA64PFR0_EL1_32BIT_64BIT 0x2 793 - #define ID_AA64PFR0_EL0_64BIT_ONLY 0x1 794 - #define ID_AA64PFR0_EL0_32BIT_64BIT 0x2 792 + #define ID_AA64PFR0_ELx_64BIT_ONLY 0x1 793 + #define ID_AA64PFR0_ELx_32BIT_64BIT 0x2 795 794 796 795 /* id_aa64pfr1 */ 797 796 #define ID_AA64PFR1_MPAMFRAC_SHIFT 16 ··· 846 847 #define ID_AA64MMFR0_ASID_SHIFT 4 847 848 #define ID_AA64MMFR0_PARANGE_SHIFT 0 848 849 850 + #define ID_AA64MMFR0_ASID_8 0x0 851 + #define ID_AA64MMFR0_ASID_16 0x2 852 + 849 853 #define ID_AA64MMFR0_TGRAN4_NI 0xf 850 854 #define ID_AA64MMFR0_TGRAN4_SUPPORTED_MIN 0x0 851 855 #define ID_AA64MMFR0_TGRAN4_SUPPORTED_MAX 0x7 ··· 859 857 #define ID_AA64MMFR0_TGRAN16_SUPPORTED_MIN 0x1 860 858 #define ID_AA64MMFR0_TGRAN16_SUPPORTED_MAX 0xf 861 859 860 + #define ID_AA64MMFR0_PARANGE_32 0x0 861 + #define ID_AA64MMFR0_PARANGE_36 0x1 862 + #define ID_AA64MMFR0_PARANGE_40 0x2 863 + #define ID_AA64MMFR0_PARANGE_42 0x3 864 + #define ID_AA64MMFR0_PARANGE_44 0x4 862 865 #define ID_AA64MMFR0_PARANGE_48 0x5 863 866 #define ID_AA64MMFR0_PARANGE_52 0x6 867 + 868 + #define ARM64_MIN_PARANGE_BITS 32 864 869 865 870 #define ID_AA64MMFR0_TGRAN_2_SUPPORTED_DEFAULT 0x0 866 871 #define ID_AA64MMFR0_TGRAN_2_SUPPORTED_NONE 0x1 ··· 913 904 #define ID_AA64MMFR2_CNP_SHIFT 0 914 905 915 906 /* id_aa64dfr0 */ 907 + #define ID_AA64DFR0_MTPMU_SHIFT 48 916 908 #define ID_AA64DFR0_TRBE_SHIFT 44 917 909 #define ID_AA64DFR0_TRACE_FILT_SHIFT 40 918 910 #define ID_AA64DFR0_DOUBLELOCK_SHIFT 36 ··· 1044 1034 #define ID_AA64MMFR0_TGRAN_SHIFT ID_AA64MMFR0_TGRAN4_SHIFT 1045 1035 #define ID_AA64MMFR0_TGRAN_SUPPORTED_MIN ID_AA64MMFR0_TGRAN4_SUPPORTED_MIN 1046 1036 #define ID_AA64MMFR0_TGRAN_SUPPORTED_MAX ID_AA64MMFR0_TGRAN4_SUPPORTED_MAX 1037 + #define ID_AA64MMFR0_TGRAN_2_SHIFT ID_AA64MMFR0_TGRAN4_2_SHIFT 1047 1038 #elif defined(CONFIG_ARM64_16K_PAGES) 1048 1039 #define ID_AA64MMFR0_TGRAN_SHIFT ID_AA64MMFR0_TGRAN16_SHIFT 1049 1040 #define ID_AA64MMFR0_TGRAN_SUPPORTED_MIN ID_AA64MMFR0_TGRAN16_SUPPORTED_MIN 1050 1041 #define ID_AA64MMFR0_TGRAN_SUPPORTED_MAX ID_AA64MMFR0_TGRAN16_SUPPORTED_MAX 1042 + #define ID_AA64MMFR0_TGRAN_2_SHIFT ID_AA64MMFR0_TGRAN16_2_SHIFT 1051 1043 #elif defined(CONFIG_ARM64_64K_PAGES) 1052 1044 #define ID_AA64MMFR0_TGRAN_SHIFT ID_AA64MMFR0_TGRAN64_SHIFT 1053 1045 #define ID_AA64MMFR0_TGRAN_SUPPORTED_MIN ID_AA64MMFR0_TGRAN64_SUPPORTED_MIN 1054 1046 #define ID_AA64MMFR0_TGRAN_SUPPORTED_MAX ID_AA64MMFR0_TGRAN64_SUPPORTED_MAX 1047 + #define ID_AA64MMFR0_TGRAN_2_SHIFT ID_AA64MMFR0_TGRAN64_2_SHIFT 1055 1048 #endif 1056 1049 1057 1050 #define MVFR2_FPMISC_SHIFT 4 ··· 1184 1171 #define ICH_VTR_SEIS_MASK (1 << ICH_VTR_SEIS_SHIFT) 1185 1172 #define ICH_VTR_A3V_SHIFT 21 1186 1173 #define ICH_VTR_A3V_MASK (1 << ICH_VTR_A3V_SHIFT) 1174 + 1175 + #define ARM64_FEATURE_FIELD_BITS 4 1176 + 1177 + /* Create a mask for the feature bits of the specified feature. */ 1178 + #define ARM64_FEATURE_MASK(x) (GENMASK_ULL(x##_SHIFT + ARM64_FEATURE_FIELD_BITS - 1, x##_SHIFT)) 1187 1179 1188 1180 #ifdef __ASSEMBLY__ 1189 1181

+4 -4

arch/arm64/kernel/cpufeature.c

··· 240 240 S_ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_LOWER_SAFE, ID_AA64PFR0_FP_SHIFT, 4, ID_AA64PFR0_FP_NI), 241 241 ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64PFR0_EL3_SHIFT, 4, 0), 242 242 ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64PFR0_EL2_SHIFT, 4, 0), 243 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64PFR0_EL1_SHIFT, 4, ID_AA64PFR0_EL1_64BIT_ONLY), 244 - ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64PFR0_EL0_SHIFT, 4, ID_AA64PFR0_EL0_64BIT_ONLY), 243 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64PFR0_EL1_SHIFT, 4, ID_AA64PFR0_ELx_64BIT_ONLY), 244 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64PFR0_EL0_SHIFT, 4, ID_AA64PFR0_ELx_64BIT_ONLY), 245 245 ARM64_FTR_END, 246 246 }; 247 247 ··· 1983 1983 .sys_reg = SYS_ID_AA64PFR0_EL1, 1984 1984 .sign = FTR_UNSIGNED, 1985 1985 .field_pos = ID_AA64PFR0_EL0_SHIFT, 1986 - .min_field_value = ID_AA64PFR0_EL0_32BIT_64BIT, 1986 + .min_field_value = ID_AA64PFR0_ELx_32BIT_64BIT, 1987 1987 }, 1988 1988 #ifdef CONFIG_KVM 1989 1989 { ··· 1994 1994 .sys_reg = SYS_ID_AA64PFR0_EL1, 1995 1995 .sign = FTR_UNSIGNED, 1996 1996 .field_pos = ID_AA64PFR0_EL1_SHIFT, 1997 - .min_field_value = ID_AA64PFR0_EL1_32BIT_64BIT, 1997 + .min_field_value = ID_AA64PFR0_ELx_32BIT_64BIT, 1998 1998 }, 1999 1999 { 2000 2000 .desc = "Protected KVM",

+2 -2

arch/arm64/kernel/vmlinux.lds.S

··· 181 181 /* everything from this point to __init_begin will be marked RO NX */ 182 182 RO_DATA(PAGE_SIZE) 183 183 184 + HYPERVISOR_DATA_SECTIONS 185 + 184 186 idmap_pg_dir = .; 185 187 . += IDMAP_DIR_SIZE; 186 188 idmap_pg_end = .; ··· 261 259 _data = .; 262 260 _sdata = .; 263 261 RW_DATA(L1_CACHE_BYTES, PAGE_SIZE, THREAD_ALIGN) 264 - 265 - HYPERVISOR_DATA_SECTIONS 266 262 267 263 /* 268 264 * Data written with the MMU off but read with the MMU on requires

+10

arch/arm64/kvm/Kconfig

··· 26 26 select HAVE_KVM_ARCH_TLB_FLUSH_ALL 27 27 select KVM_MMIO 28 28 select KVM_GENERIC_DIRTYLOG_READ_PROTECT 29 + select KVM_XFER_TO_GUEST_WORK 29 30 select SRCU 30 31 select KVM_VFIO 31 32 select HAVE_KVM_EVENTFD ··· 46 45 if KVM 47 46 48 47 source "virt/kvm/Kconfig" 48 + 49 + config NVHE_EL2_DEBUG 50 + bool "Debug mode for non-VHE EL2 object" 51 + help 52 + Say Y here to enable the debug mode for the non-VHE KVM EL2 object. 53 + Failure reports will BUG() in the hypervisor. This is intended for 54 + local EL2 hypervisor development. 55 + 56 + If unsure, say N. 49 57 50 58 endif # KVM 51 59

+63 -98

arch/arm64/kvm/arm.c

··· 6 6 7 7 #include <linux/bug.h> 8 8 #include <linux/cpu_pm.h> 9 + #include <linux/entry-kvm.h> 9 10 #include <linux/errno.h> 10 11 #include <linux/err.h> 11 12 #include <linux/kvm_host.h> ··· 16 15 #include <linux/fs.h> 17 16 #include <linux/mman.h> 18 17 #include <linux/sched.h> 18 + #include <linux/kmemleak.h> 19 19 #include <linux/kvm.h> 20 20 #include <linux/kvm_irqfd.h> 21 21 #include <linux/irqbypass.h> ··· 43 41 #include <kvm/arm_hypercalls.h> 44 42 #include <kvm/arm_pmu.h> 45 43 #include <kvm/arm_psci.h> 46 - 47 - #ifdef REQUIRES_VIRT 48 - __asm__(".arch_extension virt"); 49 - #endif 50 44 51 45 static enum kvm_mode kvm_mode = KVM_MODE_DEFAULT; 52 46 DEFINE_STATIC_KEY_FALSE(kvm_protected_mode_initialized); ··· 573 575 kvm_call_hyp(__kvm_flush_vm_context); 574 576 } 575 577 576 - vmid->vmid = kvm_next_vmid; 578 + WRITE_ONCE(vmid->vmid, kvm_next_vmid); 577 579 kvm_next_vmid++; 578 580 kvm_next_vmid &= (1 << kvm_get_vmid_bits()) - 1; 579 581 ··· 717 719 } 718 720 719 721 /** 722 + * kvm_vcpu_exit_request - returns true if the VCPU should *not* enter the guest 723 + * @vcpu: The VCPU pointer 724 + * @ret: Pointer to write optional return code 725 + * 726 + * Returns: true if the VCPU needs to return to a preemptible + interruptible 727 + * and skip guest entry. 728 + * 729 + * This function disambiguates between two different types of exits: exits to a 730 + * preemptible + interruptible kernel context and exits to userspace. For an 731 + * exit to userspace, this function will write the return code to ret and return 732 + * true. For an exit to preemptible + interruptible kernel context (i.e. check 733 + * for pending work and re-enter), return true without writing to ret. 734 + */ 735 + static bool kvm_vcpu_exit_request(struct kvm_vcpu *vcpu, int *ret) 736 + { 737 + struct kvm_run *run = vcpu->run; 738 + 739 + /* 740 + * If we're using a userspace irqchip, then check if we need 741 + * to tell a userspace irqchip about timer or PMU level 742 + * changes and if so, exit to userspace (the actual level 743 + * state gets updated in kvm_timer_update_run and 744 + * kvm_pmu_update_run below). 745 + */ 746 + if (static_branch_unlikely(&userspace_irqchip_in_use)) { 747 + if (kvm_timer_should_notify_user(vcpu) || 748 + kvm_pmu_should_notify_user(vcpu)) { 749 + *ret = -EINTR; 750 + run->exit_reason = KVM_EXIT_INTR; 751 + return true; 752 + } 753 + } 754 + 755 + return kvm_request_pending(vcpu) || 756 + need_new_vmid_gen(&vcpu->arch.hw_mmu->vmid) || 757 + xfer_to_guest_mode_work_pending(); 758 + } 759 + 760 + /** 720 761 * kvm_arch_vcpu_ioctl_run - the main VCPU run function to execute guest code 721 762 * @vcpu: The VCPU pointer 722 763 * ··· 798 761 /* 799 762 * Check conditions before entering the guest 800 763 */ 801 - cond_resched(); 764 + ret = xfer_to_guest_mode_handle_work(vcpu); 765 + if (!ret) 766 + ret = 1; 802 767 803 768 update_vmid(&vcpu->arch.hw_mmu->vmid); 804 769 ··· 820 781 kvm_vgic_flush_hwstate(vcpu); 821 782 822 783 /* 823 - * Exit if we have a signal pending so that we can deliver the 824 - * signal to user space. 825 - */ 826 - if (signal_pending(current)) { 827 - ret = -EINTR; 828 - run->exit_reason = KVM_EXIT_INTR; 829 - } 830 - 831 - /* 832 - * If we're using a userspace irqchip, then check if we need 833 - * to tell a userspace irqchip about timer or PMU level 834 - * changes and if so, exit to userspace (the actual level 835 - * state gets updated in kvm_timer_update_run and 836 - * kvm_pmu_update_run below). 837 - */ 838 - if (static_branch_unlikely(&userspace_irqchip_in_use)) { 839 - if (kvm_timer_should_notify_user(vcpu) || 840 - kvm_pmu_should_notify_user(vcpu)) { 841 - ret = -EINTR; 842 - run->exit_reason = KVM_EXIT_INTR; 843 - } 844 - } 845 - 846 - /* 847 784 * Ensure we set mode to IN_GUEST_MODE after we disable 848 785 * interrupts and before the final VCPU requests check. 849 786 * See the comment in kvm_vcpu_exiting_guest_mode() and ··· 827 812 */ 828 813 smp_store_mb(vcpu->mode, IN_GUEST_MODE); 829 814 830 - if (ret <= 0 || need_new_vmid_gen(&vcpu->arch.hw_mmu->vmid) || 831 - kvm_request_pending(vcpu)) { 815 + if (ret <= 0 || kvm_vcpu_exit_request(vcpu, &ret)) { 832 816 vcpu->mode = OUTSIDE_GUEST_MODE; 833 817 isb(); /* Ensure work in x_flush_hwstate is committed */ 834 818 kvm_pmu_sync_hwstate(vcpu); ··· 1053 1039 const struct kvm_vcpu_init *init) 1054 1040 { 1055 1041 unsigned int i, ret; 1056 - int phys_target = kvm_target_cpu(); 1042 + u32 phys_target = kvm_target_cpu(); 1057 1043 1058 1044 if (init->target != phys_target) 1059 1045 return -EINVAL; ··· 1122 1108 } 1123 1109 1124 1110 vcpu_reset_hcr(vcpu); 1111 + vcpu->arch.cptr_el2 = CPTR_EL2_DEFAULT; 1125 1112 1126 1113 /* 1127 1114 * Handle the "start in power-off" case. ··· 1233 1218 r = -EFAULT; 1234 1219 if (copy_from_user(&reg, argp, sizeof(reg))) 1235 1220 break; 1221 + 1222 + /* 1223 + * We could owe a reset due to PSCI. Handle the pending reset 1224 + * here to ensure userspace register accesses are ordered after 1225 + * the reset. 1226 + */ 1227 + if (kvm_check_request(KVM_REQ_VCPU_RESET, vcpu)) 1228 + kvm_reset_vcpu(vcpu); 1236 1229 1237 1230 if (ioctl == KVM_SET_ONE_REG) 1238 1231 r = kvm_arm_set_reg(vcpu, &reg); ··· 1723 1700 return true; 1724 1701 } 1725 1702 1726 - static int init_common_resources(void) 1727 - { 1728 - return kvm_set_ipa_limit(); 1729 - } 1730 - 1731 1703 static int init_subsystems(void) 1732 1704 { 1733 1705 int err = 0; ··· 1976 1958 WARN_ON(kvm_call_hyp_nvhe(__pkvm_prot_finalize)); 1977 1959 } 1978 1960 1979 - static inline int pkvm_mark_hyp(phys_addr_t start, phys_addr_t end) 1980 - { 1981 - return kvm_call_hyp_nvhe(__pkvm_mark_hyp, start, end); 1982 - } 1983 - 1984 - #define pkvm_mark_hyp_section(__section) \ 1985 - pkvm_mark_hyp(__pa_symbol(__section##_start), \ 1986 - __pa_symbol(__section##_end)) 1987 - 1988 1961 static int finalize_hyp_mode(void) 1989 1962 { 1990 - int cpu, ret; 1991 - 1992 1963 if (!is_protected_kvm_enabled()) 1993 1964 return 0; 1994 1965 1995 - ret = pkvm_mark_hyp_section(__hyp_idmap_text); 1996 - if (ret) 1997 - return ret; 1998 - 1999 - ret = pkvm_mark_hyp_section(__hyp_text); 2000 - if (ret) 2001 - return ret; 2002 - 2003 - ret = pkvm_mark_hyp_section(__hyp_rodata); 2004 - if (ret) 2005 - return ret; 2006 - 2007 - ret = pkvm_mark_hyp_section(__hyp_bss); 2008 - if (ret) 2009 - return ret; 2010 - 2011 - ret = pkvm_mark_hyp(hyp_mem_base, hyp_mem_base + hyp_mem_size); 2012 - if (ret) 2013 - return ret; 2014 - 2015 - for_each_possible_cpu(cpu) { 2016 - phys_addr_t start = virt_to_phys((void *)kvm_arm_hyp_percpu_base[cpu]); 2017 - phys_addr_t end = start + (PAGE_SIZE << nvhe_percpu_order()); 2018 - 2019 - ret = pkvm_mark_hyp(start, end); 2020 - if (ret) 2021 - return ret; 2022 - 2023 - start = virt_to_phys((void *)per_cpu(kvm_arm_hyp_stack_page, cpu)); 2024 - end = start + PAGE_SIZE; 2025 - ret = pkvm_mark_hyp(start, end); 2026 - if (ret) 2027 - return ret; 2028 - } 1966 + /* 1967 + * Exclude HYP BSS from kmemleak so that it doesn't get peeked 1968 + * at, which would end badly once the section is inaccessible. 1969 + * None of other sections should ever be introspected. 1970 + */ 1971 + kmemleak_free_part(__hyp_bss_start, __hyp_bss_end - __hyp_bss_start); 2029 1972 2030 1973 /* 2031 1974 * Flip the static key upfront as that may no longer be possible ··· 1996 2017 on_each_cpu(_kvm_host_prot_finalize, NULL, 1); 1997 2018 1998 2019 return 0; 1999 - } 2000 - 2001 - static void check_kvm_target_cpu(void *ret) 2002 - { 2003 - *(int *)ret = kvm_target_cpu(); 2004 2020 } 2005 2021 2006 2022 struct kvm_vcpu *kvm_mpidr_to_vcpu(struct kvm *kvm, unsigned long mpidr) ··· 2057 2083 int kvm_arch_init(void *opaque) 2058 2084 { 2059 2085 int err; 2060 - int ret, cpu; 2061 2086 bool in_hyp_mode; 2062 2087 2063 2088 if (!is_hyp_mode_available()) { ··· 2071 2098 kvm_info("Guests without required CPU erratum workarounds can deadlock system!\n" \ 2072 2099 "Only trusted guests should be used on this system.\n"); 2073 2100 2074 - for_each_online_cpu(cpu) { 2075 - smp_call_function_single(cpu, check_kvm_target_cpu, &ret, 1); 2076 - if (ret < 0) { 2077 - kvm_err("Error, CPU %d not supported!\n", cpu); 2078 - return -ENODEV; 2079 - } 2080 - } 2081 - 2082 - err = init_common_resources(); 2101 + err = kvm_set_ipa_limit(); 2083 2102 if (err) 2084 2103 return err; 2085 2104

+1 -1

arch/arm64/kvm/debug.c

··· 21 21 DBG_MDSCR_KDE | \ 22 22 DBG_MDSCR_MDE) 23 23 24 - static DEFINE_PER_CPU(u32, mdcr_el2); 24 + static DEFINE_PER_CPU(u64, mdcr_el2); 25 25 26 26 /** 27 27 * save/restore_guest_debug_regs

+3 -6

arch/arm64/kvm/guest.c

··· 31 31 const struct _kvm_stats_desc kvm_vm_stats_desc[] = { 32 32 KVM_GENERIC_VM_STATS() 33 33 }; 34 - static_assert(ARRAY_SIZE(kvm_vm_stats_desc) == 35 - sizeof(struct kvm_vm_stat) / sizeof(u64)); 36 34 37 35 const struct kvm_stats_header kvm_vm_stats_header = { 38 36 .name_size = KVM_STATS_NAME_SIZE, ··· 48 50 STATS_DESC_COUNTER(VCPU, wfi_exit_stat), 49 51 STATS_DESC_COUNTER(VCPU, mmio_exit_user), 50 52 STATS_DESC_COUNTER(VCPU, mmio_exit_kernel), 53 + STATS_DESC_COUNTER(VCPU, signal_exits), 51 54 STATS_DESC_COUNTER(VCPU, exits) 52 55 }; 53 - static_assert(ARRAY_SIZE(kvm_vcpu_stats_desc) == 54 - sizeof(struct kvm_vcpu_stat) / sizeof(u64)); 55 56 56 57 const struct kvm_stats_header kvm_vcpu_stats_header = { 57 58 .name_size = KVM_STATS_NAME_SIZE, ··· 839 842 return 0; 840 843 } 841 844 842 - int __attribute_const__ kvm_target_cpu(void) 845 + u32 __attribute_const__ kvm_target_cpu(void) 843 846 { 844 847 unsigned long implementor = read_cpuid_implementor(); 845 848 unsigned long part_number = read_cpuid_part_number(); ··· 871 874 872 875 int kvm_vcpu_preferred_target(struct kvm_vcpu_init *init) 873 876 { 874 - int target = kvm_target_cpu(); 877 + u32 target = kvm_target_cpu(); 875 878 876 879 if (target < 0) 877 880 return -ENODEV;

+17 -26

arch/arm64/kvm/handle_exit.c

··· 113 113 * guest and host are using the same debug facilities it will be up to 114 114 * userspace to re-inject the correct exception for guest delivery. 115 115 * 116 - * @return: 0 (while setting vcpu->run->exit_reason), -1 for error 116 + * @return: 0 (while setting vcpu->run->exit_reason) 117 117 */ 118 118 static int kvm_handle_guest_debug(struct kvm_vcpu *vcpu) 119 119 { 120 120 struct kvm_run *run = vcpu->run; 121 121 u32 esr = kvm_vcpu_get_esr(vcpu); 122 - int ret = 0; 123 122 124 123 run->exit_reason = KVM_EXIT_DEBUG; 125 124 run->debug.arch.hsr = esr; 126 125 127 - switch (ESR_ELx_EC(esr)) { 128 - case ESR_ELx_EC_WATCHPT_LOW: 126 + if (ESR_ELx_EC(esr) == ESR_ELx_EC_WATCHPT_LOW) 129 127 run->debug.arch.far = vcpu->arch.fault.far_el2; 130 - fallthrough; 131 - case ESR_ELx_EC_SOFTSTP_LOW: 132 - case ESR_ELx_EC_BREAKPT_LOW: 133 - case ESR_ELx_EC_BKPT32: 134 - case ESR_ELx_EC_BRK64: 135 - break; 136 - default: 137 - kvm_err("%s: un-handled case esr: %#08x\n", 138 - __func__, (unsigned int) esr); 139 - ret = -1; 140 - break; 141 - } 142 128 143 - return ret; 129 + return 0; 144 130 } 145 131 146 132 static int kvm_handle_unknown_ec(struct kvm_vcpu *vcpu) ··· 278 292 kvm_handle_guest_serror(vcpu, kvm_vcpu_get_esr(vcpu)); 279 293 } 280 294 281 - void __noreturn __cold nvhe_hyp_panic_handler(u64 esr, u64 spsr, u64 elr, 295 + void __noreturn __cold nvhe_hyp_panic_handler(u64 esr, u64 spsr, 296 + u64 elr_virt, u64 elr_phys, 282 297 u64 par, uintptr_t vcpu, 283 298 u64 far, u64 hpfar) { 284 - u64 elr_in_kimg = __phys_to_kimg(__hyp_pa(elr)); 285 - u64 hyp_offset = elr_in_kimg - kaslr_offset() - elr; 299 + u64 elr_in_kimg = __phys_to_kimg(elr_phys); 300 + u64 hyp_offset = elr_in_kimg - kaslr_offset() - elr_virt; 286 301 u64 mode = spsr & PSR_MODE_MASK; 287 302 288 303 /* ··· 296 309 kvm_err("Invalid host exception to nVHE hyp!\n"); 297 310 } else if (ESR_ELx_EC(esr) == ESR_ELx_EC_BRK64 && 298 311 (esr & ESR_ELx_BRK64_ISS_COMMENT_MASK) == BUG_BRK_IMM) { 299 - struct bug_entry *bug = find_bug(elr_in_kimg); 300 312 const char *file = NULL; 301 313 unsigned int line = 0; 302 314 303 315 /* All hyp bugs, including warnings, are treated as fatal. */ 304 - if (bug) 305 - bug_get_file_line(bug, &file, &line); 316 + if (!is_protected_kvm_enabled() || 317 + IS_ENABLED(CONFIG_NVHE_EL2_DEBUG)) { 318 + struct bug_entry *bug = find_bug(elr_in_kimg); 319 + 320 + if (bug) 321 + bug_get_file_line(bug, &file, &line); 322 + } 306 323 307 324 if (file) 308 325 kvm_err("nVHE hyp BUG at: %s:%u!\n", file, line); 309 326 else 310 - kvm_err("nVHE hyp BUG at: %016llx!\n", elr + hyp_offset); 327 + kvm_err("nVHE hyp BUG at: %016llx!\n", elr_virt + hyp_offset); 311 328 } else { 312 - kvm_err("nVHE hyp panic at: %016llx!\n", elr + hyp_offset); 329 + kvm_err("nVHE hyp panic at: %016llx!\n", elr_virt + hyp_offset); 313 330 } 314 331 315 332 /* ··· 325 334 kvm_err("Hyp Offset: 0x%llx\n", hyp_offset); 326 335 327 336 panic("HYP panic:\nPS:%08llx PC:%016llx ESR:%08llx\nFAR:%016llx HPFAR:%016llx PAR:%016llx\nVCPU:%016lx\n", 328 - spsr, elr, esr, far, hpfar, par, vcpu); 337 + spsr, elr_virt, esr, far, hpfar, par, vcpu); 329 338 }

+5 -1

arch/arm64/kvm/hyp/include/hyp/switch.h

··· 92 92 write_sysreg(0, pmselr_el0); 93 93 write_sysreg(ARMV8_PMU_USERENR_MASK, pmuserenr_el0); 94 94 } 95 + 96 + vcpu->arch.mdcr_el2_host = read_sysreg(mdcr_el2); 95 97 write_sysreg(vcpu->arch.mdcr_el2, mdcr_el2); 96 98 } 97 99 98 - static inline void __deactivate_traps_common(void) 100 + static inline void __deactivate_traps_common(struct kvm_vcpu *vcpu) 99 101 { 102 + write_sysreg(vcpu->arch.mdcr_el2_host, mdcr_el2); 103 + 100 104 write_sysreg(0, hstr_el2); 101 105 if (kvm_arm_support_pmu_v3()) 102 106 write_sysreg(0, pmuserenr_el0);

+34 -3

arch/arm64/kvm/hyp/include/nvhe/mem_protect.h

··· 12 12 #include <asm/virt.h> 13 13 #include <nvhe/spinlock.h> 14 14 15 + /* 16 + * SW bits 0-1 are reserved to track the memory ownership state of each page: 17 + * 00: The page is owned exclusively by the page-table owner. 18 + * 01: The page is owned by the page-table owner, but is shared 19 + * with another entity. 20 + * 10: The page is shared with, but not owned by the page-table owner. 21 + * 11: Reserved for future use (lending). 22 + */ 23 + enum pkvm_page_state { 24 + PKVM_PAGE_OWNED = 0ULL, 25 + PKVM_PAGE_SHARED_OWNED = KVM_PGTABLE_PROT_SW0, 26 + PKVM_PAGE_SHARED_BORROWED = KVM_PGTABLE_PROT_SW1, 27 + }; 28 + 29 + #define PKVM_PAGE_STATE_PROT_MASK (KVM_PGTABLE_PROT_SW0 | KVM_PGTABLE_PROT_SW1) 30 + static inline enum kvm_pgtable_prot pkvm_mkstate(enum kvm_pgtable_prot prot, 31 + enum pkvm_page_state state) 32 + { 33 + return (prot & ~PKVM_PAGE_STATE_PROT_MASK) | state; 34 + } 35 + 36 + static inline enum pkvm_page_state pkvm_getstate(enum kvm_pgtable_prot prot) 37 + { 38 + return prot & PKVM_PAGE_STATE_PROT_MASK; 39 + } 40 + 15 41 struct host_kvm { 16 42 struct kvm_arch arch; 17 43 struct kvm_pgtable pgt; ··· 46 20 }; 47 21 extern struct host_kvm host_kvm; 48 22 49 - int __pkvm_prot_finalize(void); 50 - int __pkvm_mark_hyp(phys_addr_t start, phys_addr_t end); 23 + extern const u8 pkvm_hyp_id; 51 24 25 + int __pkvm_prot_finalize(void); 26 + int __pkvm_host_share_hyp(u64 pfn); 27 + 28 + bool addr_is_memory(phys_addr_t phys); 29 + int host_stage2_idmap_locked(phys_addr_t addr, u64 size, enum kvm_pgtable_prot prot); 30 + int host_stage2_set_owner_locked(phys_addr_t addr, u64 size, u8 owner_id); 52 31 int kvm_host_prepare_stage2(void *pgt_pool_base); 53 32 void handle_host_mem_abort(struct kvm_cpu_context *host_ctxt); 54 33 55 34 static __always_inline void __load_host_stage2(void) 56 35 { 57 36 if (static_branch_likely(&kvm_protected_mode_initialized)) 58 - __load_stage2(&host_kvm.arch.mmu, host_kvm.arch.vtcr); 37 + __load_stage2(&host_kvm.arch.mmu, &host_kvm.arch); 59 38 else 60 39 write_sysreg(0, vttbr_el2); 61 40 }

+1 -2

arch/arm64/kvm/hyp/include/nvhe/mm.h

··· 23 23 int hyp_back_vmemmap(phys_addr_t phys, unsigned long size, phys_addr_t back); 24 24 int pkvm_cpu_set_vector(enum arm64_hyp_spectre_vector slot); 25 25 int pkvm_create_mappings(void *from, void *to, enum kvm_pgtable_prot prot); 26 - int __pkvm_create_mappings(unsigned long start, unsigned long size, 27 - unsigned long phys, enum kvm_pgtable_prot prot); 26 + int pkvm_create_mappings_locked(void *from, void *to, enum kvm_pgtable_prot prot); 28 27 unsigned long __pkvm_create_private_mapping(phys_addr_t phys, size_t size, 29 28 enum kvm_pgtable_prot prot); 30 29

+25

arch/arm64/kvm/hyp/include/nvhe/spinlock.h

··· 15 15 16 16 #include <asm/alternative.h> 17 17 #include <asm/lse.h> 18 + #include <asm/rwonce.h> 18 19 19 20 typedef union hyp_spinlock { 20 21 u32 __val; ··· 89 88 : 90 89 : "memory"); 91 90 } 91 + 92 + static inline bool hyp_spin_is_locked(hyp_spinlock_t *lock) 93 + { 94 + hyp_spinlock_t lockval = READ_ONCE(*lock); 95 + 96 + return lockval.owner != lockval.next; 97 + } 98 + 99 + #ifdef CONFIG_NVHE_EL2_DEBUG 100 + static inline void hyp_assert_lock_held(hyp_spinlock_t *lock) 101 + { 102 + /* 103 + * The __pkvm_init() path accesses protected data-structures without 104 + * holding locks as the other CPUs are guaranteed to not enter EL2 105 + * concurrently at this point in time. The point by which EL2 is 106 + * initialized on all CPUs is reflected in the pkvm static key, so 107 + * wait until it is set before checking the lock state. 108 + */ 109 + if (static_branch_likely(&kvm_protected_mode_initialized)) 110 + BUG_ON(!hyp_spin_is_locked(lock)); 111 + } 112 + #else 113 + static inline void hyp_assert_lock_held(hyp_spinlock_t *lock) { } 114 + #endif 92 115 93 116 #endif /* __ARM64_KVM_NVHE_SPINLOCK_H__ */

+1 -1

arch/arm64/kvm/hyp/nvhe/debug-sr.c

··· 109 109 __debug_switch_to_host_common(vcpu); 110 110 } 111 111 112 - u32 __kvm_get_mdcr_el2(void) 112 + u64 __kvm_get_mdcr_el2(void) 113 113 { 114 114 return read_sysreg(mdcr_el2); 115 115 }

+17 -4

arch/arm64/kvm/hyp/nvhe/host.S

··· 7 7 #include <linux/linkage.h> 8 8 9 9 #include <asm/assembler.h> 10 + #include <asm/kvm_arm.h> 10 11 #include <asm/kvm_asm.h> 11 12 #include <asm/kvm_mmu.h> 12 13 ··· 86 85 87 86 mov x29, x0 88 87 88 + #ifdef CONFIG_NVHE_EL2_DEBUG 89 + /* Ensure host stage-2 is disabled */ 90 + mrs x0, hcr_el2 91 + bic x0, x0, #HCR_VM 92 + msr hcr_el2, x0 93 + isb 94 + tlbi vmalls12e1 95 + dsb nsh 96 + #endif 97 + 89 98 /* Load the panic arguments into x0-7 */ 90 99 mrs x0, esr_el2 91 - get_vcpu_ptr x4, x5 92 - mrs x5, far_el2 93 - mrs x6, hpfar_el2 94 - mov x7, xzr // Unused argument 100 + mov x4, x3 101 + mov x3, x2 102 + hyp_pa x3, x6 103 + get_vcpu_ptr x5, x6 104 + mrs x6, far_el2 105 + mrs x7, hpfar_el2 95 106 96 107 /* Enter the host, conditionally restoring the host context. */ 97 108 cbz x29, __host_enter_without_restoring

+4 -16

arch/arm64/kvm/hyp/nvhe/hyp-main.c

··· 140 140 cpu_reg(host_ctxt, 1) = pkvm_cpu_set_vector(slot); 141 141 } 142 142 143 - static void handle___pkvm_create_mappings(struct kvm_cpu_context *host_ctxt) 143 + static void handle___pkvm_host_share_hyp(struct kvm_cpu_context *host_ctxt) 144 144 { 145 - DECLARE_REG(unsigned long, start, host_ctxt, 1); 146 - DECLARE_REG(unsigned long, size, host_ctxt, 2); 147 - DECLARE_REG(unsigned long, phys, host_ctxt, 3); 148 - DECLARE_REG(enum kvm_pgtable_prot, prot, host_ctxt, 4); 145 + DECLARE_REG(u64, pfn, host_ctxt, 1); 149 146 150 - cpu_reg(host_ctxt, 1) = __pkvm_create_mappings(start, size, phys, prot); 147 + cpu_reg(host_ctxt, 1) = __pkvm_host_share_hyp(pfn); 151 148 } 152 149 153 150 static void handle___pkvm_create_private_mapping(struct kvm_cpu_context *host_ctxt) ··· 159 162 static void handle___pkvm_prot_finalize(struct kvm_cpu_context *host_ctxt) 160 163 { 161 164 cpu_reg(host_ctxt, 1) = __pkvm_prot_finalize(); 162 - } 163 - 164 - static void handle___pkvm_mark_hyp(struct kvm_cpu_context *host_ctxt) 165 - { 166 - DECLARE_REG(phys_addr_t, start, host_ctxt, 1); 167 - DECLARE_REG(phys_addr_t, end, host_ctxt, 2); 168 - 169 - cpu_reg(host_ctxt, 1) = __pkvm_mark_hyp(start, end); 170 165 } 171 166 typedef void (*hcall_t)(struct kvm_cpu_context *); 172 167 ··· 182 193 HANDLE_FUNC(__vgic_v3_restore_aprs), 183 194 HANDLE_FUNC(__pkvm_init), 184 195 HANDLE_FUNC(__pkvm_cpu_set_vector), 185 - HANDLE_FUNC(__pkvm_create_mappings), 196 + HANDLE_FUNC(__pkvm_host_share_hyp), 186 197 HANDLE_FUNC(__pkvm_create_private_mapping), 187 198 HANDLE_FUNC(__pkvm_prot_finalize), 188 - HANDLE_FUNC(__pkvm_mark_hyp), 189 199 }; 190 200 191 201 static void handle_host_hcall(struct kvm_cpu_context *host_ctxt)

+209 -43

arch/arm64/kvm/hyp/nvhe/mem_protect.c

··· 31 31 u64 id_aa64mmfr0_el1_sys_val; 32 32 u64 id_aa64mmfr1_el1_sys_val; 33 33 34 - static const u8 pkvm_hyp_id = 1; 34 + const u8 pkvm_hyp_id = 1; 35 35 36 36 static void *host_s2_zalloc_pages_exact(size_t size) 37 37 { ··· 89 89 id_aa64mmfr1_el1_sys_val, phys_shift); 90 90 } 91 91 92 + static bool host_stage2_force_pte_cb(u64 addr, u64 end, enum kvm_pgtable_prot prot); 93 + 92 94 int kvm_host_prepare_stage2(void *pgt_pool_base) 93 95 { 94 96 struct kvm_s2_mmu *mmu = &host_kvm.arch.mmu; ··· 103 101 if (ret) 104 102 return ret; 105 103 106 - ret = kvm_pgtable_stage2_init_flags(&host_kvm.pgt, &host_kvm.arch, 107 - &host_kvm.mm_ops, KVM_HOST_S2_FLAGS); 104 + ret = __kvm_pgtable_stage2_init(&host_kvm.pgt, &host_kvm.arch, 105 + &host_kvm.mm_ops, KVM_HOST_S2_FLAGS, 106 + host_stage2_force_pte_cb); 108 107 if (ret) 109 108 return ret; 110 109 111 110 mmu->pgd_phys = __hyp_pa(host_kvm.pgt.pgd); 112 111 mmu->arch = &host_kvm.arch; 113 112 mmu->pgt = &host_kvm.pgt; 114 - mmu->vmid.vmid_gen = 0; 115 - mmu->vmid.vmid = 0; 113 + WRITE_ONCE(mmu->vmid.vmid_gen, 0); 114 + WRITE_ONCE(mmu->vmid.vmid, 0); 116 115 117 116 return 0; 118 117 } ··· 129 126 kvm_flush_dcache_to_poc(params, sizeof(*params)); 130 127 131 128 write_sysreg(params->hcr_el2, hcr_el2); 132 - __load_stage2(&host_kvm.arch.mmu, host_kvm.arch.vtcr); 129 + __load_stage2(&host_kvm.arch.mmu, &host_kvm.arch); 133 130 134 131 /* 135 132 * Make sure to have an ISB before the TLB maintenance below but only ··· 162 159 return kvm_pgtable_stage2_unmap(pgt, addr, BIT(pgt->ia_bits) - addr); 163 160 } 164 161 162 + struct kvm_mem_range { 163 + u64 start; 164 + u64 end; 165 + }; 166 + 165 167 static bool find_mem_range(phys_addr_t addr, struct kvm_mem_range *range) 166 168 { 167 169 int cur, left = 0, right = hyp_memblock_nr; ··· 197 189 return false; 198 190 } 199 191 192 + bool addr_is_memory(phys_addr_t phys) 193 + { 194 + struct kvm_mem_range range; 195 + 196 + return find_mem_range(phys, &range); 197 + } 198 + 199 + static bool is_in_mem_range(u64 addr, struct kvm_mem_range *range) 200 + { 201 + return range->start <= addr && addr < range->end; 202 + } 203 + 200 204 static bool range_is_memory(u64 start, u64 end) 201 205 { 202 - struct kvm_mem_range r1, r2; 206 + struct kvm_mem_range r; 203 207 204 - if (!find_mem_range(start, &r1) || !find_mem_range(end - 1, &r2)) 205 - return false; 206 - if (r1.start != r2.start) 208 + if (!find_mem_range(start, &r)) 207 209 return false; 208 210 209 - return true; 211 + return is_in_mem_range(end - 1, &r); 210 212 } 211 213 212 214 static inline int __host_stage2_idmap(u64 start, u64 end, ··· 226 208 prot, &host_s2_pool); 227 209 } 228 210 229 - static int host_stage2_idmap(u64 addr) 211 + /* 212 + * The pool has been provided with enough pages to cover all of memory with 213 + * page granularity, but it is difficult to know how much of the MMIO range 214 + * we will need to cover upfront, so we may need to 'recycle' the pages if we 215 + * run out. 216 + */ 217 + #define host_stage2_try(fn, ...) \ 218 + ({ \ 219 + int __ret; \ 220 + hyp_assert_lock_held(&host_kvm.lock); \ 221 + __ret = fn(__VA_ARGS__); \ 222 + if (__ret == -ENOMEM) { \ 223 + __ret = host_stage2_unmap_dev_all(); \ 224 + if (!__ret) \ 225 + __ret = fn(__VA_ARGS__); \ 226 + } \ 227 + __ret; \ 228 + }) 229 + 230 + static inline bool range_included(struct kvm_mem_range *child, 231 + struct kvm_mem_range *parent) 230 232 { 231 - enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W; 232 - struct kvm_mem_range range; 233 - bool is_memory = find_mem_range(addr, &range); 233 + return parent->start <= child->start && child->end <= parent->end; 234 + } 235 + 236 + static int host_stage2_adjust_range(u64 addr, struct kvm_mem_range *range) 237 + { 238 + struct kvm_mem_range cur; 239 + kvm_pte_t pte; 240 + u32 level; 234 241 int ret; 235 242 236 - if (is_memory) 237 - prot |= KVM_PGTABLE_PROT_X; 243 + hyp_assert_lock_held(&host_kvm.lock); 244 + ret = kvm_pgtable_get_leaf(&host_kvm.pgt, addr, &pte, &level); 245 + if (ret) 246 + return ret; 247 + 248 + if (kvm_pte_valid(pte)) 249 + return -EAGAIN; 250 + 251 + if (pte) 252 + return -EPERM; 253 + 254 + do { 255 + u64 granule = kvm_granule_size(level); 256 + cur.start = ALIGN_DOWN(addr, granule); 257 + cur.end = cur.start + granule; 258 + level++; 259 + } while ((level < KVM_PGTABLE_MAX_LEVELS) && 260 + !(kvm_level_supports_block_mapping(level) && 261 + range_included(&cur, range))); 262 + 263 + *range = cur; 264 + 265 + return 0; 266 + } 267 + 268 + int host_stage2_idmap_locked(phys_addr_t addr, u64 size, 269 + enum kvm_pgtable_prot prot) 270 + { 271 + hyp_assert_lock_held(&host_kvm.lock); 272 + 273 + return host_stage2_try(__host_stage2_idmap, addr, addr + size, prot); 274 + } 275 + 276 + int host_stage2_set_owner_locked(phys_addr_t addr, u64 size, u8 owner_id) 277 + { 278 + hyp_assert_lock_held(&host_kvm.lock); 279 + 280 + return host_stage2_try(kvm_pgtable_stage2_set_owner, &host_kvm.pgt, 281 + addr, size, &host_s2_pool, owner_id); 282 + } 283 + 284 + static bool host_stage2_force_pte_cb(u64 addr, u64 end, enum kvm_pgtable_prot prot) 285 + { 286 + /* 287 + * Block mappings must be used with care in the host stage-2 as a 288 + * kvm_pgtable_stage2_map() operation targeting a page in the range of 289 + * an existing block will delete the block under the assumption that 290 + * mappings in the rest of the block range can always be rebuilt lazily. 291 + * That assumption is correct for the host stage-2 with RWX mappings 292 + * targeting memory or RW mappings targeting MMIO ranges (see 293 + * host_stage2_idmap() below which implements some of the host memory 294 + * abort logic). However, this is not safe for any other mappings where 295 + * the host stage-2 page-table is in fact the only place where this 296 + * state is stored. In all those cases, it is safer to use page-level 297 + * mappings, hence avoiding to lose the state because of side-effects in 298 + * kvm_pgtable_stage2_map(). 299 + */ 300 + if (range_is_memory(addr, end)) 301 + return prot != PKVM_HOST_MEM_PROT; 302 + else 303 + return prot != PKVM_HOST_MMIO_PROT; 304 + } 305 + 306 + static int host_stage2_idmap(u64 addr) 307 + { 308 + struct kvm_mem_range range; 309 + bool is_memory = find_mem_range(addr, &range); 310 + enum kvm_pgtable_prot prot; 311 + int ret; 312 + 313 + prot = is_memory ? PKVM_HOST_MEM_PROT : PKVM_HOST_MMIO_PROT; 238 314 239 315 hyp_spin_lock(&host_kvm.lock); 240 - ret = kvm_pgtable_stage2_find_range(&host_kvm.pgt, addr, prot, &range); 316 + ret = host_stage2_adjust_range(addr, &range); 241 317 if (ret) 242 318 goto unlock; 243 319 244 - ret = __host_stage2_idmap(range.start, range.end, prot); 245 - if (ret != -ENOMEM) 246 - goto unlock; 247 - 248 - /* 249 - * The pool has been provided with enough pages to cover all of memory 250 - * with page granularity, but it is difficult to know how much of the 251 - * MMIO range we will need to cover upfront, so we may need to 'recycle' 252 - * the pages if we run out. 253 - */ 254 - ret = host_stage2_unmap_dev_all(); 255 - if (ret) 256 - goto unlock; 257 - 258 - ret = __host_stage2_idmap(range.start, range.end, prot); 259 - 320 + ret = host_stage2_idmap_locked(range.start, range.end - range.start, prot); 260 321 unlock: 261 322 hyp_spin_unlock(&host_kvm.lock); 262 323 263 324 return ret; 264 325 } 265 326 266 - int __pkvm_mark_hyp(phys_addr_t start, phys_addr_t end) 327 + static inline bool check_prot(enum kvm_pgtable_prot prot, 328 + enum kvm_pgtable_prot required, 329 + enum kvm_pgtable_prot denied) 267 330 { 331 + return (prot & (required | denied)) == required; 332 + } 333 + 334 + int __pkvm_host_share_hyp(u64 pfn) 335 + { 336 + phys_addr_t addr = hyp_pfn_to_phys(pfn); 337 + enum kvm_pgtable_prot prot, cur; 338 + void *virt = __hyp_va(addr); 339 + enum pkvm_page_state state; 340 + kvm_pte_t pte; 268 341 int ret; 269 342 270 - /* 271 - * host_stage2_unmap_dev_all() currently relies on MMIO mappings being 272 - * non-persistent, so don't allow changing page ownership in MMIO range. 273 - */ 274 - if (!range_is_memory(start, end)) 343 + if (!addr_is_memory(addr)) 275 344 return -EINVAL; 276 345 277 346 hyp_spin_lock(&host_kvm.lock); 278 - ret = kvm_pgtable_stage2_set_owner(&host_kvm.pgt, start, end - start, 279 - &host_s2_pool, pkvm_hyp_id); 347 + hyp_spin_lock(&pkvm_pgd_lock); 348 + 349 + ret = kvm_pgtable_get_leaf(&host_kvm.pgt, addr, &pte, NULL); 350 + if (ret) 351 + goto unlock; 352 + if (!pte) 353 + goto map_shared; 354 + 355 + /* 356 + * Check attributes in the host stage-2 PTE. We need the page to be: 357 + * - mapped RWX as we're sharing memory; 358 + * - not borrowed, as that implies absence of ownership. 359 + * Otherwise, we can't let it got through 360 + */ 361 + cur = kvm_pgtable_stage2_pte_prot(pte); 362 + prot = pkvm_mkstate(0, PKVM_PAGE_SHARED_BORROWED); 363 + if (!check_prot(cur, PKVM_HOST_MEM_PROT, prot)) { 364 + ret = -EPERM; 365 + goto unlock; 366 + } 367 + 368 + state = pkvm_getstate(cur); 369 + if (state == PKVM_PAGE_OWNED) 370 + goto map_shared; 371 + 372 + /* 373 + * Tolerate double-sharing the same page, but this requires 374 + * cross-checking the hypervisor stage-1. 375 + */ 376 + if (state != PKVM_PAGE_SHARED_OWNED) { 377 + ret = -EPERM; 378 + goto unlock; 379 + } 380 + 381 + ret = kvm_pgtable_get_leaf(&pkvm_pgtable, (u64)virt, &pte, NULL); 382 + if (ret) 383 + goto unlock; 384 + 385 + /* 386 + * If the page has been shared with the hypervisor, it must be 387 + * already mapped as SHARED_BORROWED in its stage-1. 388 + */ 389 + cur = kvm_pgtable_hyp_pte_prot(pte); 390 + prot = pkvm_mkstate(PAGE_HYP, PKVM_PAGE_SHARED_BORROWED); 391 + if (!check_prot(cur, prot, ~prot)) 392 + ret = -EPERM; 393 + goto unlock; 394 + 395 + map_shared: 396 + /* 397 + * If the page is not yet shared, adjust mappings in both page-tables 398 + * while both locks are held. 399 + */ 400 + prot = pkvm_mkstate(PAGE_HYP, PKVM_PAGE_SHARED_BORROWED); 401 + ret = pkvm_create_mappings_locked(virt, virt + PAGE_SIZE, prot); 402 + BUG_ON(ret); 403 + 404 + prot = pkvm_mkstate(PKVM_HOST_MEM_PROT, PKVM_PAGE_SHARED_OWNED); 405 + ret = host_stage2_idmap_locked(addr, PAGE_SIZE, prot); 406 + BUG_ON(ret); 407 + 408 + unlock: 409 + hyp_spin_unlock(&pkvm_pgd_lock); 280 410 hyp_spin_unlock(&host_kvm.lock); 281 411 282 - return ret != -EAGAIN ? ret : 0; 412 + return ret; 283 413 } 284 414 285 415 void handle_host_mem_abort(struct kvm_cpu_context *host_ctxt)

+18 -4

arch/arm64/kvm/hyp/nvhe/mm.c

··· 23 23 struct memblock_region hyp_memory[HYP_MEMBLOCK_REGIONS]; 24 24 unsigned int hyp_memblock_nr; 25 25 26 - int __pkvm_create_mappings(unsigned long start, unsigned long size, 27 - unsigned long phys, enum kvm_pgtable_prot prot) 26 + static int __pkvm_create_mappings(unsigned long start, unsigned long size, 27 + unsigned long phys, enum kvm_pgtable_prot prot) 28 28 { 29 29 int err; 30 30 ··· 67 67 return addr; 68 68 } 69 69 70 - int pkvm_create_mappings(void *from, void *to, enum kvm_pgtable_prot prot) 70 + int pkvm_create_mappings_locked(void *from, void *to, enum kvm_pgtable_prot prot) 71 71 { 72 72 unsigned long start = (unsigned long)from; 73 73 unsigned long end = (unsigned long)to; 74 74 unsigned long virt_addr; 75 75 phys_addr_t phys; 76 + 77 + hyp_assert_lock_held(&pkvm_pgd_lock); 76 78 77 79 start = start & PAGE_MASK; 78 80 end = PAGE_ALIGN(end); ··· 83 81 int err; 84 82 85 83 phys = hyp_virt_to_phys((void *)virt_addr); 86 - err = __pkvm_create_mappings(virt_addr, PAGE_SIZE, phys, prot); 84 + err = kvm_pgtable_hyp_map(&pkvm_pgtable, virt_addr, PAGE_SIZE, 85 + phys, prot); 87 86 if (err) 88 87 return err; 89 88 } 90 89 91 90 return 0; 91 + } 92 + 93 + int pkvm_create_mappings(void *from, void *to, enum kvm_pgtable_prot prot) 94 + { 95 + int ret; 96 + 97 + hyp_spin_lock(&pkvm_pgd_lock); 98 + ret = pkvm_create_mappings_locked(from, to, prot); 99 + hyp_spin_unlock(&pkvm_pgd_lock); 100 + 101 + return ret; 92 102 } 93 103 94 104 int hyp_back_vmemmap(phys_addr_t phys, unsigned long size, phys_addr_t back)

+74 -8

arch/arm64/kvm/hyp/nvhe/setup.c

··· 58 58 { 59 59 void *start, *end, *virt = hyp_phys_to_virt(phys); 60 60 unsigned long pgt_size = hyp_s1_pgtable_pages() << PAGE_SHIFT; 61 + enum kvm_pgtable_prot prot; 61 62 int ret, i; 62 63 63 64 /* Recreate the hyp page-table using the early page allocator */ ··· 84 83 if (ret) 85 84 return ret; 86 85 87 - ret = pkvm_create_mappings(__start_rodata, __end_rodata, PAGE_HYP_RO); 88 - if (ret) 89 - return ret; 90 - 91 86 ret = pkvm_create_mappings(__hyp_rodata_start, __hyp_rodata_end, PAGE_HYP_RO); 92 87 if (ret) 93 88 return ret; 94 89 95 90 ret = pkvm_create_mappings(__hyp_bss_start, __hyp_bss_end, PAGE_HYP); 96 - if (ret) 97 - return ret; 98 - 99 - ret = pkvm_create_mappings(__hyp_bss_end, __bss_stop, PAGE_HYP_RO); 100 91 if (ret) 101 92 return ret; 102 93 ··· 109 116 if (ret) 110 117 return ret; 111 118 } 119 + 120 + /* 121 + * Map the host's .bss and .rodata sections RO in the hypervisor, but 122 + * transfer the ownership from the host to the hypervisor itself to 123 + * make sure it can't be donated or shared with another entity. 124 + * 125 + * The ownership transition requires matching changes in the host 126 + * stage-2. This will be done later (see finalize_host_mappings()) once 127 + * the hyp_vmemmap is addressable. 128 + */ 129 + prot = pkvm_mkstate(PAGE_HYP_RO, PKVM_PAGE_SHARED_OWNED); 130 + ret = pkvm_create_mappings(__start_rodata, __end_rodata, prot); 131 + if (ret) 132 + return ret; 133 + 134 + ret = pkvm_create_mappings(__hyp_bss_end, __bss_stop, prot); 135 + if (ret) 136 + return ret; 112 137 113 138 return 0; 114 139 } ··· 159 148 hyp_put_page(&hpool, addr); 160 149 } 161 150 151 + static int finalize_host_mappings_walker(u64 addr, u64 end, u32 level, 152 + kvm_pte_t *ptep, 153 + enum kvm_pgtable_walk_flags flag, 154 + void * const arg) 155 + { 156 + enum kvm_pgtable_prot prot; 157 + enum pkvm_page_state state; 158 + kvm_pte_t pte = *ptep; 159 + phys_addr_t phys; 160 + 161 + if (!kvm_pte_valid(pte)) 162 + return 0; 163 + 164 + if (level != (KVM_PGTABLE_MAX_LEVELS - 1)) 165 + return -EINVAL; 166 + 167 + phys = kvm_pte_to_phys(pte); 168 + if (!addr_is_memory(phys)) 169 + return 0; 170 + 171 + /* 172 + * Adjust the host stage-2 mappings to match the ownership attributes 173 + * configured in the hypervisor stage-1. 174 + */ 175 + state = pkvm_getstate(kvm_pgtable_hyp_pte_prot(pte)); 176 + switch (state) { 177 + case PKVM_PAGE_OWNED: 178 + return host_stage2_set_owner_locked(phys, PAGE_SIZE, pkvm_hyp_id); 179 + case PKVM_PAGE_SHARED_OWNED: 180 + prot = pkvm_mkstate(PKVM_HOST_MEM_PROT, PKVM_PAGE_SHARED_BORROWED); 181 + break; 182 + case PKVM_PAGE_SHARED_BORROWED: 183 + prot = pkvm_mkstate(PKVM_HOST_MEM_PROT, PKVM_PAGE_SHARED_OWNED); 184 + break; 185 + default: 186 + return -EINVAL; 187 + } 188 + 189 + return host_stage2_idmap_locked(phys, PAGE_SIZE, prot); 190 + } 191 + 192 + static int finalize_host_mappings(void) 193 + { 194 + struct kvm_pgtable_walker walker = { 195 + .cb = finalize_host_mappings_walker, 196 + .flags = KVM_PGTABLE_WALK_LEAF, 197 + }; 198 + 199 + return kvm_pgtable_walk(&pkvm_pgtable, 0, BIT(pkvm_pgtable.ia_bits), &walker); 200 + } 201 + 162 202 void __noreturn __pkvm_init_finalise(void) 163 203 { 164 204 struct kvm_host_data *host_data = this_cpu_ptr(&kvm_host_data); ··· 226 164 goto out; 227 165 228 166 ret = kvm_host_prepare_stage2(host_s2_pgt_base); 167 + if (ret) 168 + goto out; 169 + 170 + ret = finalize_host_mappings(); 229 171 if (ret) 230 172 goto out; 231 173

+6 -11

arch/arm64/kvm/hyp/nvhe/switch.c

··· 41 41 ___activate_traps(vcpu); 42 42 __activate_traps_common(vcpu); 43 43 44 - val = CPTR_EL2_DEFAULT; 44 + val = vcpu->arch.cptr_el2; 45 45 val |= CPTR_EL2_TTA | CPTR_EL2_TAM; 46 46 if (!update_fp_enabled(vcpu)) { 47 47 val |= CPTR_EL2_TFP | CPTR_EL2_TZ; ··· 69 69 static void __deactivate_traps(struct kvm_vcpu *vcpu) 70 70 { 71 71 extern char __kvm_hyp_host_vector[]; 72 - u64 mdcr_el2, cptr; 72 + u64 cptr; 73 73 74 74 ___deactivate_traps(vcpu); 75 - 76 - mdcr_el2 = read_sysreg(mdcr_el2); 77 75 78 76 if (cpus_have_final_cap(ARM64_WORKAROUND_SPECULATIVE_AT)) { 79 77 u64 val; ··· 90 92 isb(); 91 93 } 92 94 93 - __deactivate_traps_common(); 95 + __deactivate_traps_common(vcpu); 94 96 95 - mdcr_el2 &= MDCR_EL2_HPMN_MASK; 96 - mdcr_el2 |= MDCR_EL2_E2PB_MASK << MDCR_EL2_E2PB_SHIFT; 97 - mdcr_el2 |= MDCR_EL2_E2TB_MASK << MDCR_EL2_E2TB_SHIFT; 98 - 99 - write_sysreg(mdcr_el2, mdcr_el2); 100 97 write_sysreg(this_cpu_ptr(&kvm_init_params)->hcr_el2, hcr_el2); 101 98 102 99 cptr = CPTR_EL2_DEFAULT; ··· 163 170 { 164 171 struct kvm_cpu_context *host_ctxt; 165 172 struct kvm_cpu_context *guest_ctxt; 173 + struct kvm_s2_mmu *mmu; 166 174 bool pmu_switch_needed; 167 175 u64 exit_code; 168 176 ··· 207 213 __sysreg32_restore_state(vcpu); 208 214 __sysreg_restore_state_nvhe(guest_ctxt); 209 215 210 - __load_guest_stage2(kern_hyp_va(vcpu->arch.hw_mmu)); 216 + mmu = kern_hyp_va(vcpu->arch.hw_mmu); 217 + __load_stage2(mmu, kern_hyp_va(mmu->arch)); 211 218 __activate_traps(vcpu); 212 219 213 220 __hyp_vgic_restore_state(vcpu);

+2 -2

arch/arm64/kvm/hyp/nvhe/tlb.c

··· 34 34 } 35 35 36 36 /* 37 - * __load_guest_stage2() includes an ISB only when the AT 37 + * __load_stage2() includes an ISB only when the AT 38 38 * workaround is applied. Take care of the opposite condition, 39 39 * ensuring that we always have an ISB, but not two ISBs back 40 40 * to back. 41 41 */ 42 - __load_guest_stage2(mmu); 42 + __load_stage2(mmu, kern_hyp_va(mmu->arch)); 43 43 asm(ALTERNATIVE("isb", "nop", ARM64_WORKAROUND_SPECULATIVE_AT)); 44 44 } 45 45

+124 -123

arch/arm64/kvm/hyp/pgtable.c

··· 11 11 #include <asm/kvm_pgtable.h> 12 12 #include <asm/stage2_pgtable.h> 13 13 14 - #define KVM_PTE_VALID BIT(0) 15 14 16 15 #define KVM_PTE_TYPE BIT(1) 17 16 #define KVM_PTE_TYPE_BLOCK 0 18 17 #define KVM_PTE_TYPE_PAGE 1 19 18 #define KVM_PTE_TYPE_TABLE 1 20 - 21 - #define KVM_PTE_ADDR_MASK GENMASK(47, PAGE_SHIFT) 22 - #define KVM_PTE_ADDR_51_48 GENMASK(15, 12) 23 19 24 20 #define KVM_PTE_LEAF_ATTR_LO GENMASK(11, 2) 25 21 ··· 36 40 37 41 #define KVM_PTE_LEAF_ATTR_HI GENMASK(63, 51) 38 42 43 + #define KVM_PTE_LEAF_ATTR_HI_SW GENMASK(58, 55) 44 + 39 45 #define KVM_PTE_LEAF_ATTR_HI_S1_XN BIT(54) 40 46 41 47 #define KVM_PTE_LEAF_ATTR_HI_S2_XN BIT(54) ··· 46 48 KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W | \ 47 49 KVM_PTE_LEAF_ATTR_HI_S2_XN) 48 50 49 - #define KVM_PTE_LEAF_ATTR_S2_IGNORED GENMASK(58, 55) 50 - 51 - #define KVM_INVALID_PTE_OWNER_MASK GENMASK(63, 56) 51 + #define KVM_INVALID_PTE_OWNER_MASK GENMASK(9, 2) 52 52 #define KVM_MAX_OWNER_ID 1 53 53 54 54 struct kvm_pgtable_walk_data { ··· 57 61 u64 end; 58 62 }; 59 63 60 - static u64 kvm_granule_shift(u32 level) 61 - { 62 - /* Assumes KVM_PGTABLE_MAX_LEVELS is 4 */ 63 - return ARM64_HW_PGTABLE_LEVEL_SHIFT(level); 64 - } 65 - 66 - static u64 kvm_granule_size(u32 level) 67 - { 68 - return BIT(kvm_granule_shift(level)); 69 - } 70 - 71 64 #define KVM_PHYS_INVALID (-1ULL) 72 65 73 66 static bool kvm_phys_is_valid(u64 phys) 74 67 { 75 68 return phys < BIT(id_aa64mmfr0_parange_to_phys_shift(ID_AA64MMFR0_PARANGE_MAX)); 76 - } 77 - 78 - static bool kvm_level_supports_block_mapping(u32 level) 79 - { 80 - /* 81 - * Reject invalid block mappings and don't bother with 4TB mappings for 82 - * 52-bit PAs. 83 - */ 84 - return !(level == 0 || (PAGE_SIZE != SZ_4K && level == 1)); 85 69 } 86 70 87 71 static bool kvm_block_mapping_supported(u64 addr, u64 end, u64 phys, u32 level) ··· 111 135 return __kvm_pgd_page_idx(&pgt, -1ULL) + 1; 112 136 } 113 137 114 - static bool kvm_pte_valid(kvm_pte_t pte) 115 - { 116 - return pte & KVM_PTE_VALID; 117 - } 118 - 119 138 static bool kvm_pte_table(kvm_pte_t pte, u32 level) 120 139 { 121 140 if (level == KVM_PGTABLE_MAX_LEVELS - 1) ··· 120 149 return false; 121 150 122 151 return FIELD_GET(KVM_PTE_TYPE, pte) == KVM_PTE_TYPE_TABLE; 123 - } 124 - 125 - static u64 kvm_pte_to_phys(kvm_pte_t pte) 126 - { 127 - u64 pa = pte & KVM_PTE_ADDR_MASK; 128 - 129 - if (PAGE_SHIFT == 16) 130 - pa |= FIELD_GET(KVM_PTE_ADDR_51_48, pte) << 48; 131 - 132 - return pa; 133 152 } 134 153 135 154 static kvm_pte_t kvm_phys_to_pte(u64 pa) ··· 287 326 return _kvm_pgtable_walk(&walk_data); 288 327 } 289 328 329 + struct leaf_walk_data { 330 + kvm_pte_t pte; 331 + u32 level; 332 + }; 333 + 334 + static int leaf_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, 335 + enum kvm_pgtable_walk_flags flag, void * const arg) 336 + { 337 + struct leaf_walk_data *data = arg; 338 + 339 + data->pte = *ptep; 340 + data->level = level; 341 + 342 + return 0; 343 + } 344 + 345 + int kvm_pgtable_get_leaf(struct kvm_pgtable *pgt, u64 addr, 346 + kvm_pte_t *ptep, u32 *level) 347 + { 348 + struct leaf_walk_data data; 349 + struct kvm_pgtable_walker walker = { 350 + .cb = leaf_walker, 351 + .flags = KVM_PGTABLE_WALK_LEAF, 352 + .arg = &data, 353 + }; 354 + int ret; 355 + 356 + ret = kvm_pgtable_walk(pgt, ALIGN_DOWN(addr, PAGE_SIZE), 357 + PAGE_SIZE, &walker); 358 + if (!ret) { 359 + if (ptep) 360 + *ptep = data.pte; 361 + if (level) 362 + *level = data.level; 363 + } 364 + 365 + return ret; 366 + } 367 + 290 368 struct hyp_map_data { 291 369 u64 phys; 292 370 kvm_pte_t attr; ··· 357 357 attr |= FIELD_PREP(KVM_PTE_LEAF_ATTR_LO_S1_AP, ap); 358 358 attr |= FIELD_PREP(KVM_PTE_LEAF_ATTR_LO_S1_SH, sh); 359 359 attr |= KVM_PTE_LEAF_ATTR_LO_S1_AF; 360 + attr |= prot & KVM_PTE_LEAF_ATTR_HI_SW; 360 361 *ptep = attr; 361 362 362 363 return 0; 364 + } 365 + 366 + enum kvm_pgtable_prot kvm_pgtable_hyp_pte_prot(kvm_pte_t pte) 367 + { 368 + enum kvm_pgtable_prot prot = pte & KVM_PTE_LEAF_ATTR_HI_SW; 369 + u32 ap; 370 + 371 + if (!kvm_pte_valid(pte)) 372 + return prot; 373 + 374 + if (!(pte & KVM_PTE_LEAF_ATTR_HI_S1_XN)) 375 + prot |= KVM_PGTABLE_PROT_X; 376 + 377 + ap = FIELD_GET(KVM_PTE_LEAF_ATTR_LO_S1_AP, pte); 378 + if (ap == KVM_PTE_LEAF_ATTR_LO_S1_AP_RO) 379 + prot |= KVM_PGTABLE_PROT_R; 380 + else if (ap == KVM_PTE_LEAF_ATTR_LO_S1_AP_RW) 381 + prot |= KVM_PGTABLE_PROT_RW; 382 + 383 + return prot; 384 + } 385 + 386 + static bool hyp_pte_needs_update(kvm_pte_t old, kvm_pte_t new) 387 + { 388 + /* 389 + * Tolerate KVM recreating the exact same mapping, or changing software 390 + * bits if the existing mapping was valid. 391 + */ 392 + if (old == new) 393 + return false; 394 + 395 + if (!kvm_pte_valid(old)) 396 + return true; 397 + 398 + return !WARN_ON((old ^ new) & ~KVM_PTE_LEAF_ATTR_HI_SW); 363 399 } 364 400 365 401 static bool hyp_map_walker_try_leaf(u64 addr, u64 end, u32 level, ··· 407 371 if (!kvm_block_mapping_supported(addr, end, phys, level)) 408 372 return false; 409 373 410 - /* Tolerate KVM recreating the exact same mapping */ 411 374 new = kvm_init_valid_leaf_pte(phys, data->attr, level); 412 - if (old != new && !WARN_ON(kvm_pte_valid(old))) 375 + if (hyp_pte_needs_update(old, new)) 413 376 smp_store_release(ptep, new); 414 377 415 378 data->phys += granule; ··· 473 438 pgt->start_level = KVM_PGTABLE_MAX_LEVELS - levels; 474 439 pgt->mm_ops = mm_ops; 475 440 pgt->mmu = NULL; 441 + pgt->force_pte_cb = NULL; 442 + 476 443 return 0; 477 444 } 478 445 ··· 512 475 void *memcache; 513 476 514 477 struct kvm_pgtable_mm_ops *mm_ops; 478 + 479 + /* Force mappings to page granularity */ 480 + bool force_pte; 515 481 }; 516 482 517 483 u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift) ··· 579 539 580 540 attr |= FIELD_PREP(KVM_PTE_LEAF_ATTR_LO_S2_SH, sh); 581 541 attr |= KVM_PTE_LEAF_ATTR_LO_S2_AF; 542 + attr |= prot & KVM_PTE_LEAF_ATTR_HI_SW; 582 543 *ptep = attr; 583 544 584 545 return 0; 546 + } 547 + 548 + enum kvm_pgtable_prot kvm_pgtable_stage2_pte_prot(kvm_pte_t pte) 549 + { 550 + enum kvm_pgtable_prot prot = pte & KVM_PTE_LEAF_ATTR_HI_SW; 551 + 552 + if (!kvm_pte_valid(pte)) 553 + return prot; 554 + 555 + if (pte & KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R) 556 + prot |= KVM_PGTABLE_PROT_R; 557 + if (pte & KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W) 558 + prot |= KVM_PGTABLE_PROT_W; 559 + if (!(pte & KVM_PTE_LEAF_ATTR_HI_S2_XN)) 560 + prot |= KVM_PGTABLE_PROT_X; 561 + 562 + return prot; 585 563 } 586 564 587 565 static bool stage2_pte_needs_update(kvm_pte_t old, kvm_pte_t new) ··· 646 588 return !(pte & KVM_PTE_LEAF_ATTR_HI_S2_XN); 647 589 } 648 590 591 + static bool stage2_leaf_mapping_allowed(u64 addr, u64 end, u32 level, 592 + struct stage2_map_data *data) 593 + { 594 + if (data->force_pte && (level < (KVM_PGTABLE_MAX_LEVELS - 1))) 595 + return false; 596 + 597 + return kvm_block_mapping_supported(addr, end, data->phys, level); 598 + } 599 + 649 600 static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level, 650 601 kvm_pte_t *ptep, 651 602 struct stage2_map_data *data) ··· 664 597 struct kvm_pgtable *pgt = data->mmu->pgt; 665 598 struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops; 666 599 667 - if (!kvm_block_mapping_supported(addr, end, phys, level)) 600 + if (!stage2_leaf_mapping_allowed(addr, end, level, data)) 668 601 return -E2BIG; 669 602 670 603 if (kvm_phys_is_valid(phys)) ··· 708 641 if (data->anchor) 709 642 return 0; 710 643 711 - if (!kvm_block_mapping_supported(addr, end, data->phys, level)) 644 + if (!stage2_leaf_mapping_allowed(addr, end, level, data)) 712 645 return 0; 713 646 714 647 data->childp = kvm_pte_follow(*ptep, data->mm_ops); ··· 838 771 .mmu = pgt->mmu, 839 772 .memcache = mc, 840 773 .mm_ops = pgt->mm_ops, 774 + .force_pte = pgt->force_pte_cb && pgt->force_pte_cb(addr, addr + size, prot), 841 775 }; 842 776 struct kvm_pgtable_walker walker = { 843 777 .cb = stage2_map_walker, ··· 870 802 .memcache = mc, 871 803 .mm_ops = pgt->mm_ops, 872 804 .owner_id = owner_id, 805 + .force_pte = true, 873 806 }; 874 807 struct kvm_pgtable_walker walker = { 875 808 .cb = stage2_map_walker, ··· 1064 995 u32 level; 1065 996 kvm_pte_t set = 0, clr = 0; 1066 997 998 + if (prot & KVM_PTE_LEAF_ATTR_HI_SW) 999 + return -EINVAL; 1000 + 1067 1001 if (prot & KVM_PGTABLE_PROT_R) 1068 1002 set |= KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R; 1069 1003 ··· 1115 1043 return kvm_pgtable_walk(pgt, addr, size, &walker); 1116 1044 } 1117 1045 1118 - int kvm_pgtable_stage2_init_flags(struct kvm_pgtable *pgt, struct kvm_arch *arch, 1119 - struct kvm_pgtable_mm_ops *mm_ops, 1120 - enum kvm_pgtable_stage2_flags flags) 1046 + 1047 + int __kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_arch *arch, 1048 + struct kvm_pgtable_mm_ops *mm_ops, 1049 + enum kvm_pgtable_stage2_flags flags, 1050 + kvm_pgtable_force_pte_cb_t force_pte_cb) 1121 1051 { 1122 1052 size_t pgd_sz; 1123 1053 u64 vtcr = arch->vtcr; ··· 1137 1063 pgt->mm_ops = mm_ops; 1138 1064 pgt->mmu = &arch->mmu; 1139 1065 pgt->flags = flags; 1066 + pgt->force_pte_cb = force_pte_cb; 1140 1067 1141 1068 /* Ensure zeroed PGD pages are visible to the hardware walker */ 1142 1069 dsb(ishst); ··· 1176 1101 pgd_sz = kvm_pgd_pages(pgt->ia_bits, pgt->start_level) * PAGE_SIZE; 1177 1102 pgt->mm_ops->free_pages_exact(pgt->pgd, pgd_sz); 1178 1103 pgt->pgd = NULL; 1179 - } 1180 - 1181 - #define KVM_PTE_LEAF_S2_COMPAT_MASK (KVM_PTE_LEAF_ATTR_S2_PERMS | \ 1182 - KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR | \ 1183 - KVM_PTE_LEAF_ATTR_S2_IGNORED) 1184 - 1185 - static int stage2_check_permission_walker(u64 addr, u64 end, u32 level, 1186 - kvm_pte_t *ptep, 1187 - enum kvm_pgtable_walk_flags flag, 1188 - void * const arg) 1189 - { 1190 - kvm_pte_t old_attr, pte = *ptep, *new_attr = arg; 1191 - 1192 - /* 1193 - * Compatible mappings are either invalid and owned by the page-table 1194 - * owner (whose id is 0), or valid with matching permission attributes. 1195 - */ 1196 - if (kvm_pte_valid(pte)) { 1197 - old_attr = pte & KVM_PTE_LEAF_S2_COMPAT_MASK; 1198 - if (old_attr != *new_attr) 1199 - return -EEXIST; 1200 - } else if (pte) { 1201 - return -EEXIST; 1202 - } 1203 - 1204 - return 0; 1205 - } 1206 - 1207 - int kvm_pgtable_stage2_find_range(struct kvm_pgtable *pgt, u64 addr, 1208 - enum kvm_pgtable_prot prot, 1209 - struct kvm_mem_range *range) 1210 - { 1211 - kvm_pte_t attr; 1212 - struct kvm_pgtable_walker check_perm_walker = { 1213 - .cb = stage2_check_permission_walker, 1214 - .flags = KVM_PGTABLE_WALK_LEAF, 1215 - .arg = &attr, 1216 - }; 1217 - u64 granule, start, end; 1218 - u32 level; 1219 - int ret; 1220 - 1221 - ret = stage2_set_prot_attr(pgt, prot, &attr); 1222 - if (ret) 1223 - return ret; 1224 - attr &= KVM_PTE_LEAF_S2_COMPAT_MASK; 1225 - 1226 - for (level = pgt->start_level; level < KVM_PGTABLE_MAX_LEVELS; level++) { 1227 - granule = kvm_granule_size(level); 1228 - start = ALIGN_DOWN(addr, granule); 1229 - end = start + granule; 1230 - 1231 - if (!kvm_level_supports_block_mapping(level)) 1232 - continue; 1233 - 1234 - if (start < range->start || range->end < end) 1235 - continue; 1236 - 1237 - /* 1238 - * Check the presence of existing mappings with incompatible 1239 - * permissions within the current block range, and try one level 1240 - * deeper if one is found. 1241 - */ 1242 - ret = kvm_pgtable_walk(pgt, start, granule, &check_perm_walker); 1243 - if (ret != -EEXIST) 1244 - break; 1245 - } 1246 - 1247 - if (!ret) { 1248 - range->start = start; 1249 - range->end = end; 1250 - } 1251 - 1252 - return ret; 1253 1104 }

+1 -1

arch/arm64/kvm/hyp/vhe/debug-sr.c

··· 20 20 __debug_switch_to_host_common(vcpu); 21 21 } 22 22 23 - u32 __kvm_get_mdcr_el2(void) 23 + u64 __kvm_get_mdcr_el2(void) 24 24 { 25 25 return read_sysreg(mdcr_el2); 26 26 }

+5 -13

arch/arm64/kvm/hyp/vhe/switch.c

··· 91 91 __activate_traps_common(vcpu); 92 92 } 93 93 94 - void deactivate_traps_vhe_put(void) 94 + void deactivate_traps_vhe_put(struct kvm_vcpu *vcpu) 95 95 { 96 - u64 mdcr_el2 = read_sysreg(mdcr_el2); 97 - 98 - mdcr_el2 &= MDCR_EL2_HPMN_MASK | 99 - MDCR_EL2_E2PB_MASK << MDCR_EL2_E2PB_SHIFT | 100 - MDCR_EL2_TPMS; 101 - 102 - write_sysreg(mdcr_el2, mdcr_el2); 103 - 104 - __deactivate_traps_common(); 96 + __deactivate_traps_common(vcpu); 105 97 } 106 98 107 99 /* Switch to the guest for VHE systems running in EL2 */ ··· 116 124 * 117 125 * We have already configured the guest's stage 1 translation in 118 126 * kvm_vcpu_load_sysregs_vhe above. We must now call 119 - * __load_guest_stage2 before __activate_traps, because 120 - * __load_guest_stage2 configures stage 2 translation, and 127 + * __load_stage2 before __activate_traps, because 128 + * __load_stage2 configures stage 2 translation, and 121 129 * __activate_traps clear HCR_EL2.TGE (among other things). 122 130 */ 123 - __load_guest_stage2(vcpu->arch.hw_mmu); 131 + __load_stage2(vcpu->arch.hw_mmu, vcpu->arch.hw_mmu->arch); 124 132 __activate_traps(vcpu); 125 133 126 134 __kvm_adjust_pc(vcpu);

+1 -1

arch/arm64/kvm/hyp/vhe/sysreg-sr.c

··· 101 101 struct kvm_cpu_context *host_ctxt; 102 102 103 103 host_ctxt = &this_cpu_ptr(&kvm_host_data)->host_ctxt; 104 - deactivate_traps_vhe_put(); 104 + deactivate_traps_vhe_put(vcpu); 105 105 106 106 __sysreg_save_el1_state(guest_ctxt); 107 107 __sysreg_save_user_state(guest_ctxt);

+2 -2

arch/arm64/kvm/hyp/vhe/tlb.c

··· 50 50 * 51 51 * ARM erratum 1165522 requires some special handling (again), 52 52 * as we need to make sure both stages of translation are in 53 - * place before clearing TGE. __load_guest_stage2() already 53 + * place before clearing TGE. __load_stage2() already 54 54 * has an ISB in order to deal with this. 55 55 */ 56 - __load_guest_stage2(mmu); 56 + __load_stage2(mmu, mmu->arch); 57 57 val = read_sysreg(hcr_el2); 58 58 val &= ~HCR_TGE; 59 59 write_sysreg(val, hcr_el2);

+64 -12

arch/arm64/kvm/mmu.c

··· 80 80 */ 81 81 void kvm_flush_remote_tlbs(struct kvm *kvm) 82 82 { 83 + ++kvm->stat.generic.remote_tlb_flush_requests; 83 84 kvm_call_hyp(__kvm_tlb_flush_vmid, &kvm->arch.mmu); 84 85 } 85 86 ··· 260 259 { 261 260 int err; 262 261 263 - if (!kvm_host_owns_hyp_mappings()) { 264 - return kvm_call_hyp_nvhe(__pkvm_create_mappings, 265 - start, size, phys, prot); 266 - } 262 + if (WARN_ON(!kvm_host_owns_hyp_mappings())) 263 + return -EINVAL; 267 264 268 265 mutex_lock(&kvm_hyp_pgd_mutex); 269 266 err = kvm_pgtable_hyp_map(hyp_pgtable, start, size, phys, prot); ··· 279 280 return page_to_phys(vmalloc_to_page(kaddr)) + 280 281 offset_in_page(kaddr); 281 282 } 283 + } 284 + 285 + static int pkvm_share_hyp(phys_addr_t start, phys_addr_t end) 286 + { 287 + phys_addr_t addr; 288 + int ret; 289 + 290 + for (addr = ALIGN_DOWN(start, PAGE_SIZE); addr < end; addr += PAGE_SIZE) { 291 + ret = kvm_call_hyp_nvhe(__pkvm_host_share_hyp, 292 + __phys_to_pfn(addr)); 293 + if (ret) 294 + return ret; 295 + } 296 + 297 + return 0; 282 298 } 283 299 284 300 /** ··· 315 301 316 302 if (is_kernel_in_hyp_mode()) 317 303 return 0; 304 + 305 + if (!kvm_host_owns_hyp_mappings()) { 306 + if (WARN_ON(prot != PAGE_HYP)) 307 + return -EPERM; 308 + return pkvm_share_hyp(kvm_kaddr_to_phys(from), 309 + kvm_kaddr_to_phys(to)); 310 + } 318 311 319 312 start = start & PAGE_MASK; 320 313 end = PAGE_ALIGN(end); ··· 454 433 return 0; 455 434 } 456 435 436 + static struct kvm_pgtable_mm_ops kvm_user_mm_ops = { 437 + /* We shouldn't need any other callback to walk the PT */ 438 + .phys_to_virt = kvm_host_va, 439 + }; 440 + 441 + static int get_user_mapping_size(struct kvm *kvm, u64 addr) 442 + { 443 + struct kvm_pgtable pgt = { 444 + .pgd = (kvm_pte_t *)kvm->mm->pgd, 445 + .ia_bits = VA_BITS, 446 + .start_level = (KVM_PGTABLE_MAX_LEVELS - 447 + CONFIG_PGTABLE_LEVELS), 448 + .mm_ops = &kvm_user_mm_ops, 449 + }; 450 + kvm_pte_t pte = 0; /* Keep GCC quiet... */ 451 + u32 level = ~0; 452 + int ret; 453 + 454 + ret = kvm_pgtable_get_leaf(&pgt, addr, &pte, &level); 455 + VM_BUG_ON(ret); 456 + VM_BUG_ON(level >= KVM_PGTABLE_MAX_LEVELS); 457 + VM_BUG_ON(!(pte & PTE_VALID)); 458 + 459 + return BIT(ARM64_HW_PGTABLE_LEVEL_SHIFT(level)); 460 + } 461 + 457 462 static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = { 458 463 .zalloc_page = stage2_memcache_zalloc_page, 459 464 .zalloc_pages_exact = kvm_host_zalloc_pages_exact, ··· 532 485 mmu->arch = &kvm->arch; 533 486 mmu->pgt = pgt; 534 487 mmu->pgd_phys = __pa(pgt->pgd); 535 - mmu->vmid.vmid_gen = 0; 488 + WRITE_ONCE(mmu->vmid.vmid_gen, 0); 536 489 return 0; 537 490 538 491 out_destroy_pgtable: ··· 827 780 * Returns the size of the mapping. 828 781 */ 829 782 static unsigned long 830 - transparent_hugepage_adjust(struct kvm_memory_slot *memslot, 783 + transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot, 831 784 unsigned long hva, kvm_pfn_t *pfnp, 832 785 phys_addr_t *ipap) 833 786 { ··· 838 791 * sure that the HVA and IPA are sufficiently aligned and that the 839 792 * block map is contained within the memslot. 840 793 */ 841 - if (kvm_is_transparent_hugepage(pfn) && 842 - fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE)) { 794 + if (fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE) && 795 + get_user_mapping_size(kvm, hva) >= PMD_SIZE) { 843 796 /* 844 797 * The address we faulted on is backed by a transparent huge 845 798 * page. However, because we map the compound huge page and ··· 861 814 *ipap &= PMD_MASK; 862 815 kvm_release_pfn_clean(pfn); 863 816 pfn &= ~(PTRS_PER_PMD - 1); 864 - kvm_get_pfn(pfn); 817 + get_page(pfn_to_page(pfn)); 865 818 *pfnp = pfn; 866 819 867 820 return PMD_SIZE; ··· 1097 1050 * If we are not forced to use page mapping, check if we are 1098 1051 * backed by a THP and thus use block mapping if possible. 1099 1052 */ 1100 - if (vma_pagesize == PAGE_SIZE && !(force_pte || device)) 1101 - vma_pagesize = transparent_hugepage_adjust(memslot, hva, 1102 - &pfn, &fault_ipa); 1053 + if (vma_pagesize == PAGE_SIZE && !(force_pte || device)) { 1054 + if (fault_status == FSC_PERM && fault_granule > PAGE_SIZE) 1055 + vma_pagesize = fault_granule; 1056 + else 1057 + vma_pagesize = transparent_hugepage_adjust(kvm, memslot, 1058 + hva, &pfn, 1059 + &fault_ipa); 1060 + } 1103 1061 1104 1062 if (fault_status != FSC_PERM && !device && kvm_has_mte(kvm)) { 1105 1063 /* Check the VMM hasn't introduced a new VM_SHARED VMA */

+1 -1

arch/arm64/kvm/perf.c

··· 50 50 51 51 int kvm_perf_init(void) 52 52 { 53 - if (kvm_pmu_probe_pmuver() != 0xf && !is_protected_kvm_enabled()) 53 + if (kvm_pmu_probe_pmuver() != ID_AA64DFR0_PMUVER_IMP_DEF && !is_protected_kvm_enabled()) 54 54 static_branch_enable(&kvm_arm_pmu_available); 55 55 56 56 return perf_register_guest_info_callbacks(&kvm_guest_cbs);

+7 -7

arch/arm64/kvm/pmu-emul.c

··· 373 373 reg = __vcpu_sys_reg(vcpu, PMOVSSET_EL0); 374 374 reg &= __vcpu_sys_reg(vcpu, PMCNTENSET_EL0); 375 375 reg &= __vcpu_sys_reg(vcpu, PMINTENSET_EL1); 376 - reg &= kvm_pmu_valid_counter_mask(vcpu); 377 376 } 378 377 379 378 return reg; ··· 563 564 */ 564 565 void kvm_pmu_handle_pmcr(struct kvm_vcpu *vcpu, u64 val) 565 566 { 566 - unsigned long mask = kvm_pmu_valid_counter_mask(vcpu); 567 567 int i; 568 568 569 569 if (val & ARMV8_PMU_PMCR_E) { 570 570 kvm_pmu_enable_counter_mask(vcpu, 571 - __vcpu_sys_reg(vcpu, PMCNTENSET_EL0) & mask); 571 + __vcpu_sys_reg(vcpu, PMCNTENSET_EL0)); 572 572 } else { 573 - kvm_pmu_disable_counter_mask(vcpu, mask); 573 + kvm_pmu_disable_counter_mask(vcpu, 574 + __vcpu_sys_reg(vcpu, PMCNTENSET_EL0)); 574 575 } 575 576 576 577 if (val & ARMV8_PMU_PMCR_C) 577 578 kvm_pmu_set_counter_value(vcpu, ARMV8_PMU_CYCLE_IDX, 0); 578 579 579 580 if (val & ARMV8_PMU_PMCR_P) { 581 + unsigned long mask = kvm_pmu_valid_counter_mask(vcpu); 580 582 mask &= ~BIT(ARMV8_PMU_CYCLE_IDX); 581 583 for_each_set_bit(i, &mask, 32) 582 584 kvm_pmu_set_counter_value(vcpu, i, 0); ··· 745 745 struct perf_event_attr attr = { }; 746 746 struct perf_event *event; 747 747 struct arm_pmu *pmu; 748 - int pmuver = 0xf; 748 + int pmuver = ID_AA64DFR0_PMUVER_IMP_DEF; 749 749 750 750 /* 751 751 * Create a dummy event that only counts user cycles. As we'll never ··· 770 770 if (IS_ERR(event)) { 771 771 pr_err_once("kvm: pmu event creation failed %ld\n", 772 772 PTR_ERR(event)); 773 - return 0xf; 773 + return ID_AA64DFR0_PMUVER_IMP_DEF; 774 774 } 775 775 776 776 if (event->pmu) { ··· 923 923 if (!vcpu->kvm->arch.pmuver) 924 924 vcpu->kvm->arch.pmuver = kvm_pmu_probe_pmuver(); 925 925 926 - if (vcpu->kvm->arch.pmuver == 0xf) 926 + if (vcpu->kvm->arch.pmuver == ID_AA64DFR0_PMUVER_IMP_DEF) 927 927 return -ENODEV; 928 928 929 929 switch (attr->attr) {

+12 -3

arch/arm64/kvm/psci.c

··· 59 59 kvm_vcpu_kick(vcpu); 60 60 } 61 61 62 + static inline bool kvm_psci_valid_affinity(struct kvm_vcpu *vcpu, 63 + unsigned long affinity) 64 + { 65 + return !(affinity & ~MPIDR_HWID_BITMASK); 66 + } 67 + 62 68 static unsigned long kvm_psci_vcpu_on(struct kvm_vcpu *source_vcpu) 63 69 { 64 70 struct vcpu_reset_state *reset_state; ··· 72 66 struct kvm_vcpu *vcpu = NULL; 73 67 unsigned long cpu_id; 74 68 75 - cpu_id = smccc_get_arg1(source_vcpu) & MPIDR_HWID_BITMASK; 76 - if (vcpu_mode_is_32bit(source_vcpu)) 77 - cpu_id &= ~((u32) 0); 69 + cpu_id = smccc_get_arg1(source_vcpu); 70 + if (!kvm_psci_valid_affinity(source_vcpu, cpu_id)) 71 + return PSCI_RET_INVALID_PARAMS; 78 72 79 73 vcpu = kvm_mpidr_to_vcpu(kvm, cpu_id); 80 74 ··· 131 125 132 126 target_affinity = smccc_get_arg1(vcpu); 133 127 lowest_affinity_level = smccc_get_arg2(vcpu); 128 + 129 + if (!kvm_psci_valid_affinity(vcpu, target_affinity)) 130 + return PSCI_RET_INVALID_PARAMS; 134 131 135 132 /* Determine target affinity mask */ 136 133 target_affinity_mask = psci_affinity_mask(lowest_affinity_level);

+21 -22

arch/arm64/kvm/reset.c

··· 210 210 */ 211 211 int kvm_reset_vcpu(struct kvm_vcpu *vcpu) 212 212 { 213 + struct vcpu_reset_state reset_state; 213 214 int ret; 214 215 bool loaded; 215 216 u32 pstate; 217 + 218 + mutex_lock(&vcpu->kvm->lock); 219 + reset_state = vcpu->arch.reset_state; 220 + WRITE_ONCE(vcpu->arch.reset_state.reset, false); 221 + mutex_unlock(&vcpu->kvm->lock); 216 222 217 223 /* Reset PMU outside of the non-preemptible section */ 218 224 kvm_pmu_vcpu_reset(vcpu); ··· 282 276 * Additional reset state handling that PSCI may have imposed on us. 283 277 * Must be done after all the sys_reg reset. 284 278 */ 285 - if (vcpu->arch.reset_state.reset) { 286 - unsigned long target_pc = vcpu->arch.reset_state.pc; 279 + if (reset_state.reset) { 280 + unsigned long target_pc = reset_state.pc; 287 281 288 282 /* Gracefully handle Thumb2 entry point */ 289 283 if (vcpu_mode_is_32bit(vcpu) && (target_pc & 1)) { ··· 292 286 } 293 287 294 288 /* Propagate caller endianness */ 295 - if (vcpu->arch.reset_state.be) 289 + if (reset_state.be) 296 290 kvm_vcpu_set_be(vcpu); 297 291 298 292 *vcpu_pc(vcpu) = target_pc; 299 - vcpu_set_reg(vcpu, 0, vcpu->arch.reset_state.r0); 300 - 301 - vcpu->arch.reset_state.reset = false; 293 + vcpu_set_reg(vcpu, 0, reset_state.r0); 302 294 } 303 295 304 296 /* Reset timer */ ··· 315 311 316 312 int kvm_set_ipa_limit(void) 317 313 { 318 - unsigned int parange, tgran_2; 314 + unsigned int parange; 319 315 u64 mmfr0; 320 316 321 317 mmfr0 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1); 322 318 parange = cpuid_feature_extract_unsigned_field(mmfr0, 323 319 ID_AA64MMFR0_PARANGE_SHIFT); 320 + /* 321 + * IPA size beyond 48 bits could not be supported 322 + * on either 4K or 16K page size. Hence let's cap 323 + * it to 48 bits, in case it's reported as larger 324 + * on the system. 325 + */ 326 + if (PAGE_SIZE != SZ_64K) 327 + parange = min(parange, (unsigned int)ID_AA64MMFR0_PARANGE_48); 324 328 325 329 /* 326 330 * Check with ARMv8.5-GTG that our PAGE_SIZE is supported at 327 331 * Stage-2. If not, things will stop very quickly. 328 332 */ 329 - switch (PAGE_SIZE) { 330 - default: 331 - case SZ_4K: 332 - tgran_2 = ID_AA64MMFR0_TGRAN4_2_SHIFT; 333 - break; 334 - case SZ_16K: 335 - tgran_2 = ID_AA64MMFR0_TGRAN16_2_SHIFT; 336 - break; 337 - case SZ_64K: 338 - tgran_2 = ID_AA64MMFR0_TGRAN64_2_SHIFT; 339 - break; 340 - } 341 - 342 - switch (cpuid_feature_extract_unsigned_field(mmfr0, tgran_2)) { 333 + switch (cpuid_feature_extract_unsigned_field(mmfr0, ID_AA64MMFR0_TGRAN_2_SHIFT)) { 343 334 case ID_AA64MMFR0_TGRAN_2_SUPPORTED_NONE: 344 335 kvm_err("PAGE_SIZE not supported at Stage-2, giving up\n"); 345 336 return -EINVAL; ··· 368 369 phys_shift = KVM_VM_TYPE_ARM_IPA_SIZE(type); 369 370 if (phys_shift) { 370 371 if (phys_shift > kvm_ipa_limit || 371 - phys_shift < 32) 372 + phys_shift < ARM64_MIN_PARANGE_BITS) 372 373 return -EINVAL; 373 374 } else { 374 375 phys_shift = KVM_PHYS_SHIFT;

+81 -53

arch/arm64/kvm/sys_regs.c

··· 44 44 * 64bit interface. 45 45 */ 46 46 47 - #define reg_to_encoding(x) \ 48 - sys_reg((u32)(x)->Op0, (u32)(x)->Op1, \ 49 - (u32)(x)->CRn, (u32)(x)->CRm, (u32)(x)->Op2) 50 - 51 47 static bool read_from_write_only(struct kvm_vcpu *vcpu, 52 48 struct sys_reg_params *params, 53 49 const struct sys_reg_desc *r) ··· 314 318 /* 315 319 * We want to avoid world-switching all the DBG registers all the 316 320 * time: 317 - * 321 + * 318 322 * - If we've touched any debug register, it is likely that we're 319 323 * going to touch more of them. It then makes sense to disable the 320 324 * traps and start doing the save/restore dance 321 325 * - If debug is active (DBG_MDSCR_KDE or DBG_MDSCR_MDE set), it is 322 326 * then mandatory to save/restore the registers, as the guest 323 327 * depends on them. 324 - * 328 + * 325 329 * For this, we use a DIRTY bit, indicating the guest has modified the 326 330 * debug registers, used as follow: 327 331 * ··· 599 603 return REG_HIDDEN; 600 604 } 601 605 606 + static void reset_pmu_reg(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r) 607 + { 608 + u64 n, mask = BIT(ARMV8_PMU_CYCLE_IDX); 609 + 610 + /* No PMU available, any PMU reg may UNDEF... */ 611 + if (!kvm_arm_support_pmu_v3()) 612 + return; 613 + 614 + n = read_sysreg(pmcr_el0) >> ARMV8_PMU_PMCR_N_SHIFT; 615 + n &= ARMV8_PMU_PMCR_N_MASK; 616 + if (n) 617 + mask |= GENMASK(n - 1, 0); 618 + 619 + reset_unknown(vcpu, r); 620 + __vcpu_sys_reg(vcpu, r->reg) &= mask; 621 + } 622 + 623 + static void reset_pmevcntr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r) 624 + { 625 + reset_unknown(vcpu, r); 626 + __vcpu_sys_reg(vcpu, r->reg) &= GENMASK(31, 0); 627 + } 628 + 629 + static void reset_pmevtyper(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r) 630 + { 631 + reset_unknown(vcpu, r); 632 + __vcpu_sys_reg(vcpu, r->reg) &= ARMV8_PMU_EVTYPE_MASK; 633 + } 634 + 635 + static void reset_pmselr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r) 636 + { 637 + reset_unknown(vcpu, r); 638 + __vcpu_sys_reg(vcpu, r->reg) &= ARMV8_PMU_COUNTER_MASK; 639 + } 640 + 602 641 static void reset_pmcr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r) 603 642 { 604 643 u64 pmcr, val; ··· 876 845 kvm_pmu_disable_counter_mask(vcpu, val); 877 846 } 878 847 } else { 879 - p->regval = __vcpu_sys_reg(vcpu, PMCNTENSET_EL0) & mask; 848 + p->regval = __vcpu_sys_reg(vcpu, PMCNTENSET_EL0); 880 849 } 881 850 882 851 return true; ··· 900 869 /* accessing PMINTENCLR_EL1 */ 901 870 __vcpu_sys_reg(vcpu, PMINTENSET_EL1) &= ~val; 902 871 } else { 903 - p->regval = __vcpu_sys_reg(vcpu, PMINTENSET_EL1) & mask; 872 + p->regval = __vcpu_sys_reg(vcpu, PMINTENSET_EL1); 904 873 } 905 874 906 875 return true; ··· 922 891 /* accessing PMOVSCLR_EL0 */ 923 892 __vcpu_sys_reg(vcpu, PMOVSSET_EL0) &= ~(p->regval & mask); 924 893 } else { 925 - p->regval = __vcpu_sys_reg(vcpu, PMOVSSET_EL0) & mask; 894 + p->regval = __vcpu_sys_reg(vcpu, PMOVSSET_EL0); 926 895 } 927 896 928 897 return true; ··· 975 944 trap_wcr, reset_wcr, 0, 0, get_wcr, set_wcr } 976 945 977 946 #define PMU_SYS_REG(r) \ 978 - SYS_DESC(r), .reset = reset_unknown, .visibility = pmu_visibility 947 + SYS_DESC(r), .reset = reset_pmu_reg, .visibility = pmu_visibility 979 948 980 949 /* Macro to expand the PMEVCNTRn_EL0 register */ 981 950 #define PMU_PMEVCNTR_EL0(n) \ 982 951 { PMU_SYS_REG(SYS_PMEVCNTRn_EL0(n)), \ 952 + .reset = reset_pmevcntr, \ 983 953 .access = access_pmu_evcntr, .reg = (PMEVCNTR0_EL0 + n), } 984 954 985 955 /* Macro to expand the PMEVTYPERn_EL0 register */ 986 956 #define PMU_PMEVTYPER_EL0(n) \ 987 957 { PMU_SYS_REG(SYS_PMEVTYPERn_EL0(n)), \ 958 + .reset = reset_pmevtyper, \ 988 959 .access = access_pmu_evtyper, .reg = (PMEVTYPER0_EL0 + n), } 989 960 990 961 static bool undef_access(struct kvm_vcpu *vcpu, struct sys_reg_params *p, ··· 1059 1026 return true; 1060 1027 } 1061 1028 1062 - #define FEATURE(x) (GENMASK_ULL(x##_SHIFT + 3, x##_SHIFT)) 1063 - 1064 1029 /* Read a sanitised cpufeature ID register by sys_reg_desc */ 1065 1030 static u64 read_id_reg(const struct kvm_vcpu *vcpu, 1066 1031 struct sys_reg_desc const *r, bool raz) ··· 1069 1038 switch (id) { 1070 1039 case SYS_ID_AA64PFR0_EL1: 1071 1040 if (!vcpu_has_sve(vcpu)) 1072 - val &= ~FEATURE(ID_AA64PFR0_SVE); 1073 - val &= ~FEATURE(ID_AA64PFR0_AMU); 1074 - val &= ~FEATURE(ID_AA64PFR0_CSV2); 1075 - val |= FIELD_PREP(FEATURE(ID_AA64PFR0_CSV2), (u64)vcpu->kvm->arch.pfr0_csv2); 1076 - val &= ~FEATURE(ID_AA64PFR0_CSV3); 1077 - val |= FIELD_PREP(FEATURE(ID_AA64PFR0_CSV3), (u64)vcpu->kvm->arch.pfr0_csv3); 1041 + val &= ~ARM64_FEATURE_MASK(ID_AA64PFR0_SVE); 1042 + val &= ~ARM64_FEATURE_MASK(ID_AA64PFR0_AMU); 1043 + val &= ~ARM64_FEATURE_MASK(ID_AA64PFR0_CSV2); 1044 + val |= FIELD_PREP(ARM64_FEATURE_MASK(ID_AA64PFR0_CSV2), (u64)vcpu->kvm->arch.pfr0_csv2); 1045 + val &= ~ARM64_FEATURE_MASK(ID_AA64PFR0_CSV3); 1046 + val |= FIELD_PREP(ARM64_FEATURE_MASK(ID_AA64PFR0_CSV3), (u64)vcpu->kvm->arch.pfr0_csv3); 1078 1047 break; 1079 1048 case SYS_ID_AA64PFR1_EL1: 1080 - val &= ~FEATURE(ID_AA64PFR1_MTE); 1049 + val &= ~ARM64_FEATURE_MASK(ID_AA64PFR1_MTE); 1081 1050 if (kvm_has_mte(vcpu->kvm)) { 1082 1051 u64 pfr, mte; 1083 1052 1084 1053 pfr = read_sanitised_ftr_reg(SYS_ID_AA64PFR1_EL1); 1085 1054 mte = cpuid_feature_extract_unsigned_field(pfr, ID_AA64PFR1_MTE_SHIFT); 1086 - val |= FIELD_PREP(FEATURE(ID_AA64PFR1_MTE), mte); 1055 + val |= FIELD_PREP(ARM64_FEATURE_MASK(ID_AA64PFR1_MTE), mte); 1087 1056 } 1088 1057 break; 1089 1058 case SYS_ID_AA64ISAR1_EL1: 1090 1059 if (!vcpu_has_ptrauth(vcpu)) 1091 - val &= ~(FEATURE(ID_AA64ISAR1_APA) | 1092 - FEATURE(ID_AA64ISAR1_API) | 1093 - FEATURE(ID_AA64ISAR1_GPA) | 1094 - FEATURE(ID_AA64ISAR1_GPI)); 1060 + val &= ~(ARM64_FEATURE_MASK(ID_AA64ISAR1_APA) | 1061 + ARM64_FEATURE_MASK(ID_AA64ISAR1_API) | 1062 + ARM64_FEATURE_MASK(ID_AA64ISAR1_GPA) | 1063 + ARM64_FEATURE_MASK(ID_AA64ISAR1_GPI)); 1095 1064 break; 1096 1065 case SYS_ID_AA64DFR0_EL1: 1097 1066 /* Limit debug to ARMv8.0 */ 1098 - val &= ~FEATURE(ID_AA64DFR0_DEBUGVER); 1099 - val |= FIELD_PREP(FEATURE(ID_AA64DFR0_DEBUGVER), 6); 1067 + val &= ~ARM64_FEATURE_MASK(ID_AA64DFR0_DEBUGVER); 1068 + val |= FIELD_PREP(ARM64_FEATURE_MASK(ID_AA64DFR0_DEBUGVER), 6); 1100 1069 /* Limit guests to PMUv3 for ARMv8.4 */ 1101 1070 val = cpuid_feature_cap_perfmon_field(val, 1102 1071 ID_AA64DFR0_PMUVER_SHIFT, 1103 1072 kvm_vcpu_has_pmu(vcpu) ? ID_AA64DFR0_PMUVER_8_4 : 0); 1104 1073 /* Hide SPE from guests */ 1105 - val &= ~FEATURE(ID_AA64DFR0_PMSVER); 1074 + val &= ~ARM64_FEATURE_MASK(ID_AA64DFR0_PMSVER); 1106 1075 break; 1107 1076 case SYS_ID_DFR0_EL1: 1108 1077 /* Limit guests to PMUv3 for ARMv8.4 */ ··· 1278 1247 const struct kvm_one_reg *reg, void __user *uaddr) 1279 1248 { 1280 1249 return __set_id_reg(vcpu, rd, uaddr, true); 1250 + } 1251 + 1252 + static int set_wi_reg(struct kvm_vcpu *vcpu, const struct sys_reg_desc *rd, 1253 + const struct kvm_one_reg *reg, void __user *uaddr) 1254 + { 1255 + int err; 1256 + u64 val; 1257 + 1258 + /* Perform the access even if we are going to ignore the value */ 1259 + err = reg_from_user(&val, uaddr, sys_reg_to_index(rd)); 1260 + if (err) 1261 + return err; 1262 + 1263 + return 0; 1281 1264 } 1282 1265 1283 1266 static bool access_ctr(struct kvm_vcpu *vcpu, struct sys_reg_params *p, ··· 1637 1592 .access = access_pmcnten, .reg = PMCNTENSET_EL0 }, 1638 1593 { PMU_SYS_REG(SYS_PMOVSCLR_EL0), 1639 1594 .access = access_pmovs, .reg = PMOVSSET_EL0 }, 1595 + /* 1596 + * PM_SWINC_EL0 is exposed to userspace as RAZ/WI, as it was 1597 + * previously (and pointlessly) advertised in the past... 1598 + */ 1640 1599 { PMU_SYS_REG(SYS_PMSWINC_EL0), 1641 - .access = access_pmswinc, .reg = PMSWINC_EL0 }, 1600 + .get_user = get_raz_id_reg, .set_user = set_wi_reg, 1601 + .access = access_pmswinc, .reset = NULL }, 1642 1602 { PMU_SYS_REG(SYS_PMSELR_EL0), 1643 - .access = access_pmselr, .reg = PMSELR_EL0 }, 1603 + .access = access_pmselr, .reset = reset_pmselr, .reg = PMSELR_EL0 }, 1644 1604 { PMU_SYS_REG(SYS_PMCEID0_EL0), 1645 1605 .access = access_pmceid, .reset = NULL }, 1646 1606 { PMU_SYS_REG(SYS_PMCEID1_EL0), 1647 1607 .access = access_pmceid, .reset = NULL }, 1648 1608 { PMU_SYS_REG(SYS_PMCCNTR_EL0), 1649 - .access = access_pmu_evcntr, .reg = PMCCNTR_EL0 }, 1609 + .access = access_pmu_evcntr, .reset = reset_unknown, .reg = PMCCNTR_EL0 }, 1650 1610 { PMU_SYS_REG(SYS_PMXEVTYPER_EL0), 1651 1611 .access = access_pmu_evtyper, .reset = NULL }, 1652 1612 { PMU_SYS_REG(SYS_PMXEVCNTR_EL0), ··· 2156 2106 return 0; 2157 2107 } 2158 2108 2159 - static int match_sys_reg(const void *key, const void *elt) 2160 - { 2161 - const unsigned long pval = (unsigned long)key; 2162 - const struct sys_reg_desc *r = elt; 2163 - 2164 - return pval - reg_to_encoding(r); 2165 - } 2166 - 2167 - static const struct sys_reg_desc *find_reg(const struct sys_reg_params *params, 2168 - const struct sys_reg_desc table[], 2169 - unsigned int num) 2170 - { 2171 - unsigned long pval = reg_to_encoding(params); 2172 - 2173 - return bsearch((void *)pval, table, num, sizeof(table[0]), match_sys_reg); 2174 - } 2175 - 2176 2109 int kvm_handle_cp14_load_store(struct kvm_vcpu *vcpu) 2177 2110 { 2178 2111 kvm_inject_undefined(vcpu); ··· 2398 2365 2399 2366 trace_kvm_handle_sys_reg(esr); 2400 2367 2401 - params.Op0 = (esr >> 20) & 3; 2402 - params.Op1 = (esr >> 14) & 0x7; 2403 - params.CRn = (esr >> 10) & 0xf; 2404 - params.CRm = (esr >> 1) & 0xf; 2405 - params.Op2 = (esr >> 17) & 0x7; 2368 + params = esr_sys64_to_params(esr); 2406 2369 params.regval = vcpu_get_reg(vcpu, Rt); 2407 - params.is_write = !(esr & 1); 2408 2370 2409 2371 ret = emulate_sys_reg(vcpu, &params); 2410 2372

+31

arch/arm64/kvm/sys_regs.h

··· 11 11 #ifndef __ARM64_KVM_SYS_REGS_LOCAL_H__ 12 12 #define __ARM64_KVM_SYS_REGS_LOCAL_H__ 13 13 14 + #include <linux/bsearch.h> 15 + 16 + #define reg_to_encoding(x) \ 17 + sys_reg((u32)(x)->Op0, (u32)(x)->Op1, \ 18 + (u32)(x)->CRn, (u32)(x)->CRm, (u32)(x)->Op2) 19 + 14 20 struct sys_reg_params { 15 21 u8 Op0; 16 22 u8 Op1; ··· 26 20 u64 regval; 27 21 bool is_write; 28 22 }; 23 + 24 + #define esr_sys64_to_params(esr) \ 25 + ((struct sys_reg_params){ .Op0 = ((esr) >> 20) & 3, \ 26 + .Op1 = ((esr) >> 14) & 0x7, \ 27 + .CRn = ((esr) >> 10) & 0xf, \ 28 + .CRm = ((esr) >> 1) & 0xf, \ 29 + .Op2 = ((esr) >> 17) & 0x7, \ 30 + .is_write = !((esr) & 1) }) 29 31 30 32 struct sys_reg_desc { 31 33 /* Sysreg string for debug */ ··· 164 150 if (i1->CRm != i2->CRm) 165 151 return i1->CRm - i2->CRm; 166 152 return i1->Op2 - i2->Op2; 153 + } 154 + 155 + static inline int match_sys_reg(const void *key, const void *elt) 156 + { 157 + const unsigned long pval = (unsigned long)key; 158 + const struct sys_reg_desc *r = elt; 159 + 160 + return pval - reg_to_encoding(r); 161 + } 162 + 163 + static inline const struct sys_reg_desc * 164 + find_reg(const struct sys_reg_params *params, const struct sys_reg_desc table[], 165 + unsigned int num) 166 + { 167 + unsigned long pval = reg_to_encoding(params); 168 + 169 + return __inline_bsearch((void *)pval, table, num, sizeof(table[0]), match_sys_reg); 167 170 } 168 171 169 172 const struct sys_reg_desc *find_reg_by_id(u64 id,

+7 -3

arch/arm64/kvm/trace_handle_exit.h

··· 78 78 TP_printk("flags: 0x%08x", __entry->guest_debug) 79 79 ); 80 80 81 + /* 82 + * The dreg32 name is a leftover from a distant past. This will really 83 + * output a 64bit value... 84 + */ 81 85 TRACE_EVENT(kvm_arm_set_dreg32, 82 - TP_PROTO(const char *name, __u32 value), 86 + TP_PROTO(const char *name, __u64 value), 83 87 TP_ARGS(name, value), 84 88 85 89 TP_STRUCT__entry( 86 90 __field(const char *, name) 87 - __field(__u32, value) 91 + __field(__u64, value) 88 92 ), 89 93 90 94 TP_fast_assign( ··· 96 92 __entry->value = value; 97 93 ), 98 94 99 - TP_printk("%s: 0x%08x", __entry->name, __entry->value) 95 + TP_printk("%s: 0x%llx", __entry->name, __entry->value) 100 96 ); 101 97 102 98 TRACE_DEFINE_SIZEOF(__u64);

+2 -2

arch/arm64/kvm/vgic/vgic-mmio-v2.c

··· 282 282 case GIC_CPU_PRIMASK: 283 283 /* 284 284 * Our KVM_DEV_TYPE_ARM_VGIC_V2 device ABI exports the 285 - * the PMR field as GICH_VMCR.VMPriMask rather than 285 + * PMR field as GICH_VMCR.VMPriMask rather than 286 286 * GICC_PMR.Priority, so we expose the upper five bits of 287 287 * priority mask to userspace using the lower bits in the 288 288 * unsigned long. ··· 329 329 case GIC_CPU_PRIMASK: 330 330 /* 331 331 * Our KVM_DEV_TYPE_ARM_VGIC_V2 device ABI exports the 332 - * the PMR field as GICH_VMCR.VMPriMask rather than 332 + * PMR field as GICH_VMCR.VMPriMask rather than 333 333 * GICC_PMR.Priority, so we expose the upper five bits of 334 334 * priority mask to userspace using the lower bits in the 335 335 * unsigned long.

+5 -31

arch/arm64/kvm/vgic/vgic-v2.c

··· 60 60 u32 val = cpuif->vgic_lr[lr]; 61 61 u32 cpuid, intid = val & GICH_LR_VIRTUALID; 62 62 struct vgic_irq *irq; 63 + bool deactivated; 63 64 64 65 /* Extract the source vCPU id from the LR */ 65 66 cpuid = val & GICH_LR_PHYSID_CPUID; ··· 76 75 77 76 raw_spin_lock(&irq->irq_lock); 78 77 79 - /* Always preserve the active bit */ 78 + /* Always preserve the active bit, note deactivation */ 79 + deactivated = irq->active && !(val & GICH_LR_ACTIVE_BIT); 80 80 irq->active = !!(val & GICH_LR_ACTIVE_BIT); 81 81 82 82 if (irq->active && vgic_irq_is_sgi(intid)) ··· 98 96 if (irq->config == VGIC_CONFIG_LEVEL && !(val & GICH_LR_STATE)) 99 97 irq->pending_latch = false; 100 98 101 - /* 102 - * Level-triggered mapped IRQs are special because we only 103 - * observe rising edges as input to the VGIC. 104 - * 105 - * If the guest never acked the interrupt we have to sample 106 - * the physical line and set the line level, because the 107 - * device state could have changed or we simply need to 108 - * process the still pending interrupt later. 109 - * 110 - * If this causes us to lower the level, we have to also clear 111 - * the physical active state, since we will otherwise never be 112 - * told when the interrupt becomes asserted again. 113 - * 114 - * Another case is when the interrupt requires a helping hand 115 - * on deactivation (no HW deactivation, for example). 116 - */ 117 - if (vgic_irq_is_mapped_level(irq)) { 118 - bool resample = false; 119 - 120 - if (val & GICH_LR_PENDING_BIT) { 121 - irq->line_level = vgic_get_phys_line_level(irq); 122 - resample = !irq->line_level; 123 - } else if (vgic_irq_needs_resampling(irq) && 124 - !(irq->active || irq->pending_latch)) { 125 - resample = true; 126 - } 127 - 128 - if (resample) 129 - vgic_irq_set_phys_active(irq, false); 130 - } 99 + /* Handle resampling for mapped interrupts if required */ 100 + vgic_irq_handle_resampling(irq, deactivated, val & GICH_LR_PENDING_BIT); 131 101 132 102 raw_spin_unlock(&irq->irq_lock); 133 103 vgic_put_irq(vcpu->kvm, irq);

+5 -31

arch/arm64/kvm/vgic/vgic-v3.c

··· 46 46 u32 intid, cpuid; 47 47 struct vgic_irq *irq; 48 48 bool is_v2_sgi = false; 49 + bool deactivated; 49 50 50 51 cpuid = val & GICH_LR_PHYSID_CPUID; 51 52 cpuid >>= GICH_LR_PHYSID_CPUID_SHIFT; ··· 69 68 70 69 raw_spin_lock(&irq->irq_lock); 71 70 72 - /* Always preserve the active bit */ 71 + /* Always preserve the active bit, note deactivation */ 72 + deactivated = irq->active && !(val & ICH_LR_ACTIVE_BIT); 73 73 irq->active = !!(val & ICH_LR_ACTIVE_BIT); 74 74 75 75 if (irq->active && is_v2_sgi) ··· 91 89 if (irq->config == VGIC_CONFIG_LEVEL && !(val & ICH_LR_STATE)) 92 90 irq->pending_latch = false; 93 91 94 - /* 95 - * Level-triggered mapped IRQs are special because we only 96 - * observe rising edges as input to the VGIC. 97 - * 98 - * If the guest never acked the interrupt we have to sample 99 - * the physical line and set the line level, because the 100 - * device state could have changed or we simply need to 101 - * process the still pending interrupt later. 102 - * 103 - * If this causes us to lower the level, we have to also clear 104 - * the physical active state, since we will otherwise never be 105 - * told when the interrupt becomes asserted again. 106 - * 107 - * Another case is when the interrupt requires a helping hand 108 - * on deactivation (no HW deactivation, for example). 109 - */ 110 - if (vgic_irq_is_mapped_level(irq)) { 111 - bool resample = false; 112 - 113 - if (val & ICH_LR_PENDING_BIT) { 114 - irq->line_level = vgic_get_phys_line_level(irq); 115 - resample = !irq->line_level; 116 - } else if (vgic_irq_needs_resampling(irq) && 117 - !(irq->active || irq->pending_latch)) { 118 - resample = true; 119 - } 120 - 121 - if (resample) 122 - vgic_irq_set_phys_active(irq, false); 123 - } 92 + /* Handle resampling for mapped interrupts if required */ 93 + vgic_irq_handle_resampling(irq, deactivated, val & ICH_LR_PENDING_BIT); 124 94 125 95 raw_spin_unlock(&irq->irq_lock); 126 96 vgic_put_irq(vcpu->kvm, irq);

+38 -1

arch/arm64/kvm/vgic/vgic.c

··· 106 106 if (intid >= VGIC_MIN_LPI) 107 107 return vgic_get_lpi(kvm, intid); 108 108 109 - WARN(1, "Looking up struct vgic_irq for reserved INTID"); 110 109 return NULL; 111 110 } 112 111 ··· 1020 1021 vgic_put_irq(vcpu->kvm, irq); 1021 1022 1022 1023 return map_is_active; 1024 + } 1025 + 1026 + /* 1027 + * Level-triggered mapped IRQs are special because we only observe rising 1028 + * edges as input to the VGIC. 1029 + * 1030 + * If the guest never acked the interrupt we have to sample the physical 1031 + * line and set the line level, because the device state could have changed 1032 + * or we simply need to process the still pending interrupt later. 1033 + * 1034 + * We could also have entered the guest with the interrupt active+pending. 1035 + * On the next exit, we need to re-evaluate the pending state, as it could 1036 + * otherwise result in a spurious interrupt by injecting a now potentially 1037 + * stale pending state. 1038 + * 1039 + * If this causes us to lower the level, we have to also clear the physical 1040 + * active state, since we will otherwise never be told when the interrupt 1041 + * becomes asserted again. 1042 + * 1043 + * Another case is when the interrupt requires a helping hand on 1044 + * deactivation (no HW deactivation, for example). 1045 + */ 1046 + void vgic_irq_handle_resampling(struct vgic_irq *irq, 1047 + bool lr_deactivated, bool lr_pending) 1048 + { 1049 + if (vgic_irq_is_mapped_level(irq)) { 1050 + bool resample = false; 1051 + 1052 + if (unlikely(vgic_irq_needs_resampling(irq))) { 1053 + resample = !(irq->active || irq->pending_latch); 1054 + } else if (lr_pending || (lr_deactivated && irq->line_level)) { 1055 + irq->line_level = vgic_get_phys_line_level(irq); 1056 + resample = !irq->line_level; 1057 + } 1058 + 1059 + if (resample) 1060 + vgic_irq_set_phys_active(irq, false); 1061 + } 1023 1062 }

+2

arch/arm64/kvm/vgic/vgic.h

··· 169 169 bool vgic_queue_irq_unlock(struct kvm *kvm, struct vgic_irq *irq, 170 170 unsigned long flags); 171 171 void vgic_kick_vcpus(struct kvm *kvm); 172 + void vgic_irq_handle_resampling(struct vgic_irq *irq, 173 + bool lr_deactivated, bool lr_pending); 172 174 173 175 int vgic_check_ioaddr(struct kvm *kvm, phys_addr_t *ioaddr, 174 176 phys_addr_t addr, phys_addr_t alignment);

-4

arch/mips/kvm/mips.c

··· 41 41 const struct _kvm_stats_desc kvm_vm_stats_desc[] = { 42 42 KVM_GENERIC_VM_STATS() 43 43 }; 44 - static_assert(ARRAY_SIZE(kvm_vm_stats_desc) == 45 - sizeof(struct kvm_vm_stat) / sizeof(u64)); 46 44 47 45 const struct kvm_stats_header kvm_vm_stats_header = { 48 46 .name_size = KVM_STATS_NAME_SIZE, ··· 83 85 STATS_DESC_COUNTER(VCPU, vz_cpucfg_exits), 84 86 #endif 85 87 }; 86 - static_assert(ARRAY_SIZE(kvm_vcpu_stats_desc) == 87 - sizeof(struct kvm_vcpu_stat) / sizeof(u64)); 88 88 89 89 const struct kvm_stats_header kvm_vcpu_stats_header = { 90 90 .name_size = KVM_STATS_NAME_SIZE,

+1 -2

arch/mips/kvm/vz.c

··· 388 388 u32 compare, u32 cause) 389 389 { 390 390 u32 start_count, after_count; 391 - ktime_t freeze_time; 392 391 unsigned long flags; 393 392 394 393 /* ··· 395 396 * this with interrupts disabled to avoid latency. 396 397 */ 397 398 local_irq_save(flags); 398 - freeze_time = kvm_mips_freeze_hrtimer(vcpu, &start_count); 399 + kvm_mips_freeze_hrtimer(vcpu, &start_count); 399 400 write_c0_gtoffset(start_count - read_c0_count()); 400 401 local_irq_restore(flags); 401 402

-1

arch/powerpc/include/asm/kvm_host.h

··· 103 103 u64 emulated_inst_exits; 104 104 u64 dec_exits; 105 105 u64 ext_intr_exits; 106 - u64 halt_wait_ns; 107 106 u64 halt_successful_wait; 108 107 u64 dbell_exits; 109 108 u64 gdbell_exits;

-5

arch/powerpc/kvm/book3s.c

··· 43 43 STATS_DESC_ICOUNTER(VM, num_2M_pages), 44 44 STATS_DESC_ICOUNTER(VM, num_1G_pages) 45 45 }; 46 - static_assert(ARRAY_SIZE(kvm_vm_stats_desc) == 47 - sizeof(struct kvm_vm_stat) / sizeof(u64)); 48 46 49 47 const struct kvm_stats_header kvm_vm_stats_header = { 50 48 .name_size = KVM_STATS_NAME_SIZE, ··· 69 71 STATS_DESC_COUNTER(VCPU, emulated_inst_exits), 70 72 STATS_DESC_COUNTER(VCPU, dec_exits), 71 73 STATS_DESC_COUNTER(VCPU, ext_intr_exits), 72 - STATS_DESC_TIME_NSEC(VCPU, halt_wait_ns), 73 74 STATS_DESC_COUNTER(VCPU, halt_successful_wait), 74 75 STATS_DESC_COUNTER(VCPU, dbell_exits), 75 76 STATS_DESC_COUNTER(VCPU, gdbell_exits), ··· 85 88 STATS_DESC_COUNTER(VCPU, pthru_host), 86 89 STATS_DESC_COUNTER(VCPU, pthru_bad_aff) 87 90 }; 88 - static_assert(ARRAY_SIZE(kvm_vcpu_stats_desc) == 89 - sizeof(struct kvm_vcpu_stat) / sizeof(u64)); 90 91 91 92 const struct kvm_stats_header kvm_vcpu_stats_header = { 92 93 .name_size = KVM_STATS_NAME_SIZE,

+1 -1

arch/powerpc/kvm/book3s_64_vio.c

··· 346 346 unsigned long gfn = tce >> PAGE_SHIFT; 347 347 struct kvm_memory_slot *memslot; 348 348 349 - memslot = search_memslots(kvm_memslots(kvm), gfn); 349 + memslot = __gfn_to_memslot(kvm_memslots(kvm), gfn); 350 350 if (!memslot) 351 351 return -EINVAL; 352 352

+1 -1

arch/powerpc/kvm/book3s_64_vio_hv.c

··· 80 80 unsigned long gfn = tce >> PAGE_SHIFT; 81 81 struct kvm_memory_slot *memslot; 82 82 83 - memslot = search_memslots(kvm_memslots_raw(kvm), gfn); 83 + memslot = __gfn_to_memslot(kvm_memslots_raw(kvm), gfn); 84 84 if (!memslot) 85 85 return -EINVAL; 86 86

+15 -3

arch/powerpc/kvm/book3s_hv.c

··· 4202 4202 4203 4203 /* Attribute wait time */ 4204 4204 if (do_sleep) { 4205 - vc->runner->stat.halt_wait_ns += 4205 + vc->runner->stat.generic.halt_wait_ns += 4206 4206 ktime_to_ns(cur) - ktime_to_ns(start_wait); 4207 + KVM_STATS_LOG_HIST_UPDATE( 4208 + vc->runner->stat.generic.halt_wait_hist, 4209 + ktime_to_ns(cur) - ktime_to_ns(start_wait)); 4207 4210 /* Attribute failed poll time */ 4208 - if (vc->halt_poll_ns) 4211 + if (vc->halt_poll_ns) { 4209 4212 vc->runner->stat.generic.halt_poll_fail_ns += 4210 4213 ktime_to_ns(start_wait) - 4211 4214 ktime_to_ns(start_poll); 4215 + KVM_STATS_LOG_HIST_UPDATE( 4216 + vc->runner->stat.generic.halt_poll_fail_hist, 4217 + ktime_to_ns(start_wait) - 4218 + ktime_to_ns(start_poll)); 4219 + } 4212 4220 } else { 4213 4221 /* Attribute successful poll time */ 4214 - if (vc->halt_poll_ns) 4222 + if (vc->halt_poll_ns) { 4215 4223 vc->runner->stat.generic.halt_poll_success_ns += 4216 4224 ktime_to_ns(cur) - 4217 4225 ktime_to_ns(start_poll); 4226 + KVM_STATS_LOG_HIST_UPDATE( 4227 + vc->runner->stat.generic.halt_poll_success_hist, 4228 + ktime_to_ns(cur) - ktime_to_ns(start_poll)); 4229 + } 4218 4230 } 4219 4231 4220 4232 /* Adjust poll time */

-5

arch/powerpc/kvm/booke.c

··· 41 41 STATS_DESC_ICOUNTER(VM, num_2M_pages), 42 42 STATS_DESC_ICOUNTER(VM, num_1G_pages) 43 43 }; 44 - static_assert(ARRAY_SIZE(kvm_vm_stats_desc) == 45 - sizeof(struct kvm_vm_stat) / sizeof(u64)); 46 44 47 45 const struct kvm_stats_header kvm_vm_stats_header = { 48 46 .name_size = KVM_STATS_NAME_SIZE, ··· 67 69 STATS_DESC_COUNTER(VCPU, emulated_inst_exits), 68 70 STATS_DESC_COUNTER(VCPU, dec_exits), 69 71 STATS_DESC_COUNTER(VCPU, ext_intr_exits), 70 - STATS_DESC_TIME_NSEC(VCPU, halt_wait_ns), 71 72 STATS_DESC_COUNTER(VCPU, halt_successful_wait), 72 73 STATS_DESC_COUNTER(VCPU, dbell_exits), 73 74 STATS_DESC_COUNTER(VCPU, gdbell_exits), ··· 76 79 STATS_DESC_COUNTER(VCPU, pthru_host), 77 80 STATS_DESC_COUNTER(VCPU, pthru_bad_aff) 78 81 }; 79 - static_assert(ARRAY_SIZE(kvm_vcpu_stats_desc) == 80 - sizeof(struct kvm_vcpu_stat) / sizeof(u64)); 81 82 82 83 const struct kvm_stats_header kvm_vcpu_stats_header = { 83 84 .name_size = KVM_STATS_NAME_SIZE,

+2

arch/s390/include/asm/kvm_host.h

··· 244 244 __u8 fpf; /* 0x0060 */ 245 245 #define ECB_GS 0x40 246 246 #define ECB_TE 0x10 247 + #define ECB_SPECI 0x08 247 248 #define ECB_SRSI 0x04 248 249 #define ECB_HOSTPROTINT 0x02 249 250 __u8 ecb; /* 0x0061 */ ··· 956 955 atomic64_t cmma_dirty_pages; 957 956 /* subset of available cpu features enabled by user space */ 958 957 DECLARE_BITMAP(cpu_feat, KVM_S390_VM_CPU_FEAT_NR_BITS); 958 + /* indexed by vcpu_idx */ 959 959 DECLARE_BITMAP(idle_mask, KVM_MAX_VCPUS); 960 960 struct kvm_s390_gisa_interrupt gisa_int; 961 961 struct kvm_s390_pv pv;

+6 -6

arch/s390/kvm/interrupt.c

··· 419 419 static void __set_cpu_idle(struct kvm_vcpu *vcpu) 420 420 { 421 421 kvm_s390_set_cpuflags(vcpu, CPUSTAT_WAIT); 422 - set_bit(vcpu->vcpu_id, vcpu->kvm->arch.idle_mask); 422 + set_bit(kvm_vcpu_get_idx(vcpu), vcpu->kvm->arch.idle_mask); 423 423 } 424 424 425 425 static void __unset_cpu_idle(struct kvm_vcpu *vcpu) 426 426 { 427 427 kvm_s390_clear_cpuflags(vcpu, CPUSTAT_WAIT); 428 - clear_bit(vcpu->vcpu_id, vcpu->kvm->arch.idle_mask); 428 + clear_bit(kvm_vcpu_get_idx(vcpu), vcpu->kvm->arch.idle_mask); 429 429 } 430 430 431 431 static void __reset_intercept_indicators(struct kvm_vcpu *vcpu) ··· 3050 3050 3051 3051 static void __airqs_kick_single_vcpu(struct kvm *kvm, u8 deliverable_mask) 3052 3052 { 3053 - int vcpu_id, online_vcpus = atomic_read(&kvm->online_vcpus); 3053 + int vcpu_idx, online_vcpus = atomic_read(&kvm->online_vcpus); 3054 3054 struct kvm_s390_gisa_interrupt *gi = &kvm->arch.gisa_int; 3055 3055 struct kvm_vcpu *vcpu; 3056 3056 3057 - for_each_set_bit(vcpu_id, kvm->arch.idle_mask, online_vcpus) { 3058 - vcpu = kvm_get_vcpu(kvm, vcpu_id); 3057 + for_each_set_bit(vcpu_idx, kvm->arch.idle_mask, online_vcpus) { 3058 + vcpu = kvm_get_vcpu(kvm, vcpu_idx); 3059 3059 if (psw_ioint_disabled(vcpu)) 3060 3060 continue; 3061 3061 deliverable_mask &= (u8)(vcpu->arch.sie_block->gcr[6] >> 24); 3062 3062 if (deliverable_mask) { 3063 3063 /* lately kicked but not yet running */ 3064 - if (test_and_set_bit(vcpu_id, gi->kicked_mask)) 3064 + if (test_and_set_bit(vcpu_idx, gi->kicked_mask)) 3065 3065 return; 3066 3066 kvm_s390_vcpu_wakeup(vcpu); 3067 3067 return;

+5 -7

arch/s390/kvm/kvm-s390.c

··· 66 66 STATS_DESC_COUNTER(VM, inject_service_signal), 67 67 STATS_DESC_COUNTER(VM, inject_virtio) 68 68 }; 69 - static_assert(ARRAY_SIZE(kvm_vm_stats_desc) == 70 - sizeof(struct kvm_vm_stat) / sizeof(u64)); 71 69 72 70 const struct kvm_stats_header kvm_vm_stats_header = { 73 71 .name_size = KVM_STATS_NAME_SIZE, ··· 172 174 STATS_DESC_COUNTER(VCPU, instruction_diagnose_other), 173 175 STATS_DESC_COUNTER(VCPU, pfault_sync) 174 176 }; 175 - static_assert(ARRAY_SIZE(kvm_vcpu_stats_desc) == 176 - sizeof(struct kvm_vcpu_stat) / sizeof(u64)); 177 177 178 178 const struct kvm_stats_header kvm_vcpu_stats_header = { 179 179 .name_size = KVM_STATS_NAME_SIZE, ··· 1949 1953 static int gfn_to_memslot_approx(struct kvm_memslots *slots, gfn_t gfn) 1950 1954 { 1951 1955 int start = 0, end = slots->used_slots; 1952 - int slot = atomic_read(&slots->lru_slot); 1956 + int slot = atomic_read(&slots->last_used_slot); 1953 1957 struct kvm_memory_slot *memslots = slots->memslots; 1954 1958 1955 1959 if (gfn >= memslots[slot].base_gfn && ··· 1970 1974 1971 1975 if (gfn >= memslots[start].base_gfn && 1972 1976 gfn < memslots[start].base_gfn + memslots[start].npages) { 1973 - atomic_set(&slots->lru_slot, start); 1977 + atomic_set(&slots->last_used_slot, start); 1974 1978 } 1975 1979 1976 1980 return start; ··· 3220 3224 vcpu->arch.sie_block->ecb |= ECB_SRSI; 3221 3225 if (test_kvm_facility(vcpu->kvm, 73)) 3222 3226 vcpu->arch.sie_block->ecb |= ECB_TE; 3227 + if (!kvm_is_ucontrol(vcpu->kvm)) 3228 + vcpu->arch.sie_block->ecb |= ECB_SPECI; 3223 3229 3224 3230 if (test_kvm_facility(vcpu->kvm, 8) && vcpu->kvm->arch.use_pfmfi) 3225 3231 vcpu->arch.sie_block->ecb2 |= ECB2_PFMFI; ··· 4066 4068 kvm_s390_patch_guest_per_regs(vcpu); 4067 4069 } 4068 4070 4069 - clear_bit(vcpu->vcpu_id, vcpu->kvm->arch.gisa_int.kicked_mask); 4071 + clear_bit(kvm_vcpu_get_idx(vcpu), vcpu->kvm->arch.gisa_int.kicked_mask); 4070 4072 4071 4073 vcpu->arch.sie_block->icptcode = 0; 4072 4074 cpuflags = atomic_read(&vcpu->arch.sie_block->cpuflags);

+1 -1

arch/s390/kvm/kvm-s390.h

··· 79 79 80 80 static inline int is_vcpu_idle(struct kvm_vcpu *vcpu) 81 81 { 82 - return test_bit(vcpu->vcpu_id, vcpu->kvm->arch.idle_mask); 82 + return test_bit(kvm_vcpu_get_idx(vcpu), vcpu->kvm->arch.idle_mask); 83 83 } 84 84 85 85 static inline int kvm_is_ucontrol(struct kvm *kvm)

+2

arch/s390/kvm/vsie.c

··· 510 510 prefix_unmapped(vsie_page); 511 511 scb_s->ecb |= ECB_TE; 512 512 } 513 + /* specification exception interpretation */ 514 + scb_s->ecb |= scb_o->ecb & ECB_SPECI; 513 515 /* branch prediction */ 514 516 if (test_kvm_facility(vcpu->kvm, 82)) 515 517 scb_s->fpf |= scb_o->fpf & FPF_BPBC;

-1

arch/x86/include/asm/kvm-x86-ops.h

··· 72 72 KVM_X86_OP(enable_irq_window) 73 73 KVM_X86_OP(update_cr8_intercept) 74 74 KVM_X86_OP(check_apicv_inhibit_reasons) 75 - KVM_X86_OP_NULL(pre_update_apicv_exec_ctrl) 76 75 KVM_X86_OP(refresh_apicv_exec_ctrl) 77 76 KVM_X86_OP(hwapic_irr_update) 78 77 KVM_X86_OP(hwapic_isr_update)

+45 -51

arch/x86/include/asm/kvm_host.h

··· 37 37 38 38 #define __KVM_HAVE_ARCH_VCPU_DEBUGFS 39 39 40 - #define KVM_MAX_VCPUS 288 41 - #define KVM_SOFT_MAX_VCPUS 240 42 - #define KVM_MAX_VCPU_ID 1023 40 + #define KVM_MAX_VCPUS 1024 41 + #define KVM_SOFT_MAX_VCPUS 710 42 + 43 + /* 44 + * In x86, the VCPU ID corresponds to the APIC ID, and APIC IDs 45 + * might be larger than the actual number of VCPUs because the 46 + * APIC ID encodes CPU topology information. 47 + * 48 + * In the worst case, we'll need less than one extra bit for the 49 + * Core ID, and less than one extra bit for the Package (Die) ID, 50 + * so ratio of 4 should be enough. 51 + */ 52 + #define KVM_VCPU_ID_RATIO 4 53 + #define KVM_MAX_VCPU_ID (KVM_MAX_VCPUS * KVM_VCPU_ID_RATIO) 54 + 43 55 /* memory slots that are not exposed to userspace */ 44 56 #define KVM_PRIVATE_MEM_SLOTS 3 45 57 ··· 135 123 #define KVM_HPAGE_SIZE(x) (1UL << KVM_HPAGE_SHIFT(x)) 136 124 #define KVM_HPAGE_MASK(x) (~(KVM_HPAGE_SIZE(x) - 1)) 137 125 #define KVM_PAGES_PER_HPAGE(x) (KVM_HPAGE_SIZE(x) / PAGE_SIZE) 138 - 139 - static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level) 140 - { 141 - /* KVM_HPAGE_GFN_SHIFT(PG_LEVEL_4K) must be 0. */ 142 - return (gfn >> KVM_HPAGE_GFN_SHIFT(level)) - 143 - (base_gfn >> KVM_HPAGE_GFN_SHIFT(level)); 144 - } 145 126 146 127 #define KVM_PERMILLE_MMU_PAGES 20 147 128 #define KVM_MIN_ALLOC_MMU_PAGES 64UL ··· 234 229 KVM_GUESTDBG_USE_HW_BP | \ 235 230 KVM_GUESTDBG_USE_SW_BP | \ 236 231 KVM_GUESTDBG_INJECT_BP | \ 237 - KVM_GUESTDBG_INJECT_DB) 232 + KVM_GUESTDBG_INJECT_DB | \ 233 + KVM_GUESTDBG_BLOCKIRQ) 238 234 239 235 240 236 #define PFERR_PRESENT_BIT 0 ··· 453 447 454 448 u64 *pae_root; 455 449 u64 *pml4_root; 450 + u64 *pml5_root; 456 451 457 452 /* 458 453 * check zero bits on shadow page table entries, these ··· 489 482 * ctrl value for fixed counters. 490 483 */ 491 484 u64 current_config; 485 + bool is_paused; 492 486 }; 493 487 494 488 struct kvm_pmu { ··· 530 522 enum { 531 523 KVM_DEBUGREG_BP_ENABLED = 1, 532 524 KVM_DEBUGREG_WONT_EXIT = 2, 533 - KVM_DEBUGREG_RELOAD = 4, 534 525 }; 535 526 536 527 struct kvm_mtrr_range { ··· 730 723 731 724 u64 reserved_gpa_bits; 732 725 int maxphyaddr; 733 - int max_tdp_level; 734 726 735 727 /* emulate context */ 736 728 ··· 994 988 /* How many vCPUs have VP index != vCPU index */ 995 989 atomic_t num_mismatched_vp_indexes; 996 990 991 + /* 992 + * How many SynICs use 'AutoEOI' feature 993 + * (protected by arch.apicv_update_lock) 994 + */ 995 + unsigned int synic_auto_eoi_used; 996 + 997 997 struct hv_partition_assist_pg *hv_pa_pg; 998 998 struct kvm_hv_syndbg hv_syndbg; 999 999 }; ··· 1014 1002 /* Xen emulation context */ 1015 1003 struct kvm_xen { 1016 1004 bool long_mode; 1017 - bool shinfo_set; 1018 1005 u8 upcall_vector; 1019 - struct gfn_to_hva_cache shinfo_cache; 1006 + gfn_t shinfo_gfn; 1020 1007 }; 1021 1008 1022 1009 enum kvm_irqchip_mode { ··· 1071 1060 struct mutex apic_map_lock; 1072 1061 struct kvm_apic_map __rcu *apic_map; 1073 1062 atomic_t apic_map_dirty; 1063 + 1064 + /* Protects apic_access_memslot_enabled and apicv_inhibit_reasons */ 1065 + struct mutex apicv_update_lock; 1074 1066 1075 1067 bool apic_access_memslot_enabled; 1076 1068 unsigned long apicv_inhibit_reasons; ··· 1227 1213 u64 mmu_recycled; 1228 1214 u64 mmu_cache_miss; 1229 1215 u64 mmu_unsync; 1230 - u64 lpages; 1216 + union { 1217 + struct { 1218 + atomic64_t pages_4k; 1219 + atomic64_t pages_2m; 1220 + atomic64_t pages_1g; 1221 + }; 1222 + atomic64_t pages[KVM_NR_PAGE_SIZES]; 1223 + }; 1231 1224 u64 nx_lpage_splits; 1232 1225 u64 max_mmu_page_hash_collisions; 1226 + u64 max_mmu_rmap_size; 1233 1227 }; 1234 1228 1235 1229 struct kvm_vcpu_stat { ··· 1381 1359 void (*enable_irq_window)(struct kvm_vcpu *vcpu); 1382 1360 void (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, int irr); 1383 1361 bool (*check_apicv_inhibit_reasons)(ulong bit); 1384 - void (*pre_update_apicv_exec_ctrl)(struct kvm *kvm, bool activate); 1385 1362 void (*refresh_apicv_exec_ctrl)(struct kvm_vcpu *vcpu); 1386 1363 void (*hwapic_irr_update)(struct kvm_vcpu *vcpu, int max_irr); 1387 1364 void (*hwapic_isr_update)(struct kvm_vcpu *vcpu, int isr); ··· 1564 1543 void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu); 1565 1544 void kvm_mmu_reset_context(struct kvm_vcpu *vcpu); 1566 1545 void kvm_mmu_slot_remove_write_access(struct kvm *kvm, 1567 - struct kvm_memory_slot *memslot, 1546 + const struct kvm_memory_slot *memslot, 1568 1547 int start_level); 1569 1548 void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm, 1570 1549 const struct kvm_memory_slot *memslot); 1571 1550 void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm, 1572 - struct kvm_memory_slot *memslot); 1551 + const struct kvm_memory_slot *memslot); 1573 1552 void kvm_mmu_zap_all(struct kvm *kvm); 1574 1553 void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen); 1575 1554 unsigned long kvm_mmu_calculate_default_mmu_pages(struct kvm *kvm); ··· 1765 1744 void kvm_request_apicv_update(struct kvm *kvm, bool activate, 1766 1745 unsigned long bit); 1767 1746 1747 + void __kvm_request_apicv_update(struct kvm *kvm, bool activate, 1748 + unsigned long bit); 1749 + 1768 1750 int kvm_emulate_hypercall(struct kvm_vcpu *vcpu); 1769 1751 1770 1752 int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code, ··· 1778 1754 void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid); 1779 1755 void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd); 1780 1756 1781 - void kvm_configure_mmu(bool enable_tdp, int tdp_max_root_level, 1782 - int tdp_huge_page_level); 1757 + void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level, 1758 + int tdp_max_root_level, int tdp_huge_page_level); 1783 1759 1784 1760 static inline u16 kvm_read_ldt(void) 1785 1761 { ··· 1802 1778 return value; 1803 1779 } 1804 1780 #endif 1805 - 1806 - static inline u32 get_rdx_init_val(void) 1807 - { 1808 - return 0x600; /* P6 family */ 1809 - } 1810 1781 1811 1782 static inline void kvm_inject_gp(struct kvm_vcpu *vcpu, u32 error_code) 1812 1783 { ··· 1834 1815 1835 1816 #define kvm_arch_vcpu_memslots_id(vcpu) ((vcpu)->arch.hflags & HF_SMM_MASK ? 1 : 0) 1836 1817 #define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, (role).smm) 1837 - 1838 - asmlinkage void kvm_spurious_fault(void); 1839 - 1840 - /* 1841 - * Hardware virtualization extension instructions may fault if a 1842 - * reboot turns off virtualization while processes are running. 1843 - * Usually after catching the fault we just panic; during reboot 1844 - * instead the instruction is ignored. 1845 - */ 1846 - #define __kvm_handle_fault_on_reboot(insn) \ 1847 - "666: \n\t" \ 1848 - insn "\n\t" \ 1849 - "jmp 668f \n\t" \ 1850 - "667: \n\t" \ 1851 - "1: \n\t" \ 1852 - ".pushsection .discard.instr_begin \n\t" \ 1853 - ".long 1b - . \n\t" \ 1854 - ".popsection \n\t" \ 1855 - "call kvm_spurious_fault \n\t" \ 1856 - "1: \n\t" \ 1857 - ".pushsection .discard.instr_end \n\t" \ 1858 - ".long 1b - . \n\t" \ 1859 - ".popsection \n\t" \ 1860 - "668: \n\t" \ 1861 - _ASM_EXTABLE(666b, 667b) 1862 1818 1863 1819 #define KVM_ARCH_WANT_MMU_NOTIFIER 1864 1820

+1

arch/x86/include/uapi/asm/kvm.h

··· 295 295 #define KVM_GUESTDBG_USE_HW_BP 0x00020000 296 296 #define KVM_GUESTDBG_INJECT_DB 0x00040000 297 297 #define KVM_GUESTDBG_INJECT_BP 0x00080000 298 + #define KVM_GUESTDBG_BLOCKIRQ 0x00100000 298 299 299 300 /* for KVM_SET_GUEST_DEBUG */ 300 301 struct kvm_guest_debug_arch {

+3 -2

arch/x86/kernel/kvm.c

··· 884 884 } else { 885 885 local_irq_disable(); 886 886 887 + /* safe_halt() will enable IRQ */ 887 888 if (READ_ONCE(*ptr) == val) 888 889 safe_halt(); 889 - 890 - local_irq_enable(); 890 + else 891 + local_irq_enable(); 891 892 } 892 893 } 893 894

+111

arch/x86/kvm/debugfs.c

··· 7 7 #include <linux/kvm_host.h> 8 8 #include <linux/debugfs.h> 9 9 #include "lapic.h" 10 + #include "mmu.h" 11 + #include "mmu/mmu_internal.h" 10 12 11 13 static int vcpu_get_timer_advance_ns(void *data, u64 *val) 12 14 { ··· 74 72 debugfs_dentry, vcpu, 75 73 &vcpu_tsc_scaling_frac_fops); 76 74 } 75 + } 76 + 77 + /* 78 + * This covers statistics <1024 (11=log(1024)+1), which should be enough to 79 + * cover RMAP_RECYCLE_THRESHOLD. 80 + */ 81 + #define RMAP_LOG_SIZE 11 82 + 83 + static const char *kvm_lpage_str[KVM_NR_PAGE_SIZES] = { "4K", "2M", "1G" }; 84 + 85 + static int kvm_mmu_rmaps_stat_show(struct seq_file *m, void *v) 86 + { 87 + struct kvm_rmap_head *rmap; 88 + struct kvm *kvm = m->private; 89 + struct kvm_memory_slot *slot; 90 + struct kvm_memslots *slots; 91 + unsigned int lpage_size, index; 92 + /* Still small enough to be on the stack */ 93 + unsigned int *log[KVM_NR_PAGE_SIZES], *cur; 94 + int i, j, k, l, ret; 95 + 96 + ret = -ENOMEM; 97 + memset(log, 0, sizeof(log)); 98 + for (i = 0; i < KVM_NR_PAGE_SIZES; i++) { 99 + log[i] = kcalloc(RMAP_LOG_SIZE, sizeof(unsigned int), GFP_KERNEL); 100 + if (!log[i]) 101 + goto out; 102 + } 103 + 104 + mutex_lock(&kvm->slots_lock); 105 + write_lock(&kvm->mmu_lock); 106 + 107 + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { 108 + slots = __kvm_memslots(kvm, i); 109 + for (j = 0; j < slots->used_slots; j++) { 110 + slot = &slots->memslots[j]; 111 + for (k = 0; k < KVM_NR_PAGE_SIZES; k++) { 112 + rmap = slot->arch.rmap[k]; 113 + lpage_size = kvm_mmu_slot_lpages(slot, k + 1); 114 + cur = log[k]; 115 + for (l = 0; l < lpage_size; l++) { 116 + index = ffs(pte_list_count(&rmap[l])); 117 + if (WARN_ON_ONCE(index >= RMAP_LOG_SIZE)) 118 + index = RMAP_LOG_SIZE - 1; 119 + cur[index]++; 120 + } 121 + } 122 + } 123 + } 124 + 125 + write_unlock(&kvm->mmu_lock); 126 + mutex_unlock(&kvm->slots_lock); 127 + 128 + /* index=0 counts no rmap; index=1 counts 1 rmap */ 129 + seq_printf(m, "Rmap_Count:\t0\t1\t"); 130 + for (i = 2; i < RMAP_LOG_SIZE; i++) { 131 + j = 1 << (i - 1); 132 + k = (1 << i) - 1; 133 + seq_printf(m, "%d-%d\t", j, k); 134 + } 135 + seq_printf(m, "\n"); 136 + 137 + for (i = 0; i < KVM_NR_PAGE_SIZES; i++) { 138 + seq_printf(m, "Level=%s:\t", kvm_lpage_str[i]); 139 + cur = log[i]; 140 + for (j = 0; j < RMAP_LOG_SIZE; j++) 141 + seq_printf(m, "%d\t", cur[j]); 142 + seq_printf(m, "\n"); 143 + } 144 + 145 + ret = 0; 146 + out: 147 + for (i = 0; i < KVM_NR_PAGE_SIZES; i++) 148 + kfree(log[i]); 149 + 150 + return ret; 151 + } 152 + 153 + static int kvm_mmu_rmaps_stat_open(struct inode *inode, struct file *file) 154 + { 155 + struct kvm *kvm = inode->i_private; 156 + 157 + if (!kvm_get_kvm_safe(kvm)) 158 + return -ENOENT; 159 + 160 + return single_open(file, kvm_mmu_rmaps_stat_show, kvm); 161 + } 162 + 163 + static int kvm_mmu_rmaps_stat_release(struct inode *inode, struct file *file) 164 + { 165 + struct kvm *kvm = inode->i_private; 166 + 167 + kvm_put_kvm(kvm); 168 + 169 + return single_release(inode, file); 170 + } 171 + 172 + static const struct file_operations mmu_rmaps_stat_fops = { 173 + .open = kvm_mmu_rmaps_stat_open, 174 + .read = seq_read, 175 + .llseek = seq_lseek, 176 + .release = kvm_mmu_rmaps_stat_release, 177 + }; 178 + 179 + int kvm_arch_create_vm_debugfs(struct kvm *kvm) 180 + { 181 + debugfs_create_file("mmu_rmaps_stat", 0644, kvm->debugfs_dentry, kvm, 182 + &mmu_rmaps_stat_fops); 183 + return 0; 77 184 }

+26 -6

arch/x86/kvm/hyperv.c

··· 88 88 static void synic_update_vector(struct kvm_vcpu_hv_synic *synic, 89 89 int vector) 90 90 { 91 + struct kvm_vcpu *vcpu = hv_synic_to_vcpu(synic); 92 + struct kvm_hv *hv = to_kvm_hv(vcpu->kvm); 93 + int auto_eoi_old, auto_eoi_new; 94 + 91 95 if (vector < HV_SYNIC_FIRST_VALID_VECTOR) 92 96 return; 93 97 ··· 100 96 else 101 97 __clear_bit(vector, synic->vec_bitmap); 102 98 99 + auto_eoi_old = bitmap_weight(synic->auto_eoi_bitmap, 256); 100 + 103 101 if (synic_has_vector_auto_eoi(synic, vector)) 104 102 __set_bit(vector, synic->auto_eoi_bitmap); 105 103 else 106 104 __clear_bit(vector, synic->auto_eoi_bitmap); 105 + 106 + auto_eoi_new = bitmap_weight(synic->auto_eoi_bitmap, 256); 107 + 108 + if (!!auto_eoi_old == !!auto_eoi_new) 109 + return; 110 + 111 + mutex_lock(&vcpu->kvm->arch.apicv_update_lock); 112 + 113 + if (auto_eoi_new) 114 + hv->synic_auto_eoi_used++; 115 + else 116 + hv->synic_auto_eoi_used--; 117 + 118 + __kvm_request_apicv_update(vcpu->kvm, 119 + !hv->synic_auto_eoi_used, 120 + APICV_INHIBIT_REASON_HYPERV); 121 + 122 + mutex_unlock(&vcpu->kvm->arch.apicv_update_lock); 107 123 } 108 124 109 125 static int synic_set_sint(struct kvm_vcpu_hv_synic *synic, int sint, ··· 957 933 958 934 synic = to_hv_synic(vcpu); 959 935 960 - /* 961 - * Hyper-V SynIC auto EOI SINT's are 962 - * not compatible with APICV, so request 963 - * to deactivate APICV permanently. 964 - */ 965 - kvm_request_apicv_update(vcpu->kvm, false, APICV_INHIBIT_REASON_HYPERV); 966 936 synic->active = true; 967 937 synic->dont_zero_synic_pages = dont_zero_synic_pages; 968 938 synic->control = HV_SYNIC_CONTROL_ENABLE; ··· 2494 2476 ent->eax |= HV_X64_ENLIGHTENED_VMCS_RECOMMENDED; 2495 2477 if (!cpu_smt_possible()) 2496 2478 ent->eax |= HV_X64_NO_NONARCH_CORESHARING; 2479 + 2480 + ent->eax |= HV_DEPRECATING_AEOI_RECOMMENDED; 2497 2481 /* 2498 2482 * Default number of spinlock retry attempts, matches 2499 2483 * HyperV 2016.

+2 -1

arch/x86/kvm/i8254.c

··· 220 220 struct kvm_pit *pit = vcpu->kvm->arch.vpit; 221 221 struct hrtimer *timer; 222 222 223 - if (!kvm_vcpu_is_bsp(vcpu) || !pit) 223 + /* Somewhat arbitrarily make vcpu0 the owner of the PIT. */ 224 + if (vcpu->vcpu_id || !pit) 224 225 return; 225 226 226 227 timer = &pit->pit_state.timer;

-4

arch/x86/kvm/ioapic.h

··· 35 35 #define IOAPIC_INIT 0x5 36 36 #define IOAPIC_EXTINT 0x7 37 37 38 - #ifdef CONFIG_X86 39 38 #define RTC_GSI 8 40 - #else 41 - #define RTC_GSI -1U 42 - #endif 43 39 44 40 struct dest_map { 45 41 /* vcpu bitmap where IRQ has been sent */

+13 -13

arch/x86/kvm/lapic.c

··· 192 192 if (atomic_read_acquire(&kvm->arch.apic_map_dirty) == CLEAN) 193 193 return; 194 194 195 + WARN_ONCE(!irqchip_in_kernel(kvm), 196 + "Dirty APIC map without an in-kernel local APIC"); 197 + 195 198 mutex_lock(&kvm->arch.apic_map_lock); 196 199 /* 197 200 * Read kvm->arch.apic_map_dirty before kvm->arch.apic_map ··· 2268 2265 u64 old_value = vcpu->arch.apic_base; 2269 2266 struct kvm_lapic *apic = vcpu->arch.apic; 2270 2267 2271 - if (!apic) 2272 - value |= MSR_IA32_APICBASE_BSP; 2273 - 2274 2268 vcpu->arch.apic_base = value; 2275 2269 2276 2270 if ((old_value ^ value) & MSR_IA32_APICBASE_ENABLE) ··· 2323 2323 struct kvm_lapic *apic = vcpu->arch.apic; 2324 2324 int i; 2325 2325 2326 + if (!init_event) { 2327 + vcpu->arch.apic_base = APIC_DEFAULT_PHYS_BASE | 2328 + MSR_IA32_APICBASE_ENABLE; 2329 + if (kvm_vcpu_is_reset_bsp(vcpu)) 2330 + vcpu->arch.apic_base |= MSR_IA32_APICBASE_BSP; 2331 + } 2332 + 2326 2333 if (!apic) 2327 2334 return; 2328 2335 ··· 2337 2330 hrtimer_cancel(&apic->lapic_timer.timer); 2338 2331 2339 2332 if (!init_event) { 2340 - kvm_lapic_set_base(vcpu, APIC_DEFAULT_PHYS_BASE | 2341 - MSR_IA32_APICBASE_ENABLE); 2333 + apic->base_address = APIC_DEFAULT_PHYS_BASE; 2334 + 2342 2335 kvm_apic_set_xapic_id(apic, vcpu->vcpu_id); 2343 2336 } 2344 2337 kvm_apic_set_version(apic->vcpu); ··· 2371 2364 apic->highest_isr_cache = -1; 2372 2365 update_divide_count(apic); 2373 2366 atomic_set(&apic->lapic_timer.pending, 0); 2374 - if (kvm_vcpu_is_bsp(vcpu)) 2375 - kvm_lapic_set_base(vcpu, 2376 - vcpu->arch.apic_base | MSR_IA32_APICBASE_BSP); 2367 + 2377 2368 vcpu->arch.pv_eoi.msr_val = 0; 2378 2369 apic_update_ppr(apic); 2379 2370 if (vcpu->arch.apicv_active) { ··· 2481 2476 lapic_timer_advance_dynamic = false; 2482 2477 } 2483 2478 2484 - /* 2485 - * APIC is created enabled. This will prevent kvm_lapic_set_base from 2486 - * thinking that APIC state has changed. 2487 - */ 2488 - vcpu->arch.apic_base = MSR_IA32_APICBASE_ENABLE; 2489 2479 static_branch_inc(&apic_sw_disabled.key); /* sw disabled at reset */ 2490 2480 kvm_iodevice_init(&apic->dev, &apic_mmio_ops); 2491 2481

+25

arch/x86/kvm/mmu.h

··· 240 240 return smp_load_acquire(&kvm->arch.memslots_have_rmaps); 241 241 } 242 242 243 + static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level) 244 + { 245 + /* KVM_HPAGE_GFN_SHIFT(PG_LEVEL_4K) must be 0. */ 246 + return (gfn >> KVM_HPAGE_GFN_SHIFT(level)) - 247 + (base_gfn >> KVM_HPAGE_GFN_SHIFT(level)); 248 + } 249 + 250 + static inline unsigned long 251 + __kvm_mmu_slot_lpages(struct kvm_memory_slot *slot, unsigned long npages, 252 + int level) 253 + { 254 + return gfn_to_index(slot->base_gfn + npages - 1, 255 + slot->base_gfn, level) + 1; 256 + } 257 + 258 + static inline unsigned long 259 + kvm_mmu_slot_lpages(struct kvm_memory_slot *slot, int level) 260 + { 261 + return __kvm_mmu_slot_lpages(slot, slot->npages, level); 262 + } 263 + 264 + static inline void kvm_update_page_stats(struct kvm *kvm, int level, int count) 265 + { 266 + atomic64_add(count, &kvm->stat.pages[level - 1]); 267 + } 243 268 #endif

+328 -194

arch/x86/kvm/mmu/mmu.c

··· 97 97 bool tdp_enabled = false; 98 98 99 99 static int max_huge_page_level __read_mostly; 100 + static int tdp_root_level __read_mostly; 100 101 static int max_tdp_level __read_mostly; 101 102 102 103 enum { ··· 138 137 139 138 #include <trace/events/kvm.h> 140 139 141 - /* make pte_list_desc fit well in cache line */ 142 - #define PTE_LIST_EXT 3 140 + /* make pte_list_desc fit well in cache lines */ 141 + #define PTE_LIST_EXT 14 143 142 143 + /* 144 + * Slight optimization of cacheline layout, by putting `more' and `spte_count' 145 + * at the start; then accessing it will only use one single cacheline for 146 + * either full (entries==PTE_LIST_EXT) case or entries<=6. 147 + */ 144 148 struct pte_list_desc { 145 - u64 *sptes[PTE_LIST_EXT]; 146 149 struct pte_list_desc *more; 150 + /* 151 + * Stores number of entries stored in the pte_list_desc. No need to be 152 + * u64 but just for easier alignment. When PTE_LIST_EXT, means full. 153 + */ 154 + u64 spte_count; 155 + u64 *sptes[PTE_LIST_EXT]; 147 156 }; 148 157 149 158 struct kvm_shadow_walk_iterator { ··· 204 193 * the single source of truth for the MMU's state. 205 194 */ 206 195 #define BUILD_MMU_ROLE_REGS_ACCESSOR(reg, name, flag) \ 207 - static inline bool ____is_##reg##_##name(struct kvm_mmu_role_regs *regs)\ 196 + static inline bool __maybe_unused ____is_##reg##_##name(struct kvm_mmu_role_regs *regs)\ 208 197 { \ 209 198 return !!(regs->reg & flag); \ 210 199 } ··· 226 215 * and the vCPU may be incorrect/irrelevant. 227 216 */ 228 217 #define BUILD_MMU_ROLE_ACCESSOR(base_or_ext, reg, name) \ 229 - static inline bool is_##reg##_##name(struct kvm_mmu *mmu) \ 218 + static inline bool __maybe_unused is_##reg##_##name(struct kvm_mmu *mmu) \ 230 219 { \ 231 220 return !!(mmu->mmu_role. base_or_ext . reg##_##name); \ 232 221 } ··· 334 323 static gpa_t translate_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, u32 access, 335 324 struct x86_exception *exception) 336 325 { 337 - /* Check if guest physical address doesn't exceed guest maximum */ 338 - if (kvm_vcpu_is_illegal_gpa(vcpu, gpa)) { 339 - exception->error_code |= PFERR_RSVD_MASK; 340 - return UNMAPPED_GVA; 341 - } 342 - 343 326 return gpa; 344 327 } 345 328 ··· 597 592 * Rules for using mmu_spte_clear_track_bits: 598 593 * It sets the sptep from present to nonpresent, and track the 599 594 * state bits, it is used to clear the last level sptep. 600 - * Returns non-zero if the PTE was previously valid. 595 + * Returns the old PTE. 601 596 */ 602 - static int mmu_spte_clear_track_bits(u64 *sptep) 597 + static int mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep) 603 598 { 604 599 kvm_pfn_t pfn; 605 600 u64 old_spte = *sptep; 601 + int level = sptep_to_sp(sptep)->role.level; 606 602 607 603 if (!spte_has_volatile_bits(old_spte)) 608 604 __update_clear_spte_fast(sptep, 0ull); ··· 611 605 old_spte = __update_clear_spte_slow(sptep, 0ull); 612 606 613 607 if (!is_shadow_present_pte(old_spte)) 614 - return 0; 608 + return old_spte; 609 + 610 + kvm_update_page_stats(kvm, level, -1); 615 611 616 612 pfn = spte_to_pfn(old_spte); 617 613 ··· 630 622 if (is_dirty_spte(old_spte)) 631 623 kvm_set_pfn_dirty(pfn); 632 624 633 - return 1; 625 + return old_spte; 634 626 } 635 627 636 628 /* ··· 694 686 695 687 static void walk_shadow_page_lockless_begin(struct kvm_vcpu *vcpu) 696 688 { 697 - /* 698 - * Prevent page table teardown by making any free-er wait during 699 - * kvm_flush_remote_tlbs() IPI to all active vcpus. 700 - */ 701 - local_irq_disable(); 689 + if (is_tdp_mmu(vcpu->arch.mmu)) { 690 + kvm_tdp_mmu_walk_lockless_begin(); 691 + } else { 692 + /* 693 + * Prevent page table teardown by making any free-er wait during 694 + * kvm_flush_remote_tlbs() IPI to all active vcpus. 695 + */ 696 + local_irq_disable(); 702 697 703 - /* 704 - * Make sure a following spte read is not reordered ahead of the write 705 - * to vcpu->mode. 706 - */ 707 - smp_store_mb(vcpu->mode, READING_SHADOW_PAGE_TABLES); 698 + /* 699 + * Make sure a following spte read is not reordered ahead of the write 700 + * to vcpu->mode. 701 + */ 702 + smp_store_mb(vcpu->mode, READING_SHADOW_PAGE_TABLES); 703 + } 708 704 } 709 705 710 706 static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu) 711 707 { 712 - /* 713 - * Make sure the write to vcpu->mode is not reordered in front of 714 - * reads to sptes. If it does, kvm_mmu_commit_zap_page() can see us 715 - * OUTSIDE_GUEST_MODE and proceed to free the shadow page table. 716 - */ 717 - smp_store_release(&vcpu->mode, OUTSIDE_GUEST_MODE); 718 - local_irq_enable(); 708 + if (is_tdp_mmu(vcpu->arch.mmu)) { 709 + kvm_tdp_mmu_walk_lockless_end(); 710 + } else { 711 + /* 712 + * Make sure the write to vcpu->mode is not reordered in front of 713 + * reads to sptes. If it does, kvm_mmu_commit_zap_page() can see us 714 + * OUTSIDE_GUEST_MODE and proceed to free the shadow page table. 715 + */ 716 + smp_store_release(&vcpu->mode, OUTSIDE_GUEST_MODE); 717 + local_irq_enable(); 718 + } 719 719 } 720 720 721 721 static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect) ··· 802 786 return &slot->arch.lpage_info[level - 2][idx]; 803 787 } 804 788 805 - static void update_gfn_disallow_lpage_count(struct kvm_memory_slot *slot, 789 + static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot, 806 790 gfn_t gfn, int count) 807 791 { 808 792 struct kvm_lpage_info *linfo; ··· 815 799 } 816 800 } 817 801 818 - void kvm_mmu_gfn_disallow_lpage(struct kvm_memory_slot *slot, gfn_t gfn) 802 + void kvm_mmu_gfn_disallow_lpage(const struct kvm_memory_slot *slot, gfn_t gfn) 819 803 { 820 804 update_gfn_disallow_lpage_count(slot, gfn, 1); 821 805 } 822 806 823 - void kvm_mmu_gfn_allow_lpage(struct kvm_memory_slot *slot, gfn_t gfn) 807 + void kvm_mmu_gfn_allow_lpage(const struct kvm_memory_slot *slot, gfn_t gfn) 824 808 { 825 809 update_gfn_disallow_lpage_count(slot, gfn, -1); 826 810 } ··· 909 893 struct kvm_rmap_head *rmap_head) 910 894 { 911 895 struct pte_list_desc *desc; 912 - int i, count = 0; 896 + int count = 0; 913 897 914 898 if (!rmap_head->val) { 915 899 rmap_printk("%p %llx 0->1\n", spte, *spte); ··· 919 903 desc = mmu_alloc_pte_list_desc(vcpu); 920 904 desc->sptes[0] = (u64 *)rmap_head->val; 921 905 desc->sptes[1] = spte; 906 + desc->spte_count = 2; 922 907 rmap_head->val = (unsigned long)desc | 1; 923 908 ++count; 924 909 } else { 925 910 rmap_printk("%p %llx many->many\n", spte, *spte); 926 911 desc = (struct pte_list_desc *)(rmap_head->val & ~1ul); 927 - while (desc->sptes[PTE_LIST_EXT-1]) { 912 + while (desc->spte_count == PTE_LIST_EXT) { 928 913 count += PTE_LIST_EXT; 929 - 930 914 if (!desc->more) { 931 915 desc->more = mmu_alloc_pte_list_desc(vcpu); 932 916 desc = desc->more; 917 + desc->spte_count = 0; 933 918 break; 934 919 } 935 920 desc = desc->more; 936 921 } 937 - for (i = 0; desc->sptes[i]; ++i) 938 - ++count; 939 - desc->sptes[i] = spte; 922 + count += desc->spte_count; 923 + desc->sptes[desc->spte_count++] = spte; 940 924 } 941 925 return count; 942 926 } ··· 946 930 struct pte_list_desc *desc, int i, 947 931 struct pte_list_desc *prev_desc) 948 932 { 949 - int j; 933 + int j = desc->spte_count - 1; 950 934 951 - for (j = PTE_LIST_EXT - 1; !desc->sptes[j] && j > i; --j) 952 - ; 953 935 desc->sptes[i] = desc->sptes[j]; 954 936 desc->sptes[j] = NULL; 955 - if (j != 0) 937 + desc->spte_count--; 938 + if (desc->spte_count) 956 939 return; 957 940 if (!prev_desc && !desc->more) 958 941 rmap_head->val = 0; ··· 984 969 desc = (struct pte_list_desc *)(rmap_head->val & ~1ul); 985 970 prev_desc = NULL; 986 971 while (desc) { 987 - for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i) { 972 + for (i = 0; i < desc->spte_count; ++i) { 988 973 if (desc->sptes[i] == spte) { 989 974 pte_list_desc_remove_entry(rmap_head, 990 975 desc, i, prev_desc); ··· 999 984 } 1000 985 } 1001 986 1002 - static void pte_list_remove(struct kvm_rmap_head *rmap_head, u64 *sptep) 987 + static void pte_list_remove(struct kvm *kvm, struct kvm_rmap_head *rmap_head, 988 + u64 *sptep) 1003 989 { 1004 - mmu_spte_clear_track_bits(sptep); 990 + mmu_spte_clear_track_bits(kvm, sptep); 1005 991 __pte_list_remove(sptep, rmap_head); 1006 992 } 1007 993 1008 - static struct kvm_rmap_head *__gfn_to_rmap(gfn_t gfn, int level, 1009 - struct kvm_memory_slot *slot) 994 + /* Return true if rmap existed, false otherwise */ 995 + static bool pte_list_destroy(struct kvm *kvm, struct kvm_rmap_head *rmap_head) 996 + { 997 + struct pte_list_desc *desc, *next; 998 + int i; 999 + 1000 + if (!rmap_head->val) 1001 + return false; 1002 + 1003 + if (!(rmap_head->val & 1)) { 1004 + mmu_spte_clear_track_bits(kvm, (u64 *)rmap_head->val); 1005 + goto out; 1006 + } 1007 + 1008 + desc = (struct pte_list_desc *)(rmap_head->val & ~1ul); 1009 + 1010 + for (; desc; desc = next) { 1011 + for (i = 0; i < desc->spte_count; i++) 1012 + mmu_spte_clear_track_bits(kvm, desc->sptes[i]); 1013 + next = desc->more; 1014 + mmu_free_pte_list_desc(desc); 1015 + } 1016 + out: 1017 + /* rmap_head is meaningless now, remember to reset it */ 1018 + rmap_head->val = 0; 1019 + return true; 1020 + } 1021 + 1022 + unsigned int pte_list_count(struct kvm_rmap_head *rmap_head) 1023 + { 1024 + struct pte_list_desc *desc; 1025 + unsigned int count = 0; 1026 + 1027 + if (!rmap_head->val) 1028 + return 0; 1029 + else if (!(rmap_head->val & 1)) 1030 + return 1; 1031 + 1032 + desc = (struct pte_list_desc *)(rmap_head->val & ~1ul); 1033 + 1034 + while (desc) { 1035 + count += desc->spte_count; 1036 + desc = desc->more; 1037 + } 1038 + 1039 + return count; 1040 + } 1041 + 1042 + static struct kvm_rmap_head *gfn_to_rmap(gfn_t gfn, int level, 1043 + const struct kvm_memory_slot *slot) 1010 1044 { 1011 1045 unsigned long idx; 1012 1046 1013 1047 idx = gfn_to_index(gfn, slot->base_gfn, level); 1014 1048 return &slot->arch.rmap[level - PG_LEVEL_4K][idx]; 1015 - } 1016 - 1017 - static struct kvm_rmap_head *gfn_to_rmap(struct kvm *kvm, gfn_t gfn, 1018 - struct kvm_mmu_page *sp) 1019 - { 1020 - struct kvm_memslots *slots; 1021 - struct kvm_memory_slot *slot; 1022 - 1023 - slots = kvm_memslots_for_spte_role(kvm, sp->role); 1024 - slot = __gfn_to_memslot(slots, gfn); 1025 - return __gfn_to_rmap(gfn, sp->role.level, slot); 1026 1049 } 1027 1050 1028 1051 static bool rmap_can_add(struct kvm_vcpu *vcpu) ··· 1073 1020 1074 1021 static int rmap_add(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn) 1075 1022 { 1023 + struct kvm_memory_slot *slot; 1076 1024 struct kvm_mmu_page *sp; 1077 1025 struct kvm_rmap_head *rmap_head; 1078 1026 1079 1027 sp = sptep_to_sp(spte); 1080 1028 kvm_mmu_page_set_gfn(sp, spte - sp->spt, gfn); 1081 - rmap_head = gfn_to_rmap(vcpu->kvm, gfn, sp); 1029 + slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn); 1030 + rmap_head = gfn_to_rmap(gfn, sp->role.level, slot); 1082 1031 return pte_list_add(vcpu, spte, rmap_head); 1083 1032 } 1084 1033 1034 + 1085 1035 static void rmap_remove(struct kvm *kvm, u64 *spte) 1086 1036 { 1037 + struct kvm_memslots *slots; 1038 + struct kvm_memory_slot *slot; 1087 1039 struct kvm_mmu_page *sp; 1088 1040 gfn_t gfn; 1089 1041 struct kvm_rmap_head *rmap_head; 1090 1042 1091 1043 sp = sptep_to_sp(spte); 1092 1044 gfn = kvm_mmu_page_get_gfn(sp, spte - sp->spt); 1093 - rmap_head = gfn_to_rmap(kvm, gfn, sp); 1045 + 1046 + /* 1047 + * Unlike rmap_add and rmap_recycle, rmap_remove does not run in the 1048 + * context of a vCPU so have to determine which memslots to use based 1049 + * on context information in sp->role. 1050 + */ 1051 + slots = kvm_memslots_for_spte_role(kvm, sp->role); 1052 + 1053 + slot = __gfn_to_memslot(slots, gfn); 1054 + rmap_head = gfn_to_rmap(gfn, sp->role.level, slot); 1055 + 1094 1056 __pte_list_remove(spte, rmap_head); 1095 1057 } 1096 1058 ··· 1187 1119 1188 1120 static void drop_spte(struct kvm *kvm, u64 *sptep) 1189 1121 { 1190 - if (mmu_spte_clear_track_bits(sptep)) 1122 + u64 old_spte = mmu_spte_clear_track_bits(kvm, sptep); 1123 + 1124 + if (is_shadow_present_pte(old_spte)) 1191 1125 rmap_remove(kvm, sptep); 1192 1126 } 1193 1127 ··· 1199 1129 if (is_large_pte(*sptep)) { 1200 1130 WARN_ON(sptep_to_sp(sptep)->role.level == PG_LEVEL_4K); 1201 1131 drop_spte(kvm, sptep); 1202 - --kvm->stat.lpages; 1203 1132 return true; 1204 1133 } 1205 1134 ··· 1287 1218 * Returns true iff any D or W bits were cleared. 1288 1219 */ 1289 1220 static bool __rmap_clear_dirty(struct kvm *kvm, struct kvm_rmap_head *rmap_head, 1290 - struct kvm_memory_slot *slot) 1221 + const struct kvm_memory_slot *slot) 1291 1222 { 1292 1223 u64 *sptep; 1293 1224 struct rmap_iterator iter; ··· 1325 1256 return; 1326 1257 1327 1258 while (mask) { 1328 - rmap_head = __gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask), 1329 - PG_LEVEL_4K, slot); 1259 + rmap_head = gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask), 1260 + PG_LEVEL_4K, slot); 1330 1261 __rmap_write_protect(kvm, rmap_head, false); 1331 1262 1332 1263 /* clear the first set bit */ ··· 1358 1289 return; 1359 1290 1360 1291 while (mask) { 1361 - rmap_head = __gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask), 1362 - PG_LEVEL_4K, slot); 1292 + rmap_head = gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask), 1293 + PG_LEVEL_4K, slot); 1363 1294 __rmap_clear_dirty(kvm, rmap_head, slot); 1364 1295 1365 1296 /* clear the first set bit */ ··· 1425 1356 1426 1357 if (kvm_memslots_have_rmaps(kvm)) { 1427 1358 for (i = min_level; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) { 1428 - rmap_head = __gfn_to_rmap(gfn, i, slot); 1359 + rmap_head = gfn_to_rmap(gfn, i, slot); 1429 1360 write_protected |= __rmap_write_protect(kvm, rmap_head, true); 1430 1361 } 1431 1362 } ··· 1446 1377 } 1447 1378 1448 1379 static bool kvm_zap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head, 1449 - struct kvm_memory_slot *slot) 1380 + const struct kvm_memory_slot *slot) 1450 1381 { 1451 - u64 *sptep; 1452 - struct rmap_iterator iter; 1453 - bool flush = false; 1454 - 1455 - while ((sptep = rmap_get_first(rmap_head, &iter))) { 1456 - rmap_printk("spte %p %llx.\n", sptep, *sptep); 1457 - 1458 - pte_list_remove(rmap_head, sptep); 1459 - flush = true; 1460 - } 1461 - 1462 - return flush; 1382 + return pte_list_destroy(kvm, rmap_head); 1463 1383 } 1464 1384 1465 1385 static bool kvm_unmap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head, ··· 1479 1421 need_flush = 1; 1480 1422 1481 1423 if (pte_write(pte)) { 1482 - pte_list_remove(rmap_head, sptep); 1424 + pte_list_remove(kvm, rmap_head, sptep); 1483 1425 goto restart; 1484 1426 } else { 1485 1427 new_spte = kvm_mmu_changed_pte_notifier_make_spte( 1486 1428 *sptep, new_pfn); 1487 1429 1488 - mmu_spte_clear_track_bits(sptep); 1430 + mmu_spte_clear_track_bits(kvm, sptep); 1489 1431 mmu_spte_set(sptep, new_spte); 1490 1432 } 1491 1433 } ··· 1500 1442 1501 1443 struct slot_rmap_walk_iterator { 1502 1444 /* input fields. */ 1503 - struct kvm_memory_slot *slot; 1445 + const struct kvm_memory_slot *slot; 1504 1446 gfn_t start_gfn; 1505 1447 gfn_t end_gfn; 1506 1448 int start_level; ··· 1520 1462 { 1521 1463 iterator->level = level; 1522 1464 iterator->gfn = iterator->start_gfn; 1523 - iterator->rmap = __gfn_to_rmap(iterator->gfn, level, iterator->slot); 1524 - iterator->end_rmap = __gfn_to_rmap(iterator->end_gfn, level, 1525 - iterator->slot); 1465 + iterator->rmap = gfn_to_rmap(iterator->gfn, level, iterator->slot); 1466 + iterator->end_rmap = gfn_to_rmap(iterator->end_gfn, level, iterator->slot); 1526 1467 } 1527 1468 1528 1469 static void 1529 1470 slot_rmap_walk_init(struct slot_rmap_walk_iterator *iterator, 1530 - struct kvm_memory_slot *slot, int start_level, 1471 + const struct kvm_memory_slot *slot, int start_level, 1531 1472 int end_level, gfn_t start_gfn, gfn_t end_gfn) 1532 1473 { 1533 1474 iterator->slot = slot; ··· 1641 1584 1642 1585 static void rmap_recycle(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn) 1643 1586 { 1587 + struct kvm_memory_slot *slot; 1644 1588 struct kvm_rmap_head *rmap_head; 1645 1589 struct kvm_mmu_page *sp; 1646 1590 1647 1591 sp = sptep_to_sp(spte); 1648 - 1649 - rmap_head = gfn_to_rmap(vcpu->kvm, gfn, sp); 1592 + slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn); 1593 + rmap_head = gfn_to_rmap(gfn, sp->role.level, slot); 1650 1594 1651 1595 kvm_unmap_rmapp(vcpu->kvm, rmap_head, NULL, gfn, sp->role.level, __pte(0)); 1652 1596 kvm_flush_remote_tlbs_with_address(vcpu->kvm, sp->gfn, ··· 2290 2232 if (is_shadow_present_pte(pte)) { 2291 2233 if (is_last_spte(pte, sp->role.level)) { 2292 2234 drop_spte(kvm, spte); 2293 - if (is_large_pte(pte)) 2294 - --kvm->stat.lpages; 2295 2235 } else { 2296 2236 child = to_shadow_page(pte & PT64_BASE_ADDR_MASK); 2297 2237 drop_parent_pte(child, spte); ··· 2772 2716 2773 2717 pgprintk("%s: setting spte %llx\n", __func__, *sptep); 2774 2718 trace_kvm_mmu_set_spte(level, gfn, sptep); 2775 - if (!was_rmapped && is_large_pte(*sptep)) 2776 - ++vcpu->kvm->stat.lpages; 2777 2719 2778 - if (is_shadow_present_pte(*sptep)) { 2779 - if (!was_rmapped) { 2780 - rmap_count = rmap_add(vcpu, sptep, gfn); 2781 - if (rmap_count > RMAP_RECYCLE_THRESHOLD) 2782 - rmap_recycle(vcpu, sptep, gfn); 2783 - } 2720 + if (!was_rmapped) { 2721 + kvm_update_page_stats(vcpu->kvm, level, 1); 2722 + rmap_count = rmap_add(vcpu, sptep, gfn); 2723 + if (rmap_count > RMAP_RECYCLE_THRESHOLD) 2724 + rmap_recycle(vcpu, sptep, gfn); 2784 2725 } 2785 2726 2786 2727 return ret; ··· 2905 2852 kvm_pfn_t pfn, int max_level) 2906 2853 { 2907 2854 struct kvm_lpage_info *linfo; 2855 + int host_level; 2908 2856 2909 2857 max_level = min(max_level, max_huge_page_level); 2910 2858 for ( ; max_level > PG_LEVEL_4K; max_level--) { ··· 2917 2863 if (max_level == PG_LEVEL_4K) 2918 2864 return PG_LEVEL_4K; 2919 2865 2920 - return host_pfn_mapping_level(kvm, gfn, pfn, slot); 2866 + host_level = host_pfn_mapping_level(kvm, gfn, pfn, slot); 2867 + return min(host_level, max_level); 2921 2868 } 2922 2869 2923 2870 int kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, gfn_t gfn, ··· 2942 2887 if (!slot) 2943 2888 return PG_LEVEL_4K; 2944 2889 2945 - level = kvm_mmu_max_mapping_level(vcpu->kvm, slot, gfn, pfn, max_level); 2946 - if (level == PG_LEVEL_4K) 2947 - return level; 2948 - 2949 - *req_level = level = min(level, max_level); 2950 - 2951 2890 /* 2952 2891 * Enforce the iTLB multihit workaround after capturing the requested 2953 2892 * level, which will be used to do precise, accurate accounting. 2954 2893 */ 2955 - if (huge_page_disallowed) 2894 + *req_level = level = kvm_mmu_max_mapping_level(vcpu->kvm, slot, gfn, pfn, max_level); 2895 + if (level == PG_LEVEL_4K || huge_page_disallowed) 2956 2896 return PG_LEVEL_4K; 2957 2897 2958 2898 /* ··· 3015 2965 break; 3016 2966 3017 2967 drop_large_spte(vcpu, it.sptep); 3018 - if (!is_shadow_present_pte(*it.sptep)) { 3019 - sp = kvm_mmu_get_page(vcpu, base_gfn, it.addr, 3020 - it.level - 1, true, ACC_ALL); 2968 + if (is_shadow_present_pte(*it.sptep)) 2969 + continue; 3021 2970 3022 - link_shadow_page(vcpu, it.sptep, sp); 3023 - if (is_tdp && huge_page_disallowed && 3024 - req_level >= it.level) 3025 - account_huge_nx_page(vcpu->kvm, sp); 3026 - } 2971 + sp = kvm_mmu_get_page(vcpu, base_gfn, it.addr, 2972 + it.level - 1, true, ACC_ALL); 2973 + 2974 + link_shadow_page(vcpu, it.sptep, sp); 2975 + if (is_tdp && huge_page_disallowed && 2976 + req_level >= it.level) 2977 + account_huge_nx_page(vcpu->kvm, sp); 3027 2978 } 3028 2979 3029 2980 ret = mmu_set_spte(vcpu, it.sptep, ACC_ALL, ··· 3173 3122 } 3174 3123 3175 3124 /* 3176 - * Returns one of RET_PF_INVALID, RET_PF_FIXED or RET_PF_SPURIOUS. 3125 + * Returns the last level spte pointer of the shadow page walk for the given 3126 + * gpa, and sets *spte to the spte value. This spte may be non-preset. If no 3127 + * walk could be performed, returns NULL and *spte does not contain valid data. 3128 + * 3129 + * Contract: 3130 + * - Must be called between walk_shadow_page_lockless_{begin,end}. 3131 + * - The returned sptep must not be used after walk_shadow_page_lockless_end. 3177 3132 */ 3178 - static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, 3179 - u32 error_code) 3133 + static u64 *fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, gpa_t gpa, u64 *spte) 3180 3134 { 3181 3135 struct kvm_shadow_walk_iterator iterator; 3136 + u64 old_spte; 3137 + u64 *sptep = NULL; 3138 + 3139 + for_each_shadow_entry_lockless(vcpu, gpa, iterator, old_spte) { 3140 + sptep = iterator.sptep; 3141 + *spte = old_spte; 3142 + 3143 + if (!is_shadow_present_pte(old_spte)) 3144 + break; 3145 + } 3146 + 3147 + return sptep; 3148 + } 3149 + 3150 + /* 3151 + * Returns one of RET_PF_INVALID, RET_PF_FIXED or RET_PF_SPURIOUS. 3152 + */ 3153 + static int fast_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code) 3154 + { 3182 3155 struct kvm_mmu_page *sp; 3183 3156 int ret = RET_PF_INVALID; 3184 3157 u64 spte = 0ull; 3158 + u64 *sptep = NULL; 3185 3159 uint retry_count = 0; 3186 3160 3187 3161 if (!page_fault_can_be_fast(error_code)) ··· 3217 3141 do { 3218 3142 u64 new_spte; 3219 3143 3220 - for_each_shadow_entry_lockless(vcpu, cr2_or_gpa, iterator, spte) 3221 - if (!is_shadow_present_pte(spte)) 3222 - break; 3144 + if (is_tdp_mmu(vcpu->arch.mmu)) 3145 + sptep = kvm_tdp_mmu_fast_pf_get_last_sptep(vcpu, gpa, &spte); 3146 + else 3147 + sptep = fast_pf_get_last_sptep(vcpu, gpa, &spte); 3223 3148 3224 3149 if (!is_shadow_present_pte(spte)) 3225 3150 break; 3226 3151 3227 - sp = sptep_to_sp(iterator.sptep); 3152 + sp = sptep_to_sp(sptep); 3228 3153 if (!is_last_spte(spte, sp->role.level)) 3229 3154 break; 3230 3155 ··· 3283 3206 * since the gfn is not stable for indirect shadow page. See 3284 3207 * Documentation/virt/kvm/locking.rst to get more detail. 3285 3208 */ 3286 - if (fast_pf_fix_direct_spte(vcpu, sp, iterator.sptep, spte, 3287 - new_spte)) { 3209 + if (fast_pf_fix_direct_spte(vcpu, sp, sptep, spte, new_spte)) { 3288 3210 ret = RET_PF_FIXED; 3289 3211 break; 3290 3212 } ··· 3296 3220 3297 3221 } while (true); 3298 3222 3299 - trace_fast_page_fault(vcpu, cr2_or_gpa, error_code, iterator.sptep, 3300 - spte, ret); 3223 + trace_fast_page_fault(vcpu, gpa, error_code, sptep, spte, ret); 3301 3224 walk_shadow_page_lockless_end(vcpu); 3302 3225 3303 3226 return ret; ··· 3530 3455 * the shadow page table may be a PAE or a long mode page table. 3531 3456 */ 3532 3457 pm_mask = PT_PRESENT_MASK | shadow_me_mask; 3533 - if (mmu->shadow_root_level == PT64_ROOT_4LEVEL) { 3458 + if (mmu->shadow_root_level >= PT64_ROOT_4LEVEL) { 3534 3459 pm_mask |= PT_ACCESSED_MASK | PT_WRITABLE_MASK | PT_USER_MASK; 3535 3460 3536 3461 if (WARN_ON_ONCE(!mmu->pml4_root)) { 3537 3462 r = -EIO; 3538 3463 goto out_unlock; 3539 3464 } 3540 - 3541 3465 mmu->pml4_root[0] = __pa(mmu->pae_root) | pm_mask; 3466 + 3467 + if (mmu->shadow_root_level == PT64_ROOT_5LEVEL) { 3468 + if (WARN_ON_ONCE(!mmu->pml5_root)) { 3469 + r = -EIO; 3470 + goto out_unlock; 3471 + } 3472 + mmu->pml5_root[0] = __pa(mmu->pml4_root) | pm_mask; 3473 + } 3542 3474 } 3543 3475 3544 3476 for (i = 0; i < 4; ++i) { ··· 3564 3482 mmu->pae_root[i] = root | pm_mask; 3565 3483 } 3566 3484 3567 - if (mmu->shadow_root_level == PT64_ROOT_4LEVEL) 3485 + if (mmu->shadow_root_level == PT64_ROOT_5LEVEL) 3486 + mmu->root_hpa = __pa(mmu->pml5_root); 3487 + else if (mmu->shadow_root_level == PT64_ROOT_4LEVEL) 3568 3488 mmu->root_hpa = __pa(mmu->pml4_root); 3569 3489 else 3570 3490 mmu->root_hpa = __pa(mmu->pae_root); ··· 3582 3498 static int mmu_alloc_special_roots(struct kvm_vcpu *vcpu) 3583 3499 { 3584 3500 struct kvm_mmu *mmu = vcpu->arch.mmu; 3585 - u64 *pml4_root, *pae_root; 3501 + bool need_pml5 = mmu->shadow_root_level > PT64_ROOT_4LEVEL; 3502 + u64 *pml5_root = NULL; 3503 + u64 *pml4_root = NULL; 3504 + u64 *pae_root; 3586 3505 3587 3506 /* 3588 3507 * When shadowing 32-bit or PAE NPT with 64-bit NPT, the PML4 and PDP ··· 3598 3511 return 0; 3599 3512 3600 3513 /* 3601 - * This mess only works with 4-level paging and needs to be updated to 3602 - * work with 5-level paging. 3514 + * NPT, the only paging mode that uses this horror, uses a fixed number 3515 + * of levels for the shadow page tables, e.g. all MMUs are 4-level or 3516 + * all MMus are 5-level. Thus, this can safely require that pml5_root 3517 + * is allocated if the other roots are valid and pml5 is needed, as any 3518 + * prior MMU would also have required pml5. 3603 3519 */ 3604 - if (WARN_ON_ONCE(mmu->shadow_root_level != PT64_ROOT_4LEVEL)) 3605 - return -EIO; 3606 - 3607 - if (mmu->pae_root && mmu->pml4_root) 3520 + if (mmu->pae_root && mmu->pml4_root && (!need_pml5 || mmu->pml5_root)) 3608 3521 return 0; 3609 3522 3610 3523 /* 3611 3524 * The special roots should always be allocated in concert. Yell and 3612 3525 * bail if KVM ends up in a state where only one of the roots is valid. 3613 3526 */ 3614 - if (WARN_ON_ONCE(!tdp_enabled || mmu->pae_root || mmu->pml4_root)) 3527 + if (WARN_ON_ONCE(!tdp_enabled || mmu->pae_root || mmu->pml4_root || 3528 + (need_pml5 && mmu->pml5_root))) 3615 3529 return -EIO; 3616 3530 3617 3531 /* ··· 3623 3535 if (!pae_root) 3624 3536 return -ENOMEM; 3625 3537 3538 + #ifdef CONFIG_X86_64 3626 3539 pml4_root = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT); 3627 - if (!pml4_root) { 3628 - free_page((unsigned long)pae_root); 3629 - return -ENOMEM; 3540 + if (!pml4_root) 3541 + goto err_pml4; 3542 + 3543 + if (need_pml5) { 3544 + pml5_root = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT); 3545 + if (!pml5_root) 3546 + goto err_pml5; 3630 3547 } 3548 + #endif 3631 3549 3632 3550 mmu->pae_root = pae_root; 3633 3551 mmu->pml4_root = pml4_root; 3552 + mmu->pml5_root = pml5_root; 3634 3553 3635 3554 return 0; 3555 + 3556 + #ifdef CONFIG_X86_64 3557 + err_pml5: 3558 + free_page((unsigned long)pml4_root); 3559 + err_pml4: 3560 + free_page((unsigned long)pae_root); 3561 + return -ENOMEM; 3562 + #endif 3636 3563 } 3637 3564 3638 3565 void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu) ··· 3743 3640 /* 3744 3641 * Return the level of the lowest level SPTE added to sptes. 3745 3642 * That SPTE may be non-present. 3643 + * 3644 + * Must be called between walk_shadow_page_lockless_{begin,end}. 3746 3645 */ 3747 3646 static int get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes, int *root_level) 3748 3647 { 3749 3648 struct kvm_shadow_walk_iterator iterator; 3750 3649 int leaf = -1; 3751 3650 u64 spte; 3752 - 3753 - walk_shadow_page_lockless_begin(vcpu); 3754 3651 3755 3652 for (shadow_walk_init(&iterator, vcpu, addr), 3756 3653 *root_level = iterator.level; ··· 3765 3662 break; 3766 3663 } 3767 3664 3768 - walk_shadow_page_lockless_end(vcpu); 3769 - 3770 3665 return leaf; 3771 3666 } 3772 3667 ··· 3776 3675 int root, leaf, level; 3777 3676 bool reserved = false; 3778 3677 3678 + walk_shadow_page_lockless_begin(vcpu); 3679 + 3779 3680 if (is_tdp_mmu(vcpu->arch.mmu)) 3780 3681 leaf = kvm_tdp_mmu_get_walk(vcpu, addr, sptes, &root); 3781 3682 else 3782 3683 leaf = get_walk(vcpu, addr, sptes, &root); 3684 + 3685 + walk_shadow_page_lockless_end(vcpu); 3783 3686 3784 3687 if (unlikely(leaf < 0)) { 3785 3688 *sptep = 0ull; ··· 3900 3795 kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch); 3901 3796 } 3902 3797 3903 - static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn, 3798 + static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn, 3904 3799 gpa_t cr2_or_gpa, kvm_pfn_t *pfn, hva_t *hva, 3905 - bool write, bool *writable) 3800 + bool write, bool *writable, int *r) 3906 3801 { 3907 3802 struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn); 3908 3803 bool async; ··· 3913 3808 * be zapped before KVM inserts a new MMIO SPTE for the gfn. 3914 3809 */ 3915 3810 if (slot && (slot->flags & KVM_MEMSLOT_INVALID)) 3916 - return true; 3811 + goto out_retry; 3917 3812 3918 - /* Don't expose private memslots to L2. */ 3919 - if (is_guest_mode(vcpu) && !kvm_is_visible_memslot(slot)) { 3920 - *pfn = KVM_PFN_NOSLOT; 3921 - *writable = false; 3922 - return false; 3813 + if (!kvm_is_visible_memslot(slot)) { 3814 + /* Don't expose private memslots to L2. */ 3815 + if (is_guest_mode(vcpu)) { 3816 + *pfn = KVM_PFN_NOSLOT; 3817 + *writable = false; 3818 + return false; 3819 + } 3820 + /* 3821 + * If the APIC access page exists but is disabled, go directly 3822 + * to emulation without caching the MMIO access or creating a 3823 + * MMIO SPTE. That way the cache doesn't need to be purged 3824 + * when the AVIC is re-enabled. 3825 + */ 3826 + if (slot && slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT && 3827 + !kvm_apicv_activated(vcpu->kvm)) { 3828 + *r = RET_PF_EMULATE; 3829 + return true; 3830 + } 3923 3831 } 3924 3832 3925 3833 async = false; ··· 3946 3828 if (kvm_find_async_pf_gfn(vcpu, gfn)) { 3947 3829 trace_kvm_async_pf_doublefault(cr2_or_gpa, gfn); 3948 3830 kvm_make_request(KVM_REQ_APF_HALT, vcpu); 3949 - return true; 3831 + goto out_retry; 3950 3832 } else if (kvm_arch_setup_async_pf(vcpu, cr2_or_gpa, gfn)) 3951 - return true; 3833 + goto out_retry; 3952 3834 } 3953 3835 3954 3836 *pfn = __gfn_to_pfn_memslot(slot, gfn, false, NULL, 3955 3837 write, writable, hva); 3956 - return false; 3838 + 3839 + out_retry: 3840 + *r = RET_PF_RETRY; 3841 + return true; 3957 3842 } 3958 3843 3959 3844 static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code, ··· 3975 3854 if (page_fault_handle_page_track(vcpu, error_code, gfn)) 3976 3855 return RET_PF_EMULATE; 3977 3856 3978 - if (!is_tdp_mmu_fault) { 3979 - r = fast_page_fault(vcpu, gpa, error_code); 3980 - if (r != RET_PF_INVALID) 3981 - return r; 3982 - } 3857 + r = fast_page_fault(vcpu, gpa, error_code); 3858 + if (r != RET_PF_INVALID) 3859 + return r; 3983 3860 3984 3861 r = mmu_topup_memory_caches(vcpu, false); 3985 3862 if (r) ··· 3986 3867 mmu_seq = vcpu->kvm->mmu_notifier_seq; 3987 3868 smp_rmb(); 3988 3869 3989 - if (try_async_pf(vcpu, prefault, gfn, gpa, &pfn, &hva, 3990 - write, &map_writable)) 3991 - return RET_PF_RETRY; 3870 + if (kvm_faultin_pfn(vcpu, prefault, gfn, gpa, &pfn, &hva, 3871 + write, &map_writable, &r)) 3872 + return r; 3992 3873 3993 3874 if (handle_abnormal_pfn(vcpu, is_tdp ? 0 : gpa, gfn, pfn, ACC_ALL, &r)) 3994 3875 return r; ··· 4707 4588 4708 4589 static inline int kvm_mmu_get_tdp_level(struct kvm_vcpu *vcpu) 4709 4590 { 4591 + /* tdp_root_level is architecture forced level, use it if nonzero */ 4592 + if (tdp_root_level) 4593 + return tdp_root_level; 4594 + 4710 4595 /* Use 5-level TDP if and only if it's useful/necessary. */ 4711 4596 if (max_tdp_level == 5 && cpuid_maxphyaddr(vcpu) <= 48) 4712 4597 return 4; ··· 5283 5160 if (r == RET_PF_INVALID) { 5284 5161 r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa, 5285 5162 lower_32_bits(error_code), false); 5286 - if (WARN_ON_ONCE(r == RET_PF_INVALID)) 5163 + if (KVM_BUG_ON(r == RET_PF_INVALID, vcpu->kvm)) 5287 5164 return -EIO; 5288 5165 } 5289 5166 ··· 5402 5279 */ 5403 5280 } 5404 5281 5405 - void kvm_configure_mmu(bool enable_tdp, int tdp_max_root_level, 5406 - int tdp_huge_page_level) 5282 + void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level, 5283 + int tdp_max_root_level, int tdp_huge_page_level) 5407 5284 { 5408 5285 tdp_enabled = enable_tdp; 5286 + tdp_root_level = tdp_forced_root_level; 5409 5287 max_tdp_level = tdp_max_root_level; 5410 5288 5411 5289 /* ··· 5426 5302 EXPORT_SYMBOL_GPL(kvm_configure_mmu); 5427 5303 5428 5304 /* The return value indicates if tlb flush on all vcpus is needed. */ 5429 - typedef bool (*slot_level_handler) (struct kvm *kvm, struct kvm_rmap_head *rmap_head, 5430 - struct kvm_memory_slot *slot); 5305 + typedef bool (*slot_level_handler) (struct kvm *kvm, 5306 + struct kvm_rmap_head *rmap_head, 5307 + const struct kvm_memory_slot *slot); 5431 5308 5432 5309 /* The caller should hold mmu-lock before calling this function. */ 5433 5310 static __always_inline bool 5434 - slot_handle_level_range(struct kvm *kvm, struct kvm_memory_slot *memslot, 5311 + slot_handle_level_range(struct kvm *kvm, const struct kvm_memory_slot *memslot, 5435 5312 slot_level_handler fn, int start_level, int end_level, 5436 5313 gfn_t start_gfn, gfn_t end_gfn, bool flush_on_yield, 5437 5314 bool flush) ··· 5459 5334 } 5460 5335 5461 5336 static __always_inline bool 5462 - slot_handle_level(struct kvm *kvm, struct kvm_memory_slot *memslot, 5337 + slot_handle_level(struct kvm *kvm, const struct kvm_memory_slot *memslot, 5463 5338 slot_level_handler fn, int start_level, int end_level, 5464 5339 bool flush_on_yield) 5465 5340 { ··· 5470 5345 } 5471 5346 5472 5347 static __always_inline bool 5473 - slot_handle_leaf(struct kvm *kvm, struct kvm_memory_slot *memslot, 5348 + slot_handle_leaf(struct kvm *kvm, const struct kvm_memory_slot *memslot, 5474 5349 slot_level_handler fn, bool flush_on_yield) 5475 5350 { 5476 5351 return slot_handle_level(kvm, memslot, fn, PG_LEVEL_4K, ··· 5483 5358 set_memory_encrypted((unsigned long)mmu->pae_root, 1); 5484 5359 free_page((unsigned long)mmu->pae_root); 5485 5360 free_page((unsigned long)mmu->pml4_root); 5361 + free_page((unsigned long)mmu->pml5_root); 5486 5362 } 5487 5363 5488 5364 static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu) ··· 5713 5587 kvm_mmu_uninit_tdp_mmu(kvm); 5714 5588 } 5715 5589 5590 + /* 5591 + * Invalidate (zap) SPTEs that cover GFNs from gfn_start and up to gfn_end 5592 + * (not including it) 5593 + */ 5716 5594 void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end) 5717 5595 { 5718 5596 struct kvm_memslots *slots; ··· 5724 5594 int i; 5725 5595 bool flush = false; 5726 5596 5597 + write_lock(&kvm->mmu_lock); 5598 + 5599 + kvm_inc_notifier_count(kvm, gfn_start, gfn_end); 5600 + 5727 5601 if (kvm_memslots_have_rmaps(kvm)) { 5728 - write_lock(&kvm->mmu_lock); 5729 5602 for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { 5730 5603 slots = __kvm_memslots(kvm, i); 5731 5604 kvm_for_each_memslot(memslot, slots) { ··· 5739 5606 if (start >= end) 5740 5607 continue; 5741 5608 5742 - flush = slot_handle_level_range(kvm, memslot, 5609 + flush = slot_handle_level_range(kvm, 5610 + (const struct kvm_memory_slot *) memslot, 5743 5611 kvm_zap_rmapp, PG_LEVEL_4K, 5744 5612 KVM_MAX_HUGEPAGE_LEVEL, start, 5745 5613 end - 1, true, flush); 5746 5614 } 5747 5615 } 5748 5616 if (flush) 5749 - kvm_flush_remote_tlbs_with_address(kvm, gfn_start, gfn_end); 5750 - write_unlock(&kvm->mmu_lock); 5617 + kvm_flush_remote_tlbs_with_address(kvm, gfn_start, 5618 + gfn_end - gfn_start); 5751 5619 } 5752 5620 5753 5621 if (is_tdp_mmu_enabled(kvm)) { 5754 - flush = false; 5755 - 5756 - read_lock(&kvm->mmu_lock); 5757 5622 for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) 5758 5623 flush = kvm_tdp_mmu_zap_gfn_range(kvm, i, gfn_start, 5759 - gfn_end, flush, true); 5624 + gfn_end, flush); 5760 5625 if (flush) 5761 5626 kvm_flush_remote_tlbs_with_address(kvm, gfn_start, 5762 - gfn_end); 5763 - 5764 - read_unlock(&kvm->mmu_lock); 5627 + gfn_end - gfn_start); 5765 5628 } 5629 + 5630 + if (flush) 5631 + kvm_flush_remote_tlbs_with_address(kvm, gfn_start, gfn_end); 5632 + 5633 + kvm_dec_notifier_count(kvm, gfn_start, gfn_end); 5634 + 5635 + write_unlock(&kvm->mmu_lock); 5766 5636 } 5767 5637 5768 5638 static bool slot_rmap_write_protect(struct kvm *kvm, 5769 5639 struct kvm_rmap_head *rmap_head, 5770 - struct kvm_memory_slot *slot) 5640 + const struct kvm_memory_slot *slot) 5771 5641 { 5772 5642 return __rmap_write_protect(kvm, rmap_head, false); 5773 5643 } 5774 5644 5775 5645 void kvm_mmu_slot_remove_write_access(struct kvm *kvm, 5776 - struct kvm_memory_slot *memslot, 5646 + const struct kvm_memory_slot *memslot, 5777 5647 int start_level) 5778 5648 { 5779 5649 bool flush = false; ··· 5812 5676 5813 5677 static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm, 5814 5678 struct kvm_rmap_head *rmap_head, 5815 - struct kvm_memory_slot *slot) 5679 + const struct kvm_memory_slot *slot) 5816 5680 { 5817 5681 u64 *sptep; 5818 5682 struct rmap_iterator iter; ··· 5835 5699 if (sp->role.direct && !kvm_is_reserved_pfn(pfn) && 5836 5700 sp->role.level < kvm_mmu_max_mapping_level(kvm, slot, sp->gfn, 5837 5701 pfn, PG_LEVEL_NUM)) { 5838 - pte_list_remove(rmap_head, sptep); 5702 + pte_list_remove(kvm, rmap_head, sptep); 5839 5703 5840 5704 if (kvm_available_flush_tlb_with_range()) 5841 5705 kvm_flush_remote_tlbs_with_address(kvm, sp->gfn, ··· 5851 5715 } 5852 5716 5853 5717 void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm, 5854 - const struct kvm_memory_slot *memslot) 5718 + const struct kvm_memory_slot *slot) 5855 5719 { 5856 - /* FIXME: const-ify all uses of struct kvm_memory_slot. */ 5857 - struct kvm_memory_slot *slot = (struct kvm_memory_slot *)memslot; 5858 5720 bool flush = false; 5859 5721 5860 5722 if (kvm_memslots_have_rmaps(kvm)) { ··· 5888 5754 } 5889 5755 5890 5756 void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm, 5891 - struct kvm_memory_slot *memslot) 5757 + const struct kvm_memory_slot *memslot) 5892 5758 { 5893 5759 bool flush = false; 5894 5760

+2 -2

arch/x86/kvm/mmu/mmu_audit.c

··· 147 147 return; 148 148 } 149 149 150 - rmap_head = __gfn_to_rmap(gfn, rev_sp->role.level, slot); 150 + rmap_head = gfn_to_rmap(gfn, rev_sp->role.level, slot); 151 151 if (!rmap_head->val) { 152 152 if (!__ratelimit(&ratelimit_state)) 153 153 return; ··· 200 200 201 201 slots = kvm_memslots_for_spte_role(kvm, sp->role); 202 202 slot = __gfn_to_memslot(slots, sp->gfn); 203 - rmap_head = __gfn_to_rmap(sp->gfn, PG_LEVEL_4K, slot); 203 + rmap_head = gfn_to_rmap(sp->gfn, PG_LEVEL_4K, slot); 204 204 205 205 for_each_rmap_spte(rmap_head, &iter, sptep) { 206 206 if (is_writable_pte(*sptep))

+12 -6

arch/x86/kvm/mmu/mmu_internal.h

··· 31 31 #define IS_VALID_PAE_ROOT(x) (!!(x)) 32 32 33 33 struct kvm_mmu_page { 34 + /* 35 + * Note, "link" through "spt" fit in a single 64 byte cache line on 36 + * 64-bit kernels, keep it that way unless there's a reason not to. 37 + */ 34 38 struct list_head link; 35 39 struct hlist_node hash_link; 36 - struct list_head lpage_disallowed_link; 37 40 41 + bool tdp_mmu_page; 38 42 bool unsync; 39 43 u8 mmu_valid_gen; 40 - bool mmio_cached; 41 44 bool lpage_disallowed; /* Can't be replaced by an equiv large page */ 42 45 43 46 /* ··· 62 59 struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */ 63 60 DECLARE_BITMAP(unsync_child_bitmap, 512); 64 61 62 + struct list_head lpage_disallowed_link; 65 63 #ifdef CONFIG_X86_32 66 64 /* 67 65 * Used out of the mmu-lock to avoid reading spte values while an ··· 75 71 atomic_t write_flooding_count; 76 72 77 73 #ifdef CONFIG_X86_64 78 - bool tdp_mmu_page; 79 - 80 74 /* Used for freeing the page asynchronously if it is a TDP MMU page. */ 81 75 struct rcu_head rcu_head; 82 76 #endif ··· 126 124 127 125 int mmu_try_to_unsync_pages(struct kvm_vcpu *vcpu, gfn_t gfn, bool can_unsync); 128 126 129 - void kvm_mmu_gfn_disallow_lpage(struct kvm_memory_slot *slot, gfn_t gfn); 130 - void kvm_mmu_gfn_allow_lpage(struct kvm_memory_slot *slot, gfn_t gfn); 127 + void kvm_mmu_gfn_disallow_lpage(const struct kvm_memory_slot *slot, gfn_t gfn); 128 + void kvm_mmu_gfn_allow_lpage(const struct kvm_memory_slot *slot, gfn_t gfn); 131 129 bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm, 132 130 struct kvm_memory_slot *slot, u64 gfn, 133 131 int min_level); 134 132 void kvm_flush_remote_tlbs_with_address(struct kvm *kvm, 135 133 u64 start_gfn, u64 pages); 134 + unsigned int pte_list_count(struct kvm_rmap_head *rmap_head); 136 135 137 136 /* 138 137 * Return values of handle_mmio_page_fault, mmu.page_fault, and fast_page_fault(). ··· 143 140 * RET_PF_INVALID: the spte is invalid, let the real page fault path update it. 144 141 * RET_PF_FIXED: The faulting entry has been fixed. 145 142 * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU. 143 + * 144 + * Any names added to this enum should be exported to userspace for use in 145 + * tracepoints via TRACE_DEFINE_ENUM() in mmutrace.h 146 146 */ 147 147 enum { 148 148 RET_PF_RETRY = 0,

+6

arch/x86/kvm/mmu/mmutrace.h

··· 54 54 { PFERR_RSVD_MASK, "RSVD" }, \ 55 55 { PFERR_FETCH_MASK, "F" } 56 56 57 + TRACE_DEFINE_ENUM(RET_PF_RETRY); 58 + TRACE_DEFINE_ENUM(RET_PF_EMULATE); 59 + TRACE_DEFINE_ENUM(RET_PF_INVALID); 60 + TRACE_DEFINE_ENUM(RET_PF_FIXED); 61 + TRACE_DEFINE_ENUM(RET_PF_SPURIOUS); 62 + 57 63 /* 58 64 * A pagetable walk has started 59 65 */

+1

arch/x86/kvm/mmu/page_track.c

··· 16 16 17 17 #include <asm/kvm_page_track.h> 18 18 19 + #include "mmu.h" 19 20 #include "mmu_internal.h" 20 21 21 22 void kvm_page_track_free_memslot(struct kvm_memory_slot *slot)

+3 -3

arch/x86/kvm/mmu/paging_tmpl.h

··· 881 881 mmu_seq = vcpu->kvm->mmu_notifier_seq; 882 882 smp_rmb(); 883 883 884 - if (try_async_pf(vcpu, prefault, walker.gfn, addr, &pfn, &hva, 885 - write_fault, &map_writable)) 886 - return RET_PF_RETRY; 884 + if (kvm_faultin_pfn(vcpu, prefault, walker.gfn, addr, &pfn, &hva, 885 + write_fault, &map_writable, &r)) 886 + return r; 887 887 888 888 if (handle_abnormal_pfn(vcpu, addr, walker.gfn, pfn, walker.pte_access, &r)) 889 889 return r;

+91 -48

arch/x86/kvm/mmu/tdp_mmu.c

··· 10 10 #include <asm/cmpxchg.h> 11 11 #include <trace/events/kvm.h> 12 12 13 - static bool __read_mostly tdp_mmu_enabled = false; 13 + static bool __read_mostly tdp_mmu_enabled = true; 14 14 module_param_named(tdp_mmu, tdp_mmu_enabled, bool, 0644); 15 15 16 16 /* Initializes the TDP MMU for the VM, if enabled. */ ··· 255 255 * 256 256 * @kvm: kvm instance 257 257 * @sp: the new page 258 - * @shared: This operation may not be running under the exclusive use of 259 - * the MMU lock and the operation must synchronize with other 260 - * threads that might be adding or removing pages. 261 258 * @account_nx: This page replaces a NX large page and should be marked for 262 259 * eventual reclaim. 263 260 */ 264 261 static void tdp_mmu_link_page(struct kvm *kvm, struct kvm_mmu_page *sp, 265 - bool shared, bool account_nx) 262 + bool account_nx) 266 263 { 267 - if (shared) 268 - spin_lock(&kvm->arch.tdp_mmu_pages_lock); 269 - else 270 - lockdep_assert_held_write(&kvm->mmu_lock); 271 - 264 + spin_lock(&kvm->arch.tdp_mmu_pages_lock); 272 265 list_add(&sp->link, &kvm->arch.tdp_mmu_pages); 273 266 if (account_nx) 274 267 account_huge_nx_page(kvm, sp); 275 - 276 - if (shared) 277 - spin_unlock(&kvm->arch.tdp_mmu_pages_lock); 268 + spin_unlock(&kvm->arch.tdp_mmu_pages_lock); 278 269 } 279 270 280 271 /** ··· 436 445 437 446 trace_kvm_tdp_mmu_spte_changed(as_id, gfn, level, old_spte, new_spte); 438 447 439 - if (is_large_pte(old_spte) != is_large_pte(new_spte)) { 440 - if (is_large_pte(old_spte)) 441 - atomic64_sub(1, (atomic64_t*)&kvm->stat.lpages); 442 - else 443 - atomic64_add(1, (atomic64_t*)&kvm->stat.lpages); 444 - } 445 - 446 448 /* 447 449 * The only times a SPTE should be changed from a non-present to 448 450 * non-present state is when an MMIO entry is installed/modified/ ··· 461 477 return; 462 478 } 463 479 480 + if (is_leaf != was_leaf) 481 + kvm_update_page_stats(kvm, level, is_leaf ? 1 : -1); 464 482 465 483 if (was_leaf && is_dirty_spte(old_spte) && 466 484 (!is_present || !is_dirty_spte(new_spte) || pfn_changed)) ··· 512 526 if (is_removed_spte(iter->old_spte)) 513 527 return false; 514 528 529 + /* 530 + * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and 531 + * does not hold the mmu_lock. 532 + */ 515 533 if (cmpxchg64(rcu_dereference(iter->sptep), iter->old_spte, 516 534 new_spte) != iter->old_spte) 517 535 return false; ··· 527 537 return true; 528 538 } 529 539 530 - static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm, 531 - struct tdp_iter *iter, 532 - u64 new_spte) 540 + /* 541 + * tdp_mmu_map_set_spte_atomic - Set a leaf TDP MMU SPTE atomically to resolve a 542 + * TDP page fault. 543 + * 544 + * @vcpu: The vcpu instance that took the TDP page fault. 545 + * @iter: a tdp_iter instance currently on the SPTE that should be set 546 + * @new_spte: The value the SPTE should be set to 547 + * 548 + * Returns: true if the SPTE was set, false if it was not. If false is returned, 549 + * this function will have no side-effects. 550 + */ 551 + static inline bool tdp_mmu_map_set_spte_atomic(struct kvm_vcpu *vcpu, 552 + struct tdp_iter *iter, 553 + u64 new_spte) 533 554 { 555 + struct kvm *kvm = vcpu->kvm; 556 + 534 557 if (!tdp_mmu_set_spte_atomic_no_dirty_log(kvm, iter, new_spte)) 535 558 return false; 536 559 537 - handle_changed_spte_dirty_log(kvm, iter->as_id, iter->gfn, 538 - iter->old_spte, new_spte, iter->level); 560 + /* 561 + * Use kvm_vcpu_gfn_to_memslot() instead of going through 562 + * handle_changed_spte_dirty_log() to leverage vcpu->last_used_slot. 563 + */ 564 + if (is_writable_pte(new_spte)) { 565 + struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, iter->gfn); 566 + 567 + if (slot && kvm_slot_dirty_track_enabled(slot)) { 568 + /* Enforced by kvm_mmu_hugepage_adjust. */ 569 + WARN_ON_ONCE(iter->level > PG_LEVEL_4K); 570 + mark_page_dirty_in_slot(kvm, slot, iter->gfn); 571 + } 572 + } 573 + 539 574 return true; 540 575 } 541 576 ··· 573 558 * immediately installing a present entry in its place 574 559 * before the TLBs are flushed. 575 560 */ 576 - if (!tdp_mmu_set_spte_atomic(kvm, iter, REMOVED_SPTE)) 561 + if (!tdp_mmu_set_spte_atomic_no_dirty_log(kvm, iter, REMOVED_SPTE)) 577 562 return false; 578 563 579 564 kvm_flush_remote_tlbs_with_address(kvm, iter->gfn, ··· 804 789 * non-root pages mapping GFNs strictly within that range. Returns true if 805 790 * SPTEs have been cleared and a TLB flush is needed before releasing the 806 791 * MMU lock. 807 - * 808 - * If shared is true, this thread holds the MMU lock in read mode and must 809 - * account for the possibility that other threads are modifying the paging 810 - * structures concurrently. If shared is false, this thread should hold the 811 - * MMU in write mode. 812 792 */ 813 793 bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start, 814 - gfn_t end, bool can_yield, bool flush, 815 - bool shared) 794 + gfn_t end, bool can_yield, bool flush) 816 795 { 817 796 struct kvm_mmu_page *root; 818 797 819 - for_each_tdp_mmu_root_yield_safe(kvm, root, as_id, shared) 798 + for_each_tdp_mmu_root_yield_safe(kvm, root, as_id, false) 820 799 flush = zap_gfn_range(kvm, root, start, end, can_yield, flush, 821 - shared); 800 + false); 822 801 823 802 return flush; 824 803 } ··· 823 814 int i; 824 815 825 816 for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) 826 - flush = kvm_tdp_mmu_zap_gfn_range(kvm, i, 0, -1ull, 827 - flush, false); 817 + flush = kvm_tdp_mmu_zap_gfn_range(kvm, i, 0, -1ull, flush); 828 818 829 819 if (flush) 830 820 kvm_flush_remote_tlbs(kvm); ··· 948 940 949 941 if (new_spte == iter->old_spte) 950 942 ret = RET_PF_SPURIOUS; 951 - else if (!tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte)) 943 + else if (!tdp_mmu_map_set_spte_atomic(vcpu, iter, new_spte)) 952 944 return RET_PF_RETRY; 953 945 954 946 /* ··· 1052 1044 new_spte = make_nonleaf_spte(child_pt, 1053 1045 !shadow_accessed_mask); 1054 1046 1055 - if (tdp_mmu_set_spte_atomic(vcpu->kvm, &iter, 1056 - new_spte)) { 1057 - tdp_mmu_link_page(vcpu->kvm, sp, true, 1047 + if (tdp_mmu_set_spte_atomic_no_dirty_log(vcpu->kvm, &iter, new_spte)) { 1048 + tdp_mmu_link_page(vcpu->kvm, sp, 1058 1049 huge_page_disallowed && 1059 1050 req_level >= iter.level); 1060 1051 ··· 1262 1255 * only affect leaf SPTEs down to min_level. 1263 1256 * Returns true if an SPTE has been changed and the TLBs need to be flushed. 1264 1257 */ 1265 - bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm, struct kvm_memory_slot *slot, 1266 - int min_level) 1258 + bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm, 1259 + const struct kvm_memory_slot *slot, int min_level) 1267 1260 { 1268 1261 struct kvm_mmu_page *root; 1269 1262 bool spte_set = false; ··· 1333 1326 * each SPTE. Returns true if an SPTE has been changed and the TLBs need to 1334 1327 * be flushed. 1335 1328 */ 1336 - bool kvm_tdp_mmu_clear_dirty_slot(struct kvm *kvm, struct kvm_memory_slot *slot) 1329 + bool kvm_tdp_mmu_clear_dirty_slot(struct kvm *kvm, 1330 + const struct kvm_memory_slot *slot) 1337 1331 { 1338 1332 struct kvm_mmu_page *root; 1339 1333 bool spte_set = false; ··· 1537 1529 /* 1538 1530 * Return the level of the lowest level SPTE added to sptes. 1539 1531 * That SPTE may be non-present. 1532 + * 1533 + * Must be called between kvm_tdp_mmu_walk_lockless_{begin,end}. 1540 1534 */ 1541 1535 int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes, 1542 1536 int *root_level) ··· 1550 1540 1551 1541 *root_level = vcpu->arch.mmu->shadow_root_level; 1552 1542 1553 - rcu_read_lock(); 1554 - 1555 1543 tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) { 1556 1544 leaf = iter.level; 1557 1545 sptes[leaf] = iter.old_spte; 1558 1546 } 1559 1547 1560 - rcu_read_unlock(); 1561 - 1562 1548 return leaf; 1549 + } 1550 + 1551 + /* 1552 + * Returns the last level spte pointer of the shadow page walk for the given 1553 + * gpa, and sets *spte to the spte value. This spte may be non-preset. If no 1554 + * walk could be performed, returns NULL and *spte does not contain valid data. 1555 + * 1556 + * Contract: 1557 + * - Must be called between kvm_tdp_mmu_walk_lockless_{begin,end}. 1558 + * - The returned sptep must not be used after kvm_tdp_mmu_walk_lockless_end. 1559 + * 1560 + * WARNING: This function is only intended to be called during fast_page_fault. 1561 + */ 1562 + u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, u64 addr, 1563 + u64 *spte) 1564 + { 1565 + struct tdp_iter iter; 1566 + struct kvm_mmu *mmu = vcpu->arch.mmu; 1567 + gfn_t gfn = addr >> PAGE_SHIFT; 1568 + tdp_ptep_t sptep = NULL; 1569 + 1570 + tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) { 1571 + *spte = iter.old_spte; 1572 + sptep = iter.sptep; 1573 + } 1574 + 1575 + /* 1576 + * Perform the rcu_dereference to get the raw spte pointer value since 1577 + * we are passing it up to fast_page_fault, which is shared with the 1578 + * legacy MMU and thus does not retain the TDP MMU-specific __rcu 1579 + * annotation. 1580 + * 1581 + * This is safe since fast_page_fault obeys the contracts of this 1582 + * function as well as all TDP MMU contracts around modifying SPTEs 1583 + * outside of mmu_lock. 1584 + */ 1585 + return rcu_dereference(sptep); 1563 1586 }

+19 -10

arch/x86/kvm/mmu/tdp_mmu.h

··· 20 20 bool shared); 21 21 22 22 bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start, 23 - gfn_t end, bool can_yield, bool flush, 24 - bool shared); 23 + gfn_t end, bool can_yield, bool flush); 25 24 static inline bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, 26 - gfn_t start, gfn_t end, bool flush, 27 - bool shared) 25 + gfn_t start, gfn_t end, bool flush) 28 26 { 29 - return __kvm_tdp_mmu_zap_gfn_range(kvm, as_id, start, end, true, flush, 30 - shared); 27 + return __kvm_tdp_mmu_zap_gfn_range(kvm, as_id, start, end, true, flush); 31 28 } 32 29 static inline bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp) 33 30 { ··· 41 44 */ 42 45 lockdep_assert_held_write(&kvm->mmu_lock); 43 46 return __kvm_tdp_mmu_zap_gfn_range(kvm, kvm_mmu_page_as_id(sp), 44 - sp->gfn, end, false, false, false); 47 + sp->gfn, end, false, false); 45 48 } 46 49 47 50 void kvm_tdp_mmu_zap_all(struct kvm *kvm); ··· 58 61 bool kvm_tdp_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range); 59 62 bool kvm_tdp_mmu_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range); 60 63 61 - bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm, struct kvm_memory_slot *slot, 62 - int min_level); 64 + bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm, 65 + const struct kvm_memory_slot *slot, int min_level); 63 66 bool kvm_tdp_mmu_clear_dirty_slot(struct kvm *kvm, 64 - struct kvm_memory_slot *slot); 67 + const struct kvm_memory_slot *slot); 65 68 void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm, 66 69 struct kvm_memory_slot *slot, 67 70 gfn_t gfn, unsigned long mask, ··· 74 77 struct kvm_memory_slot *slot, gfn_t gfn, 75 78 int min_level); 76 79 80 + static inline void kvm_tdp_mmu_walk_lockless_begin(void) 81 + { 82 + rcu_read_lock(); 83 + } 84 + 85 + static inline void kvm_tdp_mmu_walk_lockless_end(void) 86 + { 87 + rcu_read_unlock(); 88 + } 89 + 77 90 int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes, 78 91 int *root_level); 92 + u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, u64 addr, 93 + u64 *spte); 79 94 80 95 #ifdef CONFIG_X86_64 81 96 bool kvm_mmu_init_tdp_mmu(struct kvm *kvm);

+4 -1

arch/x86/kvm/pmu.c

··· 137 137 pmc->perf_event = event; 138 138 pmc_to_pmu(pmc)->event_count++; 139 139 clear_bit(pmc->idx, pmc_to_pmu(pmc)->reprogram_pmi); 140 + pmc->is_paused = false; 140 141 } 141 142 142 143 static void pmc_pause_counter(struct kvm_pmc *pmc) 143 144 { 144 145 u64 counter = pmc->counter; 145 146 146 - if (!pmc->perf_event) 147 + if (!pmc->perf_event || pmc->is_paused) 147 148 return; 148 149 149 150 /* update counter, reset event value to avoid redundant accumulation */ 150 151 counter += perf_event_pause(pmc->perf_event, true); 151 152 pmc->counter = counter & pmc_bitmask(pmc); 153 + pmc->is_paused = true; 152 154 } 153 155 154 156 static bool pmc_resume_counter(struct kvm_pmc *pmc) ··· 165 163 166 164 /* reuse perf_event to serve as pmc_reprogram_counter() does*/ 167 165 perf_event_enable(pmc->perf_event); 166 + pmc->is_paused = false; 168 167 169 168 clear_bit(pmc->idx, (unsigned long *)&pmc_to_pmu(pmc)->reprogram_pmi); 170 169 return true;

+1 -1

arch/x86/kvm/pmu.h

··· 55 55 u64 counter, enabled, running; 56 56 57 57 counter = pmc->counter; 58 - if (pmc->perf_event) 58 + if (pmc->perf_event && !pmc->is_paused) 59 59 counter += perf_event_read_value(pmc->perf_event, 60 60 &enabled, &running); 61 61 /* FIXME: Scaling needed? */

+17 -32

arch/x86/kvm/svm/avic.c

··· 197 197 vmcb->control.avic_logical_id = lpa & AVIC_HPA_MASK; 198 198 vmcb->control.avic_physical_id = ppa & AVIC_HPA_MASK; 199 199 vmcb->control.avic_physical_id |= AVIC_MAX_PHYSICAL_ID_COUNT; 200 + vmcb->control.avic_vapic_bar = APIC_DEFAULT_PHYS_BASE & VMCB_AVIC_APIC_BAR_MASK; 201 + 200 202 if (kvm_apicv_activated(svm->vcpu.kvm)) 201 203 vmcb->control.int_ctl |= AVIC_ENABLE_MASK; 202 204 else ··· 227 225 * field of the VMCB. Therefore, we set up the 228 226 * APIC_ACCESS_PAGE_PRIVATE_MEMSLOT (4KB) here. 229 227 */ 230 - static int avic_update_access_page(struct kvm *kvm, bool activate) 228 + static int avic_alloc_access_page(struct kvm *kvm) 231 229 { 232 230 void __user *ret; 233 231 int r = 0; 234 232 235 233 mutex_lock(&kvm->slots_lock); 236 - /* 237 - * During kvm_destroy_vm(), kvm_pit_set_reinject() could trigger 238 - * APICv mode change, which update APIC_ACCESS_PAGE_PRIVATE_MEMSLOT 239 - * memory region. So, we need to ensure that kvm->mm == current->mm. 240 - */ 241 - if ((kvm->arch.apic_access_memslot_enabled == activate) || 242 - (kvm->mm != current->mm)) 234 + 235 + if (kvm->arch.apic_access_memslot_enabled) 243 236 goto out; 244 237 245 238 ret = __x86_set_memory_region(kvm, 246 239 APIC_ACCESS_PAGE_PRIVATE_MEMSLOT, 247 240 APIC_DEFAULT_PHYS_BASE, 248 - activate ? PAGE_SIZE : 0); 241 + PAGE_SIZE); 249 242 if (IS_ERR(ret)) { 250 243 r = PTR_ERR(ret); 251 244 goto out; 252 245 } 253 246 254 - kvm->arch.apic_access_memslot_enabled = activate; 247 + kvm->arch.apic_access_memslot_enabled = true; 255 248 out: 256 249 mutex_unlock(&kvm->slots_lock); 257 250 return r; ··· 267 270 if (kvm_apicv_activated(vcpu->kvm)) { 268 271 int ret; 269 272 270 - ret = avic_update_access_page(vcpu->kvm, true); 273 + ret = avic_alloc_access_page(vcpu->kvm); 271 274 if (ret) 272 275 return ret; 273 276 } ··· 584 587 avic_handle_ldr_update(vcpu); 585 588 } 586 589 587 - void svm_toggle_avic_for_irq_window(struct kvm_vcpu *vcpu, bool activate) 588 - { 589 - if (!enable_apicv || !lapic_in_kernel(vcpu)) 590 - return; 591 - 592 - srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx); 593 - kvm_request_apicv_update(vcpu->kvm, activate, 594 - APICV_INHIBIT_REASON_IRQWIN); 595 - vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu); 596 - } 597 - 598 590 void svm_set_virtual_apic_mode(struct kvm_vcpu *vcpu) 599 591 { 600 592 return; ··· 652 666 vmcb->control.int_ctl &= ~AVIC_ENABLE_MASK; 653 667 } 654 668 vmcb_mark_dirty(vmcb, VMCB_AVIC); 669 + 670 + if (activated) 671 + avic_vcpu_load(vcpu, vcpu->cpu); 672 + else 673 + avic_vcpu_put(vcpu); 655 674 656 675 svm_set_pi_irte_mode(vcpu, activated); 657 676 } ··· 909 918 return supported & BIT(bit); 910 919 } 911 920 912 - void svm_pre_update_apicv_exec_ctrl(struct kvm *kvm, bool activate) 913 - { 914 - avic_update_access_page(kvm, activate); 915 - } 916 921 917 922 static inline int 918 923 avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu, bool r) ··· 947 960 int h_physical_id = kvm_cpu_get_apicid(cpu); 948 961 struct vcpu_svm *svm = to_svm(vcpu); 949 962 950 - if (!kvm_vcpu_apicv_active(vcpu)) 951 - return; 952 - 953 963 /* 954 964 * Since the host physical APIC id is 8 bits, 955 965 * we can support host APIC ID upto 255. ··· 974 990 u64 entry; 975 991 struct vcpu_svm *svm = to_svm(vcpu); 976 992 977 - if (!kvm_vcpu_apicv_active(vcpu)) 978 - return; 979 - 980 993 entry = READ_ONCE(*(svm->avic_physical_id_cache)); 981 994 if (entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK) 982 995 avic_update_iommu_vcpu_affinity(vcpu, -1, 0); ··· 990 1009 struct vcpu_svm *svm = to_svm(vcpu); 991 1010 992 1011 svm->avic_is_running = is_run; 1012 + 1013 + if (!kvm_vcpu_apicv_active(vcpu)) 1014 + return; 1015 + 993 1016 if (is_run) 994 1017 avic_vcpu_load(vcpu, vcpu->cpu); 995 1018 else

-5

arch/x86/kvm/svm/nested.c

··· 666 666 goto out; 667 667 } 668 668 669 - 670 - /* Clear internal status */ 671 - kvm_clear_exception_queue(vcpu); 672 - kvm_clear_interrupt_queue(vcpu); 673 - 674 669 /* 675 670 * Since vmcb01 is not in use, we can use it to store some of the L1 676 671 * state.

+1 -2

arch/x86/kvm/svm/sev.c

··· 28 28 #include "cpuid.h" 29 29 #include "trace.h" 30 30 31 - #define __ex(x) __kvm_handle_fault_on_reboot(x) 32 - 33 31 #ifndef CONFIG_KVM_AMD_SEV 34 32 /* 35 33 * When this config is not defined, SEV feature is not supported and APIs in ··· 582 584 save->xcr0 = svm->vcpu.arch.xcr0; 583 585 save->pkru = svm->vcpu.arch.pkru; 584 586 save->xss = svm->vcpu.arch.ia32_xss; 587 + save->dr6 = svm->vcpu.arch.dr6; 585 588 586 589 /* 587 590 * SEV-ES will use a VMSA that is pointed to by the VMCB, not

+32 -65

arch/x86/kvm/svm/svm.c

··· 46 46 #include "kvm_onhyperv.h" 47 47 #include "svm_onhyperv.h" 48 48 49 - #define __ex(x) __kvm_handle_fault_on_reboot(x) 50 - 51 49 MODULE_AUTHOR("Qumranet"); 52 50 MODULE_LICENSE("GPL"); 53 51 ··· 259 261 static int get_max_npt_level(void) 260 262 { 261 263 #ifdef CONFIG_X86_64 262 - return PT64_ROOT_4LEVEL; 264 + return pgtable_l5_enabled() ? PT64_ROOT_5LEVEL : PT64_ROOT_4LEVEL; 263 265 #else 264 266 return PT32E_ROOT_LEVEL; 265 267 #endif ··· 457 459 458 460 if (sev_active()) { 459 461 pr_info("KVM is unsupported when running as an SEV guest\n"); 460 - return 0; 461 - } 462 - 463 - if (pgtable_l5_enabled()) { 464 - pr_info("KVM doesn't yet support 5-level paging on AMD SVM\n"); 465 462 return 0; 466 463 } 467 464 ··· 1008 1015 if (!boot_cpu_has(X86_FEATURE_NPT)) 1009 1016 npt_enabled = false; 1010 1017 1011 - kvm_configure_mmu(npt_enabled, get_max_npt_level(), PG_LEVEL_1G); 1018 + /* Force VM NPT level equal to the host's max NPT level */ 1019 + kvm_configure_mmu(npt_enabled, get_max_npt_level(), 1020 + get_max_npt_level(), PG_LEVEL_1G); 1012 1021 pr_info("kvm: Nested Paging %sabled\n", npt_enabled ? "en" : "dis"); 1013 1022 1014 1023 /* Note, SEV setup consumes npt_enabled. */ ··· 1156 1161 struct vmcb_control_area *control = &svm->vmcb->control; 1157 1162 struct vmcb_save_area *save = &svm->vmcb->save; 1158 1163 1159 - vcpu->arch.hflags = 0; 1160 - 1161 1164 svm_set_intercept(svm, INTERCEPT_CR0_READ); 1162 1165 svm_set_intercept(svm, INTERCEPT_CR3_READ); 1163 1166 svm_set_intercept(svm, INTERCEPT_CR4_READ); ··· 1234 1241 SVM_SELECTOR_S_MASK | SVM_SELECTOR_CODE_MASK; 1235 1242 save->cs.limit = 0xffff; 1236 1243 1244 + save->gdtr.base = 0; 1237 1245 save->gdtr.limit = 0xffff; 1246 + save->idtr.base = 0; 1238 1247 save->idtr.limit = 0xffff; 1239 1248 1240 1249 init_sys_seg(&save->ldtr, SEG_TYPE_LDT); 1241 1250 init_sys_seg(&save->tr, SEG_TYPE_BUSY_TSS16); 1242 - 1243 - svm_set_cr4(vcpu, 0); 1244 - svm_set_efer(vcpu, 0); 1245 - save->dr6 = 0xffff0ff0; 1246 - kvm_set_rflags(vcpu, X86_EFLAGS_FIXED); 1247 - save->rip = 0x0000fff0; 1248 - vcpu->arch.regs[VCPU_REGS_RIP] = save->rip; 1249 - 1250 - /* 1251 - * svm_set_cr0() sets PG and WP and clears NW and CD on save->cr0. 1252 - * It also updates the guest-visible cr0 value. 1253 - */ 1254 - svm_set_cr0(vcpu, X86_CR0_NW | X86_CR0_CD | X86_CR0_ET); 1255 - kvm_mmu_reset_context(vcpu); 1256 - 1257 - save->cr4 = X86_CR4_PAE; 1258 - /* rdx = ?? */ 1259 1251 1260 1252 if (npt_enabled) { 1261 1253 /* Setup VMCB for Nested Paging */ ··· 1251 1273 svm_clr_intercept(svm, INTERCEPT_CR3_WRITE); 1252 1274 save->g_pat = vcpu->arch.pat; 1253 1275 save->cr3 = 0; 1254 - save->cr4 = 0; 1255 1276 } 1256 1277 svm->current_vmcb->asid_generation = 0; 1257 1278 svm->asid = 0; 1258 1279 1259 1280 svm->nested.vmcb12_gpa = INVALID_GPA; 1260 1281 svm->nested.last_vmcb12_gpa = INVALID_GPA; 1261 - vcpu->arch.hflags = 0; 1262 1282 1263 1283 if (!kvm_pause_in_guest(vcpu->kvm)) { 1264 1284 control->pause_filter_count = pause_filter_count; ··· 1306 1330 static void svm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) 1307 1331 { 1308 1332 struct vcpu_svm *svm = to_svm(vcpu); 1309 - u32 dummy; 1310 - u32 eax = 1; 1311 1333 1312 1334 svm->spec_ctrl = 0; 1313 1335 svm->virt_spec_ctrl = 0; 1314 1336 1315 - if (!init_event) { 1316 - vcpu->arch.apic_base = APIC_DEFAULT_PHYS_BASE | 1317 - MSR_IA32_APICBASE_ENABLE; 1318 - if (kvm_vcpu_is_reset_bsp(vcpu)) 1319 - vcpu->arch.apic_base |= MSR_IA32_APICBASE_BSP; 1320 - } 1321 1337 init_vmcb(vcpu); 1322 - 1323 - kvm_cpuid(vcpu, &eax, &dummy, &dummy, &dummy, false); 1324 - kvm_rdx_write(vcpu, eax); 1325 - 1326 - if (kvm_vcpu_apicv_active(vcpu) && !init_event) 1327 - avic_update_vapic_bar(svm, APIC_DEFAULT_PHYS_BASE); 1328 1338 } 1329 1339 1330 1340 void svm_switch_vmcb(struct vcpu_svm *svm, struct kvm_vmcb_info *target_vmcb) ··· 1475 1513 sd->current_vmcb = svm->vmcb; 1476 1514 indirect_branch_prediction_barrier(); 1477 1515 } 1478 - avic_vcpu_load(vcpu, cpu); 1516 + if (kvm_vcpu_apicv_active(vcpu)) 1517 + avic_vcpu_load(vcpu, cpu); 1479 1518 } 1480 1519 1481 1520 static void svm_vcpu_put(struct kvm_vcpu *vcpu) 1482 1521 { 1483 - avic_vcpu_put(vcpu); 1522 + if (kvm_vcpu_apicv_active(vcpu)) 1523 + avic_vcpu_put(vcpu); 1524 + 1484 1525 svm_prepare_host_switch(vcpu); 1485 1526 1486 1527 ++vcpu->stat.host_state_reload; ··· 1525 1560 load_pdptrs(vcpu, vcpu->arch.walk_mmu, kvm_read_cr3(vcpu)); 1526 1561 break; 1527 1562 default: 1528 - WARN_ON_ONCE(1); 1563 + KVM_BUG_ON(1, vcpu->kvm); 1529 1564 } 1530 1565 } 1531 1566 ··· 2043 2078 return -EINVAL; 2044 2079 2045 2080 /* 2046 - * VMCB is undefined after a SHUTDOWN intercept 2047 - * so reinitialize it. 2081 + * VMCB is undefined after a SHUTDOWN intercept. INIT the vCPU to put 2082 + * the VMCB in a known good state. Unfortuately, KVM doesn't have 2083 + * KVM_MP_STATE_SHUTDOWN and can't add it without potentially breaking 2084 + * userspace. At a platform view, INIT is acceptable behavior as 2085 + * there exist bare metal platforms that automatically INIT the CPU 2086 + * in response to shutdown. 2048 2087 */ 2049 2088 clear_page(svm->vmcb); 2050 - init_vmcb(vcpu); 2089 + kvm_vcpu_reset(vcpu, true); 2051 2090 2052 2091 kvm_run->exit_reason = KVM_EXIT_SHUTDOWN; 2053 2092 return 0; ··· 2962 2993 svm->msr_decfg = data; 2963 2994 break; 2964 2995 } 2965 - case MSR_IA32_APICBASE: 2966 - if (kvm_vcpu_apicv_active(vcpu)) 2967 - avic_update_vapic_bar(to_svm(vcpu), data); 2968 - fallthrough; 2969 2996 default: 2970 2997 return kvm_set_msr_common(vcpu, msr); 2971 2998 } ··· 2986 3021 * In this case AVIC was temporarily disabled for 2987 3022 * requesting the IRQ window and we have to re-enable it. 2988 3023 */ 2989 - svm_toggle_avic_for_irq_window(vcpu, true); 3024 + kvm_request_apicv_update(vcpu->kvm, true, APICV_INHIBIT_REASON_IRQWIN); 2990 3025 2991 3026 ++vcpu->stat.irq_window_exits; 2992 3027 return 1; ··· 3234 3269 "excp_to:", save->last_excp_to); 3235 3270 } 3236 3271 3272 + static bool svm_check_exit_valid(struct kvm_vcpu *vcpu, u64 exit_code) 3273 + { 3274 + return (exit_code < ARRAY_SIZE(svm_exit_handlers) && 3275 + svm_exit_handlers[exit_code]); 3276 + } 3277 + 3237 3278 static int svm_handle_invalid_exit(struct kvm_vcpu *vcpu, u64 exit_code) 3238 3279 { 3239 - if (exit_code < ARRAY_SIZE(svm_exit_handlers) && 3240 - svm_exit_handlers[exit_code]) 3241 - return 0; 3242 - 3243 3280 vcpu_unimpl(vcpu, "svm: unexpected exit reason 0x%llx\n", exit_code); 3244 3281 dump_vmcb(vcpu); 3245 3282 vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR; ··· 3249 3282 vcpu->run->internal.ndata = 2; 3250 3283 vcpu->run->internal.data[0] = exit_code; 3251 3284 vcpu->run->internal.data[1] = vcpu->arch.last_vmentry_cpu; 3252 - 3253 - return -EINVAL; 3285 + return 0; 3254 3286 } 3255 3287 3256 3288 int svm_invoke_exit_handler(struct kvm_vcpu *vcpu, u64 exit_code) 3257 3289 { 3258 - if (svm_handle_invalid_exit(vcpu, exit_code)) 3259 - return 0; 3290 + if (!svm_check_exit_valid(vcpu, exit_code)) 3291 + return svm_handle_invalid_exit(vcpu, exit_code); 3260 3292 3261 3293 #ifdef CONFIG_RETPOLINE 3262 3294 if (exit_code == SVM_EXIT_MSR) ··· 3539 3573 * via AVIC. In such case, we need to temporarily disable AVIC, 3540 3574 * and fallback to injecting IRQ via V_IRQ. 3541 3575 */ 3542 - svm_toggle_avic_for_irq_window(vcpu, false); 3576 + kvm_request_apicv_update(vcpu->kvm, false, APICV_INHIBIT_REASON_IRQWIN); 3543 3577 svm_set_vintr(svm); 3544 3578 } 3545 3579 } ··· 3773 3807 } 3774 3808 3775 3809 pre_svm_run(vcpu); 3810 + 3811 + WARN_ON_ONCE(kvm_apicv_activated(vcpu->kvm) != kvm_vcpu_apicv_active(vcpu)); 3776 3812 3777 3813 sync_lapic_to_cr8(vcpu); 3778 3814 ··· 4578 4610 .set_virtual_apic_mode = svm_set_virtual_apic_mode, 4579 4611 .refresh_apicv_exec_ctrl = svm_refresh_apicv_exec_ctrl, 4580 4612 .check_apicv_inhibit_reasons = svm_check_apicv_inhibit_reasons, 4581 - .pre_update_apicv_exec_ctrl = svm_pre_update_apicv_exec_ctrl, 4582 4613 .load_eoi_exitmap = svm_load_eoi_exitmap, 4583 4614 .hwapic_irr_update = svm_hwapic_irr_update, 4584 4615 .hwapic_isr_update = svm_hwapic_isr_update,

-8

arch/x86/kvm/svm/svm.h

··· 503 503 504 504 #define VMCB_AVIC_APIC_BAR_MASK 0xFFFFFFFFFF000ULL 505 505 506 - static inline void avic_update_vapic_bar(struct vcpu_svm *svm, u64 data) 507 - { 508 - svm->vmcb->control.avic_vapic_bar = data & VMCB_AVIC_APIC_BAR_MASK; 509 - vmcb_mark_dirty(svm->vmcb, VMCB_AVIC); 510 - } 511 - 512 506 static inline bool avic_vcpu_is_running(struct kvm_vcpu *vcpu) 513 507 { 514 508 struct vcpu_svm *svm = to_svm(vcpu); ··· 518 524 void avic_vm_destroy(struct kvm *kvm); 519 525 int avic_vm_init(struct kvm *kvm); 520 526 void avic_init_vmcb(struct vcpu_svm *svm); 521 - void svm_toggle_avic_for_irq_window(struct kvm_vcpu *vcpu, bool activate); 522 527 int avic_incomplete_ipi_interception(struct kvm_vcpu *vcpu); 523 528 int avic_unaccelerated_access_interception(struct kvm_vcpu *vcpu); 524 529 int avic_init_vcpu(struct vcpu_svm *svm); ··· 527 534 void svm_set_virtual_apic_mode(struct kvm_vcpu *vcpu); 528 535 void svm_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu); 529 536 bool svm_check_apicv_inhibit_reasons(ulong bit); 530 - void svm_pre_update_apicv_exec_ctrl(struct kvm *kvm, bool activate); 531 537 void svm_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap); 532 538 void svm_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr); 533 539 void svm_hwapic_isr_update(struct kvm_vcpu *vcpu, int max_isr);

+1 -1

arch/x86/kvm/svm/svm_ops.h

··· 4 4 5 5 #include <linux/compiler_types.h> 6 6 7 - #include <asm/kvm_host.h> 7 + #include "x86.h" 8 8 9 9 #define svm_asm(insn, clobber...) \ 10 10 do { \

-1

arch/x86/kvm/vmx/evmcs.c

··· 14 14 15 15 #if IS_ENABLED(CONFIG_HYPERV) 16 16 17 - #define ROL16(val, n) ((u16)(((u16)(val) << (n)) | ((u16)(val) >> (16 - (n))))) 18 17 #define EVMCS1_OFFSET(x) offsetof(struct hv_enlightened_vmcs, x) 19 18 #define EVMCS1_FIELD(number, name, clean_field)[ROL16(number, 6)] = \ 20 19 {EVMCS1_OFFSET(name), clean_field}

-4

arch/x86/kvm/vmx/evmcs.h

··· 73 73 extern const struct evmcs_field vmcs_field_to_evmcs_1[]; 74 74 extern const unsigned int nr_evmcs_1_fields; 75 75 76 - #define ROL16(val, n) ((u16)(((u16)(val) << (n)) | ((u16)(val) >> (16 - (n))))) 77 - 78 76 static __always_inline int get_evmcs_offset(unsigned long field, 79 77 u16 *clean_field) 80 78 { ··· 92 94 93 95 return evmcs_field->offset; 94 96 } 95 - 96 - #undef ROL16 97 97 98 98 static inline void evmcs_write64(unsigned long field, u64 value) 99 99 {

+30 -26

arch/x86/kvm/vmx/nested.c

··· 2207 2207 } 2208 2208 } 2209 2209 2210 - static void prepare_vmcs02_early(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12) 2210 + static void prepare_vmcs02_early(struct vcpu_vmx *vmx, struct loaded_vmcs *vmcs01, 2211 + struct vmcs12 *vmcs12) 2211 2212 { 2212 2213 u32 exec_control; 2213 2214 u64 guest_efer = nested_vmx_calc_efer(vmx, vmcs12); ··· 2219 2218 /* 2220 2219 * PIN CONTROLS 2221 2220 */ 2222 - exec_control = vmx_pin_based_exec_ctrl(vmx); 2221 + exec_control = __pin_controls_get(vmcs01); 2223 2222 exec_control |= (vmcs12->pin_based_vm_exec_control & 2224 2223 ~PIN_BASED_VMX_PREEMPTION_TIMER); 2225 2224 2226 2225 /* Posted interrupts setting is only taken from vmcs12. */ 2227 - if (nested_cpu_has_posted_intr(vmcs12)) { 2226 + vmx->nested.pi_pending = false; 2227 + if (nested_cpu_has_posted_intr(vmcs12)) 2228 2228 vmx->nested.posted_intr_nv = vmcs12->posted_intr_nv; 2229 - vmx->nested.pi_pending = false; 2230 - } else { 2229 + else 2231 2230 exec_control &= ~PIN_BASED_POSTED_INTR; 2232 - } 2233 2231 pin_controls_set(vmx, exec_control); 2234 2232 2235 2233 /* 2236 2234 * EXEC CONTROLS 2237 2235 */ 2238 - exec_control = vmx_exec_control(vmx); /* L0's desires */ 2236 + exec_control = __exec_controls_get(vmcs01); /* L0's desires */ 2239 2237 exec_control &= ~CPU_BASED_INTR_WINDOW_EXITING; 2240 2238 exec_control &= ~CPU_BASED_NMI_WINDOW_EXITING; 2241 2239 exec_control &= ~CPU_BASED_TPR_SHADOW; ··· 2271 2271 * SECONDARY EXEC CONTROLS 2272 2272 */ 2273 2273 if (cpu_has_secondary_exec_ctrls()) { 2274 - exec_control = vmx->secondary_exec_control; 2274 + exec_control = __secondary_exec_controls_get(vmcs01); 2275 2275 2276 2276 /* Take the following fields only from vmcs12 */ 2277 2277 exec_control &= ~(SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES | 2278 + SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE | 2278 2279 SECONDARY_EXEC_ENABLE_INVPCID | 2279 2280 SECONDARY_EXEC_ENABLE_RDTSCP | 2280 2281 SECONDARY_EXEC_XSAVES | ··· 2283 2282 SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY | 2284 2283 SECONDARY_EXEC_APIC_REGISTER_VIRT | 2285 2284 SECONDARY_EXEC_ENABLE_VMFUNC | 2286 - SECONDARY_EXEC_TSC_SCALING); 2285 + SECONDARY_EXEC_TSC_SCALING | 2286 + SECONDARY_EXEC_DESC); 2287 + 2287 2288 if (nested_cpu_has(vmcs12, 2288 2289 CPU_BASED_ACTIVATE_SECONDARY_CONTROLS)) 2289 2290 exec_control |= vmcs12->secondary_vm_exec_control; ··· 2325 2322 * on the related bits (if supported by the CPU) in the hope that 2326 2323 * we can avoid VMWrites during vmx_set_efer(). 2327 2324 */ 2328 - exec_control = (vmcs12->vm_entry_controls | vmx_vmentry_ctrl()) & 2329 - ~VM_ENTRY_IA32E_MODE & ~VM_ENTRY_LOAD_IA32_EFER; 2325 + exec_control = __vm_entry_controls_get(vmcs01); 2326 + exec_control |= vmcs12->vm_entry_controls; 2327 + exec_control &= ~(VM_ENTRY_IA32E_MODE | VM_ENTRY_LOAD_IA32_EFER); 2330 2328 if (cpu_has_load_ia32_efer()) { 2331 2329 if (guest_efer & EFER_LMA) 2332 2330 exec_control |= VM_ENTRY_IA32E_MODE; ··· 2343 2339 * we should use its exit controls. Note that VM_EXIT_LOAD_IA32_EFER 2344 2340 * bits may be modified by vmx_set_efer() in prepare_vmcs02(). 2345 2341 */ 2346 - exec_control = vmx_vmexit_ctrl(); 2342 + exec_control = __vm_exit_controls_get(vmcs01); 2347 2343 if (cpu_has_load_ia32_efer() && guest_efer != host_efer) 2348 2344 exec_control |= VM_EXIT_LOAD_IA32_EFER; 2345 + else 2346 + exec_control &= ~VM_EXIT_LOAD_IA32_EFER; 2349 2347 vm_exit_controls_set(vmx, exec_control); 2350 2348 2351 2349 /* ··· 3390 3384 3391 3385 vmx_switch_vmcs(vcpu, &vmx->nested.vmcs02); 3392 3386 3393 - prepare_vmcs02_early(vmx, vmcs12); 3387 + prepare_vmcs02_early(vmx, &vmx->vmcs01, vmcs12); 3394 3388 3395 3389 if (from_vmentry) { 3396 3390 if (unlikely(!nested_get_vmcs12_pages(vcpu))) { ··· 4310 4304 seg.l = 1; 4311 4305 else 4312 4306 seg.db = 1; 4313 - vmx_set_segment(vcpu, &seg, VCPU_SREG_CS); 4307 + __vmx_set_segment(vcpu, &seg, VCPU_SREG_CS); 4314 4308 seg = (struct kvm_segment) { 4315 4309 .base = 0, 4316 4310 .limit = 0xFFFFFFFF, ··· 4321 4315 .g = 1 4322 4316 }; 4323 4317 seg.selector = vmcs12->host_ds_selector; 4324 - vmx_set_segment(vcpu, &seg, VCPU_SREG_DS); 4318 + __vmx_set_segment(vcpu, &seg, VCPU_SREG_DS); 4325 4319 seg.selector = vmcs12->host_es_selector; 4326 - vmx_set_segment(vcpu, &seg, VCPU_SREG_ES); 4320 + __vmx_set_segment(vcpu, &seg, VCPU_SREG_ES); 4327 4321 seg.selector = vmcs12->host_ss_selector; 4328 - vmx_set_segment(vcpu, &seg, VCPU_SREG_SS); 4322 + __vmx_set_segment(vcpu, &seg, VCPU_SREG_SS); 4329 4323 seg.selector = vmcs12->host_fs_selector; 4330 4324 seg.base = vmcs12->host_fs_base; 4331 - vmx_set_segment(vcpu, &seg, VCPU_SREG_FS); 4325 + __vmx_set_segment(vcpu, &seg, VCPU_SREG_FS); 4332 4326 seg.selector = vmcs12->host_gs_selector; 4333 4327 seg.base = vmcs12->host_gs_base; 4334 - vmx_set_segment(vcpu, &seg, VCPU_SREG_GS); 4328 + __vmx_set_segment(vcpu, &seg, VCPU_SREG_GS); 4335 4329 seg = (struct kvm_segment) { 4336 4330 .base = vmcs12->host_tr_base, 4337 4331 .limit = 0x67, ··· 4339 4333 .type = 11, 4340 4334 .present = 1 4341 4335 }; 4342 - vmx_set_segment(vcpu, &seg, VCPU_SREG_TR); 4336 + __vmx_set_segment(vcpu, &seg, VCPU_SREG_TR); 4337 + 4338 + memset(&seg, 0, sizeof(seg)); 4339 + seg.unusable = 1; 4340 + __vmx_set_segment(vcpu, &seg, VCPU_SREG_LDTR); 4343 4341 4344 4342 kvm_set_dr(vcpu, 7, 0x400); 4345 4343 vmcs_write64(GUEST_IA32_DEBUGCTL, 0); 4346 - 4347 - if (cpu_has_vmx_msr_bitmap()) 4348 - vmx_update_msr_bitmap(vcpu); 4349 4344 4350 4345 if (nested_vmx_load_msr(vcpu, vmcs12->vm_exit_msr_load_addr, 4351 4346 vmcs12->vm_exit_msr_load_count)) ··· 4425 4418 ept_save_pdptrs(vcpu); 4426 4419 4427 4420 kvm_mmu_reset_context(vcpu); 4428 - 4429 - if (cpu_has_vmx_msr_bitmap()) 4430 - vmx_update_msr_bitmap(vcpu); 4431 4421 4432 4422 /* 4433 4423 * This nasty bit of open coding is a compromise between blindly

+2 -2

arch/x86/kvm/vmx/pmu_intel.c

··· 437 437 !(msr & MSR_PMC_FULL_WIDTH_BIT)) 438 438 data = (s64)(s32)data; 439 439 pmc->counter += data - pmc_read_counter(pmc); 440 - if (pmc->perf_event) 440 + if (pmc->perf_event && !pmc->is_paused) 441 441 perf_event_period(pmc->perf_event, 442 442 get_sample_period(pmc, data)); 443 443 return 0; 444 444 } else if ((pmc = get_fixed_pmc(pmu, msr))) { 445 445 pmc->counter += data - pmc_read_counter(pmc); 446 - if (pmc->perf_event) 446 + if (pmc->perf_event && !pmc->is_paused) 447 447 perf_event_period(pmc->perf_event, 448 448 get_sample_period(pmc, data)); 449 449 return 0;

+2

arch/x86/kvm/vmx/vmcs.h

··· 11 11 12 12 #include "capabilities.h" 13 13 14 + #define ROL16(val, n) ((u16)(((u16)(val) << (n)) | ((u16)(val) >> (16 - (n))))) 15 + 14 16 struct vmcs_hdr { 15 17 u32 revision_id:31; 16 18 u32 shadow_vmcs:1;

-1

arch/x86/kvm/vmx/vmcs12.c

··· 2 2 3 3 #include "vmcs12.h" 4 4 5 - #define ROL16(val, n) ((u16)(((u16)(val) << (n)) | ((u16)(val) >> (16 - (n))))) 6 5 #define VMCS12_OFFSET(x) offsetof(struct vmcs12, x) 7 6 #define FIELD(number, name) [ROL16(number, 6)] = VMCS12_OFFSET(name) 8 7 #define FIELD64(number, name) \

-4

arch/x86/kvm/vmx/vmcs12.h

··· 364 364 extern const unsigned short vmcs_field_to_offset_table[]; 365 365 extern const unsigned int nr_vmcs12_fields; 366 366 367 - #define ROL16(val, n) ((u16)(((u16)(val) << (n)) | ((u16)(val) >> (16 - (n))))) 368 - 369 367 static inline short vmcs_field_to_offset(unsigned long field) 370 368 { 371 369 unsigned short offset; ··· 382 384 return -ENOENT; 383 385 return offset; 384 386 } 385 - 386 - #undef ROL16 387 387 388 388 static inline u64 vmcs12_read_any(struct vmcs12 *vmcs12, unsigned long field, 389 389 u16 offset)

+181 -160

arch/x86/kvm/vmx/vmx.c

··· 136 136 #define KVM_VM_CR0_ALWAYS_OFF (X86_CR0_NW | X86_CR0_CD) 137 137 #define KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST X86_CR0_NE 138 138 #define KVM_VM_CR0_ALWAYS_ON \ 139 - (KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST | \ 140 - X86_CR0_WP | X86_CR0_PG | X86_CR0_PE) 139 + (KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST | X86_CR0_PG | X86_CR0_PE) 141 140 142 141 #define KVM_VM_CR4_ALWAYS_ON_UNRESTRICTED_GUEST X86_CR4_VMXE 143 142 #define KVM_PMODE_VM_CR4_ALWAYS_ON (X86_CR4_PAE | X86_CR4_VMXE) ··· 1647 1648 } 1648 1649 1649 1650 /* 1650 - * Set up the vmcs to automatically save and restore system 1651 - * msrs. Don't touch the 64-bit msrs if the guest is in legacy 1652 - * mode, as fiddling with msrs is very expensive. 1651 + * Configuring user return MSRs to automatically save, load, and restore MSRs 1652 + * that need to be shoved into hardware when running the guest. Note, omitting 1653 + * an MSR here does _NOT_ mean it's not emulated, only that it will not be 1654 + * loaded into hardware when running the guest. 1653 1655 */ 1654 - static void setup_msrs(struct vcpu_vmx *vmx) 1656 + static void vmx_setup_uret_msrs(struct vcpu_vmx *vmx) 1655 1657 { 1656 1658 #ifdef CONFIG_X86_64 1657 1659 bool load_syscall_msrs; ··· 1681 1681 * so that TSX remains always disabled. 1682 1682 */ 1683 1683 vmx_setup_uret_msr(vmx, MSR_IA32_TSX_CTRL, boot_cpu_has(X86_FEATURE_RTM)); 1684 - 1685 - if (cpu_has_vmx_msr_bitmap()) 1686 - vmx_update_msr_bitmap(&vmx->vcpu); 1687 1684 1688 1685 /* 1689 1686 * The set of MSRs to load may have changed, reload MSRs before the ··· 2260 2263 vcpu->arch.cr0 |= vmcs_readl(GUEST_CR0) & guest_owned_bits; 2261 2264 break; 2262 2265 case VCPU_EXREG_CR3: 2263 - if (is_unrestricted_guest(vcpu) || 2264 - (enable_ept && is_paging(vcpu))) 2266 + /* 2267 + * When intercepting CR3 loads, e.g. for shadowing paging, KVM's 2268 + * CR3 is loaded into hardware, not the guest's CR3. 2269 + */ 2270 + if (!(exec_controls_get(to_vmx(vcpu)) & CPU_BASED_CR3_LOAD_EXITING)) 2265 2271 vcpu->arch.cr3 = vmcs_readl(GUEST_CR3); 2266 2272 break; 2267 2273 case VCPU_EXREG_CR4: ··· 2274 2274 vcpu->arch.cr4 |= vmcs_readl(GUEST_CR4) & guest_owned_bits; 2275 2275 break; 2276 2276 default: 2277 - WARN_ON_ONCE(1); 2277 + KVM_BUG_ON(1, vcpu->kvm); 2278 2278 break; 2279 2279 } 2280 2280 } ··· 2733 2733 save->dpl = save->selector & SEGMENT_RPL_MASK; 2734 2734 save->s = 1; 2735 2735 } 2736 - vmx_set_segment(vcpu, save, seg); 2736 + __vmx_set_segment(vcpu, save, seg); 2737 2737 } 2738 2738 2739 2739 static void enter_pmode(struct kvm_vcpu *vcpu) ··· 2754 2754 2755 2755 vmx->rmode.vm86_active = 0; 2756 2756 2757 - vmx_set_segment(vcpu, &vmx->rmode.segs[VCPU_SREG_TR], VCPU_SREG_TR); 2757 + __vmx_set_segment(vcpu, &vmx->rmode.segs[VCPU_SREG_TR], VCPU_SREG_TR); 2758 2758 2759 2759 flags = vmcs_readl(GUEST_RFLAGS); 2760 2760 flags &= RMODE_GUEST_OWNED_EFLAGS_BITS; ··· 2852 2852 fix_rmode_seg(VCPU_SREG_DS, &vmx->rmode.segs[VCPU_SREG_DS]); 2853 2853 fix_rmode_seg(VCPU_SREG_GS, &vmx->rmode.segs[VCPU_SREG_GS]); 2854 2854 fix_rmode_seg(VCPU_SREG_FS, &vmx->rmode.segs[VCPU_SREG_FS]); 2855 - 2856 - kvm_mmu_reset_context(vcpu); 2857 2855 } 2858 2856 2859 2857 int vmx_set_efer(struct kvm_vcpu *vcpu, u64 efer) ··· 2872 2874 2873 2875 msr->data = efer & ~EFER_LME; 2874 2876 } 2875 - setup_msrs(vmx); 2877 + vmx_setup_uret_msrs(vmx); 2876 2878 return 0; 2877 2879 } 2878 2880 ··· 2995 2997 kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR); 2996 2998 } 2997 2999 2998 - static void ept_update_paging_mode_cr0(unsigned long *hw_cr0, 2999 - unsigned long cr0, 3000 - struct kvm_vcpu *vcpu) 3001 - { 3002 - struct vcpu_vmx *vmx = to_vmx(vcpu); 3003 - 3004 - if (!kvm_register_is_available(vcpu, VCPU_EXREG_CR3)) 3005 - vmx_cache_reg(vcpu, VCPU_EXREG_CR3); 3006 - if (!(cr0 & X86_CR0_PG)) { 3007 - /* From paging/starting to nonpaging */ 3008 - exec_controls_setbit(vmx, CPU_BASED_CR3_LOAD_EXITING | 3009 - CPU_BASED_CR3_STORE_EXITING); 3010 - vcpu->arch.cr0 = cr0; 3011 - vmx_set_cr4(vcpu, kvm_read_cr4(vcpu)); 3012 - } else if (!is_paging(vcpu)) { 3013 - /* From nonpaging to paging */ 3014 - exec_controls_clearbit(vmx, CPU_BASED_CR3_LOAD_EXITING | 3015 - CPU_BASED_CR3_STORE_EXITING); 3016 - vcpu->arch.cr0 = cr0; 3017 - vmx_set_cr4(vcpu, kvm_read_cr4(vcpu)); 3018 - } 3019 - 3020 - if (!(cr0 & X86_CR0_WP)) 3021 - *hw_cr0 &= ~X86_CR0_WP; 3022 - } 3000 + #define CR3_EXITING_BITS (CPU_BASED_CR3_LOAD_EXITING | \ 3001 + CPU_BASED_CR3_STORE_EXITING) 3023 3002 3024 3003 void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0) 3025 3004 { 3026 3005 struct vcpu_vmx *vmx = to_vmx(vcpu); 3027 - unsigned long hw_cr0; 3006 + unsigned long hw_cr0, old_cr0_pg; 3007 + u32 tmp; 3008 + 3009 + old_cr0_pg = kvm_read_cr0_bits(vcpu, X86_CR0_PG); 3028 3010 3029 3011 hw_cr0 = (cr0 & ~KVM_VM_CR0_ALWAYS_OFF); 3030 3012 if (is_unrestricted_guest(vcpu)) 3031 3013 hw_cr0 |= KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST; 3032 3014 else { 3033 3015 hw_cr0 |= KVM_VM_CR0_ALWAYS_ON; 3016 + if (!enable_ept) 3017 + hw_cr0 |= X86_CR0_WP; 3034 3018 3035 3019 if (vmx->rmode.vm86_active && (cr0 & X86_CR0_PE)) 3036 3020 enter_pmode(vcpu); ··· 3021 3041 enter_rmode(vcpu); 3022 3042 } 3023 3043 3024 - #ifdef CONFIG_X86_64 3025 - if (vcpu->arch.efer & EFER_LME) { 3026 - if (!is_paging(vcpu) && (cr0 & X86_CR0_PG)) 3027 - enter_lmode(vcpu); 3028 - if (is_paging(vcpu) && !(cr0 & X86_CR0_PG)) 3029 - exit_lmode(vcpu); 3030 - } 3031 - #endif 3032 - 3033 - if (enable_ept && !is_unrestricted_guest(vcpu)) 3034 - ept_update_paging_mode_cr0(&hw_cr0, cr0, vcpu); 3035 - 3036 3044 vmcs_writel(CR0_READ_SHADOW, cr0); 3037 3045 vmcs_writel(GUEST_CR0, hw_cr0); 3038 3046 vcpu->arch.cr0 = cr0; 3039 3047 kvm_register_mark_available(vcpu, VCPU_EXREG_CR0); 3048 + 3049 + #ifdef CONFIG_X86_64 3050 + if (vcpu->arch.efer & EFER_LME) { 3051 + if (!old_cr0_pg && (cr0 & X86_CR0_PG)) 3052 + enter_lmode(vcpu); 3053 + else if (old_cr0_pg && !(cr0 & X86_CR0_PG)) 3054 + exit_lmode(vcpu); 3055 + } 3056 + #endif 3057 + 3058 + if (enable_ept && !is_unrestricted_guest(vcpu)) { 3059 + /* 3060 + * Ensure KVM has an up-to-date snapshot of the guest's CR3. If 3061 + * the below code _enables_ CR3 exiting, vmx_cache_reg() will 3062 + * (correctly) stop reading vmcs.GUEST_CR3 because it thinks 3063 + * KVM's CR3 is installed. 3064 + */ 3065 + if (!kvm_register_is_available(vcpu, VCPU_EXREG_CR3)) 3066 + vmx_cache_reg(vcpu, VCPU_EXREG_CR3); 3067 + 3068 + /* 3069 + * When running with EPT but not unrestricted guest, KVM must 3070 + * intercept CR3 accesses when paging is _disabled_. This is 3071 + * necessary because restricted guests can't actually run with 3072 + * paging disabled, and so KVM stuffs its own CR3 in order to 3073 + * run the guest when identity mapped page tables. 3074 + * 3075 + * Do _NOT_ check the old CR0.PG, e.g. to optimize away the 3076 + * update, it may be stale with respect to CR3 interception, 3077 + * e.g. after nested VM-Enter. 3078 + * 3079 + * Lastly, honor L1's desires, i.e. intercept CR3 loads and/or 3080 + * stores to forward them to L1, even if KVM does not need to 3081 + * intercept them to preserve its identity mapped page tables. 3082 + */ 3083 + if (!(cr0 & X86_CR0_PG)) { 3084 + exec_controls_setbit(vmx, CR3_EXITING_BITS); 3085 + } else if (!is_guest_mode(vcpu)) { 3086 + exec_controls_clearbit(vmx, CR3_EXITING_BITS); 3087 + } else { 3088 + tmp = exec_controls_get(vmx); 3089 + tmp &= ~CR3_EXITING_BITS; 3090 + tmp |= get_vmcs12(vcpu)->cpu_based_vm_exec_control & CR3_EXITING_BITS; 3091 + exec_controls_set(vmx, tmp); 3092 + } 3093 + 3094 + /* Note, vmx_set_cr4() consumes the new vcpu->arch.cr0. */ 3095 + if ((old_cr0_pg ^ cr0) & X86_CR0_PG) 3096 + vmx_set_cr4(vcpu, kvm_read_cr4(vcpu)); 3097 + } 3040 3098 3041 3099 /* depends on vcpu->arch.cr0 to be set to a new value */ 3042 3100 vmx->emulation_required = emulation_required(vcpu); ··· 3289 3271 return ar; 3290 3272 } 3291 3273 3292 - void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg) 3274 + void __vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg) 3293 3275 { 3294 3276 struct vcpu_vmx *vmx = to_vmx(vcpu); 3295 3277 const struct kvm_vmx_segment_field *sf = &kvm_vmx_segment_fields[seg]; ··· 3302 3284 vmcs_write16(sf->selector, var->selector); 3303 3285 else if (var->s) 3304 3286 fix_rmode_seg(seg, &vmx->rmode.segs[seg]); 3305 - goto out; 3287 + return; 3306 3288 } 3307 3289 3308 3290 vmcs_writel(sf->base, var->base); ··· 3324 3306 var->type |= 0x1; /* Accessed */ 3325 3307 3326 3308 vmcs_write32(sf->ar_bytes, vmx_segment_access_rights(var)); 3309 + } 3327 3310 3328 - out: 3329 - vmx->emulation_required = emulation_required(vcpu); 3311 + static void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg) 3312 + { 3313 + __vmx_set_segment(vcpu, var, seg); 3314 + 3315 + to_vmx(vcpu)->emulation_required = emulation_required(vcpu); 3330 3316 } 3331 3317 3332 3318 static void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l) ··· 3812 3790 vmx_set_msr_bitmap_write(msr_bitmap, msr); 3813 3791 } 3814 3792 3815 - static u8 vmx_msr_bitmap_mode(struct kvm_vcpu *vcpu) 3816 - { 3817 - u8 mode = 0; 3818 - 3819 - if (cpu_has_secondary_exec_ctrls() && 3820 - (secondary_exec_controls_get(to_vmx(vcpu)) & 3821 - SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE)) { 3822 - mode |= MSR_BITMAP_MODE_X2APIC; 3823 - if (enable_apicv && kvm_vcpu_apicv_active(vcpu)) 3824 - mode |= MSR_BITMAP_MODE_X2APIC_APICV; 3825 - } 3826 - 3827 - return mode; 3828 - } 3829 - 3830 3793 static void vmx_reset_x2apic_msrs(struct kvm_vcpu *vcpu, u8 mode) 3831 3794 { 3832 3795 unsigned long *msr_bitmap = to_vmx(vcpu)->vmcs01.msr_bitmap; ··· 3829 3822 } 3830 3823 } 3831 3824 3832 - static void vmx_update_msr_bitmap_x2apic(struct kvm_vcpu *vcpu, u8 mode) 3825 + static void vmx_update_msr_bitmap_x2apic(struct kvm_vcpu *vcpu) 3833 3826 { 3827 + struct vcpu_vmx *vmx = to_vmx(vcpu); 3828 + u8 mode; 3829 + 3834 3830 if (!cpu_has_vmx_msr_bitmap()) 3835 3831 return; 3832 + 3833 + if (cpu_has_secondary_exec_ctrls() && 3834 + (secondary_exec_controls_get(vmx) & 3835 + SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE)) { 3836 + mode = MSR_BITMAP_MODE_X2APIC; 3837 + if (enable_apicv && kvm_vcpu_apicv_active(vcpu)) 3838 + mode |= MSR_BITMAP_MODE_X2APIC_APICV; 3839 + } else { 3840 + mode = 0; 3841 + } 3842 + 3843 + if (mode == vmx->x2apic_msr_bitmap_mode) 3844 + return; 3845 + 3846 + vmx->x2apic_msr_bitmap_mode = mode; 3836 3847 3837 3848 vmx_reset_x2apic_msrs(vcpu, mode); 3838 3849 ··· 3866 3841 vmx_disable_intercept_for_msr(vcpu, X2APIC_MSR(APIC_EOI), MSR_TYPE_W); 3867 3842 vmx_disable_intercept_for_msr(vcpu, X2APIC_MSR(APIC_SELF_IPI), MSR_TYPE_W); 3868 3843 } 3869 - } 3870 - 3871 - void vmx_update_msr_bitmap(struct kvm_vcpu *vcpu) 3872 - { 3873 - struct vcpu_vmx *vmx = to_vmx(vcpu); 3874 - u8 mode = vmx_msr_bitmap_mode(vcpu); 3875 - u8 changed = mode ^ vmx->msr_bitmap_mode; 3876 - 3877 - if (!changed) 3878 - return; 3879 - 3880 - if (changed & (MSR_BITMAP_MODE_X2APIC | MSR_BITMAP_MODE_X2APIC_APICV)) 3881 - vmx_update_msr_bitmap_x2apic(vcpu, mode); 3882 - 3883 - vmx->msr_bitmap_mode = mode; 3884 3844 } 3885 3845 3886 3846 void pt_update_intercept_for_msr(struct kvm_vcpu *vcpu) ··· 3924 3914 } 3925 3915 3926 3916 pt_update_intercept_for_msr(vcpu); 3927 - vmx_update_msr_bitmap_x2apic(vcpu, vmx_msr_bitmap_mode(vcpu)); 3928 3917 } 3929 3918 3930 3919 static inline bool kvm_vcpu_trigger_posted_interrupt(struct kvm_vcpu *vcpu, ··· 4095 4086 vmcs_writel(CR4_GUEST_HOST_MASK, ~vcpu->arch.cr4_guest_owned_bits); 4096 4087 } 4097 4088 4098 - u32 vmx_pin_based_exec_ctrl(struct vcpu_vmx *vmx) 4089 + static u32 vmx_pin_based_exec_ctrl(struct vcpu_vmx *vmx) 4099 4090 { 4100 4091 u32 pin_based_exec_ctrl = vmcs_config.pin_based_exec_ctrl; 4101 4092 ··· 4109 4100 pin_based_exec_ctrl &= ~PIN_BASED_VMX_PREEMPTION_TIMER; 4110 4101 4111 4102 return pin_based_exec_ctrl; 4103 + } 4104 + 4105 + static u32 vmx_vmentry_ctrl(void) 4106 + { 4107 + u32 vmentry_ctrl = vmcs_config.vmentry_ctrl; 4108 + 4109 + if (vmx_pt_mode_is_system()) 4110 + vmentry_ctrl &= ~(VM_ENTRY_PT_CONCEAL_PIP | 4111 + VM_ENTRY_LOAD_IA32_RTIT_CTL); 4112 + /* Loading of EFER and PERF_GLOBAL_CTRL are toggled dynamically */ 4113 + return vmentry_ctrl & 4114 + ~(VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL | VM_ENTRY_LOAD_IA32_EFER); 4115 + } 4116 + 4117 + static u32 vmx_vmexit_ctrl(void) 4118 + { 4119 + u32 vmexit_ctrl = vmcs_config.vmexit_ctrl; 4120 + 4121 + if (vmx_pt_mode_is_system()) 4122 + vmexit_ctrl &= ~(VM_EXIT_PT_CONCEAL_PIP | 4123 + VM_EXIT_CLEAR_IA32_RTIT_CTL); 4124 + /* Loading of EFER and PERF_GLOBAL_CTRL are toggled dynamically */ 4125 + return vmexit_ctrl & 4126 + ~(VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL | VM_EXIT_LOAD_IA32_EFER); 4112 4127 } 4113 4128 4114 4129 static void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu) ··· 4151 4118 SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY); 4152 4119 } 4153 4120 4154 - if (cpu_has_vmx_msr_bitmap()) 4155 - vmx_update_msr_bitmap(vcpu); 4121 + vmx_update_msr_bitmap_x2apic(vcpu); 4156 4122 } 4157 4123 4158 - u32 vmx_exec_control(struct vcpu_vmx *vmx) 4124 + static u32 vmx_exec_control(struct vcpu_vmx *vmx) 4159 4125 { 4160 4126 u32 exec_control = vmcs_config.cpu_based_exec_ctrl; 4161 4127 ··· 4236 4204 #define vmx_adjust_sec_exec_exiting(vmx, exec_control, lname, uname) \ 4237 4205 vmx_adjust_sec_exec_control(vmx, exec_control, lname, uname, uname##_EXITING, true) 4238 4206 4239 - static void vmx_compute_secondary_exec_control(struct vcpu_vmx *vmx) 4207 + static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx) 4240 4208 { 4241 4209 struct kvm_vcpu *vcpu = &vmx->vcpu; 4242 4210 ··· 4322 4290 if (!vcpu->kvm->arch.bus_lock_detection_enabled) 4323 4291 exec_control &= ~SECONDARY_EXEC_BUS_LOCK_DETECTION; 4324 4292 4325 - vmx->secondary_exec_control = exec_control; 4293 + return exec_control; 4326 4294 } 4327 4295 4328 4296 #define VMX_XSS_EXIT_BITMAP 0 ··· 4346 4314 4347 4315 exec_controls_set(vmx, vmx_exec_control(vmx)); 4348 4316 4349 - if (cpu_has_secondary_exec_ctrls()) { 4350 - vmx_compute_secondary_exec_control(vmx); 4351 - secondary_exec_controls_set(vmx, vmx->secondary_exec_control); 4352 - } 4317 + if (cpu_has_secondary_exec_ctrls()) 4318 + secondary_exec_controls_set(vmx, vmx_secondary_exec_control(vmx)); 4353 4319 4354 4320 if (kvm_vcpu_apicv_active(&vmx->vcpu)) { 4355 4321 vmcs_write64(EOI_EXIT_BITMAP0, 0); ··· 4418 4388 vmx->pt_desc.guest.output_mask = 0x7F; 4419 4389 vmcs_write64(GUEST_IA32_RTIT_CTL, 0); 4420 4390 } 4391 + 4392 + vmcs_write32(GUEST_SYSENTER_CS, 0); 4393 + vmcs_writel(GUEST_SYSENTER_ESP, 0); 4394 + vmcs_writel(GUEST_SYSENTER_EIP, 0); 4395 + vmcs_write64(GUEST_IA32_DEBUGCTL, 0); 4396 + 4397 + if (cpu_has_vmx_tpr_shadow()) { 4398 + vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, 0); 4399 + if (cpu_need_tpr_shadow(&vmx->vcpu)) 4400 + vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, 4401 + __pa(vmx->vcpu.arch.apic->regs)); 4402 + vmcs_write32(TPR_THRESHOLD, 0); 4403 + } 4404 + 4405 + vmx_setup_uret_msrs(vmx); 4421 4406 } 4422 4407 4423 4408 static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) 4424 4409 { 4425 4410 struct vcpu_vmx *vmx = to_vmx(vcpu); 4426 - struct msr_data apic_base_msr; 4427 - u64 cr0; 4428 4411 4429 4412 vmx->rmode.vm86_active = 0; 4430 4413 vmx->spec_ctrl = 0; 4431 4414 4432 4415 vmx->msr_ia32_umwait_control = 0; 4433 4416 4434 - vmx->vcpu.arch.regs[VCPU_REGS_RDX] = get_rdx_init_val(); 4435 4417 vmx->hv_deadline_tsc = -1; 4436 4418 kvm_set_cr8(vcpu, 0); 4437 - 4438 - if (!init_event) { 4439 - apic_base_msr.data = APIC_DEFAULT_PHYS_BASE | 4440 - MSR_IA32_APICBASE_ENABLE; 4441 - if (kvm_vcpu_is_reset_bsp(vcpu)) 4442 - apic_base_msr.data |= MSR_IA32_APICBASE_BSP; 4443 - apic_base_msr.host_initiated = true; 4444 - kvm_set_apic_base(vcpu, &apic_base_msr); 4445 - } 4446 4419 4447 4420 vmx_segment_cache_clear(vmx); 4448 4421 ··· 4469 4436 vmcs_write32(GUEST_LDTR_LIMIT, 0xffff); 4470 4437 vmcs_write32(GUEST_LDTR_AR_BYTES, 0x00082); 4471 4438 4472 - if (!init_event) { 4473 - vmcs_write32(GUEST_SYSENTER_CS, 0); 4474 - vmcs_writel(GUEST_SYSENTER_ESP, 0); 4475 - vmcs_writel(GUEST_SYSENTER_EIP, 0); 4476 - vmcs_write64(GUEST_IA32_DEBUGCTL, 0); 4477 - } 4478 - 4479 - kvm_set_rflags(vcpu, X86_EFLAGS_FIXED); 4480 - kvm_rip_write(vcpu, 0xfff0); 4481 - 4482 4439 vmcs_writel(GUEST_GDTR_BASE, 0); 4483 4440 vmcs_write32(GUEST_GDTR_LIMIT, 0xffff); 4484 4441 ··· 4481 4458 if (kvm_mpx_supported()) 4482 4459 vmcs_write64(GUEST_BNDCFGS, 0); 4483 4460 4484 - setup_msrs(vmx); 4485 - 4486 4461 vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0); /* 22.2.1 */ 4487 - 4488 - if (cpu_has_vmx_tpr_shadow() && !init_event) { 4489 - vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, 0); 4490 - if (cpu_need_tpr_shadow(vcpu)) 4491 - vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, 4492 - __pa(vcpu->arch.apic->regs)); 4493 - vmcs_write32(TPR_THRESHOLD, 0); 4494 - } 4495 4462 4496 4463 kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu); 4497 4464 4498 - cr0 = X86_CR0_NW | X86_CR0_CD | X86_CR0_ET; 4499 - vmx->vcpu.arch.cr0 = cr0; 4500 - vmx_set_cr0(vcpu, cr0); /* enter rmode */ 4501 - vmx_set_cr4(vcpu, 0); 4502 - vmx_set_efer(vcpu, 0); 4503 - 4504 - vmx_update_exception_bitmap(vcpu); 4505 - 4506 4465 vpid_sync_context(vmx->vpid); 4507 - if (init_event) 4508 - vmx_clear_hlt(vcpu); 4509 4466 } 4510 4467 4511 4468 static void vmx_enable_irq_window(struct kvm_vcpu *vcpu) ··· 4999 4996 return kvm_complete_insn_gp(vcpu, err); 5000 4997 case 3: 5001 4998 WARN_ON_ONCE(enable_unrestricted_guest); 4999 + 5002 5000 err = kvm_set_cr3(vcpu, val); 5003 5001 return kvm_complete_insn_gp(vcpu, err); 5004 5002 case 4: ··· 5025 5021 } 5026 5022 break; 5027 5023 case 2: /* clts */ 5028 - WARN_ONCE(1, "Guest should always own CR0.TS"); 5029 - vmx_set_cr0(vcpu, kvm_read_cr0_bits(vcpu, ~X86_CR0_TS)); 5030 - trace_kvm_cr_write(0, kvm_read_cr0(vcpu)); 5031 - return kvm_skip_emulated_instruction(vcpu); 5024 + KVM_BUG(1, vcpu->kvm, "Guest always owns CR0.TS"); 5025 + return -EIO; 5032 5026 case 1: /*mov from cr*/ 5033 5027 switch (cr) { 5034 5028 case 3: 5035 5029 WARN_ON_ONCE(enable_unrestricted_guest); 5030 + 5036 5031 val = kvm_read_cr3(vcpu); 5037 5032 kvm_register_write(vcpu, reg, val); 5038 5033 trace_kvm_cr_read(cr, val); ··· 5132 5129 5133 5130 vcpu->arch.switch_db_regs &= ~KVM_DEBUGREG_WONT_EXIT; 5134 5131 exec_controls_setbit(to_vmx(vcpu), CPU_BASED_MOV_DR_EXITING); 5132 + 5133 + /* 5134 + * exc_debug expects dr6 to be cleared after it runs, avoid that it sees 5135 + * a stale dr6 from the guest. 5136 + */ 5137 + set_debugreg(DR6_RESERVED, 6); 5135 5138 } 5136 5139 5137 5140 static void vmx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val) ··· 5347 5338 5348 5339 static int handle_nmi_window(struct kvm_vcpu *vcpu) 5349 5340 { 5350 - WARN_ON_ONCE(!enable_vnmi); 5341 + if (KVM_BUG_ON(!enable_vnmi, vcpu->kvm)) 5342 + return -EIO; 5343 + 5351 5344 exec_controls_clearbit(to_vmx(vcpu), CPU_BASED_NMI_WINDOW_EXITING); 5352 5345 ++vcpu->stat.nmi_window_exits; 5353 5346 kvm_make_request(KVM_REQ_EVENT, vcpu); ··· 5907 5896 * below) should never happen as that means we incorrectly allowed a 5908 5897 * nested VM-Enter with an invalid vmcs12. 5909 5898 */ 5910 - WARN_ON_ONCE(vmx->nested.nested_run_pending); 5899 + if (KVM_BUG_ON(vmx->nested.nested_run_pending, vcpu->kvm)) 5900 + return -EIO; 5911 5901 5912 5902 /* If guest state is invalid, start emulating */ 5913 5903 if (vmx->emulation_required) ··· 6201 6189 } 6202 6190 secondary_exec_controls_set(vmx, sec_exec_control); 6203 6191 6204 - vmx_update_msr_bitmap(vcpu); 6192 + vmx_update_msr_bitmap_x2apic(vcpu); 6205 6193 } 6206 6194 6207 6195 static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu) ··· 6286 6274 int max_irr; 6287 6275 bool max_irr_updated; 6288 6276 6289 - WARN_ON(!vcpu->arch.apicv_active); 6277 + if (KVM_BUG_ON(!vcpu->arch.apicv_active, vcpu->kvm)) 6278 + return -EIO; 6279 + 6290 6280 if (pi_test_on(&vmx->pi_desc)) { 6291 6281 pi_clear_on(&vmx->pi_desc); 6292 6282 /* ··· 6371 6357 unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK; 6372 6358 gate_desc *desc = (gate_desc *)host_idt_base + vector; 6373 6359 6374 - if (WARN_ONCE(!is_external_intr(intr_info), 6360 + if (KVM_BUG(!is_external_intr(intr_info), vcpu->kvm, 6375 6361 "KVM: unexpected VM-Exit interrupt info: 0x%x", intr_info)) 6376 6362 return; 6377 6363 ··· 6381 6367 static void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu) 6382 6368 { 6383 6369 struct vcpu_vmx *vmx = to_vmx(vcpu); 6370 + 6371 + if (vmx->emulation_required) 6372 + return; 6384 6373 6385 6374 if (vmx->exit_reason.basic == EXIT_REASON_EXTERNAL_INTERRUPT) 6386 6375 handle_external_interrupt_irqoff(vcpu); ··· 6656 6639 vmx->loaded_vmcs->host_state.cr4 = cr4; 6657 6640 } 6658 6641 6642 + /* When KVM_DEBUGREG_WONT_EXIT, dr6 is accessible in guest. */ 6643 + if (unlikely(vcpu->arch.switch_db_regs & KVM_DEBUGREG_WONT_EXIT)) 6644 + set_debugreg(vcpu->arch.dr6, 6); 6645 + 6659 6646 /* When single-stepping over STI and MOV SS, we must clear the 6660 6647 * corresponding interruptibility bits in the guest state. Otherwise 6661 6648 * vmentry fails as it then expects bit 14 (BS) in pending debug ··· 6859 6838 vmx_disable_intercept_for_msr(vcpu, MSR_CORE_C6_RESIDENCY, MSR_TYPE_R); 6860 6839 vmx_disable_intercept_for_msr(vcpu, MSR_CORE_C7_RESIDENCY, MSR_TYPE_R); 6861 6840 } 6862 - vmx->msr_bitmap_mode = 0; 6863 6841 6864 6842 vmx->loaded_vmcs = &vmx->vmcs01; 6865 6843 cpu = get_cpu(); ··· 7017 6997 return (cache << VMX_EPT_MT_EPTE_SHIFT) | ipat; 7018 6998 } 7019 6999 7020 - static void vmcs_set_secondary_exec_control(struct vcpu_vmx *vmx) 7000 + static void vmcs_set_secondary_exec_control(struct vcpu_vmx *vmx, u32 new_ctl) 7021 7001 { 7022 7002 /* 7023 7003 * These bits in the secondary execution controls field ··· 7031 7011 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES | 7032 7012 SECONDARY_EXEC_DESC; 7033 7013 7034 - u32 new_ctl = vmx->secondary_exec_control; 7035 7014 u32 cur_ctl = secondary_exec_controls_get(vmx); 7036 7015 7037 7016 secondary_exec_controls_set(vmx, (new_ctl & ~mask) | (cur_ctl & mask)); ··· 7173 7154 /* xsaves_enabled is recomputed in vmx_compute_secondary_exec_control(). */ 7174 7155 vcpu->arch.xsaves_enabled = false; 7175 7156 7176 - if (cpu_has_secondary_exec_ctrls()) { 7177 - vmx_compute_secondary_exec_control(vmx); 7178 - vmcs_set_secondary_exec_control(vmx); 7179 - } 7157 + vmx_setup_uret_msrs(vmx); 7158 + 7159 + if (cpu_has_secondary_exec_ctrls()) 7160 + vmcs_set_secondary_exec_control(vmx, 7161 + vmx_secondary_exec_control(vmx)); 7180 7162 7181 7163 if (nested_vmx_allowed(vcpu)) 7182 7164 to_vmx(vcpu)->msr_ia32_feature_control_valid_bits |= ··· 7823 7803 ept_lpage_level = PG_LEVEL_2M; 7824 7804 else 7825 7805 ept_lpage_level = PG_LEVEL_4K; 7826 - kvm_configure_mmu(enable_ept, vmx_get_max_tdp_level(), ept_lpage_level); 7806 + kvm_configure_mmu(enable_ept, 0, vmx_get_max_tdp_level(), 7807 + ept_lpage_level); 7827 7808 7828 7809 /* 7829 7810 * Only enable PML when hardware supports PML feature, and both EPT

+7 -31

arch/x86/kvm/vmx/vmx.h

··· 227 227 struct vcpu_vmx { 228 228 struct kvm_vcpu vcpu; 229 229 u8 fail; 230 - u8 msr_bitmap_mode; 230 + u8 x2apic_msr_bitmap_mode; 231 231 232 232 /* 233 233 * If true, host state has been stored in vmx->loaded_vmcs for ··· 262 262 263 263 u64 spec_ctrl; 264 264 u32 msr_ia32_umwait_control; 265 - 266 - u32 secondary_exec_control; 267 265 268 266 /* 269 267 * loaded_vmcs points to the VMCS currently used in this vcpu. For a ··· 369 371 void set_cr4_guest_host_mask(struct vcpu_vmx *vmx); 370 372 void ept_save_pdptrs(struct kvm_vcpu *vcpu); 371 373 void vmx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg); 372 - void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg); 374 + void __vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg); 373 375 u64 construct_eptp(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level); 374 376 375 377 bool vmx_guest_inject_ac(struct kvm_vcpu *vcpu); 376 378 void vmx_update_exception_bitmap(struct kvm_vcpu *vcpu); 377 - void vmx_update_msr_bitmap(struct kvm_vcpu *vcpu); 378 379 bool vmx_nmi_blocked(struct kvm_vcpu *vcpu); 379 380 bool vmx_interrupt_blocked(struct kvm_vcpu *vcpu); 380 381 bool vmx_get_nmi_mask(struct kvm_vcpu *vcpu); ··· 416 419 vmx->loaded_vmcs->controls_shadow.lname = val; \ 417 420 } \ 418 421 } \ 422 + static inline u32 __##lname##_controls_get(struct loaded_vmcs *vmcs) \ 423 + { \ 424 + return vmcs->controls_shadow.lname; \ 425 + } \ 419 426 static inline u32 lname##_controls_get(struct vcpu_vmx *vmx) \ 420 427 { \ 421 - return vmx->loaded_vmcs->controls_shadow.lname; \ 428 + return __##lname##_controls_get(vmx->loaded_vmcs); \ 422 429 } \ 423 430 static inline void lname##_controls_setbit(struct vcpu_vmx *vmx, u32 val) \ 424 431 { \ ··· 451 450 | (1 << VCPU_EXREG_EXIT_INFO_2)); 452 451 vcpu->arch.regs_dirty = 0; 453 452 } 454 - 455 - static inline u32 vmx_vmentry_ctrl(void) 456 - { 457 - u32 vmentry_ctrl = vmcs_config.vmentry_ctrl; 458 - if (vmx_pt_mode_is_system()) 459 - vmentry_ctrl &= ~(VM_ENTRY_PT_CONCEAL_PIP | 460 - VM_ENTRY_LOAD_IA32_RTIT_CTL); 461 - /* Loading of EFER and PERF_GLOBAL_CTRL are toggled dynamically */ 462 - return vmentry_ctrl & 463 - ~(VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL | VM_ENTRY_LOAD_IA32_EFER); 464 - } 465 - 466 - static inline u32 vmx_vmexit_ctrl(void) 467 - { 468 - u32 vmexit_ctrl = vmcs_config.vmexit_ctrl; 469 - if (vmx_pt_mode_is_system()) 470 - vmexit_ctrl &= ~(VM_EXIT_PT_CONCEAL_PIP | 471 - VM_EXIT_CLEAR_IA32_RTIT_CTL); 472 - /* Loading of EFER and PERF_GLOBAL_CTRL are toggled dynamically */ 473 - return vmexit_ctrl & 474 - ~(VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL | VM_EXIT_LOAD_IA32_EFER); 475 - } 476 - 477 - u32 vmx_exec_control(struct vcpu_vmx *vmx); 478 - u32 vmx_pin_based_exec_ctrl(struct vcpu_vmx *vmx); 479 453 480 454 static inline struct kvm_vmx *to_kvm_vmx(struct kvm *kvm) 481 455 {

+1 -3

arch/x86/kvm/vmx/vmx_ops.h

··· 4 4 5 5 #include <linux/nospec.h> 6 6 7 - #include <asm/kvm_host.h> 8 7 #include <asm/vmx.h> 9 8 10 9 #include "evmcs.h" 11 10 #include "vmcs.h" 12 - 13 - #define __ex(x) __kvm_handle_fault_on_reboot(x) 11 + #include "x86.h" 14 12 15 13 asmlinkage void vmread_error(unsigned long field, bool fault); 16 14 __attribute__((regparm(0))) void vmread_error_trampoline(unsigned long field,

+118 -69

arch/x86/kvm/x86.c

··· 233 233 STATS_DESC_COUNTER(VM, mmu_recycled), 234 234 STATS_DESC_COUNTER(VM, mmu_cache_miss), 235 235 STATS_DESC_ICOUNTER(VM, mmu_unsync), 236 - STATS_DESC_ICOUNTER(VM, lpages), 236 + STATS_DESC_ICOUNTER(VM, pages_4k), 237 + STATS_DESC_ICOUNTER(VM, pages_2m), 238 + STATS_DESC_ICOUNTER(VM, pages_1g), 237 239 STATS_DESC_ICOUNTER(VM, nx_lpage_splits), 240 + STATS_DESC_PCOUNTER(VM, max_mmu_rmap_size), 238 241 STATS_DESC_PCOUNTER(VM, max_mmu_page_hash_collisions) 239 242 }; 240 - static_assert(ARRAY_SIZE(kvm_vm_stats_desc) == 241 - sizeof(struct kvm_vm_stat) / sizeof(u64)); 242 243 243 244 const struct kvm_stats_header kvm_vm_stats_header = { 244 245 .name_size = KVM_STATS_NAME_SIZE, ··· 279 278 STATS_DESC_COUNTER(VCPU, directed_yield_successful), 280 279 STATS_DESC_ICOUNTER(VCPU, guest_mode) 281 280 }; 282 - static_assert(ARRAY_SIZE(kvm_vcpu_stats_desc) == 283 - sizeof(struct kvm_vcpu_stat) / sizeof(u64)); 284 281 285 282 const struct kvm_stats_header kvm_vcpu_stats_header = { 286 283 .name_size = KVM_STATS_NAME_SIZE, ··· 484 485 } 485 486 EXPORT_SYMBOL_GPL(kvm_set_apic_base); 486 487 487 - asmlinkage __visible noinstr void kvm_spurious_fault(void) 488 + /* 489 + * Handle a fault on a hardware virtualization (VMX or SVM) instruction. 490 + * 491 + * Hardware virtualization extension instructions may fault if a reboot turns 492 + * off virtualization while processes are running. Usually after catching the 493 + * fault we just panic; during reboot instead the instruction is ignored. 494 + */ 495 + noinstr void kvm_spurious_fault(void) 488 496 { 489 497 /* Fault while not rebooting. We want the trace. */ 490 498 BUG_ON(!kvm_rebooting); ··· 1186 1180 if (!(vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP)) { 1187 1181 for (i = 0; i < KVM_NR_DB_REGS; i++) 1188 1182 vcpu->arch.eff_db[i] = vcpu->arch.db[i]; 1189 - vcpu->arch.switch_db_regs |= KVM_DEBUGREG_RELOAD; 1190 1183 } 1191 1184 } 1192 1185 ··· 3321 3316 if (!msr_info->host_initiated) { 3322 3317 s64 adj = data - vcpu->arch.ia32_tsc_adjust_msr; 3323 3318 adjust_tsc_offset_guest(vcpu, adj); 3319 + /* Before back to guest, tsc_timestamp must be adjusted 3320 + * as well, otherwise guest's percpu pvclock time could jump. 3321 + */ 3322 + kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu); 3324 3323 } 3325 3324 vcpu->arch.ia32_tsc_adjust_msr = data; 3326 3325 } ··· 4319 4310 4320 4311 static_call(kvm_x86_vcpu_put)(vcpu); 4321 4312 vcpu->arch.last_host_tsc = rdtsc(); 4322 - /* 4323 - * If userspace has set any breakpoints or watchpoints, dr6 is restored 4324 - * on every vmexit, but if not, we might have a stale dr6 from the 4325 - * guest. do_debug expects dr6 to be cleared after it runs, do the same. 4326 - */ 4327 - set_debugreg(0, 6); 4328 4313 } 4329 4314 4330 4315 static int kvm_vcpu_ioctl_get_lapic(struct kvm_vcpu *vcpu, ··· 6570 6567 * there is no pkey in EPT page table for L1 guest or EPT 6571 6568 * shadow page table for L2 guest. 6572 6569 */ 6573 - if (vcpu_match_mmio_gva(vcpu, gva) 6574 - && !permission_fault(vcpu, vcpu->arch.walk_mmu, 6575 - vcpu->arch.mmio_access, 0, access)) { 6570 + if (vcpu_match_mmio_gva(vcpu, gva) && (!is_paging(vcpu) || 6571 + !permission_fault(vcpu, vcpu->arch.walk_mmu, 6572 + vcpu->arch.mmio_access, 0, access))) { 6576 6573 *gpa = vcpu->arch.mmio_gfn << PAGE_SHIFT | 6577 6574 (gva & (PAGE_SIZE - 1)); 6578 6575 trace_vcpu_match_mmio(gva, *gpa, write, false); ··· 8581 8578 8582 8579 static void kvm_apicv_init(struct kvm *kvm) 8583 8580 { 8581 + mutex_init(&kvm->arch.apicv_update_lock); 8582 + 8584 8583 if (enable_apicv) 8585 8584 clear_bit(APICV_INHIBIT_REASON_DISABLE, 8586 8585 &kvm->arch.apicv_inhibit_reasons); ··· 8895 8890 kvm_inject_exception(vcpu); 8896 8891 can_inject = false; 8897 8892 } 8893 + 8894 + /* Don't inject interrupts if the user asked to avoid doing so */ 8895 + if (vcpu->guest_debug & KVM_GUESTDBG_BLOCKIRQ) 8896 + return 0; 8898 8897 8899 8898 /* 8900 8899 * Finally, inject interrupt events. If an event cannot be injected ··· 9245 9236 9246 9237 void kvm_vcpu_update_apicv(struct kvm_vcpu *vcpu) 9247 9238 { 9239 + bool activate; 9240 + 9248 9241 if (!lapic_in_kernel(vcpu)) 9249 9242 return; 9250 9243 9251 - vcpu->arch.apicv_active = kvm_apicv_activated(vcpu->kvm); 9244 + mutex_lock(&vcpu->kvm->arch.apicv_update_lock); 9245 + 9246 + activate = kvm_apicv_activated(vcpu->kvm); 9247 + if (vcpu->arch.apicv_active == activate) 9248 + goto out; 9249 + 9250 + vcpu->arch.apicv_active = activate; 9252 9251 kvm_apic_update_apicv(vcpu); 9253 9252 static_call(kvm_x86_refresh_apicv_exec_ctrl)(vcpu); 9254 9253 ··· 9268 9251 */ 9269 9252 if (!vcpu->arch.apicv_active) 9270 9253 kvm_make_request(KVM_REQ_EVENT, vcpu); 9254 + 9255 + out: 9256 + mutex_unlock(&vcpu->kvm->arch.apicv_update_lock); 9271 9257 } 9272 9258 EXPORT_SYMBOL_GPL(kvm_vcpu_update_apicv); 9273 9259 9274 - /* 9275 - * NOTE: Do not hold any lock prior to calling this. 9276 - * 9277 - * In particular, kvm_request_apicv_update() expects kvm->srcu not to be 9278 - * locked, because it calls __x86_set_memory_region() which does 9279 - * synchronize_srcu(&kvm->srcu). 9280 - */ 9281 - void kvm_request_apicv_update(struct kvm *kvm, bool activate, ulong bit) 9260 + void __kvm_request_apicv_update(struct kvm *kvm, bool activate, ulong bit) 9282 9261 { 9283 - struct kvm_vcpu *except; 9284 - unsigned long old, new, expected; 9262 + unsigned long old, new; 9285 9263 9286 9264 if (!kvm_x86_ops.check_apicv_inhibit_reasons || 9287 9265 !static_call(kvm_x86_check_apicv_inhibit_reasons)(bit)) 9288 9266 return; 9289 9267 9290 - old = READ_ONCE(kvm->arch.apicv_inhibit_reasons); 9291 - do { 9292 - expected = new = old; 9293 - if (activate) 9294 - __clear_bit(bit, &new); 9295 - else 9296 - __set_bit(bit, &new); 9297 - if (new == old) 9298 - break; 9299 - old = cmpxchg(&kvm->arch.apicv_inhibit_reasons, expected, new); 9300 - } while (old != expected); 9268 + old = new = kvm->arch.apicv_inhibit_reasons; 9301 9269 9302 - if (!!old == !!new) 9303 - return; 9270 + if (activate) 9271 + __clear_bit(bit, &new); 9272 + else 9273 + __set_bit(bit, &new); 9304 9274 9305 - trace_kvm_apicv_update_request(activate, bit); 9306 - if (kvm_x86_ops.pre_update_apicv_exec_ctrl) 9307 - static_call(kvm_x86_pre_update_apicv_exec_ctrl)(kvm, activate); 9275 + if (!!old != !!new) { 9276 + trace_kvm_apicv_update_request(activate, bit); 9277 + kvm_make_all_cpus_request(kvm, KVM_REQ_APICV_UPDATE); 9278 + kvm->arch.apicv_inhibit_reasons = new; 9279 + if (new) { 9280 + unsigned long gfn = gpa_to_gfn(APIC_DEFAULT_PHYS_BASE); 9281 + kvm_zap_gfn_range(kvm, gfn, gfn+1); 9282 + } 9283 + } else 9284 + kvm->arch.apicv_inhibit_reasons = new; 9285 + } 9286 + EXPORT_SYMBOL_GPL(__kvm_request_apicv_update); 9308 9287 9309 - /* 9310 - * Sending request to update APICV for all other vcpus, 9311 - * while update the calling vcpu immediately instead of 9312 - * waiting for another #VMEXIT to handle the request. 9313 - */ 9314 - except = kvm_get_running_vcpu(); 9315 - kvm_make_all_cpus_request_except(kvm, KVM_REQ_APICV_UPDATE, 9316 - except); 9317 - if (except) 9318 - kvm_vcpu_update_apicv(except); 9288 + void kvm_request_apicv_update(struct kvm *kvm, bool activate, ulong bit) 9289 + { 9290 + mutex_lock(&kvm->arch.apicv_update_lock); 9291 + __kvm_request_apicv_update(kvm, activate, bit); 9292 + mutex_unlock(&kvm->arch.apicv_update_lock); 9319 9293 } 9320 9294 EXPORT_SYMBOL_GPL(kvm_request_apicv_update); 9321 9295 ··· 9403 9395 } 9404 9396 9405 9397 if (kvm_request_pending(vcpu)) { 9398 + if (kvm_check_request(KVM_REQ_VM_BUGGED, vcpu)) { 9399 + r = -EIO; 9400 + goto out; 9401 + } 9406 9402 if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) { 9407 9403 if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu))) { 9408 9404 r = 0; ··· 9620 9608 set_debugreg(vcpu->arch.eff_db[1], 1); 9621 9609 set_debugreg(vcpu->arch.eff_db[2], 2); 9622 9610 set_debugreg(vcpu->arch.eff_db[3], 3); 9623 - set_debugreg(vcpu->arch.dr6, 6); 9624 - vcpu->arch.switch_db_regs &= ~KVM_DEBUGREG_RELOAD; 9625 9611 } else if (unlikely(hw_breakpoint_active())) { 9626 9612 set_debugreg(0, 7); 9627 9613 } ··· 9649 9639 static_call(kvm_x86_sync_dirty_debug_regs)(vcpu); 9650 9640 kvm_update_dr0123(vcpu); 9651 9641 kvm_update_dr7(vcpu); 9652 - vcpu->arch.switch_db_regs &= ~KVM_DEBUGREG_RELOAD; 9653 9642 } 9654 9643 9655 9644 /* ··· 9985 9976 goto out; 9986 9977 } 9987 9978 9988 - if (kvm_run->kvm_valid_regs & ~KVM_SYNC_X86_VALID_FIELDS) { 9979 + if ((kvm_run->kvm_valid_regs & ~KVM_SYNC_X86_VALID_FIELDS) || 9980 + (kvm_run->kvm_dirty_regs & ~KVM_SYNC_X86_VALID_FIELDS)) { 9989 9981 r = -EINVAL; 9990 9982 goto out; 9991 9983 } ··· 10591 10581 10592 10582 static int sync_regs(struct kvm_vcpu *vcpu) 10593 10583 { 10594 - if (vcpu->run->kvm_dirty_regs & ~KVM_SYNC_X86_VALID_FIELDS) 10595 - return -EINVAL; 10596 - 10597 10584 if (vcpu->run->kvm_dirty_regs & KVM_SYNC_X86_REGS) { 10598 10585 __set_regs(vcpu, &vcpu->run->s.regs.regs); 10599 10586 vcpu->run->kvm_dirty_regs &= ~KVM_SYNC_X86_REGS; ··· 10806 10799 void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) 10807 10800 { 10808 10801 unsigned long old_cr0 = kvm_read_cr0(vcpu); 10802 + unsigned long new_cr0; 10803 + u32 eax, dummy; 10809 10804 10810 10805 kvm_lapic_reset(vcpu, init_event); 10811 10806 ··· 10874 10865 vcpu->arch.regs_avail = ~0; 10875 10866 vcpu->arch.regs_dirty = ~0; 10876 10867 10868 + /* 10869 + * Fall back to KVM's default Family/Model/Stepping of 0x600 (P6/Athlon) 10870 + * if no CPUID match is found. Note, it's impossible to get a match at 10871 + * RESET since KVM emulates RESET before exposing the vCPU to userspace, 10872 + * i.e. it'simpossible for kvm_cpuid() to find a valid entry on RESET. 10873 + * But, go through the motions in case that's ever remedied. 10874 + */ 10875 + eax = 1; 10876 + if (!kvm_cpuid(vcpu, &eax, &dummy, &dummy, &dummy, true)) 10877 + eax = 0x600; 10878 + kvm_rdx_write(vcpu, eax); 10879 + 10877 10880 vcpu->arch.ia32_xss = 0; 10878 10881 10879 10882 static_call(kvm_x86_vcpu_reset)(vcpu, init_event); 10883 + 10884 + kvm_set_rflags(vcpu, X86_EFLAGS_FIXED); 10885 + kvm_rip_write(vcpu, 0xfff0); 10886 + 10887 + /* 10888 + * CR0.CD/NW are set on RESET, preserved on INIT. Note, some versions 10889 + * of Intel's SDM list CD/NW as being set on INIT, but they contradict 10890 + * (or qualify) that with a footnote stating that CD/NW are preserved. 10891 + */ 10892 + new_cr0 = X86_CR0_ET; 10893 + if (init_event) 10894 + new_cr0 |= (old_cr0 & (X86_CR0_NW | X86_CR0_CD)); 10895 + else 10896 + new_cr0 |= X86_CR0_NW | X86_CR0_CD; 10897 + 10898 + static_call(kvm_x86_set_cr0)(vcpu, new_cr0); 10899 + static_call(kvm_x86_set_cr4)(vcpu, 0); 10900 + static_call(kvm_x86_set_efer)(vcpu, 0); 10901 + static_call(kvm_x86_update_exception_bitmap)(vcpu); 10880 10902 10881 10903 /* 10882 10904 * Reset the MMU context if paging was enabled prior to INIT (which is ··· 10919 10879 */ 10920 10880 if (old_cr0 & X86_CR0_PG) 10921 10881 kvm_mmu_reset_context(vcpu); 10882 + 10883 + /* 10884 + * Intel's SDM states that all TLB entries are flushed on INIT. AMD's 10885 + * APM states the TLBs are untouched by INIT, but it also states that 10886 + * the TLBs are flushed on "External initialization of the processor." 10887 + * Flush the guest TLB regardless of vendor, there is no meaningful 10888 + * benefit in relying on the guest to flush the TLB immediately after 10889 + * INIT. A spurious TLB flush is benign and likely negligible from a 10890 + * performance perspective. 10891 + */ 10892 + if (init_event) 10893 + kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu); 10922 10894 } 10895 + EXPORT_SYMBOL_GPL(kvm_vcpu_reset); 10923 10896 10924 10897 void kvm_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector) 10925 10898 { ··· 11176 11123 kvm_hv_init_vm(kvm); 11177 11124 kvm_page_track_init(kvm); 11178 11125 kvm_mmu_init_vm(kvm); 11126 + kvm_xen_init_vm(kvm); 11179 11127 11180 11128 return static_call(kvm_x86_vm_init)(kvm); 11181 11129 } ··· 11366 11312 11367 11313 for (i = 0; i < KVM_NR_PAGE_SIZES; ++i) { 11368 11314 int level = i + 1; 11369 - int lpages = gfn_to_index(slot->base_gfn + npages - 1, 11370 - slot->base_gfn, level) + 1; 11315 + int lpages = __kvm_mmu_slot_lpages(slot, npages, level); 11371 11316 11372 11317 WARN_ON(slot->arch.rmap[i]); 11373 11318 ··· 11449 11396 int lpages; 11450 11397 int level = i + 1; 11451 11398 11452 - lpages = gfn_to_index(slot->base_gfn + npages - 1, 11453 - slot->base_gfn, level) + 1; 11399 + lpages = __kvm_mmu_slot_lpages(slot, npages, level); 11454 11400 11455 11401 linfo = kvcalloc(lpages, sizeof(*linfo), GFP_KERNEL_ACCOUNT); 11456 11402 if (!linfo) ··· 11533 11481 11534 11482 static void kvm_mmu_slot_apply_flags(struct kvm *kvm, 11535 11483 struct kvm_memory_slot *old, 11536 - struct kvm_memory_slot *new, 11484 + const struct kvm_memory_slot *new, 11537 11485 enum kvm_mr_change change) 11538 11486 { 11539 11487 bool log_dirty_pages = new->flags & KVM_MEM_LOG_DIRTY_PAGES; ··· 11613 11561 kvm_mmu_change_mmu_pages(kvm, 11614 11562 kvm_mmu_calculate_default_mmu_pages(kvm)); 11615 11563 11616 - /* 11617 - * FIXME: const-ify all uses of struct kvm_memory_slot. 11618 - */ 11619 - kvm_mmu_slot_apply_flags(kvm, old, (struct kvm_memory_slot *) new, change); 11564 + kvm_mmu_slot_apply_flags(kvm, old, new, change); 11620 11565 11621 11566 /* Free the arrays associated with the old memslot. */ 11622 11567 if (change == KVM_MR_MOVE)

+2

arch/x86/kvm/x86.h

··· 8 8 #include "kvm_cache_regs.h" 9 9 #include "kvm_emulate.h" 10 10 11 + void kvm_spurious_fault(void); 12 + 11 13 static __always_inline void kvm_guest_enter_irqoff(void) 12 14 { 13 15 /*

+12 -11

arch/x86/kvm/xen.c

··· 25 25 { 26 26 gpa_t gpa = gfn_to_gpa(gfn); 27 27 int wc_ofs, sec_hi_ofs; 28 - int ret; 28 + int ret = 0; 29 29 int idx = srcu_read_lock(&kvm->srcu); 30 30 31 - ret = kvm_gfn_to_hva_cache_init(kvm, &kvm->arch.xen.shinfo_cache, 32 - gpa, PAGE_SIZE); 33 - if (ret) 31 + if (kvm_is_error_hva(gfn_to_hva(kvm, gfn))) { 32 + ret = -EFAULT; 34 33 goto out; 35 - 36 - kvm->arch.xen.shinfo_set = true; 34 + } 35 + kvm->arch.xen.shinfo_gfn = gfn; 37 36 38 37 /* Paranoia checks on the 32-bit struct layout */ 39 38 BUILD_BUG_ON(offsetof(struct compat_shared_info, wc) != 0x900); ··· 244 245 245 246 case KVM_XEN_ATTR_TYPE_SHARED_INFO: 246 247 if (data->u.shared_info.gfn == GPA_INVALID) { 247 - kvm->arch.xen.shinfo_set = false; 248 + kvm->arch.xen.shinfo_gfn = GPA_INVALID; 248 249 r = 0; 249 250 break; 250 251 } ··· 282 283 break; 283 284 284 285 case KVM_XEN_ATTR_TYPE_SHARED_INFO: 285 - if (kvm->arch.xen.shinfo_set) 286 - data->u.shared_info.gfn = gpa_to_gfn(kvm->arch.xen.shinfo_cache.gpa); 287 - else 288 - data->u.shared_info.gfn = GPA_INVALID; 286 + data->u.shared_info.gfn = gpa_to_gfn(kvm->arch.xen.shinfo_gfn); 289 287 r = 0; 290 288 break; 291 289 ··· 640 644 641 645 mutex_unlock(&kvm->lock); 642 646 return 0; 647 + } 648 + 649 + void kvm_xen_init_vm(struct kvm *kvm) 650 + { 651 + kvm->arch.xen.shinfo_gfn = GPA_INVALID; 643 652 } 644 653 645 654 void kvm_xen_destroy_vm(struct kvm *kvm)

+5

arch/x86/kvm/xen.h

··· 21 21 int kvm_xen_hvm_get_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data); 22 22 int kvm_xen_write_hypercall_page(struct kvm_vcpu *vcpu, u64 data); 23 23 int kvm_xen_hvm_config(struct kvm *kvm, struct kvm_xen_hvm_config *xhc); 24 + void kvm_xen_init_vm(struct kvm *kvm); 24 25 void kvm_xen_destroy_vm(struct kvm *kvm); 25 26 26 27 static inline bool kvm_xen_msr_enabled(struct kvm *kvm) ··· 49 48 static inline int kvm_xen_write_hypercall_page(struct kvm_vcpu *vcpu, u64 data) 50 49 { 51 50 return 1; 51 + } 52 + 53 + static inline void kvm_xen_init_vm(struct kvm *kvm) 54 + { 52 55 } 53 56 54 57 static inline void kvm_xen_destroy_vm(struct kvm *kvm)

+5 -1

include/linux/entry-kvm.h

··· 2 2 #ifndef __LINUX_ENTRYKVM_H 3 3 #define __LINUX_ENTRYKVM_H 4 4 5 - #include <linux/entry-common.h> 5 + #include <linux/static_call_types.h> 6 + #include <linux/tracehook.h> 7 + #include <linux/syscalls.h> 8 + #include <linux/seccomp.h> 9 + #include <linux/sched.h> 6 10 #include <linux/tick.h> 7 11 8 12 /* Transfer to guest mode work */

+193 -43

include/linux/kvm_host.h

··· 150 150 #define KVM_REQ_MMU_RELOAD (1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP) 151 151 #define KVM_REQ_UNBLOCK 2 152 152 #define KVM_REQ_UNHALT 3 153 + #define KVM_REQ_VM_BUGGED (4 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP) 153 154 #define KVM_REQUEST_ARCH_BASE 8 154 155 155 156 #define KVM_ARCH_REQ_FLAGS(nr, flags) ({ \ ··· 158 157 (unsigned)(((nr) + KVM_REQUEST_ARCH_BASE) | (flags)); \ 159 158 }) 160 159 #define KVM_ARCH_REQ(nr) KVM_ARCH_REQ_FLAGS(nr, 0) 160 + 161 + bool kvm_make_vcpus_request_mask(struct kvm *kvm, unsigned int req, 162 + struct kvm_vcpu *except, 163 + unsigned long *vcpu_bitmap, cpumask_var_t tmp); 164 + bool kvm_make_all_cpus_request(struct kvm *kvm, unsigned int req); 165 + bool kvm_make_all_cpus_request_except(struct kvm *kvm, unsigned int req, 166 + struct kvm_vcpu *except); 167 + bool kvm_make_cpus_request_mask(struct kvm *kvm, unsigned int req, 168 + unsigned long *vcpu_bitmap); 161 169 162 170 #define KVM_USERSPACE_IRQ_SOURCE_ID 0 163 171 #define KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID 1 ··· 354 344 struct kvm_vcpu_stat stat; 355 345 char stats_id[KVM_STATS_NAME_SIZE]; 356 346 struct kvm_dirty_ring dirty_ring; 347 + 348 + /* 349 + * The index of the most recently used memslot by this vCPU. It's ok 350 + * if this becomes stale due to memslot changes since we always check 351 + * it is a valid slot. 352 + */ 353 + int last_used_slot; 357 354 }; 358 355 359 356 /* must be called with irqs disabled */ ··· 529 512 u64 generation; 530 513 /* The mapping table from slot id to the index in memslots[]. */ 531 514 short id_to_index[KVM_MEM_SLOTS_NUM]; 532 - atomic_t lru_slot; 515 + atomic_t last_used_slot; 533 516 int used_slots; 534 517 struct kvm_memory_slot memslots[]; 535 518 }; ··· 554 537 struct mm_struct *mm; /* userspace tied to this vm */ 555 538 struct kvm_memslots __rcu *memslots[KVM_ADDRESS_SPACE_NUM]; 556 539 struct kvm_vcpu *vcpus[KVM_MAX_VCPUS]; 540 + 541 + /* Used to wait for completion of MMU notifiers. */ 542 + spinlock_t mn_invalidate_lock; 543 + unsigned long mn_active_invalidate_count; 544 + struct rcuwait mn_memslots_update_rcuwait; 557 545 558 546 /* 559 547 * created_vcpus is protected by kvm->lock, and is incremented ··· 618 596 pid_t userspace_pid; 619 597 unsigned int max_halt_poll_ns; 620 598 u32 dirty_ring_size; 599 + bool vm_bugged; 621 600 622 601 #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER 623 602 struct notifier_block pm_notifier; ··· 651 628 ## __VA_ARGS__) 652 629 #define vcpu_err(vcpu, fmt, ...) \ 653 630 kvm_err("vcpu%i " fmt, (vcpu)->vcpu_id, ## __VA_ARGS__) 631 + 632 + static inline void kvm_vm_bugged(struct kvm *kvm) 633 + { 634 + kvm->vm_bugged = true; 635 + kvm_make_all_cpus_request(kvm, KVM_REQ_VM_BUGGED); 636 + } 637 + 638 + #define KVM_BUG(cond, kvm, fmt...) \ 639 + ({ \ 640 + int __ret = (cond); \ 641 + \ 642 + if (WARN_ONCE(__ret && !(kvm)->vm_bugged, fmt)) \ 643 + kvm_vm_bugged(kvm); \ 644 + unlikely(__ret); \ 645 + }) 646 + 647 + #define KVM_BUG_ON(cond, kvm) \ 648 + ({ \ 649 + int __ret = (cond); \ 650 + \ 651 + if (WARN_ON_ONCE(__ret && !(kvm)->vm_bugged)) \ 652 + kvm_vm_bugged(kvm); \ 653 + unlikely(__ret); \ 654 + }) 654 655 655 656 static inline bool kvm_dirty_log_manual_protect_and_init_set(struct kvm *kvm) 656 657 { ··· 767 720 void kvm_exit(void); 768 721 769 722 void kvm_get_kvm(struct kvm *kvm); 723 + bool kvm_get_kvm_safe(struct kvm *kvm); 770 724 void kvm_put_kvm(struct kvm *kvm); 771 725 bool file_is_kvm(struct file *file); 772 726 void kvm_put_kvm_no_destroy(struct kvm *kvm); ··· 872 824 void kvm_release_pfn_dirty(kvm_pfn_t pfn); 873 825 void kvm_set_pfn_dirty(kvm_pfn_t pfn); 874 826 void kvm_set_pfn_accessed(kvm_pfn_t pfn); 875 - void kvm_get_pfn(kvm_pfn_t pfn); 876 827 877 828 void kvm_release_pfn(kvm_pfn_t pfn, bool dirty, struct gfn_to_pfn_cache *cache); 878 829 int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset, ··· 990 943 void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc); 991 944 #endif 992 945 993 - bool kvm_make_vcpus_request_mask(struct kvm *kvm, unsigned int req, 994 - struct kvm_vcpu *except, 995 - unsigned long *vcpu_bitmap, cpumask_var_t tmp); 996 - bool kvm_make_all_cpus_request(struct kvm *kvm, unsigned int req); 997 - bool kvm_make_all_cpus_request_except(struct kvm *kvm, unsigned int req, 998 - struct kvm_vcpu *except); 999 - bool kvm_make_cpus_request_mask(struct kvm *kvm, unsigned int req, 1000 - unsigned long *vcpu_bitmap); 946 + void kvm_inc_notifier_count(struct kvm *kvm, unsigned long start, 947 + unsigned long end); 948 + void kvm_dec_notifier_count(struct kvm *kvm, unsigned long start, 949 + unsigned long end); 1001 950 1002 951 long kvm_arch_dev_ioctl(struct file *filp, 1003 952 unsigned int ioctl, unsigned long arg); ··· 1077 1034 bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu); 1078 1035 int kvm_arch_post_init_vm(struct kvm *kvm); 1079 1036 void kvm_arch_pre_destroy_vm(struct kvm *kvm); 1037 + int kvm_arch_create_vm_debugfs(struct kvm *kvm); 1080 1038 1081 1039 #ifndef __KVM_HAVE_ARCH_VM_ALLOC 1082 1040 /* ··· 1201 1157 bool kvm_arch_irqfd_allowed(struct kvm *kvm, struct kvm_irqfd *args); 1202 1158 1203 1159 /* 1204 - * search_memslots() and __gfn_to_memslot() are here because they are 1205 - * used in non-modular code in arch/powerpc/kvm/book3s_hv_rm_mmu.c. 1206 - * gfn_to_memslot() itself isn't here as an inline because that would 1207 - * bloat other code too much. 1160 + * Returns a pointer to the memslot at slot_index if it contains gfn. 1161 + * Otherwise returns NULL. 1162 + */ 1163 + static inline struct kvm_memory_slot * 1164 + try_get_memslot(struct kvm_memslots *slots, int slot_index, gfn_t gfn) 1165 + { 1166 + struct kvm_memory_slot *slot; 1167 + 1168 + if (slot_index < 0 || slot_index >= slots->used_slots) 1169 + return NULL; 1170 + 1171 + /* 1172 + * slot_index can come from vcpu->last_used_slot which is not kept 1173 + * in sync with userspace-controllable memslot deletion. So use nospec 1174 + * to prevent the CPU from speculating past the end of memslots[]. 1175 + */ 1176 + slot_index = array_index_nospec(slot_index, slots->used_slots); 1177 + slot = &slots->memslots[slot_index]; 1178 + 1179 + if (gfn >= slot->base_gfn && gfn < slot->base_gfn + slot->npages) 1180 + return slot; 1181 + else 1182 + return NULL; 1183 + } 1184 + 1185 + /* 1186 + * Returns a pointer to the memslot that contains gfn and records the index of 1187 + * the slot in index. Otherwise returns NULL. 1208 1188 * 1209 1189 * IMPORTANT: Slots are sorted from highest GFN to lowest GFN! 1210 1190 */ 1211 1191 static inline struct kvm_memory_slot * 1212 - search_memslots(struct kvm_memslots *slots, gfn_t gfn) 1192 + search_memslots(struct kvm_memslots *slots, gfn_t gfn, int *index) 1213 1193 { 1214 1194 int start = 0, end = slots->used_slots; 1215 - int slot = atomic_read(&slots->lru_slot); 1216 1195 struct kvm_memory_slot *memslots = slots->memslots; 1196 + struct kvm_memory_slot *slot; 1217 1197 1218 1198 if (unlikely(!slots->used_slots)) 1219 1199 return NULL; 1220 1200 1221 - if (gfn >= memslots[slot].base_gfn && 1222 - gfn < memslots[slot].base_gfn + memslots[slot].npages) 1223 - return &memslots[slot]; 1224 - 1225 1201 while (start < end) { 1226 - slot = start + (end - start) / 2; 1202 + int slot = start + (end - start) / 2; 1227 1203 1228 1204 if (gfn >= memslots[slot].base_gfn) 1229 1205 end = slot; ··· 1251 1187 start = slot + 1; 1252 1188 } 1253 1189 1254 - if (start < slots->used_slots && gfn >= memslots[start].base_gfn && 1255 - gfn < memslots[start].base_gfn + memslots[start].npages) { 1256 - atomic_set(&slots->lru_slot, start); 1257 - return &memslots[start]; 1190 + slot = try_get_memslot(slots, start, gfn); 1191 + if (slot) { 1192 + *index = start; 1193 + return slot; 1258 1194 } 1259 1195 1260 1196 return NULL; 1261 1197 } 1262 1198 1199 + /* 1200 + * __gfn_to_memslot() and its descendants are here because it is called from 1201 + * non-modular code in arch/powerpc/kvm/book3s_64_vio{,_hv}.c. gfn_to_memslot() 1202 + * itself isn't here as an inline because that would bloat other code too much. 1203 + */ 1263 1204 static inline struct kvm_memory_slot * 1264 1205 __gfn_to_memslot(struct kvm_memslots *slots, gfn_t gfn) 1265 1206 { 1266 - return search_memslots(slots, gfn); 1207 + struct kvm_memory_slot *slot; 1208 + int slot_index = atomic_read(&slots->last_used_slot); 1209 + 1210 + slot = try_get_memslot(slots, slot_index, gfn); 1211 + if (slot) 1212 + return slot; 1213 + 1214 + slot = search_memslots(slots, gfn, &slot_index); 1215 + if (slot) { 1216 + atomic_set(&slots->last_used_slot, slot_index); 1217 + return slot; 1218 + } 1219 + 1220 + return NULL; 1267 1221 } 1268 1222 1269 1223 static inline unsigned long ··· 1355 1273 char name[KVM_STATS_NAME_SIZE]; 1356 1274 }; 1357 1275 1358 - #define STATS_DESC_COMMON(type, unit, base, exp) \ 1276 + #define STATS_DESC_COMMON(type, unit, base, exp, sz, bsz) \ 1359 1277 .flags = type | unit | base | \ 1360 1278 BUILD_BUG_ON_ZERO(type & ~KVM_STATS_TYPE_MASK) | \ 1361 1279 BUILD_BUG_ON_ZERO(unit & ~KVM_STATS_UNIT_MASK) | \ 1362 1280 BUILD_BUG_ON_ZERO(base & ~KVM_STATS_BASE_MASK), \ 1363 1281 .exponent = exp, \ 1364 - .size = 1 1282 + .size = sz, \ 1283 + .bucket_size = bsz 1365 1284 1366 - #define VM_GENERIC_STATS_DESC(stat, type, unit, base, exp) \ 1285 + #define VM_GENERIC_STATS_DESC(stat, type, unit, base, exp, sz, bsz) \ 1367 1286 { \ 1368 1287 { \ 1369 - STATS_DESC_COMMON(type, unit, base, exp), \ 1288 + STATS_DESC_COMMON(type, unit, base, exp, sz, bsz), \ 1370 1289 .offset = offsetof(struct kvm_vm_stat, generic.stat) \ 1371 1290 }, \ 1372 1291 .name = #stat, \ 1373 1292 } 1374 - #define VCPU_GENERIC_STATS_DESC(stat, type, unit, base, exp) \ 1293 + #define VCPU_GENERIC_STATS_DESC(stat, type, unit, base, exp, sz, bsz) \ 1375 1294 { \ 1376 1295 { \ 1377 - STATS_DESC_COMMON(type, unit, base, exp), \ 1296 + STATS_DESC_COMMON(type, unit, base, exp, sz, bsz), \ 1378 1297 .offset = offsetof(struct kvm_vcpu_stat, generic.stat) \ 1379 1298 }, \ 1380 1299 .name = #stat, \ 1381 1300 } 1382 - #define VM_STATS_DESC(stat, type, unit, base, exp) \ 1301 + #define VM_STATS_DESC(stat, type, unit, base, exp, sz, bsz) \ 1383 1302 { \ 1384 1303 { \ 1385 - STATS_DESC_COMMON(type, unit, base, exp), \ 1304 + STATS_DESC_COMMON(type, unit, base, exp, sz, bsz), \ 1386 1305 .offset = offsetof(struct kvm_vm_stat, stat) \ 1387 1306 }, \ 1388 1307 .name = #stat, \ 1389 1308 } 1390 - #define VCPU_STATS_DESC(stat, type, unit, base, exp) \ 1309 + #define VCPU_STATS_DESC(stat, type, unit, base, exp, sz, bsz) \ 1391 1310 { \ 1392 1311 { \ 1393 - STATS_DESC_COMMON(type, unit, base, exp), \ 1312 + STATS_DESC_COMMON(type, unit, base, exp, sz, bsz), \ 1394 1313 .offset = offsetof(struct kvm_vcpu_stat, stat) \ 1395 1314 }, \ 1396 1315 .name = #stat, \ 1397 1316 } 1398 1317 /* SCOPE: VM, VM_GENERIC, VCPU, VCPU_GENERIC */ 1399 - #define STATS_DESC(SCOPE, stat, type, unit, base, exp) \ 1400 - SCOPE##_STATS_DESC(stat, type, unit, base, exp) 1318 + #define STATS_DESC(SCOPE, stat, type, unit, base, exp, sz, bsz) \ 1319 + SCOPE##_STATS_DESC(stat, type, unit, base, exp, sz, bsz) 1401 1320 1402 1321 #define STATS_DESC_CUMULATIVE(SCOPE, name, unit, base, exponent) \ 1403 - STATS_DESC(SCOPE, name, KVM_STATS_TYPE_CUMULATIVE, unit, base, exponent) 1322 + STATS_DESC(SCOPE, name, KVM_STATS_TYPE_CUMULATIVE, \ 1323 + unit, base, exponent, 1, 0) 1404 1324 #define STATS_DESC_INSTANT(SCOPE, name, unit, base, exponent) \ 1405 - STATS_DESC(SCOPE, name, KVM_STATS_TYPE_INSTANT, unit, base, exponent) 1325 + STATS_DESC(SCOPE, name, KVM_STATS_TYPE_INSTANT, \ 1326 + unit, base, exponent, 1, 0) 1406 1327 #define STATS_DESC_PEAK(SCOPE, name, unit, base, exponent) \ 1407 - STATS_DESC(SCOPE, name, KVM_STATS_TYPE_PEAK, unit, base, exponent) 1328 + STATS_DESC(SCOPE, name, KVM_STATS_TYPE_PEAK, \ 1329 + unit, base, exponent, 1, 0) 1330 + #define STATS_DESC_LINEAR_HIST(SCOPE, name, unit, base, exponent, sz, bsz) \ 1331 + STATS_DESC(SCOPE, name, KVM_STATS_TYPE_LINEAR_HIST, \ 1332 + unit, base, exponent, sz, bsz) 1333 + #define STATS_DESC_LOG_HIST(SCOPE, name, unit, base, exponent, sz) \ 1334 + STATS_DESC(SCOPE, name, KVM_STATS_TYPE_LOG_HIST, \ 1335 + unit, base, exponent, sz, 0) 1408 1336 1409 1337 /* Cumulative counter, read/write */ 1410 1338 #define STATS_DESC_COUNTER(SCOPE, name) \ ··· 1433 1341 #define STATS_DESC_TIME_NSEC(SCOPE, name) \ 1434 1342 STATS_DESC_CUMULATIVE(SCOPE, name, KVM_STATS_UNIT_SECONDS, \ 1435 1343 KVM_STATS_BASE_POW10, -9) 1344 + /* Linear histogram for time in nanosecond */ 1345 + #define STATS_DESC_LINHIST_TIME_NSEC(SCOPE, name, sz, bsz) \ 1346 + STATS_DESC_LINEAR_HIST(SCOPE, name, KVM_STATS_UNIT_SECONDS, \ 1347 + KVM_STATS_BASE_POW10, -9, sz, bsz) 1348 + /* Logarithmic histogram for time in nanosecond */ 1349 + #define STATS_DESC_LOGHIST_TIME_NSEC(SCOPE, name, sz) \ 1350 + STATS_DESC_LOG_HIST(SCOPE, name, KVM_STATS_UNIT_SECONDS, \ 1351 + KVM_STATS_BASE_POW10, -9, sz) 1436 1352 1437 1353 #define KVM_GENERIC_VM_STATS() \ 1438 - STATS_DESC_COUNTER(VM_GENERIC, remote_tlb_flush) 1354 + STATS_DESC_COUNTER(VM_GENERIC, remote_tlb_flush), \ 1355 + STATS_DESC_COUNTER(VM_GENERIC, remote_tlb_flush_requests) 1439 1356 1440 1357 #define KVM_GENERIC_VCPU_STATS() \ 1441 1358 STATS_DESC_COUNTER(VCPU_GENERIC, halt_successful_poll), \ ··· 1452 1351 STATS_DESC_COUNTER(VCPU_GENERIC, halt_poll_invalid), \ 1453 1352 STATS_DESC_COUNTER(VCPU_GENERIC, halt_wakeup), \ 1454 1353 STATS_DESC_TIME_NSEC(VCPU_GENERIC, halt_poll_success_ns), \ 1455 - STATS_DESC_TIME_NSEC(VCPU_GENERIC, halt_poll_fail_ns) 1354 + STATS_DESC_TIME_NSEC(VCPU_GENERIC, halt_poll_fail_ns), \ 1355 + STATS_DESC_TIME_NSEC(VCPU_GENERIC, halt_wait_ns), \ 1356 + STATS_DESC_LOGHIST_TIME_NSEC(VCPU_GENERIC, halt_poll_success_hist, \ 1357 + HALT_POLL_HIST_COUNT), \ 1358 + STATS_DESC_LOGHIST_TIME_NSEC(VCPU_GENERIC, halt_poll_fail_hist, \ 1359 + HALT_POLL_HIST_COUNT), \ 1360 + STATS_DESC_LOGHIST_TIME_NSEC(VCPU_GENERIC, halt_wait_hist, \ 1361 + HALT_POLL_HIST_COUNT) 1456 1362 1457 1363 extern struct dentry *kvm_debugfs_dir; 1364 + 1458 1365 ssize_t kvm_stats_read(char *id, const struct kvm_stats_header *header, 1459 1366 const struct _kvm_stats_desc *desc, 1460 1367 void *stats, size_t size_stats, 1461 1368 char __user *user_buffer, size_t size, loff_t *offset); 1369 + 1370 + /** 1371 + * kvm_stats_linear_hist_update() - Update bucket value for linear histogram 1372 + * statistics data. 1373 + * 1374 + * @data: start address of the stats data 1375 + * @size: the number of bucket of the stats data 1376 + * @value: the new value used to update the linear histogram's bucket 1377 + * @bucket_size: the size (width) of a bucket 1378 + */ 1379 + static inline void kvm_stats_linear_hist_update(u64 *data, size_t size, 1380 + u64 value, size_t bucket_size) 1381 + { 1382 + size_t index = div64_u64(value, bucket_size); 1383 + 1384 + index = min(index, size - 1); 1385 + ++data[index]; 1386 + } 1387 + 1388 + /** 1389 + * kvm_stats_log_hist_update() - Update bucket value for logarithmic histogram 1390 + * statistics data. 1391 + * 1392 + * @data: start address of the stats data 1393 + * @size: the number of bucket of the stats data 1394 + * @value: the new value used to update the logarithmic histogram's bucket 1395 + */ 1396 + static inline void kvm_stats_log_hist_update(u64 *data, size_t size, u64 value) 1397 + { 1398 + size_t index = fls64(value); 1399 + 1400 + index = min(index, size - 1); 1401 + ++data[index]; 1402 + } 1403 + 1404 + #define KVM_STATS_LINEAR_HIST_UPDATE(array, value, bsize) \ 1405 + kvm_stats_linear_hist_update(array, ARRAY_SIZE(array), value, bsize) 1406 + #define KVM_STATS_LOG_HIST_UPDATE(array, value) \ 1407 + kvm_stats_log_hist_update(array, ARRAY_SIZE(array), value) 1408 + 1409 + 1462 1410 extern const struct kvm_stats_header kvm_vm_stats_header; 1463 1411 extern const struct _kvm_stats_desc kvm_vm_stats_desc[]; 1464 1412 extern const struct kvm_stats_header kvm_vcpu_stats_header;

+7

include/linux/kvm_types.h

··· 76 76 }; 77 77 #endif 78 78 79 + #define HALT_POLL_HIST_COUNT 32 80 + 79 81 struct kvm_vm_stat_generic { 80 82 u64 remote_tlb_flush; 83 + u64 remote_tlb_flush_requests; 81 84 }; 82 85 83 86 struct kvm_vcpu_stat_generic { ··· 90 87 u64 halt_wakeup; 91 88 u64 halt_poll_success_ns; 92 89 u64 halt_poll_fail_ns; 90 + u64 halt_wait_ns; 91 + u64 halt_poll_success_hist[HALT_POLL_HIST_COUNT]; 92 + u64 halt_poll_fail_hist[HALT_POLL_HIST_COUNT]; 93 + u64 halt_wait_hist[HALT_POLL_HIST_COUNT]; 93 94 }; 94 95 95 96 #define KVM_STATS_NAME_SIZE 48

-37

include/linux/page-flags.h

··· 633 633 } 634 634 635 635 /* 636 - * PageTransCompoundMap is the same as PageTransCompound, but it also 637 - * guarantees the primary MMU has the entire compound page mapped 638 - * through pmd_trans_huge, which in turn guarantees the secondary MMUs 639 - * can also map the entire compound page. This allows the secondary 640 - * MMUs to call get_user_pages() only once for each compound page and 641 - * to immediately map the entire compound page with a single secondary 642 - * MMU fault. If there will be a pmd split later, the secondary MMUs 643 - * will get an update through the MMU notifier invalidation through 644 - * split_huge_pmd(). 645 - * 646 - * Unlike PageTransCompound, this is safe to be called only while 647 - * split_huge_pmd() cannot run from under us, like if protected by the 648 - * MMU notifier, otherwise it may result in page->_mapcount check false 649 - * positives. 650 - * 651 - * We have to treat page cache THP differently since every subpage of it 652 - * would get _mapcount inc'ed once it is PMD mapped. But, it may be PTE 653 - * mapped in the current process so comparing subpage's _mapcount to 654 - * compound_mapcount to filter out PTE mapped case. 655 - */ 656 - static inline int PageTransCompoundMap(struct page *page) 657 - { 658 - struct page *head; 659 - 660 - if (!PageTransCompound(page)) 661 - return 0; 662 - 663 - if (PageAnon(page)) 664 - return atomic_read(&page->_mapcount) < 0; 665 - 666 - head = compound_head(page); 667 - /* File THP is PMD mapped and not PTE mapped */ 668 - return atomic_read(&page->_mapcount) == 669 - atomic_read(compound_mapcount_ptr(head)); 670 - } 671 - 672 - /* 673 636 * PageTransTail returns true for both transparent huge pages 674 637 * and hugetlbfs pages, so it should only be called when it's known 675 638 * that hugetlbfs pages aren't involved.

+7 -4

include/uapi/linux/kvm.h

··· 1965 1965 #define KVM_STATS_TYPE_CUMULATIVE (0x0 << KVM_STATS_TYPE_SHIFT) 1966 1966 #define KVM_STATS_TYPE_INSTANT (0x1 << KVM_STATS_TYPE_SHIFT) 1967 1967 #define KVM_STATS_TYPE_PEAK (0x2 << KVM_STATS_TYPE_SHIFT) 1968 - #define KVM_STATS_TYPE_MAX KVM_STATS_TYPE_PEAK 1968 + #define KVM_STATS_TYPE_LINEAR_HIST (0x3 << KVM_STATS_TYPE_SHIFT) 1969 + #define KVM_STATS_TYPE_LOG_HIST (0x4 << KVM_STATS_TYPE_SHIFT) 1970 + #define KVM_STATS_TYPE_MAX KVM_STATS_TYPE_LOG_HIST 1969 1971 1970 1972 #define KVM_STATS_UNIT_SHIFT 4 1971 1973 #define KVM_STATS_UNIT_MASK (0xF << KVM_STATS_UNIT_SHIFT) ··· 1990 1988 * @size: The number of data items for this stats. 1991 1989 * Every data item is of type __u64. 1992 1990 * @offset: The offset of the stats to the start of stat structure in 1993 - * struture kvm or kvm_vcpu. 1994 - * @unused: Unused field for future usage. Always 0 for now. 1991 + * structure kvm or kvm_vcpu. 1992 + * @bucket_size: A parameter value used for histogram stats. It is only used 1993 + * for linear histogram stats, specifying the size of the bucket; 1995 1994 * @name: The name string for the stats. Its size is indicated by the 1996 1995 * &kvm_stats_header->name_size. 1997 1996 */ ··· 2001 1998 __s16 exponent; 2002 1999 __u16 size; 2003 2000 __u32 offset; 2004 - __u32 unused; 2001 + __u32 bucket_size; 2005 2002 char name[]; 2006 2003 }; 2007 2004

+1

tools/testing/selftests/kvm/.gitignore

··· 1 1 # SPDX-License-Identifier: GPL-2.0-only 2 2 /aarch64/debug-exceptions 3 3 /aarch64/get-reg-list 4 + /aarch64/psci_cpu_on_test 4 5 /aarch64/vgic_init 5 6 /s390x/memop 6 7 /s390x/resets

+1

tools/testing/selftests/kvm/Makefile

··· 86 86 87 87 TEST_GEN_PROGS_aarch64 += aarch64/debug-exceptions 88 88 TEST_GEN_PROGS_aarch64 += aarch64/get-reg-list 89 + TEST_GEN_PROGS_aarch64 += aarch64/psci_cpu_on_test 89 90 TEST_GEN_PROGS_aarch64 += aarch64/vgic_init 90 91 TEST_GEN_PROGS_aarch64 += demand_paging_test 91 92 TEST_GEN_PROGS_aarch64 += dirty_log_test

+121

tools/testing/selftests/kvm/aarch64/psci_cpu_on_test.c

··· 1 + // SPDX-License-Identifier: GPL-2.0-only 2 + /* 3 + * psci_cpu_on_test - Test that the observable state of a vCPU targeted by the 4 + * CPU_ON PSCI call matches what the caller requested. 5 + * 6 + * Copyright (c) 2021 Google LLC. 7 + * 8 + * This is a regression test for a race between KVM servicing the PSCI call and 9 + * userspace reading the vCPUs registers. 10 + */ 11 + 12 + #define _GNU_SOURCE 13 + 14 + #include <linux/psci.h> 15 + 16 + #include "kvm_util.h" 17 + #include "processor.h" 18 + #include "test_util.h" 19 + 20 + #define VCPU_ID_SOURCE 0 21 + #define VCPU_ID_TARGET 1 22 + 23 + #define CPU_ON_ENTRY_ADDR 0xfeedf00dul 24 + #define CPU_ON_CONTEXT_ID 0xdeadc0deul 25 + 26 + static uint64_t psci_cpu_on(uint64_t target_cpu, uint64_t entry_addr, 27 + uint64_t context_id) 28 + { 29 + register uint64_t x0 asm("x0") = PSCI_0_2_FN64_CPU_ON; 30 + register uint64_t x1 asm("x1") = target_cpu; 31 + register uint64_t x2 asm("x2") = entry_addr; 32 + register uint64_t x3 asm("x3") = context_id; 33 + 34 + asm("hvc #0" 35 + : "=r"(x0) 36 + : "r"(x0), "r"(x1), "r"(x2), "r"(x3) 37 + : "memory"); 38 + 39 + return x0; 40 + } 41 + 42 + static uint64_t psci_affinity_info(uint64_t target_affinity, 43 + uint64_t lowest_affinity_level) 44 + { 45 + register uint64_t x0 asm("x0") = PSCI_0_2_FN64_AFFINITY_INFO; 46 + register uint64_t x1 asm("x1") = target_affinity; 47 + register uint64_t x2 asm("x2") = lowest_affinity_level; 48 + 49 + asm("hvc #0" 50 + : "=r"(x0) 51 + : "r"(x0), "r"(x1), "r"(x2) 52 + : "memory"); 53 + 54 + return x0; 55 + } 56 + 57 + static void guest_main(uint64_t target_cpu) 58 + { 59 + GUEST_ASSERT(!psci_cpu_on(target_cpu, CPU_ON_ENTRY_ADDR, CPU_ON_CONTEXT_ID)); 60 + uint64_t target_state; 61 + 62 + do { 63 + target_state = psci_affinity_info(target_cpu, 0); 64 + 65 + GUEST_ASSERT((target_state == PSCI_0_2_AFFINITY_LEVEL_ON) || 66 + (target_state == PSCI_0_2_AFFINITY_LEVEL_OFF)); 67 + } while (target_state != PSCI_0_2_AFFINITY_LEVEL_ON); 68 + 69 + GUEST_DONE(); 70 + } 71 + 72 + int main(void) 73 + { 74 + uint64_t target_mpidr, obs_pc, obs_x0; 75 + struct kvm_vcpu_init init; 76 + struct kvm_vm *vm; 77 + struct ucall uc; 78 + 79 + vm = vm_create(VM_MODE_DEFAULT, DEFAULT_GUEST_PHY_PAGES, O_RDWR); 80 + kvm_vm_elf_load(vm, program_invocation_name); 81 + ucall_init(vm, NULL); 82 + 83 + vm_ioctl(vm, KVM_ARM_PREFERRED_TARGET, &init); 84 + init.features[0] |= (1 << KVM_ARM_VCPU_PSCI_0_2); 85 + 86 + aarch64_vcpu_add_default(vm, VCPU_ID_SOURCE, &init, guest_main); 87 + 88 + /* 89 + * make sure the target is already off when executing the test. 90 + */ 91 + init.features[0] |= (1 << KVM_ARM_VCPU_POWER_OFF); 92 + aarch64_vcpu_add_default(vm, VCPU_ID_TARGET, &init, guest_main); 93 + 94 + get_reg(vm, VCPU_ID_TARGET, ARM64_SYS_REG(MPIDR_EL1), &target_mpidr); 95 + vcpu_args_set(vm, VCPU_ID_SOURCE, 1, target_mpidr & MPIDR_HWID_BITMASK); 96 + vcpu_run(vm, VCPU_ID_SOURCE); 97 + 98 + switch (get_ucall(vm, VCPU_ID_SOURCE, &uc)) { 99 + case UCALL_DONE: 100 + break; 101 + case UCALL_ABORT: 102 + TEST_FAIL("%s at %s:%ld", (const char *)uc.args[0], __FILE__, 103 + uc.args[1]); 104 + break; 105 + default: 106 + TEST_FAIL("Unhandled ucall: %lu", uc.cmd); 107 + } 108 + 109 + get_reg(vm, VCPU_ID_TARGET, ARM64_CORE_REG(regs.pc), &obs_pc); 110 + get_reg(vm, VCPU_ID_TARGET, ARM64_CORE_REG(regs.regs[0]), &obs_x0); 111 + 112 + TEST_ASSERT(obs_pc == CPU_ON_ENTRY_ADDR, 113 + "unexpected target cpu pc: %lx (expected: %lx)", 114 + obs_pc, CPU_ON_ENTRY_ADDR); 115 + TEST_ASSERT(obs_x0 == CPU_ON_CONTEXT_ID, 116 + "unexpected target context id: %lx (expected: %lx)", 117 + obs_x0, CPU_ON_CONTEXT_ID); 118 + 119 + kvm_vm_free(vm); 120 + return 0; 121 + }

+1 -3

tools/testing/selftests/kvm/access_tracking_perf_test.c

··· 222 222 int vcpu_id = vcpu_args->vcpu_id; 223 223 int current_iteration = -1; 224 224 225 - vcpu_args_set(vm, vcpu_id, 1, vcpu_id); 226 - 227 225 while (spin_wait_for_next_iteration(&current_iteration)) { 228 226 switch (READ_ONCE(iteration_work)) { 229 227 case ITERATION_ACCESS_MEMORY: ··· 331 333 pthread_t *vcpu_threads; 332 334 int vcpus = params->vcpus; 333 335 334 - vm = perf_test_create_vm(mode, vcpus, params->vcpu_memory_bytes, 336 + vm = perf_test_create_vm(mode, vcpus, params->vcpu_memory_bytes, 1, 335 337 params->backing_src); 336 338 337 339 perf_test_setup_vcpus(vm, vcpus, params->vcpu_memory_bytes,

+1 -2

tools/testing/selftests/kvm/demand_paging_test.c

··· 52 52 struct timespec start; 53 53 struct timespec ts_diff; 54 54 55 - vcpu_args_set(vm, vcpu_id, 1, vcpu_id); 56 55 run = vcpu_state(vm, vcpu_id); 57 56 58 57 clock_gettime(CLOCK_MONOTONIC, &start); ··· 292 293 int vcpu_id; 293 294 int r; 294 295 295 - vm = perf_test_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 296 + vm = perf_test_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1, 296 297 p->src_type); 297 298 298 299 perf_test_args.wr_fract = 1;

+65 -12

tools/testing/selftests/kvm/dirty_log_perf_test.c

··· 44 44 struct perf_test_vcpu_args *vcpu_args = (struct perf_test_vcpu_args *)data; 45 45 int vcpu_id = vcpu_args->vcpu_id; 46 46 47 - vcpu_args_set(vm, vcpu_id, 1, vcpu_id); 48 47 run = vcpu_state(vm, vcpu_id); 49 48 50 49 while (!READ_ONCE(host_quit)) { ··· 93 94 int wr_fract; 94 95 bool partition_vcpu_memory_access; 95 96 enum vm_mem_backing_src_type backing_src; 97 + int slots; 96 98 }; 99 + 100 + static void toggle_dirty_logging(struct kvm_vm *vm, int slots, bool enable) 101 + { 102 + int i; 103 + 104 + for (i = 0; i < slots; i++) { 105 + int slot = PERF_TEST_MEM_SLOT_INDEX + i; 106 + int flags = enable ? KVM_MEM_LOG_DIRTY_PAGES : 0; 107 + 108 + vm_mem_region_set_flags(vm, slot, flags); 109 + } 110 + } 111 + 112 + static inline void enable_dirty_logging(struct kvm_vm *vm, int slots) 113 + { 114 + toggle_dirty_logging(vm, slots, true); 115 + } 116 + 117 + static inline void disable_dirty_logging(struct kvm_vm *vm, int slots) 118 + { 119 + toggle_dirty_logging(vm, slots, false); 120 + } 121 + 122 + static void get_dirty_log(struct kvm_vm *vm, int slots, unsigned long *bitmap, 123 + uint64_t nr_pages) 124 + { 125 + uint64_t slot_pages = nr_pages / slots; 126 + int i; 127 + 128 + for (i = 0; i < slots; i++) { 129 + int slot = PERF_TEST_MEM_SLOT_INDEX + i; 130 + unsigned long *slot_bitmap = bitmap + i * slot_pages; 131 + 132 + kvm_vm_get_dirty_log(vm, slot, slot_bitmap); 133 + } 134 + } 135 + 136 + static void clear_dirty_log(struct kvm_vm *vm, int slots, unsigned long *bitmap, 137 + uint64_t nr_pages) 138 + { 139 + uint64_t slot_pages = nr_pages / slots; 140 + int i; 141 + 142 + for (i = 0; i < slots; i++) { 143 + int slot = PERF_TEST_MEM_SLOT_INDEX + i; 144 + unsigned long *slot_bitmap = bitmap + i * slot_pages; 145 + 146 + kvm_vm_clear_dirty_log(vm, slot, slot_bitmap, 0, slot_pages); 147 + } 148 + } 97 149 98 150 static void run_test(enum vm_guest_mode mode, void *arg) 99 151 { ··· 164 114 struct timespec clear_dirty_log_total = (struct timespec){0}; 165 115 166 116 vm = perf_test_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 167 - p->backing_src); 117 + p->slots, p->backing_src); 168 118 169 119 perf_test_args.wr_fract = p->wr_fract; 170 120 ··· 213 163 214 164 /* Enable dirty logging */ 215 165 clock_gettime(CLOCK_MONOTONIC, &start); 216 - vm_mem_region_set_flags(vm, PERF_TEST_MEM_SLOT_INDEX, 217 - KVM_MEM_LOG_DIRTY_PAGES); 166 + enable_dirty_logging(vm, p->slots); 218 167 ts_diff = timespec_elapsed(start); 219 168 pr_info("Enabling dirty logging time: %ld.%.9lds\n\n", 220 169 ts_diff.tv_sec, ts_diff.tv_nsec); ··· 239 190 iteration, ts_diff.tv_sec, ts_diff.tv_nsec); 240 191 241 192 clock_gettime(CLOCK_MONOTONIC, &start); 242 - kvm_vm_get_dirty_log(vm, PERF_TEST_MEM_SLOT_INDEX, bmap); 243 - 193 + get_dirty_log(vm, p->slots, bmap, host_num_pages); 244 194 ts_diff = timespec_elapsed(start); 245 195 get_dirty_log_total = timespec_add(get_dirty_log_total, 246 196 ts_diff); ··· 248 200 249 201 if (dirty_log_manual_caps) { 250 202 clock_gettime(CLOCK_MONOTONIC, &start); 251 - kvm_vm_clear_dirty_log(vm, PERF_TEST_MEM_SLOT_INDEX, bmap, 0, 252 - host_num_pages); 253 - 203 + clear_dirty_log(vm, p->slots, bmap, host_num_pages); 254 204 ts_diff = timespec_elapsed(start); 255 205 clear_dirty_log_total = timespec_add(clear_dirty_log_total, 256 206 ts_diff); ··· 259 213 260 214 /* Disable dirty logging */ 261 215 clock_gettime(CLOCK_MONOTONIC, &start); 262 - vm_mem_region_set_flags(vm, PERF_TEST_MEM_SLOT_INDEX, 0); 216 + disable_dirty_logging(vm, p->slots); 263 217 ts_diff = timespec_elapsed(start); 264 218 pr_info("Disabling dirty logging time: %ld.%.9lds\n", 265 219 ts_diff.tv_sec, ts_diff.tv_nsec); ··· 290 244 { 291 245 puts(""); 292 246 printf("usage: %s [-h] [-i iterations] [-p offset] " 293 - "[-m mode] [-b vcpu bytes] [-v vcpus] [-o] [-s mem type]\n", name); 247 + "[-m mode] [-b vcpu bytes] [-v vcpus] [-o] [-s mem type]" 248 + "[-x memslots]\n", name); 294 249 puts(""); 295 250 printf(" -i: specify iteration counts (default: %"PRIu64")\n", 296 251 TEST_HOST_LOOP_N); ··· 310 263 " them into a separate region of memory for each vCPU.\n"); 311 264 printf(" -s: specify the type of memory that should be used to\n" 312 265 " back the guest data region.\n\n"); 266 + printf(" -x: Split the memory region into this number of memslots.\n" 267 + " (default: 1)"); 313 268 backing_src_help(); 314 269 puts(""); 315 270 exit(0); ··· 325 276 .wr_fract = 1, 326 277 .partition_vcpu_memory_access = true, 327 278 .backing_src = VM_MEM_SRC_ANONYMOUS, 279 + .slots = 1, 328 280 }; 329 281 int opt; 330 282 ··· 336 286 337 287 guest_modes_append_default(); 338 288 339 - while ((opt = getopt(argc, argv, "hi:p:m:b:f:v:os:")) != -1) { 289 + while ((opt = getopt(argc, argv, "hi:p:m:b:f:v:os:x:")) != -1) { 340 290 switch (opt) { 341 291 case 'i': 342 292 p.iterations = atoi(optarg); ··· 365 315 break; 366 316 case 's': 367 317 p.backing_src = parse_backing_src_type(optarg); 318 + break; 319 + case 'x': 320 + p.slots = atoi(optarg); 368 321 break; 369 322 case 'h': 370 323 default:

+3

tools/testing/selftests/kvm/include/aarch64/processor.h

··· 17 17 #define CPACR_EL1 3, 0, 1, 0, 2 18 18 #define TCR_EL1 3, 0, 2, 0, 2 19 19 #define MAIR_EL1 3, 0, 10, 2, 0 20 + #define MPIDR_EL1 3, 0, 0, 0, 5 20 21 #define TTBR0_EL1 3, 0, 2, 0, 0 21 22 #define SCTLR_EL1 3, 0, 1, 0, 0 22 23 #define VBAR_EL1 3, 0, 12, 0, 0 ··· 40 39 (0x44ul << (3 * 8)) | \ 41 40 (0xfful << (4 * 8)) | \ 42 41 (0xbbul << (5 * 8))) 42 + 43 + #define MPIDR_HWID_BITMASK (0xff00fffffful) 43 44 44 45 static inline void get_reg(struct kvm_vm *vm, uint32_t vcpuid, uint64_t id, uint64_t *addr) 45 46 {

+1 -1

tools/testing/selftests/kvm/include/perf_test_util.h

··· 44 44 extern uint64_t guest_test_phys_mem; 45 45 46 46 struct kvm_vm *perf_test_create_vm(enum vm_guest_mode mode, int vcpus, 47 - uint64_t vcpu_memory_bytes, 47 + uint64_t vcpu_memory_bytes, int slots, 48 48 enum vm_mem_backing_src_type backing_src); 49 49 void perf_test_destroy_vm(struct kvm_vm *vm); 50 50 void perf_test_setup_vcpus(struct kvm_vm *vm, int vcpus,

+12

tools/testing/selftests/kvm/kvm_binary_stats_test.c

··· 109 109 /* Check size field, which should not be zero */ 110 110 TEST_ASSERT(pdesc->size, "KVM descriptor(%s) with size of 0", 111 111 pdesc->name); 112 + /* Check bucket_size field */ 113 + switch (pdesc->flags & KVM_STATS_TYPE_MASK) { 114 + case KVM_STATS_TYPE_LINEAR_HIST: 115 + TEST_ASSERT(pdesc->bucket_size, 116 + "Bucket size of Linear Histogram stats (%s) is zero", 117 + pdesc->name); 118 + break; 119 + default: 120 + TEST_ASSERT(!pdesc->bucket_size, 121 + "Bucket size of stats (%s) is not zero", 122 + pdesc->name); 123 + } 112 124 size_data += pdesc->size * sizeof(*stats_data); 113 125 } 114 126 /* Check overlap */

+17 -5

tools/testing/selftests/kvm/lib/perf_test_util.c

··· 50 50 } 51 51 52 52 struct kvm_vm *perf_test_create_vm(enum vm_guest_mode mode, int vcpus, 53 - uint64_t vcpu_memory_bytes, 53 + uint64_t vcpu_memory_bytes, int slots, 54 54 enum vm_mem_backing_src_type backing_src) 55 55 { 56 56 struct kvm_vm *vm; 57 57 uint64_t guest_num_pages; 58 + int i; 58 59 59 60 pr_info("Testing guest mode: %s\n", vm_guest_mode_string(mode)); 60 61 ··· 69 68 "Guest memory size is not host page size aligned."); 70 69 TEST_ASSERT(vcpu_memory_bytes % perf_test_args.guest_page_size == 0, 71 70 "Guest memory size is not guest page size aligned."); 71 + TEST_ASSERT(guest_num_pages % slots == 0, 72 + "Guest memory cannot be evenly divided into %d slots.", 73 + slots); 72 74 73 75 vm = vm_create_with_vcpus(mode, vcpus, DEFAULT_GUEST_PHY_PAGES, 74 76 (vcpus * vcpu_memory_bytes) / perf_test_args.guest_page_size, ··· 99 95 #endif 100 96 pr_info("guest physical test memory offset: 0x%lx\n", guest_test_phys_mem); 101 97 102 - /* Add an extra memory slot for testing */ 103 - vm_userspace_mem_region_add(vm, backing_src, guest_test_phys_mem, 104 - PERF_TEST_MEM_SLOT_INDEX, 105 - guest_num_pages, 0); 98 + /* Add extra memory slots for testing */ 99 + for (i = 0; i < slots; i++) { 100 + uint64_t region_pages = guest_num_pages / slots; 101 + vm_paddr_t region_start = guest_test_phys_mem + 102 + region_pages * perf_test_args.guest_page_size * i; 103 + 104 + vm_userspace_mem_region_add(vm, backing_src, region_start, 105 + PERF_TEST_MEM_SLOT_INDEX + i, 106 + region_pages, 0); 107 + } 106 108 107 109 /* Do mapping for the demand paging memory slot */ 108 110 virt_map(vm, guest_test_virt_mem, guest_test_phys_mem, guest_num_pages); ··· 149 139 perf_test_args.guest_page_size; 150 140 vcpu_gpa = guest_test_phys_mem; 151 141 } 142 + 143 + vcpu_args_set(vm, vcpu_id, 1, vcpu_id); 152 144 153 145 pr_debug("Added VCPU %d with test mem gpa [%lx, %lx)\n", 154 146 vcpu_id, vcpu_gpa, vcpu_gpa +

+1 -2

tools/testing/selftests/kvm/memslot_modification_stress_test.c

··· 45 45 struct kvm_vm *vm = perf_test_args.vm; 46 46 struct kvm_run *run; 47 47 48 - vcpu_args_set(vm, vcpu_id, 1, vcpu_id); 49 48 run = vcpu_state(vm, vcpu_id); 50 49 51 50 /* Let the guest access its memory until a stop signal is received */ ··· 104 105 struct kvm_vm *vm; 105 106 int vcpu_id; 106 107 107 - vm = perf_test_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 108 + vm = perf_test_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1, 108 109 VM_MEM_SRC_ANONYMOUS); 109 110 110 111 perf_test_args.wr_fract = 1;

+21 -3

tools/testing/selftests/kvm/x86_64/debug_regs.c

··· 8 8 #include <string.h> 9 9 #include "kvm_util.h" 10 10 #include "processor.h" 11 + #include "apic.h" 11 12 12 13 #define VCPU_ID 0 13 14 14 15 #define DR6_BD (1 << 13) 15 16 #define DR7_GD (1 << 13) 17 + 18 + #define IRQ_VECTOR 0xAA 16 19 17 20 /* For testing data access debug BP */ 18 21 uint32_t guest_value; ··· 24 21 25 22 static void guest_code(void) 26 23 { 24 + /* Create a pending interrupt on current vCPU */ 25 + x2apic_enable(); 26 + x2apic_write_reg(APIC_ICR, APIC_DEST_SELF | APIC_INT_ASSERT | 27 + APIC_DM_FIXED | IRQ_VECTOR); 28 + 27 29 /* 28 30 * Software BP tests. 29 31 * ··· 46 38 "mov %%rax,%0;\n\t write_data:" 47 39 : "=m" (guest_value) : : "rax"); 48 40 49 - /* Single step test, covers 2 basic instructions and 2 emulated */ 41 + /* 42 + * Single step test, covers 2 basic instructions and 2 emulated 43 + * 44 + * Enable interrupts during the single stepping to see that 45 + * pending interrupt we raised is not handled due to KVM_GUESTDBG_BLOCKIRQ 46 + */ 50 47 asm volatile("ss_start: " 48 + "sti\n\t" 51 49 "xor %%eax,%%eax\n\t" 52 50 "cpuid\n\t" 53 51 "movl $0x1a0,%%ecx\n\t" 54 52 "rdmsr\n\t" 53 + "cli\n\t" 55 54 : : : "eax", "ebx", "ecx", "edx"); 56 55 57 56 /* DR6.BD test */ ··· 87 72 uint64_t cmd; 88 73 int i; 89 74 /* Instruction lengths starting at ss_start */ 90 - int ss_size[4] = { 75 + int ss_size[6] = { 76 + 1, /* sti*/ 91 77 2, /* xor */ 92 78 2, /* cpuid */ 93 79 5, /* mov */ 94 80 2, /* rdmsr */ 81 + 1, /* cli */ 95 82 }; 96 83 97 84 if (!kvm_check_cap(KVM_CAP_SET_GUEST_DEBUG)) { ··· 171 154 for (i = 0; i < (sizeof(ss_size) / sizeof(ss_size[0])); i++) { 172 155 target_rip += ss_size[i]; 173 156 CLEAR_DEBUG(); 174 - debug.control = KVM_GUESTDBG_ENABLE | KVM_GUESTDBG_SINGLESTEP; 157 + debug.control = KVM_GUESTDBG_ENABLE | KVM_GUESTDBG_SINGLESTEP | 158 + KVM_GUESTDBG_BLOCKIRQ; 175 159 debug.arch.debugreg[7] = 0x00000400; 176 160 APPLY_DEBUG(); 177 161 vcpu_run(vm, VCPU_ID);

-2

virt/kvm/binary_stats.c

··· 136 136 src = stats + pos - header->data_offset; 137 137 if (copy_to_user(dest, src, copylen)) 138 138 return -EFAULT; 139 - remain -= copylen; 140 139 pos += copylen; 141 - dest += copylen; 142 140 } 143 141 144 142 *offset = pos;

-5

virt/kvm/dirty_ring.c

··· 91 91 gfn->flags = KVM_DIRTY_GFN_F_DIRTY; 92 92 } 93 93 94 - static inline bool kvm_dirty_gfn_invalid(struct kvm_dirty_gfn *gfn) 95 - { 96 - return gfn->flags == 0; 97 - } 98 - 99 94 static inline bool kvm_dirty_gfn_harvested(struct kvm_dirty_gfn *gfn) 100 95 { 101 96 return gfn->flags & KVM_DIRTY_GFN_F_RESET;

+147 -50

virt/kvm/kvm_main.c

··· 189 189 return true; 190 190 } 191 191 192 - bool kvm_is_transparent_hugepage(kvm_pfn_t pfn) 193 - { 194 - struct page *page = pfn_to_page(pfn); 195 - 196 - if (!PageTransCompoundMap(page)) 197 - return false; 198 - 199 - return is_transparent_hugepage(compound_head(page)); 200 - } 201 - 202 192 /* 203 193 * Switches to specified vcpu, until a matching vcpu_put() 204 194 */ ··· 308 318 */ 309 319 long dirty_count = smp_load_acquire(&kvm->tlbs_dirty); 310 320 321 + ++kvm->stat.generic.remote_tlb_flush_requests; 311 322 /* 312 323 * We want to publish modifications to the page tables before reading 313 324 * mode. Pairs with a memory barrier in arch-specific code. ··· 406 415 vcpu->preempted = false; 407 416 vcpu->ready = false; 408 417 preempt_notifier_init(&vcpu->preempt_notifier, &kvm_preempt_ops); 418 + vcpu->last_used_slot = 0; 409 419 } 410 420 411 421 void kvm_vcpu_destroy(struct kvm_vcpu *vcpu) ··· 488 496 489 497 idx = srcu_read_lock(&kvm->srcu); 490 498 491 - /* The on_lock() path does not yet support lock elision. */ 492 - if (!IS_KVM_NULL_FN(range->on_lock)) { 493 - locked = true; 494 - KVM_MMU_LOCK(kvm); 495 - 496 - range->on_lock(kvm, range->start, range->end); 497 - 498 - if (IS_KVM_NULL_FN(range->handler)) 499 - goto out_unlock; 500 - } 501 - 502 499 for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { 503 500 slots = __kvm_memslots(kvm, i); 504 501 kvm_for_each_memslot(slot, slots) { ··· 519 538 if (!locked) { 520 539 locked = true; 521 540 KVM_MMU_LOCK(kvm); 541 + if (!IS_KVM_NULL_FN(range->on_lock)) 542 + range->on_lock(kvm, range->start, range->end); 543 + if (IS_KVM_NULL_FN(range->handler)) 544 + break; 522 545 } 523 546 ret |= range->handler(kvm, &gfn_range); 524 547 } ··· 531 546 if (range->flush_on_ret && (ret || kvm->tlbs_dirty)) 532 547 kvm_flush_remote_tlbs(kvm); 533 548 534 - out_unlock: 535 549 if (locked) 536 550 KVM_MMU_UNLOCK(kvm); 537 551 ··· 588 604 trace_kvm_set_spte_hva(address); 589 605 590 606 /* 591 - * .change_pte() must be surrounded by .invalidate_range_{start,end}(), 592 - * and so always runs with an elevated notifier count. This obviates 593 - * the need to bump the sequence count. 607 + * .change_pte() must be surrounded by .invalidate_range_{start,end}(). 608 + * If mmu_notifier_count is zero, then no in-progress invalidations, 609 + * including this one, found a relevant memslot at start(); rechecking 610 + * memslots here is unnecessary. Note, a false positive (count elevated 611 + * by a different invalidation) is sub-optimal but functionally ok. 594 612 */ 595 - WARN_ON_ONCE(!kvm->mmu_notifier_count); 613 + WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count)); 614 + if (!READ_ONCE(kvm->mmu_notifier_count)) 615 + return; 596 616 597 617 kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn); 598 618 } 599 619 600 - static void kvm_inc_notifier_count(struct kvm *kvm, unsigned long start, 620 + void kvm_inc_notifier_count(struct kvm *kvm, unsigned long start, 601 621 unsigned long end) 602 622 { 603 623 /* ··· 646 658 647 659 trace_kvm_unmap_hva_range(range->start, range->end); 648 660 661 + /* 662 + * Prevent memslot modification between range_start() and range_end() 663 + * so that conditionally locking provides the same result in both 664 + * functions. Without that guarantee, the mmu_notifier_count 665 + * adjustments will be imbalanced. 666 + * 667 + * Pairs with the decrement in range_end(). 668 + */ 669 + spin_lock(&kvm->mn_invalidate_lock); 670 + kvm->mn_active_invalidate_count++; 671 + spin_unlock(&kvm->mn_invalidate_lock); 672 + 649 673 __kvm_handle_hva_range(kvm, &hva_range); 650 674 651 675 return 0; 652 676 } 653 677 654 - static void kvm_dec_notifier_count(struct kvm *kvm, unsigned long start, 678 + void kvm_dec_notifier_count(struct kvm *kvm, unsigned long start, 655 679 unsigned long end) 656 680 { 657 681 /* ··· 694 694 .flush_on_ret = false, 695 695 .may_block = mmu_notifier_range_blockable(range), 696 696 }; 697 + bool wake; 697 698 698 699 __kvm_handle_hva_range(kvm, &hva_range); 700 + 701 + /* Pairs with the increment in range_start(). */ 702 + spin_lock(&kvm->mn_invalidate_lock); 703 + wake = (--kvm->mn_active_invalidate_count == 0); 704 + spin_unlock(&kvm->mn_invalidate_lock); 705 + 706 + /* 707 + * There can only be one waiter, since the wait happens under 708 + * slots_lock. 709 + */ 710 + if (wake) 711 + rcuwait_wake_up(&kvm->mn_memslots_update_rcuwait); 699 712 700 713 BUG_ON(kvm->mmu_notifier_count < 0); 701 714 } ··· 910 897 char dir_name[ITOA_MAX_LEN * 2]; 911 898 struct kvm_stat_data *stat_data; 912 899 const struct _kvm_stats_desc *pdesc; 913 - int i; 900 + int i, ret; 914 901 int kvm_debugfs_num_entries = kvm_vm_stats_header.num_desc + 915 902 kvm_vcpu_stats_header.num_desc; 916 903 ··· 967 954 kvm->debugfs_dentry, stat_data, 968 955 &stat_fops_per_vm); 969 956 } 957 + 958 + ret = kvm_arch_create_vm_debugfs(kvm); 959 + if (ret) { 960 + kvm_destroy_vm_debugfs(kvm); 961 + return i; 962 + } 963 + 970 964 return 0; 971 965 } 972 966 ··· 994 974 { 995 975 } 996 976 977 + /* 978 + * Called after per-vm debugfs created. When called kvm->debugfs_dentry should 979 + * be setup already, so we can create arch-specific debugfs entries under it. 980 + * Cleanup should be automatic done in kvm_destroy_vm_debugfs() recursively, so 981 + * a per-arch destroy interface is not needed. 982 + */ 983 + int __weak kvm_arch_create_vm_debugfs(struct kvm *kvm) 984 + { 985 + return 0; 986 + } 987 + 997 988 static struct kvm *kvm_create_vm(unsigned long type) 998 989 { 999 990 struct kvm *kvm = kvm_arch_alloc_vm(); ··· 1022 991 mutex_init(&kvm->irq_lock); 1023 992 mutex_init(&kvm->slots_lock); 1024 993 mutex_init(&kvm->slots_arch_lock); 994 + spin_lock_init(&kvm->mn_invalidate_lock); 995 + rcuwait_init(&kvm->mn_memslots_update_rcuwait); 996 + 1025 997 INIT_LIST_HEAD(&kvm->devices); 1026 998 1027 999 BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX); ··· 1147 1113 kvm_coalesced_mmio_free(kvm); 1148 1114 #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER) 1149 1115 mmu_notifier_unregister(&kvm->mmu_notifier, kvm->mm); 1116 + /* 1117 + * At this point, pending calls to invalidate_range_start() 1118 + * have completed but no more MMU notifiers will run, so 1119 + * mn_active_invalidate_count may remain unbalanced. 1120 + * No threads can be waiting in install_new_memslots as the 1121 + * last reference on KVM has been dropped, but freeing 1122 + * memslots would deadlock without this manual intervention. 1123 + */ 1124 + WARN_ON(rcuwait_active(&kvm->mn_memslots_update_rcuwait)); 1125 + kvm->mn_active_invalidate_count = 0; 1150 1126 #else 1151 1127 kvm_arch_flush_shadow_all(kvm); 1152 1128 #endif ··· 1177 1133 refcount_inc(&kvm->users_count); 1178 1134 } 1179 1135 EXPORT_SYMBOL_GPL(kvm_get_kvm); 1136 + 1137 + /* 1138 + * Make sure the vm is not during destruction, which is a safe version of 1139 + * kvm_get_kvm(). Return true if kvm referenced successfully, false otherwise. 1140 + */ 1141 + bool kvm_get_kvm_safe(struct kvm *kvm) 1142 + { 1143 + return refcount_inc_not_zero(&kvm->users_count); 1144 + } 1145 + EXPORT_SYMBOL_GPL(kvm_get_kvm_safe); 1180 1146 1181 1147 void kvm_put_kvm(struct kvm *kvm) 1182 1148 { ··· 1248 1194 1249 1195 slots->used_slots--; 1250 1196 1251 - if (atomic_read(&slots->lru_slot) >= slots->used_slots) 1252 - atomic_set(&slots->lru_slot, 0); 1197 + if (atomic_read(&slots->last_used_slot) >= slots->used_slots) 1198 + atomic_set(&slots->last_used_slot, 0); 1253 1199 1254 1200 for (i = slots->id_to_index[memslot->id]; i < slots->used_slots; i++) { 1255 1201 mslots[i] = mslots[i + 1]; ··· 1418 1364 WARN_ON(gen & KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS); 1419 1365 slots->generation = gen | KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS; 1420 1366 1367 + /* 1368 + * Do not store the new memslots while there are invalidations in 1369 + * progress, otherwise the locking in invalidate_range_start and 1370 + * invalidate_range_end will be unbalanced. 1371 + */ 1372 + spin_lock(&kvm->mn_invalidate_lock); 1373 + prepare_to_rcuwait(&kvm->mn_memslots_update_rcuwait); 1374 + while (kvm->mn_active_invalidate_count) { 1375 + set_current_state(TASK_UNINTERRUPTIBLE); 1376 + spin_unlock(&kvm->mn_invalidate_lock); 1377 + schedule(); 1378 + spin_lock(&kvm->mn_invalidate_lock); 1379 + } 1380 + finish_rcuwait(&kvm->mn_memslots_update_rcuwait); 1421 1381 rcu_assign_pointer(kvm->memslots[as_id], slots); 1382 + spin_unlock(&kvm->mn_invalidate_lock); 1422 1383 1423 1384 /* 1424 1385 * Acquired in kvm_set_memslot. Must be released before synchronize ··· 2049 1980 2050 1981 struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn) 2051 1982 { 2052 - return __gfn_to_memslot(kvm_vcpu_memslots(vcpu), gfn); 1983 + struct kvm_memslots *slots = kvm_vcpu_memslots(vcpu); 1984 + struct kvm_memory_slot *slot; 1985 + int slot_index; 1986 + 1987 + slot = try_get_memslot(slots, vcpu->last_used_slot, gfn); 1988 + if (slot) 1989 + return slot; 1990 + 1991 + /* 1992 + * Fall back to searching all memslots. We purposely use 1993 + * search_memslots() instead of __gfn_to_memslot() to avoid 1994 + * thrashing the VM-wide last_used_index in kvm_memslots. 1995 + */ 1996 + slot = search_memslots(slots, gfn, &slot_index); 1997 + if (slot) { 1998 + vcpu->last_used_slot = slot_index; 1999 + return slot; 2000 + } 2001 + 2002 + return NULL; 2053 2003 } 2054 2004 EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_memslot); 2055 2005 ··· 2327 2239 * Get a reference here because callers of *hva_to_pfn* and 2328 2240 * *gfn_to_pfn* ultimately call kvm_release_pfn_clean on the 2329 2241 * returned pfn. This is only needed if the VMA has VM_MIXEDMAP 2330 - * set, but the kvm_get_pfn/kvm_release_pfn_clean pair will 2242 + * set, but the kvm_try_get_pfn/kvm_release_pfn_clean pair will 2331 2243 * simply do nothing for reserved pfns. 2332 2244 * 2333 2245 * Whoever called remap_pfn_range is also going to call e.g. ··· 2723 2635 mark_page_accessed(pfn_to_page(pfn)); 2724 2636 } 2725 2637 EXPORT_SYMBOL_GPL(kvm_set_pfn_accessed); 2726 - 2727 - void kvm_get_pfn(kvm_pfn_t pfn) 2728 - { 2729 - if (!kvm_is_reserved_pfn(pfn)) 2730 - get_page(pfn_to_page(pfn)); 2731 - } 2732 - EXPORT_SYMBOL_GPL(kvm_get_pfn); 2733 2638 2734 2639 static int next_segment(unsigned long len, int offset) 2735 2640 { ··· 3203 3122 ++vcpu->stat.generic.halt_successful_poll; 3204 3123 if (!vcpu_valid_wakeup(vcpu)) 3205 3124 ++vcpu->stat.generic.halt_poll_invalid; 3125 + 3126 + KVM_STATS_LOG_HIST_UPDATE( 3127 + vcpu->stat.generic.halt_poll_success_hist, 3128 + ktime_to_ns(ktime_get()) - 3129 + ktime_to_ns(start)); 3206 3130 goto out; 3207 3131 } 3208 3132 cpu_relax(); 3209 3133 poll_end = cur = ktime_get(); 3210 3134 } while (kvm_vcpu_can_poll(cur, stop)); 3135 + 3136 + KVM_STATS_LOG_HIST_UPDATE( 3137 + vcpu->stat.generic.halt_poll_fail_hist, 3138 + ktime_to_ns(ktime_get()) - ktime_to_ns(start)); 3211 3139 } 3140 + 3212 3141 3213 3142 prepare_to_rcuwait(&vcpu->wait); 3214 3143 for (;;) { ··· 3232 3141 } 3233 3142 finish_rcuwait(&vcpu->wait); 3234 3143 cur = ktime_get(); 3144 + if (waited) { 3145 + vcpu->stat.generic.halt_wait_ns += 3146 + ktime_to_ns(cur) - ktime_to_ns(poll_end); 3147 + KVM_STATS_LOG_HIST_UPDATE(vcpu->stat.generic.halt_wait_hist, 3148 + ktime_to_ns(cur) - ktime_to_ns(poll_end)); 3149 + } 3235 3150 out: 3236 3151 kvm_arch_vcpu_unblocking(vcpu); 3237 3152 block_ns = ktime_to_ns(cur) - ktime_to_ns(start); ··· 3709 3612 struct kvm_fpu *fpu = NULL; 3710 3613 struct kvm_sregs *kvm_sregs = NULL; 3711 3614 3712 - if (vcpu->kvm->mm != current->mm) 3615 + if (vcpu->kvm->mm != current->mm || vcpu->kvm->vm_bugged) 3713 3616 return -EIO; 3714 3617 3715 3618 if (unlikely(_IOC_TYPE(ioctl) != KVMIO)) ··· 3919 3822 void __user *argp = compat_ptr(arg); 3920 3823 int r; 3921 3824 3922 - if (vcpu->kvm->mm != current->mm) 3825 + if (vcpu->kvm->mm != current->mm || vcpu->kvm->vm_bugged) 3923 3826 return -EIO; 3924 3827 3925 3828 switch (ioctl) { ··· 3985 3888 { 3986 3889 struct kvm_device *dev = filp->private_data; 3987 3890 3988 - if (dev->kvm->mm != current->mm) 3891 + if (dev->kvm->mm != current->mm || dev->kvm->vm_bugged) 3989 3892 return -EIO; 3990 3893 3991 3894 switch (ioctl) { ··· 4307 4210 void __user *argp = (void __user *)arg; 4308 4211 int r; 4309 4212 4310 - if (kvm->mm != current->mm) 4213 + if (kvm->mm != current->mm || kvm->vm_bugged) 4311 4214 return -EIO; 4312 4215 switch (ioctl) { 4313 4216 case KVM_CREATE_VCPU: ··· 4518 4421 struct kvm *kvm = filp->private_data; 4519 4422 int r; 4520 4423 4521 - if (kvm->mm != current->mm) 4424 + if (kvm->mm != current->mm || kvm->vm_bugged) 4522 4425 return -EIO; 4523 4426 switch (ioctl) { 4524 4427 #ifdef CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT ··· 5080 4983 struct kvm_stat_data *stat_data = (struct kvm_stat_data *) 5081 4984 inode->i_private; 5082 4985 5083 - /* The debugfs files are a reference to the kvm struct which 5084 - * is still valid when kvm_destroy_vm is called. 5085 - * To avoid the race between open and the removal of the debugfs 5086 - * directory we test against the users count. 4986 + /* 4987 + * The debugfs files are a reference to the kvm struct which 4988 + * is still valid when kvm_destroy_vm is called. kvm_get_kvm_safe 4989 + * avoids the race between open and the removal of the debugfs directory. 5087 4990 */ 5088 - if (!refcount_inc_not_zero(&stat_data->kvm->users_count)) 4991 + if (!kvm_get_kvm_safe(stat_data->kvm)) 5089 4992 return -ENOENT; 5090 4993 5091 4994 if (simple_attr_open(inode, file, get,

Configure Feed

Configure Feed