Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

+17

Documentation/virtual/kvm/api.txt

··· 45 45 and one vcpu per thread. 46 46 47 47 48 + It is important to note that althought VM ioctls may only be issued from 49 + the process that created the VM, a VM's lifecycle is associated with its 50 + file descriptor, not its creator (process). In other words, the VM and 51 + its resources, *including the associated address space*, are not freed 52 + until the last reference to the VM's file descriptor has been released. 53 + For example, if fork() is issued after ioctl(KVM_CREATE_VM), the VM will 54 + not be freed until both the parent (original) process and its child have 55 + put their references to the VM's file descriptor. 56 + 57 + Because a VM's resources are not freed until the last reference to its 58 + file descriptor is released, creating additional references to a VM via 59 + via fork(), dup(), etc... without careful consideration is strongly 60 + discouraged and may have unwanted side effects, e.g. memory allocated 61 + by and on behalf of the VM's process may not be freed/unaccounted when 62 + the VM is shut down. 63 + 64 + 48 65 3. Extensions 49 66 ------------- 50 67

+23 -14

Documentation/virtual/kvm/halt-polling.txt

+9 -32

Documentation/virtual/kvm/mmu.txt

··· 224 224 A bitmap indicating which sptes in spt point (directly or indirectly) at 225 225 pages that may be unsynchronized. Used to quickly locate all unsychronized 226 226 pages reachable from a given page. 227 - mmu_valid_gen: 228 - Generation number of the page. It is compared with kvm->arch.mmu_valid_gen 229 - during hash table lookup, and used to skip invalidated shadow pages (see 230 - "Zapping all pages" below.) 231 227 clear_spte_count: 232 228 Only present on 32-bit hosts, where a 64-bit spte cannot be written 233 229 atomically. The reader uses this while running out of the MMU lock ··· 398 402 a large spte. The frames at the end of an unaligned memory slot have 399 403 artificially inflated ->disallow_lpages so they can never be instantiated. 400 404 401 - Zapping all pages (page generation count) 402 - ========================================= 403 - 404 - For the large memory guests, walking and zapping all pages is really slow 405 - (because there are a lot of pages), and also blocks memory accesses of 406 - all VCPUs because it needs to hold the MMU lock. 407 - 408 - To make it be more scalable, kvm maintains a global generation number 409 - which is stored in kvm->arch.mmu_valid_gen. Every shadow page stores 410 - the current global generation-number into sp->mmu_valid_gen when it 411 - is created. Pages with a mismatching generation number are "obsolete". 412 - 413 - When KVM need zap all shadow pages sptes, it just simply increases the global 414 - generation-number then reload root shadow pages on all vcpus. As the VCPUs 415 - create new shadow page tables, the old pages are not used because of the 416 - mismatching generation number. 417 - 418 - KVM then walks through all pages and zaps obsolete pages. While the zap 419 - operation needs to take the MMU lock, the lock can be released periodically 420 - so that the VCPUs can make progress. 421 - 422 405 Fast invalidation of MMIO sptes 423 406 =============================== 424 407 ··· 410 435 MMIO sptes have a few spare bits, which are used to store a 411 436 generation number. The global generation number is stored in 412 437 kvm_memslots(kvm)->generation, and increased whenever guest memory info 413 - changes. This generation number is distinct from the one described in 414 - the previous section. 438 + changes. 415 439 416 440 When KVM finds an MMIO spte, it checks the generation number of the spte. 417 441 If the generation number of the spte does not equal the global generation ··· 426 452 out-of-date information, but with an up-to-date generation number. 427 453 428 454 To avoid this, the generation number is incremented again after synchronize_srcu 429 - returns; thus, the low bit of kvm_memslots(kvm)->generation is only 1 during a 455 + returns; thus, bit 63 of kvm_memslots(kvm)->generation set to 1 only during a 430 456 memslot update, while some SRCU readers might be using the old copy. We do not 431 457 want to use an MMIO sptes created with an odd generation number, and we can do 432 - this without losing a bit in the MMIO spte. The low bit of the generation 433 - is not stored in MMIO spte, and presumed zero when it is extracted out of the 434 - spte. If KVM is unlucky and creates an MMIO spte while the low bit is 1, 435 - the next access to the spte will always be a cache miss. 458 + this without losing a bit in the MMIO spte. The "update in-progress" bit of the 459 + generation is not stored in MMIO spte, and is so is implicitly zero when the 460 + generation is extracted out of the spte. If KVM is unlucky and creates an MMIO 461 + spte while an update is in-progress, the next access to the spte will always be 462 + a cache miss. For example, a subsequent access during the update window will 463 + miss due to the in-progress flag diverging, while an access after the update 464 + window closes will have a higher generation number (as compared to the spte). 436 465 437 466 438 467 Further reading

+8 -11

MAINTAINERS

··· 8461 8461 F: include/kvm/iodev.h 8462 8462 F: virt/kvm/* 8463 8463 F: tools/kvm/ 8464 + F: tools/testing/selftests/kvm/ 8464 8465 8465 8466 KERNEL VIRTUAL MACHINE FOR AMD-V (KVM/amd) 8466 8467 M: Joerg Roedel <joro@8bytes.org> ··· 8471 8470 F: arch/x86/include/asm/svm.h 8472 8471 F: arch/x86/kvm/svm.c 8473 8472 8474 - KERNEL VIRTUAL MACHINE FOR ARM (KVM/arm) 8473 + KERNEL VIRTUAL MACHINE FOR ARM/ARM64 (KVM/arm, KVM/arm64) 8475 8474 M: Christoffer Dall <christoffer.dall@arm.com> 8476 8475 M: Marc Zyngier <marc.zyngier@arm.com> 8476 + R: James Morse <james.morse@arm.com> 8477 + R: Julien Thierry <julien.thierry@arm.com> 8478 + R: Suzuki K Pouloze <suzuki.poulose@arm.com> 8477 8479 L: linux-arm-kernel@lists.infradead.org (moderated for non-subscribers) 8478 8480 L: kvmarm@lists.cs.columbia.edu 8479 8481 W: http://systems.cs.columbia.edu/projects/kvm-arm 8480 8482 T: git git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git 8481 - S: Supported 8483 + S: Maintained 8482 8484 F: arch/arm/include/uapi/asm/kvm* 8483 8485 F: arch/arm/include/asm/kvm* 8484 8486 F: arch/arm/kvm/ 8485 - F: virt/kvm/arm/ 8486 - F: include/kvm/arm_* 8487 - 8488 - KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64) 8489 - M: Christoffer Dall <christoffer.dall@arm.com> 8490 - M: Marc Zyngier <marc.zyngier@arm.com> 8491 - L: linux-arm-kernel@lists.infradead.org (moderated for non-subscribers) 8492 - L: kvmarm@lists.cs.columbia.edu 8493 - S: Maintained 8494 8487 F: arch/arm64/include/uapi/asm/kvm* 8495 8488 F: arch/arm64/include/asm/kvm* 8496 8489 F: arch/arm64/kvm/ 8490 + F: virt/kvm/arm/ 8491 + F: include/kvm/arm_* 8497 8492 8498 8493 KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips) 8499 8494 M: James Hogan <jhogan@kernel.org>

+2 -2

arch/arm/include/asm/arch_gicv3.h

··· 55 55 #define ICH_VTR __ACCESS_CP15(c12, 4, c11, 1) 56 56 #define ICH_MISR __ACCESS_CP15(c12, 4, c11, 2) 57 57 #define ICH_EISR __ACCESS_CP15(c12, 4, c11, 3) 58 - #define ICH_ELSR __ACCESS_CP15(c12, 4, c11, 5) 58 + #define ICH_ELRSR __ACCESS_CP15(c12, 4, c11, 5) 59 59 #define ICH_VMCR __ACCESS_CP15(c12, 4, c11, 7) 60 60 61 61 #define __LR0(x) __ACCESS_CP15(c12, 4, c12, x) ··· 152 152 CPUIF_MAP(ICH_VTR, ICH_VTR_EL2) 153 153 CPUIF_MAP(ICH_MISR, ICH_MISR_EL2) 154 154 CPUIF_MAP(ICH_EISR, ICH_EISR_EL2) 155 - CPUIF_MAP(ICH_ELSR, ICH_ELSR_EL2) 155 + CPUIF_MAP(ICH_ELRSR, ICH_ELRSR_EL2) 156 156 CPUIF_MAP(ICH_VMCR, ICH_VMCR_EL2) 157 157 CPUIF_MAP(ICH_AP0R3, ICH_AP0R3_EL2) 158 158 CPUIF_MAP(ICH_AP0R2, ICH_AP0R2_EL2)

+8

arch/arm/include/asm/kvm_emulate.h

··· 265 265 } 266 266 } 267 267 268 + static inline bool kvm_is_write_fault(struct kvm_vcpu *vcpu) 269 + { 270 + if (kvm_vcpu_trap_is_iabt(vcpu)) 271 + return false; 272 + 273 + return kvm_vcpu_dabt_iswrite(vcpu); 274 + } 275 + 268 276 static inline u32 kvm_vcpu_hvc_get_imm(struct kvm_vcpu *vcpu) 269 277 { 270 278 return kvm_vcpu_get_hsr(vcpu) & HSR_HVC_IMM_MASK;

+46 -7

arch/arm/include/asm/kvm_host.h

··· 26 26 #include <asm/kvm_asm.h> 27 27 #include <asm/kvm_mmio.h> 28 28 #include <asm/fpstate.h> 29 + #include <asm/smp_plat.h> 29 30 #include <kvm/arm_arch_timer.h> 30 31 31 32 #define __KVM_HAVE_ARCH_INTC_INITIALIZED ··· 58 57 int kvm_reset_vcpu(struct kvm_vcpu *vcpu); 59 58 void kvm_reset_coprocs(struct kvm_vcpu *vcpu); 60 59 61 - struct kvm_arch { 62 - /* VTTBR value associated with below pgd and vmid */ 63 - u64 vttbr; 60 + struct kvm_vmid { 61 + /* The VMID generation used for the virt. memory system */ 62 + u64 vmid_gen; 63 + u32 vmid; 64 + }; 64 65 66 + struct kvm_arch { 65 67 /* The last vcpu id that ran on each physical CPU */ 66 68 int __percpu *last_vcpu_ran; 67 69 ··· 74 70 */ 75 71 76 72 /* The VMID generation used for the virt. memory system */ 77 - u64 vmid_gen; 78 - u32 vmid; 73 + struct kvm_vmid vmid; 79 74 80 75 /* Stage-2 page table */ 81 76 pgd_t *pgd; 77 + phys_addr_t pgd_phys; 82 78 83 79 /* Interrupt controller */ 84 80 struct vgic_dist vgic; ··· 151 147 }; 152 148 153 149 typedef struct kvm_cpu_context kvm_cpu_context_t; 150 + 151 + static inline void kvm_init_host_cpu_context(kvm_cpu_context_t *cpu_ctxt, 152 + int cpu) 153 + { 154 + /* The host's MPIDR is immutable, so let's set it up at boot time */ 155 + cpu_ctxt->cp15[c0_MPIDR] = cpu_logical_map(cpu); 156 + } 154 157 155 158 struct vcpu_reset_state { 156 159 unsigned long pc; ··· 235 224 int kvm_arm_copy_reg_indices(struct kvm_vcpu *vcpu, u64 __user *indices); 236 225 int kvm_arm_get_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg); 237 226 int kvm_arm_set_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg); 238 - unsigned long kvm_call_hyp(void *hypfn, ...); 227 + 228 + unsigned long __kvm_call_hyp(void *hypfn, ...); 229 + 230 + /* 231 + * The has_vhe() part doesn't get emitted, but is used for type-checking. 232 + */ 233 + #define kvm_call_hyp(f, ...) \ 234 + do { \ 235 + if (has_vhe()) { \ 236 + f(__VA_ARGS__); \ 237 + } else { \ 238 + __kvm_call_hyp(kvm_ksym_ref(f), ##__VA_ARGS__); \ 239 + } \ 240 + } while(0) 241 + 242 + #define kvm_call_hyp_ret(f, ...) \ 243 + ({ \ 244 + typeof(f(__VA_ARGS__)) ret; \ 245 + \ 246 + if (has_vhe()) { \ 247 + ret = f(__VA_ARGS__); \ 248 + } else { \ 249 + ret = __kvm_call_hyp(kvm_ksym_ref(f), \ 250 + ##__VA_ARGS__); \ 251 + } \ 252 + \ 253 + ret; \ 254 + }) 255 + 239 256 void force_vm_exit(const cpumask_t *mask); 240 257 int __kvm_arm_vcpu_get_events(struct kvm_vcpu *vcpu, 241 258 struct kvm_vcpu_events *events); ··· 314 275 * compliant with the PCS!). 315 276 */ 316 277 317 - kvm_call_hyp((void*)hyp_stack_ptr, vector_ptr, pgd_ptr); 278 + __kvm_call_hyp((void*)hyp_stack_ptr, vector_ptr, pgd_ptr); 318 279 } 319 280 320 281 static inline void __cpu_init_stage2(void)

+4

arch/arm/include/asm/kvm_hyp.h

··· 40 40 #define TTBR1 __ACCESS_CP15_64(1, c2) 41 41 #define VTTBR __ACCESS_CP15_64(6, c2) 42 42 #define PAR __ACCESS_CP15_64(0, c7) 43 + #define CNTP_CVAL __ACCESS_CP15_64(2, c14) 43 44 #define CNTV_CVAL __ACCESS_CP15_64(3, c14) 44 45 #define CNTVOFF __ACCESS_CP15_64(4, c14) 45 46 ··· 86 85 #define TID_PRIV __ACCESS_CP15(c13, 0, c0, 4) 87 86 #define HTPIDR __ACCESS_CP15(c13, 4, c0, 2) 88 87 #define CNTKCTL __ACCESS_CP15(c14, 0, c1, 0) 88 + #define CNTP_CTL __ACCESS_CP15(c14, 0, c2, 1) 89 89 #define CNTV_CTL __ACCESS_CP15(c14, 0, c3, 1) 90 90 #define CNTHCTL __ACCESS_CP15(c14, 4, c1, 0) 91 91 ··· 96 94 #define read_sysreg_el0(r) read_sysreg(r##_el0) 97 95 #define write_sysreg_el0(v, r) write_sysreg(v, r##_el0) 98 96 97 + #define cntp_ctl_el0 CNTP_CTL 98 + #define cntp_cval_el0 CNTP_CVAL 99 99 #define cntv_ctl_el0 CNTV_CTL 100 100 #define cntv_cval_el0 CNTV_CVAL 101 101 #define cntvoff_el2 CNTVOFF

+7 -2

arch/arm/include/asm/kvm_mmu.h

··· 421 421 422 422 static inline void kvm_set_ipa_limit(void) {} 423 423 424 - static inline bool kvm_cpu_has_cnp(void) 424 + static __always_inline u64 kvm_get_vttbr(struct kvm *kvm) 425 425 { 426 - return false; 426 + struct kvm_vmid *vmid = &kvm->arch.vmid; 427 + u64 vmid_field, baddr; 428 + 429 + baddr = kvm->arch.pgd_phys; 430 + vmid_field = (u64)vmid->vmid << VTTBR_VMID_SHIFT; 431 + return kvm_phys_to_vttbr(baddr) | vmid_field; 427 432 } 428 433 429 434 #endif /* !__ASSEMBLY__ */

+2 -3

arch/arm/kvm/Makefile

··· 8 8 plus_virt_def := -DREQUIRES_VIRT=1 9 9 endif 10 10 11 - ccflags-y += -Iarch/arm/kvm -Ivirt/kvm/arm/vgic 12 - CFLAGS_arm.o := -I. $(plus_virt_def) 13 - CFLAGS_mmu.o := -I. 11 + ccflags-y += -I $(srctree)/$(src) -I $(srctree)/virt/kvm/arm/vgic 12 + CFLAGS_arm.o := $(plus_virt_def) 14 13 15 14 AFLAGS_init.o := -Wa,-march=armv7-a$(plus_virt) 16 15 AFLAGS_interrupts.o := -Wa,-march=armv7-a$(plus_virt)

+14 -9

arch/arm/kvm/coproc.c

··· 293 293 const struct coproc_params *p, 294 294 const struct coproc_reg *r) 295 295 { 296 - u64 now = kvm_phys_timer_read(); 297 - u64 val; 296 + u32 val; 298 297 299 298 if (p->is_write) { 300 299 val = *vcpu_reg(vcpu, p->Rt1); 301 - kvm_arm_timer_set_reg(vcpu, KVM_REG_ARM_PTIMER_CVAL, val + now); 300 + kvm_arm_timer_write_sysreg(vcpu, 301 + TIMER_PTIMER, TIMER_REG_TVAL, val); 302 302 } else { 303 - val = kvm_arm_timer_get_reg(vcpu, KVM_REG_ARM_PTIMER_CVAL); 304 - *vcpu_reg(vcpu, p->Rt1) = val - now; 303 + val = kvm_arm_timer_read_sysreg(vcpu, 304 + TIMER_PTIMER, TIMER_REG_TVAL); 305 + *vcpu_reg(vcpu, p->Rt1) = val; 305 306 } 306 307 307 308 return true; ··· 316 315 317 316 if (p->is_write) { 318 317 val = *vcpu_reg(vcpu, p->Rt1); 319 - kvm_arm_timer_set_reg(vcpu, KVM_REG_ARM_PTIMER_CTL, val); 318 + kvm_arm_timer_write_sysreg(vcpu, 319 + TIMER_PTIMER, TIMER_REG_CTL, val); 320 320 } else { 321 - val = kvm_arm_timer_get_reg(vcpu, KVM_REG_ARM_PTIMER_CTL); 321 + val = kvm_arm_timer_read_sysreg(vcpu, 322 + TIMER_PTIMER, TIMER_REG_CTL); 322 323 *vcpu_reg(vcpu, p->Rt1) = val; 323 324 } 324 325 ··· 336 333 if (p->is_write) { 337 334 val = (u64)*vcpu_reg(vcpu, p->Rt2) << 32; 338 335 val |= *vcpu_reg(vcpu, p->Rt1); 339 - kvm_arm_timer_set_reg(vcpu, KVM_REG_ARM_PTIMER_CVAL, val); 336 + kvm_arm_timer_write_sysreg(vcpu, 337 + TIMER_PTIMER, TIMER_REG_CVAL, val); 340 338 } else { 341 - val = kvm_arm_timer_get_reg(vcpu, KVM_REG_ARM_PTIMER_CVAL); 339 + val = kvm_arm_timer_read_sysreg(vcpu, 340 + TIMER_PTIMER, TIMER_REG_CVAL); 342 341 *vcpu_reg(vcpu, p->Rt1) = val; 343 342 *vcpu_reg(vcpu, p->Rt2) = val >> 32; 344 343 }

-1

arch/arm/kvm/hyp/cp15-sr.c

··· 27 27 28 28 void __hyp_text __sysreg_save_state(struct kvm_cpu_context *ctxt) 29 29 { 30 - ctxt->cp15[c0_MPIDR] = read_sysreg(VMPIDR); 31 30 ctxt->cp15[c0_CSSELR] = read_sysreg(CSSELR); 32 31 ctxt->cp15[c1_SCTLR] = read_sysreg(SCTLR); 33 32 ctxt->cp15[c1_CPACR] = read_sysreg(CPACR);

+1 -1

arch/arm/kvm/hyp/hyp-entry.S

··· 176 176 msr spsr_cxsf, lr 177 177 ldr lr, =panic 178 178 msr ELR_hyp, lr 179 - ldr lr, =kvm_call_hyp 179 + ldr lr, =__kvm_call_hyp 180 180 clrex 181 181 eret 182 182 ENDPROC(__hyp_do_panic)

+1 -1

arch/arm/kvm/hyp/switch.c

··· 77 77 static void __hyp_text __activate_vm(struct kvm_vcpu *vcpu) 78 78 { 79 79 struct kvm *kvm = kern_hyp_va(vcpu->kvm); 80 - write_sysreg(kvm->arch.vttbr, VTTBR); 80 + write_sysreg(kvm_get_vttbr(kvm), VTTBR); 81 81 write_sysreg(vcpu->arch.midr, VPIDR); 82 82 } 83 83

+2 -2

arch/arm/kvm/hyp/tlb.c

··· 41 41 42 42 /* Switch to requested VMID */ 43 43 kvm = kern_hyp_va(kvm); 44 - write_sysreg(kvm->arch.vttbr, VTTBR); 44 + write_sysreg(kvm_get_vttbr(kvm), VTTBR); 45 45 isb(); 46 46 47 47 write_sysreg(0, TLBIALLIS); ··· 61 61 struct kvm *kvm = kern_hyp_va(kern_hyp_va(vcpu)->kvm); 62 62 63 63 /* Switch to requested VMID */ 64 - write_sysreg(kvm->arch.vttbr, VTTBR); 64 + write_sysreg(kvm_get_vttbr(kvm), VTTBR); 65 65 isb(); 66 66 67 67 write_sysreg(0, TLBIALL);

+2 -2

arch/arm/kvm/interrupts.S

··· 42 42 * r12: caller save 43 43 * rest: callee save 44 44 */ 45 - ENTRY(kvm_call_hyp) 45 + ENTRY(__kvm_call_hyp) 46 46 hvc #0 47 47 bx lr 48 - ENDPROC(kvm_call_hyp) 48 + ENDPROC(__kvm_call_hyp)

+12

arch/arm64/include/asm/kvm_emulate.h

··· 77 77 */ 78 78 if (!vcpu_el1_is_32bit(vcpu)) 79 79 vcpu->arch.hcr_el2 |= HCR_TID3; 80 + 81 + if (cpus_have_const_cap(ARM64_MISMATCHED_CACHE_TYPE) || 82 + vcpu_el1_is_32bit(vcpu)) 83 + vcpu->arch.hcr_el2 |= HCR_TID2; 80 84 } 81 85 82 86 static inline unsigned long *vcpu_hcr(struct kvm_vcpu *vcpu) ··· 333 329 { 334 330 u32 esr = kvm_vcpu_get_hsr(vcpu); 335 331 return ESR_ELx_SYS64_ISS_RT(esr); 332 + } 333 + 334 + static inline bool kvm_is_write_fault(struct kvm_vcpu *vcpu) 335 + { 336 + if (kvm_vcpu_trap_is_iabt(vcpu)) 337 + return false; 338 + 339 + return kvm_vcpu_dabt_iswrite(vcpu); 336 340 } 337 341 338 342 static inline unsigned long kvm_vcpu_get_mpidr_aff(struct kvm_vcpu *vcpu)

+44 -4

arch/arm64/include/asm/kvm_host.h

··· 31 31 #include <asm/kvm.h> 32 32 #include <asm/kvm_asm.h> 33 33 #include <asm/kvm_mmio.h> 34 + #include <asm/smp_plat.h> 34 35 #include <asm/thread_info.h> 35 36 36 37 #define __KVM_HAVE_ARCH_INTC_INITIALIZED ··· 59 58 int kvm_arch_vm_ioctl_check_extension(struct kvm *kvm, long ext); 60 59 void __extended_idmap_trampoline(phys_addr_t boot_pgd, phys_addr_t idmap_start); 61 60 62 - struct kvm_arch { 61 + struct kvm_vmid { 63 62 /* The VMID generation used for the virt. memory system */ 64 63 u64 vmid_gen; 65 64 u32 vmid; 65 + }; 66 + 67 + struct kvm_arch { 68 + struct kvm_vmid vmid; 66 69 67 70 /* stage2 entry level table */ 68 71 pgd_t *pgd; 72 + phys_addr_t pgd_phys; 69 73 70 - /* VTTBR value associated with above pgd and vmid */ 71 - u64 vttbr; 72 74 /* VTCR_EL2 value for this VM */ 73 75 u64 vtcr; 74 76 ··· 386 382 void kvm_arm_resume_guest(struct kvm *kvm); 387 383 388 384 u64 __kvm_call_hyp(void *hypfn, ...); 389 - #define kvm_call_hyp(f, ...) __kvm_call_hyp(kvm_ksym_ref(f), ##__VA_ARGS__) 385 + 386 + /* 387 + * The couple of isb() below are there to guarantee the same behaviour 388 + * on VHE as on !VHE, where the eret to EL1 acts as a context 389 + * synchronization event. 390 + */ 391 + #define kvm_call_hyp(f, ...) \ 392 + do { \ 393 + if (has_vhe()) { \ 394 + f(__VA_ARGS__); \ 395 + isb(); \ 396 + } else { \ 397 + __kvm_call_hyp(kvm_ksym_ref(f), ##__VA_ARGS__); \ 398 + } \ 399 + } while(0) 400 + 401 + #define kvm_call_hyp_ret(f, ...) \ 402 + ({ \ 403 + typeof(f(__VA_ARGS__)) ret; \ 404 + \ 405 + if (has_vhe()) { \ 406 + ret = f(__VA_ARGS__); \ 407 + isb(); \ 408 + } else { \ 409 + ret = __kvm_call_hyp(kvm_ksym_ref(f), \ 410 + ##__VA_ARGS__); \ 411 + } \ 412 + \ 413 + ret; \ 414 + }) 390 415 391 416 void force_vm_exit(const cpumask_t *mask); 392 417 void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot); ··· 433 400 struct kvm_vcpu *kvm_mpidr_to_vcpu(struct kvm *kvm, unsigned long mpidr); 434 401 435 402 DECLARE_PER_CPU(kvm_cpu_context_t, kvm_host_cpu_state); 403 + 404 + static inline void kvm_init_host_cpu_context(kvm_cpu_context_t *cpu_ctxt, 405 + int cpu) 406 + { 407 + /* The host's MPIDR is immutable, so let's set it up at boot time */ 408 + cpu_ctxt->sys_regs[MPIDR_EL1] = cpu_logical_map(cpu); 409 + } 436 410 437 411 void __kvm_enable_ssbs(void); 438 412

+2 -1

arch/arm64/include/asm/kvm_hyp.h

··· 21 21 #include <linux/compiler.h> 22 22 #include <linux/kvm_host.h> 23 23 #include <asm/alternative.h> 24 + #include <asm/kvm_mmu.h> 24 25 #include <asm/sysreg.h> 25 26 26 27 #define __hyp_text __section(.hyp.text) notrace ··· 164 163 static __always_inline void __hyp_text __load_guest_stage2(struct kvm *kvm) 165 164 { 166 165 write_sysreg(kvm->arch.vtcr, vtcr_el2); 167 - write_sysreg(kvm->arch.vttbr, vttbr_el2); 166 + write_sysreg(kvm_get_vttbr(kvm), vttbr_el2); 168 167 169 168 /* 170 169 * ARM erratum 1165522 requires the actual execution of the above

+10 -3

arch/arm64/include/asm/kvm_mmu.h

··· 138 138 }) 139 139 140 140 /* 141 - * We currently only support a 40bit IPA. 141 + * We currently support using a VM-specified IPA size. For backward 142 + * compatibility, the default IPA size is fixed to 40bits. 142 143 */ 143 144 #define KVM_PHYS_SHIFT (40) 144 145 ··· 592 591 return vttbr_baddr_mask(kvm_phys_shift(kvm), kvm_stage2_levels(kvm)); 593 592 } 594 593 595 - static inline bool kvm_cpu_has_cnp(void) 594 + static __always_inline u64 kvm_get_vttbr(struct kvm *kvm) 596 595 { 597 - return system_supports_cnp(); 596 + struct kvm_vmid *vmid = &kvm->arch.vmid; 597 + u64 vmid_field, baddr; 598 + u64 cnp = system_supports_cnp() ? VTTBR_CNP_BIT : 0; 599 + 600 + baddr = kvm->arch.pgd_phys; 601 + vmid_field = (u64)vmid->vmid << VTTBR_VMID_SHIFT; 602 + return kvm_phys_to_vttbr(baddr) | vmid_field | cnp; 598 603 } 599 604 600 605 #endif /* __ASSEMBLY__ */

+6 -1

arch/arm64/include/asm/sysreg.h

··· 361 361 362 362 #define SYS_CNTKCTL_EL1 sys_reg(3, 0, 14, 1, 0) 363 363 364 + #define SYS_CCSIDR_EL1 sys_reg(3, 1, 0, 0, 0) 364 365 #define SYS_CLIDR_EL1 sys_reg(3, 1, 0, 0, 1) 365 366 #define SYS_AIDR_EL1 sys_reg(3, 1, 0, 0, 7) 366 367 ··· 392 391 #define SYS_CNTP_TVAL_EL0 sys_reg(3, 3, 14, 2, 0) 393 392 #define SYS_CNTP_CTL_EL0 sys_reg(3, 3, 14, 2, 1) 394 393 #define SYS_CNTP_CVAL_EL0 sys_reg(3, 3, 14, 2, 2) 394 + 395 + #define SYS_AARCH32_CNTP_TVAL sys_reg(0, 0, 14, 2, 0) 396 + #define SYS_AARCH32_CNTP_CTL sys_reg(0, 0, 14, 2, 1) 397 + #define SYS_AARCH32_CNTP_CVAL sys_reg(0, 2, 0, 14, 0) 395 398 396 399 #define __PMEV_op2(n) ((n) & 0x7) 397 400 #define __CNTR_CRm(n) (0x8 | (((n) >> 3) & 0x3)) ··· 431 426 #define SYS_ICH_VTR_EL2 sys_reg(3, 4, 12, 11, 1) 432 427 #define SYS_ICH_MISR_EL2 sys_reg(3, 4, 12, 11, 2) 433 428 #define SYS_ICH_EISR_EL2 sys_reg(3, 4, 12, 11, 3) 434 - #define SYS_ICH_ELSR_EL2 sys_reg(3, 4, 12, 11, 5) 429 + #define SYS_ICH_ELRSR_EL2 sys_reg(3, 4, 12, 11, 5) 435 430 #define SYS_ICH_VMCR_EL2 sys_reg(3, 4, 12, 11, 7) 436 431 437 432 #define __SYS__LR0_EL2(x) sys_reg(3, 4, 12, 12, x)

+1 -3

arch/arm64/kvm/Makefile

··· 3 3 # Makefile for Kernel-based Virtual Machine module 4 4 # 5 5 6 - ccflags-y += -Iarch/arm64/kvm -Ivirt/kvm/arm/vgic 7 - CFLAGS_arm.o := -I. 8 - CFLAGS_mmu.o := -I. 6 + ccflags-y += -I $(srctree)/$(src) -I $(srctree)/virt/kvm/arm/vgic 9 7 10 8 KVM=../../../virt/kvm 11 9

+1 -1

arch/arm64/kvm/debug.c

··· 76 76 77 77 void kvm_arm_init_debug(void) 78 78 { 79 - __this_cpu_write(mdcr_el2, kvm_call_hyp(__kvm_get_mdcr_el2)); 79 + __this_cpu_write(mdcr_el2, kvm_call_hyp_ret(__kvm_get_mdcr_el2)); 80 80 } 81 81 82 82 /**

-3

arch/arm64/kvm/hyp.S

··· 40 40 * arch/arm64/kernel/hyp_stub.S. 41 41 */ 42 42 ENTRY(__kvm_call_hyp) 43 - alternative_if_not ARM64_HAS_VIRT_HOST_EXTN 44 43 hvc #0 45 44 ret 46 - alternative_else_nop_endif 47 - b __vhe_hyp_call 48 45 ENDPROC(__kvm_call_hyp)

-12

arch/arm64/kvm/hyp/hyp-entry.S

··· 43 43 ldr lr, [sp], #16 44 44 .endm 45 45 46 - ENTRY(__vhe_hyp_call) 47 - do_el2_call 48 - /* 49 - * We used to rely on having an exception return to get 50 - * an implicit isb. In the E2H case, we don't have it anymore. 51 - * rather than changing all the leaf functions, just do it here 52 - * before returning to the rest of the kernel. 53 - */ 54 - isb 55 - ret 56 - ENDPROC(__vhe_hyp_call) 57 - 58 46 el1_sync: // Guest trapped into EL2 59 47 60 48 mrs x0, esr_el2

-1

arch/arm64/kvm/hyp/sysreg-sr.c

··· 53 53 54 54 static void __hyp_text __sysreg_save_el1_state(struct kvm_cpu_context *ctxt) 55 55 { 56 - ctxt->sys_regs[MPIDR_EL1] = read_sysreg(vmpidr_el2); 57 56 ctxt->sys_regs[CSSELR_EL1] = read_sysreg(csselr_el1); 58 57 ctxt->sys_regs[SCTLR_EL1] = read_sysreg_el1(sctlr); 59 58 ctxt->sys_regs[ACTLR_EL1] = read_sysreg(actlr_el1);

+112 -56

arch/arm64/kvm/sys_regs.c

··· 982 982 return true; 983 983 } 984 984 985 + #define reg_to_encoding(x) \ 986 + sys_reg((u32)(x)->Op0, (u32)(x)->Op1, \ 987 + (u32)(x)->CRn, (u32)(x)->CRm, (u32)(x)->Op2); 988 + 985 989 /* Silly macro to expand the DBG{BCR,BVR,WVR,WCR}n_EL1 registers in one go */ 986 990 #define DBG_BCR_BVR_WCR_WVR_EL1(n) \ 987 991 { SYS_DESC(SYS_DBGBVRn_EL1(n)), \ ··· 1007 1003 { SYS_DESC(SYS_PMEVTYPERn_EL0(n)), \ 1008 1004 access_pmu_evtyper, reset_unknown, (PMEVTYPER0_EL0 + n), } 1009 1005 1010 - static bool access_cntp_tval(struct kvm_vcpu *vcpu, 1011 - struct sys_reg_params *p, 1012 - const struct sys_reg_desc *r) 1006 + static bool access_arch_timer(struct kvm_vcpu *vcpu, 1007 + struct sys_reg_params *p, 1008 + const struct sys_reg_desc *r) 1013 1009 { 1014 - u64 now = kvm_phys_timer_read(); 1015 - u64 cval; 1010 + enum kvm_arch_timers tmr; 1011 + enum kvm_arch_timer_regs treg; 1012 + u64 reg = reg_to_encoding(r); 1016 1013 1017 - if (p->is_write) { 1018 - kvm_arm_timer_set_reg(vcpu, KVM_REG_ARM_PTIMER_CVAL, 1019 - p->regval + now); 1020 - } else { 1021 - cval = kvm_arm_timer_get_reg(vcpu, KVM_REG_ARM_PTIMER_CVAL); 1022 - p->regval = cval - now; 1014 + switch (reg) { 1015 + case SYS_CNTP_TVAL_EL0: 1016 + case SYS_AARCH32_CNTP_TVAL: 1017 + tmr = TIMER_PTIMER; 1018 + treg = TIMER_REG_TVAL; 1019 + break; 1020 + case SYS_CNTP_CTL_EL0: 1021 + case SYS_AARCH32_CNTP_CTL: 1022 + tmr = TIMER_PTIMER; 1023 + treg = TIMER_REG_CTL; 1024 + break; 1025 + case SYS_CNTP_CVAL_EL0: 1026 + case SYS_AARCH32_CNTP_CVAL: 1027 + tmr = TIMER_PTIMER; 1028 + treg = TIMER_REG_CVAL; 1029 + break; 1030 + default: 1031 + BUG(); 1023 1032 } 1024 1033 1025 - return true; 1026 - } 1027 - 1028 - static bool access_cntp_ctl(struct kvm_vcpu *vcpu, 1029 - struct sys_reg_params *p, 1030 - const struct sys_reg_desc *r) 1031 - { 1032 1034 if (p->is_write) 1033 - kvm_arm_timer_set_reg(vcpu, KVM_REG_ARM_PTIMER_CTL, p->regval); 1035 + kvm_arm_timer_write_sysreg(vcpu, tmr, treg, p->regval); 1034 1036 else 1035 - p->regval = kvm_arm_timer_get_reg(vcpu, KVM_REG_ARM_PTIMER_CTL); 1036 - 1037 - return true; 1038 - } 1039 - 1040 - static bool access_cntp_cval(struct kvm_vcpu *vcpu, 1041 - struct sys_reg_params *p, 1042 - const struct sys_reg_desc *r) 1043 - { 1044 - if (p->is_write) 1045 - kvm_arm_timer_set_reg(vcpu, KVM_REG_ARM_PTIMER_CVAL, p->regval); 1046 - else 1047 - p->regval = kvm_arm_timer_get_reg(vcpu, KVM_REG_ARM_PTIMER_CVAL); 1037 + p->regval = kvm_arm_timer_read_sysreg(vcpu, tmr, treg); 1048 1038 1049 1039 return true; 1050 1040 } ··· 1156 1158 const struct kvm_one_reg *reg, void __user *uaddr) 1157 1159 { 1158 1160 return __set_id_reg(rd, uaddr, true); 1161 + } 1162 + 1163 + static bool access_ctr(struct kvm_vcpu *vcpu, struct sys_reg_params *p, 1164 + const struct sys_reg_desc *r) 1165 + { 1166 + if (p->is_write) 1167 + return write_to_read_only(vcpu, p, r); 1168 + 1169 + p->regval = read_sanitised_ftr_reg(SYS_CTR_EL0); 1170 + return true; 1171 + } 1172 + 1173 + static bool access_clidr(struct kvm_vcpu *vcpu, struct sys_reg_params *p, 1174 + const struct sys_reg_desc *r) 1175 + { 1176 + if (p->is_write) 1177 + return write_to_read_only(vcpu, p, r); 1178 + 1179 + p->regval = read_sysreg(clidr_el1); 1180 + return true; 1181 + } 1182 + 1183 + static bool access_csselr(struct kvm_vcpu *vcpu, struct sys_reg_params *p, 1184 + const struct sys_reg_desc *r) 1185 + { 1186 + if (p->is_write) 1187 + vcpu_write_sys_reg(vcpu, p->regval, r->reg); 1188 + else 1189 + p->regval = vcpu_read_sys_reg(vcpu, r->reg); 1190 + return true; 1191 + } 1192 + 1193 + static bool access_ccsidr(struct kvm_vcpu *vcpu, struct sys_reg_params *p, 1194 + const struct sys_reg_desc *r) 1195 + { 1196 + u32 csselr; 1197 + 1198 + if (p->is_write) 1199 + return write_to_read_only(vcpu, p, r); 1200 + 1201 + csselr = vcpu_read_sys_reg(vcpu, CSSELR_EL1); 1202 + p->regval = get_ccsidr(csselr); 1203 + 1204 + /* 1205 + * Guests should not be doing cache operations by set/way at all, and 1206 + * for this reason, we trap them and attempt to infer the intent, so 1207 + * that we can flush the entire guest's address space at the appropriate 1208 + * time. 1209 + * To prevent this trapping from causing performance problems, let's 1210 + * expose the geometry of all data and unified caches (which are 1211 + * guaranteed to be PIPT and thus non-aliasing) as 1 set and 1 way. 1212 + * [If guests should attempt to infer aliasing properties from the 1213 + * geometry (which is not permitted by the architecture), they would 1214 + * only do so for virtually indexed caches.] 1215 + */ 1216 + if (!(csselr & 1)) // data or unified cache 1217 + p->regval &= ~GENMASK(27, 3); 1218 + return true; 1159 1219 } 1160 1220 1161 1221 /* sys_reg_desc initialiser for known cpufeature ID registers */ ··· 1433 1377 1434 1378 { SYS_DESC(SYS_CNTKCTL_EL1), NULL, reset_val, CNTKCTL_EL1, 0}, 1435 1379 1436 - { SYS_DESC(SYS_CSSELR_EL1), NULL, reset_unknown, CSSELR_EL1 }, 1380 + { SYS_DESC(SYS_CCSIDR_EL1), access_ccsidr }, 1381 + { SYS_DESC(SYS_CLIDR_EL1), access_clidr }, 1382 + { SYS_DESC(SYS_CSSELR_EL1), access_csselr, reset_unknown, CSSELR_EL1 }, 1383 + { SYS_DESC(SYS_CTR_EL0), access_ctr }, 1437 1384 1438 1385 { SYS_DESC(SYS_PMCR_EL0), access_pmcr, reset_pmcr, }, 1439 1386 { SYS_DESC(SYS_PMCNTENSET_EL0), access_pmcnten, reset_unknown, PMCNTENSET_EL0 }, ··· 1459 1400 { SYS_DESC(SYS_TPIDR_EL0), NULL, reset_unknown, TPIDR_EL0 }, 1460 1401 { SYS_DESC(SYS_TPIDRRO_EL0), NULL, reset_unknown, TPIDRRO_EL0 }, 1461 1402 1462 - { SYS_DESC(SYS_CNTP_TVAL_EL0), access_cntp_tval }, 1463 - { SYS_DESC(SYS_CNTP_CTL_EL0), access_cntp_ctl }, 1464 - { SYS_DESC(SYS_CNTP_CVAL_EL0), access_cntp_cval }, 1403 + { SYS_DESC(SYS_CNTP_TVAL_EL0), access_arch_timer }, 1404 + { SYS_DESC(SYS_CNTP_CTL_EL0), access_arch_timer }, 1405 + { SYS_DESC(SYS_CNTP_CVAL_EL0), access_arch_timer }, 1465 1406 1466 1407 /* PMEVCNTRn_EL0 */ 1467 1408 PMU_PMEVCNTR_EL0(0), ··· 1535 1476 1536 1477 { SYS_DESC(SYS_DACR32_EL2), NULL, reset_unknown, DACR32_EL2 }, 1537 1478 { SYS_DESC(SYS_IFSR32_EL2), NULL, reset_unknown, IFSR32_EL2 }, 1538 - { SYS_DESC(SYS_FPEXC32_EL2), NULL, reset_val, FPEXC32_EL2, 0x70 }, 1479 + { SYS_DESC(SYS_FPEXC32_EL2), NULL, reset_val, FPEXC32_EL2, 0x700 }, 1539 1480 }; 1540 1481 1541 1482 static bool trap_dbgidr(struct kvm_vcpu *vcpu, ··· 1736 1677 * register). 1737 1678 */ 1738 1679 static const struct sys_reg_desc cp15_regs[] = { 1680 + { Op1( 0), CRn( 0), CRm( 0), Op2( 1), access_ctr }, 1739 1681 { Op1( 0), CRn( 1), CRm( 0), Op2( 0), access_vm_reg, NULL, c1_SCTLR }, 1740 1682 { Op1( 0), CRn( 2), CRm( 0), Op2( 0), access_vm_reg, NULL, c2_TTBR0 }, 1741 1683 { Op1( 0), CRn( 2), CRm( 0), Op2( 1), access_vm_reg, NULL, c2_TTBR1 }, ··· 1783 1723 1784 1724 { Op1( 0), CRn(13), CRm( 0), Op2( 1), access_vm_reg, NULL, c13_CID }, 1785 1725 1786 - /* CNTP_TVAL */ 1787 - { Op1( 0), CRn(14), CRm( 2), Op2( 0), access_cntp_tval }, 1788 - /* CNTP_CTL */ 1789 - { Op1( 0), CRn(14), CRm( 2), Op2( 1), access_cntp_ctl }, 1726 + /* Arch Tmers */ 1727 + { SYS_DESC(SYS_AARCH32_CNTP_TVAL), access_arch_timer }, 1728 + { SYS_DESC(SYS_AARCH32_CNTP_CTL), access_arch_timer }, 1790 1729 1791 1730 /* PMEVCNTRn */ 1792 1731 PMU_PMEVCNTR(0), ··· 1853 1794 PMU_PMEVTYPER(30), 1854 1795 /* PMCCFILTR */ 1855 1796 { Op1(0), CRn(14), CRm(15), Op2(7), access_pmu_evtyper }, 1797 + 1798 + { Op1(1), CRn( 0), CRm( 0), Op2(0), access_ccsidr }, 1799 + { Op1(1), CRn( 0), CRm( 0), Op2(1), access_clidr }, 1800 + { Op1(2), CRn( 0), CRm( 0), Op2(0), access_csselr, NULL, c0_CSSELR }, 1856 1801 }; 1857 1802 1858 1803 static const struct sys_reg_desc cp15_64_regs[] = { ··· 1866 1803 { Op1( 1), CRn( 0), CRm( 2), Op2( 0), access_vm_reg, NULL, c2_TTBR1 }, 1867 1804 { Op1( 1), CRn( 0), CRm(12), Op2( 0), access_gic_sgi }, /* ICC_ASGI1R */ 1868 1805 { Op1( 2), CRn( 0), CRm(12), Op2( 0), access_gic_sgi }, /* ICC_SGI0R */ 1869 - { Op1( 2), CRn( 0), CRm(14), Op2( 0), access_cntp_cval }, 1806 + { SYS_DESC(SYS_AARCH32_CNTP_CVAL), access_arch_timer }, 1870 1807 }; 1871 1808 1872 1809 /* Target specific emulation tables */ ··· 1895 1832 } 1896 1833 } 1897 1834 1898 - #define reg_to_match_value(x) \ 1899 - ({ \ 1900 - unsigned long val; \ 1901 - val = (x)->Op0 << 14; \ 1902 - val |= (x)->Op1 << 11; \ 1903 - val |= (x)->CRn << 7; \ 1904 - val |= (x)->CRm << 3; \ 1905 - val |= (x)->Op2; \ 1906 - val; \ 1907 - }) 1908 - 1909 1835 static int match_sys_reg(const void *key, const void *elt) 1910 1836 { 1911 1837 const unsigned long pval = (unsigned long)key; 1912 1838 const struct sys_reg_desc *r = elt; 1913 1839 1914 - return pval - reg_to_match_value(r); 1840 + return pval - reg_to_encoding(r); 1915 1841 } 1916 1842 1917 1843 static const struct sys_reg_desc *find_reg(const struct sys_reg_params *params, 1918 1844 const struct sys_reg_desc table[], 1919 1845 unsigned int num) 1920 1846 { 1921 - unsigned long pval = reg_to_match_value(params); 1847 + unsigned long pval = reg_to_encoding(params); 1922 1848 1923 1849 return bsearch((void *)pval, table, num, sizeof(table[0]), match_sys_reg); 1924 1850 } ··· 2270 2218 } 2271 2219 2272 2220 FUNCTION_INVARIANT(midr_el1) 2273 - FUNCTION_INVARIANT(ctr_el0) 2274 2221 FUNCTION_INVARIANT(revidr_el1) 2275 2222 FUNCTION_INVARIANT(clidr_el1) 2276 2223 FUNCTION_INVARIANT(aidr_el1) 2224 + 2225 + static void get_ctr_el0(struct kvm_vcpu *v, const struct sys_reg_desc *r) 2226 + { 2227 + ((struct sys_reg_desc *)r)->val = read_sanitised_ftr_reg(SYS_CTR_EL0); 2228 + } 2277 2229 2278 2230 /* ->val is filled in by kvm_sys_reg_table_init() */ 2279 2231 static struct sys_reg_desc invariant_sys_regs[] = {

+1 -1

arch/mips/include/asm/kvm_host.h

··· 1134 1134 static inline void kvm_arch_sync_events(struct kvm *kvm) {} 1135 1135 static inline void kvm_arch_free_memslot(struct kvm *kvm, 1136 1136 struct kvm_memory_slot *free, struct kvm_memory_slot *dont) {} 1137 - static inline void kvm_arch_memslots_updated(struct kvm *kvm, struct kvm_memslots *slots) {} 1137 + static inline void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) {} 1138 1138 static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {} 1139 1139 static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {} 1140 1140 static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}

+4 -1

arch/powerpc/include/asm/kvm_host.h

··· 99 99 100 100 struct kvm_vm_stat { 101 101 ulong remote_tlb_flush; 102 + ulong num_2M_pages; 103 + ulong num_1G_pages; 102 104 }; 103 105 104 106 struct kvm_vcpu_stat { ··· 379 377 void (*slbmte)(struct kvm_vcpu *vcpu, u64 rb, u64 rs); 380 378 u64 (*slbmfee)(struct kvm_vcpu *vcpu, u64 slb_nr); 381 379 u64 (*slbmfev)(struct kvm_vcpu *vcpu, u64 slb_nr); 380 + int (*slbfee)(struct kvm_vcpu *vcpu, gva_t eaddr, ulong *ret_slb); 382 381 void (*slbie)(struct kvm_vcpu *vcpu, u64 slb_nr); 383 382 void (*slbia)(struct kvm_vcpu *vcpu); 384 383 /* book3s */ ··· 840 837 static inline void kvm_arch_hardware_disable(void) {} 841 838 static inline void kvm_arch_hardware_unsetup(void) {} 842 839 static inline void kvm_arch_sync_events(struct kvm *kvm) {} 843 - static inline void kvm_arch_memslots_updated(struct kvm *kvm, struct kvm_memslots *slots) {} 840 + static inline void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) {} 844 841 static inline void kvm_arch_flush_shadow_all(struct kvm *kvm) {} 845 842 static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {} 846 843 static inline void kvm_arch_exit(void) {}

+14

arch/powerpc/include/asm/kvm_ppc.h

··· 36 36 #endif 37 37 #ifdef CONFIG_KVM_BOOK3S_64_HANDLER 38 38 #include <asm/paca.h> 39 + #include <asm/xive.h> 40 + #include <asm/cpu_has_feature.h> 39 41 #endif 40 42 41 43 /* ··· 618 616 int level, bool line_status) { return -ENODEV; } 619 617 static inline void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu) { } 620 618 #endif /* CONFIG_KVM_XIVE */ 619 + 620 + #if defined(CONFIG_PPC_POWERNV) && defined(CONFIG_KVM_BOOK3S_64_HANDLER) 621 + static inline bool xics_on_xive(void) 622 + { 623 + return xive_enabled() && cpu_has_feature(CPU_FTR_HVMODE); 624 + } 625 + #else 626 + static inline bool xics_on_xive(void) 627 + { 628 + return false; 629 + } 630 + #endif 621 631 622 632 /* 623 633 * Prototypes for functions called only from assembler code.

+2

arch/powerpc/include/uapi/asm/kvm.h

··· 463 463 #define KVM_PPC_CPU_CHAR_BR_HINT_HONOURED (1ULL << 58) 464 464 #define KVM_PPC_CPU_CHAR_MTTRIG_THR_RECONF (1ULL << 57) 465 465 #define KVM_PPC_CPU_CHAR_COUNT_CACHE_DIS (1ULL << 56) 466 + #define KVM_PPC_CPU_CHAR_BCCTR_FLUSH_ASSIST (1ull << 54) 466 467 467 468 #define KVM_PPC_CPU_BEHAV_FAVOUR_SECURITY (1ULL << 63) 468 469 #define KVM_PPC_CPU_BEHAV_L1D_FLUSH_PR (1ULL << 62) 469 470 #define KVM_PPC_CPU_BEHAV_BNDS_CHK_SPEC_BAR (1ULL << 61) 471 + #define KVM_PPC_CPU_BEHAV_FLUSH_COUNT_CACHE (1ull << 58) 470 472 471 473 /* Per-vcpu XICS interrupt controller state */ 472 474 #define KVM_REG_PPC_ICP_STATE (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x8c)

+8 -5

arch/powerpc/kvm/book3s.c

··· 39 39 #include "book3s.h" 40 40 #include "trace.h" 41 41 42 + #define VM_STAT(x) offsetof(struct kvm, stat.x), KVM_STAT_VM 42 43 #define VCPU_STAT(x) offsetof(struct kvm_vcpu, stat.x), KVM_STAT_VCPU 43 44 44 45 /* #define EXIT_DEBUG */ ··· 72 71 { "pthru_all", VCPU_STAT(pthru_all) }, 73 72 { "pthru_host", VCPU_STAT(pthru_host) }, 74 73 { "pthru_bad_aff", VCPU_STAT(pthru_bad_aff) }, 74 + { "largepages_2M", VM_STAT(num_2M_pages) }, 75 + { "largepages_1G", VM_STAT(num_1G_pages) }, 75 76 { NULL } 76 77 }; 77 78 ··· 645 642 r = -ENXIO; 646 643 break; 647 644 } 648 - if (xive_enabled()) 645 + if (xics_on_xive()) 649 646 *val = get_reg_val(id, kvmppc_xive_get_icp(vcpu)); 650 647 else 651 648 *val = get_reg_val(id, kvmppc_xics_get_icp(vcpu)); ··· 718 715 r = -ENXIO; 719 716 break; 720 717 } 721 - if (xive_enabled()) 718 + if (xics_on_xive()) 722 719 r = kvmppc_xive_set_icp(vcpu, set_reg_val(id, *val)); 723 720 else 724 721 r = kvmppc_xics_set_icp(vcpu, set_reg_val(id, *val)); ··· 994 991 int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level, 995 992 bool line_status) 996 993 { 997 - if (xive_enabled()) 994 + if (xics_on_xive()) 998 995 return kvmppc_xive_set_irq(kvm, irq_source_id, irq, level, 999 996 line_status); 1000 997 else ··· 1047 1044 1048 1045 #ifdef CONFIG_KVM_XICS 1049 1046 #ifdef CONFIG_KVM_XIVE 1050 - if (xive_enabled()) { 1047 + if (xics_on_xive()) { 1051 1048 kvmppc_xive_init_module(); 1052 1049 kvm_register_device_ops(&kvm_xive_ops, KVM_DEV_TYPE_XICS); 1053 1050 } else ··· 1060 1057 static void kvmppc_book3s_exit(void) 1061 1058 { 1062 1059 #ifdef CONFIG_KVM_XICS 1063 - if (xive_enabled()) 1060 + if (xics_on_xive()) 1064 1061 kvmppc_xive_exit_module(); 1065 1062 #endif 1066 1063 #ifdef CONFIG_KVM_BOOK3S_32_HANDLER

+1

arch/powerpc/kvm/book3s_32_mmu.c

··· 425 425 mmu->slbmte = NULL; 426 426 mmu->slbmfee = NULL; 427 427 mmu->slbmfev = NULL; 428 + mmu->slbfee = NULL; 428 429 mmu->slbie = NULL; 429 430 mmu->slbia = NULL; 430 431 }

+14

arch/powerpc/kvm/book3s_64_mmu.c

··· 435 435 kvmppc_mmu_map_segment(vcpu, esid << SID_SHIFT); 436 436 } 437 437 438 + static int kvmppc_mmu_book3s_64_slbfee(struct kvm_vcpu *vcpu, gva_t eaddr, 439 + ulong *ret_slb) 440 + { 441 + struct kvmppc_slb *slbe = kvmppc_mmu_book3s_64_find_slbe(vcpu, eaddr); 442 + 443 + if (slbe) { 444 + *ret_slb = slbe->origv; 445 + return 0; 446 + } 447 + *ret_slb = 0; 448 + return -ENOENT; 449 + } 450 + 438 451 static u64 kvmppc_mmu_book3s_64_slbmfee(struct kvm_vcpu *vcpu, u64 slb_nr) 439 452 { 440 453 struct kvmppc_slb *slbe; ··· 683 670 mmu->slbmte = kvmppc_mmu_book3s_64_slbmte; 684 671 mmu->slbmfee = kvmppc_mmu_book3s_64_slbmfee; 685 672 mmu->slbmfev = kvmppc_mmu_book3s_64_slbmfev; 673 + mmu->slbfee = kvmppc_mmu_book3s_64_slbfee; 686 674 mmu->slbie = kvmppc_mmu_book3s_64_slbie; 687 675 mmu->slbia = kvmppc_mmu_book3s_64_slbia; 688 676 mmu->xlate = kvmppc_mmu_book3s_64_xlate;

+18

arch/powerpc/kvm/book3s_64_mmu_hv.c

··· 442 442 u32 last_inst; 443 443 444 444 /* 445 + * Fast path - check if the guest physical address corresponds to a 446 + * device on the FAST_MMIO_BUS, if so we can avoid loading the 447 + * instruction all together, then we can just handle it and return. 448 + */ 449 + if (is_store) { 450 + int idx, ret; 451 + 452 + idx = srcu_read_lock(&vcpu->kvm->srcu); 453 + ret = kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, (gpa_t) gpa, 0, 454 + NULL); 455 + srcu_read_unlock(&vcpu->kvm->srcu, idx); 456 + if (!ret) { 457 + kvmppc_set_pc(vcpu, kvmppc_get_pc(vcpu) + 4); 458 + return RESUME_GUEST; 459 + } 460 + } 461 + 462 + /* 445 463 * If we fail, we just return to the guest and try executing it again. 446 464 */ 447 465 if (kvmppc_get_last_inst(vcpu, INST_GENERIC, &last_inst) !=

+14 -1

arch/powerpc/kvm/book3s_64_mmu_radix.c

··· 403 403 if (!memslot) 404 404 return; 405 405 } 406 - if (shift) 406 + if (shift) { /* 1GB or 2MB page */ 407 407 page_size = 1ul << shift; 408 + if (shift == PMD_SHIFT) 409 + kvm->stat.num_2M_pages--; 410 + else if (shift == PUD_SHIFT) 411 + kvm->stat.num_1G_pages--; 412 + } 408 413 409 414 gpa &= ~(page_size - 1); 410 415 hpa = old & PTE_RPN_MASK; ··· 881 876 if (!ret && (pte_val(pte) & _PAGE_WRITE)) 882 877 set_page_dirty_lock(page); 883 878 put_page(page); 879 + } 880 + 881 + /* Increment number of large pages if we (successfully) inserted one */ 882 + if (!ret) { 883 + if (level == 1) 884 + kvm->stat.num_2M_pages++; 885 + else if (level == 2) 886 + kvm->stat.num_1G_pages++; 884 887 } 885 888 886 889 return ret;

+4 -4

arch/powerpc/kvm/book3s_64_vio.c

··· 133 133 continue; 134 134 135 135 kref_put(&stit->kref, kvm_spapr_tce_liobn_put); 136 - return; 137 136 } 138 137 } 139 138 } ··· 337 338 } 338 339 } 339 340 341 + kvm_get_kvm(kvm); 340 342 if (!ret) 341 343 ret = anon_inode_getfd("kvm-spapr-tce", &kvm_spapr_tce_fops, 342 344 stt, O_RDWR | O_CLOEXEC); 343 345 344 - if (ret >= 0) { 346 + if (ret >= 0) 345 347 list_add_rcu(&stt->list, &kvm->arch.spapr_tce_tables); 346 - kvm_get_kvm(kvm); 347 - } 348 + else 349 + kvm_put_kvm(kvm); 348 350 349 351 mutex_unlock(&kvm->lock); 350 352

+18

arch/powerpc/kvm/book3s_emulate.c

··· 47 47 #define OP_31_XOP_SLBMFEV 851 48 48 #define OP_31_XOP_EIOIO 854 49 49 #define OP_31_XOP_SLBMFEE 915 50 + #define OP_31_XOP_SLBFEE 979 50 51 51 52 #define OP_31_XOP_TBEGIN 654 52 53 #define OP_31_XOP_TABORT 910 ··· 416 415 return EMULATE_FAIL; 417 416 418 417 vcpu->arch.mmu.slbia(vcpu); 418 + break; 419 + case OP_31_XOP_SLBFEE: 420 + if (!(inst & 1) || !vcpu->arch.mmu.slbfee) { 421 + return EMULATE_FAIL; 422 + } else { 423 + ulong b, t; 424 + ulong cr = kvmppc_get_cr(vcpu) & ~CR0_MASK; 425 + 426 + b = kvmppc_get_gpr(vcpu, rb); 427 + if (!vcpu->arch.mmu.slbfee(vcpu, b, &t)) 428 + cr |= 2 << CR0_SHIFT; 429 + kvmppc_set_gpr(vcpu, rt, t); 430 + /* copy XER[SO] bit to CR0[SO] */ 431 + cr |= (vcpu->arch.regs.xer & 0x80000000) >> 432 + (31 - CR0_SHIFT); 433 + kvmppc_set_cr(vcpu, cr); 434 + } 419 435 break; 420 436 case OP_31_XOP_SLBMFEE: 421 437 if (!vcpu->arch.mmu.slbmfee) {

+17 -16

arch/powerpc/kvm/book3s_hv.c

··· 922 922 case H_IPOLL: 923 923 case H_XIRR_X: 924 924 if (kvmppc_xics_enabled(vcpu)) { 925 - if (xive_enabled()) { 925 + if (xics_on_xive()) { 926 926 ret = H_NOT_AVAILABLE; 927 927 return RESUME_GUEST; 928 928 } ··· 937 937 ret = kvmppc_h_set_xdabr(vcpu, kvmppc_get_gpr(vcpu, 4), 938 938 kvmppc_get_gpr(vcpu, 5)); 939 939 break; 940 + #ifdef CONFIG_SPAPR_TCE_IOMMU 940 941 case H_GET_TCE: 941 942 ret = kvmppc_h_get_tce(vcpu, kvmppc_get_gpr(vcpu, 4), 942 943 kvmppc_get_gpr(vcpu, 5)); ··· 967 966 if (ret == H_TOO_HARD) 968 967 return RESUME_HOST; 969 968 break; 969 + #endif 970 970 case H_RANDOM: 971 971 if (!powernv_get_random_long(&vcpu->arch.regs.gpr[4])) 972 972 ret = H_HARDWARE; ··· 1447 1445 case BOOK3S_INTERRUPT_HV_RM_HARD: 1448 1446 vcpu->arch.trap = 0; 1449 1447 r = RESUME_GUEST; 1450 - if (!xive_enabled()) 1448 + if (!xics_on_xive()) 1451 1449 kvmppc_xics_rm_complete(vcpu, 0); 1452 1450 break; 1453 1451 default: ··· 3650 3648 3651 3649 static void grow_halt_poll_ns(struct kvmppc_vcore *vc) 3652 3650 { 3653 - /* 10us base */ 3654 - if (vc->halt_poll_ns == 0 && halt_poll_ns_grow) 3655 - vc->halt_poll_ns = 10000; 3656 - else 3657 - vc->halt_poll_ns *= halt_poll_ns_grow; 3651 + if (!halt_poll_ns_grow) 3652 + return; 3653 + 3654 + vc->halt_poll_ns *= halt_poll_ns_grow; 3655 + if (vc->halt_poll_ns < halt_poll_ns_grow_start) 3656 + vc->halt_poll_ns = halt_poll_ns_grow_start; 3658 3657 } 3659 3658 3660 3659 static void shrink_halt_poll_ns(struct kvmppc_vcore *vc) ··· 3669 3666 #ifdef CONFIG_KVM_XICS 3670 3667 static inline bool xive_interrupt_pending(struct kvm_vcpu *vcpu) 3671 3668 { 3672 - if (!xive_enabled()) 3669 + if (!xics_on_xive()) 3673 3670 return false; 3674 3671 return vcpu->arch.irq_pending || vcpu->arch.xive_saved_state.pipr < 3675 3672 vcpu->arch.xive_saved_state.cppr; ··· 4229 4226 vcpu->arch.fault_dar, vcpu->arch.fault_dsisr); 4230 4227 srcu_read_unlock(&kvm->srcu, srcu_idx); 4231 4228 } else if (r == RESUME_PASSTHROUGH) { 4232 - if (WARN_ON(xive_enabled())) 4229 + if (WARN_ON(xics_on_xive())) 4233 4230 r = H_SUCCESS; 4234 4231 else 4235 4232 r = kvmppc_xics_rm_complete(vcpu, 0); ··· 4753 4750 * If xive is enabled, we route 0x500 interrupts directly 4754 4751 * to the guest. 4755 4752 */ 4756 - if (xive_enabled()) 4753 + if (xics_on_xive()) 4757 4754 lpcr |= LPCR_LPES; 4758 4755 } 4759 4756 ··· 4989 4986 if (i == pimap->n_mapped) 4990 4987 pimap->n_mapped++; 4991 4988 4992 - if (xive_enabled()) 4989 + if (xics_on_xive()) 4993 4990 rc = kvmppc_xive_set_mapped(kvm, guest_gsi, desc); 4994 4991 else 4995 4992 kvmppc_xics_set_mapped(kvm, guest_gsi, desc->irq_data.hwirq); ··· 5030 5027 return -ENODEV; 5031 5028 } 5032 5029 5033 - if (xive_enabled()) 5030 + if (xics_on_xive()) 5034 5031 rc = kvmppc_xive_clr_mapped(kvm, guest_gsi, pimap->mapped[i].desc); 5035 5032 else 5036 5033 kvmppc_xics_clr_mapped(kvm, guest_gsi, pimap->mapped[i].r_hwirq); ··· 5362 5359 continue; 5363 5360 5364 5361 sibling_subcore_state = 5365 - kmalloc_node(sizeof(struct sibling_subcore_state), 5362 + kzalloc_node(sizeof(struct sibling_subcore_state), 5366 5363 GFP_KERNEL, node); 5367 5364 if (!sibling_subcore_state) 5368 5365 return -ENOMEM; 5369 5366 5370 - memset(sibling_subcore_state, 0, 5371 - sizeof(struct sibling_subcore_state)); 5372 5367 5373 5368 for (j = 0; j < threads_per_core; j++) { 5374 5369 int cpu = first_cpu + j; ··· 5407 5406 * indirectly, via OPAL. 5408 5407 */ 5409 5408 #ifdef CONFIG_SMP 5410 - if (!xive_enabled() && !kvmhv_on_pseries() && 5409 + if (!xics_on_xive() && !kvmhv_on_pseries() && 5411 5410 !local_paca->kvm_hstate.xics_phys) { 5412 5411 struct device_node *np; 5413 5412

+7 -7

arch/powerpc/kvm/book3s_hv_builtin.c

··· 257 257 } 258 258 259 259 /* We should never reach this */ 260 - if (WARN_ON_ONCE(xive_enabled())) 260 + if (WARN_ON_ONCE(xics_on_xive())) 261 261 return; 262 262 263 263 /* Else poke the target with an IPI */ ··· 577 577 { 578 578 if (!kvmppc_xics_enabled(vcpu)) 579 579 return H_TOO_HARD; 580 - if (xive_enabled()) { 580 + if (xics_on_xive()) { 581 581 if (is_rm()) 582 582 return xive_rm_h_xirr(vcpu); 583 583 if (unlikely(!__xive_vm_h_xirr)) ··· 592 592 if (!kvmppc_xics_enabled(vcpu)) 593 593 return H_TOO_HARD; 594 594 vcpu->arch.regs.gpr[5] = get_tb(); 595 - if (xive_enabled()) { 595 + if (xics_on_xive()) { 596 596 if (is_rm()) 597 597 return xive_rm_h_xirr(vcpu); 598 598 if (unlikely(!__xive_vm_h_xirr)) ··· 606 606 { 607 607 if (!kvmppc_xics_enabled(vcpu)) 608 608 return H_TOO_HARD; 609 - if (xive_enabled()) { 609 + if (xics_on_xive()) { 610 610 if (is_rm()) 611 611 return xive_rm_h_ipoll(vcpu, server); 612 612 if (unlikely(!__xive_vm_h_ipoll)) ··· 621 621 { 622 622 if (!kvmppc_xics_enabled(vcpu)) 623 623 return H_TOO_HARD; 624 - if (xive_enabled()) { 624 + if (xics_on_xive()) { 625 625 if (is_rm()) 626 626 return xive_rm_h_ipi(vcpu, server, mfrr); 627 627 if (unlikely(!__xive_vm_h_ipi)) ··· 635 635 { 636 636 if (!kvmppc_xics_enabled(vcpu)) 637 637 return H_TOO_HARD; 638 - if (xive_enabled()) { 638 + if (xics_on_xive()) { 639 639 if (is_rm()) 640 640 return xive_rm_h_cppr(vcpu, cppr); 641 641 if (unlikely(!__xive_vm_h_cppr)) ··· 649 649 { 650 650 if (!kvmppc_xics_enabled(vcpu)) 651 651 return H_TOO_HARD; 652 - if (xive_enabled()) { 652 + if (xics_on_xive()) { 653 653 if (is_rm()) 654 654 return xive_rm_h_eoi(vcpu, xirr); 655 655 if (unlikely(!__xive_vm_h_eoi))

+7

arch/powerpc/kvm/book3s_hv_rm_xics.c

··· 144 144 return; 145 145 } 146 146 147 + if (xive_enabled() && kvmhv_on_pseries()) { 148 + /* No XICS access or hypercalls available, too hard */ 149 + this_icp->rm_action |= XICS_RM_KICK_VCPU; 150 + this_icp->rm_kick_target = vcpu; 151 + return; 152 + } 153 + 147 154 /* 148 155 * Check if the core is loaded, 149 156 * if not, find an available host core to post to wake the VCPU,

+10

arch/powerpc/kvm/book3s_hv_rmhandlers.S

··· 2272 2272 .long DOTSYM(kvmppc_h_clear_mod) - hcall_real_table 2273 2273 .long DOTSYM(kvmppc_h_clear_ref) - hcall_real_table 2274 2274 .long DOTSYM(kvmppc_h_protect) - hcall_real_table 2275 + #ifdef CONFIG_SPAPR_TCE_IOMMU 2275 2276 .long DOTSYM(kvmppc_h_get_tce) - hcall_real_table 2276 2277 .long DOTSYM(kvmppc_rm_h_put_tce) - hcall_real_table 2278 + #else 2279 + .long 0 /* 0x1c */ 2280 + .long 0 /* 0x20 */ 2281 + #endif 2277 2282 .long 0 /* 0x24 - H_SET_SPRG0 */ 2278 2283 .long DOTSYM(kvmppc_h_set_dabr) - hcall_real_table 2279 2284 .long 0 /* 0x2c */ ··· 2356 2351 .long 0 /* 0x12c */ 2357 2352 .long 0 /* 0x130 */ 2358 2353 .long DOTSYM(kvmppc_h_set_xdabr) - hcall_real_table 2354 + #ifdef CONFIG_SPAPR_TCE_IOMMU 2359 2355 .long DOTSYM(kvmppc_rm_h_stuff_tce) - hcall_real_table 2360 2356 .long DOTSYM(kvmppc_rm_h_put_tce_indirect) - hcall_real_table 2357 + #else 2358 + .long 0 /* 0x138 */ 2359 + .long 0 /* 0x13c */ 2360 + #endif 2361 2361 .long 0 /* 0x140 */ 2362 2362 .long 0 /* 0x144 */ 2363 2363 .long 0 /* 0x148 */

+4 -4

arch/powerpc/kvm/book3s_rtas.c

··· 33 33 server = be32_to_cpu(args->args[1]); 34 34 priority = be32_to_cpu(args->args[2]); 35 35 36 - if (xive_enabled()) 36 + if (xics_on_xive()) 37 37 rc = kvmppc_xive_set_xive(vcpu->kvm, irq, server, priority); 38 38 else 39 39 rc = kvmppc_xics_set_xive(vcpu->kvm, irq, server, priority); ··· 56 56 irq = be32_to_cpu(args->args[0]); 57 57 58 58 server = priority = 0; 59 - if (xive_enabled()) 59 + if (xics_on_xive()) 60 60 rc = kvmppc_xive_get_xive(vcpu->kvm, irq, &server, &priority); 61 61 else 62 62 rc = kvmppc_xics_get_xive(vcpu->kvm, irq, &server, &priority); ··· 83 83 84 84 irq = be32_to_cpu(args->args[0]); 85 85 86 - if (xive_enabled()) 86 + if (xics_on_xive()) 87 87 rc = kvmppc_xive_int_off(vcpu->kvm, irq); 88 88 else 89 89 rc = kvmppc_xics_int_off(vcpu->kvm, irq); ··· 105 105 106 106 irq = be32_to_cpu(args->args[0]); 107 107 108 - if (xive_enabled()) 108 + if (xics_on_xive()) 109 109 rc = kvmppc_xive_int_on(vcpu->kvm, irq); 110 110 else 111 111 rc = kvmppc_xics_int_on(vcpu->kvm, irq);

+16 -6

arch/powerpc/kvm/powerpc.c

··· 748 748 kvmppc_mpic_disconnect_vcpu(vcpu->arch.mpic, vcpu); 749 749 break; 750 750 case KVMPPC_IRQ_XICS: 751 - if (xive_enabled()) 751 + if (xics_on_xive()) 752 752 kvmppc_xive_cleanup_vcpu(vcpu); 753 753 else 754 754 kvmppc_xics_free_icp(vcpu); ··· 1931 1931 r = -EPERM; 1932 1932 dev = kvm_device_from_filp(f.file); 1933 1933 if (dev) { 1934 - if (xive_enabled()) 1934 + if (xics_on_xive()) 1935 1935 r = kvmppc_xive_connect_vcpu(dev, vcpu, cap->args[1]); 1936 1936 else 1937 1937 r = kvmppc_xics_connect_vcpu(dev, vcpu, cap->args[1]); ··· 2189 2189 KVM_PPC_CPU_CHAR_L1D_THREAD_PRIV | 2190 2190 KVM_PPC_CPU_CHAR_BR_HINT_HONOURED | 2191 2191 KVM_PPC_CPU_CHAR_MTTRIG_THR_RECONF | 2192 - KVM_PPC_CPU_CHAR_COUNT_CACHE_DIS; 2192 + KVM_PPC_CPU_CHAR_COUNT_CACHE_DIS | 2193 + KVM_PPC_CPU_CHAR_BCCTR_FLUSH_ASSIST; 2193 2194 cp->behaviour_mask = KVM_PPC_CPU_BEHAV_FAVOUR_SECURITY | 2194 2195 KVM_PPC_CPU_BEHAV_L1D_FLUSH_PR | 2195 - KVM_PPC_CPU_BEHAV_BNDS_CHK_SPEC_BAR; 2196 + KVM_PPC_CPU_BEHAV_BNDS_CHK_SPEC_BAR | 2197 + KVM_PPC_CPU_BEHAV_FLUSH_COUNT_CACHE; 2196 2198 } 2197 2199 return 0; 2198 2200 } ··· 2253 2251 if (have_fw_feat(fw_features, "enabled", 2254 2252 "fw-count-cache-disabled")) 2255 2253 cp->character |= KVM_PPC_CPU_CHAR_COUNT_CACHE_DIS; 2254 + if (have_fw_feat(fw_features, "enabled", 2255 + "fw-count-cache-flush-bcctr2,0,0")) 2256 + cp->character |= KVM_PPC_CPU_CHAR_BCCTR_FLUSH_ASSIST; 2256 2257 cp->character_mask = KVM_PPC_CPU_CHAR_SPEC_BAR_ORI31 | 2257 2258 KVM_PPC_CPU_CHAR_BCCTRL_SERIALISED | 2258 2259 KVM_PPC_CPU_CHAR_L1D_FLUSH_ORI30 | 2259 2260 KVM_PPC_CPU_CHAR_L1D_FLUSH_TRIG2 | 2260 2261 KVM_PPC_CPU_CHAR_L1D_THREAD_PRIV | 2261 - KVM_PPC_CPU_CHAR_COUNT_CACHE_DIS; 2262 + KVM_PPC_CPU_CHAR_COUNT_CACHE_DIS | 2263 + KVM_PPC_CPU_CHAR_BCCTR_FLUSH_ASSIST; 2262 2264 2263 2265 if (have_fw_feat(fw_features, "enabled", 2264 2266 "speculation-policy-favor-security")) ··· 2273 2267 if (!have_fw_feat(fw_features, "disabled", 2274 2268 "needs-spec-barrier-for-bound-checks")) 2275 2269 cp->behaviour |= KVM_PPC_CPU_BEHAV_BNDS_CHK_SPEC_BAR; 2270 + if (have_fw_feat(fw_features, "enabled", 2271 + "needs-count-cache-flush-on-context-switch")) 2272 + cp->behaviour |= KVM_PPC_CPU_BEHAV_FLUSH_COUNT_CACHE; 2276 2273 cp->behaviour_mask = KVM_PPC_CPU_BEHAV_FAVOUR_SECURITY | 2277 2274 KVM_PPC_CPU_BEHAV_L1D_FLUSH_PR | 2278 - KVM_PPC_CPU_BEHAV_BNDS_CHK_SPEC_BAR; 2275 + KVM_PPC_CPU_BEHAV_BNDS_CHK_SPEC_BAR | 2276 + KVM_PPC_CPU_BEHAV_FLUSH_COUNT_CACHE; 2279 2277 2280 2278 of_node_put(fw_features); 2281 2279 }

+1

arch/s390/include/asm/cio.h

··· 331 331 /* Function from drivers/s390/cio/chsc.c */ 332 332 int chsc_sstpc(void *page, unsigned int op, u16 ctrl, u64 *clock_delta); 333 333 int chsc_sstpi(void *page, void *result, size_t size); 334 + int chsc_sgib(u32 origin); 334 335 335 336 #endif

+1

arch/s390/include/asm/irq.h

··· 62 62 IRQIO_MSI, 63 63 IRQIO_VIR, 64 64 IRQIO_VAI, 65 + IRQIO_GAL, 65 66 NMI_NMI, 66 67 CPU_RST, 67 68 NR_ARCH_IRQS

+1

arch/s390/include/asm/isc.h

··· 21 21 /* Adapter interrupts. */ 22 22 #define QDIO_AIRQ_ISC IO_SCH_ISC /* I/O subchannel in qdio mode */ 23 23 #define PCI_ISC 2 /* PCI I/O subchannels */ 24 + #define GAL_ISC 5 /* GIB alert */ 24 25 #define AP_ISC 6 /* adjunct processor (crypto) devices */ 25 26 26 27 /* Functions for registration of I/O interruption subclasses */

+35 -4

arch/s390/include/asm/kvm_host.h

··· 591 591 struct kvm_s390_mchk_info mchk; 592 592 struct kvm_s390_ext_info srv_signal; 593 593 int next_rr_cpu; 594 - unsigned long idle_mask[BITS_TO_LONGS(KVM_MAX_VCPUS)]; 595 594 struct mutex ais_lock; 596 595 u8 simm; 597 596 u8 nimm; ··· 711 712 struct kvm_s390_cpu_model { 712 713 /* facility mask supported by kvm & hosting machine */ 713 714 __u64 fac_mask[S390_ARCH_FAC_LIST_SIZE_U64]; 715 + struct kvm_s390_vm_cpu_subfunc subfuncs; 714 716 /* facility list requested by guest (in dma page) */ 715 717 __u64 *fac_list; 716 718 u64 cpuid; ··· 782 782 u8 reserved03[11]; 783 783 u32 airq_count; 784 784 } g1; 785 + struct { 786 + u64 word[4]; 787 + } u64; 785 788 }; 789 + }; 790 + 791 + struct kvm_s390_gib { 792 + u32 alert_list_origin; 793 + u32 reserved01; 794 + u8:5; 795 + u8 nisc:3; 796 + u8 reserved03[3]; 797 + u32 reserved04[5]; 786 798 }; 787 799 788 800 /* ··· 805 793 __u64 fac_list[S390_ARCH_FAC_LIST_SIZE_U64]; /* 0x0000 */ 806 794 struct kvm_s390_crypto_cb crycb; /* 0x0800 */ 807 795 struct kvm_s390_gisa gisa; /* 0x0900 */ 808 - u8 reserved920[0x1000 - 0x920]; /* 0x0920 */ 796 + struct kvm *kvm; /* 0x0920 */ 797 + u8 reserved928[0x1000 - 0x928]; /* 0x0928 */ 809 798 }; 810 799 811 800 struct kvm_s390_vsie { ··· 815 802 int page_count; 816 803 int next; 817 804 struct page *pages[KVM_MAX_VCPUS]; 805 + }; 806 + 807 + struct kvm_s390_gisa_iam { 808 + u8 mask; 809 + spinlock_t ref_lock; 810 + u32 ref_count[MAX_ISC + 1]; 811 + }; 812 + 813 + struct kvm_s390_gisa_interrupt { 814 + struct kvm_s390_gisa *origin; 815 + struct kvm_s390_gisa_iam alert; 816 + struct hrtimer timer; 817 + u64 expires; 818 + DECLARE_BITMAP(kicked_mask, KVM_MAX_VCPUS); 818 819 }; 819 820 820 821 struct kvm_arch{ ··· 864 837 atomic64_t cmma_dirty_pages; 865 838 /* subset of available cpu features enabled by user space */ 866 839 DECLARE_BITMAP(cpu_feat, KVM_S390_VM_CPU_FEAT_NR_BITS); 867 - struct kvm_s390_gisa *gisa; 840 + DECLARE_BITMAP(idle_mask, KVM_MAX_VCPUS); 841 + struct kvm_s390_gisa_interrupt gisa_int; 868 842 }; 869 843 870 844 #define KVM_HVA_ERR_BAD (-1UL) ··· 899 871 extern int sie64a(struct kvm_s390_sie_block *, u64 *); 900 872 extern char sie_exit; 901 873 874 + extern int kvm_s390_gisc_register(struct kvm *kvm, u32 gisc); 875 + extern int kvm_s390_gisc_unregister(struct kvm *kvm, u32 gisc); 876 + 902 877 static inline void kvm_arch_hardware_disable(void) {} 903 878 static inline void kvm_arch_check_processor_compat(void *rtn) {} 904 879 static inline void kvm_arch_sync_events(struct kvm *kvm) {} ··· 909 878 static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {} 910 879 static inline void kvm_arch_free_memslot(struct kvm *kvm, 911 880 struct kvm_memory_slot *free, struct kvm_memory_slot *dont) {} 912 - static inline void kvm_arch_memslots_updated(struct kvm *kvm, struct kvm_memslots *slots) {} 881 + static inline void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) {} 913 882 static inline void kvm_arch_flush_shadow_all(struct kvm *kvm) {} 914 883 static inline void kvm_arch_flush_shadow_memslot(struct kvm *kvm, 915 884 struct kvm_memory_slot *slot) {}

+1

arch/s390/kernel/irq.c

··· 88 88 {.irq = IRQIO_MSI, .name = "MSI", .desc = "[I/O] MSI Interrupt" }, 89 89 {.irq = IRQIO_VIR, .name = "VIR", .desc = "[I/O] Virtual I/O Devices"}, 90 90 {.irq = IRQIO_VAI, .name = "VAI", .desc = "[I/O] Virtual I/O Devices AI"}, 91 + {.irq = IRQIO_GAL, .name = "GAL", .desc = "[I/O] GIB Alert"}, 91 92 {.irq = NMI_NMI, .name = "NMI", .desc = "[NMI] Machine Check"}, 92 93 {.irq = CPU_RST, .name = "RST", .desc = "[CPU] CPU Restart"}, 93 94 };

+390 -41

arch/s390/kvm/interrupt.c

··· 7 7 * Author(s): Carsten Otte <cotte@de.ibm.com> 8 8 */ 9 9 10 + #define KMSG_COMPONENT "kvm-s390" 11 + #define pr_fmt(fmt) KMSG_COMPONENT ": " fmt 12 + 10 13 #include <linux/interrupt.h> 11 14 #include <linux/kvm_host.h> 12 15 #include <linux/hrtimer.h> ··· 26 23 #include <asm/gmap.h> 27 24 #include <asm/switch_to.h> 28 25 #include <asm/nmi.h> 26 + #include <asm/airq.h> 29 27 #include "kvm-s390.h" 30 28 #include "gaccess.h" 31 29 #include "trace-s390.h" ··· 34 30 #define PFAULT_INIT 0x0600 35 31 #define PFAULT_DONE 0x0680 36 32 #define VIRTIO_PARAM 0x0d00 33 + 34 + static struct kvm_s390_gib *gib; 37 35 38 36 /* handle external calls via sigp interpretation facility */ 39 37 static int sca_ext_call_pending(struct kvm_vcpu *vcpu, int *src_id) ··· 223 217 */ 224 218 #define IPM_BIT_OFFSET (offsetof(struct kvm_s390_gisa, ipm) * BITS_PER_BYTE) 225 219 226 - static inline void kvm_s390_gisa_set_ipm_gisc(struct kvm_s390_gisa *gisa, u32 gisc) 220 + /** 221 + * gisa_set_iam - change the GISA interruption alert mask 222 + * 223 + * @gisa: gisa to operate on 224 + * @iam: new IAM value to use 225 + * 226 + * Change the IAM atomically with the next alert address and the IPM 227 + * of the GISA if the GISA is not part of the GIB alert list. All three 228 + * fields are located in the first long word of the GISA. 229 + * 230 + * Returns: 0 on success 231 + * -EBUSY in case the gisa is part of the alert list 232 + */ 233 + static inline int gisa_set_iam(struct kvm_s390_gisa *gisa, u8 iam) 234 + { 235 + u64 word, _word; 236 + 237 + do { 238 + word = READ_ONCE(gisa->u64.word[0]); 239 + if ((u64)gisa != word >> 32) 240 + return -EBUSY; 241 + _word = (word & ~0xffUL) | iam; 242 + } while (cmpxchg(&gisa->u64.word[0], word, _word) != word); 243 + 244 + return 0; 245 + } 246 + 247 + /** 248 + * gisa_clear_ipm - clear the GISA interruption pending mask 249 + * 250 + * @gisa: gisa to operate on 251 + * 252 + * Clear the IPM atomically with the next alert address and the IAM 253 + * of the GISA unconditionally. All three fields are located in the 254 + * first long word of the GISA. 255 + */ 256 + static inline void gisa_clear_ipm(struct kvm_s390_gisa *gisa) 257 + { 258 + u64 word, _word; 259 + 260 + do { 261 + word = READ_ONCE(gisa->u64.word[0]); 262 + _word = word & ~(0xffUL << 24); 263 + } while (cmpxchg(&gisa->u64.word[0], word, _word) != word); 264 + } 265 + 266 + /** 267 + * gisa_get_ipm_or_restore_iam - return IPM or restore GISA IAM 268 + * 269 + * @gi: gisa interrupt struct to work on 270 + * 271 + * Atomically restores the interruption alert mask if none of the 272 + * relevant ISCs are pending and return the IPM. 273 + * 274 + * Returns: the relevant pending ISCs 275 + */ 276 + static inline u8 gisa_get_ipm_or_restore_iam(struct kvm_s390_gisa_interrupt *gi) 277 + { 278 + u8 pending_mask, alert_mask; 279 + u64 word, _word; 280 + 281 + do { 282 + word = READ_ONCE(gi->origin->u64.word[0]); 283 + alert_mask = READ_ONCE(gi->alert.mask); 284 + pending_mask = (u8)(word >> 24) & alert_mask; 285 + if (pending_mask) 286 + return pending_mask; 287 + _word = (word & ~0xffUL) | alert_mask; 288 + } while (cmpxchg(&gi->origin->u64.word[0], word, _word) != word); 289 + 290 + return 0; 291 + } 292 + 293 + static inline int gisa_in_alert_list(struct kvm_s390_gisa *gisa) 294 + { 295 + return READ_ONCE(gisa->next_alert) != (u32)(u64)gisa; 296 + } 297 + 298 + static inline void gisa_set_ipm_gisc(struct kvm_s390_gisa *gisa, u32 gisc) 227 299 { 228 300 set_bit_inv(IPM_BIT_OFFSET + gisc, (unsigned long *) gisa); 229 301 } 230 302 231 - static inline u8 kvm_s390_gisa_get_ipm(struct kvm_s390_gisa *gisa) 303 + static inline u8 gisa_get_ipm(struct kvm_s390_gisa *gisa) 232 304 { 233 305 return READ_ONCE(gisa->ipm); 234 306 } 235 307 236 - static inline void kvm_s390_gisa_clear_ipm_gisc(struct kvm_s390_gisa *gisa, u32 gisc) 308 + static inline void gisa_clear_ipm_gisc(struct kvm_s390_gisa *gisa, u32 gisc) 237 309 { 238 310 clear_bit_inv(IPM_BIT_OFFSET + gisc, (unsigned long *) gisa); 239 311 } 240 312 241 - static inline int kvm_s390_gisa_tac_ipm_gisc(struct kvm_s390_gisa *gisa, u32 gisc) 313 + static inline int gisa_tac_ipm_gisc(struct kvm_s390_gisa *gisa, u32 gisc) 242 314 { 243 315 return test_and_clear_bit_inv(IPM_BIT_OFFSET + gisc, (unsigned long *) gisa); 244 316 } ··· 329 245 330 246 static inline unsigned long pending_irqs(struct kvm_vcpu *vcpu) 331 247 { 332 - return pending_irqs_no_gisa(vcpu) | 333 - kvm_s390_gisa_get_ipm(vcpu->kvm->arch.gisa) << IRQ_PEND_IO_ISC_7; 248 + struct kvm_s390_gisa_interrupt *gi = &vcpu->kvm->arch.gisa_int; 249 + unsigned long pending_mask; 250 + 251 + pending_mask = pending_irqs_no_gisa(vcpu); 252 + if (gi->origin) 253 + pending_mask |= gisa_get_ipm(gi->origin) << IRQ_PEND_IO_ISC_7; 254 + return pending_mask; 334 255 } 335 256 336 257 static inline int isc_to_irq_type(unsigned long isc) ··· 407 318 static void __set_cpu_idle(struct kvm_vcpu *vcpu) 408 319 { 409 320 kvm_s390_set_cpuflags(vcpu, CPUSTAT_WAIT); 410 - set_bit(vcpu->vcpu_id, vcpu->kvm->arch.float_int.idle_mask); 321 + set_bit(vcpu->vcpu_id, vcpu->kvm->arch.idle_mask); 411 322 } 412 323 413 324 static void __unset_cpu_idle(struct kvm_vcpu *vcpu) 414 325 { 415 326 kvm_s390_clear_cpuflags(vcpu, CPUSTAT_WAIT); 416 - clear_bit(vcpu->vcpu_id, vcpu->kvm->arch.float_int.idle_mask); 327 + clear_bit(vcpu->vcpu_id, vcpu->kvm->arch.idle_mask); 417 328 } 418 329 419 330 static void __reset_intercept_indicators(struct kvm_vcpu *vcpu) ··· 434 345 { 435 346 if (!(pending_irqs_no_gisa(vcpu) & IRQ_PEND_IO_MASK)) 436 347 return; 437 - else if (psw_ioint_disabled(vcpu)) 348 + if (psw_ioint_disabled(vcpu)) 438 349 kvm_s390_set_cpuflags(vcpu, CPUSTAT_IO_INT); 439 350 else 440 351 vcpu->arch.sie_block->lctl |= LCTL_CR6; ··· 442 353 443 354 static void set_intercept_indicators_ext(struct kvm_vcpu *vcpu) 444 355 { 445 - if (!(pending_irqs(vcpu) & IRQ_PEND_EXT_MASK)) 356 + if (!(pending_irqs_no_gisa(vcpu) & IRQ_PEND_EXT_MASK)) 446 357 return; 447 358 if (psw_extint_disabled(vcpu)) 448 359 kvm_s390_set_cpuflags(vcpu, CPUSTAT_EXT_INT); ··· 452 363 453 364 static void set_intercept_indicators_mchk(struct kvm_vcpu *vcpu) 454 365 { 455 - if (!(pending_irqs(vcpu) & IRQ_PEND_MCHK_MASK)) 366 + if (!(pending_irqs_no_gisa(vcpu) & IRQ_PEND_MCHK_MASK)) 456 367 return; 457 368 if (psw_mchk_disabled(vcpu)) 458 369 vcpu->arch.sie_block->ictl |= ICTL_LPSW; ··· 1045 956 { 1046 957 struct list_head *isc_list; 1047 958 struct kvm_s390_float_interrupt *fi; 959 + struct kvm_s390_gisa_interrupt *gi = &vcpu->kvm->arch.gisa_int; 1048 960 struct kvm_s390_interrupt_info *inti = NULL; 1049 961 struct kvm_s390_io_info io; 1050 962 u32 isc; ··· 1088 998 goto out; 1089 999 } 1090 1000 1091 - if (vcpu->kvm->arch.gisa && 1092 - kvm_s390_gisa_tac_ipm_gisc(vcpu->kvm->arch.gisa, isc)) { 1001 + if (gi->origin && gisa_tac_ipm_gisc(gi->origin, isc)) { 1093 1002 /* 1094 1003 * in case an adapter interrupt was not delivered 1095 1004 * in SIE context KVM will handle the delivery ··· 1178 1089 1179 1090 int kvm_s390_handle_wait(struct kvm_vcpu *vcpu) 1180 1091 { 1092 + struct kvm_s390_gisa_interrupt *gi = &vcpu->kvm->arch.gisa_int; 1181 1093 u64 sltime; 1182 1094 1183 1095 vcpu->stat.exit_wait_state++; ··· 1191 1101 VCPU_EVENT(vcpu, 3, "%s", "disabled wait"); 1192 1102 return -EOPNOTSUPP; /* disabled wait */ 1193 1103 } 1104 + 1105 + if (gi->origin && 1106 + (gisa_get_ipm_or_restore_iam(gi) & 1107 + vcpu->arch.sie_block->gcr[6] >> 24)) 1108 + return 0; 1194 1109 1195 1110 if (!ckc_interrupts_enabled(vcpu) && 1196 1111 !cpu_timer_interrupts_enabled(vcpu)) { ··· 1628 1533 1629 1534 static int get_top_gisa_isc(struct kvm *kvm, u64 isc_mask, u32 schid) 1630 1535 { 1536 + struct kvm_s390_gisa_interrupt *gi = &kvm->arch.gisa_int; 1631 1537 unsigned long active_mask; 1632 1538 int isc; 1633 1539 1634 1540 if (schid) 1635 1541 goto out; 1636 - if (!kvm->arch.gisa) 1542 + if (!gi->origin) 1637 1543 goto out; 1638 1544 1639 - active_mask = (isc_mask & kvm_s390_gisa_get_ipm(kvm->arch.gisa) << 24) << 32; 1545 + active_mask = (isc_mask & gisa_get_ipm(gi->origin) << 24) << 32; 1640 1546 while (active_mask) { 1641 1547 isc = __fls(active_mask) ^ (BITS_PER_LONG - 1); 1642 - if (kvm_s390_gisa_tac_ipm_gisc(kvm->arch.gisa, isc)) 1548 + if (gisa_tac_ipm_gisc(gi->origin, isc)) 1643 1549 return isc; 1644 1550 clear_bit_inv(isc, &active_mask); 1645 1551 } ··· 1663 1567 struct kvm_s390_interrupt_info *kvm_s390_get_io_int(struct kvm *kvm, 1664 1568 u64 isc_mask, u32 schid) 1665 1569 { 1570 + struct kvm_s390_gisa_interrupt *gi = &kvm->arch.gisa_int; 1666 1571 struct kvm_s390_interrupt_info *inti, *tmp_inti; 1667 1572 int isc; 1668 1573 ··· 1681 1584 /* both types of interrupts present */ 1682 1585 if (int_word_to_isc(inti->io.io_int_word) <= isc) { 1683 1586 /* classical IO int with higher priority */ 1684 - kvm_s390_gisa_set_ipm_gisc(kvm->arch.gisa, isc); 1587 + gisa_set_ipm_gisc(gi->origin, isc); 1685 1588 goto out; 1686 1589 } 1687 1590 gisa_out: ··· 1693 1596 kvm_s390_reinject_io_int(kvm, inti); 1694 1597 inti = tmp_inti; 1695 1598 } else 1696 - kvm_s390_gisa_set_ipm_gisc(kvm->arch.gisa, isc); 1599 + gisa_set_ipm_gisc(gi->origin, isc); 1697 1600 out: 1698 1601 return inti; 1699 1602 } ··· 1782 1685 1783 1686 static int __inject_io(struct kvm *kvm, struct kvm_s390_interrupt_info *inti) 1784 1687 { 1688 + struct kvm_s390_gisa_interrupt *gi = &kvm->arch.gisa_int; 1785 1689 struct kvm_s390_float_interrupt *fi; 1786 1690 struct list_head *list; 1787 1691 int isc; ··· 1790 1692 kvm->stat.inject_io++; 1791 1693 isc = int_word_to_isc(inti->io.io_int_word); 1792 1694 1793 - if (kvm->arch.gisa && inti->type & KVM_S390_INT_IO_AI_MASK) { 1695 + if (gi->origin && inti->type & KVM_S390_INT_IO_AI_MASK) { 1794 1696 VM_EVENT(kvm, 4, "%s isc %1u", "inject: I/O (AI/gisa)", isc); 1795 - kvm_s390_gisa_set_ipm_gisc(kvm->arch.gisa, isc); 1697 + gisa_set_ipm_gisc(gi->origin, isc); 1796 1698 kfree(inti); 1797 1699 return 0; 1798 1700 } ··· 1824 1726 */ 1825 1727 static void __floating_irq_kick(struct kvm *kvm, u64 type) 1826 1728 { 1827 - struct kvm_s390_float_interrupt *fi = &kvm->arch.float_int; 1828 1729 struct kvm_vcpu *dst_vcpu; 1829 1730 int sigcpu, online_vcpus, nr_tries = 0; 1830 1731 ··· 1832 1735 return; 1833 1736 1834 1737 /* find idle VCPUs first, then round robin */ 1835 - sigcpu = find_first_bit(fi->idle_mask, online_vcpus); 1738 + sigcpu = find_first_bit(kvm->arch.idle_mask, online_vcpus); 1836 1739 if (sigcpu == online_vcpus) { 1837 1740 do { 1838 - sigcpu = fi->next_rr_cpu; 1839 - fi->next_rr_cpu = (fi->next_rr_cpu + 1) % online_vcpus; 1741 + sigcpu = kvm->arch.float_int.next_rr_cpu++; 1742 + kvm->arch.float_int.next_rr_cpu %= online_vcpus; 1840 1743 /* avoid endless loops if all vcpus are stopped */ 1841 1744 if (nr_tries++ >= online_vcpus) 1842 1745 return; ··· 1850 1753 kvm_s390_set_cpuflags(dst_vcpu, CPUSTAT_STOP_INT); 1851 1754 break; 1852 1755 case KVM_S390_INT_IO_MIN...KVM_S390_INT_IO_MAX: 1853 - if (!(type & KVM_S390_INT_IO_AI_MASK && kvm->arch.gisa)) 1756 + if (!(type & KVM_S390_INT_IO_AI_MASK && 1757 + kvm->arch.gisa_int.origin)) 1854 1758 kvm_s390_set_cpuflags(dst_vcpu, CPUSTAT_IO_INT); 1855 1759 break; 1856 1760 default: ··· 2101 2003 2102 2004 static int get_all_floating_irqs(struct kvm *kvm, u8 __user *usrbuf, u64 len) 2103 2005 { 2006 + struct kvm_s390_gisa_interrupt *gi = &kvm->arch.gisa_int; 2104 2007 struct kvm_s390_interrupt_info *inti; 2105 2008 struct kvm_s390_float_interrupt *fi; 2106 2009 struct kvm_s390_irq *buf; ··· 2125 2026 2126 2027 max_irqs = len / sizeof(struct kvm_s390_irq); 2127 2028 2128 - if (kvm->arch.gisa && 2129 - kvm_s390_gisa_get_ipm(kvm->arch.gisa)) { 2029 + if (gi->origin && gisa_get_ipm(gi->origin)) { 2130 2030 for (i = 0; i <= MAX_ISC; i++) { 2131 2031 if (n == max_irqs) { 2132 2032 /* signal userspace to try again */ 2133 2033 ret = -ENOMEM; 2134 2034 goto out_nolock; 2135 2035 } 2136 - if (kvm_s390_gisa_tac_ipm_gisc(kvm->arch.gisa, i)) { 2036 + if (gisa_tac_ipm_gisc(gi->origin, i)) { 2137 2037 irq = (struct kvm_s390_irq *) &buf[n]; 2138 2038 irq->type = KVM_S390_INT_IO(1, 0, 0, 0); 2139 2039 irq->u.io.io_int_word = isc_to_int_word(i); ··· 2929 2831 int kvm_s390_get_irq_state(struct kvm_vcpu *vcpu, __u8 __user *buf, int len) 2930 2832 { 2931 2833 int scn; 2932 - unsigned long sigp_emerg_pending[BITS_TO_LONGS(KVM_MAX_VCPUS)]; 2834 + DECLARE_BITMAP(sigp_emerg_pending, KVM_MAX_VCPUS); 2933 2835 struct kvm_s390_local_interrupt *li = &vcpu->arch.local_int; 2934 2836 unsigned long pending_irqs; 2935 2837 struct kvm_s390_irq irq; ··· 2982 2884 return n; 2983 2885 } 2984 2886 2887 + static void __airqs_kick_single_vcpu(struct kvm *kvm, u8 deliverable_mask) 2888 + { 2889 + int vcpu_id, online_vcpus = atomic_read(&kvm->online_vcpus); 2890 + struct kvm_s390_gisa_interrupt *gi = &kvm->arch.gisa_int; 2891 + struct kvm_vcpu *vcpu; 2892 + 2893 + for_each_set_bit(vcpu_id, kvm->arch.idle_mask, online_vcpus) { 2894 + vcpu = kvm_get_vcpu(kvm, vcpu_id); 2895 + if (psw_ioint_disabled(vcpu)) 2896 + continue; 2897 + deliverable_mask &= (u8)(vcpu->arch.sie_block->gcr[6] >> 24); 2898 + if (deliverable_mask) { 2899 + /* lately kicked but not yet running */ 2900 + if (test_and_set_bit(vcpu_id, gi->kicked_mask)) 2901 + return; 2902 + kvm_s390_vcpu_wakeup(vcpu); 2903 + return; 2904 + } 2905 + } 2906 + } 2907 + 2908 + static enum hrtimer_restart gisa_vcpu_kicker(struct hrtimer *timer) 2909 + { 2910 + struct kvm_s390_gisa_interrupt *gi = 2911 + container_of(timer, struct kvm_s390_gisa_interrupt, timer); 2912 + struct kvm *kvm = 2913 + container_of(gi->origin, struct sie_page2, gisa)->kvm; 2914 + u8 pending_mask; 2915 + 2916 + pending_mask = gisa_get_ipm_or_restore_iam(gi); 2917 + if (pending_mask) { 2918 + __airqs_kick_single_vcpu(kvm, pending_mask); 2919 + hrtimer_forward_now(timer, ns_to_ktime(gi->expires)); 2920 + return HRTIMER_RESTART; 2921 + }; 2922 + 2923 + return HRTIMER_NORESTART; 2924 + } 2925 + 2926 + #define NULL_GISA_ADDR 0x00000000UL 2927 + #define NONE_GISA_ADDR 0x00000001UL 2928 + #define GISA_ADDR_MASK 0xfffff000UL 2929 + 2930 + static void process_gib_alert_list(void) 2931 + { 2932 + struct kvm_s390_gisa_interrupt *gi; 2933 + struct kvm_s390_gisa *gisa; 2934 + struct kvm *kvm; 2935 + u32 final, origin = 0UL; 2936 + 2937 + do { 2938 + /* 2939 + * If the NONE_GISA_ADDR is still stored in the alert list 2940 + * origin, we will leave the outer loop. No further GISA has 2941 + * been added to the alert list by millicode while processing 2942 + * the current alert list. 2943 + */ 2944 + final = (origin & NONE_GISA_ADDR); 2945 + /* 2946 + * Cut off the alert list and store the NONE_GISA_ADDR in the 2947 + * alert list origin to avoid further GAL interruptions. 2948 + * A new alert list can be build up by millicode in parallel 2949 + * for guests not in the yet cut-off alert list. When in the 2950 + * final loop, store the NULL_GISA_ADDR instead. This will re- 2951 + * enable GAL interruptions on the host again. 2952 + */ 2953 + origin = xchg(&gib->alert_list_origin, 2954 + (!final) ? NONE_GISA_ADDR : NULL_GISA_ADDR); 2955 + /* 2956 + * Loop through the just cut-off alert list and start the 2957 + * gisa timers to kick idle vcpus to consume the pending 2958 + * interruptions asap. 2959 + */ 2960 + while (origin & GISA_ADDR_MASK) { 2961 + gisa = (struct kvm_s390_gisa *)(u64)origin; 2962 + origin = gisa->next_alert; 2963 + gisa->next_alert = (u32)(u64)gisa; 2964 + kvm = container_of(gisa, struct sie_page2, gisa)->kvm; 2965 + gi = &kvm->arch.gisa_int; 2966 + if (hrtimer_active(&gi->timer)) 2967 + hrtimer_cancel(&gi->timer); 2968 + hrtimer_start(&gi->timer, 0, HRTIMER_MODE_REL); 2969 + } 2970 + } while (!final); 2971 + 2972 + } 2973 + 2985 2974 void kvm_s390_gisa_clear(struct kvm *kvm) 2986 2975 { 2987 - if (kvm->arch.gisa) { 2988 - memset(kvm->arch.gisa, 0, sizeof(struct kvm_s390_gisa)); 2989 - kvm->arch.gisa->next_alert = (u32)(u64)kvm->arch.gisa; 2990 - VM_EVENT(kvm, 3, "gisa 0x%pK cleared", kvm->arch.gisa); 2991 - } 2976 + struct kvm_s390_gisa_interrupt *gi = &kvm->arch.gisa_int; 2977 + 2978 + if (!gi->origin) 2979 + return; 2980 + gisa_clear_ipm(gi->origin); 2981 + VM_EVENT(kvm, 3, "gisa 0x%pK cleared", gi->origin); 2992 2982 } 2993 2983 2994 2984 void kvm_s390_gisa_init(struct kvm *kvm) 2995 2985 { 2996 - if (css_general_characteristics.aiv) { 2997 - kvm->arch.gisa = &kvm->arch.sie_page2->gisa; 2998 - VM_EVENT(kvm, 3, "gisa 0x%pK initialized", kvm->arch.gisa); 2999 - kvm_s390_gisa_clear(kvm); 3000 - } 2986 + struct kvm_s390_gisa_interrupt *gi = &kvm->arch.gisa_int; 2987 + 2988 + if (!css_general_characteristics.aiv) 2989 + return; 2990 + gi->origin = &kvm->arch.sie_page2->gisa; 2991 + gi->alert.mask = 0; 2992 + spin_lock_init(&gi->alert.ref_lock); 2993 + gi->expires = 50 * 1000; /* 50 usec */ 2994 + hrtimer_init(&gi->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); 2995 + gi->timer.function = gisa_vcpu_kicker; 2996 + memset(gi->origin, 0, sizeof(struct kvm_s390_gisa)); 2997 + gi->origin->next_alert = (u32)(u64)gi->origin; 2998 + VM_EVENT(kvm, 3, "gisa 0x%pK initialized", gi->origin); 3001 2999 } 3002 3000 3003 3001 void kvm_s390_gisa_destroy(struct kvm *kvm) 3004 3002 { 3005 - if (!kvm->arch.gisa) 3003 + struct kvm_s390_gisa_interrupt *gi = &kvm->arch.gisa_int; 3004 + 3005 + if (!gi->origin) 3006 3006 return; 3007 - kvm->arch.gisa = NULL; 3007 + if (gi->alert.mask) 3008 + KVM_EVENT(3, "vm 0x%pK has unexpected iam 0x%02x", 3009 + kvm, gi->alert.mask); 3010 + while (gisa_in_alert_list(gi->origin)) 3011 + cpu_relax(); 3012 + hrtimer_cancel(&gi->timer); 3013 + gi->origin = NULL; 3014 + } 3015 + 3016 + /** 3017 + * kvm_s390_gisc_register - register a guest ISC 3018 + * 3019 + * @kvm: the kernel vm to work with 3020 + * @gisc: the guest interruption sub class to register 3021 + * 3022 + * The function extends the vm specific alert mask to use. 3023 + * The effective IAM mask in the GISA is updated as well 3024 + * in case the GISA is not part of the GIB alert list. 3025 + * It will be updated latest when the IAM gets restored 3026 + * by gisa_get_ipm_or_restore_iam(). 3027 + * 3028 + * Returns: the nonspecific ISC (NISC) the gib alert mechanism 3029 + * has registered with the channel subsystem. 3030 + * -ENODEV in case the vm uses no GISA 3031 + * -ERANGE in case the guest ISC is invalid 3032 + */ 3033 + int kvm_s390_gisc_register(struct kvm *kvm, u32 gisc) 3034 + { 3035 + struct kvm_s390_gisa_interrupt *gi = &kvm->arch.gisa_int; 3036 + 3037 + if (!gi->origin) 3038 + return -ENODEV; 3039 + if (gisc > MAX_ISC) 3040 + return -ERANGE; 3041 + 3042 + spin_lock(&gi->alert.ref_lock); 3043 + gi->alert.ref_count[gisc]++; 3044 + if (gi->alert.ref_count[gisc] == 1) { 3045 + gi->alert.mask |= 0x80 >> gisc; 3046 + gisa_set_iam(gi->origin, gi->alert.mask); 3047 + } 3048 + spin_unlock(&gi->alert.ref_lock); 3049 + 3050 + return gib->nisc; 3051 + } 3052 + EXPORT_SYMBOL_GPL(kvm_s390_gisc_register); 3053 + 3054 + /** 3055 + * kvm_s390_gisc_unregister - unregister a guest ISC 3056 + * 3057 + * @kvm: the kernel vm to work with 3058 + * @gisc: the guest interruption sub class to register 3059 + * 3060 + * The function reduces the vm specific alert mask to use. 3061 + * The effective IAM mask in the GISA is updated as well 3062 + * in case the GISA is not part of the GIB alert list. 3063 + * It will be updated latest when the IAM gets restored 3064 + * by gisa_get_ipm_or_restore_iam(). 3065 + * 3066 + * Returns: the nonspecific ISC (NISC) the gib alert mechanism 3067 + * has registered with the channel subsystem. 3068 + * -ENODEV in case the vm uses no GISA 3069 + * -ERANGE in case the guest ISC is invalid 3070 + * -EINVAL in case the guest ISC is not registered 3071 + */ 3072 + int kvm_s390_gisc_unregister(struct kvm *kvm, u32 gisc) 3073 + { 3074 + struct kvm_s390_gisa_interrupt *gi = &kvm->arch.gisa_int; 3075 + int rc = 0; 3076 + 3077 + if (!gi->origin) 3078 + return -ENODEV; 3079 + if (gisc > MAX_ISC) 3080 + return -ERANGE; 3081 + 3082 + spin_lock(&gi->alert.ref_lock); 3083 + if (gi->alert.ref_count[gisc] == 0) { 3084 + rc = -EINVAL; 3085 + goto out; 3086 + } 3087 + gi->alert.ref_count[gisc]--; 3088 + if (gi->alert.ref_count[gisc] == 0) { 3089 + gi->alert.mask &= ~(0x80 >> gisc); 3090 + gisa_set_iam(gi->origin, gi->alert.mask); 3091 + } 3092 + out: 3093 + spin_unlock(&gi->alert.ref_lock); 3094 + 3095 + return rc; 3096 + } 3097 + EXPORT_SYMBOL_GPL(kvm_s390_gisc_unregister); 3098 + 3099 + static void gib_alert_irq_handler(struct airq_struct *airq) 3100 + { 3101 + inc_irq_stat(IRQIO_GAL); 3102 + process_gib_alert_list(); 3103 + } 3104 + 3105 + static struct airq_struct gib_alert_irq = { 3106 + .handler = gib_alert_irq_handler, 3107 + .lsi_ptr = &gib_alert_irq.lsi_mask, 3108 + }; 3109 + 3110 + void kvm_s390_gib_destroy(void) 3111 + { 3112 + if (!gib) 3113 + return; 3114 + chsc_sgib(0); 3115 + unregister_adapter_interrupt(&gib_alert_irq); 3116 + free_page((unsigned long)gib); 3117 + gib = NULL; 3118 + } 3119 + 3120 + int kvm_s390_gib_init(u8 nisc) 3121 + { 3122 + int rc = 0; 3123 + 3124 + if (!css_general_characteristics.aiv) { 3125 + KVM_EVENT(3, "%s", "gib not initialized, no AIV facility"); 3126 + goto out; 3127 + } 3128 + 3129 + gib = (struct kvm_s390_gib *)get_zeroed_page(GFP_KERNEL | GFP_DMA); 3130 + if (!gib) { 3131 + rc = -ENOMEM; 3132 + goto out; 3133 + } 3134 + 3135 + gib_alert_irq.isc = nisc; 3136 + if (register_adapter_interrupt(&gib_alert_irq)) { 3137 + pr_err("Registering the GIB alert interruption handler failed\n"); 3138 + rc = -EIO; 3139 + goto out_free_gib; 3140 + } 3141 + 3142 + gib->nisc = nisc; 3143 + if (chsc_sgib((u32)(u64)gib)) { 3144 + pr_err("Associating the GIB with the AIV facility failed\n"); 3145 + free_page((unsigned long)gib); 3146 + gib = NULL; 3147 + rc = -EIO; 3148 + goto out_unreg_gal; 3149 + } 3150 + 3151 + KVM_EVENT(3, "gib 0x%pK (nisc=%d) initialized", gib, gib->nisc); 3152 + goto out; 3153 + 3154 + out_unreg_gal: 3155 + unregister_adapter_interrupt(&gib_alert_irq); 3156 + out_free_gib: 3157 + free_page((unsigned long)gib); 3158 + gib = NULL; 3159 + out: 3160 + return rc; 3008 3161 }

+173 -17

arch/s390/kvm/kvm-s390.c

··· 432 432 /* Register floating interrupt controller interface. */ 433 433 rc = kvm_register_device_ops(&kvm_flic_ops, KVM_DEV_TYPE_FLIC); 434 434 if (rc) { 435 - pr_err("Failed to register FLIC rc=%d\n", rc); 435 + pr_err("A FLIC registration call failed with rc=%d\n", rc); 436 436 goto out_debug_unreg; 437 437 } 438 + 439 + rc = kvm_s390_gib_init(GAL_ISC); 440 + if (rc) 441 + goto out_gib_destroy; 442 + 438 443 return 0; 439 444 445 + out_gib_destroy: 446 + kvm_s390_gib_destroy(); 440 447 out_debug_unreg: 441 448 debug_unregister(kvm_s390_dbf); 442 449 return rc; ··· 451 444 452 445 void kvm_arch_exit(void) 453 446 { 447 + kvm_s390_gib_destroy(); 454 448 debug_unregister(kvm_s390_dbf); 455 449 } 456 450 ··· 1266 1258 static int kvm_s390_set_processor_subfunc(struct kvm *kvm, 1267 1259 struct kvm_device_attr *attr) 1268 1260 { 1269 - /* 1270 - * Once supported by kernel + hw, we have to store the subfunctions 1271 - * in kvm->arch and remember that user space configured them. 1272 - */ 1273 - return -ENXIO; 1261 + mutex_lock(&kvm->lock); 1262 + if (kvm->created_vcpus) { 1263 + mutex_unlock(&kvm->lock); 1264 + return -EBUSY; 1265 + } 1266 + 1267 + if (copy_from_user(&kvm->arch.model.subfuncs, (void __user *)attr->addr, 1268 + sizeof(struct kvm_s390_vm_cpu_subfunc))) { 1269 + mutex_unlock(&kvm->lock); 1270 + return -EFAULT; 1271 + } 1272 + mutex_unlock(&kvm->lock); 1273 + 1274 + VM_EVENT(kvm, 3, "SET: guest PLO subfunc 0x%16.16lx.%16.16lx.%16.16lx.%16.16lx", 1275 + ((unsigned long *) &kvm->arch.model.subfuncs.plo)[0], 1276 + ((unsigned long *) &kvm->arch.model.subfuncs.plo)[1], 1277 + ((unsigned long *) &kvm->arch.model.subfuncs.plo)[2], 1278 + ((unsigned long *) &kvm->arch.model.subfuncs.plo)[3]); 1279 + VM_EVENT(kvm, 3, "SET: guest PTFF subfunc 0x%16.16lx.%16.16lx", 1280 + ((unsigned long *) &kvm->arch.model.subfuncs.ptff)[0], 1281 + ((unsigned long *) &kvm->arch.model.subfuncs.ptff)[1]); 1282 + VM_EVENT(kvm, 3, "SET: guest KMAC subfunc 0x%16.16lx.%16.16lx", 1283 + ((unsigned long *) &kvm->arch.model.subfuncs.kmac)[0], 1284 + ((unsigned long *) &kvm->arch.model.subfuncs.kmac)[1]); 1285 + VM_EVENT(kvm, 3, "SET: guest KMC subfunc 0x%16.16lx.%16.16lx", 1286 + ((unsigned long *) &kvm->arch.model.subfuncs.kmc)[0], 1287 + ((unsigned long *) &kvm->arch.model.subfuncs.kmc)[1]); 1288 + VM_EVENT(kvm, 3, "SET: guest KM subfunc 0x%16.16lx.%16.16lx", 1289 + ((unsigned long *) &kvm->arch.model.subfuncs.km)[0], 1290 + ((unsigned long *) &kvm->arch.model.subfuncs.km)[1]); 1291 + VM_EVENT(kvm, 3, "SET: guest KIMD subfunc 0x%16.16lx.%16.16lx", 1292 + ((unsigned long *) &kvm->arch.model.subfuncs.kimd)[0], 1293 + ((unsigned long *) &kvm->arch.model.subfuncs.kimd)[1]); 1294 + VM_EVENT(kvm, 3, "SET: guest KLMD subfunc 0x%16.16lx.%16.16lx", 1295 + ((unsigned long *) &kvm->arch.model.subfuncs.klmd)[0], 1296 + ((unsigned long *) &kvm->arch.model.subfuncs.klmd)[1]); 1297 + VM_EVENT(kvm, 3, "SET: guest PCKMO subfunc 0x%16.16lx.%16.16lx", 1298 + ((unsigned long *) &kvm->arch.model.subfuncs.pckmo)[0], 1299 + ((unsigned long *) &kvm->arch.model.subfuncs.pckmo)[1]); 1300 + VM_EVENT(kvm, 3, "SET: guest KMCTR subfunc 0x%16.16lx.%16.16lx", 1301 + ((unsigned long *) &kvm->arch.model.subfuncs.kmctr)[0], 1302 + ((unsigned long *) &kvm->arch.model.subfuncs.kmctr)[1]); 1303 + VM_EVENT(kvm, 3, "SET: guest KMF subfunc 0x%16.16lx.%16.16lx", 1304 + ((unsigned long *) &kvm->arch.model.subfuncs.kmf)[0], 1305 + ((unsigned long *) &kvm->arch.model.subfuncs.kmf)[1]); 1306 + VM_EVENT(kvm, 3, "SET: guest KMO subfunc 0x%16.16lx.%16.16lx", 1307 + ((unsigned long *) &kvm->arch.model.subfuncs.kmo)[0], 1308 + ((unsigned long *) &kvm->arch.model.subfuncs.kmo)[1]); 1309 + VM_EVENT(kvm, 3, "SET: guest PCC subfunc 0x%16.16lx.%16.16lx", 1310 + ((unsigned long *) &kvm->arch.model.subfuncs.pcc)[0], 1311 + ((unsigned long *) &kvm->arch.model.subfuncs.pcc)[1]); 1312 + VM_EVENT(kvm, 3, "SET: guest PPNO subfunc 0x%16.16lx.%16.16lx", 1313 + ((unsigned long *) &kvm->arch.model.subfuncs.ppno)[0], 1314 + ((unsigned long *) &kvm->arch.model.subfuncs.ppno)[1]); 1315 + VM_EVENT(kvm, 3, "SET: guest KMA subfunc 0x%16.16lx.%16.16lx", 1316 + ((unsigned long *) &kvm->arch.model.subfuncs.kma)[0], 1317 + ((unsigned long *) &kvm->arch.model.subfuncs.kma)[1]); 1318 + 1319 + return 0; 1274 1320 } 1275 1321 1276 1322 static int kvm_s390_set_cpu_model(struct kvm *kvm, struct kvm_device_attr *attr) ··· 1443 1381 static int kvm_s390_get_processor_subfunc(struct kvm *kvm, 1444 1382 struct kvm_device_attr *attr) 1445 1383 { 1446 - /* 1447 - * Once we can actually configure subfunctions (kernel + hw support), 1448 - * we have to check if they were already set by user space, if so copy 1449 - * them from kvm->arch. 1450 - */ 1451 - return -ENXIO; 1384 + if (copy_to_user((void __user *)attr->addr, &kvm->arch.model.subfuncs, 1385 + sizeof(struct kvm_s390_vm_cpu_subfunc))) 1386 + return -EFAULT; 1387 + 1388 + VM_EVENT(kvm, 3, "GET: guest PLO subfunc 0x%16.16lx.%16.16lx.%16.16lx.%16.16lx", 1389 + ((unsigned long *) &kvm->arch.model.subfuncs.plo)[0], 1390 + ((unsigned long *) &kvm->arch.model.subfuncs.plo)[1], 1391 + ((unsigned long *) &kvm->arch.model.subfuncs.plo)[2], 1392 + ((unsigned long *) &kvm->arch.model.subfuncs.plo)[3]); 1393 + VM_EVENT(kvm, 3, "GET: guest PTFF subfunc 0x%16.16lx.%16.16lx", 1394 + ((unsigned long *) &kvm->arch.model.subfuncs.ptff)[0], 1395 + ((unsigned long *) &kvm->arch.model.subfuncs.ptff)[1]); 1396 + VM_EVENT(kvm, 3, "GET: guest KMAC subfunc 0x%16.16lx.%16.16lx", 1397 + ((unsigned long *) &kvm->arch.model.subfuncs.kmac)[0], 1398 + ((unsigned long *) &kvm->arch.model.subfuncs.kmac)[1]); 1399 + VM_EVENT(kvm, 3, "GET: guest KMC subfunc 0x%16.16lx.%16.16lx", 1400 + ((unsigned long *) &kvm->arch.model.subfuncs.kmc)[0], 1401 + ((unsigned long *) &kvm->arch.model.subfuncs.kmc)[1]); 1402 + VM_EVENT(kvm, 3, "GET: guest KM subfunc 0x%16.16lx.%16.16lx", 1403 + ((unsigned long *) &kvm->arch.model.subfuncs.km)[0], 1404 + ((unsigned long *) &kvm->arch.model.subfuncs.km)[1]); 1405 + VM_EVENT(kvm, 3, "GET: guest KIMD subfunc 0x%16.16lx.%16.16lx", 1406 + ((unsigned long *) &kvm->arch.model.subfuncs.kimd)[0], 1407 + ((unsigned long *) &kvm->arch.model.subfuncs.kimd)[1]); 1408 + VM_EVENT(kvm, 3, "GET: guest KLMD subfunc 0x%16.16lx.%16.16lx", 1409 + ((unsigned long *) &kvm->arch.model.subfuncs.klmd)[0], 1410 + ((unsigned long *) &kvm->arch.model.subfuncs.klmd)[1]); 1411 + VM_EVENT(kvm, 3, "GET: guest PCKMO subfunc 0x%16.16lx.%16.16lx", 1412 + ((unsigned long *) &kvm->arch.model.subfuncs.pckmo)[0], 1413 + ((unsigned long *) &kvm->arch.model.subfuncs.pckmo)[1]); 1414 + VM_EVENT(kvm, 3, "GET: guest KMCTR subfunc 0x%16.16lx.%16.16lx", 1415 + ((unsigned long *) &kvm->arch.model.subfuncs.kmctr)[0], 1416 + ((unsigned long *) &kvm->arch.model.subfuncs.kmctr)[1]); 1417 + VM_EVENT(kvm, 3, "GET: guest KMF subfunc 0x%16.16lx.%16.16lx", 1418 + ((unsigned long *) &kvm->arch.model.subfuncs.kmf)[0], 1419 + ((unsigned long *) &kvm->arch.model.subfuncs.kmf)[1]); 1420 + VM_EVENT(kvm, 3, "GET: guest KMO subfunc 0x%16.16lx.%16.16lx", 1421 + ((unsigned long *) &kvm->arch.model.subfuncs.kmo)[0], 1422 + ((unsigned long *) &kvm->arch.model.subfuncs.kmo)[1]); 1423 + VM_EVENT(kvm, 3, "GET: guest PCC subfunc 0x%16.16lx.%16.16lx", 1424 + ((unsigned long *) &kvm->arch.model.subfuncs.pcc)[0], 1425 + ((unsigned long *) &kvm->arch.model.subfuncs.pcc)[1]); 1426 + VM_EVENT(kvm, 3, "GET: guest PPNO subfunc 0x%16.16lx.%16.16lx", 1427 + ((unsigned long *) &kvm->arch.model.subfuncs.ppno)[0], 1428 + ((unsigned long *) &kvm->arch.model.subfuncs.ppno)[1]); 1429 + VM_EVENT(kvm, 3, "GET: guest KMA subfunc 0x%16.16lx.%16.16lx", 1430 + ((unsigned long *) &kvm->arch.model.subfuncs.kma)[0], 1431 + ((unsigned long *) &kvm->arch.model.subfuncs.kma)[1]); 1432 + 1433 + return 0; 1452 1434 } 1453 1435 1454 1436 static int kvm_s390_get_machine_subfunc(struct kvm *kvm, ··· 1501 1395 if (copy_to_user((void __user *)attr->addr, &kvm_s390_available_subfunc, 1502 1396 sizeof(struct kvm_s390_vm_cpu_subfunc))) 1503 1397 return -EFAULT; 1398 + 1399 + VM_EVENT(kvm, 3, "GET: host PLO subfunc 0x%16.16lx.%16.16lx.%16.16lx.%16.16lx", 1400 + ((unsigned long *) &kvm_s390_available_subfunc.plo)[0], 1401 + ((unsigned long *) &kvm_s390_available_subfunc.plo)[1], 1402 + ((unsigned long *) &kvm_s390_available_subfunc.plo)[2], 1403 + ((unsigned long *) &kvm_s390_available_subfunc.plo)[3]); 1404 + VM_EVENT(kvm, 3, "GET: host PTFF subfunc 0x%16.16lx.%16.16lx", 1405 + ((unsigned long *) &kvm_s390_available_subfunc.ptff)[0], 1406 + ((unsigned long *) &kvm_s390_available_subfunc.ptff)[1]); 1407 + VM_EVENT(kvm, 3, "GET: host KMAC subfunc 0x%16.16lx.%16.16lx", 1408 + ((unsigned long *) &kvm_s390_available_subfunc.kmac)[0], 1409 + ((unsigned long *) &kvm_s390_available_subfunc.kmac)[1]); 1410 + VM_EVENT(kvm, 3, "GET: host KMC subfunc 0x%16.16lx.%16.16lx", 1411 + ((unsigned long *) &kvm_s390_available_subfunc.kmc)[0], 1412 + ((unsigned long *) &kvm_s390_available_subfunc.kmc)[1]); 1413 + VM_EVENT(kvm, 3, "GET: host KM subfunc 0x%16.16lx.%16.16lx", 1414 + ((unsigned long *) &kvm_s390_available_subfunc.km)[0], 1415 + ((unsigned long *) &kvm_s390_available_subfunc.km)[1]); 1416 + VM_EVENT(kvm, 3, "GET: host KIMD subfunc 0x%16.16lx.%16.16lx", 1417 + ((unsigned long *) &kvm_s390_available_subfunc.kimd)[0], 1418 + ((unsigned long *) &kvm_s390_available_subfunc.kimd)[1]); 1419 + VM_EVENT(kvm, 3, "GET: host KLMD subfunc 0x%16.16lx.%16.16lx", 1420 + ((unsigned long *) &kvm_s390_available_subfunc.klmd)[0], 1421 + ((unsigned long *) &kvm_s390_available_subfunc.klmd)[1]); 1422 + VM_EVENT(kvm, 3, "GET: host PCKMO subfunc 0x%16.16lx.%16.16lx", 1423 + ((unsigned long *) &kvm_s390_available_subfunc.pckmo)[0], 1424 + ((unsigned long *) &kvm_s390_available_subfunc.pckmo)[1]); 1425 + VM_EVENT(kvm, 3, "GET: host KMCTR subfunc 0x%16.16lx.%16.16lx", 1426 + ((unsigned long *) &kvm_s390_available_subfunc.kmctr)[0], 1427 + ((unsigned long *) &kvm_s390_available_subfunc.kmctr)[1]); 1428 + VM_EVENT(kvm, 3, "GET: host KMF subfunc 0x%16.16lx.%16.16lx", 1429 + ((unsigned long *) &kvm_s390_available_subfunc.kmf)[0], 1430 + ((unsigned long *) &kvm_s390_available_subfunc.kmf)[1]); 1431 + VM_EVENT(kvm, 3, "GET: host KMO subfunc 0x%16.16lx.%16.16lx", 1432 + ((unsigned long *) &kvm_s390_available_subfunc.kmo)[0], 1433 + ((unsigned long *) &kvm_s390_available_subfunc.kmo)[1]); 1434 + VM_EVENT(kvm, 3, "GET: host PCC subfunc 0x%16.16lx.%16.16lx", 1435 + ((unsigned long *) &kvm_s390_available_subfunc.pcc)[0], 1436 + ((unsigned long *) &kvm_s390_available_subfunc.pcc)[1]); 1437 + VM_EVENT(kvm, 3, "GET: host PPNO subfunc 0x%16.16lx.%16.16lx", 1438 + ((unsigned long *) &kvm_s390_available_subfunc.ppno)[0], 1439 + ((unsigned long *) &kvm_s390_available_subfunc.ppno)[1]); 1440 + VM_EVENT(kvm, 3, "GET: host KMA subfunc 0x%16.16lx.%16.16lx", 1441 + ((unsigned long *) &kvm_s390_available_subfunc.kma)[0], 1442 + ((unsigned long *) &kvm_s390_available_subfunc.kma)[1]); 1443 + 1504 1444 return 0; 1505 1445 } 1446 + 1506 1447 static int kvm_s390_get_cpu_model(struct kvm *kvm, struct kvm_device_attr *attr) 1507 1448 { 1508 1449 int ret = -ENXIO; ··· 1667 1514 case KVM_S390_VM_CPU_PROCESSOR_FEAT: 1668 1515 case KVM_S390_VM_CPU_MACHINE_FEAT: 1669 1516 case KVM_S390_VM_CPU_MACHINE_SUBFUNC: 1517 + case KVM_S390_VM_CPU_PROCESSOR_SUBFUNC: 1670 1518 ret = 0; 1671 1519 break; 1672 - /* configuring subfunctions is not supported yet */ 1673 - case KVM_S390_VM_CPU_PROCESSOR_SUBFUNC: 1674 1520 default: 1675 1521 ret = -ENXIO; 1676 1522 break; ··· 2361 2209 if (!kvm->arch.sie_page2) 2362 2210 goto out_err; 2363 2211 2212 + kvm->arch.sie_page2->kvm = kvm; 2364 2213 kvm->arch.model.fac_list = kvm->arch.sie_page2->fac_list; 2365 2214 2366 2215 for (i = 0; i < kvm_s390_fac_size(); i++) { ··· 2371 2218 kvm->arch.model.fac_list[i] = S390_lowcore.stfle_fac_list[i] & 2372 2219 kvm_s390_fac_base[i]; 2373 2220 } 2221 + kvm->arch.model.subfuncs = kvm_s390_available_subfunc; 2374 2222 2375 2223 /* we are always in czam mode - even on pre z14 machines */ 2376 2224 set_kvm_facility(kvm->arch.model.fac_mask, 138); ··· 2966 2812 2967 2813 vcpu->arch.sie_block->icpua = id; 2968 2814 spin_lock_init(&vcpu->arch.local_int.lock); 2969 - vcpu->arch.sie_block->gd = (u32)(u64)kvm->arch.gisa; 2815 + vcpu->arch.sie_block->gd = (u32)(u64)kvm->arch.gisa_int.origin; 2970 2816 if (vcpu->arch.sie_block->gd && sclp.has_gisaf) 2971 2817 vcpu->arch.sie_block->gd |= GISA_FORMAT1; 2972 2818 seqcount_init(&vcpu->arch.cputm_seqcount); ··· 3611 3457 kvm_s390_backup_guest_per_regs(vcpu); 3612 3458 kvm_s390_patch_guest_per_regs(vcpu); 3613 3459 } 3460 + 3461 + clear_bit(vcpu->vcpu_id, vcpu->kvm->arch.gisa_int.kicked_mask); 3614 3462 3615 3463 vcpu->arch.sie_block->icptcode = 0; 3616 3464 cpuflags = atomic_read(&vcpu->arch.sie_block->cpuflags); ··· 4449 4293 int i; 4450 4294 4451 4295 if (!sclp.has_sief2) { 4452 - pr_info("SIE not available\n"); 4296 + pr_info("SIE is not available\n"); 4453 4297 return -ENODEV; 4454 4298 } 4455 4299 4456 4300 if (nested && hpage) { 4457 - pr_info("nested (vSIE) and hpage (huge page backing) can currently not be activated concurrently"); 4301 + pr_info("A KVM host that supports nesting cannot back its KVM guests with huge pages\n"); 4458 4302 return -EINVAL; 4459 4303 } 4460 4304

+3 -1

arch/s390/kvm/kvm-s390.h

··· 67 67 68 68 static inline int is_vcpu_idle(struct kvm_vcpu *vcpu) 69 69 { 70 - return test_bit(vcpu->vcpu_id, vcpu->kvm->arch.float_int.idle_mask); 70 + return test_bit(vcpu->vcpu_id, vcpu->kvm->arch.idle_mask); 71 71 } 72 72 73 73 static inline int kvm_is_ucontrol(struct kvm *kvm) ··· 381 381 void kvm_s390_gisa_init(struct kvm *kvm); 382 382 void kvm_s390_gisa_clear(struct kvm *kvm); 383 383 void kvm_s390_gisa_destroy(struct kvm *kvm); 384 + int kvm_s390_gib_init(u8 nisc); 385 + void kvm_s390_gib_destroy(void); 384 386 385 387 /* implemented in guestdbg.c */ 386 388 void kvm_s390_backup_guest_per_regs(struct kvm_vcpu *vcpu);

+19 -23

arch/x86/include/asm/kvm_host.h

··· 35 35 #include <asm/msr-index.h> 36 36 #include <asm/asm.h> 37 37 #include <asm/kvm_page_track.h> 38 + #include <asm/kvm_vcpu_regs.h> 38 39 #include <asm/hyperv-tlfs.h> 39 40 40 41 #define KVM_MAX_VCPUS 288 ··· 138 137 #define ASYNC_PF_PER_VCPU 64 139 138 140 139 enum kvm_reg { 141 - VCPU_REGS_RAX = 0, 142 - VCPU_REGS_RCX = 1, 143 - VCPU_REGS_RDX = 2, 144 - VCPU_REGS_RBX = 3, 145 - VCPU_REGS_RSP = 4, 146 - VCPU_REGS_RBP = 5, 147 - VCPU_REGS_RSI = 6, 148 - VCPU_REGS_RDI = 7, 140 + VCPU_REGS_RAX = __VCPU_REGS_RAX, 141 + VCPU_REGS_RCX = __VCPU_REGS_RCX, 142 + VCPU_REGS_RDX = __VCPU_REGS_RDX, 143 + VCPU_REGS_RBX = __VCPU_REGS_RBX, 144 + VCPU_REGS_RSP = __VCPU_REGS_RSP, 145 + VCPU_REGS_RBP = __VCPU_REGS_RBP, 146 + VCPU_REGS_RSI = __VCPU_REGS_RSI, 147 + VCPU_REGS_RDI = __VCPU_REGS_RDI, 149 148 #ifdef CONFIG_X86_64 150 - VCPU_REGS_R8 = 8, 151 - VCPU_REGS_R9 = 9, 152 - VCPU_REGS_R10 = 10, 153 - VCPU_REGS_R11 = 11, 154 - VCPU_REGS_R12 = 12, 155 - VCPU_REGS_R13 = 13, 156 - VCPU_REGS_R14 = 14, 157 - VCPU_REGS_R15 = 15, 149 + VCPU_REGS_R8 = __VCPU_REGS_R8, 150 + VCPU_REGS_R9 = __VCPU_REGS_R9, 151 + VCPU_REGS_R10 = __VCPU_REGS_R10, 152 + VCPU_REGS_R11 = __VCPU_REGS_R11, 153 + VCPU_REGS_R12 = __VCPU_REGS_R12, 154 + VCPU_REGS_R13 = __VCPU_REGS_R13, 155 + VCPU_REGS_R14 = __VCPU_REGS_R14, 156 + VCPU_REGS_R15 = __VCPU_REGS_R15, 158 157 #endif 159 158 VCPU_REGS_RIP, 160 159 NR_VCPU_REGS ··· 320 319 struct list_head link; 321 320 struct hlist_node hash_link; 322 321 bool unsync; 322 + bool mmio_cached; 323 323 324 324 /* 325 325 * The following two entries are used to key the shadow page in the ··· 335 333 int root_count; /* Currently serving as active root */ 336 334 unsigned int unsync_children; 337 335 struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */ 338 - 339 - /* The page is obsolete if mmu_valid_gen != kvm->arch.mmu_valid_gen. */ 340 - unsigned long mmu_valid_gen; 341 - 342 336 DECLARE_BITMAP(unsync_child_bitmap, 512); 343 337 344 338 #ifdef CONFIG_X86_32 ··· 846 848 unsigned int n_requested_mmu_pages; 847 849 unsigned int n_max_mmu_pages; 848 850 unsigned int indirect_shadow_pages; 849 - unsigned long mmu_valid_gen; 850 851 struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES]; 851 852 /* 852 853 * Hash table of struct kvm_mmu_page. 853 854 */ 854 855 struct list_head active_mmu_pages; 855 - struct list_head zapped_obsolete_pages; 856 856 struct kvm_page_track_notifier_node mmu_sp_tracker; 857 857 struct kvm_page_track_notifier_head track_notifier_head; 858 858 ··· 1251 1255 struct kvm_memory_slot *slot, 1252 1256 gfn_t gfn_offset, unsigned long mask); 1253 1257 void kvm_mmu_zap_all(struct kvm *kvm); 1254 - void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, struct kvm_memslots *slots); 1258 + void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen); 1255 1259 unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm); 1256 1260 void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned int kvm_nr_mmu_pages); 1257 1261

+25

arch/x86/include/asm/kvm_vcpu_regs.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #ifndef _ASM_X86_KVM_VCPU_REGS_H 3 + #define _ASM_X86_KVM_VCPU_REGS_H 4 + 5 + #define __VCPU_REGS_RAX 0 6 + #define __VCPU_REGS_RCX 1 7 + #define __VCPU_REGS_RDX 2 8 + #define __VCPU_REGS_RBX 3 9 + #define __VCPU_REGS_RSP 4 10 + #define __VCPU_REGS_RBP 5 11 + #define __VCPU_REGS_RSI 6 12 + #define __VCPU_REGS_RDI 7 13 + 14 + #ifdef CONFIG_X86_64 15 + #define __VCPU_REGS_R8 8 16 + #define __VCPU_REGS_R9 9 17 + #define __VCPU_REGS_R10 10 18 + #define __VCPU_REGS_R11 11 19 + #define __VCPU_REGS_R12 12 20 + #define __VCPU_REGS_R13 13 21 + #define __VCPU_REGS_R14 14 22 + #define __VCPU_REGS_R15 15 23 + #endif 24 + 25 + #endif /* _ASM_X86_KVM_VCPU_REGS_H */

+15 -5

arch/x86/kernel/kvmclock.c

··· 104 104 105 105 static inline void kvm_sched_clock_init(bool stable) 106 106 { 107 - if (!stable) { 108 - pv_ops.time.sched_clock = kvm_clock_read; 107 + if (!stable) 109 108 clear_sched_clock_stable(); 110 - return; 111 - } 112 - 113 109 kvm_sched_clock_offset = kvm_clock_read(); 114 110 pv_ops.time.sched_clock = kvm_sched_clock_read; 115 111 ··· 351 355 machine_ops.crash_shutdown = kvm_crash_shutdown; 352 356 #endif 353 357 kvm_get_preset_lpj(); 358 + 359 + /* 360 + * X86_FEATURE_NONSTOP_TSC is TSC runs at constant rate 361 + * with P/T states and does not stop in deep C-states. 362 + * 363 + * Invariant TSC exposed by host means kvmclock is not necessary: 364 + * can use TSC as clocksource. 365 + * 366 + */ 367 + if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) && 368 + boot_cpu_has(X86_FEATURE_NONSTOP_TSC) && 369 + !check_tsc_unstable()) 370 + kvm_clock.rating = 299; 371 + 354 372 clocksource_register_hz(&kvm_clock, NSEC_PER_SEC); 355 373 pv_info.name = "KVM"; 356 374 }

+1 -1

arch/x86/kvm/cpuid.c

··· 405 405 F(AVX512VBMI) | F(LA57) | F(PKU) | 0 /*OSPKE*/ | 406 406 F(AVX512_VPOPCNTDQ) | F(UMIP) | F(AVX512_VBMI2) | F(GFNI) | 407 407 F(VAES) | F(VPCLMULQDQ) | F(AVX512_VNNI) | F(AVX512_BITALG) | 408 - F(CLDEMOTE); 408 + F(CLDEMOTE) | F(MOVDIRI) | F(MOVDIR64B); 409 409 410 410 /* cpuid 7.0.edx*/ 411 411 const u32 kvm_cpuid_7_0_edx_x86_features =

+1 -1

arch/x86/kvm/hyperv.c

··· 1729 1729 1730 1730 mutex_lock(&hv->hv_lock); 1731 1731 ret = idr_alloc(&hv->conn_to_evt, eventfd, conn_id, conn_id + 1, 1732 - GFP_KERNEL); 1732 + GFP_KERNEL_ACCOUNT); 1733 1733 mutex_unlock(&hv->hv_lock); 1734 1734 1735 1735 if (ret >= 0)

+1 -1

arch/x86/kvm/i8254.c

··· 653 653 pid_t pid_nr; 654 654 int ret; 655 655 656 - pit = kzalloc(sizeof(struct kvm_pit), GFP_KERNEL); 656 + pit = kzalloc(sizeof(struct kvm_pit), GFP_KERNEL_ACCOUNT); 657 657 if (!pit) 658 658 return NULL; 659 659

+1 -1

arch/x86/kvm/i8259.c

··· 583 583 struct kvm_pic *s; 584 584 int ret; 585 585 586 - s = kzalloc(sizeof(struct kvm_pic), GFP_KERNEL); 586 + s = kzalloc(sizeof(struct kvm_pic), GFP_KERNEL_ACCOUNT); 587 587 if (!s) 588 588 return -ENOMEM; 589 589 spin_lock_init(&s->lock);

+1 -1

arch/x86/kvm/ioapic.c

··· 622 622 struct kvm_ioapic *ioapic; 623 623 int ret; 624 624 625 - ioapic = kzalloc(sizeof(struct kvm_ioapic), GFP_KERNEL); 625 + ioapic = kzalloc(sizeof(struct kvm_ioapic), GFP_KERNEL_ACCOUNT); 626 626 if (!ioapic) 627 627 return -ENOMEM; 628 628 spin_lock_init(&ioapic->lock);

+4 -3

arch/x86/kvm/lapic.c

··· 181 181 max_id = max(max_id, kvm_x2apic_id(vcpu->arch.apic)); 182 182 183 183 new = kvzalloc(sizeof(struct kvm_apic_map) + 184 - sizeof(struct kvm_lapic *) * ((u64)max_id + 1), GFP_KERNEL); 184 + sizeof(struct kvm_lapic *) * ((u64)max_id + 1), 185 + GFP_KERNEL_ACCOUNT); 185 186 186 187 if (!new) 187 188 goto out; ··· 2260 2259 ASSERT(vcpu != NULL); 2261 2260 apic_debug("apic_init %d\n", vcpu->vcpu_id); 2262 2261 2263 - apic = kzalloc(sizeof(*apic), GFP_KERNEL); 2262 + apic = kzalloc(sizeof(*apic), GFP_KERNEL_ACCOUNT); 2264 2263 if (!apic) 2265 2264 goto nomem; 2266 2265 2267 2266 vcpu->arch.apic = apic; 2268 2267 2269 - apic->regs = (void *)get_zeroed_page(GFP_KERNEL); 2268 + apic->regs = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT); 2270 2269 if (!apic->regs) { 2271 2270 printk(KERN_ERR "malloc apic regs error for vcpu %x\n", 2272 2271 vcpu->vcpu_id);

+227 -253

arch/x86/kvm/mmu.c

··· 109 109 (((address) >> PT32_LEVEL_SHIFT(level)) & ((1 << PT32_LEVEL_BITS) - 1)) 110 110 111 111 112 - #define PT64_BASE_ADDR_MASK __sme_clr((((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))) 113 - #define PT64_DIR_BASE_ADDR_MASK \ 114 - (PT64_BASE_ADDR_MASK & ~((1ULL << (PAGE_SHIFT + PT64_LEVEL_BITS)) - 1)) 112 + #ifdef CONFIG_DYNAMIC_PHYSICAL_MASK 113 + #define PT64_BASE_ADDR_MASK (physical_mask & ~(u64)(PAGE_SIZE-1)) 114 + #else 115 + #define PT64_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1)) 116 + #endif 115 117 #define PT64_LVL_ADDR_MASK(level) \ 116 118 (PT64_BASE_ADDR_MASK & ~((1ULL << (PAGE_SHIFT + (((level) - 1) \ 117 119 * PT64_LEVEL_BITS))) - 1)) ··· 332 330 } 333 331 334 332 /* 335 - * the low bit of the generation number is always presumed to be zero. 336 - * This disables mmio caching during memslot updates. The concept is 337 - * similar to a seqcount but instead of retrying the access we just punt 338 - * and ignore the cache. 333 + * Due to limited space in PTEs, the MMIO generation is a 19 bit subset of 334 + * the memslots generation and is derived as follows: 339 335 * 340 - * spte bits 3-11 are used as bits 1-9 of the generation number, 341 - * the bits 52-61 are used as bits 10-19 of the generation number. 336 + * Bits 0-8 of the MMIO generation are propagated to spte bits 3-11 337 + * Bits 9-18 of the MMIO generation are propagated to spte bits 52-61 338 + * 339 + * The KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS flag is intentionally not included in 340 + * the MMIO generation number, as doing so would require stealing a bit from 341 + * the "real" generation number and thus effectively halve the maximum number 342 + * of MMIO generations that can be handled before encountering a wrap (which 343 + * requires a full MMU zap). The flag is instead explicitly queried when 344 + * checking for MMIO spte cache hits. 342 345 */ 343 - #define MMIO_SPTE_GEN_LOW_SHIFT 2 344 - #define MMIO_SPTE_GEN_HIGH_SHIFT 52 346 + #define MMIO_SPTE_GEN_MASK GENMASK_ULL(18, 0) 345 347 346 - #define MMIO_GEN_SHIFT 20 347 - #define MMIO_GEN_LOW_SHIFT 10 348 - #define MMIO_GEN_LOW_MASK ((1 << MMIO_GEN_LOW_SHIFT) - 2) 349 - #define MMIO_GEN_MASK ((1 << MMIO_GEN_SHIFT) - 1) 348 + #define MMIO_SPTE_GEN_LOW_START 3 349 + #define MMIO_SPTE_GEN_LOW_END 11 350 + #define MMIO_SPTE_GEN_LOW_MASK GENMASK_ULL(MMIO_SPTE_GEN_LOW_END, \ 351 + MMIO_SPTE_GEN_LOW_START) 350 352 351 - static u64 generation_mmio_spte_mask(unsigned int gen) 353 + #define MMIO_SPTE_GEN_HIGH_START 52 354 + #define MMIO_SPTE_GEN_HIGH_END 61 355 + #define MMIO_SPTE_GEN_HIGH_MASK GENMASK_ULL(MMIO_SPTE_GEN_HIGH_END, \ 356 + MMIO_SPTE_GEN_HIGH_START) 357 + static u64 generation_mmio_spte_mask(u64 gen) 352 358 { 353 359 u64 mask; 354 360 355 - WARN_ON(gen & ~MMIO_GEN_MASK); 361 + WARN_ON(gen & ~MMIO_SPTE_GEN_MASK); 356 362 357 - mask = (gen & MMIO_GEN_LOW_MASK) << MMIO_SPTE_GEN_LOW_SHIFT; 358 - mask |= ((u64)gen >> MMIO_GEN_LOW_SHIFT) << MMIO_SPTE_GEN_HIGH_SHIFT; 363 + mask = (gen << MMIO_SPTE_GEN_LOW_START) & MMIO_SPTE_GEN_LOW_MASK; 364 + mask |= (gen << MMIO_SPTE_GEN_HIGH_START) & MMIO_SPTE_GEN_HIGH_MASK; 359 365 return mask; 360 366 } 361 367 362 - static unsigned int get_mmio_spte_generation(u64 spte) 368 + static u64 get_mmio_spte_generation(u64 spte) 363 369 { 364 - unsigned int gen; 370 + u64 gen; 365 371 366 372 spte &= ~shadow_mmio_mask; 367 373 368 - gen = (spte >> MMIO_SPTE_GEN_LOW_SHIFT) & MMIO_GEN_LOW_MASK; 369 - gen |= (spte >> MMIO_SPTE_GEN_HIGH_SHIFT) << MMIO_GEN_LOW_SHIFT; 374 + gen = (spte & MMIO_SPTE_GEN_LOW_MASK) >> MMIO_SPTE_GEN_LOW_START; 375 + gen |= (spte & MMIO_SPTE_GEN_HIGH_MASK) >> MMIO_SPTE_GEN_HIGH_START; 370 376 return gen; 371 - } 372 - 373 - static unsigned int kvm_current_mmio_generation(struct kvm_vcpu *vcpu) 374 - { 375 - return kvm_vcpu_memslots(vcpu)->generation & MMIO_GEN_MASK; 376 377 } 377 378 378 379 static void mark_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 gfn, 379 380 unsigned access) 380 381 { 381 - unsigned int gen = kvm_current_mmio_generation(vcpu); 382 + u64 gen = kvm_vcpu_memslots(vcpu)->generation & MMIO_SPTE_GEN_MASK; 382 383 u64 mask = generation_mmio_spte_mask(gen); 383 384 u64 gpa = gfn << PAGE_SHIFT; 384 385 ··· 390 385 mask |= gpa | shadow_nonpresent_or_rsvd_mask; 391 386 mask |= (gpa & shadow_nonpresent_or_rsvd_mask) 392 387 << shadow_nonpresent_or_rsvd_mask_len; 388 + 389 + page_header(__pa(sptep))->mmio_cached = true; 393 390 394 391 trace_mark_mmio_spte(sptep, gfn, access, gen); 395 392 mmu_spte_set(sptep, mask); ··· 414 407 415 408 static unsigned get_mmio_spte_access(u64 spte) 416 409 { 417 - u64 mask = generation_mmio_spte_mask(MMIO_GEN_MASK) | shadow_mmio_mask; 410 + u64 mask = generation_mmio_spte_mask(MMIO_SPTE_GEN_MASK) | shadow_mmio_mask; 418 411 return (spte & ~mask) & ~PAGE_MASK; 419 412 } 420 413 ··· 431 424 432 425 static bool check_mmio_spte(struct kvm_vcpu *vcpu, u64 spte) 433 426 { 434 - unsigned int kvm_gen, spte_gen; 427 + u64 kvm_gen, spte_gen, gen; 435 428 436 - kvm_gen = kvm_current_mmio_generation(vcpu); 429 + gen = kvm_vcpu_memslots(vcpu)->generation; 430 + if (unlikely(gen & KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS)) 431 + return false; 432 + 433 + kvm_gen = gen & MMIO_SPTE_GEN_MASK; 437 434 spte_gen = get_mmio_spte_generation(spte); 438 435 439 436 trace_check_mmio_spte(spte, kvm_gen, spte_gen); ··· 970 959 if (cache->nobjs >= min) 971 960 return 0; 972 961 while (cache->nobjs < ARRAY_SIZE(cache->objects)) { 973 - obj = kmem_cache_zalloc(base_cache, GFP_KERNEL); 962 + obj = kmem_cache_zalloc(base_cache, GFP_KERNEL_ACCOUNT); 974 963 if (!obj) 975 964 return cache->nobjs >= min ? 0 : -ENOMEM; 976 965 cache->objects[cache->nobjs++] = obj; ··· 2060 2049 if (!direct) 2061 2050 sp->gfns = mmu_memory_cache_alloc(&vcpu->arch.mmu_page_cache); 2062 2051 set_page_private(virt_to_page(sp->spt), (unsigned long)sp); 2063 - 2064 - /* 2065 - * The active_mmu_pages list is the FIFO list, do not move the 2066 - * page until it is zapped. kvm_zap_obsolete_pages depends on 2067 - * this feature. See the comments in kvm_zap_obsolete_pages(). 2068 - */ 2069 2052 list_add(&sp->link, &vcpu->kvm->arch.active_mmu_pages); 2070 2053 kvm_mod_used_mmu_pages(vcpu->kvm, +1); 2071 2054 return sp; ··· 2200 2195 --kvm->stat.mmu_unsync; 2201 2196 } 2202 2197 2203 - static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp, 2204 - struct list_head *invalid_list); 2198 + static bool kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp, 2199 + struct list_head *invalid_list); 2205 2200 static void kvm_mmu_commit_zap_page(struct kvm *kvm, 2206 2201 struct list_head *invalid_list); 2207 2202 2208 - /* 2209 - * NOTE: we should pay more attention on the zapped-obsolete page 2210 - * (is_obsolete_sp(sp) && sp->role.invalid) when you do hash list walk 2211 - * since it has been deleted from active_mmu_pages but still can be found 2212 - * at hast list. 2213 - * 2214 - * for_each_valid_sp() has skipped that kind of pages. 2215 - */ 2216 2203 #define for_each_valid_sp(_kvm, _sp, _gfn) \ 2217 2204 hlist_for_each_entry(_sp, \ 2218 2205 &(_kvm)->arch.mmu_page_hash[kvm_page_table_hashfn(_gfn)], hash_link) \ 2219 - if (is_obsolete_sp((_kvm), (_sp)) || (_sp)->role.invalid) { \ 2206 + if ((_sp)->role.invalid) { \ 2220 2207 } else 2221 2208 2222 2209 #define for_each_gfn_indirect_valid_sp(_kvm, _sp, _gfn) \ ··· 2228 2231 return true; 2229 2232 } 2230 2233 2234 + static bool kvm_mmu_remote_flush_or_zap(struct kvm *kvm, 2235 + struct list_head *invalid_list, 2236 + bool remote_flush) 2237 + { 2238 + if (!remote_flush && !list_empty(invalid_list)) 2239 + return false; 2240 + 2241 + if (!list_empty(invalid_list)) 2242 + kvm_mmu_commit_zap_page(kvm, invalid_list); 2243 + else 2244 + kvm_flush_remote_tlbs(kvm); 2245 + return true; 2246 + } 2247 + 2231 2248 static void kvm_mmu_flush_or_zap(struct kvm_vcpu *vcpu, 2232 2249 struct list_head *invalid_list, 2233 2250 bool remote_flush, bool local_flush) 2234 2251 { 2235 - if (!list_empty(invalid_list)) { 2236 - kvm_mmu_commit_zap_page(vcpu->kvm, invalid_list); 2252 + if (kvm_mmu_remote_flush_or_zap(vcpu->kvm, invalid_list, remote_flush)) 2237 2253 return; 2238 - } 2239 2254 2240 - if (remote_flush) 2241 - kvm_flush_remote_tlbs(vcpu->kvm); 2242 - else if (local_flush) 2255 + if (local_flush) 2243 2256 kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu); 2244 2257 } 2245 2258 ··· 2259 2252 static void kvm_mmu_audit(struct kvm_vcpu *vcpu, int point) { } 2260 2253 static void mmu_audit_disable(void) { } 2261 2254 #endif 2262 - 2263 - static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp) 2264 - { 2265 - return unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen); 2266 - } 2267 2255 2268 2256 static bool kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, 2269 2257 struct list_head *invalid_list) ··· 2484 2482 if (level > PT_PAGE_TABLE_LEVEL && need_sync) 2485 2483 flush |= kvm_sync_pages(vcpu, gfn, &invalid_list); 2486 2484 } 2487 - sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen; 2488 2485 clear_page(sp->spt); 2489 2486 trace_kvm_mmu_get_page(sp, true); 2490 2487 ··· 2669 2668 return zapped; 2670 2669 } 2671 2670 2672 - static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp, 2673 - struct list_head *invalid_list) 2671 + static bool __kvm_mmu_prepare_zap_page(struct kvm *kvm, 2672 + struct kvm_mmu_page *sp, 2673 + struct list_head *invalid_list, 2674 + int *nr_zapped) 2674 2675 { 2675 - int ret; 2676 + bool list_unstable; 2676 2677 2677 2678 trace_kvm_mmu_prepare_zap_page(sp); 2678 2679 ++kvm->stat.mmu_shadow_zapped; 2679 - ret = mmu_zap_unsync_children(kvm, sp, invalid_list); 2680 + *nr_zapped = mmu_zap_unsync_children(kvm, sp, invalid_list); 2680 2681 kvm_mmu_page_unlink_children(kvm, sp); 2681 2682 kvm_mmu_unlink_parents(kvm, sp); 2683 + 2684 + /* Zapping children means active_mmu_pages has become unstable. */ 2685 + list_unstable = *nr_zapped; 2682 2686 2683 2687 if (!sp->role.invalid && !sp->role.direct) 2684 2688 unaccount_shadowed(kvm, sp); ··· 2692 2686 kvm_unlink_unsync_page(kvm, sp); 2693 2687 if (!sp->root_count) { 2694 2688 /* Count self */ 2695 - ret++; 2689 + (*nr_zapped)++; 2696 2690 list_move(&sp->link, invalid_list); 2697 2691 kvm_mod_used_mmu_pages(kvm, -1); 2698 2692 } else { 2699 2693 list_move(&sp->link, &kvm->arch.active_mmu_pages); 2700 2694 2701 - /* 2702 - * The obsolete pages can not be used on any vcpus. 2703 - * See the comments in kvm_mmu_invalidate_zap_all_pages(). 2704 - */ 2705 - if (!sp->role.invalid && !is_obsolete_sp(kvm, sp)) 2695 + if (!sp->role.invalid) 2706 2696 kvm_reload_remote_mmus(kvm); 2707 2697 } 2708 2698 2709 2699 sp->role.invalid = 1; 2710 - return ret; 2700 + return list_unstable; 2701 + } 2702 + 2703 + static bool kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp, 2704 + struct list_head *invalid_list) 2705 + { 2706 + int nr_zapped; 2707 + 2708 + __kvm_mmu_prepare_zap_page(kvm, sp, invalid_list, &nr_zapped); 2709 + return nr_zapped; 2711 2710 } 2712 2711 2713 2712 static void kvm_mmu_commit_zap_page(struct kvm *kvm, ··· 3714 3703 3715 3704 u64 *lm_root; 3716 3705 3717 - lm_root = (void*)get_zeroed_page(GFP_KERNEL); 3706 + lm_root = (void*)get_zeroed_page(GFP_KERNEL_ACCOUNT); 3718 3707 if (lm_root == NULL) 3719 3708 return 1; 3720 3709 ··· 4215 4204 return false; 4216 4205 4217 4206 if (cached_root_available(vcpu, new_cr3, new_role)) { 4218 - /* 4219 - * It is possible that the cached previous root page is 4220 - * obsolete because of a change in the MMU 4221 - * generation number. However, that is accompanied by 4222 - * KVM_REQ_MMU_RELOAD, which will free the root that we 4223 - * have set here and allocate a new one. 4224 - */ 4225 - 4226 4207 kvm_make_request(KVM_REQ_LOAD_CR3, vcpu); 4227 4208 if (!skip_tlb_flush) { 4228 4209 kvm_make_request(KVM_REQ_MMU_SYNC, vcpu); ··· 5489 5486 } 5490 5487 EXPORT_SYMBOL_GPL(kvm_disable_tdp); 5491 5488 5492 - static void free_mmu_pages(struct kvm_vcpu *vcpu) 5493 - { 5494 - free_page((unsigned long)vcpu->arch.mmu->pae_root); 5495 - free_page((unsigned long)vcpu->arch.mmu->lm_root); 5496 - } 5497 - 5498 - static int alloc_mmu_pages(struct kvm_vcpu *vcpu) 5499 - { 5500 - struct page *page; 5501 - int i; 5502 - 5503 - if (tdp_enabled) 5504 - return 0; 5505 - 5506 - /* 5507 - * When emulating 32-bit mode, cr3 is only 32 bits even on x86_64. 5508 - * Therefore we need to allocate shadow page tables in the first 5509 - * 4GB of memory, which happens to fit the DMA32 zone. 5510 - */ 5511 - page = alloc_page(GFP_KERNEL | __GFP_DMA32); 5512 - if (!page) 5513 - return -ENOMEM; 5514 - 5515 - vcpu->arch.mmu->pae_root = page_address(page); 5516 - for (i = 0; i < 4; ++i) 5517 - vcpu->arch.mmu->pae_root[i] = INVALID_PAGE; 5518 - 5519 - return 0; 5520 - } 5521 - 5522 - int kvm_mmu_create(struct kvm_vcpu *vcpu) 5523 - { 5524 - uint i; 5525 - 5526 - vcpu->arch.mmu = &vcpu->arch.root_mmu; 5527 - vcpu->arch.walk_mmu = &vcpu->arch.root_mmu; 5528 - 5529 - vcpu->arch.root_mmu.root_hpa = INVALID_PAGE; 5530 - vcpu->arch.root_mmu.root_cr3 = 0; 5531 - vcpu->arch.root_mmu.translate_gpa = translate_gpa; 5532 - for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) 5533 - vcpu->arch.root_mmu.prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID; 5534 - 5535 - vcpu->arch.guest_mmu.root_hpa = INVALID_PAGE; 5536 - vcpu->arch.guest_mmu.root_cr3 = 0; 5537 - vcpu->arch.guest_mmu.translate_gpa = translate_gpa; 5538 - for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) 5539 - vcpu->arch.guest_mmu.prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID; 5540 - 5541 - vcpu->arch.nested_mmu.translate_gpa = translate_nested_gpa; 5542 - return alloc_mmu_pages(vcpu); 5543 - } 5544 - 5545 - static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm, 5546 - struct kvm_memory_slot *slot, 5547 - struct kvm_page_track_notifier_node *node) 5548 - { 5549 - kvm_mmu_invalidate_zap_all_pages(kvm); 5550 - } 5551 - 5552 - void kvm_mmu_init_vm(struct kvm *kvm) 5553 - { 5554 - struct kvm_page_track_notifier_node *node = &kvm->arch.mmu_sp_tracker; 5555 - 5556 - node->track_write = kvm_mmu_pte_write; 5557 - node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot; 5558 - kvm_page_track_register_notifier(kvm, node); 5559 - } 5560 - 5561 - void kvm_mmu_uninit_vm(struct kvm *kvm) 5562 - { 5563 - struct kvm_page_track_notifier_node *node = &kvm->arch.mmu_sp_tracker; 5564 - 5565 - kvm_page_track_unregister_notifier(kvm, node); 5566 - } 5567 5489 5568 5490 /* The return value indicates if tlb flush on all vcpus is needed. */ 5569 5491 typedef bool (*slot_level_handler) (struct kvm *kvm, struct kvm_rmap_head *rmap_head); ··· 5559 5631 PT_PAGE_TABLE_LEVEL, lock_flush_tlb); 5560 5632 } 5561 5633 5634 + static void free_mmu_pages(struct kvm_vcpu *vcpu) 5635 + { 5636 + free_page((unsigned long)vcpu->arch.mmu->pae_root); 5637 + free_page((unsigned long)vcpu->arch.mmu->lm_root); 5638 + } 5639 + 5640 + static int alloc_mmu_pages(struct kvm_vcpu *vcpu) 5641 + { 5642 + struct page *page; 5643 + int i; 5644 + 5645 + if (tdp_enabled) 5646 + return 0; 5647 + 5648 + /* 5649 + * When emulating 32-bit mode, cr3 is only 32 bits even on x86_64. 5650 + * Therefore we need to allocate shadow page tables in the first 5651 + * 4GB of memory, which happens to fit the DMA32 zone. 5652 + */ 5653 + page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_DMA32); 5654 + if (!page) 5655 + return -ENOMEM; 5656 + 5657 + vcpu->arch.mmu->pae_root = page_address(page); 5658 + for (i = 0; i < 4; ++i) 5659 + vcpu->arch.mmu->pae_root[i] = INVALID_PAGE; 5660 + 5661 + return 0; 5662 + } 5663 + 5664 + int kvm_mmu_create(struct kvm_vcpu *vcpu) 5665 + { 5666 + uint i; 5667 + 5668 + vcpu->arch.mmu = &vcpu->arch.root_mmu; 5669 + vcpu->arch.walk_mmu = &vcpu->arch.root_mmu; 5670 + 5671 + vcpu->arch.root_mmu.root_hpa = INVALID_PAGE; 5672 + vcpu->arch.root_mmu.root_cr3 = 0; 5673 + vcpu->arch.root_mmu.translate_gpa = translate_gpa; 5674 + for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) 5675 + vcpu->arch.root_mmu.prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID; 5676 + 5677 + vcpu->arch.guest_mmu.root_hpa = INVALID_PAGE; 5678 + vcpu->arch.guest_mmu.root_cr3 = 0; 5679 + vcpu->arch.guest_mmu.translate_gpa = translate_gpa; 5680 + for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) 5681 + vcpu->arch.guest_mmu.prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID; 5682 + 5683 + vcpu->arch.nested_mmu.translate_gpa = translate_nested_gpa; 5684 + return alloc_mmu_pages(vcpu); 5685 + } 5686 + 5687 + static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm, 5688 + struct kvm_memory_slot *slot, 5689 + struct kvm_page_track_notifier_node *node) 5690 + { 5691 + struct kvm_mmu_page *sp; 5692 + LIST_HEAD(invalid_list); 5693 + unsigned long i; 5694 + bool flush; 5695 + gfn_t gfn; 5696 + 5697 + spin_lock(&kvm->mmu_lock); 5698 + 5699 + if (list_empty(&kvm->arch.active_mmu_pages)) 5700 + goto out_unlock; 5701 + 5702 + flush = slot_handle_all_level(kvm, slot, kvm_zap_rmapp, false); 5703 + 5704 + for (i = 0; i < slot->npages; i++) { 5705 + gfn = slot->base_gfn + i; 5706 + 5707 + for_each_valid_sp(kvm, sp, gfn) { 5708 + if (sp->gfn != gfn) 5709 + continue; 5710 + 5711 + kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list); 5712 + } 5713 + if (need_resched() || spin_needbreak(&kvm->mmu_lock)) { 5714 + kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush); 5715 + flush = false; 5716 + cond_resched_lock(&kvm->mmu_lock); 5717 + } 5718 + } 5719 + kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush); 5720 + 5721 + out_unlock: 5722 + spin_unlock(&kvm->mmu_lock); 5723 + } 5724 + 5725 + void kvm_mmu_init_vm(struct kvm *kvm) 5726 + { 5727 + struct kvm_page_track_notifier_node *node = &kvm->arch.mmu_sp_tracker; 5728 + 5729 + node->track_write = kvm_mmu_pte_write; 5730 + node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot; 5731 + kvm_page_track_register_notifier(kvm, node); 5732 + } 5733 + 5734 + void kvm_mmu_uninit_vm(struct kvm *kvm) 5735 + { 5736 + struct kvm_page_track_notifier_node *node = &kvm->arch.mmu_sp_tracker; 5737 + 5738 + kvm_page_track_unregister_notifier(kvm, node); 5739 + } 5740 + 5562 5741 void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end) 5563 5742 { 5564 5743 struct kvm_memslots *slots; 5565 5744 struct kvm_memory_slot *memslot; 5566 - bool flush_tlb = true; 5567 - bool flush = false; 5568 5745 int i; 5569 - 5570 - if (kvm_available_flush_tlb_with_range()) 5571 - flush_tlb = false; 5572 5746 5573 5747 spin_lock(&kvm->mmu_lock); 5574 5748 for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { ··· 5683 5653 if (start >= end) 5684 5654 continue; 5685 5655 5686 - flush |= slot_handle_level_range(kvm, memslot, 5687 - kvm_zap_rmapp, PT_PAGE_TABLE_LEVEL, 5688 - PT_MAX_HUGEPAGE_LEVEL, start, 5689 - end - 1, flush_tlb); 5656 + slot_handle_level_range(kvm, memslot, kvm_zap_rmapp, 5657 + PT_PAGE_TABLE_LEVEL, PT_MAX_HUGEPAGE_LEVEL, 5658 + start, end - 1, true); 5690 5659 } 5691 5660 } 5692 - 5693 - if (flush) 5694 - kvm_flush_remote_tlbs_with_address(kvm, gfn_start, 5695 - gfn_end - gfn_start + 1); 5696 5661 5697 5662 spin_unlock(&kvm->mmu_lock); 5698 5663 } ··· 5840 5815 } 5841 5816 EXPORT_SYMBOL_GPL(kvm_mmu_slot_set_dirty); 5842 5817 5843 - #define BATCH_ZAP_PAGES 10 5844 - static void kvm_zap_obsolete_pages(struct kvm *kvm) 5818 + static void __kvm_mmu_zap_all(struct kvm *kvm, bool mmio_only) 5845 5819 { 5846 5820 struct kvm_mmu_page *sp, *node; 5847 - int batch = 0; 5821 + LIST_HEAD(invalid_list); 5822 + int ign; 5848 5823 5824 + spin_lock(&kvm->mmu_lock); 5849 5825 restart: 5850 - list_for_each_entry_safe_reverse(sp, node, 5851 - &kvm->arch.active_mmu_pages, link) { 5852 - int ret; 5853 - 5854 - /* 5855 - * No obsolete page exists before new created page since 5856 - * active_mmu_pages is the FIFO list. 5857 - */ 5858 - if (!is_obsolete_sp(kvm, sp)) 5859 - break; 5860 - 5861 - /* 5862 - * Since we are reversely walking the list and the invalid 5863 - * list will be moved to the head, skip the invalid page 5864 - * can help us to avoid the infinity list walking. 5865 - */ 5866 - if (sp->role.invalid) 5826 + list_for_each_entry_safe(sp, node, &kvm->arch.active_mmu_pages, link) { 5827 + if (mmio_only && !sp->mmio_cached) 5867 5828 continue; 5868 - 5869 - /* 5870 - * Need not flush tlb since we only zap the sp with invalid 5871 - * generation number. 5872 - */ 5873 - if (batch >= BATCH_ZAP_PAGES && 5874 - cond_resched_lock(&kvm->mmu_lock)) { 5875 - batch = 0; 5829 + if (sp->role.invalid && sp->root_count) 5830 + continue; 5831 + if (__kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list, &ign)) { 5832 + WARN_ON_ONCE(mmio_only); 5876 5833 goto restart; 5877 5834 } 5878 - 5879 - ret = kvm_mmu_prepare_zap_page(kvm, sp, 5880 - &kvm->arch.zapped_obsolete_pages); 5881 - batch += ret; 5882 - 5883 - if (ret) 5835 + if (cond_resched_lock(&kvm->mmu_lock)) 5884 5836 goto restart; 5885 5837 } 5886 5838 5887 - /* 5888 - * Should flush tlb before free page tables since lockless-walking 5889 - * may use the pages. 5890 - */ 5891 - kvm_mmu_commit_zap_page(kvm, &kvm->arch.zapped_obsolete_pages); 5892 - } 5893 - 5894 - /* 5895 - * Fast invalidate all shadow pages and use lock-break technique 5896 - * to zap obsolete pages. 5897 - * 5898 - * It's required when memslot is being deleted or VM is being 5899 - * destroyed, in these cases, we should ensure that KVM MMU does 5900 - * not use any resource of the being-deleted slot or all slots 5901 - * after calling the function. 5902 - */ 5903 - void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm) 5904 - { 5905 - spin_lock(&kvm->mmu_lock); 5906 - trace_kvm_mmu_invalidate_zap_all_pages(kvm); 5907 - kvm->arch.mmu_valid_gen++; 5908 - 5909 - /* 5910 - * Notify all vcpus to reload its shadow page table 5911 - * and flush TLB. Then all vcpus will switch to new 5912 - * shadow page table with the new mmu_valid_gen. 5913 - * 5914 - * Note: we should do this under the protection of 5915 - * mmu-lock, otherwise, vcpu would purge shadow page 5916 - * but miss tlb flush. 5917 - */ 5918 - kvm_reload_remote_mmus(kvm); 5919 - 5920 - kvm_zap_obsolete_pages(kvm); 5839 + kvm_mmu_commit_zap_page(kvm, &invalid_list); 5921 5840 spin_unlock(&kvm->mmu_lock); 5922 5841 } 5923 5842 5924 - static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm) 5843 + void kvm_mmu_zap_all(struct kvm *kvm) 5925 5844 { 5926 - return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages)); 5845 + return __kvm_mmu_zap_all(kvm, false); 5927 5846 } 5928 5847 5929 - void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, struct kvm_memslots *slots) 5848 + void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen) 5930 5849 { 5850 + WARN_ON(gen & KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS); 5851 + 5852 + gen &= MMIO_SPTE_GEN_MASK; 5853 + 5931 5854 /* 5932 - * The very rare case: if the generation-number is round, 5855 + * Generation numbers are incremented in multiples of the number of 5856 + * address spaces in order to provide unique generations across all 5857 + * address spaces. Strip what is effectively the address space 5858 + * modifier prior to checking for a wrap of the MMIO generation so 5859 + * that a wrap in any address space is detected. 5860 + */ 5861 + gen &= ~((u64)KVM_ADDRESS_SPACE_NUM - 1); 5862 + 5863 + /* 5864 + * The very rare case: if the MMIO generation number has wrapped, 5933 5865 * zap all shadow pages. 5934 5866 */ 5935 - if (unlikely((slots->generation & MMIO_GEN_MASK) == 0)) { 5867 + if (unlikely(gen == 0)) { 5936 5868 kvm_debug_ratelimited("kvm: zapping shadow pages for mmio generation wraparound\n"); 5937 - kvm_mmu_invalidate_zap_all_pages(kvm); 5869 + __kvm_mmu_zap_all(kvm, true); 5938 5870 } 5939 5871 } 5940 5872 ··· 5922 5940 * want to shrink a VM that only started to populate its MMU 5923 5941 * anyway. 5924 5942 */ 5925 - if (!kvm->arch.n_used_mmu_pages && 5926 - !kvm_has_zapped_obsolete_pages(kvm)) 5943 + if (!kvm->arch.n_used_mmu_pages) 5927 5944 continue; 5928 5945 5929 5946 idx = srcu_read_lock(&kvm->srcu); 5930 5947 spin_lock(&kvm->mmu_lock); 5931 5948 5932 - if (kvm_has_zapped_obsolete_pages(kvm)) { 5933 - kvm_mmu_commit_zap_page(kvm, 5934 - &kvm->arch.zapped_obsolete_pages); 5935 - goto unlock; 5936 - } 5937 - 5938 5949 if (prepare_zap_oldest_mmu_page(kvm, &invalid_list)) 5939 5950 freed++; 5940 5951 kvm_mmu_commit_zap_page(kvm, &invalid_list); 5941 5952 5942 - unlock: 5943 5953 spin_unlock(&kvm->mmu_lock); 5944 5954 srcu_read_unlock(&kvm->srcu, idx); 5945 5955

-1

arch/x86/kvm/mmu.h

··· 203 203 return -(u32)fault & errcode; 204 204 } 205 205 206 - void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm); 207 206 void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end); 208 207 209 208 void kvm_mmu_gfn_disallow_lpage(struct kvm_memory_slot *slot, gfn_t gfn);

+9 -33

arch/x86/kvm/mmutrace.h

··· 8 8 #undef TRACE_SYSTEM 9 9 #define TRACE_SYSTEM kvmmmu 10 10 11 - #define KVM_MMU_PAGE_FIELDS \ 12 - __field(unsigned long, mmu_valid_gen) \ 13 - __field(__u64, gfn) \ 14 - __field(__u32, role) \ 15 - __field(__u32, root_count) \ 11 + #define KVM_MMU_PAGE_FIELDS \ 12 + __field(__u64, gfn) \ 13 + __field(__u32, role) \ 14 + __field(__u32, root_count) \ 16 15 __field(bool, unsync) 17 16 18 - #define KVM_MMU_PAGE_ASSIGN(sp) \ 19 - __entry->mmu_valid_gen = sp->mmu_valid_gen; \ 20 - __entry->gfn = sp->gfn; \ 21 - __entry->role = sp->role.word; \ 22 - __entry->root_count = sp->root_count; \ 17 + #define KVM_MMU_PAGE_ASSIGN(sp) \ 18 + __entry->gfn = sp->gfn; \ 19 + __entry->role = sp->role.word; \ 20 + __entry->root_count = sp->root_count; \ 23 21 __entry->unsync = sp->unsync; 24 22 25 23 #define KVM_MMU_PAGE_PRINTK() ({ \ ··· 29 31 \ 30 32 role.word = __entry->role; \ 31 33 \ 32 - trace_seq_printf(p, "sp gen %lx gfn %llx l%u%s q%u%s %s%s" \ 34 + trace_seq_printf(p, "sp gfn %llx l%u%s q%u%s %s%s" \ 33 35 " %snxe %sad root %u %s%c", \ 34 - __entry->mmu_valid_gen, \ 35 36 __entry->gfn, role.level, \ 36 37 role.cr4_pae ? " pae" : "", \ 37 38 role.quadrant, \ ··· 278 281 __spte_satisfied(old_spte), __spte_satisfied(new_spte) 279 282 ) 280 283 ); 281 - 282 - TRACE_EVENT( 283 - kvm_mmu_invalidate_zap_all_pages, 284 - TP_PROTO(struct kvm *kvm), 285 - TP_ARGS(kvm), 286 - 287 - TP_STRUCT__entry( 288 - __field(unsigned long, mmu_valid_gen) 289 - __field(unsigned int, mmu_used_pages) 290 - ), 291 - 292 - TP_fast_assign( 293 - __entry->mmu_valid_gen = kvm->arch.mmu_valid_gen; 294 - __entry->mmu_used_pages = kvm->arch.n_used_mmu_pages; 295 - ), 296 - 297 - TP_printk("kvm-mmu-valid-gen %lx used_pages %x", 298 - __entry->mmu_valid_gen, __entry->mmu_used_pages 299 - ) 300 - ); 301 - 302 284 303 285 TRACE_EVENT( 304 286 check_mmio_spte,

+1 -1

arch/x86/kvm/page_track.c

··· 42 42 for (i = 0; i < KVM_PAGE_TRACK_MAX; i++) { 43 43 slot->arch.gfn_track[i] = 44 44 kvcalloc(npages, sizeof(*slot->arch.gfn_track[i]), 45 - GFP_KERNEL); 45 + GFP_KERNEL_ACCOUNT); 46 46 if (!slot->arch.gfn_track[i]) 47 47 goto track_free; 48 48 }

+60 -60

arch/x86/kvm/svm.c

··· 145 145 146 146 /* Struct members for AVIC */ 147 147 u32 avic_vm_id; 148 - u32 ldr_mode; 149 148 struct page *avic_logical_id_table_page; 150 149 struct page *avic_physical_id_table_page; 151 150 struct hlist_node hnode; ··· 235 236 bool nrips_enabled : 1; 236 237 237 238 u32 ldr_reg; 239 + u32 dfr_reg; 238 240 struct page *avic_backing_page; 239 241 u64 *avic_physical_id_cache; 240 242 bool avic_is_running; ··· 1795 1795 /* Avoid using vmalloc for smaller buffers. */ 1796 1796 size = npages * sizeof(struct page *); 1797 1797 if (size > PAGE_SIZE) 1798 - pages = vmalloc(size); 1798 + pages = __vmalloc(size, GFP_KERNEL_ACCOUNT | __GFP_ZERO, 1799 + PAGE_KERNEL); 1799 1800 else 1800 - pages = kmalloc(size, GFP_KERNEL); 1801 + pages = kmalloc(size, GFP_KERNEL_ACCOUNT); 1801 1802 1802 1803 if (!pages) 1803 1804 return NULL; ··· 1866 1865 1867 1866 static struct kvm *svm_vm_alloc(void) 1868 1867 { 1869 - struct kvm_svm *kvm_svm = vzalloc(sizeof(struct kvm_svm)); 1868 + struct kvm_svm *kvm_svm = __vmalloc(sizeof(struct kvm_svm), 1869 + GFP_KERNEL_ACCOUNT | __GFP_ZERO, 1870 + PAGE_KERNEL); 1870 1871 return &kvm_svm->kvm; 1871 1872 } 1872 1873 ··· 1943 1940 return 0; 1944 1941 1945 1942 /* Allocating physical APIC ID table (4KB) */ 1946 - p_page = alloc_page(GFP_KERNEL); 1943 + p_page = alloc_page(GFP_KERNEL_ACCOUNT); 1947 1944 if (!p_page) 1948 1945 goto free_avic; 1949 1946 ··· 1951 1948 clear_page(page_address(p_page)); 1952 1949 1953 1950 /* Allocating logical APIC ID table (4KB) */ 1954 - l_page = alloc_page(GFP_KERNEL); 1951 + l_page = alloc_page(GFP_KERNEL_ACCOUNT); 1955 1952 if (!l_page) 1956 1953 goto free_avic; 1957 1954 ··· 2109 2106 2110 2107 INIT_LIST_HEAD(&svm->ir_list); 2111 2108 spin_lock_init(&svm->ir_list_lock); 2109 + svm->dfr_reg = APIC_DFR_FLAT; 2112 2110 2113 2111 return ret; 2114 2112 } ··· 2123 2119 struct page *nested_msrpm_pages; 2124 2120 int err; 2125 2121 2126 - svm = kmem_cache_zalloc(kvm_vcpu_cache, GFP_KERNEL); 2122 + svm = kmem_cache_zalloc(kvm_vcpu_cache, GFP_KERNEL_ACCOUNT); 2127 2123 if (!svm) { 2128 2124 err = -ENOMEM; 2129 2125 goto out; 2130 2126 } 2131 2127 2132 - svm->vcpu.arch.guest_fpu = kmem_cache_zalloc(x86_fpu_cache, GFP_KERNEL); 2128 + svm->vcpu.arch.guest_fpu = kmem_cache_zalloc(x86_fpu_cache, 2129 + GFP_KERNEL_ACCOUNT); 2133 2130 if (!svm->vcpu.arch.guest_fpu) { 2134 2131 printk(KERN_ERR "kvm: failed to allocate vcpu's fpu\n"); 2135 2132 err = -ENOMEM; ··· 2142 2137 goto free_svm; 2143 2138 2144 2139 err = -ENOMEM; 2145 - page = alloc_page(GFP_KERNEL); 2140 + page = alloc_page(GFP_KERNEL_ACCOUNT); 2146 2141 if (!page) 2147 2142 goto uninit; 2148 2143 2149 - msrpm_pages = alloc_pages(GFP_KERNEL, MSRPM_ALLOC_ORDER); 2144 + msrpm_pages = alloc_pages(GFP_KERNEL_ACCOUNT, MSRPM_ALLOC_ORDER); 2150 2145 if (!msrpm_pages) 2151 2146 goto free_page1; 2152 2147 2153 - nested_msrpm_pages = alloc_pages(GFP_KERNEL, MSRPM_ALLOC_ORDER); 2148 + nested_msrpm_pages = alloc_pages(GFP_KERNEL_ACCOUNT, MSRPM_ALLOC_ORDER); 2154 2149 if (!nested_msrpm_pages) 2155 2150 goto free_page2; 2156 2151 2157 - hsave_page = alloc_page(GFP_KERNEL); 2152 + hsave_page = alloc_page(GFP_KERNEL_ACCOUNT); 2158 2153 if (!hsave_page) 2159 2154 goto free_page3; 2160 2155 ··· 4570 4565 return &logical_apic_id_table[index]; 4571 4566 } 4572 4567 4573 - static int avic_ldr_write(struct kvm_vcpu *vcpu, u8 g_physical_id, u32 ldr, 4574 - bool valid) 4568 + static int avic_ldr_write(struct kvm_vcpu *vcpu, u8 g_physical_id, u32 ldr) 4575 4569 { 4576 4570 bool flat; 4577 4571 u32 *entry, new_entry; ··· 4583 4579 new_entry = READ_ONCE(*entry); 4584 4580 new_entry &= ~AVIC_LOGICAL_ID_ENTRY_GUEST_PHYSICAL_ID_MASK; 4585 4581 new_entry |= (g_physical_id & AVIC_LOGICAL_ID_ENTRY_GUEST_PHYSICAL_ID_MASK); 4586 - if (valid) 4587 - new_entry |= AVIC_LOGICAL_ID_ENTRY_VALID_MASK; 4588 - else 4589 - new_entry &= ~AVIC_LOGICAL_ID_ENTRY_VALID_MASK; 4582 + new_entry |= AVIC_LOGICAL_ID_ENTRY_VALID_MASK; 4590 4583 WRITE_ONCE(*entry, new_entry); 4591 4584 4592 4585 return 0; 4593 4586 } 4594 4587 4588 + static void avic_invalidate_logical_id_entry(struct kvm_vcpu *vcpu) 4589 + { 4590 + struct vcpu_svm *svm = to_svm(vcpu); 4591 + bool flat = svm->dfr_reg == APIC_DFR_FLAT; 4592 + u32 *entry = avic_get_logical_id_entry(vcpu, svm->ldr_reg, flat); 4593 + 4594 + if (entry) 4595 + WRITE_ONCE(*entry, (u32) ~AVIC_LOGICAL_ID_ENTRY_VALID_MASK); 4596 + } 4597 + 4595 4598 static int avic_handle_ldr_update(struct kvm_vcpu *vcpu) 4596 4599 { 4597 - int ret; 4600 + int ret = 0; 4598 4601 struct vcpu_svm *svm = to_svm(vcpu); 4599 4602 u32 ldr = kvm_lapic_get_reg(vcpu->arch.apic, APIC_LDR); 4600 4603 4601 - if (!ldr) 4602 - return 1; 4604 + if (ldr == svm->ldr_reg) 4605 + return 0; 4603 4606 4604 - ret = avic_ldr_write(vcpu, vcpu->vcpu_id, ldr, true); 4605 - if (ret && svm->ldr_reg) { 4606 - avic_ldr_write(vcpu, 0, svm->ldr_reg, false); 4607 - svm->ldr_reg = 0; 4608 - } else { 4607 + avic_invalidate_logical_id_entry(vcpu); 4608 + 4609 + if (ldr) 4610 + ret = avic_ldr_write(vcpu, vcpu->vcpu_id, ldr); 4611 + 4612 + if (!ret) 4609 4613 svm->ldr_reg = ldr; 4610 - } 4614 + 4611 4615 return ret; 4612 4616 } 4613 4617 ··· 4649 4637 return 0; 4650 4638 } 4651 4639 4652 - static int avic_handle_dfr_update(struct kvm_vcpu *vcpu) 4640 + static void avic_handle_dfr_update(struct kvm_vcpu *vcpu) 4653 4641 { 4654 4642 struct vcpu_svm *svm = to_svm(vcpu); 4655 - struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm); 4656 4643 u32 dfr = kvm_lapic_get_reg(vcpu->arch.apic, APIC_DFR); 4657 - u32 mod = (dfr >> 28) & 0xf; 4658 4644 4659 - /* 4660 - * We assume that all local APICs are using the same type. 4661 - * If this changes, we need to flush the AVIC logical 4662 - * APID id table. 4663 - */ 4664 - if (kvm_svm->ldr_mode == mod) 4665 - return 0; 4645 + if (svm->dfr_reg == dfr) 4646 + return; 4666 4647 4667 - clear_page(page_address(kvm_svm->avic_logical_id_table_page)); 4668 - kvm_svm->ldr_mode = mod; 4669 - 4670 - if (svm->ldr_reg) 4671 - avic_handle_ldr_update(vcpu); 4672 - return 0; 4648 + avic_invalidate_logical_id_entry(vcpu); 4649 + svm->dfr_reg = dfr; 4673 4650 } 4674 4651 4675 4652 static int avic_unaccel_trap_write(struct vcpu_svm *svm) ··· 5126 5125 struct vcpu_svm *svm = to_svm(vcpu); 5127 5126 struct vmcb *vmcb = svm->vmcb; 5128 5127 5129 - if (!kvm_vcpu_apicv_active(&svm->vcpu)) 5130 - return; 5131 - 5132 - vmcb->control.int_ctl &= ~AVIC_ENABLE_MASK; 5133 - mark_dirty(vmcb, VMCB_INTR); 5128 + if (kvm_vcpu_apicv_active(vcpu)) 5129 + vmcb->control.int_ctl |= AVIC_ENABLE_MASK; 5130 + else 5131 + vmcb->control.int_ctl &= ~AVIC_ENABLE_MASK; 5132 + mark_dirty(vmcb, VMCB_AVIC); 5134 5133 } 5135 5134 5136 5135 static void svm_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap) ··· 5196 5195 * Allocating new amd_iommu_pi_data, which will get 5197 5196 * add to the per-vcpu ir_list. 5198 5197 */ 5199 - ir = kzalloc(sizeof(struct amd_svm_iommu_ir), GFP_KERNEL); 5198 + ir = kzalloc(sizeof(struct amd_svm_iommu_ir), GFP_KERNEL_ACCOUNT); 5200 5199 if (!ir) { 5201 5200 ret = -ENOMEM; 5202 5201 goto out; ··· 6164 6163 { 6165 6164 if (avic_handle_apic_id_update(vcpu) != 0) 6166 6165 return; 6167 - if (avic_handle_dfr_update(vcpu) != 0) 6168 - return; 6166 + avic_handle_dfr_update(vcpu); 6169 6167 avic_handle_ldr_update(vcpu); 6170 6168 } 6171 6169 ··· 6311 6311 if (ret) 6312 6312 return ret; 6313 6313 6314 - data = kzalloc(sizeof(*data), GFP_KERNEL); 6314 + data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT); 6315 6315 if (!data) 6316 6316 return -ENOMEM; 6317 6317 ··· 6361 6361 if (copy_from_user(&params, (void __user *)(uintptr_t)argp->data, sizeof(params))) 6362 6362 return -EFAULT; 6363 6363 6364 - start = kzalloc(sizeof(*start), GFP_KERNEL); 6364 + start = kzalloc(sizeof(*start), GFP_KERNEL_ACCOUNT); 6365 6365 if (!start) 6366 6366 return -ENOMEM; 6367 6367 ··· 6458 6458 if (copy_from_user(&params, (void __user *)(uintptr_t)argp->data, sizeof(params))) 6459 6459 return -EFAULT; 6460 6460 6461 - data = kzalloc(sizeof(*data), GFP_KERNEL); 6461 + data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT); 6462 6462 if (!data) 6463 6463 return -ENOMEM; 6464 6464 ··· 6535 6535 if (copy_from_user(&params, measure, sizeof(params))) 6536 6536 return -EFAULT; 6537 6537 6538 - data = kzalloc(sizeof(*data), GFP_KERNEL); 6538 + data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT); 6539 6539 if (!data) 6540 6540 return -ENOMEM; 6541 6541 ··· 6597 6597 if (!sev_guest(kvm)) 6598 6598 return -ENOTTY; 6599 6599 6600 - data = kzalloc(sizeof(*data), GFP_KERNEL); 6600 + data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT); 6601 6601 if (!data) 6602 6602 return -ENOMEM; 6603 6603 ··· 6618 6618 if (!sev_guest(kvm)) 6619 6619 return -ENOTTY; 6620 6620 6621 - data = kzalloc(sizeof(*data), GFP_KERNEL); 6621 + data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT); 6622 6622 if (!data) 6623 6623 return -ENOMEM; 6624 6624 ··· 6646 6646 struct sev_data_dbg *data; 6647 6647 int ret; 6648 6648 6649 - data = kzalloc(sizeof(*data), GFP_KERNEL); 6649 + data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT); 6650 6650 if (!data) 6651 6651 return -ENOMEM; 6652 6652 ··· 6901 6901 } 6902 6902 6903 6903 ret = -ENOMEM; 6904 - data = kzalloc(sizeof(*data), GFP_KERNEL); 6904 + data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT); 6905 6905 if (!data) 6906 6906 goto e_unpin_memory; 6907 6907 ··· 7007 7007 if (range->addr > ULONG_MAX || range->size > ULONG_MAX) 7008 7008 return -EINVAL; 7009 7009 7010 - region = kzalloc(sizeof(*region), GFP_KERNEL); 7010 + region = kzalloc(sizeof(*region), GFP_KERNEL_ACCOUNT); 7011 7011 if (!region) 7012 7012 return -ENOMEM; 7013 7013

+79 -50

arch/x86/kvm/vmx/nested.c

··· 211 211 if (!vmx->nested.vmxon && !vmx->nested.smm.vmxon) 212 212 return; 213 213 214 - hrtimer_cancel(&vmx->nested.preemption_timer); 215 214 vmx->nested.vmxon = false; 216 215 vmx->nested.smm.vmxon = false; 217 216 free_vpid(vmx->nested.vpid02); ··· 273 274 void nested_vmx_free_vcpu(struct kvm_vcpu *vcpu) 274 275 { 275 276 vcpu_load(vcpu); 277 + vmx_leave_nested(vcpu); 276 278 vmx_switch_vmcs(vcpu, &to_vmx(vcpu)->vmcs01); 277 279 free_nested(vcpu); 278 280 vcpu_put(vcpu); ··· 1980 1980 prepare_vmcs02_early_full(vmx, vmcs12); 1981 1981 1982 1982 /* 1983 - * HOST_RSP is normally set correctly in vmx_vcpu_run() just before 1984 - * entry, but only if the current (host) sp changed from the value 1985 - * we wrote last (vmx->host_rsp). This cache is no longer relevant 1986 - * if we switch vmcs, and rather than hold a separate cache per vmcs, 1987 - * here we just force the write to happen on entry. host_rsp will 1988 - * also be written unconditionally by nested_vmx_check_vmentry_hw() 1989 - * if we are doing early consistency checks via hardware. 1990 - */ 1991 - vmx->host_rsp = 0; 1992 - 1993 - /* 1994 1983 * PIN CONTROLS 1995 1984 */ 1996 1985 exec_control = vmcs12->pin_based_vm_exec_control; ··· 2277 2288 vmcs_write64(GUEST_IA32_DEBUGCTL, vmx->nested.vmcs01_debugctl); 2278 2289 } 2279 2290 vmx_set_rflags(vcpu, vmcs12->guest_rflags); 2280 - 2281 - vmx->nested.preemption_timer_expired = false; 2282 - if (nested_cpu_has_preemption_timer(vmcs12)) 2283 - vmx_start_preemption_timer(vcpu); 2284 2291 2285 2292 /* EXCEPTION_BITMAP and CR0_GUEST_HOST_MASK should basically be the 2286 2293 * bitwise-or of what L1 wants to trap for L2, and what we want to ··· 2707 2722 { 2708 2723 struct vcpu_vmx *vmx = to_vmx(vcpu); 2709 2724 unsigned long cr3, cr4; 2725 + bool vm_fail; 2710 2726 2711 2727 if (!nested_early_check) 2712 2728 return 0; ··· 2741 2755 vmx->loaded_vmcs->host_state.cr4 = cr4; 2742 2756 } 2743 2757 2744 - vmx->__launched = vmx->loaded_vmcs->launched; 2745 - 2746 2758 asm( 2747 - /* Set HOST_RSP */ 2748 2759 "sub $%c[wordsize], %%" _ASM_SP "\n\t" /* temporarily adjust RSP for CALL */ 2749 - __ex("vmwrite %%" _ASM_SP ", %%" _ASM_DX) "\n\t" 2750 - "mov %%" _ASM_SP ", %c[host_rsp](%1)\n\t" 2760 + "cmp %%" _ASM_SP ", %c[host_state_rsp](%[loaded_vmcs]) \n\t" 2761 + "je 1f \n\t" 2762 + __ex("vmwrite %%" _ASM_SP ", %[HOST_RSP]") "\n\t" 2763 + "mov %%" _ASM_SP ", %c[host_state_rsp](%[loaded_vmcs]) \n\t" 2764 + "1: \n\t" 2751 2765 "add $%c[wordsize], %%" _ASM_SP "\n\t" /* un-adjust RSP */ 2752 2766 2753 2767 /* Check if vmlaunch or vmresume is needed */ 2754 - "cmpl $0, %c[launched](%% " _ASM_CX")\n\t" 2768 + "cmpb $0, %c[launched](%[loaded_vmcs])\n\t" 2755 2769 2770 + /* 2771 + * VMLAUNCH and VMRESUME clear RFLAGS.{CF,ZF} on VM-Exit, set 2772 + * RFLAGS.CF on VM-Fail Invalid and set RFLAGS.ZF on VM-Fail 2773 + * Valid. vmx_vmenter() directly "returns" RFLAGS, and so the 2774 + * results of VM-Enter is captured via CC_{SET,OUT} to vm_fail. 2775 + */ 2756 2776 "call vmx_vmenter\n\t" 2757 2777 2758 - /* Set vmx->fail accordingly */ 2759 - "setbe %c[fail](%% " _ASM_CX")\n\t" 2760 - : ASM_CALL_CONSTRAINT 2761 - : "c"(vmx), "d"((unsigned long)HOST_RSP), 2762 - [launched]"i"(offsetof(struct vcpu_vmx, __launched)), 2763 - [fail]"i"(offsetof(struct vcpu_vmx, fail)), 2764 - [host_rsp]"i"(offsetof(struct vcpu_vmx, host_rsp)), 2778 + CC_SET(be) 2779 + : ASM_CALL_CONSTRAINT, CC_OUT(be) (vm_fail) 2780 + : [HOST_RSP]"r"((unsigned long)HOST_RSP), 2781 + [loaded_vmcs]"r"(vmx->loaded_vmcs), 2782 + [launched]"i"(offsetof(struct loaded_vmcs, launched)), 2783 + [host_state_rsp]"i"(offsetof(struct loaded_vmcs, host_state.rsp)), 2765 2784 [wordsize]"i"(sizeof(ulong)) 2766 - : "rax", "cc", "memory" 2785 + : "cc", "memory" 2767 2786 ); 2768 2787 2769 2788 preempt_enable(); ··· 2778 2787 if (vmx->msr_autoload.guest.nr) 2779 2788 vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, vmx->msr_autoload.guest.nr); 2780 2789 2781 - if (vmx->fail) { 2790 + if (vm_fail) { 2782 2791 WARN_ON_ONCE(vmcs_read32(VM_INSTRUCTION_ERROR) != 2783 2792 VMXERR_ENTRY_INVALID_CONTROL_FIELD); 2784 - vmx->fail = 0; 2785 2793 return 1; 2786 2794 } 2787 2795 ··· 2803 2813 2804 2814 return 0; 2805 2815 } 2806 - STACK_FRAME_NON_STANDARD(nested_vmx_check_vmentry_hw); 2807 - 2808 2816 2809 2817 static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu, 2810 2818 struct vmcs12 *vmcs12); ··· 3017 3029 */ 3018 3030 if (unlikely(evaluate_pending_interrupts)) 3019 3031 kvm_make_request(KVM_REQ_EVENT, vcpu); 3032 + 3033 + /* 3034 + * Do not start the preemption timer hrtimer until after we know 3035 + * we are successful, so that only nested_vmx_vmexit needs to cancel 3036 + * the timer. 3037 + */ 3038 + vmx->nested.preemption_timer_expired = false; 3039 + if (nested_cpu_has_preemption_timer(vmcs12)) 3040 + vmx_start_preemption_timer(vcpu); 3020 3041 3021 3042 /* 3022 3043 * Note no nested_vmx_succeed or nested_vmx_fail here. At this point ··· 3447 3450 else 3448 3451 vmcs12->guest_activity_state = GUEST_ACTIVITY_ACTIVE; 3449 3452 3450 - if (nested_cpu_has_preemption_timer(vmcs12)) { 3451 - if (vmcs12->vm_exit_controls & 3452 - VM_EXIT_SAVE_VMX_PREEMPTION_TIMER) 3453 + if (nested_cpu_has_preemption_timer(vmcs12) && 3454 + vmcs12->vm_exit_controls & VM_EXIT_SAVE_VMX_PREEMPTION_TIMER) 3453 3455 vmcs12->vmx_preemption_timer_value = 3454 3456 vmx_get_preemption_timer_value(vcpu); 3455 - hrtimer_cancel(&to_vmx(vcpu)->nested.preemption_timer); 3456 - } 3457 3457 3458 3458 /* 3459 3459 * In some cases (usually, nested EPT), L2 is allowed to change its ··· 3858 3864 3859 3865 leave_guest_mode(vcpu); 3860 3866 3867 + if (nested_cpu_has_preemption_timer(vmcs12)) 3868 + hrtimer_cancel(&to_vmx(vcpu)->nested.preemption_timer); 3869 + 3861 3870 if (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETING) 3862 3871 vcpu->arch.tsc_offset -= vmcs12->tsc_offset; 3863 3872 ··· 3911 3914 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)) { 3912 3915 vmx_flush_tlb(vcpu, true); 3913 3916 } 3914 - 3915 - /* This is needed for same reason as it was needed in prepare_vmcs02 */ 3916 - vmx->host_rsp = 0; 3917 3917 3918 3918 /* Unpin physical memory we referred to in vmcs02 */ 3919 3919 if (vmx->nested.apic_access_page) { ··· 4029 4035 /* Addr = segment_base + offset */ 4030 4036 /* offset = base + [index * scale] + displacement */ 4031 4037 off = exit_qualification; /* holds the displacement */ 4038 + if (addr_size == 1) 4039 + off = (gva_t)sign_extend64(off, 31); 4040 + else if (addr_size == 0) 4041 + off = (gva_t)sign_extend64(off, 15); 4032 4042 if (base_is_valid) 4033 4043 off += kvm_register_read(vcpu, base_reg); 4034 4044 if (index_is_valid) 4035 4045 off += kvm_register_read(vcpu, index_reg)<<scaling; 4036 4046 vmx_get_segment(vcpu, &s, seg_reg); 4037 - *ret = s.base + off; 4038 4047 4048 + /* 4049 + * The effective address, i.e. @off, of a memory operand is truncated 4050 + * based on the address size of the instruction. Note that this is 4051 + * the *effective address*, i.e. the address prior to accounting for 4052 + * the segment's base. 4053 + */ 4039 4054 if (addr_size == 1) /* 32 bit */ 4040 - *ret &= 0xffffffff; 4055 + off &= 0xffffffff; 4056 + else if (addr_size == 0) /* 16 bit */ 4057 + off &= 0xffff; 4041 4058 4042 4059 /* Checks for #GP/#SS exceptions. */ 4043 4060 exn = false; 4044 4061 if (is_long_mode(vcpu)) { 4062 + /* 4063 + * The virtual/linear address is never truncated in 64-bit 4064 + * mode, e.g. a 32-bit address size can yield a 64-bit virtual 4065 + * address when using FS/GS with a non-zero base. 4066 + */ 4067 + *ret = s.base + off; 4068 + 4045 4069 /* Long mode: #GP(0)/#SS(0) if the memory address is in a 4046 4070 * non-canonical form. This is the only check on the memory 4047 4071 * destination for long mode! 4048 4072 */ 4049 4073 exn = is_noncanonical_address(*ret, vcpu); 4050 - } else if (is_protmode(vcpu)) { 4074 + } else { 4075 + /* 4076 + * When not in long mode, the virtual/linear address is 4077 + * unconditionally truncated to 32 bits regardless of the 4078 + * address size. 4079 + */ 4080 + *ret = (s.base + off) & 0xffffffff; 4081 + 4051 4082 /* Protected mode: apply checks for segment validity in the 4052 4083 * following order: 4053 4084 * - segment type check (#GP(0) may be thrown) ··· 4096 4077 /* Protected mode: #GP(0)/#SS(0) if the segment is unusable. 4097 4078 */ 4098 4079 exn = (s.unusable != 0); 4099 - /* Protected mode: #GP(0)/#SS(0) if the memory 4100 - * operand is outside the segment limit. 4080 + 4081 + /* 4082 + * Protected mode: #GP(0)/#SS(0) if the memory operand is 4083 + * outside the segment limit. All CPUs that support VMX ignore 4084 + * limit checks for flat segments, i.e. segments with base==0, 4085 + * limit==0xffffffff and of type expand-up data or code. 4101 4086 */ 4102 - exn = exn || (off + sizeof(u64) > s.limit); 4087 + if (!(s.base == 0 && s.limit == 0xffffffff && 4088 + ((s.type & 8) || !(s.type & 4)))) 4089 + exn = exn || (off + sizeof(u64) > s.limit); 4103 4090 } 4104 4091 if (exn) { 4105 4092 kvm_queue_exception_e(vcpu, ··· 4170 4145 if (r < 0) 4171 4146 goto out_vmcs02; 4172 4147 4173 - vmx->nested.cached_vmcs12 = kzalloc(VMCS12_SIZE, GFP_KERNEL); 4148 + vmx->nested.cached_vmcs12 = kzalloc(VMCS12_SIZE, GFP_KERNEL_ACCOUNT); 4174 4149 if (!vmx->nested.cached_vmcs12) 4175 4150 goto out_cached_vmcs12; 4176 4151 4177 - vmx->nested.cached_shadow_vmcs12 = kzalloc(VMCS12_SIZE, GFP_KERNEL); 4152 + vmx->nested.cached_shadow_vmcs12 = kzalloc(VMCS12_SIZE, GFP_KERNEL_ACCOUNT); 4178 4153 if (!vmx->nested.cached_shadow_vmcs12) 4179 4154 goto out_cached_shadow_vmcs12; 4180 4155 ··· 5721 5696 enable_shadow_vmcs = 0; 5722 5697 if (enable_shadow_vmcs) { 5723 5698 for (i = 0; i < VMX_BITMAP_NR; i++) { 5699 + /* 5700 + * The vmx_bitmap is not tied to a VM and so should 5701 + * not be charged to a memcg. 5702 + */ 5724 5703 vmx_bitmap[i] = (unsigned long *) 5725 5704 __get_free_page(GFP_KERNEL); 5726 5705 if (!vmx_bitmap[i]) {

+1

arch/x86/kvm/vmx/vmcs.h

··· 34 34 unsigned long cr4; /* May not match real cr4 */ 35 35 unsigned long gs_base; 36 36 unsigned long fs_base; 37 + unsigned long rsp; 37 38 38 39 u16 fs_sel, gs_sel, ldt_sel; 39 40 #ifdef CONFIG_X86_64

+167

arch/x86/kvm/vmx/vmenter.S

··· 1 1 /* SPDX-License-Identifier: GPL-2.0 */ 2 2 #include <linux/linkage.h> 3 3 #include <asm/asm.h> 4 + #include <asm/bitsperlong.h> 5 + #include <asm/kvm_vcpu_regs.h> 6 + 7 + #define WORD_SIZE (BITS_PER_LONG / 8) 8 + 9 + #define VCPU_RAX __VCPU_REGS_RAX * WORD_SIZE 10 + #define VCPU_RCX __VCPU_REGS_RCX * WORD_SIZE 11 + #define VCPU_RDX __VCPU_REGS_RDX * WORD_SIZE 12 + #define VCPU_RBX __VCPU_REGS_RBX * WORD_SIZE 13 + /* Intentionally omit RSP as it's context switched by hardware */ 14 + #define VCPU_RBP __VCPU_REGS_RBP * WORD_SIZE 15 + #define VCPU_RSI __VCPU_REGS_RSI * WORD_SIZE 16 + #define VCPU_RDI __VCPU_REGS_RDI * WORD_SIZE 17 + 18 + #ifdef CONFIG_X86_64 19 + #define VCPU_R8 __VCPU_REGS_R8 * WORD_SIZE 20 + #define VCPU_R9 __VCPU_REGS_R9 * WORD_SIZE 21 + #define VCPU_R10 __VCPU_REGS_R10 * WORD_SIZE 22 + #define VCPU_R11 __VCPU_REGS_R11 * WORD_SIZE 23 + #define VCPU_R12 __VCPU_REGS_R12 * WORD_SIZE 24 + #define VCPU_R13 __VCPU_REGS_R13 * WORD_SIZE 25 + #define VCPU_R14 __VCPU_REGS_R14 * WORD_SIZE 26 + #define VCPU_R15 __VCPU_REGS_R15 * WORD_SIZE 27 + #endif 4 28 5 29 .text 6 30 ··· 79 55 ENTRY(vmx_vmexit) 80 56 ret 81 57 ENDPROC(vmx_vmexit) 58 + 59 + /** 60 + * __vmx_vcpu_run - Run a vCPU via a transition to VMX guest mode 61 + * @vmx: struct vcpu_vmx * 62 + * @regs: unsigned long * (to guest registers) 63 + * @launched: %true if the VMCS has been launched 64 + * 65 + * Returns: 66 + * 0 on VM-Exit, 1 on VM-Fail 67 + */ 68 + ENTRY(__vmx_vcpu_run) 69 + push %_ASM_BP 70 + mov %_ASM_SP, %_ASM_BP 71 + #ifdef CONFIG_X86_64 72 + push %r15 73 + push %r14 74 + push %r13 75 + push %r12 76 + #else 77 + push %edi 78 + push %esi 79 + #endif 80 + push %_ASM_BX 81 + 82 + /* 83 + * Save @regs, _ASM_ARG2 may be modified by vmx_update_host_rsp() and 84 + * @regs is needed after VM-Exit to save the guest's register values. 85 + */ 86 + push %_ASM_ARG2 87 + 88 + /* Copy @launched to BL, _ASM_ARG3 is volatile. */ 89 + mov %_ASM_ARG3B, %bl 90 + 91 + /* Adjust RSP to account for the CALL to vmx_vmenter(). */ 92 + lea -WORD_SIZE(%_ASM_SP), %_ASM_ARG2 93 + call vmx_update_host_rsp 94 + 95 + /* Load @regs to RAX. */ 96 + mov (%_ASM_SP), %_ASM_AX 97 + 98 + /* Check if vmlaunch or vmresume is needed */ 99 + cmpb $0, %bl 100 + 101 + /* Load guest registers. Don't clobber flags. */ 102 + mov VCPU_RBX(%_ASM_AX), %_ASM_BX 103 + mov VCPU_RCX(%_ASM_AX), %_ASM_CX 104 + mov VCPU_RDX(%_ASM_AX), %_ASM_DX 105 + mov VCPU_RSI(%_ASM_AX), %_ASM_SI 106 + mov VCPU_RDI(%_ASM_AX), %_ASM_DI 107 + mov VCPU_RBP(%_ASM_AX), %_ASM_BP 108 + #ifdef CONFIG_X86_64 109 + mov VCPU_R8 (%_ASM_AX), %r8 110 + mov VCPU_R9 (%_ASM_AX), %r9 111 + mov VCPU_R10(%_ASM_AX), %r10 112 + mov VCPU_R11(%_ASM_AX), %r11 113 + mov VCPU_R12(%_ASM_AX), %r12 114 + mov VCPU_R13(%_ASM_AX), %r13 115 + mov VCPU_R14(%_ASM_AX), %r14 116 + mov VCPU_R15(%_ASM_AX), %r15 117 + #endif 118 + /* Load guest RAX. This kills the vmx_vcpu pointer! */ 119 + mov VCPU_RAX(%_ASM_AX), %_ASM_AX 120 + 121 + /* Enter guest mode */ 122 + call vmx_vmenter 123 + 124 + /* Jump on VM-Fail. */ 125 + jbe 2f 126 + 127 + /* Temporarily save guest's RAX. */ 128 + push %_ASM_AX 129 + 130 + /* Reload @regs to RAX. */ 131 + mov WORD_SIZE(%_ASM_SP), %_ASM_AX 132 + 133 + /* Save all guest registers, including RAX from the stack */ 134 + __ASM_SIZE(pop) VCPU_RAX(%_ASM_AX) 135 + mov %_ASM_BX, VCPU_RBX(%_ASM_AX) 136 + mov %_ASM_CX, VCPU_RCX(%_ASM_AX) 137 + mov %_ASM_DX, VCPU_RDX(%_ASM_AX) 138 + mov %_ASM_SI, VCPU_RSI(%_ASM_AX) 139 + mov %_ASM_DI, VCPU_RDI(%_ASM_AX) 140 + mov %_ASM_BP, VCPU_RBP(%_ASM_AX) 141 + #ifdef CONFIG_X86_64 142 + mov %r8, VCPU_R8 (%_ASM_AX) 143 + mov %r9, VCPU_R9 (%_ASM_AX) 144 + mov %r10, VCPU_R10(%_ASM_AX) 145 + mov %r11, VCPU_R11(%_ASM_AX) 146 + mov %r12, VCPU_R12(%_ASM_AX) 147 + mov %r13, VCPU_R13(%_ASM_AX) 148 + mov %r14, VCPU_R14(%_ASM_AX) 149 + mov %r15, VCPU_R15(%_ASM_AX) 150 + #endif 151 + 152 + /* Clear RAX to indicate VM-Exit (as opposed to VM-Fail). */ 153 + xor %eax, %eax 154 + 155 + /* 156 + * Clear all general purpose registers except RSP and RAX to prevent 157 + * speculative use of the guest's values, even those that are reloaded 158 + * via the stack. In theory, an L1 cache miss when restoring registers 159 + * could lead to speculative execution with the guest's values. 160 + * Zeroing XORs are dirt cheap, i.e. the extra paranoia is essentially 161 + * free. RSP and RAX are exempt as RSP is restored by hardware during 162 + * VM-Exit and RAX is explicitly loaded with 0 or 1 to return VM-Fail. 163 + */ 164 + 1: xor %ebx, %ebx 165 + xor %ecx, %ecx 166 + xor %edx, %edx 167 + xor %esi, %esi 168 + xor %edi, %edi 169 + xor %ebp, %ebp 170 + #ifdef CONFIG_X86_64 171 + xor %r8d, %r8d 172 + xor %r9d, %r9d 173 + xor %r10d, %r10d 174 + xor %r11d, %r11d 175 + xor %r12d, %r12d 176 + xor %r13d, %r13d 177 + xor %r14d, %r14d 178 + xor %r15d, %r15d 179 + #endif 180 + 181 + /* "POP" @regs. */ 182 + add $WORD_SIZE, %_ASM_SP 183 + pop %_ASM_BX 184 + 185 + #ifdef CONFIG_X86_64 186 + pop %r12 187 + pop %r13 188 + pop %r14 189 + pop %r15 190 + #else 191 + pop %esi 192 + pop %edi 193 + #endif 194 + pop %_ASM_BP 195 + ret 196 + 197 + /* VM-Fail. Out-of-line to avoid a taken Jcc after VM-Exit. */ 198 + 2: mov $1, %eax 199 + jmp 1b 200 + ENDPROC(__vmx_vcpu_run)

+35 -153

arch/x86/kvm/vmx/vmx.c

··· 246 246 247 247 if (l1tf != VMENTER_L1D_FLUSH_NEVER && !vmx_l1d_flush_pages && 248 248 !boot_cpu_has(X86_FEATURE_FLUSH_L1D)) { 249 + /* 250 + * This allocation for vmx_l1d_flush_pages is not tied to a VM 251 + * lifetime and so should not be charged to a memcg. 252 + */ 249 253 page = alloc_pages(GFP_KERNEL, L1D_CACHE_ORDER); 250 254 if (!page) 251 255 return -ENOMEM; ··· 2391 2387 return 0; 2392 2388 } 2393 2389 2394 - struct vmcs *alloc_vmcs_cpu(bool shadow, int cpu) 2390 + struct vmcs *alloc_vmcs_cpu(bool shadow, int cpu, gfp_t flags) 2395 2391 { 2396 2392 int node = cpu_to_node(cpu); 2397 2393 struct page *pages; 2398 2394 struct vmcs *vmcs; 2399 2395 2400 - pages = __alloc_pages_node(node, GFP_KERNEL, vmcs_config.order); 2396 + pages = __alloc_pages_node(node, flags, vmcs_config.order); 2401 2397 if (!pages) 2402 2398 return NULL; 2403 2399 vmcs = page_address(pages); ··· 2444 2440 loaded_vmcs_init(loaded_vmcs); 2445 2441 2446 2442 if (cpu_has_vmx_msr_bitmap()) { 2447 - loaded_vmcs->msr_bitmap = (unsigned long *)__get_free_page(GFP_KERNEL); 2443 + loaded_vmcs->msr_bitmap = (unsigned long *) 2444 + __get_free_page(GFP_KERNEL_ACCOUNT); 2448 2445 if (!loaded_vmcs->msr_bitmap) 2449 2446 goto out_vmcs; 2450 2447 memset(loaded_vmcs->msr_bitmap, 0xff, PAGE_SIZE); ··· 2486 2481 for_each_possible_cpu(cpu) { 2487 2482 struct vmcs *vmcs; 2488 2483 2489 - vmcs = alloc_vmcs_cpu(false, cpu); 2484 + vmcs = alloc_vmcs_cpu(false, cpu, GFP_KERNEL); 2490 2485 if (!vmcs) { 2491 2486 free_kvm_area(); 2492 2487 return -ENOMEM; ··· 6365 6360 vmx->loaded_vmcs->hv_timer_armed = false; 6366 6361 } 6367 6362 6368 - static void __vmx_vcpu_run(struct kvm_vcpu *vcpu, struct vcpu_vmx *vmx) 6363 + void vmx_update_host_rsp(struct vcpu_vmx *vmx, unsigned long host_rsp) 6369 6364 { 6370 - unsigned long evmcs_rsp; 6371 - 6372 - vmx->__launched = vmx->loaded_vmcs->launched; 6373 - 6374 - evmcs_rsp = static_branch_unlikely(&enable_evmcs) ? 6375 - (unsigned long)&current_evmcs->host_rsp : 0; 6376 - 6377 - if (static_branch_unlikely(&vmx_l1d_should_flush)) 6378 - vmx_l1d_flush(vcpu); 6379 - 6380 - asm( 6381 - /* Store host registers */ 6382 - "push %%" _ASM_DX "; push %%" _ASM_BP ";" 6383 - "push %%" _ASM_CX " \n\t" /* placeholder for guest rcx */ 6384 - "push %%" _ASM_CX " \n\t" 6385 - "sub $%c[wordsize], %%" _ASM_SP "\n\t" /* temporarily adjust RSP for CALL */ 6386 - "cmp %%" _ASM_SP ", %c[host_rsp](%%" _ASM_CX ") \n\t" 6387 - "je 1f \n\t" 6388 - "mov %%" _ASM_SP ", %c[host_rsp](%%" _ASM_CX ") \n\t" 6389 - /* Avoid VMWRITE when Enlightened VMCS is in use */ 6390 - "test %%" _ASM_SI ", %%" _ASM_SI " \n\t" 6391 - "jz 2f \n\t" 6392 - "mov %%" _ASM_SP ", (%%" _ASM_SI ") \n\t" 6393 - "jmp 1f \n\t" 6394 - "2: \n\t" 6395 - __ex("vmwrite %%" _ASM_SP ", %%" _ASM_DX) "\n\t" 6396 - "1: \n\t" 6397 - "add $%c[wordsize], %%" _ASM_SP "\n\t" /* un-adjust RSP */ 6398 - 6399 - /* Reload cr2 if changed */ 6400 - "mov %c[cr2](%%" _ASM_CX "), %%" _ASM_AX " \n\t" 6401 - "mov %%cr2, %%" _ASM_DX " \n\t" 6402 - "cmp %%" _ASM_AX ", %%" _ASM_DX " \n\t" 6403 - "je 3f \n\t" 6404 - "mov %%" _ASM_AX", %%cr2 \n\t" 6405 - "3: \n\t" 6406 - /* Check if vmlaunch or vmresume is needed */ 6407 - "cmpl $0, %c[launched](%%" _ASM_CX ") \n\t" 6408 - /* Load guest registers. Don't clobber flags. */ 6409 - "mov %c[rax](%%" _ASM_CX "), %%" _ASM_AX " \n\t" 6410 - "mov %c[rbx](%%" _ASM_CX "), %%" _ASM_BX " \n\t" 6411 - "mov %c[rdx](%%" _ASM_CX "), %%" _ASM_DX " \n\t" 6412 - "mov %c[rsi](%%" _ASM_CX "), %%" _ASM_SI " \n\t" 6413 - "mov %c[rdi](%%" _ASM_CX "), %%" _ASM_DI " \n\t" 6414 - "mov %c[rbp](%%" _ASM_CX "), %%" _ASM_BP " \n\t" 6415 - #ifdef CONFIG_X86_64 6416 - "mov %c[r8](%%" _ASM_CX "), %%r8 \n\t" 6417 - "mov %c[r9](%%" _ASM_CX "), %%r9 \n\t" 6418 - "mov %c[r10](%%" _ASM_CX "), %%r10 \n\t" 6419 - "mov %c[r11](%%" _ASM_CX "), %%r11 \n\t" 6420 - "mov %c[r12](%%" _ASM_CX "), %%r12 \n\t" 6421 - "mov %c[r13](%%" _ASM_CX "), %%r13 \n\t" 6422 - "mov %c[r14](%%" _ASM_CX "), %%r14 \n\t" 6423 - "mov %c[r15](%%" _ASM_CX "), %%r15 \n\t" 6424 - #endif 6425 - /* Load guest RCX. This kills the vmx_vcpu pointer! */ 6426 - "mov %c[rcx](%%" _ASM_CX "), %%" _ASM_CX " \n\t" 6427 - 6428 - /* Enter guest mode */ 6429 - "call vmx_vmenter\n\t" 6430 - 6431 - /* Save guest's RCX to the stack placeholder (see above) */ 6432 - "mov %%" _ASM_CX ", %c[wordsize](%%" _ASM_SP ") \n\t" 6433 - 6434 - /* Load host's RCX, i.e. the vmx_vcpu pointer */ 6435 - "pop %%" _ASM_CX " \n\t" 6436 - 6437 - /* Set vmx->fail based on EFLAGS.{CF,ZF} */ 6438 - "setbe %c[fail](%%" _ASM_CX ")\n\t" 6439 - 6440 - /* Save all guest registers, including RCX from the stack */ 6441 - "mov %%" _ASM_AX ", %c[rax](%%" _ASM_CX ") \n\t" 6442 - "mov %%" _ASM_BX ", %c[rbx](%%" _ASM_CX ") \n\t" 6443 - __ASM_SIZE(pop) " %c[rcx](%%" _ASM_CX ") \n\t" 6444 - "mov %%" _ASM_DX ", %c[rdx](%%" _ASM_CX ") \n\t" 6445 - "mov %%" _ASM_SI ", %c[rsi](%%" _ASM_CX ") \n\t" 6446 - "mov %%" _ASM_DI ", %c[rdi](%%" _ASM_CX ") \n\t" 6447 - "mov %%" _ASM_BP ", %c[rbp](%%" _ASM_CX ") \n\t" 6448 - #ifdef CONFIG_X86_64 6449 - "mov %%r8, %c[r8](%%" _ASM_CX ") \n\t" 6450 - "mov %%r9, %c[r9](%%" _ASM_CX ") \n\t" 6451 - "mov %%r10, %c[r10](%%" _ASM_CX ") \n\t" 6452 - "mov %%r11, %c[r11](%%" _ASM_CX ") \n\t" 6453 - "mov %%r12, %c[r12](%%" _ASM_CX ") \n\t" 6454 - "mov %%r13, %c[r13](%%" _ASM_CX ") \n\t" 6455 - "mov %%r14, %c[r14](%%" _ASM_CX ") \n\t" 6456 - "mov %%r15, %c[r15](%%" _ASM_CX ") \n\t" 6457 - /* 6458 - * Clear host registers marked as clobbered to prevent 6459 - * speculative use. 6460 - */ 6461 - "xor %%r8d, %%r8d \n\t" 6462 - "xor %%r9d, %%r9d \n\t" 6463 - "xor %%r10d, %%r10d \n\t" 6464 - "xor %%r11d, %%r11d \n\t" 6465 - "xor %%r12d, %%r12d \n\t" 6466 - "xor %%r13d, %%r13d \n\t" 6467 - "xor %%r14d, %%r14d \n\t" 6468 - "xor %%r15d, %%r15d \n\t" 6469 - #endif 6470 - "mov %%cr2, %%" _ASM_AX " \n\t" 6471 - "mov %%" _ASM_AX ", %c[cr2](%%" _ASM_CX ") \n\t" 6472 - 6473 - "xor %%eax, %%eax \n\t" 6474 - "xor %%ebx, %%ebx \n\t" 6475 - "xor %%esi, %%esi \n\t" 6476 - "xor %%edi, %%edi \n\t" 6477 - "pop %%" _ASM_BP "; pop %%" _ASM_DX " \n\t" 6478 - : ASM_CALL_CONSTRAINT 6479 - : "c"(vmx), "d"((unsigned long)HOST_RSP), "S"(evmcs_rsp), 6480 - [launched]"i"(offsetof(struct vcpu_vmx, __launched)), 6481 - [fail]"i"(offsetof(struct vcpu_vmx, fail)), 6482 - [host_rsp]"i"(offsetof(struct vcpu_vmx, host_rsp)), 6483 - [rax]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_RAX])), 6484 - [rbx]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_RBX])), 6485 - [rcx]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_RCX])), 6486 - [rdx]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_RDX])), 6487 - [rsi]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_RSI])), 6488 - [rdi]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_RDI])), 6489 - [rbp]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_RBP])), 6490 - #ifdef CONFIG_X86_64 6491 - [r8]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_R8])), 6492 - [r9]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_R9])), 6493 - [r10]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_R10])), 6494 - [r11]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_R11])), 6495 - [r12]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_R12])), 6496 - [r13]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_R13])), 6497 - [r14]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_R14])), 6498 - [r15]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_R15])), 6499 - #endif 6500 - [cr2]"i"(offsetof(struct vcpu_vmx, vcpu.arch.cr2)), 6501 - [wordsize]"i"(sizeof(ulong)) 6502 - : "cc", "memory" 6503 - #ifdef CONFIG_X86_64 6504 - , "rax", "rbx", "rdi" 6505 - , "r8", "r9", "r10", "r11", "r12", "r13", "r14", "r15" 6506 - #else 6507 - , "eax", "ebx", "edi" 6508 - #endif 6509 - ); 6365 + if (unlikely(host_rsp != vmx->loaded_vmcs->host_state.rsp)) { 6366 + vmx->loaded_vmcs->host_state.rsp = host_rsp; 6367 + vmcs_writel(HOST_RSP, host_rsp); 6368 + } 6510 6369 } 6511 - STACK_FRAME_NON_STANDARD(__vmx_vcpu_run); 6370 + 6371 + bool __vmx_vcpu_run(struct vcpu_vmx *vmx, unsigned long *regs, bool launched); 6512 6372 6513 6373 static void vmx_vcpu_run(struct kvm_vcpu *vcpu) 6514 6374 { ··· 6442 6572 */ 6443 6573 x86_spec_ctrl_set_guest(vmx->spec_ctrl, 0); 6444 6574 6445 - __vmx_vcpu_run(vcpu, vmx); 6575 + if (static_branch_unlikely(&vmx_l1d_should_flush)) 6576 + vmx_l1d_flush(vcpu); 6577 + 6578 + if (vcpu->arch.cr2 != read_cr2()) 6579 + write_cr2(vcpu->arch.cr2); 6580 + 6581 + vmx->fail = __vmx_vcpu_run(vmx, (unsigned long *)&vcpu->arch.regs, 6582 + vmx->loaded_vmcs->launched); 6583 + 6584 + vcpu->arch.cr2 = read_cr2(); 6446 6585 6447 6586 /* 6448 6587 * We do not use IBRS in the kernel. If this vCPU has used the ··· 6536 6657 6537 6658 static struct kvm *vmx_vm_alloc(void) 6538 6659 { 6539 - struct kvm_vmx *kvm_vmx = vzalloc(sizeof(struct kvm_vmx)); 6660 + struct kvm_vmx *kvm_vmx = __vmalloc(sizeof(struct kvm_vmx), 6661 + GFP_KERNEL_ACCOUNT | __GFP_ZERO, 6662 + PAGE_KERNEL); 6540 6663 return &kvm_vmx->kvm; 6541 6664 } 6542 6665 ··· 6554 6673 if (enable_pml) 6555 6674 vmx_destroy_pml_buffer(vmx); 6556 6675 free_vpid(vmx->vpid); 6557 - leave_guest_mode(vcpu); 6558 6676 nested_vmx_free_vcpu(vcpu); 6559 6677 free_loaded_vmcs(vmx->loaded_vmcs); 6560 6678 kfree(vmx->guest_msrs); ··· 6565 6685 static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id) 6566 6686 { 6567 6687 int err; 6568 - struct vcpu_vmx *vmx = kmem_cache_zalloc(kvm_vcpu_cache, GFP_KERNEL); 6688 + struct vcpu_vmx *vmx; 6569 6689 unsigned long *msr_bitmap; 6570 6690 int cpu; 6571 6691 6692 + vmx = kmem_cache_zalloc(kvm_vcpu_cache, GFP_KERNEL_ACCOUNT); 6572 6693 if (!vmx) 6573 6694 return ERR_PTR(-ENOMEM); 6574 6695 6575 - vmx->vcpu.arch.guest_fpu = kmem_cache_zalloc(x86_fpu_cache, GFP_KERNEL); 6696 + vmx->vcpu.arch.guest_fpu = kmem_cache_zalloc(x86_fpu_cache, 6697 + GFP_KERNEL_ACCOUNT); 6576 6698 if (!vmx->vcpu.arch.guest_fpu) { 6577 6699 printk(KERN_ERR "kvm: failed to allocate vcpu's fpu\n"); 6578 6700 err = -ENOMEM; ··· 6596 6714 * for the guest, etc. 6597 6715 */ 6598 6716 if (enable_pml) { 6599 - vmx->pml_pg = alloc_page(GFP_KERNEL | __GFP_ZERO); 6717 + vmx->pml_pg = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO); 6600 6718 if (!vmx->pml_pg) 6601 6719 goto uninit_vcpu; 6602 6720 } 6603 6721 6604 - vmx->guest_msrs = kmalloc(PAGE_SIZE, GFP_KERNEL); 6722 + vmx->guest_msrs = kmalloc(PAGE_SIZE, GFP_KERNEL_ACCOUNT); 6605 6723 BUILD_BUG_ON(ARRAY_SIZE(vmx_msr_index) * sizeof(vmx->guest_msrs[0]) 6606 6724 > PAGE_SIZE); 6607 6725

+11 -9

arch/x86/kvm/vmx/vmx.h

··· 175 175 176 176 struct vcpu_vmx { 177 177 struct kvm_vcpu vcpu; 178 - unsigned long host_rsp; 179 178 u8 fail; 180 179 u8 msr_bitmap_mode; 181 180 u32 exit_intr_info; ··· 208 209 struct loaded_vmcs vmcs01; 209 210 struct loaded_vmcs *loaded_vmcs; 210 211 struct loaded_vmcs *loaded_cpu_state; 211 - bool __launched; /* temporary, used in vmx_vcpu_run */ 212 + 212 213 struct msr_autoload { 213 214 struct vmx_msrs guest; 214 215 struct vmx_msrs host; ··· 338 339 339 340 static inline void pi_set_sn(struct pi_desc *pi_desc) 340 341 { 341 - return set_bit(POSTED_INTR_SN, 342 - (unsigned long *)&pi_desc->control); 342 + set_bit(POSTED_INTR_SN, 343 + (unsigned long *)&pi_desc->control); 343 344 } 344 345 345 346 static inline void pi_set_on(struct pi_desc *pi_desc) ··· 444 445 { 445 446 u32 vmentry_ctrl = vmcs_config.vmentry_ctrl; 446 447 if (pt_mode == PT_MODE_SYSTEM) 447 - vmentry_ctrl &= ~(VM_EXIT_PT_CONCEAL_PIP | VM_EXIT_CLEAR_IA32_RTIT_CTL); 448 + vmentry_ctrl &= ~(VM_ENTRY_PT_CONCEAL_PIP | 449 + VM_ENTRY_LOAD_IA32_RTIT_CTL); 448 450 /* Loading of EFER and PERF_GLOBAL_CTRL are toggled dynamically */ 449 451 return vmentry_ctrl & 450 452 ~(VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL | VM_ENTRY_LOAD_IA32_EFER); ··· 455 455 { 456 456 u32 vmexit_ctrl = vmcs_config.vmexit_ctrl; 457 457 if (pt_mode == PT_MODE_SYSTEM) 458 - vmexit_ctrl &= ~(VM_ENTRY_PT_CONCEAL_PIP | VM_ENTRY_LOAD_IA32_RTIT_CTL); 458 + vmexit_ctrl &= ~(VM_EXIT_PT_CONCEAL_PIP | 459 + VM_EXIT_CLEAR_IA32_RTIT_CTL); 459 460 /* Loading of EFER and PERF_GLOBAL_CTRL are toggled dynamically */ 460 - return vmcs_config.vmexit_ctrl & 461 + return vmexit_ctrl & 461 462 ~(VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL | VM_EXIT_LOAD_IA32_EFER); 462 463 } 463 464 ··· 479 478 return &(to_vmx(vcpu)->pi_desc); 480 479 } 481 480 482 - struct vmcs *alloc_vmcs_cpu(bool shadow, int cpu); 481 + struct vmcs *alloc_vmcs_cpu(bool shadow, int cpu, gfp_t flags); 483 482 void free_vmcs(struct vmcs *vmcs); 484 483 int alloc_loaded_vmcs(struct loaded_vmcs *loaded_vmcs); 485 484 void free_loaded_vmcs(struct loaded_vmcs *loaded_vmcs); ··· 488 487 489 488 static inline struct vmcs *alloc_vmcs(bool shadow) 490 489 { 491 - return alloc_vmcs_cpu(shadow, raw_smp_processor_id()); 490 + return alloc_vmcs_cpu(shadow, raw_smp_processor_id(), 491 + GFP_KERNEL_ACCOUNT); 492 492 } 493 493 494 494 u64 construct_eptp(struct kvm_vcpu *vcpu, unsigned long root_hpa);

+20 -12

arch/x86/kvm/x86.c

··· 3879 3879 r = -EINVAL; 3880 3880 if (!lapic_in_kernel(vcpu)) 3881 3881 goto out; 3882 - u.lapic = kzalloc(sizeof(struct kvm_lapic_state), GFP_KERNEL); 3882 + u.lapic = kzalloc(sizeof(struct kvm_lapic_state), 3883 + GFP_KERNEL_ACCOUNT); 3883 3884 3884 3885 r = -ENOMEM; 3885 3886 if (!u.lapic) ··· 4067 4066 break; 4068 4067 } 4069 4068 case KVM_GET_XSAVE: { 4070 - u.xsave = kzalloc(sizeof(struct kvm_xsave), GFP_KERNEL); 4069 + u.xsave = kzalloc(sizeof(struct kvm_xsave), GFP_KERNEL_ACCOUNT); 4071 4070 r = -ENOMEM; 4072 4071 if (!u.xsave) 4073 4072 break; ··· 4091 4090 break; 4092 4091 } 4093 4092 case KVM_GET_XCRS: { 4094 - u.xcrs = kzalloc(sizeof(struct kvm_xcrs), GFP_KERNEL); 4093 + u.xcrs = kzalloc(sizeof(struct kvm_xcrs), GFP_KERNEL_ACCOUNT); 4095 4094 r = -ENOMEM; 4096 4095 if (!u.xcrs) 4097 4096 break; ··· 7056 7055 7057 7056 void kvm_vcpu_deactivate_apicv(struct kvm_vcpu *vcpu) 7058 7057 { 7058 + if (!lapic_in_kernel(vcpu)) { 7059 + WARN_ON_ONCE(vcpu->arch.apicv_active); 7060 + return; 7061 + } 7062 + if (!vcpu->arch.apicv_active) 7063 + return; 7064 + 7059 7065 vcpu->arch.apicv_active = false; 7060 7066 kvm_x86_ops->refresh_apicv_exec_ctrl(vcpu); 7061 7067 } ··· 9013 9005 struct page *page; 9014 9006 int r; 9015 9007 9016 - vcpu->arch.apicv_active = kvm_x86_ops->get_enable_apicv(vcpu); 9017 9008 vcpu->arch.emulate_ctxt.ops = &emulate_ops; 9018 9009 if (!irqchip_in_kernel(vcpu->kvm) || kvm_vcpu_is_reset_bsp(vcpu)) 9019 9010 vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE; ··· 9033 9026 goto fail_free_pio_data; 9034 9027 9035 9028 if (irqchip_in_kernel(vcpu->kvm)) { 9029 + vcpu->arch.apicv_active = kvm_x86_ops->get_enable_apicv(vcpu); 9036 9030 r = kvm_create_lapic(vcpu); 9037 9031 if (r < 0) 9038 9032 goto fail_mmu_destroy; ··· 9041 9033 static_key_slow_inc(&kvm_no_apic_vcpu); 9042 9034 9043 9035 vcpu->arch.mce_banks = kzalloc(KVM_MAX_MCE_BANKS * sizeof(u64) * 4, 9044 - GFP_KERNEL); 9036 + GFP_KERNEL_ACCOUNT); 9045 9037 if (!vcpu->arch.mce_banks) { 9046 9038 r = -ENOMEM; 9047 9039 goto fail_free_lapic; 9048 9040 } 9049 9041 vcpu->arch.mcg_cap = KVM_MAX_MCE_BANKS; 9050 9042 9051 - if (!zalloc_cpumask_var(&vcpu->arch.wbinvd_dirty_mask, GFP_KERNEL)) { 9043 + if (!zalloc_cpumask_var(&vcpu->arch.wbinvd_dirty_mask, 9044 + GFP_KERNEL_ACCOUNT)) { 9052 9045 r = -ENOMEM; 9053 9046 goto fail_free_mce_banks; 9054 9047 } ··· 9113 9104 9114 9105 INIT_HLIST_HEAD(&kvm->arch.mask_notifier_list); 9115 9106 INIT_LIST_HEAD(&kvm->arch.active_mmu_pages); 9116 - INIT_LIST_HEAD(&kvm->arch.zapped_obsolete_pages); 9117 9107 INIT_LIST_HEAD(&kvm->arch.assigned_dev_head); 9118 9108 atomic_set(&kvm->arch.noncoherent_dma_count, 0); 9119 9109 ··· 9307 9299 9308 9300 slot->arch.rmap[i] = 9309 9301 kvcalloc(lpages, sizeof(*slot->arch.rmap[i]), 9310 - GFP_KERNEL); 9302 + GFP_KERNEL_ACCOUNT); 9311 9303 if (!slot->arch.rmap[i]) 9312 9304 goto out_free; 9313 9305 if (i == 0) 9314 9306 continue; 9315 9307 9316 - linfo = kvcalloc(lpages, sizeof(*linfo), GFP_KERNEL); 9308 + linfo = kvcalloc(lpages, sizeof(*linfo), GFP_KERNEL_ACCOUNT); 9317 9309 if (!linfo) 9318 9310 goto out_free; 9319 9311 ··· 9356 9348 return -ENOMEM; 9357 9349 } 9358 9350 9359 - void kvm_arch_memslots_updated(struct kvm *kvm, struct kvm_memslots *slots) 9351 + void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) 9360 9352 { 9361 9353 /* 9362 9354 * memslots->generation has been incremented. 9363 9355 * mmio generation may have reached its maximum value. 9364 9356 */ 9365 - kvm_mmu_invalidate_mmio_sptes(kvm, slots); 9357 + kvm_mmu_invalidate_mmio_sptes(kvm, gen); 9366 9358 } 9367 9359 9368 9360 int kvm_arch_prepare_memory_region(struct kvm *kvm, ··· 9470 9462 9471 9463 void kvm_arch_flush_shadow_all(struct kvm *kvm) 9472 9464 { 9473 - kvm_mmu_invalidate_zap_all_pages(kvm); 9465 + kvm_mmu_zap_all(kvm); 9474 9466 } 9475 9467 9476 9468 void kvm_arch_flush_shadow_memslot(struct kvm *kvm,

+6 -1

arch/x86/kvm/x86.h

··· 181 181 static inline void vcpu_cache_mmio_info(struct kvm_vcpu *vcpu, 182 182 gva_t gva, gfn_t gfn, unsigned access) 183 183 { 184 + u64 gen = kvm_memslots(vcpu->kvm)->generation; 185 + 186 + if (unlikely(gen & KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS)) 187 + return; 188 + 184 189 /* 185 190 * If this is a shadow nested page table, the "GVA" is 186 191 * actually a nGPA. ··· 193 188 vcpu->arch.mmio_gva = mmu_is_nested(vcpu) ? 0 : gva & PAGE_MASK; 194 189 vcpu->arch.access = access; 195 190 vcpu->arch.mmio_gfn = gfn; 196 - vcpu->arch.mmio_gen = kvm_memslots(vcpu->kvm)->generation; 191 + vcpu->arch.mmio_gen = gen; 197 192 } 198 193 199 194 static inline bool vcpu_match_mmio_gen(struct kvm_vcpu *vcpu)

+9 -2

drivers/clocksource/arm_arch_timer.c

··· 1261 1261 return ARCH_TIMER_PHYS_SECURE_PPI; 1262 1262 } 1263 1263 1264 + static void __init arch_timer_populate_kvm_info(void) 1265 + { 1266 + arch_timer_kvm_info.virtual_irq = arch_timer_ppi[ARCH_TIMER_VIRT_PPI]; 1267 + if (is_kernel_in_hyp_mode()) 1268 + arch_timer_kvm_info.physical_irq = arch_timer_ppi[ARCH_TIMER_PHYS_NONSECURE_PPI]; 1269 + } 1270 + 1264 1271 static int __init arch_timer_of_init(struct device_node *np) 1265 1272 { 1266 1273 int i, ret; ··· 1282 1275 for (i = ARCH_TIMER_PHYS_SECURE_PPI; i < ARCH_TIMER_MAX_TIMER_PPI; i++) 1283 1276 arch_timer_ppi[i] = irq_of_parse_and_map(np, i); 1284 1277 1285 - arch_timer_kvm_info.virtual_irq = arch_timer_ppi[ARCH_TIMER_VIRT_PPI]; 1278 + arch_timer_populate_kvm_info(); 1286 1279 1287 1280 rate = arch_timer_get_cntfrq(); 1288 1281 arch_timer_of_configure_rate(rate, np); ··· 1612 1605 arch_timer_ppi[ARCH_TIMER_HYP_PPI] = 1613 1606 acpi_gtdt_map_ppi(ARCH_TIMER_HYP_PPI); 1614 1607 1615 - arch_timer_kvm_info.virtual_irq = arch_timer_ppi[ARCH_TIMER_VIRT_PPI]; 1608 + arch_timer_populate_kvm_info(); 1616 1609 1617 1610 /* 1618 1611 * When probing via ACPI, we have no mechanism to override the sysreg

+37

drivers/s390/cio/chsc.c

··· 1382 1382 return chsc_error_from_response(brinfo_area->response.code); 1383 1383 } 1384 1384 EXPORT_SYMBOL_GPL(chsc_pnso_brinfo); 1385 + 1386 + int chsc_sgib(u32 origin) 1387 + { 1388 + struct { 1389 + struct chsc_header request; 1390 + u16 op; 1391 + u8 reserved01[2]; 1392 + u8 reserved02:4; 1393 + u8 fmt:4; 1394 + u8 reserved03[7]; 1395 + /* operation data area begin */ 1396 + u8 reserved04[4]; 1397 + u32 gib_origin; 1398 + u8 reserved05[10]; 1399 + u8 aix; 1400 + u8 reserved06[4029]; 1401 + struct chsc_header response; 1402 + u8 reserved07[4]; 1403 + } *sgib_area; 1404 + int ret; 1405 + 1406 + spin_lock_irq(&chsc_page_lock); 1407 + memset(chsc_page, 0, PAGE_SIZE); 1408 + sgib_area = chsc_page; 1409 + sgib_area->request.length = 0x0fe0; 1410 + sgib_area->request.code = 0x0021; 1411 + sgib_area->op = 0x1; 1412 + sgib_area->gib_origin = origin; 1413 + 1414 + ret = chsc(sgib_area); 1415 + if (ret == 0) 1416 + ret = chsc_error_from_response(sgib_area->response.code); 1417 + spin_unlock_irq(&chsc_page_lock); 1418 + 1419 + return ret; 1420 + } 1421 + EXPORT_SYMBOL_GPL(chsc_sgib);

+1

drivers/s390/cio/chsc.h

··· 164 164 int chsc_ssqd(struct subchannel_id schid, struct chsc_ssqd_area *ssqd); 165 165 int chsc_sadc(struct subchannel_id schid, struct chsc_scssc_area *scssc, 166 166 u64 summary_indicator_addr, u64 subchannel_indicator_addr); 167 + int chsc_sgib(u32 origin); 167 168 int chsc_error_from_response(int response); 168 169 169 170 int chsc_siosl(struct subchannel_id schid);

+1

include/clocksource/arm_arch_timer.h

··· 74 74 struct arch_timer_kvm_info { 75 75 struct timecounter timecounter; 76 76 int virtual_irq; 77 + int physical_irq; 77 78 }; 78 79 79 80 struct arch_timer_mem_frame {

+51 -21

include/kvm/arm_arch_timer.h

··· 22 22 #include <linux/clocksource.h> 23 23 #include <linux/hrtimer.h> 24 24 25 + enum kvm_arch_timers { 26 + TIMER_PTIMER, 27 + TIMER_VTIMER, 28 + NR_KVM_TIMERS 29 + }; 30 + 31 + enum kvm_arch_timer_regs { 32 + TIMER_REG_CNT, 33 + TIMER_REG_CVAL, 34 + TIMER_REG_TVAL, 35 + TIMER_REG_CTL, 36 + }; 37 + 25 38 struct arch_timer_context { 39 + struct kvm_vcpu *vcpu; 40 + 26 41 /* Registers: control register, timer value */ 27 42 u32 cnt_ctl; 28 43 u64 cnt_cval; ··· 45 30 /* Timer IRQ */ 46 31 struct kvm_irq_level irq; 47 32 48 - /* 49 - * We have multiple paths which can save/restore the timer state 50 - * onto the hardware, so we need some way of keeping track of 51 - * where the latest state is. 52 - * 53 - * loaded == true: State is loaded on the hardware registers. 54 - * loaded == false: State is stored in memory. 55 - */ 56 - bool loaded; 57 - 58 33 /* Virtual offset */ 59 - u64 cntvoff; 34 + u64 cntvoff; 35 + 36 + /* Emulated Timer (may be unused) */ 37 + struct hrtimer hrtimer; 38 + 39 + /* 40 + * We have multiple paths which can save/restore the timer state onto 41 + * the hardware, so we need some way of keeping track of where the 42 + * latest state is. 43 + */ 44 + bool loaded; 45 + 46 + /* Duplicated state from arch_timer.c for convenience */ 47 + u32 host_timer_irq; 48 + u32 host_timer_irq_flags; 49 + }; 50 + 51 + struct timer_map { 52 + struct arch_timer_context *direct_vtimer; 53 + struct arch_timer_context *direct_ptimer; 54 + struct arch_timer_context *emul_ptimer; 60 55 }; 61 56 62 57 struct arch_timer_cpu { 63 - struct arch_timer_context vtimer; 64 - struct arch_timer_context ptimer; 58 + struct arch_timer_context timers[NR_KVM_TIMERS]; 65 59 66 60 /* Background timer used when the guest is not running */ 67 61 struct hrtimer bg_timer; 68 - 69 - /* Physical timer emulation */ 70 - struct hrtimer phys_timer; 71 62 72 63 /* Is the timer enabled */ 73 64 bool enabled; ··· 97 76 98 77 bool kvm_timer_is_pending(struct kvm_vcpu *vcpu); 99 78 100 - void kvm_timer_schedule(struct kvm_vcpu *vcpu); 101 - void kvm_timer_unschedule(struct kvm_vcpu *vcpu); 102 - 103 79 u64 kvm_phys_timer_read(void); 104 80 105 81 void kvm_timer_vcpu_load(struct kvm_vcpu *vcpu); ··· 106 88 107 89 bool kvm_arch_timer_get_input_level(int vintid); 108 90 109 - #define vcpu_vtimer(v) (&(v)->arch.timer_cpu.vtimer) 110 - #define vcpu_ptimer(v) (&(v)->arch.timer_cpu.ptimer) 91 + #define vcpu_timer(v) (&(v)->arch.timer_cpu) 92 + #define vcpu_get_timer(v,t) (&vcpu_timer(v)->timers[(t)]) 93 + #define vcpu_vtimer(v) (&(v)->arch.timer_cpu.timers[TIMER_VTIMER]) 94 + #define vcpu_ptimer(v) (&(v)->arch.timer_cpu.timers[TIMER_PTIMER]) 95 + 96 + #define arch_timer_ctx_index(ctx) ((ctx) - vcpu_timer((ctx)->vcpu)->timers) 97 + 98 + u64 kvm_arm_timer_read_sysreg(struct kvm_vcpu *vcpu, 99 + enum kvm_arch_timers tmr, 100 + enum kvm_arch_timer_regs treg); 101 + void kvm_arm_timer_write_sysreg(struct kvm_vcpu *vcpu, 102 + enum kvm_arch_timers tmr, 103 + enum kvm_arch_timer_regs treg, 104 + u64 val); 111 105 112 106 #endif

+23 -1

include/linux/kvm_host.h

··· 48 48 */ 49 49 #define KVM_MEMSLOT_INVALID (1UL << 16) 50 50 51 + /* 52 + * Bit 63 of the memslot generation number is an "update in-progress flag", 53 + * e.g. is temporarily set for the duration of install_new_memslots(). 54 + * This flag effectively creates a unique generation number that is used to 55 + * mark cached memslot data, e.g. MMIO accesses, as potentially being stale, 56 + * i.e. may (or may not) have come from the previous memslots generation. 57 + * 58 + * This is necessary because the actual memslots update is not atomic with 59 + * respect to the generation number update. Updating the generation number 60 + * first would allow a vCPU to cache a spte from the old memslots using the 61 + * new generation number, and updating the generation number after switching 62 + * to the new memslots would allow cache hits using the old generation number 63 + * to reference the defunct memslots. 64 + * 65 + * This mechanism is used to prevent getting hits in KVM's caches while a 66 + * memslot update is in-progress, and to prevent cache hits *after* updating 67 + * the actual generation number against accesses that were inserted into the 68 + * cache *before* the memslots were updated. 69 + */ 70 + #define KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS BIT_ULL(63) 71 + 51 72 /* Two fragments for cross MMIO pages. */ 52 73 #define KVM_MAX_MMIO_FRAGMENTS 2 53 74 ··· 655 634 struct kvm_memory_slot *dont); 656 635 int kvm_arch_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot, 657 636 unsigned long npages); 658 - void kvm_arch_memslots_updated(struct kvm *kvm, struct kvm_memslots *slots); 637 + void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen); 659 638 int kvm_arch_prepare_memory_region(struct kvm *kvm, 660 639 struct kvm_memory_slot *memslot, 661 640 const struct kvm_userspace_memory_region *mem, ··· 1203 1182 1204 1183 extern unsigned int halt_poll_ns; 1205 1184 extern unsigned int halt_poll_ns_grow; 1185 + extern unsigned int halt_poll_ns_grow_start; 1206 1186 extern unsigned int halt_poll_ns_shrink; 1207 1187 1208 1188 struct kvm_device {

+1

tools/testing/selftests/kvm/.gitignore

··· 3 3 /x86_64/platform_info_test 4 4 /x86_64/set_sregs_test 5 5 /x86_64/sync_regs_test 6 + /x86_64/vmx_close_while_nested_test 6 7 /x86_64/vmx_tsc_adjust_test 7 8 /x86_64/state_test 8 9 /dirty_log_test

+1

tools/testing/selftests/kvm/Makefile

··· 16 16 TEST_GEN_PROGS_x86_64 += x86_64/state_test 17 17 TEST_GEN_PROGS_x86_64 += x86_64/evmcs_test 18 18 TEST_GEN_PROGS_x86_64 += x86_64/hyperv_cpuid 19 + TEST_GEN_PROGS_x86_64 += x86_64/vmx_close_while_nested_test 19 20 TEST_GEN_PROGS_x86_64 += dirty_log_test 20 21 TEST_GEN_PROGS_x86_64 += clear_dirty_log_test 21 22

+95

tools/testing/selftests/kvm/x86_64/vmx_close_while_nested_test.c

··· 1 + /* 2 + * vmx_close_while_nested 3 + * 4 + * Copyright (C) 2019, Red Hat, Inc. 5 + * 6 + * This work is licensed under the terms of the GNU GPL, version 2. 7 + * 8 + * Verify that nothing bad happens if a KVM user exits with open 9 + * file descriptors while executing a nested guest. 10 + */ 11 + 12 + #include "test_util.h" 13 + #include "kvm_util.h" 14 + #include "processor.h" 15 + #include "vmx.h" 16 + 17 + #include <string.h> 18 + #include <sys/ioctl.h> 19 + 20 + #include "kselftest.h" 21 + 22 + #define VCPU_ID 5 23 + 24 + enum { 25 + PORT_L0_EXIT = 0x2000, 26 + }; 27 + 28 + /* The virtual machine object. */ 29 + static struct kvm_vm *vm; 30 + 31 + static void l2_guest_code(void) 32 + { 33 + /* Exit to L0 */ 34 + asm volatile("inb %%dx, %%al" 35 + : : [port] "d" (PORT_L0_EXIT) : "rax"); 36 + } 37 + 38 + static void l1_guest_code(struct vmx_pages *vmx_pages) 39 + { 40 + #define L2_GUEST_STACK_SIZE 64 41 + unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE]; 42 + uint32_t control; 43 + uintptr_t save_cr3; 44 + 45 + GUEST_ASSERT(prepare_for_vmx_operation(vmx_pages)); 46 + GUEST_ASSERT(load_vmcs(vmx_pages)); 47 + 48 + /* Prepare the VMCS for L2 execution. */ 49 + prepare_vmcs(vmx_pages, l2_guest_code, 50 + &l2_guest_stack[L2_GUEST_STACK_SIZE]); 51 + 52 + GUEST_ASSERT(!vmlaunch()); 53 + GUEST_ASSERT(0); 54 + } 55 + 56 + int main(int argc, char *argv[]) 57 + { 58 + struct vmx_pages *vmx_pages; 59 + vm_vaddr_t vmx_pages_gva; 60 + struct kvm_cpuid_entry2 *entry = kvm_get_supported_cpuid_entry(1); 61 + 62 + if (!(entry->ecx & CPUID_VMX)) { 63 + fprintf(stderr, "nested VMX not enabled, skipping test\n"); 64 + exit(KSFT_SKIP); 65 + } 66 + 67 + vm = vm_create_default(VCPU_ID, 0, (void *) l1_guest_code); 68 + vcpu_set_cpuid(vm, VCPU_ID, kvm_get_supported_cpuid()); 69 + 70 + /* Allocate VMX pages and shared descriptors (vmx_pages). */ 71 + vmx_pages = vcpu_alloc_vmx(vm, &vmx_pages_gva); 72 + vcpu_args_set(vm, VCPU_ID, 1, vmx_pages_gva); 73 + 74 + for (;;) { 75 + volatile struct kvm_run *run = vcpu_state(vm, VCPU_ID); 76 + struct ucall uc; 77 + 78 + vcpu_run(vm, VCPU_ID); 79 + TEST_ASSERT(run->exit_reason == KVM_EXIT_IO, 80 + "Got exit_reason other than KVM_EXIT_IO: %u (%s)\n", 81 + run->exit_reason, 82 + exit_reason_str(run->exit_reason)); 83 + 84 + if (run->io.port == PORT_L0_EXIT) 85 + break; 86 + 87 + switch (get_ucall(vm, VCPU_ID, &uc)) { 88 + case UCALL_ABORT: 89 + TEST_ASSERT(false, "%s", (const char *)uc.args[0]); 90 + /* NOT REACHED */ 91 + default: 92 + TEST_ASSERT(false, "Unknown ucall 0x%x.", uc.cmd); 93 + } 94 + } 95 + }

+430 -186

virt/kvm/arm/arch_timer.c

··· 25 25 26 26 #include <clocksource/arm_arch_timer.h> 27 27 #include <asm/arch_timer.h> 28 + #include <asm/kvm_emulate.h> 28 29 #include <asm/kvm_hyp.h> 29 30 30 31 #include <kvm/arm_vgic.h> ··· 35 34 36 35 static struct timecounter *timecounter; 37 36 static unsigned int host_vtimer_irq; 37 + static unsigned int host_ptimer_irq; 38 38 static u32 host_vtimer_irq_flags; 39 + static u32 host_ptimer_irq_flags; 39 40 40 41 static DEFINE_STATIC_KEY_FALSE(has_gic_active_state); 41 42 ··· 55 52 static void kvm_timer_update_irq(struct kvm_vcpu *vcpu, bool new_level, 56 53 struct arch_timer_context *timer_ctx); 57 54 static bool kvm_timer_should_fire(struct arch_timer_context *timer_ctx); 55 + static void kvm_arm_timer_write(struct kvm_vcpu *vcpu, 56 + struct arch_timer_context *timer, 57 + enum kvm_arch_timer_regs treg, 58 + u64 val); 59 + static u64 kvm_arm_timer_read(struct kvm_vcpu *vcpu, 60 + struct arch_timer_context *timer, 61 + enum kvm_arch_timer_regs treg); 58 62 59 63 u64 kvm_phys_timer_read(void) 60 64 { 61 65 return timecounter->cc->read(timecounter->cc); 66 + } 67 + 68 + static void get_timer_map(struct kvm_vcpu *vcpu, struct timer_map *map) 69 + { 70 + if (has_vhe()) { 71 + map->direct_vtimer = vcpu_vtimer(vcpu); 72 + map->direct_ptimer = vcpu_ptimer(vcpu); 73 + map->emul_ptimer = NULL; 74 + } else { 75 + map->direct_vtimer = vcpu_vtimer(vcpu); 76 + map->direct_ptimer = NULL; 77 + map->emul_ptimer = vcpu_ptimer(vcpu); 78 + } 79 + 80 + trace_kvm_get_timer_map(vcpu->vcpu_id, map); 62 81 } 63 82 64 83 static inline bool userspace_irqchip(struct kvm *kvm) ··· 103 78 static irqreturn_t kvm_arch_timer_handler(int irq, void *dev_id) 104 79 { 105 80 struct kvm_vcpu *vcpu = *(struct kvm_vcpu **)dev_id; 106 - struct arch_timer_context *vtimer; 81 + struct arch_timer_context *ctx; 82 + struct timer_map map; 107 83 108 84 /* 109 85 * We may see a timer interrupt after vcpu_put() has been called which 110 86 * sets the CPU's vcpu pointer to NULL, because even though the timer 111 - * has been disabled in vtimer_save_state(), the hardware interrupt 87 + * has been disabled in timer_save_state(), the hardware interrupt 112 88 * signal may not have been retired from the interrupt controller yet. 113 89 */ 114 90 if (!vcpu) 115 91 return IRQ_HANDLED; 116 92 117 - vtimer = vcpu_vtimer(vcpu); 118 - if (kvm_timer_should_fire(vtimer)) 119 - kvm_timer_update_irq(vcpu, true, vtimer); 93 + get_timer_map(vcpu, &map); 94 + 95 + if (irq == host_vtimer_irq) 96 + ctx = map.direct_vtimer; 97 + else 98 + ctx = map.direct_ptimer; 99 + 100 + if (kvm_timer_should_fire(ctx)) 101 + kvm_timer_update_irq(vcpu, true, ctx); 120 102 121 103 if (userspace_irqchip(vcpu->kvm) && 122 104 !static_branch_unlikely(&has_gic_active_state)) ··· 154 122 155 123 static bool kvm_timer_irq_can_fire(struct arch_timer_context *timer_ctx) 156 124 { 157 - return !(timer_ctx->cnt_ctl & ARCH_TIMER_CTRL_IT_MASK) && 125 + WARN_ON(timer_ctx && timer_ctx->loaded); 126 + return timer_ctx && 127 + !(timer_ctx->cnt_ctl & ARCH_TIMER_CTRL_IT_MASK) && 158 128 (timer_ctx->cnt_ctl & ARCH_TIMER_CTRL_ENABLE); 159 129 } 160 130 ··· 166 132 */ 167 133 static u64 kvm_timer_earliest_exp(struct kvm_vcpu *vcpu) 168 134 { 169 - u64 min_virt = ULLONG_MAX, min_phys = ULLONG_MAX; 170 - struct arch_timer_context *vtimer = vcpu_vtimer(vcpu); 171 - struct arch_timer_context *ptimer = vcpu_ptimer(vcpu); 135 + u64 min_delta = ULLONG_MAX; 136 + int i; 172 137 173 - if (kvm_timer_irq_can_fire(vtimer)) 174 - min_virt = kvm_timer_compute_delta(vtimer); 138 + for (i = 0; i < NR_KVM_TIMERS; i++) { 139 + struct arch_timer_context *ctx = &vcpu->arch.timer_cpu.timers[i]; 175 140 176 - if (kvm_timer_irq_can_fire(ptimer)) 177 - min_phys = kvm_timer_compute_delta(ptimer); 141 + WARN(ctx->loaded, "timer %d loaded\n", i); 142 + if (kvm_timer_irq_can_fire(ctx)) 143 + min_delta = min(min_delta, kvm_timer_compute_delta(ctx)); 144 + } 178 145 179 146 /* If none of timers can fire, then return 0 */ 180 - if ((min_virt == ULLONG_MAX) && (min_phys == ULLONG_MAX)) 147 + if (min_delta == ULLONG_MAX) 181 148 return 0; 182 149 183 - return min(min_virt, min_phys); 150 + return min_delta; 184 151 } 185 152 186 153 static enum hrtimer_restart kvm_bg_timer_expire(struct hrtimer *hrt) ··· 208 173 return HRTIMER_NORESTART; 209 174 } 210 175 211 - static enum hrtimer_restart kvm_phys_timer_expire(struct hrtimer *hrt) 176 + static enum hrtimer_restart kvm_hrtimer_expire(struct hrtimer *hrt) 212 177 { 213 - struct arch_timer_context *ptimer; 214 - struct arch_timer_cpu *timer; 178 + struct arch_timer_context *ctx; 215 179 struct kvm_vcpu *vcpu; 216 180 u64 ns; 217 181 218 - timer = container_of(hrt, struct arch_timer_cpu, phys_timer); 219 - vcpu = container_of(timer, struct kvm_vcpu, arch.timer_cpu); 220 - ptimer = vcpu_ptimer(vcpu); 182 + ctx = container_of(hrt, struct arch_timer_context, hrtimer); 183 + vcpu = ctx->vcpu; 184 + 185 + trace_kvm_timer_hrtimer_expire(ctx); 221 186 222 187 /* 223 188 * Check that the timer has really expired from the guest's 224 189 * PoV (NTP on the host may have forced it to expire 225 190 * early). If not ready, schedule for a later time. 226 191 */ 227 - ns = kvm_timer_compute_delta(ptimer); 192 + ns = kvm_timer_compute_delta(ctx); 228 193 if (unlikely(ns)) { 229 194 hrtimer_forward_now(hrt, ns_to_ktime(ns)); 230 195 return HRTIMER_RESTART; 231 196 } 232 197 233 - kvm_timer_update_irq(vcpu, true, ptimer); 198 + kvm_timer_update_irq(vcpu, true, ctx); 234 199 return HRTIMER_NORESTART; 235 200 } 236 201 237 202 static bool kvm_timer_should_fire(struct arch_timer_context *timer_ctx) 238 203 { 204 + enum kvm_arch_timers index; 239 205 u64 cval, now; 240 206 241 - if (timer_ctx->loaded) { 242 - u32 cnt_ctl; 207 + if (!timer_ctx) 208 + return false; 243 209 244 - /* Only the virtual timer can be loaded so far */ 245 - cnt_ctl = read_sysreg_el0(cntv_ctl); 210 + index = arch_timer_ctx_index(timer_ctx); 211 + 212 + if (timer_ctx->loaded) { 213 + u32 cnt_ctl = 0; 214 + 215 + switch (index) { 216 + case TIMER_VTIMER: 217 + cnt_ctl = read_sysreg_el0(cntv_ctl); 218 + break; 219 + case TIMER_PTIMER: 220 + cnt_ctl = read_sysreg_el0(cntp_ctl); 221 + break; 222 + case NR_KVM_TIMERS: 223 + /* GCC is braindead */ 224 + cnt_ctl = 0; 225 + break; 226 + } 227 + 246 228 return (cnt_ctl & ARCH_TIMER_CTRL_ENABLE) && 247 229 (cnt_ctl & ARCH_TIMER_CTRL_IT_STAT) && 248 230 !(cnt_ctl & ARCH_TIMER_CTRL_IT_MASK); ··· 276 224 277 225 bool kvm_timer_is_pending(struct kvm_vcpu *vcpu) 278 226 { 279 - struct arch_timer_context *vtimer = vcpu_vtimer(vcpu); 280 - struct arch_timer_context *ptimer = vcpu_ptimer(vcpu); 227 + struct timer_map map; 281 228 282 - if (kvm_timer_should_fire(vtimer)) 283 - return true; 229 + get_timer_map(vcpu, &map); 284 230 285 - return kvm_timer_should_fire(ptimer); 231 + return kvm_timer_should_fire(map.direct_vtimer) || 232 + kvm_timer_should_fire(map.direct_ptimer) || 233 + kvm_timer_should_fire(map.emul_ptimer); 286 234 } 287 235 288 236 /* ··· 321 269 } 322 270 } 323 271 324 - /* Schedule the background timer for the emulated timer. */ 325 - static void phys_timer_emulate(struct kvm_vcpu *vcpu) 272 + static void timer_emulate(struct arch_timer_context *ctx) 326 273 { 327 - struct arch_timer_cpu *timer = &vcpu->arch.timer_cpu; 328 - struct arch_timer_context *ptimer = vcpu_ptimer(vcpu); 274 + bool should_fire = kvm_timer_should_fire(ctx); 275 + 276 + trace_kvm_timer_emulate(ctx, should_fire); 277 + 278 + if (should_fire) { 279 + kvm_timer_update_irq(ctx->vcpu, true, ctx); 280 + return; 281 + } 329 282 330 283 /* 331 284 * If the timer can fire now, we don't need to have a soft timer 332 285 * scheduled for the future. If the timer cannot fire at all, 333 286 * then we also don't need a soft timer. 334 287 */ 335 - if (kvm_timer_should_fire(ptimer) || !kvm_timer_irq_can_fire(ptimer)) { 336 - soft_timer_cancel(&timer->phys_timer); 288 + if (!kvm_timer_irq_can_fire(ctx)) { 289 + soft_timer_cancel(&ctx->hrtimer); 337 290 return; 338 291 } 339 292 340 - soft_timer_start(&timer->phys_timer, kvm_timer_compute_delta(ptimer)); 293 + soft_timer_start(&ctx->hrtimer, kvm_timer_compute_delta(ctx)); 341 294 } 342 295 343 - /* 344 - * Check if there was a change in the timer state, so that we should either 345 - * raise or lower the line level to the GIC or schedule a background timer to 346 - * emulate the physical timer. 347 - */ 348 - static void kvm_timer_update_state(struct kvm_vcpu *vcpu) 296 + static void timer_save_state(struct arch_timer_context *ctx) 349 297 { 350 - struct arch_timer_cpu *timer = &vcpu->arch.timer_cpu; 351 - struct arch_timer_context *vtimer = vcpu_vtimer(vcpu); 352 - struct arch_timer_context *ptimer = vcpu_ptimer(vcpu); 353 - bool level; 354 - 355 - if (unlikely(!timer->enabled)) 356 - return; 357 - 358 - /* 359 - * The vtimer virtual interrupt is a 'mapped' interrupt, meaning part 360 - * of its lifecycle is offloaded to the hardware, and we therefore may 361 - * not have lowered the irq.level value before having to signal a new 362 - * interrupt, but have to signal an interrupt every time the level is 363 - * asserted. 364 - */ 365 - level = kvm_timer_should_fire(vtimer); 366 - kvm_timer_update_irq(vcpu, level, vtimer); 367 - 368 - phys_timer_emulate(vcpu); 369 - 370 - if (kvm_timer_should_fire(ptimer) != ptimer->irq.level) 371 - kvm_timer_update_irq(vcpu, !ptimer->irq.level, ptimer); 372 - } 373 - 374 - static void vtimer_save_state(struct kvm_vcpu *vcpu) 375 - { 376 - struct arch_timer_cpu *timer = &vcpu->arch.timer_cpu; 377 - struct arch_timer_context *vtimer = vcpu_vtimer(vcpu); 298 + struct arch_timer_cpu *timer = vcpu_timer(ctx->vcpu); 299 + enum kvm_arch_timers index = arch_timer_ctx_index(ctx); 378 300 unsigned long flags; 301 + 302 + if (!timer->enabled) 303 + return; 379 304 380 305 local_irq_save(flags); 381 306 382 - if (!vtimer->loaded) 307 + if (!ctx->loaded) 383 308 goto out; 384 309 385 - if (timer->enabled) { 386 - vtimer->cnt_ctl = read_sysreg_el0(cntv_ctl); 387 - vtimer->cnt_cval = read_sysreg_el0(cntv_cval); 310 + switch (index) { 311 + case TIMER_VTIMER: 312 + ctx->cnt_ctl = read_sysreg_el0(cntv_ctl); 313 + ctx->cnt_cval = read_sysreg_el0(cntv_cval); 314 + 315 + /* Disable the timer */ 316 + write_sysreg_el0(0, cntv_ctl); 317 + isb(); 318 + 319 + break; 320 + case TIMER_PTIMER: 321 + ctx->cnt_ctl = read_sysreg_el0(cntp_ctl); 322 + ctx->cnt_cval = read_sysreg_el0(cntp_cval); 323 + 324 + /* Disable the timer */ 325 + write_sysreg_el0(0, cntp_ctl); 326 + isb(); 327 + 328 + break; 329 + case NR_KVM_TIMERS: 330 + BUG(); 388 331 } 389 332 390 - /* Disable the virtual timer */ 391 - write_sysreg_el0(0, cntv_ctl); 392 - isb(); 333 + trace_kvm_timer_save_state(ctx); 393 334 394 - vtimer->loaded = false; 335 + ctx->loaded = false; 395 336 out: 396 337 local_irq_restore(flags); 397 338 } ··· 394 349 * thread is removed from its waitqueue and made runnable when there's a timer 395 350 * interrupt to handle. 396 351 */ 397 - void kvm_timer_schedule(struct kvm_vcpu *vcpu) 352 + static void kvm_timer_blocking(struct kvm_vcpu *vcpu) 398 353 { 399 - struct arch_timer_cpu *timer = &vcpu->arch.timer_cpu; 400 - struct arch_timer_context *vtimer = vcpu_vtimer(vcpu); 401 - struct arch_timer_context *ptimer = vcpu_ptimer(vcpu); 354 + struct arch_timer_cpu *timer = vcpu_timer(vcpu); 355 + struct timer_map map; 402 356 403 - vtimer_save_state(vcpu); 404 - 405 - /* 406 - * No need to schedule a background timer if any guest timer has 407 - * already expired, because kvm_vcpu_block will return before putting 408 - * the thread to sleep. 409 - */ 410 - if (kvm_timer_should_fire(vtimer) || kvm_timer_should_fire(ptimer)) 411 - return; 357 + get_timer_map(vcpu, &map); 412 358 413 359 /* 414 - * If both timers are not capable of raising interrupts (disabled or 360 + * If no timers are capable of raising interrupts (disabled or 415 361 * masked), then there's no more work for us to do. 416 362 */ 417 - if (!kvm_timer_irq_can_fire(vtimer) && !kvm_timer_irq_can_fire(ptimer)) 363 + if (!kvm_timer_irq_can_fire(map.direct_vtimer) && 364 + !kvm_timer_irq_can_fire(map.direct_ptimer) && 365 + !kvm_timer_irq_can_fire(map.emul_ptimer)) 418 366 return; 419 367 420 368 /* 421 - * The guest timers have not yet expired, schedule a background timer. 369 + * At least one guest time will expire. Schedule a background timer. 422 370 * Set the earliest expiration time among the guest timers. 423 371 */ 424 372 soft_timer_start(&timer->bg_timer, kvm_timer_earliest_exp(vcpu)); 425 373 } 426 374 427 - static void vtimer_restore_state(struct kvm_vcpu *vcpu) 375 + static void kvm_timer_unblocking(struct kvm_vcpu *vcpu) 428 376 { 429 - struct arch_timer_cpu *timer = &vcpu->arch.timer_cpu; 430 - struct arch_timer_context *vtimer = vcpu_vtimer(vcpu); 377 + struct arch_timer_cpu *timer = vcpu_timer(vcpu); 378 + 379 + soft_timer_cancel(&timer->bg_timer); 380 + } 381 + 382 + static void timer_restore_state(struct arch_timer_context *ctx) 383 + { 384 + struct arch_timer_cpu *timer = vcpu_timer(ctx->vcpu); 385 + enum kvm_arch_timers index = arch_timer_ctx_index(ctx); 431 386 unsigned long flags; 387 + 388 + if (!timer->enabled) 389 + return; 432 390 433 391 local_irq_save(flags); 434 392 435 - if (vtimer->loaded) 393 + if (ctx->loaded) 436 394 goto out; 437 395 438 - if (timer->enabled) { 439 - write_sysreg_el0(vtimer->cnt_cval, cntv_cval); 396 + switch (index) { 397 + case TIMER_VTIMER: 398 + write_sysreg_el0(ctx->cnt_cval, cntv_cval); 440 399 isb(); 441 - write_sysreg_el0(vtimer->cnt_ctl, cntv_ctl); 400 + write_sysreg_el0(ctx->cnt_ctl, cntv_ctl); 401 + break; 402 + case TIMER_PTIMER: 403 + write_sysreg_el0(ctx->cnt_cval, cntp_cval); 404 + isb(); 405 + write_sysreg_el0(ctx->cnt_ctl, cntp_ctl); 406 + break; 407 + case NR_KVM_TIMERS: 408 + BUG(); 442 409 } 443 410 444 - vtimer->loaded = true; 411 + trace_kvm_timer_restore_state(ctx); 412 + 413 + ctx->loaded = true; 445 414 out: 446 415 local_irq_restore(flags); 447 - } 448 - 449 - void kvm_timer_unschedule(struct kvm_vcpu *vcpu) 450 - { 451 - struct arch_timer_cpu *timer = &vcpu->arch.timer_cpu; 452 - 453 - vtimer_restore_state(vcpu); 454 - 455 - soft_timer_cancel(&timer->bg_timer); 456 416 } 457 417 458 418 static void set_cntvoff(u64 cntvoff) ··· 475 425 kvm_call_hyp(__kvm_timer_set_cntvoff, low, high); 476 426 } 477 427 478 - static inline void set_vtimer_irq_phys_active(struct kvm_vcpu *vcpu, bool active) 428 + static inline void set_timer_irq_phys_active(struct arch_timer_context *ctx, bool active) 479 429 { 480 430 int r; 481 - r = irq_set_irqchip_state(host_vtimer_irq, IRQCHIP_STATE_ACTIVE, active); 431 + r = irq_set_irqchip_state(ctx->host_timer_irq, IRQCHIP_STATE_ACTIVE, active); 482 432 WARN_ON(r); 483 433 } 484 434 485 - static void kvm_timer_vcpu_load_gic(struct kvm_vcpu *vcpu) 435 + static void kvm_timer_vcpu_load_gic(struct arch_timer_context *ctx) 486 436 { 487 - struct arch_timer_context *vtimer = vcpu_vtimer(vcpu); 488 - bool phys_active; 437 + struct kvm_vcpu *vcpu = ctx->vcpu; 438 + bool phys_active = false; 439 + 440 + /* 441 + * Update the timer output so that it is likely to match the 442 + * state we're about to restore. If the timer expires between 443 + * this point and the register restoration, we'll take the 444 + * interrupt anyway. 445 + */ 446 + kvm_timer_update_irq(ctx->vcpu, kvm_timer_should_fire(ctx), ctx); 489 447 490 448 if (irqchip_in_kernel(vcpu->kvm)) 491 - phys_active = kvm_vgic_map_is_active(vcpu, vtimer->irq.irq); 492 - else 493 - phys_active = vtimer->irq.level; 494 - set_vtimer_irq_phys_active(vcpu, phys_active); 449 + phys_active = kvm_vgic_map_is_active(vcpu, ctx->irq.irq); 450 + 451 + phys_active |= ctx->irq.level; 452 + 453 + set_timer_irq_phys_active(ctx, phys_active); 495 454 } 496 455 497 456 static void kvm_timer_vcpu_load_nogic(struct kvm_vcpu *vcpu) ··· 525 466 526 467 void kvm_timer_vcpu_load(struct kvm_vcpu *vcpu) 527 468 { 528 - struct arch_timer_cpu *timer = &vcpu->arch.timer_cpu; 529 - struct arch_timer_context *vtimer = vcpu_vtimer(vcpu); 530 - struct arch_timer_context *ptimer = vcpu_ptimer(vcpu); 469 + struct arch_timer_cpu *timer = vcpu_timer(vcpu); 470 + struct timer_map map; 531 471 532 472 if (unlikely(!timer->enabled)) 533 473 return; 534 474 535 - if (static_branch_likely(&has_gic_active_state)) 536 - kvm_timer_vcpu_load_gic(vcpu); 537 - else 475 + get_timer_map(vcpu, &map); 476 + 477 + if (static_branch_likely(&has_gic_active_state)) { 478 + kvm_timer_vcpu_load_gic(map.direct_vtimer); 479 + if (map.direct_ptimer) 480 + kvm_timer_vcpu_load_gic(map.direct_ptimer); 481 + } else { 538 482 kvm_timer_vcpu_load_nogic(vcpu); 483 + } 539 484 540 - set_cntvoff(vtimer->cntvoff); 485 + set_cntvoff(map.direct_vtimer->cntvoff); 541 486 542 - vtimer_restore_state(vcpu); 487 + kvm_timer_unblocking(vcpu); 543 488 544 - /* Set the background timer for the physical timer emulation. */ 545 - phys_timer_emulate(vcpu); 489 + timer_restore_state(map.direct_vtimer); 490 + if (map.direct_ptimer) 491 + timer_restore_state(map.direct_ptimer); 546 492 547 - /* If the timer fired while we weren't running, inject it now */ 548 - if (kvm_timer_should_fire(ptimer) != ptimer->irq.level) 549 - kvm_timer_update_irq(vcpu, !ptimer->irq.level, ptimer); 493 + if (map.emul_ptimer) 494 + timer_emulate(map.emul_ptimer); 550 495 } 551 496 552 497 bool kvm_timer_should_notify_user(struct kvm_vcpu *vcpu) ··· 572 509 573 510 void kvm_timer_vcpu_put(struct kvm_vcpu *vcpu) 574 511 { 575 - struct arch_timer_cpu *timer = &vcpu->arch.timer_cpu; 512 + struct arch_timer_cpu *timer = vcpu_timer(vcpu); 513 + struct timer_map map; 576 514 577 515 if (unlikely(!timer->enabled)) 578 516 return; 579 517 580 - vtimer_save_state(vcpu); 518 + get_timer_map(vcpu, &map); 519 + 520 + timer_save_state(map.direct_vtimer); 521 + if (map.direct_ptimer) 522 + timer_save_state(map.direct_ptimer); 581 523 582 524 /* 583 - * Cancel the physical timer emulation, because the only case where we 525 + * Cancel soft timer emulation, because the only case where we 584 526 * need it after a vcpu_put is in the context of a sleeping VCPU, and 585 527 * in that case we already factor in the deadline for the physical 586 528 * timer when scheduling the bg_timer. ··· 593 525 * In any case, we re-schedule the hrtimer for the physical timer when 594 526 * coming back to the VCPU thread in kvm_timer_vcpu_load(). 595 527 */ 596 - soft_timer_cancel(&timer->phys_timer); 528 + if (map.emul_ptimer) 529 + soft_timer_cancel(&map.emul_ptimer->hrtimer); 530 + 531 + if (swait_active(kvm_arch_vcpu_wq(vcpu))) 532 + kvm_timer_blocking(vcpu); 597 533 598 534 /* 599 535 * The kernel may decide to run userspace after calling vcpu_put, so ··· 606 534 * counter of non-VHE case. For VHE, the virtual counter uses a fixed 607 535 * virtual offset of zero, so no need to zero CNTVOFF_EL2 register. 608 536 */ 609 - if (!has_vhe()) 610 - set_cntvoff(0); 537 + set_cntvoff(0); 611 538 } 612 539 613 540 /* ··· 621 550 if (!kvm_timer_should_fire(vtimer)) { 622 551 kvm_timer_update_irq(vcpu, false, vtimer); 623 552 if (static_branch_likely(&has_gic_active_state)) 624 - set_vtimer_irq_phys_active(vcpu, false); 553 + set_timer_irq_phys_active(vtimer, false); 625 554 else 626 555 enable_percpu_irq(host_vtimer_irq, host_vtimer_irq_flags); 627 556 } ··· 629 558 630 559 void kvm_timer_sync_hwstate(struct kvm_vcpu *vcpu) 631 560 { 632 - struct arch_timer_cpu *timer = &vcpu->arch.timer_cpu; 561 + struct arch_timer_cpu *timer = vcpu_timer(vcpu); 633 562 634 563 if (unlikely(!timer->enabled)) 635 564 return; ··· 640 569 641 570 int kvm_timer_vcpu_reset(struct kvm_vcpu *vcpu) 642 571 { 643 - struct arch_timer_cpu *timer = &vcpu->arch.timer_cpu; 644 - struct arch_timer_context *vtimer = vcpu_vtimer(vcpu); 645 - struct arch_timer_context *ptimer = vcpu_ptimer(vcpu); 572 + struct arch_timer_cpu *timer = vcpu_timer(vcpu); 573 + struct timer_map map; 574 + 575 + get_timer_map(vcpu, &map); 646 576 647 577 /* 648 578 * The bits in CNTV_CTL are architecturally reset to UNKNOWN for ARMv8 ··· 651 579 * resets the timer to be disabled and unmasked and is compliant with 652 580 * the ARMv7 architecture. 653 581 */ 654 - vtimer->cnt_ctl = 0; 655 - ptimer->cnt_ctl = 0; 656 - kvm_timer_update_state(vcpu); 582 + vcpu_vtimer(vcpu)->cnt_ctl = 0; 583 + vcpu_ptimer(vcpu)->cnt_ctl = 0; 657 584 658 - if (timer->enabled && irqchip_in_kernel(vcpu->kvm)) 659 - kvm_vgic_reset_mapped_irq(vcpu, vtimer->irq.irq); 585 + if (timer->enabled) { 586 + kvm_timer_update_irq(vcpu, false, vcpu_vtimer(vcpu)); 587 + kvm_timer_update_irq(vcpu, false, vcpu_ptimer(vcpu)); 588 + 589 + if (irqchip_in_kernel(vcpu->kvm)) { 590 + kvm_vgic_reset_mapped_irq(vcpu, map.direct_vtimer->irq.irq); 591 + if (map.direct_ptimer) 592 + kvm_vgic_reset_mapped_irq(vcpu, map.direct_ptimer->irq.irq); 593 + } 594 + } 595 + 596 + if (map.emul_ptimer) 597 + soft_timer_cancel(&map.emul_ptimer->hrtimer); 660 598 661 599 return 0; 662 600 } ··· 692 610 693 611 void kvm_timer_vcpu_init(struct kvm_vcpu *vcpu) 694 612 { 695 - struct arch_timer_cpu *timer = &vcpu->arch.timer_cpu; 613 + struct arch_timer_cpu *timer = vcpu_timer(vcpu); 696 614 struct arch_timer_context *vtimer = vcpu_vtimer(vcpu); 697 615 struct arch_timer_context *ptimer = vcpu_ptimer(vcpu); 698 616 699 617 /* Synchronize cntvoff across all vtimers of a VM. */ 700 618 update_vtimer_cntvoff(vcpu, kvm_phys_timer_read()); 701 - vcpu_ptimer(vcpu)->cntvoff = 0; 619 + ptimer->cntvoff = 0; 702 620 703 621 hrtimer_init(&timer->bg_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS); 704 622 timer->bg_timer.function = kvm_bg_timer_expire; 705 623 706 - hrtimer_init(&timer->phys_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS); 707 - timer->phys_timer.function = kvm_phys_timer_expire; 624 + hrtimer_init(&vtimer->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS); 625 + hrtimer_init(&ptimer->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS); 626 + vtimer->hrtimer.function = kvm_hrtimer_expire; 627 + ptimer->hrtimer.function = kvm_hrtimer_expire; 708 628 709 629 vtimer->irq.irq = default_vtimer_irq.irq; 710 630 ptimer->irq.irq = default_ptimer_irq.irq; 631 + 632 + vtimer->host_timer_irq = host_vtimer_irq; 633 + ptimer->host_timer_irq = host_ptimer_irq; 634 + 635 + vtimer->host_timer_irq_flags = host_vtimer_irq_flags; 636 + ptimer->host_timer_irq_flags = host_ptimer_irq_flags; 637 + 638 + vtimer->vcpu = vcpu; 639 + ptimer->vcpu = vcpu; 711 640 } 712 641 713 642 static void kvm_timer_init_interrupt(void *info) 714 643 { 715 644 enable_percpu_irq(host_vtimer_irq, host_vtimer_irq_flags); 645 + enable_percpu_irq(host_ptimer_irq, host_ptimer_irq_flags); 716 646 } 717 647 718 648 int kvm_arm_timer_set_reg(struct kvm_vcpu *vcpu, u64 regid, u64 value) 719 649 { 720 - struct arch_timer_context *vtimer = vcpu_vtimer(vcpu); 721 - struct arch_timer_context *ptimer = vcpu_ptimer(vcpu); 650 + struct arch_timer_context *timer; 651 + bool level; 722 652 723 653 switch (regid) { 724 654 case KVM_REG_ARM_TIMER_CTL: 725 - vtimer->cnt_ctl = value & ~ARCH_TIMER_CTRL_IT_STAT; 655 + timer = vcpu_vtimer(vcpu); 656 + kvm_arm_timer_write(vcpu, timer, TIMER_REG_CTL, value); 726 657 break; 727 658 case KVM_REG_ARM_TIMER_CNT: 659 + timer = vcpu_vtimer(vcpu); 728 660 update_vtimer_cntvoff(vcpu, kvm_phys_timer_read() - value); 729 661 break; 730 662 case KVM_REG_ARM_TIMER_CVAL: 731 - vtimer->cnt_cval = value; 663 + timer = vcpu_vtimer(vcpu); 664 + kvm_arm_timer_write(vcpu, timer, TIMER_REG_CVAL, value); 732 665 break; 733 666 case KVM_REG_ARM_PTIMER_CTL: 734 - ptimer->cnt_ctl = value & ~ARCH_TIMER_CTRL_IT_STAT; 667 + timer = vcpu_ptimer(vcpu); 668 + kvm_arm_timer_write(vcpu, timer, TIMER_REG_CTL, value); 735 669 break; 736 670 case KVM_REG_ARM_PTIMER_CVAL: 737 - ptimer->cnt_cval = value; 671 + timer = vcpu_ptimer(vcpu); 672 + kvm_arm_timer_write(vcpu, timer, TIMER_REG_CVAL, value); 738 673 break; 739 674 740 675 default: 741 676 return -1; 742 677 } 743 678 744 - kvm_timer_update_state(vcpu); 679 + level = kvm_timer_should_fire(timer); 680 + kvm_timer_update_irq(vcpu, level, timer); 681 + timer_emulate(timer); 682 + 745 683 return 0; 746 684 } 747 685 ··· 781 679 782 680 u64 kvm_arm_timer_get_reg(struct kvm_vcpu *vcpu, u64 regid) 783 681 { 784 - struct arch_timer_context *ptimer = vcpu_ptimer(vcpu); 785 - struct arch_timer_context *vtimer = vcpu_vtimer(vcpu); 786 - 787 682 switch (regid) { 788 683 case KVM_REG_ARM_TIMER_CTL: 789 - return read_timer_ctl(vtimer); 684 + return kvm_arm_timer_read(vcpu, 685 + vcpu_vtimer(vcpu), TIMER_REG_CTL); 790 686 case KVM_REG_ARM_TIMER_CNT: 791 - return kvm_phys_timer_read() - vtimer->cntvoff; 687 + return kvm_arm_timer_read(vcpu, 688 + vcpu_vtimer(vcpu), TIMER_REG_CNT); 792 689 case KVM_REG_ARM_TIMER_CVAL: 793 - return vtimer->cnt_cval; 690 + return kvm_arm_timer_read(vcpu, 691 + vcpu_vtimer(vcpu), TIMER_REG_CVAL); 794 692 case KVM_REG_ARM_PTIMER_CTL: 795 - return read_timer_ctl(ptimer); 796 - case KVM_REG_ARM_PTIMER_CVAL: 797 - return ptimer->cnt_cval; 693 + return kvm_arm_timer_read(vcpu, 694 + vcpu_ptimer(vcpu), TIMER_REG_CTL); 798 695 case KVM_REG_ARM_PTIMER_CNT: 799 - return kvm_phys_timer_read(); 696 + return kvm_arm_timer_read(vcpu, 697 + vcpu_vtimer(vcpu), TIMER_REG_CNT); 698 + case KVM_REG_ARM_PTIMER_CVAL: 699 + return kvm_arm_timer_read(vcpu, 700 + vcpu_ptimer(vcpu), TIMER_REG_CVAL); 800 701 } 801 702 return (u64)-1; 703 + } 704 + 705 + static u64 kvm_arm_timer_read(struct kvm_vcpu *vcpu, 706 + struct arch_timer_context *timer, 707 + enum kvm_arch_timer_regs treg) 708 + { 709 + u64 val; 710 + 711 + switch (treg) { 712 + case TIMER_REG_TVAL: 713 + val = kvm_phys_timer_read() - timer->cntvoff - timer->cnt_cval; 714 + break; 715 + 716 + case TIMER_REG_CTL: 717 + val = read_timer_ctl(timer); 718 + break; 719 + 720 + case TIMER_REG_CVAL: 721 + val = timer->cnt_cval; 722 + break; 723 + 724 + case TIMER_REG_CNT: 725 + val = kvm_phys_timer_read() - timer->cntvoff; 726 + break; 727 + 728 + default: 729 + BUG(); 730 + } 731 + 732 + return val; 733 + } 734 + 735 + u64 kvm_arm_timer_read_sysreg(struct kvm_vcpu *vcpu, 736 + enum kvm_arch_timers tmr, 737 + enum kvm_arch_timer_regs treg) 738 + { 739 + u64 val; 740 + 741 + preempt_disable(); 742 + kvm_timer_vcpu_put(vcpu); 743 + 744 + val = kvm_arm_timer_read(vcpu, vcpu_get_timer(vcpu, tmr), treg); 745 + 746 + kvm_timer_vcpu_load(vcpu); 747 + preempt_enable(); 748 + 749 + return val; 750 + } 751 + 752 + static void kvm_arm_timer_write(struct kvm_vcpu *vcpu, 753 + struct arch_timer_context *timer, 754 + enum kvm_arch_timer_regs treg, 755 + u64 val) 756 + { 757 + switch (treg) { 758 + case TIMER_REG_TVAL: 759 + timer->cnt_cval = val - kvm_phys_timer_read() - timer->cntvoff; 760 + break; 761 + 762 + case TIMER_REG_CTL: 763 + timer->cnt_ctl = val & ~ARCH_TIMER_CTRL_IT_STAT; 764 + break; 765 + 766 + case TIMER_REG_CVAL: 767 + timer->cnt_cval = val; 768 + break; 769 + 770 + default: 771 + BUG(); 772 + } 773 + } 774 + 775 + void kvm_arm_timer_write_sysreg(struct kvm_vcpu *vcpu, 776 + enum kvm_arch_timers tmr, 777 + enum kvm_arch_timer_regs treg, 778 + u64 val) 779 + { 780 + preempt_disable(); 781 + kvm_timer_vcpu_put(vcpu); 782 + 783 + kvm_arm_timer_write(vcpu, vcpu_get_timer(vcpu, tmr), treg, val); 784 + 785 + kvm_timer_vcpu_load(vcpu); 786 + preempt_enable(); 802 787 } 803 788 804 789 static int kvm_timer_starting_cpu(unsigned int cpu) ··· 913 724 return -ENODEV; 914 725 } 915 726 727 + /* First, do the virtual EL1 timer irq */ 728 + 916 729 if (info->virtual_irq <= 0) { 917 730 kvm_err("kvm_arch_timer: invalid virtual timer IRQ: %d\n", 918 731 info->virtual_irq); ··· 925 734 host_vtimer_irq_flags = irq_get_trigger_type(host_vtimer_irq); 926 735 if (host_vtimer_irq_flags != IRQF_TRIGGER_HIGH && 927 736 host_vtimer_irq_flags != IRQF_TRIGGER_LOW) { 928 - kvm_err("Invalid trigger for IRQ%d, assuming level low\n", 737 + kvm_err("Invalid trigger for vtimer IRQ%d, assuming level low\n", 929 738 host_vtimer_irq); 930 739 host_vtimer_irq_flags = IRQF_TRIGGER_LOW; 931 740 } 932 741 933 742 err = request_percpu_irq(host_vtimer_irq, kvm_arch_timer_handler, 934 - "kvm guest timer", kvm_get_running_vcpus()); 743 + "kvm guest vtimer", kvm_get_running_vcpus()); 935 744 if (err) { 936 - kvm_err("kvm_arch_timer: can't request interrupt %d (%d)\n", 745 + kvm_err("kvm_arch_timer: can't request vtimer interrupt %d (%d)\n", 937 746 host_vtimer_irq, err); 938 747 return err; 939 748 } ··· 951 760 952 761 kvm_debug("virtual timer IRQ%d\n", host_vtimer_irq); 953 762 763 + /* Now let's do the physical EL1 timer irq */ 764 + 765 + if (info->physical_irq > 0) { 766 + host_ptimer_irq = info->physical_irq; 767 + host_ptimer_irq_flags = irq_get_trigger_type(host_ptimer_irq); 768 + if (host_ptimer_irq_flags != IRQF_TRIGGER_HIGH && 769 + host_ptimer_irq_flags != IRQF_TRIGGER_LOW) { 770 + kvm_err("Invalid trigger for ptimer IRQ%d, assuming level low\n", 771 + host_ptimer_irq); 772 + host_ptimer_irq_flags = IRQF_TRIGGER_LOW; 773 + } 774 + 775 + err = request_percpu_irq(host_ptimer_irq, kvm_arch_timer_handler, 776 + "kvm guest ptimer", kvm_get_running_vcpus()); 777 + if (err) { 778 + kvm_err("kvm_arch_timer: can't request ptimer interrupt %d (%d)\n", 779 + host_ptimer_irq, err); 780 + return err; 781 + } 782 + 783 + if (has_gic) { 784 + err = irq_set_vcpu_affinity(host_ptimer_irq, 785 + kvm_get_running_vcpus()); 786 + if (err) { 787 + kvm_err("kvm_arch_timer: error setting vcpu affinity\n"); 788 + goto out_free_irq; 789 + } 790 + } 791 + 792 + kvm_debug("physical timer IRQ%d\n", host_ptimer_irq); 793 + } else if (has_vhe()) { 794 + kvm_err("kvm_arch_timer: invalid physical timer IRQ: %d\n", 795 + info->physical_irq); 796 + err = -ENODEV; 797 + goto out_free_irq; 798 + } 799 + 954 800 cpuhp_setup_state(CPUHP_AP_KVM_ARM_TIMER_STARTING, 955 801 "kvm/arm/timer:starting", kvm_timer_starting_cpu, 956 802 kvm_timer_dying_cpu); ··· 999 771 1000 772 void kvm_timer_vcpu_terminate(struct kvm_vcpu *vcpu) 1001 773 { 1002 - struct arch_timer_cpu *timer = &vcpu->arch.timer_cpu; 774 + struct arch_timer_cpu *timer = vcpu_timer(vcpu); 1003 775 1004 776 soft_timer_cancel(&timer->bg_timer); 1005 777 } ··· 1035 807 1036 808 if (vintid == vcpu_vtimer(vcpu)->irq.irq) 1037 809 timer = vcpu_vtimer(vcpu); 810 + else if (vintid == vcpu_ptimer(vcpu)->irq.irq) 811 + timer = vcpu_ptimer(vcpu); 1038 812 else 1039 - BUG(); /* We only map the vtimer so far */ 813 + BUG(); 1040 814 1041 815 return kvm_timer_should_fire(timer); 1042 816 } 1043 817 1044 818 int kvm_timer_enable(struct kvm_vcpu *vcpu) 1045 819 { 1046 - struct arch_timer_cpu *timer = &vcpu->arch.timer_cpu; 1047 - struct arch_timer_context *vtimer = vcpu_vtimer(vcpu); 820 + struct arch_timer_cpu *timer = vcpu_timer(vcpu); 821 + struct timer_map map; 1048 822 int ret; 1049 823 1050 824 if (timer->enabled) ··· 1064 834 return -EINVAL; 1065 835 } 1066 836 1067 - ret = kvm_vgic_map_phys_irq(vcpu, host_vtimer_irq, vtimer->irq.irq, 837 + get_timer_map(vcpu, &map); 838 + 839 + ret = kvm_vgic_map_phys_irq(vcpu, 840 + map.direct_vtimer->host_timer_irq, 841 + map.direct_vtimer->irq.irq, 1068 842 kvm_arch_timer_get_input_level); 843 + if (ret) 844 + return ret; 845 + 846 + if (map.direct_ptimer) { 847 + ret = kvm_vgic_map_phys_irq(vcpu, 848 + map.direct_ptimer->host_timer_irq, 849 + map.direct_ptimer->irq.irq, 850 + kvm_arch_timer_get_input_level); 851 + } 852 + 1069 853 if (ret) 1070 854 return ret; 1071 855 ··· 1089 845 } 1090 846 1091 847 /* 1092 - * On VHE system, we only need to configure trap on physical timer and counter 1093 - * accesses in EL0 and EL1 once, not for every world switch. 848 + * On VHE system, we only need to configure the EL2 timer trap register once, 849 + * not for every world switch. 1094 850 * The host kernel runs at EL2 with HCR_EL2.TGE == 1, 1095 851 * and this makes those bits have no effect for the host kernel execution. 1096 852 */ ··· 1101 857 u64 val; 1102 858 1103 859 /* 1104 - * Disallow physical timer access for the guest. 1105 - * Physical counter access is allowed. 860 + * VHE systems allow the guest direct access to the EL1 physical 861 + * timer/counter. 1106 862 */ 1107 863 val = read_sysreg(cnthctl_el2); 1108 - val &= ~(CNTHCTL_EL1PCEN << cnthctl_shift); 864 + val |= (CNTHCTL_EL1PCEN << cnthctl_shift); 1109 865 val |= (CNTHCTL_EL1PCTEN << cnthctl_shift); 1110 866 write_sysreg(val, cnthctl_el2); 1111 867 }

+23 -41

virt/kvm/arm/arm.c

··· 65 65 /* The VMID used in the VTTBR */ 66 66 static atomic64_t kvm_vmid_gen = ATOMIC64_INIT(1); 67 67 static u32 kvm_next_vmid; 68 - static unsigned int kvm_vmid_bits __read_mostly; 69 68 static DEFINE_SPINLOCK(kvm_vmid_lock); 70 69 71 70 static bool vgic_present; ··· 141 142 kvm_vgic_early_init(kvm); 142 143 143 144 /* Mark the initial VMID generation invalid */ 144 - kvm->arch.vmid_gen = 0; 145 + kvm->arch.vmid.vmid_gen = 0; 145 146 146 147 /* The maximum number of VCPUs is limited by the host's GIC model */ 147 148 kvm->arch.max_vcpus = vgic_present ? ··· 335 336 336 337 void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) 337 338 { 338 - kvm_timer_schedule(vcpu); 339 339 kvm_vgic_v4_enable_doorbell(vcpu); 340 340 } 341 341 342 342 void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) 343 343 { 344 - kvm_timer_unschedule(vcpu); 345 344 kvm_vgic_v4_disable_doorbell(vcpu); 346 345 } 347 346 ··· 469 472 470 473 /** 471 474 * need_new_vmid_gen - check that the VMID is still valid 472 - * @kvm: The VM's VMID to check 475 + * @vmid: The VMID to check 473 476 * 474 477 * return true if there is a new generation of VMIDs being used 475 478 * 476 - * The hardware supports only 256 values with the value zero reserved for the 477 - * host, so we check if an assigned value belongs to a previous generation, 478 - * which which requires us to assign a new value. If we're the first to use a 479 - * VMID for the new generation, we must flush necessary caches and TLBs on all 480 - * CPUs. 479 + * The hardware supports a limited set of values with the value zero reserved 480 + * for the host, so we check if an assigned value belongs to a previous 481 + * generation, which which requires us to assign a new value. If we're the 482 + * first to use a VMID for the new generation, we must flush necessary caches 483 + * and TLBs on all CPUs. 481 484 */ 482 - static bool need_new_vmid_gen(struct kvm *kvm) 485 + static bool need_new_vmid_gen(struct kvm_vmid *vmid) 483 486 { 484 487 u64 current_vmid_gen = atomic64_read(&kvm_vmid_gen); 485 488 smp_rmb(); /* Orders read of kvm_vmid_gen and kvm->arch.vmid */ 486 - return unlikely(READ_ONCE(kvm->arch.vmid_gen) != current_vmid_gen); 489 + return unlikely(READ_ONCE(vmid->vmid_gen) != current_vmid_gen); 487 490 } 488 491 489 492 /** 490 - * update_vttbr - Update the VTTBR with a valid VMID before the guest runs 491 - * @kvm The guest that we are about to run 492 - * 493 - * Called from kvm_arch_vcpu_ioctl_run before entering the guest to ensure the 494 - * VM has a valid VMID, otherwise assigns a new one and flushes corresponding 495 - * caches and TLBs. 493 + * update_vmid - Update the vmid with a valid VMID for the current generation 494 + * @kvm: The guest that struct vmid belongs to 495 + * @vmid: The stage-2 VMID information struct 496 496 */ 497 - static void update_vttbr(struct kvm *kvm) 497 + static void update_vmid(struct kvm_vmid *vmid) 498 498 { 499 - phys_addr_t pgd_phys; 500 - u64 vmid, cnp = kvm_cpu_has_cnp() ? VTTBR_CNP_BIT : 0; 501 - 502 - if (!need_new_vmid_gen(kvm)) 499 + if (!need_new_vmid_gen(vmid)) 503 500 return; 504 501 505 502 spin_lock(&kvm_vmid_lock); ··· 503 512 * already allocated a valid vmid for this vm, then this vcpu should 504 513 * use the same vmid. 505 514 */ 506 - if (!need_new_vmid_gen(kvm)) { 515 + if (!need_new_vmid_gen(vmid)) { 507 516 spin_unlock(&kvm_vmid_lock); 508 517 return; 509 518 } ··· 527 536 kvm_call_hyp(__kvm_flush_vm_context); 528 537 } 529 538 530 - kvm->arch.vmid = kvm_next_vmid; 539 + vmid->vmid = kvm_next_vmid; 531 540 kvm_next_vmid++; 532 - kvm_next_vmid &= (1 << kvm_vmid_bits) - 1; 533 - 534 - /* update vttbr to be used with the new vmid */ 535 - pgd_phys = virt_to_phys(kvm->arch.pgd); 536 - BUG_ON(pgd_phys & ~kvm_vttbr_baddr_mask(kvm)); 537 - vmid = ((u64)(kvm->arch.vmid) << VTTBR_VMID_SHIFT) & VTTBR_VMID_MASK(kvm_vmid_bits); 538 - kvm->arch.vttbr = kvm_phys_to_vttbr(pgd_phys) | vmid | cnp; 541 + kvm_next_vmid &= (1 << kvm_get_vmid_bits()) - 1; 539 542 540 543 smp_wmb(); 541 - WRITE_ONCE(kvm->arch.vmid_gen, atomic64_read(&kvm_vmid_gen)); 544 + WRITE_ONCE(vmid->vmid_gen, atomic64_read(&kvm_vmid_gen)); 542 545 543 546 spin_unlock(&kvm_vmid_lock); 544 547 } ··· 685 700 */ 686 701 cond_resched(); 687 702 688 - update_vttbr(vcpu->kvm); 703 + update_vmid(&vcpu->kvm->arch.vmid); 689 704 690 705 check_vcpu_requests(vcpu); 691 706 ··· 734 749 */ 735 750 smp_store_mb(vcpu->mode, IN_GUEST_MODE); 736 751 737 - if (ret <= 0 || need_new_vmid_gen(vcpu->kvm) || 752 + if (ret <= 0 || need_new_vmid_gen(&vcpu->kvm->arch.vmid) || 738 753 kvm_request_pending(vcpu)) { 739 754 vcpu->mode = OUTSIDE_GUEST_MODE; 740 755 isb(); /* Ensure work in x_flush_hwstate is committed */ ··· 760 775 ret = kvm_vcpu_run_vhe(vcpu); 761 776 kvm_arm_vhe_guest_exit(); 762 777 } else { 763 - ret = kvm_call_hyp(__kvm_vcpu_run_nvhe, vcpu); 778 + ret = kvm_call_hyp_ret(__kvm_vcpu_run_nvhe, vcpu); 764 779 } 765 780 766 781 vcpu->mode = OUTSIDE_GUEST_MODE; ··· 1412 1427 1413 1428 static int init_common_resources(void) 1414 1429 { 1415 - /* set size of VMID supported by CPU */ 1416 - kvm_vmid_bits = kvm_get_vmid_bits(); 1417 - kvm_info("%d-bit VMID\n", kvm_vmid_bits); 1418 - 1419 1430 kvm_set_ipa_limit(); 1420 1431 1421 1432 return 0; ··· 1552 1571 kvm_cpu_context_t *cpu_ctxt; 1553 1572 1554 1573 cpu_ctxt = per_cpu_ptr(&kvm_host_cpu_state, cpu); 1574 + kvm_init_host_cpu_context(cpu_ctxt, cpu); 1555 1575 err = create_hyp_mappings(cpu_ctxt, cpu_ctxt + 1, PAGE_HYP); 1556 1576 1557 1577 if (err) { ··· 1563 1581 1564 1582 err = hyp_map_aux_data(); 1565 1583 if (err) 1566 - kvm_err("Cannot map host auxilary data: %d\n", err); 1584 + kvm_err("Cannot map host auxiliary data: %d\n", err); 1567 1585 1568 1586 return 0; 1569 1587

+1 -1

virt/kvm/arm/hyp/vgic-v3-sr.c

··· 226 226 int i; 227 227 u32 elrsr; 228 228 229 - elrsr = read_gicreg(ICH_ELSR_EL2); 229 + elrsr = read_gicreg(ICH_ELRSR_EL2); 230 230 231 231 write_gicreg(cpu_if->vgic_hcr & ~ICH_HCR_EN, ICH_HCR_EL2); 232 232

+9 -11

virt/kvm/arm/mmu.c

··· 908 908 */ 909 909 int kvm_alloc_stage2_pgd(struct kvm *kvm) 910 910 { 911 + phys_addr_t pgd_phys; 911 912 pgd_t *pgd; 912 913 913 914 if (kvm->arch.pgd != NULL) { ··· 921 920 if (!pgd) 922 921 return -ENOMEM; 923 922 923 + pgd_phys = virt_to_phys(pgd); 924 + if (WARN_ON(pgd_phys & ~kvm_vttbr_baddr_mask(kvm))) 925 + return -EINVAL; 926 + 924 927 kvm->arch.pgd = pgd; 928 + kvm->arch.pgd_phys = pgd_phys; 925 929 return 0; 926 930 } 927 931 ··· 1014 1008 unmap_stage2_range(kvm, 0, kvm_phys_size(kvm)); 1015 1009 pgd = READ_ONCE(kvm->arch.pgd); 1016 1010 kvm->arch.pgd = NULL; 1011 + kvm->arch.pgd_phys = 0; 1017 1012 } 1018 1013 spin_unlock(&kvm->mmu_lock); 1019 1014 ··· 1403 1396 return false; 1404 1397 } 1405 1398 1406 - static bool kvm_is_write_fault(struct kvm_vcpu *vcpu) 1407 - { 1408 - if (kvm_vcpu_trap_is_iabt(vcpu)) 1409 - return false; 1410 - 1411 - return kvm_vcpu_dabt_iswrite(vcpu); 1412 - } 1413 - 1414 1399 /** 1415 1400 * stage2_wp_ptes - write protect PMD range 1416 1401 * @pmd: pointer to pmd entry ··· 1597 1598 static bool fault_supports_stage2_pmd_mappings(struct kvm_memory_slot *memslot, 1598 1599 unsigned long hva) 1599 1600 { 1600 - gpa_t gpa_start, gpa_end; 1601 + gpa_t gpa_start; 1601 1602 hva_t uaddr_start, uaddr_end; 1602 1603 size_t size; 1603 1604 1604 1605 size = memslot->npages * PAGE_SIZE; 1605 1606 1606 1607 gpa_start = memslot->base_gfn << PAGE_SHIFT; 1607 - gpa_end = gpa_start + size; 1608 1608 1609 1609 uaddr_start = memslot->userspace_addr; 1610 1610 uaddr_end = uaddr_start + size; ··· 2351 2353 return 0; 2352 2354 } 2353 2355 2354 - void kvm_arch_memslots_updated(struct kvm *kvm, struct kvm_memslots *slots) 2356 + void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) 2355 2357 { 2356 2358 } 2357 2359

+106 -1

virt/kvm/arm/trace.h

··· 2 2 #if !defined(_TRACE_KVM_H) || defined(TRACE_HEADER_MULTI_READ) 3 3 #define _TRACE_KVM_H 4 4 5 + #include <kvm/arm_arch_timer.h> 5 6 #include <linux/tracepoint.h> 6 7 7 8 #undef TRACE_SYSTEM ··· 263 262 __entry->vcpu_id, __entry->irq, __entry->level) 264 263 ); 265 264 265 + TRACE_EVENT(kvm_get_timer_map, 266 + TP_PROTO(unsigned long vcpu_id, struct timer_map *map), 267 + TP_ARGS(vcpu_id, map), 268 + 269 + TP_STRUCT__entry( 270 + __field( unsigned long, vcpu_id ) 271 + __field( int, direct_vtimer ) 272 + __field( int, direct_ptimer ) 273 + __field( int, emul_ptimer ) 274 + ), 275 + 276 + TP_fast_assign( 277 + __entry->vcpu_id = vcpu_id; 278 + __entry->direct_vtimer = arch_timer_ctx_index(map->direct_vtimer); 279 + __entry->direct_ptimer = 280 + (map->direct_ptimer) ? arch_timer_ctx_index(map->direct_ptimer) : -1; 281 + __entry->emul_ptimer = 282 + (map->emul_ptimer) ? arch_timer_ctx_index(map->emul_ptimer) : -1; 283 + ), 284 + 285 + TP_printk("VCPU: %ld, dv: %d, dp: %d, ep: %d", 286 + __entry->vcpu_id, 287 + __entry->direct_vtimer, 288 + __entry->direct_ptimer, 289 + __entry->emul_ptimer) 290 + ); 291 + 292 + TRACE_EVENT(kvm_timer_save_state, 293 + TP_PROTO(struct arch_timer_context *ctx), 294 + TP_ARGS(ctx), 295 + 296 + TP_STRUCT__entry( 297 + __field( unsigned long, ctl ) 298 + __field( unsigned long long, cval ) 299 + __field( int, timer_idx ) 300 + ), 301 + 302 + TP_fast_assign( 303 + __entry->ctl = ctx->cnt_ctl; 304 + __entry->cval = ctx->cnt_cval; 305 + __entry->timer_idx = arch_timer_ctx_index(ctx); 306 + ), 307 + 308 + TP_printk(" CTL: %#08lx CVAL: %#16llx arch_timer_ctx_index: %d", 309 + __entry->ctl, 310 + __entry->cval, 311 + __entry->timer_idx) 312 + ); 313 + 314 + TRACE_EVENT(kvm_timer_restore_state, 315 + TP_PROTO(struct arch_timer_context *ctx), 316 + TP_ARGS(ctx), 317 + 318 + TP_STRUCT__entry( 319 + __field( unsigned long, ctl ) 320 + __field( unsigned long long, cval ) 321 + __field( int, timer_idx ) 322 + ), 323 + 324 + TP_fast_assign( 325 + __entry->ctl = ctx->cnt_ctl; 326 + __entry->cval = ctx->cnt_cval; 327 + __entry->timer_idx = arch_timer_ctx_index(ctx); 328 + ), 329 + 330 + TP_printk("CTL: %#08lx CVAL: %#16llx arch_timer_ctx_index: %d", 331 + __entry->ctl, 332 + __entry->cval, 333 + __entry->timer_idx) 334 + ); 335 + 336 + TRACE_EVENT(kvm_timer_hrtimer_expire, 337 + TP_PROTO(struct arch_timer_context *ctx), 338 + TP_ARGS(ctx), 339 + 340 + TP_STRUCT__entry( 341 + __field( int, timer_idx ) 342 + ), 343 + 344 + TP_fast_assign( 345 + __entry->timer_idx = arch_timer_ctx_index(ctx); 346 + ), 347 + 348 + TP_printk("arch_timer_ctx_index: %d", __entry->timer_idx) 349 + ); 350 + 351 + TRACE_EVENT(kvm_timer_emulate, 352 + TP_PROTO(struct arch_timer_context *ctx, bool should_fire), 353 + TP_ARGS(ctx, should_fire), 354 + 355 + TP_STRUCT__entry( 356 + __field( int, timer_idx ) 357 + __field( bool, should_fire ) 358 + ), 359 + 360 + TP_fast_assign( 361 + __entry->timer_idx = arch_timer_ctx_index(ctx); 362 + __entry->should_fire = should_fire; 363 + ), 364 + 365 + TP_printk("arch_timer_ctx_index: %d (should_fire: %d)", 366 + __entry->timer_idx, __entry->should_fire) 367 + ); 368 + 266 369 #endif /* _TRACE_KVM_H */ 267 370 268 371 #undef TRACE_INCLUDE_PATH 269 - #define TRACE_INCLUDE_PATH ../../../virt/kvm/arm 372 + #define TRACE_INCLUDE_PATH ../../virt/kvm/arm 270 373 #undef TRACE_INCLUDE_FILE 271 374 #define TRACE_INCLUDE_FILE trace 272 375

+2 -2

virt/kvm/arm/vgic/vgic-v3.c

··· 589 589 */ 590 590 int vgic_v3_probe(const struct gic_kvm_info *info) 591 591 { 592 - u32 ich_vtr_el2 = kvm_call_hyp(__vgic_v3_get_ich_vtr_el2); 592 + u32 ich_vtr_el2 = kvm_call_hyp_ret(__vgic_v3_get_ich_vtr_el2); 593 593 int ret; 594 594 595 595 /* ··· 679 679 struct vgic_v3_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v3; 680 680 681 681 if (likely(cpu_if->vgic_sre)) 682 - cpu_if->vgic_vmcr = kvm_call_hyp(__vgic_v3_read_vmcr); 682 + cpu_if->vgic_vmcr = kvm_call_hyp_ret(__vgic_v3_read_vmcr); 683 683 684 684 kvm_call_hyp(__vgic_v3_save_aprs, vcpu); 685 685

+2 -1

virt/kvm/coalesced_mmio.c

··· 144 144 if (zone->pio != 1 && zone->pio != 0) 145 145 return -EINVAL; 146 146 147 - dev = kzalloc(sizeof(struct kvm_coalesced_mmio_dev), GFP_KERNEL); 147 + dev = kzalloc(sizeof(struct kvm_coalesced_mmio_dev), 148 + GFP_KERNEL_ACCOUNT); 148 149 if (!dev) 149 150 return -ENOMEM; 150 151

+4 -3

virt/kvm/eventfd.c

··· 297 297 if (!kvm_arch_intc_initialized(kvm)) 298 298 return -EAGAIN; 299 299 300 - irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL); 300 + irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL_ACCOUNT); 301 301 if (!irqfd) 302 302 return -ENOMEM; 303 303 ··· 345 345 } 346 346 347 347 if (!irqfd->resampler) { 348 - resampler = kzalloc(sizeof(*resampler), GFP_KERNEL); 348 + resampler = kzalloc(sizeof(*resampler), 349 + GFP_KERNEL_ACCOUNT); 349 350 if (!resampler) { 350 351 ret = -ENOMEM; 351 352 mutex_unlock(&kvm->irqfds.resampler_lock); ··· 798 797 if (IS_ERR(eventfd)) 799 798 return PTR_ERR(eventfd); 800 799 801 - p = kzalloc(sizeof(*p), GFP_KERNEL); 800 + p = kzalloc(sizeof(*p), GFP_KERNEL_ACCOUNT); 802 801 if (!p) { 803 802 ret = -ENOMEM; 804 803 goto fail;

+2 -2

virt/kvm/irqchip.c

··· 196 196 nr_rt_entries += 1; 197 197 198 198 new = kzalloc(sizeof(*new) + (nr_rt_entries * sizeof(struct hlist_head)), 199 - GFP_KERNEL); 199 + GFP_KERNEL_ACCOUNT); 200 200 201 201 if (!new) 202 202 return -ENOMEM; ··· 208 208 209 209 for (i = 0; i < nr; ++i) { 210 210 r = -ENOMEM; 211 - e = kzalloc(sizeof(*e), GFP_KERNEL); 211 + e = kzalloc(sizeof(*e), GFP_KERNEL_ACCOUNT); 212 212 if (!e) 213 213 goto out; 214 214

+54 -49

virt/kvm/kvm_main.c

··· 81 81 module_param(halt_poll_ns_grow, uint, 0644); 82 82 EXPORT_SYMBOL_GPL(halt_poll_ns_grow); 83 83 84 + /* The start value to grow halt_poll_ns from */ 85 + unsigned int halt_poll_ns_grow_start = 10000; /* 10us */ 86 + module_param(halt_poll_ns_grow_start, uint, 0644); 87 + EXPORT_SYMBOL_GPL(halt_poll_ns_grow_start); 88 + 84 89 /* Default resets per-vcpu halt_poll_ns . */ 85 90 unsigned int halt_poll_ns_shrink; 86 91 module_param(halt_poll_ns_shrink, uint, 0644); ··· 530 525 int i; 531 526 struct kvm_memslots *slots; 532 527 533 - slots = kvzalloc(sizeof(struct kvm_memslots), GFP_KERNEL); 528 + slots = kvzalloc(sizeof(struct kvm_memslots), GFP_KERNEL_ACCOUNT); 534 529 if (!slots) 535 530 return NULL; 536 531 ··· 606 601 607 602 kvm->debugfs_stat_data = kcalloc(kvm_debugfs_num_entries, 608 603 sizeof(*kvm->debugfs_stat_data), 609 - GFP_KERNEL); 604 + GFP_KERNEL_ACCOUNT); 610 605 if (!kvm->debugfs_stat_data) 611 606 return -ENOMEM; 612 607 613 608 for (p = debugfs_entries; p->name; p++) { 614 - stat_data = kzalloc(sizeof(*stat_data), GFP_KERNEL); 609 + stat_data = kzalloc(sizeof(*stat_data), GFP_KERNEL_ACCOUNT); 615 610 if (!stat_data) 616 611 return -ENOMEM; 617 612 ··· 661 656 struct kvm_memslots *slots = kvm_alloc_memslots(); 662 657 if (!slots) 663 658 goto out_err_no_srcu; 664 - /* 665 - * Generations must be different for each address space. 666 - * Init kvm generation close to the maximum to easily test the 667 - * code of handling generation number wrap-around. 668 - */ 669 - slots->generation = i * 2 - 150; 659 + /* Generations must be different for each address space. */ 660 + slots->generation = i; 670 661 rcu_assign_pointer(kvm->memslots[i], slots); 671 662 } 672 663 ··· 672 671 goto out_err_no_irq_srcu; 673 672 for (i = 0; i < KVM_NR_BUSES; i++) { 674 673 rcu_assign_pointer(kvm->buses[i], 675 - kzalloc(sizeof(struct kvm_io_bus), GFP_KERNEL)); 674 + kzalloc(sizeof(struct kvm_io_bus), GFP_KERNEL_ACCOUNT)); 676 675 if (!kvm->buses[i]) 677 676 goto out_err; 678 677 } ··· 790 789 { 791 790 unsigned long dirty_bytes = 2 * kvm_dirty_bitmap_bytes(memslot); 792 791 793 - memslot->dirty_bitmap = kvzalloc(dirty_bytes, GFP_KERNEL); 792 + memslot->dirty_bitmap = kvzalloc(dirty_bytes, GFP_KERNEL_ACCOUNT); 794 793 if (!memslot->dirty_bitmap) 795 794 return -ENOMEM; 796 795 ··· 875 874 int as_id, struct kvm_memslots *slots) 876 875 { 877 876 struct kvm_memslots *old_memslots = __kvm_memslots(kvm, as_id); 877 + u64 gen = old_memslots->generation; 878 878 879 - /* 880 - * Set the low bit in the generation, which disables SPTE caching 881 - * until the end of synchronize_srcu_expedited. 882 - */ 883 - WARN_ON(old_memslots->generation & 1); 884 - slots->generation = old_memslots->generation + 1; 879 + WARN_ON(gen & KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS); 880 + slots->generation = gen | KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS; 885 881 886 882 rcu_assign_pointer(kvm->memslots[as_id], slots); 887 883 synchronize_srcu_expedited(&kvm->srcu); 888 884 889 885 /* 890 - * Increment the new memslot generation a second time. This prevents 891 - * vm exits that race with memslot updates from caching a memslot 892 - * generation that will (potentially) be valid forever. 893 - * 886 + * Increment the new memslot generation a second time, dropping the 887 + * update in-progress flag and incrementing then generation based on 888 + * the number of address spaces. This provides a unique and easily 889 + * identifiable generation number while the memslots are in flux. 890 + */ 891 + gen = slots->generation & ~KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS; 892 + 893 + /* 894 894 * Generations must be unique even across address spaces. We do not need 895 895 * a global counter for that, instead the generation space is evenly split 896 896 * across address spaces. For example, with two address spaces, address 897 - * space 0 will use generations 0, 4, 8, ... while * address space 1 will 898 - * use generations 2, 6, 10, 14, ... 897 + * space 0 will use generations 0, 2, 4, ... while address space 1 will 898 + * use generations 1, 3, 5, ... 899 899 */ 900 - slots->generation += KVM_ADDRESS_SPACE_NUM * 2 - 1; 900 + gen += KVM_ADDRESS_SPACE_NUM; 901 901 902 - kvm_arch_memslots_updated(kvm, slots); 902 + kvm_arch_memslots_updated(kvm, gen); 903 + 904 + slots->generation = gen; 903 905 904 906 return old_memslots; 905 907 } ··· 1022 1018 goto out_free; 1023 1019 } 1024 1020 1025 - slots = kvzalloc(sizeof(struct kvm_memslots), GFP_KERNEL); 1021 + slots = kvzalloc(sizeof(struct kvm_memslots), GFP_KERNEL_ACCOUNT); 1026 1022 if (!slots) 1027 1023 goto out_free; 1028 1024 memcpy(slots, __kvm_memslots(kvm, as_id), sizeof(struct kvm_memslots)); ··· 1205 1201 mask = xchg(&dirty_bitmap[i], 0); 1206 1202 dirty_bitmap_buffer[i] = mask; 1207 1203 1208 - if (mask) { 1209 - offset = i * BITS_PER_LONG; 1210 - kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, 1211 - offset, mask); 1212 - } 1204 + offset = i * BITS_PER_LONG; 1205 + kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, 1206 + offset, mask); 1213 1207 } 1214 1208 spin_unlock(&kvm->mmu_lock); 1215 1209 } ··· 2187 2185 2188 2186 static void grow_halt_poll_ns(struct kvm_vcpu *vcpu) 2189 2187 { 2190 - unsigned int old, val, grow; 2188 + unsigned int old, val, grow, grow_start; 2191 2189 2192 2190 old = val = vcpu->halt_poll_ns; 2191 + grow_start = READ_ONCE(halt_poll_ns_grow_start); 2193 2192 grow = READ_ONCE(halt_poll_ns_grow); 2194 - /* 10us base */ 2195 - if (val == 0 && grow) 2196 - val = 10000; 2197 - else 2198 - val *= grow; 2193 + if (!grow) 2194 + goto out; 2195 + 2196 + val *= grow; 2197 + if (val < grow_start) 2198 + val = grow_start; 2199 2199 2200 2200 if (val > halt_poll_ns) 2201 2201 val = halt_poll_ns; 2202 2202 2203 2203 vcpu->halt_poll_ns = val; 2204 + out: 2204 2205 trace_kvm_halt_poll_ns_grow(vcpu->vcpu_id, val, old); 2205 2206 } 2206 2207 ··· 2688 2683 struct kvm_regs *kvm_regs; 2689 2684 2690 2685 r = -ENOMEM; 2691 - kvm_regs = kzalloc(sizeof(struct kvm_regs), GFP_KERNEL); 2686 + kvm_regs = kzalloc(sizeof(struct kvm_regs), GFP_KERNEL_ACCOUNT); 2692 2687 if (!kvm_regs) 2693 2688 goto out; 2694 2689 r = kvm_arch_vcpu_ioctl_get_regs(vcpu, kvm_regs); ··· 2716 2711 break; 2717 2712 } 2718 2713 case KVM_GET_SREGS: { 2719 - kvm_sregs = kzalloc(sizeof(struct kvm_sregs), GFP_KERNEL); 2714 + kvm_sregs = kzalloc(sizeof(struct kvm_sregs), 2715 + GFP_KERNEL_ACCOUNT); 2720 2716 r = -ENOMEM; 2721 2717 if (!kvm_sregs) 2722 2718 goto out; ··· 2809 2803 break; 2810 2804 } 2811 2805 case KVM_GET_FPU: { 2812 - fpu = kzalloc(sizeof(struct kvm_fpu), GFP_KERNEL); 2806 + fpu = kzalloc(sizeof(struct kvm_fpu), GFP_KERNEL_ACCOUNT); 2813 2807 r = -ENOMEM; 2814 2808 if (!fpu) 2815 2809 goto out; ··· 2986 2980 if (test) 2987 2981 return 0; 2988 2982 2989 - dev = kzalloc(sizeof(*dev), GFP_KERNEL); 2983 + dev = kzalloc(sizeof(*dev), GFP_KERNEL_ACCOUNT); 2990 2984 if (!dev) 2991 2985 return -ENOMEM; 2992 2986 ··· 3631 3625 r = __kvm_io_bus_write(vcpu, bus, &range, val); 3632 3626 return r < 0 ? r : 0; 3633 3627 } 3628 + EXPORT_SYMBOL_GPL(kvm_io_bus_write); 3634 3629 3635 3630 /* kvm_io_bus_write_cookie - called under kvm->slots_lock */ 3636 3631 int kvm_io_bus_write_cookie(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, ··· 3682 3675 3683 3676 return -EOPNOTSUPP; 3684 3677 } 3685 - EXPORT_SYMBOL_GPL(kvm_io_bus_write); 3686 3678 3687 3679 /* kvm_io_bus_read - called under kvm->slots_lock */ 3688 3680 int kvm_io_bus_read(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, gpa_t addr, ··· 3703 3697 return r < 0 ? r : 0; 3704 3698 } 3705 3699 3706 - 3707 3700 /* Caller must hold slots_lock. */ 3708 3701 int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr, 3709 3702 int len, struct kvm_io_device *dev) ··· 3719 3714 if (bus->dev_count - bus->ioeventfd_count > NR_IOBUS_DEVS - 1) 3720 3715 return -ENOSPC; 3721 3716 3722 - new_bus = kmalloc(sizeof(*bus) + ((bus->dev_count + 1) * 3723 - sizeof(struct kvm_io_range)), GFP_KERNEL); 3717 + new_bus = kmalloc(struct_size(bus, range, bus->dev_count + 1), 3718 + GFP_KERNEL_ACCOUNT); 3724 3719 if (!new_bus) 3725 3720 return -ENOMEM; 3726 3721 ··· 3765 3760 if (i == bus->dev_count) 3766 3761 return; 3767 3762 3768 - new_bus = kmalloc(sizeof(*bus) + ((bus->dev_count - 1) * 3769 - sizeof(struct kvm_io_range)), GFP_KERNEL); 3763 + new_bus = kmalloc(struct_size(bus, range, bus->dev_count - 1), 3764 + GFP_KERNEL_ACCOUNT); 3770 3765 if (!new_bus) { 3771 3766 pr_err("kvm: failed to shrink bus, removing it completely\n"); 3772 3767 goto broken; ··· 4034 4029 active = kvm_active_vms; 4035 4030 spin_unlock(&kvm_lock); 4036 4031 4037 - env = kzalloc(sizeof(*env), GFP_KERNEL); 4032 + env = kzalloc(sizeof(*env), GFP_KERNEL_ACCOUNT); 4038 4033 if (!env) 4039 4034 return; 4040 4035 ··· 4050 4045 add_uevent_var(env, "PID=%d", kvm->userspace_pid); 4051 4046 4052 4047 if (!IS_ERR_OR_NULL(kvm->debugfs_dentry)) { 4053 - char *tmp, *p = kmalloc(PATH_MAX, GFP_KERNEL); 4048 + char *tmp, *p = kmalloc(PATH_MAX, GFP_KERNEL_ACCOUNT); 4054 4049 4055 4050 if (p) { 4056 4051 tmp = dentry_path_raw(kvm->debugfs_dentry, p, PATH_MAX);

+2 -2

virt/kvm/vfio.c

··· 219 219 } 220 220 } 221 221 222 - kvg = kzalloc(sizeof(*kvg), GFP_KERNEL); 222 + kvg = kzalloc(sizeof(*kvg), GFP_KERNEL_ACCOUNT); 223 223 if (!kvg) { 224 224 mutex_unlock(&kv->lock); 225 225 kvm_vfio_group_put_external_user(vfio_group); ··· 405 405 if (tmp->ops == &kvm_vfio_ops) 406 406 return -EBUSY; 407 407 408 - kv = kzalloc(sizeof(*kv), GFP_KERNEL); 408 + kv = kzalloc(sizeof(*kv), GFP_KERNEL_ACCOUNT); 409 409 if (!kv) 410 410 return -ENOMEM; 411 411

Configure Feed

Configure Feed