Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM fixes from Paolo Bonzini:
"A collection of x86 and ARM bugfixes, and some improvements to
documentation.

On top of this, a cleanup of kvm_para.h headers, which were exported
by some architectures even though they not support KVM at all. This is
responsible for all the Kbuild changes in the diffstat"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (28 commits)
Documentation: kvm: clarify KVM_SET_USER_MEMORY_REGION
KVM: doc: Document the life cycle of a VM and its resources
KVM: selftests: complete IO before migrating guest state
KVM: selftests: disable stack protector for all KVM tests
KVM: selftests: explicitly disable PIE for tests
KVM: selftests: assert on exit reason in CR4/cpuid sync test
KVM: x86: update %rip after emulating IO
x86/kvm/hyper-v: avoid spurious pending stimer on vCPU init
kvm/x86: Move MSR_IA32_ARCH_CAPABILITIES to array emulated_msrs
KVM: x86: Emulate MSR_IA32_ARCH_CAPABILITIES on AMD hosts
kvm: don't redefine flags as something else
kvm: mmu: Used range based flushing in slot_handle_level_range
KVM: export <linux/kvm_para.h> and <asm/kvm_para.h> iif KVM is supported
KVM: x86: remove check on nr_mmu_pages in kvm_arch_commit_memory_region()
kvm: nVMX: Add a vmentry check for HOST_SYSENTER_ESP and HOST_SYSENTER_EIP fields
KVM: SVM: Workaround errata#1096 (insn_len maybe zero on SMAP violation)
KVM: Reject device ioctls from processes other than the VM's creator
KVM: doc: Fix incorrect word ordering regarding supported use of APIs
KVM: x86: fix handling of role.cr4_pae and rename it to 'gpte_size'
KVM: nVMX: Do not inherit quadrant and invalid for the root shadow EPT
...

+409 -201
+53 -24
Documentation/virtual/kvm/api.txt
··· 5 5 ---------------------- 6 6 7 7 The kvm API is a set of ioctls that are issued to control various aspects 8 - of a virtual machine. The ioctls belong to three classes 8 + of a virtual machine. The ioctls belong to three classes: 9 9 10 10 - System ioctls: These query and set global attributes which affect the 11 11 whole kvm subsystem. In addition a system ioctl is used to create 12 - virtual machines 12 + virtual machines. 13 13 14 14 - VM ioctls: These query and set attributes that affect an entire virtual 15 15 machine, for example memory layout. In addition a VM ioctl is used to 16 - create virtual cpus (vcpus). 16 + create virtual cpus (vcpus) and devices. 17 17 18 - Only run VM ioctls from the same process (address space) that was used 19 - to create the VM. 18 + VM ioctls must be issued from the same process (address space) that was 19 + used to create the VM. 20 20 21 21 - vcpu ioctls: These query and set attributes that control the operation 22 22 of a single virtual cpu. 23 23 24 - Only run vcpu ioctls from the same thread that was used to create the 25 - vcpu. 24 + vcpu ioctls should be issued from the same thread that was used to create 25 + the vcpu, except for asynchronous vcpu ioctl that are marked as such in 26 + the documentation. Otherwise, the first ioctl after switching threads 27 + could see a performance impact. 26 28 29 + - device ioctls: These query and set attributes that control the operation 30 + of a single device. 31 + 32 + device ioctls must be issued from the same process (address space) that 33 + was used to create the VM. 27 34 28 35 2. File descriptors 29 36 ------------------- ··· 39 32 open("/dev/kvm") obtains a handle to the kvm subsystem; this handle 40 33 can be used to issue system ioctls. A KVM_CREATE_VM ioctl on this 41 34 handle will create a VM file descriptor which can be used to issue VM 42 - ioctls. A KVM_CREATE_VCPU ioctl on a VM fd will create a virtual cpu 43 - and return a file descriptor pointing to it. Finally, ioctls on a vcpu 44 - fd can be used to control the vcpu, including the important task of 45 - actually running guest code. 35 + ioctls. A KVM_CREATE_VCPU or KVM_CREATE_DEVICE ioctl on a VM fd will 36 + create a virtual cpu or device and return a file descriptor pointing to 37 + the new resource. Finally, ioctls on a vcpu or device fd can be used 38 + to control the vcpu or device. For vcpus, this includes the important 39 + task of actually running guest code. 46 40 47 41 In general file descriptors can be migrated among processes by means 48 42 of fork() and the SCM_RIGHTS facility of unix domain socket. These 49 43 kinds of tricks are explicitly not supported by kvm. While they will 50 44 not cause harm to the host, their actual behavior is not guaranteed by 51 - the API. The only supported use is one virtual machine per process, 52 - and one vcpu per thread. 45 + the API. See "General description" for details on the ioctl usage 46 + model that is supported by KVM. 47 + 48 + It is important to note that althought VM ioctls may only be issued from 49 + the process that created the VM, a VM's lifecycle is associated with its 50 + file descriptor, not its creator (process). In other words, the VM and 51 + its resources, *including the associated address space*, are not freed 52 + until the last reference to the VM's file descriptor has been released. 53 + For example, if fork() is issued after ioctl(KVM_CREATE_VM), the VM will 54 + not be freed until both the parent (original) process and its child have 55 + put their references to the VM's file descriptor. 56 + 57 + Because a VM's resources are not freed until the last reference to its 58 + file descriptor is released, creating additional references to a VM via 59 + via fork(), dup(), etc... without careful consideration is strongly 60 + discouraged and may have unwanted side effects, e.g. memory allocated 61 + by and on behalf of the VM's process may not be freed/unaccounted when 62 + the VM is shut down. 53 63 54 64 55 65 It is important to note that althought VM ioctls may only be issued from ··· 539 515 Note that any value for 'irq' other than the ones stated above is invalid 540 516 and incurs unexpected behavior. 541 517 518 + This is an asynchronous vcpu ioctl and can be invoked from any thread. 519 + 542 520 MIPS: 543 521 544 522 Queues an external interrupt to be injected into the virtual CPU. A negative 545 523 interrupt number dequeues the interrupt. 524 + 525 + This is an asynchronous vcpu ioctl and can be invoked from any thread. 546 526 547 527 548 528 4.17 KVM_DEBUG_GUEST ··· 1114 1086 #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0) 1115 1087 #define KVM_MEM_READONLY (1UL << 1) 1116 1088 1117 - This ioctl allows the user to create or modify a guest physical memory 1118 - slot. When changing an existing slot, it may be moved in the guest 1119 - physical memory space, or its flags may be modified. It may not be 1120 - resized. Slots may not overlap in guest physical address space. 1121 - Bits 0-15 of "slot" specifies the slot id and this value should be 1122 - less than the maximum number of user memory slots supported per VM. 1123 - The maximum allowed slots can be queried using KVM_CAP_NR_MEMSLOTS, 1124 - if this capability is supported by the architecture. 1089 + This ioctl allows the user to create, modify or delete a guest physical 1090 + memory slot. Bits 0-15 of "slot" specify the slot id and this value 1091 + should be less than the maximum number of user memory slots supported per 1092 + VM. The maximum allowed slots can be queried using KVM_CAP_NR_MEMSLOTS, 1093 + if this capability is supported by the architecture. Slots may not 1094 + overlap in guest physical address space. 1125 1095 1126 1096 If KVM_CAP_MULTI_ADDRESS_SPACE is available, bits 16-31 of "slot" 1127 1097 specifies the address space which is being modified. They must be ··· 1127 1101 KVM_CAP_MULTI_ADDRESS_SPACE capability. Slots in separate address spaces 1128 1102 are unrelated; the restriction on overlapping slots only applies within 1129 1103 each address space. 1104 + 1105 + Deleting a slot is done by passing zero for memory_size. When changing 1106 + an existing slot, it may be moved in the guest physical memory space, 1107 + or its flags may be modified, but it may not be resized. 1130 1108 1131 1109 Memory for the region is taken starting at the address denoted by the 1132 1110 field userspace_addr, which must point at user addressable memory for ··· 2523 2493 machine checks needing further payload are not 2524 2494 supported by this ioctl) 2525 2495 2526 - Note that the vcpu ioctl is asynchronous to vcpu execution. 2496 + This is an asynchronous vcpu ioctl and can be invoked from any thread. 2527 2497 2528 2498 4.78 KVM_PPC_GET_HTAB_FD 2529 2499 ··· 3072 3042 KVM_S390_INT_EXTERNAL_CALL - sigp external call; parameters in .extcall 3073 3043 KVM_S390_MCHK - machine check interrupt; parameters in .mchk 3074 3044 3075 - 3076 - Note that the vcpu ioctl is asynchronous to vcpu execution. 3045 + This is an asynchronous vcpu ioctl and can be invoked from any thread. 3077 3046 3078 3047 4.94 KVM_S390_GET_IRQ_STATE 3079 3048
+7 -4
Documentation/virtual/kvm/mmu.txt
··· 142 142 If clear, this page corresponds to a guest page table denoted by the gfn 143 143 field. 144 144 role.quadrant: 145 - When role.cr4_pae=0, the guest uses 32-bit gptes while the host uses 64-bit 145 + When role.gpte_is_8_bytes=0, the guest uses 32-bit gptes while the host uses 64-bit 146 146 sptes. That means a guest page table contains more ptes than the host, 147 147 so multiple shadow pages are needed to shadow one guest page. 148 148 For first-level shadow pages, role.quadrant can be 0 or 1 and denotes the ··· 158 158 The page is invalid and should not be used. It is a root page that is 159 159 currently pinned (by a cpu hardware register pointing to it); once it is 160 160 unpinned it will be destroyed. 161 - role.cr4_pae: 162 - Contains the value of cr4.pae for which the page is valid (e.g. whether 163 - 32-bit or 64-bit gptes are in use). 161 + role.gpte_is_8_bytes: 162 + Reflects the size of the guest PTE for which the page is valid, i.e. '1' 163 + if 64-bit gptes are in use, '0' if 32-bit gptes are in use. 164 164 role.nxe: 165 165 Contains the value of efer.nxe for which the page is valid. 166 166 role.cr0_wp: ··· 173 173 Contains the value of cr4.smap && !cr0.wp for which the page is valid 174 174 (pages for which this is true are different from other pages; see the 175 175 treatment of cr0.wp=0 below). 176 + role.ept_sp: 177 + This is a virtual flag to denote a shadowed nested EPT page. ept_sp 178 + is true if "cr0_wp && smap_andnot_wp", an otherwise invalid combination. 176 179 role.smm: 177 180 Is 1 if the page is valid in system management mode. This field 178 181 determines which of the kvm_memslots array was used to build this
+1
arch/alpha/include/asm/Kbuild
··· 6 6 generic-y += export.h 7 7 generic-y += fb.h 8 8 generic-y += irq_work.h 9 + generic-y += kvm_para.h 9 10 generic-y += mcs_spinlock.h 10 11 generic-y += mm-arch-hooks.h 11 12 generic-y += preempt.h
-2
arch/alpha/include/uapi/asm/kvm_para.h
··· 1 - /* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ 2 - #include <asm-generic/kvm_para.h>
+1
arch/arc/include/asm/Kbuild
··· 11 11 generic-y += hw_irq.h 12 12 generic-y += irq_regs.h 13 13 generic-y += irq_work.h 14 + generic-y += kvm_para.h 14 15 generic-y += local.h 15 16 generic-y += local64.h 16 17 generic-y += mcs_spinlock.h
-1
arch/arc/include/uapi/asm/Kbuild
··· 1 - generic-y += kvm_para.h 2 1 generic-y += ucontext.h
+11
arch/arm/include/asm/kvm_mmu.h
··· 381 381 return ret; 382 382 } 383 383 384 + static inline int kvm_write_guest_lock(struct kvm *kvm, gpa_t gpa, 385 + const void *data, unsigned long len) 386 + { 387 + int srcu_idx = srcu_read_lock(&kvm->srcu); 388 + int ret = kvm_write_guest(kvm, gpa, data, len); 389 + 390 + srcu_read_unlock(&kvm->srcu, srcu_idx); 391 + 392 + return ret; 393 + } 394 + 384 395 static inline void *kvm_get_hyp_vector(void) 385 396 { 386 397 switch(read_cpuid_part()) {
+2
arch/arm/include/asm/stage2_pgtable.h
··· 75 75 76 76 #define S2_PMD_MASK PMD_MASK 77 77 #define S2_PMD_SIZE PMD_SIZE 78 + #define S2_PUD_MASK PUD_MASK 79 + #define S2_PUD_SIZE PUD_SIZE 78 80 79 81 static inline bool kvm_stage2_has_pmd(struct kvm *kvm) 80 82 {
+1
arch/arm/include/uapi/asm/Kbuild
··· 3 3 generated-y += unistd-common.h 4 4 generated-y += unistd-oabi.h 5 5 generated-y += unistd-eabi.h 6 + generic-y += kvm_para.h
-2
arch/arm/include/uapi/asm/kvm_para.h
··· 1 - /* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ 2 - #include <asm-generic/kvm_para.h>
+11
arch/arm64/include/asm/kvm_mmu.h
··· 445 445 return ret; 446 446 } 447 447 448 + static inline int kvm_write_guest_lock(struct kvm *kvm, gpa_t gpa, 449 + const void *data, unsigned long len) 450 + { 451 + int srcu_idx = srcu_read_lock(&kvm->srcu); 452 + int ret = kvm_write_guest(kvm, gpa, data, len); 453 + 454 + srcu_read_unlock(&kvm->srcu, srcu_idx); 455 + 456 + return ret; 457 + } 458 + 448 459 #ifdef CONFIG_KVM_INDIRECT_VECTORS 449 460 /* 450 461 * EL2 vectors can be mapped and rerouted in a number of ways,
+3 -3
arch/arm64/kvm/reset.c
··· 123 123 int ret = -EINVAL; 124 124 bool loaded; 125 125 126 + /* Reset PMU outside of the non-preemptible section */ 127 + kvm_pmu_vcpu_reset(vcpu); 128 + 126 129 preempt_disable(); 127 130 loaded = (vcpu->cpu != -1); 128 131 if (loaded) ··· 172 169 173 170 vcpu->arch.reset_state.reset = false; 174 171 } 175 - 176 - /* Reset PMU */ 177 - kvm_pmu_vcpu_reset(vcpu); 178 172 179 173 /* Default workaround setup is enabled (if supported) */ 180 174 if (kvm_arm_have_ssbd() == KVM_SSBD_KERNEL)
+1
arch/c6x/include/asm/Kbuild
··· 19 19 generic-y += kdebug.h 20 20 generic-y += kmap_types.h 21 21 generic-y += kprobes.h 22 + generic-y += kvm_para.h 22 23 generic-y += local.h 23 24 generic-y += mcs_spinlock.h 24 25 generic-y += mm-arch-hooks.h
-1
arch/c6x/include/uapi/asm/Kbuild
··· 1 - generic-y += kvm_para.h 2 1 generic-y += ucontext.h
+1
arch/h8300/include/asm/Kbuild
··· 23 23 generic-y += kdebug.h 24 24 generic-y += kmap_types.h 25 25 generic-y += kprobes.h 26 + generic-y += kvm_para.h 26 27 generic-y += linkage.h 27 28 generic-y += local.h 28 29 generic-y += local64.h
-1
arch/h8300/include/uapi/asm/Kbuild
··· 1 - generic-y += kvm_para.h 2 1 generic-y += ucontext.h
+1
arch/hexagon/include/asm/Kbuild
··· 19 19 generic-y += kdebug.h 20 20 generic-y += kmap_types.h 21 21 generic-y += kprobes.h 22 + generic-y += kvm_para.h 22 23 generic-y += local.h 23 24 generic-y += local64.h 24 25 generic-y += mcs_spinlock.h
-2
arch/hexagon/include/uapi/asm/kvm_para.h
··· 1 - /* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ 2 - #include <asm-generic/kvm_para.h>
+1
arch/ia64/include/asm/Kbuild
··· 2 2 generic-y += compat.h 3 3 generic-y += exec.h 4 4 generic-y += irq_work.h 5 + generic-y += kvm_para.h 5 6 generic-y += mcs_spinlock.h 6 7 generic-y += mm-arch-hooks.h 7 8 generic-y += preempt.h
-1
arch/ia64/include/uapi/asm/Kbuild
··· 1 1 generated-y += unistd_64.h 2 - generic-y += kvm_para.h
+1
arch/m68k/include/asm/Kbuild
··· 13 13 generic-y += kdebug.h 14 14 generic-y += kmap_types.h 15 15 generic-y += kprobes.h 16 + generic-y += kvm_para.h 16 17 generic-y += local.h 17 18 generic-y += local64.h 18 19 generic-y += mcs_spinlock.h
-1
arch/m68k/include/uapi/asm/Kbuild
··· 1 1 generated-y += unistd_32.h 2 - generic-y += kvm_para.h
+1
arch/microblaze/include/asm/Kbuild
··· 17 17 generic-y += kdebug.h 18 18 generic-y += kmap_types.h 19 19 generic-y += kprobes.h 20 + generic-y += kvm_para.h 20 21 generic-y += linkage.h 21 22 generic-y += local.h 22 23 generic-y += local64.h
-1
arch/microblaze/include/uapi/asm/Kbuild
··· 1 1 generated-y += unistd_32.h 2 - generic-y += kvm_para.h 3 2 generic-y += ucontext.h
+1
arch/nios2/include/asm/Kbuild
··· 23 23 generic-y += kdebug.h 24 24 generic-y += kmap_types.h 25 25 generic-y += kprobes.h 26 + generic-y += kvm_para.h 26 27 generic-y += local.h 27 28 generic-y += mcs_spinlock.h 28 29 generic-y += mm-arch-hooks.h
-1
arch/nios2/include/uapi/asm/Kbuild
··· 1 - generic-y += kvm_para.h 2 1 generic-y += ucontext.h
+1
arch/openrisc/include/asm/Kbuild
··· 20 20 generic-y += kdebug.h 21 21 generic-y += kmap_types.h 22 22 generic-y += kprobes.h 23 + generic-y += kvm_para.h 23 24 generic-y += local.h 24 25 generic-y += mcs_spinlock.h 25 26 generic-y += mm-arch-hooks.h
-1
arch/openrisc/include/uapi/asm/Kbuild
··· 1 - generic-y += kvm_para.h 2 1 generic-y += ucontext.h
+1
arch/parisc/include/asm/Kbuild
··· 11 11 generic-y += irq_work.h 12 12 generic-y += kdebug.h 13 13 generic-y += kprobes.h 14 + generic-y += kvm_para.h 14 15 generic-y += local.h 15 16 generic-y += local64.h 16 17 generic-y += mcs_spinlock.h
-1
arch/parisc/include/uapi/asm/Kbuild
··· 1 1 generated-y += unistd_32.h 2 2 generated-y += unistd_64.h 3 - generic-y += kvm_para.h
+1
arch/sh/include/asm/Kbuild
··· 9 9 generic-y += exec.h 10 10 generic-y += irq_regs.h 11 11 generic-y += irq_work.h 12 + generic-y += kvm_para.h 12 13 generic-y += local.h 13 14 generic-y += local64.h 14 15 generic-y += mcs_spinlock.h
-1
arch/sh/include/uapi/asm/Kbuild
··· 1 1 # SPDX-License-Identifier: GPL-2.0 2 2 3 3 generated-y += unistd_32.h 4 - generic-y += kvm_para.h 5 4 generic-y += ucontext.h
+1
arch/sparc/include/asm/Kbuild
··· 9 9 generic-y += export.h 10 10 generic-y += irq_regs.h 11 11 generic-y += irq_work.h 12 + generic-y += kvm_para.h 12 13 generic-y += linkage.h 13 14 generic-y += local.h 14 15 generic-y += local64.h
-2
arch/sparc/include/uapi/asm/kvm_para.h
··· 1 - /* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ 2 - #include <asm-generic/kvm_para.h>
+1
arch/unicore32/include/asm/Kbuild
··· 18 18 generic-y += kdebug.h 19 19 generic-y += kmap_types.h 20 20 generic-y += kprobes.h 21 + generic-y += kvm_para.h 21 22 generic-y += local.h 22 23 generic-y += mcs_spinlock.h 23 24 generic-y += mm-arch-hooks.h
-1
arch/unicore32/include/uapi/asm/Kbuild
··· 1 - generic-y += kvm_para.h 2 1 generic-y += ucontext.h
+7 -3
arch/x86/include/asm/kvm_host.h
··· 253 253 * kvm_memory_slot.arch.gfn_track which is 16 bits, so the role bits used 254 254 * by indirect shadow page can not be more than 15 bits. 255 255 * 256 - * Currently, we used 14 bits that are @level, @cr4_pae, @quadrant, @access, 256 + * Currently, we used 14 bits that are @level, @gpte_is_8_bytes, @quadrant, @access, 257 257 * @nxe, @cr0_wp, @smep_andnot_wp and @smap_andnot_wp. 258 258 */ 259 259 union kvm_mmu_page_role { 260 260 u32 word; 261 261 struct { 262 262 unsigned level:4; 263 - unsigned cr4_pae:1; 263 + unsigned gpte_is_8_bytes:1; 264 264 unsigned quadrant:2; 265 265 unsigned direct:1; 266 266 unsigned access:3; ··· 350 350 }; 351 351 352 352 struct kvm_pio_request { 353 + unsigned long linear_rip; 353 354 unsigned long count; 354 355 int in; 355 356 int port; ··· 569 568 bool tpr_access_reporting; 570 569 u64 ia32_xss; 571 570 u64 microcode_version; 571 + u64 arch_capabilities; 572 572 573 573 /* 574 574 * Paging state of the vcpu ··· 1194 1192 int (*nested_enable_evmcs)(struct kvm_vcpu *vcpu, 1195 1193 uint16_t *vmcs_version); 1196 1194 uint16_t (*nested_get_evmcs_version)(struct kvm_vcpu *vcpu); 1195 + 1196 + bool (*need_emulation_on_page_fault)(struct kvm_vcpu *vcpu); 1197 1197 }; 1198 1198 1199 1199 struct kvm_arch_async_pf { ··· 1256 1252 gfn_t gfn_offset, unsigned long mask); 1257 1253 void kvm_mmu_zap_all(struct kvm *kvm); 1258 1254 void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen); 1259 - unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm); 1255 + unsigned int kvm_mmu_calculate_default_mmu_pages(struct kvm *kvm); 1260 1256 void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned int kvm_nr_mmu_pages); 1261 1257 1262 1258 int load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long cr3);
+7 -2
arch/x86/kvm/hyperv.c
··· 526 526 new_config.enable = 0; 527 527 stimer->config.as_uint64 = new_config.as_uint64; 528 528 529 - stimer_mark_pending(stimer, false); 529 + if (stimer->config.enable) 530 + stimer_mark_pending(stimer, false); 531 + 530 532 return 0; 531 533 } 532 534 ··· 544 542 stimer->config.enable = 0; 545 543 else if (stimer->config.auto_enable) 546 544 stimer->config.enable = 1; 547 - stimer_mark_pending(stimer, false); 545 + 546 + if (stimer->config.enable) 547 + stimer_mark_pending(stimer, false); 548 + 548 549 return 0; 549 550 } 550 551
+37 -17
arch/x86/kvm/mmu.c
··· 182 182 183 183 static const union kvm_mmu_page_role mmu_base_role_mask = { 184 184 .cr0_wp = 1, 185 - .cr4_pae = 1, 185 + .gpte_is_8_bytes = 1, 186 186 .nxe = 1, 187 187 .smep_andnot_wp = 1, 188 188 .smap_andnot_wp = 1, ··· 2205 2205 static void kvm_mmu_commit_zap_page(struct kvm *kvm, 2206 2206 struct list_head *invalid_list); 2207 2207 2208 + 2208 2209 #define for_each_valid_sp(_kvm, _sp, _gfn) \ 2209 2210 hlist_for_each_entry(_sp, \ 2210 2211 &(_kvm)->arch.mmu_page_hash[kvm_page_table_hashfn(_gfn)], hash_link) \ ··· 2216 2215 for_each_valid_sp(_kvm, _sp, _gfn) \ 2217 2216 if ((_sp)->gfn != (_gfn) || (_sp)->role.direct) {} else 2218 2217 2218 + static inline bool is_ept_sp(struct kvm_mmu_page *sp) 2219 + { 2220 + return sp->role.cr0_wp && sp->role.smap_andnot_wp; 2221 + } 2222 + 2219 2223 /* @sp->gfn should be write-protected at the call site */ 2220 2224 static bool __kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, 2221 2225 struct list_head *invalid_list) 2222 2226 { 2223 - if (sp->role.cr4_pae != !!is_pae(vcpu) 2224 - || vcpu->arch.mmu->sync_page(vcpu, sp) == 0) { 2227 + if ((!is_ept_sp(sp) && sp->role.gpte_is_8_bytes != !!is_pae(vcpu)) || 2228 + vcpu->arch.mmu->sync_page(vcpu, sp) == 0) { 2225 2229 kvm_mmu_prepare_zap_page(vcpu->kvm, sp, invalid_list); 2226 2230 return false; 2227 2231 } ··· 2429 2423 role.level = level; 2430 2424 role.direct = direct; 2431 2425 if (role.direct) 2432 - role.cr4_pae = 0; 2426 + role.gpte_is_8_bytes = true; 2433 2427 role.access = access; 2434 2428 if (!vcpu->arch.mmu->direct_map 2435 2429 && vcpu->arch.mmu->root_level <= PT32_ROOT_LEVEL) { ··· 4800 4794 4801 4795 role.base.access = ACC_ALL; 4802 4796 role.base.nxe = !!is_nx(vcpu); 4803 - role.base.cr4_pae = !!is_pae(vcpu); 4804 4797 role.base.cr0_wp = is_write_protection(vcpu); 4805 4798 role.base.smm = is_smm(vcpu); 4806 4799 role.base.guest_mode = is_guest_mode(vcpu); ··· 4820 4815 role.base.ad_disabled = (shadow_accessed_mask == 0); 4821 4816 role.base.level = kvm_x86_ops->get_tdp_level(vcpu); 4822 4817 role.base.direct = true; 4818 + role.base.gpte_is_8_bytes = true; 4823 4819 4824 4820 return role; 4825 4821 } ··· 4885 4879 role.base.smap_andnot_wp = role.ext.cr4_smap && 4886 4880 !is_write_protection(vcpu); 4887 4881 role.base.direct = !is_paging(vcpu); 4882 + role.base.gpte_is_8_bytes = !!is_pae(vcpu); 4888 4883 4889 4884 if (!is_long_mode(vcpu)) 4890 4885 role.base.level = PT32E_ROOT_LEVEL; ··· 4925 4918 kvm_calc_shadow_ept_root_page_role(struct kvm_vcpu *vcpu, bool accessed_dirty, 4926 4919 bool execonly) 4927 4920 { 4928 - union kvm_mmu_role role; 4921 + union kvm_mmu_role role = {0}; 4929 4922 4930 - /* Base role is inherited from root_mmu */ 4931 - role.base.word = vcpu->arch.root_mmu.mmu_role.base.word; 4932 - role.ext = kvm_calc_mmu_role_ext(vcpu); 4923 + /* SMM flag is inherited from root_mmu */ 4924 + role.base.smm = vcpu->arch.root_mmu.mmu_role.base.smm; 4933 4925 4934 4926 role.base.level = PT64_ROOT_4LEVEL; 4927 + role.base.gpte_is_8_bytes = true; 4935 4928 role.base.direct = false; 4936 4929 role.base.ad_disabled = !accessed_dirty; 4937 4930 role.base.guest_mode = true; 4938 4931 role.base.access = ACC_ALL; 4939 4932 4933 + /* 4934 + * WP=1 and NOT_WP=1 is an impossible combination, use WP and the 4935 + * SMAP variation to denote shadow EPT entries. 4936 + */ 4937 + role.base.cr0_wp = true; 4938 + role.base.smap_andnot_wp = true; 4939 + 4940 + role.ext = kvm_calc_mmu_role_ext(vcpu); 4940 4941 role.ext.execonly = execonly; 4941 4942 4942 4943 return role; ··· 5194 5179 gpa, bytes, sp->role.word); 5195 5180 5196 5181 offset = offset_in_page(gpa); 5197 - pte_size = sp->role.cr4_pae ? 8 : 4; 5182 + pte_size = sp->role.gpte_is_8_bytes ? 8 : 4; 5198 5183 5199 5184 /* 5200 5185 * Sometimes, the OS only writes the last one bytes to update status ··· 5218 5203 page_offset = offset_in_page(gpa); 5219 5204 level = sp->role.level; 5220 5205 *nspte = 1; 5221 - if (!sp->role.cr4_pae) { 5206 + if (!sp->role.gpte_is_8_bytes) { 5222 5207 page_offset <<= 1; /* 32->64 */ 5223 5208 /* 5224 5209 * A 32-bit pde maps 4MB while the shadow pdes map ··· 5408 5393 * This can happen if a guest gets a page-fault on data access but the HW 5409 5394 * table walker is not able to read the instruction page (e.g instruction 5410 5395 * page is not present in memory). In those cases we simply restart the 5411 - * guest. 5396 + * guest, with the exception of AMD Erratum 1096 which is unrecoverable. 5412 5397 */ 5413 - if (unlikely(insn && !insn_len)) 5414 - return 1; 5398 + if (unlikely(insn && !insn_len)) { 5399 + if (!kvm_x86_ops->need_emulation_on_page_fault(vcpu)) 5400 + return 1; 5401 + } 5415 5402 5416 5403 er = x86_emulate_instruction(vcpu, cr2, emulation_type, insn, insn_len); 5417 5404 ··· 5526 5509 5527 5510 if (need_resched() || spin_needbreak(&kvm->mmu_lock)) { 5528 5511 if (flush && lock_flush_tlb) { 5529 - kvm_flush_remote_tlbs(kvm); 5512 + kvm_flush_remote_tlbs_with_address(kvm, 5513 + start_gfn, 5514 + iterator.gfn - start_gfn + 1); 5530 5515 flush = false; 5531 5516 } 5532 5517 cond_resched_lock(&kvm->mmu_lock); ··· 5536 5517 } 5537 5518 5538 5519 if (flush && lock_flush_tlb) { 5539 - kvm_flush_remote_tlbs(kvm); 5520 + kvm_flush_remote_tlbs_with_address(kvm, start_gfn, 5521 + end_gfn - start_gfn + 1); 5540 5522 flush = false; 5541 5523 } 5542 5524 ··· 6031 6011 /* 6032 6012 * Calculate mmu pages needed for kvm. 6033 6013 */ 6034 - unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm) 6014 + unsigned int kvm_mmu_calculate_default_mmu_pages(struct kvm *kvm) 6035 6015 { 6036 6016 unsigned int nr_mmu_pages; 6037 6017 unsigned int nr_pages = 0;
+2 -2
arch/x86/kvm/mmutrace.h
··· 29 29 \ 30 30 role.word = __entry->role; \ 31 31 \ 32 - trace_seq_printf(p, "sp gfn %llx l%u%s q%u%s %s%s" \ 32 + trace_seq_printf(p, "sp gfn %llx l%u %u-byte q%u%s %s%s" \ 33 33 " %snxe %sad root %u %s%c", \ 34 34 __entry->gfn, role.level, \ 35 - role.cr4_pae ? " pae" : "", \ 35 + role.gpte_is_8_bytes ? 8 : 4, \ 36 36 role.quadrant, \ 37 37 role.direct ? " direct" : "", \ 38 38 access_str[role.access], \
+32
arch/x86/kvm/svm.c
··· 7098 7098 return -ENODEV; 7099 7099 } 7100 7100 7101 + static bool svm_need_emulation_on_page_fault(struct kvm_vcpu *vcpu) 7102 + { 7103 + bool is_user, smap; 7104 + 7105 + is_user = svm_get_cpl(vcpu) == 3; 7106 + smap = !kvm_read_cr4_bits(vcpu, X86_CR4_SMAP); 7107 + 7108 + /* 7109 + * Detect and workaround Errata 1096 Fam_17h_00_0Fh 7110 + * 7111 + * In non SEV guest, hypervisor will be able to read the guest 7112 + * memory to decode the instruction pointer when insn_len is zero 7113 + * so we return true to indicate that decoding is possible. 7114 + * 7115 + * But in the SEV guest, the guest memory is encrypted with the 7116 + * guest specific key and hypervisor will not be able to decode the 7117 + * instruction pointer so we will not able to workaround it. Lets 7118 + * print the error and request to kill the guest. 7119 + */ 7120 + if (is_user && smap) { 7121 + if (!sev_guest(vcpu->kvm)) 7122 + return true; 7123 + 7124 + pr_err_ratelimited("KVM: Guest triggered AMD Erratum 1096\n"); 7125 + kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu); 7126 + } 7127 + 7128 + return false; 7129 + } 7130 + 7101 7131 static struct kvm_x86_ops svm_x86_ops __ro_after_init = { 7102 7132 .cpu_has_kvm_support = has_svm, 7103 7133 .disabled_by_bios = is_disabled, ··· 7261 7231 7262 7232 .nested_enable_evmcs = nested_enable_evmcs, 7263 7233 .nested_get_evmcs_version = nested_get_evmcs_version, 7234 + 7235 + .need_emulation_on_page_fault = svm_need_emulation_on_page_fault, 7264 7236 }; 7265 7237 7266 7238 static int __init svm_init(void)
+5
arch/x86/kvm/vmx/nested.c
··· 2585 2585 !nested_host_cr4_valid(vcpu, vmcs12->host_cr4) || 2586 2586 !nested_cr3_valid(vcpu, vmcs12->host_cr3)) 2587 2587 return -EINVAL; 2588 + 2589 + if (is_noncanonical_address(vmcs12->host_ia32_sysenter_esp, vcpu) || 2590 + is_noncanonical_address(vmcs12->host_ia32_sysenter_eip, vcpu)) 2591 + return -EINVAL; 2592 + 2588 2593 /* 2589 2594 * If the load IA32_EFER VM-exit control is 1, bits reserved in the 2590 2595 * IA32_EFER MSR must be 0 in the field for that register. In addition,
+6 -13
arch/x86/kvm/vmx/vmx.c
··· 1683 1683 1684 1684 msr_info->data = to_vmx(vcpu)->spec_ctrl; 1685 1685 break; 1686 - case MSR_IA32_ARCH_CAPABILITIES: 1687 - if (!msr_info->host_initiated && 1688 - !guest_cpuid_has(vcpu, X86_FEATURE_ARCH_CAPABILITIES)) 1689 - return 1; 1690 - msr_info->data = to_vmx(vcpu)->arch_capabilities; 1691 - break; 1692 1686 case MSR_IA32_SYSENTER_CS: 1693 1687 msr_info->data = vmcs_read32(GUEST_SYSENTER_CS); 1694 1688 break; ··· 1888 1894 */ 1889 1895 vmx_disable_intercept_for_msr(vmx->vmcs01.msr_bitmap, MSR_IA32_PRED_CMD, 1890 1896 MSR_TYPE_W); 1891 - break; 1892 - case MSR_IA32_ARCH_CAPABILITIES: 1893 - if (!msr_info->host_initiated) 1894 - return 1; 1895 - vmx->arch_capabilities = data; 1896 1897 break; 1897 1898 case MSR_IA32_CR_PAT: 1898 1899 if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) { ··· 4076 4087 vmx->guest_msrs[j].mask = -1ull; 4077 4088 ++vmx->nmsrs; 4078 4089 } 4079 - 4080 - vmx->arch_capabilities = kvm_get_arch_capabilities(); 4081 4090 4082 4091 vm_exit_controls_init(vmx, vmx_vmexit_ctrl()); 4083 4092 ··· 7396 7409 return 0; 7397 7410 } 7398 7411 7412 + static bool vmx_need_emulation_on_page_fault(struct kvm_vcpu *vcpu) 7413 + { 7414 + return 0; 7415 + } 7416 + 7399 7417 static __init int hardware_setup(void) 7400 7418 { 7401 7419 unsigned long host_bndcfgs; ··· 7703 7711 .set_nested_state = NULL, 7704 7712 .get_vmcs12_pages = NULL, 7705 7713 .nested_enable_evmcs = NULL, 7714 + .need_emulation_on_page_fault = vmx_need_emulation_on_page_fault, 7706 7715 }; 7707 7716 7708 7717 static void vmx_cleanup_l1d_flush(void)
-1
arch/x86/kvm/vmx/vmx.h
··· 190 190 u64 msr_guest_kernel_gs_base; 191 191 #endif 192 192 193 - u64 arch_capabilities; 194 193 u64 spec_ctrl; 195 194 196 195 u32 vm_entry_controls_shadow;
+42 -17
arch/x86/kvm/x86.c
··· 1125 1125 #endif 1126 1126 MSR_IA32_TSC, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA, 1127 1127 MSR_IA32_FEATURE_CONTROL, MSR_IA32_BNDCFGS, MSR_TSC_AUX, 1128 - MSR_IA32_SPEC_CTRL, MSR_IA32_ARCH_CAPABILITIES, 1128 + MSR_IA32_SPEC_CTRL, 1129 1129 MSR_IA32_RTIT_CTL, MSR_IA32_RTIT_STATUS, MSR_IA32_RTIT_CR3_MATCH, 1130 1130 MSR_IA32_RTIT_OUTPUT_BASE, MSR_IA32_RTIT_OUTPUT_MASK, 1131 1131 MSR_IA32_RTIT_ADDR0_A, MSR_IA32_RTIT_ADDR0_B, ··· 1158 1158 1159 1159 MSR_IA32_TSC_ADJUST, 1160 1160 MSR_IA32_TSCDEADLINE, 1161 + MSR_IA32_ARCH_CAPABILITIES, 1161 1162 MSR_IA32_MISC_ENABLE, 1162 1163 MSR_IA32_MCG_STATUS, 1163 1164 MSR_IA32_MCG_CTL, ··· 2444 2443 if (msr_info->host_initiated) 2445 2444 vcpu->arch.microcode_version = data; 2446 2445 break; 2446 + case MSR_IA32_ARCH_CAPABILITIES: 2447 + if (!msr_info->host_initiated) 2448 + return 1; 2449 + vcpu->arch.arch_capabilities = data; 2450 + break; 2447 2451 case MSR_EFER: 2448 2452 return set_efer(vcpu, data); 2449 2453 case MSR_K7_HWCR: ··· 2752 2746 break; 2753 2747 case MSR_IA32_UCODE_REV: 2754 2748 msr_info->data = vcpu->arch.microcode_version; 2749 + break; 2750 + case MSR_IA32_ARCH_CAPABILITIES: 2751 + if (!msr_info->host_initiated && 2752 + !guest_cpuid_has(vcpu, X86_FEATURE_ARCH_CAPABILITIES)) 2753 + return 1; 2754 + msr_info->data = vcpu->arch.arch_capabilities; 2755 2755 break; 2756 2756 case MSR_IA32_TSC: 2757 2757 msr_info->data = kvm_scale_tsc(vcpu, rdtsc()) + vcpu->arch.tsc_offset; ··· 6535 6523 } 6536 6524 EXPORT_SYMBOL_GPL(kvm_emulate_instruction_from_buffer); 6537 6525 6526 + static int complete_fast_pio_out(struct kvm_vcpu *vcpu) 6527 + { 6528 + vcpu->arch.pio.count = 0; 6529 + 6530 + if (unlikely(!kvm_is_linear_rip(vcpu, vcpu->arch.pio.linear_rip))) 6531 + return 1; 6532 + 6533 + return kvm_skip_emulated_instruction(vcpu); 6534 + } 6535 + 6538 6536 static int kvm_fast_pio_out(struct kvm_vcpu *vcpu, int size, 6539 6537 unsigned short port) 6540 6538 { 6541 6539 unsigned long val = kvm_register_read(vcpu, VCPU_REGS_RAX); 6542 6540 int ret = emulator_pio_out_emulated(&vcpu->arch.emulate_ctxt, 6543 6541 size, port, &val, 1); 6544 - /* do not return to emulator after return from userspace */ 6545 - vcpu->arch.pio.count = 0; 6542 + 6543 + if (!ret) { 6544 + vcpu->arch.pio.linear_rip = kvm_get_linear_rip(vcpu); 6545 + vcpu->arch.complete_userspace_io = complete_fast_pio_out; 6546 + } 6546 6547 return ret; 6547 6548 } 6548 6549 ··· 6565 6540 6566 6541 /* We should only ever be called with arch.pio.count equal to 1 */ 6567 6542 BUG_ON(vcpu->arch.pio.count != 1); 6543 + 6544 + if (unlikely(!kvm_is_linear_rip(vcpu, vcpu->arch.pio.linear_rip))) { 6545 + vcpu->arch.pio.count = 0; 6546 + return 1; 6547 + } 6568 6548 6569 6549 /* For size less than 4 we merge, else we zero extend */ 6570 6550 val = (vcpu->arch.pio.size < 4) ? kvm_register_read(vcpu, VCPU_REGS_RAX) ··· 6583 6553 vcpu->arch.pio.port, &val, 1); 6584 6554 kvm_register_write(vcpu, VCPU_REGS_RAX, val); 6585 6555 6586 - return 1; 6556 + return kvm_skip_emulated_instruction(vcpu); 6587 6557 } 6588 6558 6589 6559 static int kvm_fast_pio_in(struct kvm_vcpu *vcpu, int size, ··· 6602 6572 return ret; 6603 6573 } 6604 6574 6575 + vcpu->arch.pio.linear_rip = kvm_get_linear_rip(vcpu); 6605 6576 vcpu->arch.complete_userspace_io = complete_fast_pio_in; 6606 6577 6607 6578 return 0; ··· 6610 6579 6611 6580 int kvm_fast_pio(struct kvm_vcpu *vcpu, int size, unsigned short port, int in) 6612 6581 { 6613 - int ret = kvm_skip_emulated_instruction(vcpu); 6582 + int ret; 6614 6583 6615 - /* 6616 - * TODO: we might be squashing a KVM_GUESTDBG_SINGLESTEP-triggered 6617 - * KVM_EXIT_DEBUG here. 6618 - */ 6619 6584 if (in) 6620 - return kvm_fast_pio_in(vcpu, size, port) && ret; 6585 + ret = kvm_fast_pio_in(vcpu, size, port); 6621 6586 else 6622 - return kvm_fast_pio_out(vcpu, size, port) && ret; 6587 + ret = kvm_fast_pio_out(vcpu, size, port); 6588 + return ret && kvm_skip_emulated_instruction(vcpu); 6623 6589 } 6624 6590 EXPORT_SYMBOL_GPL(kvm_fast_pio); 6625 6591 ··· 8761 8733 8762 8734 int kvm_arch_vcpu_setup(struct kvm_vcpu *vcpu) 8763 8735 { 8736 + vcpu->arch.arch_capabilities = kvm_get_arch_capabilities(); 8764 8737 vcpu->arch.msr_platform_info = MSR_PLATFORM_INFO_CPUID_FAULT; 8765 8738 kvm_vcpu_mtrr_init(vcpu); 8766 8739 vcpu_load(vcpu); ··· 9458 9429 const struct kvm_memory_slot *new, 9459 9430 enum kvm_mr_change change) 9460 9431 { 9461 - int nr_mmu_pages = 0; 9462 - 9463 9432 if (!kvm->arch.n_requested_mmu_pages) 9464 - nr_mmu_pages = kvm_mmu_calculate_mmu_pages(kvm); 9465 - 9466 - if (nr_mmu_pages) 9467 - kvm_mmu_change_mmu_pages(kvm, nr_mmu_pages); 9433 + kvm_mmu_change_mmu_pages(kvm, 9434 + kvm_mmu_calculate_default_mmu_pages(kvm)); 9468 9435 9469 9436 /* 9470 9437 * Dirty logging tracks sptes in 4k granularity, meaning that large
+1
arch/xtensa/include/asm/Kbuild
··· 15 15 generic-y += kdebug.h 16 16 generic-y += kmap_types.h 17 17 generic-y += kprobes.h 18 + generic-y += kvm_para.h 18 19 generic-y += local.h 19 20 generic-y += local64.h 20 21 generic-y += mcs_spinlock.h
-1
arch/xtensa/include/uapi/asm/Kbuild
··· 1 1 generated-y += unistd_32.h 2 - generic-y += kvm_para.h
+2
include/uapi/linux/Kbuild
··· 7 7 endif 8 8 9 9 ifeq ($(wildcard $(srctree)/arch/$(SRCARCH)/include/uapi/asm/kvm_para.h),) 10 + ifeq ($(wildcard $(objtree)/arch/$(SRCARCH)/include/generated/uapi/asm/kvm_para.h),) 10 11 no-export-headers += kvm_para.h 12 + endif 11 13 endif
+2 -2
tools/testing/selftests/kvm/Makefile
··· 29 29 INSTALL_HDR_PATH = $(top_srcdir)/usr 30 30 LINUX_HDR_PATH = $(INSTALL_HDR_PATH)/include/ 31 31 LINUX_TOOL_INCLUDE = $(top_srcdir)/tools/include 32 - CFLAGS += -O2 -g -std=gnu99 -I$(LINUX_TOOL_INCLUDE) -I$(LINUX_HDR_PATH) -Iinclude -I$(<D) -Iinclude/$(UNAME_M) -I.. 33 - LDFLAGS += -pthread 32 + CFLAGS += -O2 -g -std=gnu99 -fno-stack-protector -fno-PIE -I$(LINUX_TOOL_INCLUDE) -I$(LINUX_HDR_PATH) -Iinclude -I$(<D) -Iinclude/$(UNAME_M) -I.. 33 + LDFLAGS += -pthread -no-pie 34 34 35 35 # After inclusion, $(OUTPUT) is defined and 36 36 # $(TEST_GEN_PROGS) starts with $(OUTPUT)/
+1
tools/testing/selftests/kvm/include/kvm_util.h
··· 102 102 struct kvm_run *vcpu_state(struct kvm_vm *vm, uint32_t vcpuid); 103 103 void vcpu_run(struct kvm_vm *vm, uint32_t vcpuid); 104 104 int _vcpu_run(struct kvm_vm *vm, uint32_t vcpuid); 105 + void vcpu_run_complete_io(struct kvm_vm *vm, uint32_t vcpuid); 105 106 void vcpu_set_mp_state(struct kvm_vm *vm, uint32_t vcpuid, 106 107 struct kvm_mp_state *mp_state); 107 108 void vcpu_regs_get(struct kvm_vm *vm, uint32_t vcpuid, struct kvm_regs *regs);
+16
tools/testing/selftests/kvm/lib/kvm_util.c
··· 1121 1121 return rc; 1122 1122 } 1123 1123 1124 + void vcpu_run_complete_io(struct kvm_vm *vm, uint32_t vcpuid) 1125 + { 1126 + struct vcpu *vcpu = vcpu_find(vm, vcpuid); 1127 + int ret; 1128 + 1129 + TEST_ASSERT(vcpu != NULL, "vcpu not found, vcpuid: %u", vcpuid); 1130 + 1131 + vcpu->state->immediate_exit = 1; 1132 + ret = ioctl(vcpu->fd, KVM_RUN, NULL); 1133 + vcpu->state->immediate_exit = 0; 1134 + 1135 + TEST_ASSERT(ret == -1 && errno == EINTR, 1136 + "KVM_RUN IOCTL didn't exit immediately, rc: %i, errno: %i", 1137 + ret, errno); 1138 + } 1139 + 1124 1140 /* 1125 1141 * VM VCPU Set MP State 1126 1142 *
+19 -16
tools/testing/selftests/kvm/x86_64/cr4_cpuid_sync_test.c
··· 87 87 while (1) { 88 88 rc = _vcpu_run(vm, VCPU_ID); 89 89 90 - if (run->exit_reason == KVM_EXIT_IO) { 91 - switch (get_ucall(vm, VCPU_ID, &uc)) { 92 - case UCALL_SYNC: 93 - /* emulate hypervisor clearing CR4.OSXSAVE */ 94 - vcpu_sregs_get(vm, VCPU_ID, &sregs); 95 - sregs.cr4 &= ~X86_CR4_OSXSAVE; 96 - vcpu_sregs_set(vm, VCPU_ID, &sregs); 97 - break; 98 - case UCALL_ABORT: 99 - TEST_ASSERT(false, "Guest CR4 bit (OSXSAVE) unsynchronized with CPUID bit."); 100 - break; 101 - case UCALL_DONE: 102 - goto done; 103 - default: 104 - TEST_ASSERT(false, "Unknown ucall 0x%x.", uc.cmd); 105 - } 90 + TEST_ASSERT(run->exit_reason == KVM_EXIT_IO, 91 + "Unexpected exit reason: %u (%s),\n", 92 + run->exit_reason, 93 + exit_reason_str(run->exit_reason)); 94 + 95 + switch (get_ucall(vm, VCPU_ID, &uc)) { 96 + case UCALL_SYNC: 97 + /* emulate hypervisor clearing CR4.OSXSAVE */ 98 + vcpu_sregs_get(vm, VCPU_ID, &sregs); 99 + sregs.cr4 &= ~X86_CR4_OSXSAVE; 100 + vcpu_sregs_set(vm, VCPU_ID, &sregs); 101 + break; 102 + case UCALL_ABORT: 103 + TEST_ASSERT(false, "Guest CR4 bit (OSXSAVE) unsynchronized with CPUID bit."); 104 + break; 105 + case UCALL_DONE: 106 + goto done; 107 + default: 108 + TEST_ASSERT(false, "Unknown ucall 0x%x.", uc.cmd); 106 109 } 107 110 } 108 111
+16 -2
tools/testing/selftests/kvm/x86_64/state_test.c
··· 134 134 135 135 struct kvm_cpuid_entry2 *entry = kvm_get_supported_cpuid_entry(1); 136 136 137 + if (!kvm_check_cap(KVM_CAP_IMMEDIATE_EXIT)) { 138 + fprintf(stderr, "immediate_exit not available, skipping test\n"); 139 + exit(KSFT_SKIP); 140 + } 141 + 137 142 /* Create VM */ 138 143 vm = vm_create_default(VCPU_ID, 0, guest_code); 139 144 vcpu_set_cpuid(vm, VCPU_ID, kvm_get_supported_cpuid()); ··· 161 156 stage, run->exit_reason, 162 157 exit_reason_str(run->exit_reason)); 163 158 164 - memset(&regs1, 0, sizeof(regs1)); 165 - vcpu_regs_get(vm, VCPU_ID, &regs1); 166 159 switch (get_ucall(vm, VCPU_ID, &uc)) { 167 160 case UCALL_ABORT: 168 161 TEST_ASSERT(false, "%s at %s:%d", (const char *)uc.args[0], ··· 178 175 TEST_ASSERT(!strcmp((const char *)uc.args[0], "hello") && 179 176 uc.args[1] == stage, "Unexpected register values vmexit #%lx, got %lx", 180 177 stage, (ulong)uc.args[1]); 178 + 179 + /* 180 + * When KVM exits to userspace with KVM_EXIT_IO, KVM guarantees 181 + * guest state is consistent only after userspace re-enters the 182 + * kernel with KVM_RUN. Complete IO prior to migrating state 183 + * to a new VM. 184 + */ 185 + vcpu_run_complete_io(vm, VCPU_ID); 186 + 187 + memset(&regs1, 0, sizeof(regs1)); 188 + vcpu_regs_get(vm, VCPU_ID, &regs1); 181 189 182 190 state = vcpu_save_state(vm, VCPU_ID); 183 191 kvm_vm_release(vm);
+2 -2
virt/kvm/arm/hyp/vgic-v3-sr.c
··· 222 222 } 223 223 } 224 224 225 - if (used_lrs) { 225 + if (used_lrs || cpu_if->its_vpe.its_vm) { 226 226 int i; 227 227 u32 elrsr; 228 228 ··· 247 247 u64 used_lrs = vcpu->arch.vgic_cpu.used_lrs; 248 248 int i; 249 249 250 - if (used_lrs) { 250 + if (used_lrs || cpu_if->its_vpe.its_vm) { 251 251 write_gicreg(cpu_if->vgic_hcr, ICH_HCR_EL2); 252 252 253 253 for (i = 0; i < used_lrs; i++)
+73 -52
virt/kvm/arm/mmu.c
··· 102 102 * @addr: IPA 103 103 * @pmd: pmd pointer for IPA 104 104 * 105 - * Function clears a PMD entry, flushes addr 1st and 2nd stage TLBs. Marks all 106 - * pages in the range dirty. 105 + * Function clears a PMD entry, flushes addr 1st and 2nd stage TLBs. 107 106 */ 108 107 static void stage2_dissolve_pmd(struct kvm *kvm, phys_addr_t addr, pmd_t *pmd) 109 108 { ··· 120 121 * @addr: IPA 121 122 * @pud: pud pointer for IPA 122 123 * 123 - * Function clears a PUD entry, flushes addr 1st and 2nd stage TLBs. Marks all 124 - * pages in the range dirty. 124 + * Function clears a PUD entry, flushes addr 1st and 2nd stage TLBs. 125 125 */ 126 126 static void stage2_dissolve_pud(struct kvm *kvm, phys_addr_t addr, pud_t *pudp) 127 127 { ··· 897 899 * kvm_alloc_stage2_pgd - allocate level-1 table for stage-2 translation. 898 900 * @kvm: The KVM struct pointer for the VM. 899 901 * 900 - * Allocates only the stage-2 HW PGD level table(s) (can support either full 901 - * 40-bit input addresses or limited to 32-bit input addresses). Clears the 902 - * allocated pages. 902 + * Allocates only the stage-2 HW PGD level table(s) of size defined by 903 + * stage2_pgd_size(kvm). 903 904 * 904 905 * Note we don't need locking here as this is only called when the VM is 905 906 * created, which can only be done once. ··· 1064 1067 { 1065 1068 pmd_t *pmd, old_pmd; 1066 1069 1070 + retry: 1067 1071 pmd = stage2_get_pmd(kvm, cache, addr); 1068 1072 VM_BUG_ON(!pmd); 1069 1073 1070 1074 old_pmd = *pmd; 1075 + /* 1076 + * Multiple vcpus faulting on the same PMD entry, can 1077 + * lead to them sequentially updating the PMD with the 1078 + * same value. Following the break-before-make 1079 + * (pmd_clear() followed by tlb_flush()) process can 1080 + * hinder forward progress due to refaults generated 1081 + * on missing translations. 1082 + * 1083 + * Skip updating the page table if the entry is 1084 + * unchanged. 1085 + */ 1086 + if (pmd_val(old_pmd) == pmd_val(*new_pmd)) 1087 + return 0; 1088 + 1071 1089 if (pmd_present(old_pmd)) { 1072 1090 /* 1073 - * Multiple vcpus faulting on the same PMD entry, can 1074 - * lead to them sequentially updating the PMD with the 1075 - * same value. Following the break-before-make 1076 - * (pmd_clear() followed by tlb_flush()) process can 1077 - * hinder forward progress due to refaults generated 1078 - * on missing translations. 1091 + * If we already have PTE level mapping for this block, 1092 + * we must unmap it to avoid inconsistent TLB state and 1093 + * leaking the table page. We could end up in this situation 1094 + * if the memory slot was marked for dirty logging and was 1095 + * reverted, leaving PTE level mappings for the pages accessed 1096 + * during the period. So, unmap the PTE level mapping for this 1097 + * block and retry, as we could have released the upper level 1098 + * table in the process. 1079 1099 * 1080 - * Skip updating the page table if the entry is 1081 - * unchanged. 1100 + * Normal THP split/merge follows mmu_notifier callbacks and do 1101 + * get handled accordingly. 1082 1102 */ 1083 - if (pmd_val(old_pmd) == pmd_val(*new_pmd)) 1084 - return 0; 1085 - 1103 + if (!pmd_thp_or_huge(old_pmd)) { 1104 + unmap_stage2_range(kvm, addr & S2_PMD_MASK, S2_PMD_SIZE); 1105 + goto retry; 1106 + } 1086 1107 /* 1087 1108 * Mapping in huge pages should only happen through a 1088 1109 * fault. If a page is merged into a transparent huge ··· 1112 1097 * should become splitting first, unmapped, merged, 1113 1098 * and mapped back in on-demand. 1114 1099 */ 1115 - VM_BUG_ON(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd)); 1116 - 1100 + WARN_ON_ONCE(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd)); 1117 1101 pmd_clear(pmd); 1118 1102 kvm_tlb_flush_vmid_ipa(kvm, addr); 1119 1103 } else { ··· 1128 1114 { 1129 1115 pud_t *pudp, old_pud; 1130 1116 1117 + retry: 1131 1118 pudp = stage2_get_pud(kvm, cache, addr); 1132 1119 VM_BUG_ON(!pudp); 1133 1120 ··· 1136 1121 1137 1122 /* 1138 1123 * A large number of vcpus faulting on the same stage 2 entry, 1139 - * can lead to a refault due to the 1140 - * stage2_pud_clear()/tlb_flush(). Skip updating the page 1141 - * tables if there is no change. 1124 + * can lead to a refault due to the stage2_pud_clear()/tlb_flush(). 1125 + * Skip updating the page tables if there is no change. 1142 1126 */ 1143 1127 if (pud_val(old_pud) == pud_val(*new_pudp)) 1144 1128 return 0; 1145 1129 1146 1130 if (stage2_pud_present(kvm, old_pud)) { 1131 + /* 1132 + * If we already have table level mapping for this block, unmap 1133 + * the range for this block and retry. 1134 + */ 1135 + if (!stage2_pud_huge(kvm, old_pud)) { 1136 + unmap_stage2_range(kvm, addr & S2_PUD_MASK, S2_PUD_SIZE); 1137 + goto retry; 1138 + } 1139 + 1140 + WARN_ON_ONCE(kvm_pud_pfn(old_pud) != kvm_pud_pfn(*new_pudp)); 1147 1141 stage2_pud_clear(kvm, pudp); 1148 1142 kvm_tlb_flush_vmid_ipa(kvm, addr); 1149 1143 } else { ··· 1475 1451 } 1476 1452 1477 1453 /** 1478 - * stage2_wp_puds - write protect PGD range 1479 - * @pgd: pointer to pgd entry 1480 - * @addr: range start address 1481 - * @end: range end address 1482 - * 1483 - * Process PUD entries, for a huge PUD we cause a panic. 1484 - */ 1454 + * stage2_wp_puds - write protect PGD range 1455 + * @pgd: pointer to pgd entry 1456 + * @addr: range start address 1457 + * @end: range end address 1458 + */ 1485 1459 static void stage2_wp_puds(struct kvm *kvm, pgd_t *pgd, 1486 1460 phys_addr_t addr, phys_addr_t end) 1487 1461 { ··· 1616 1594 send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb, current); 1617 1595 } 1618 1596 1619 - static bool fault_supports_stage2_pmd_mappings(struct kvm_memory_slot *memslot, 1620 - unsigned long hva) 1597 + static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot, 1598 + unsigned long hva, 1599 + unsigned long map_size) 1621 1600 { 1622 1601 gpa_t gpa_start; 1623 1602 hva_t uaddr_start, uaddr_end; ··· 1633 1610 1634 1611 /* 1635 1612 * Pages belonging to memslots that don't have the same alignment 1636 - * within a PMD for userspace and IPA cannot be mapped with stage-2 1637 - * PMD entries, because we'll end up mapping the wrong pages. 1613 + * within a PMD/PUD for userspace and IPA cannot be mapped with stage-2 1614 + * PMD/PUD entries, because we'll end up mapping the wrong pages. 1638 1615 * 1639 1616 * Consider a layout like the following: 1640 1617 * 1641 1618 * memslot->userspace_addr: 1642 1619 * +-----+--------------------+--------------------+---+ 1643 - * |abcde|fgh Stage-1 PMD | Stage-1 PMD tv|xyz| 1620 + * |abcde|fgh Stage-1 block | Stage-1 block tv|xyz| 1644 1621 * +-----+--------------------+--------------------+---+ 1645 1622 * 1646 1623 * memslot->base_gfn << PAGE_SIZE: 1647 1624 * +---+--------------------+--------------------+-----+ 1648 - * |abc|def Stage-2 PMD | Stage-2 PMD |tvxyz| 1625 + * |abc|def Stage-2 block | Stage-2 block |tvxyz| 1649 1626 * +---+--------------------+--------------------+-----+ 1650 1627 * 1651 - * If we create those stage-2 PMDs, we'll end up with this incorrect 1628 + * If we create those stage-2 blocks, we'll end up with this incorrect 1652 1629 * mapping: 1653 1630 * d -> f 1654 1631 * e -> g 1655 1632 * f -> h 1656 1633 */ 1657 - if ((gpa_start & ~S2_PMD_MASK) != (uaddr_start & ~S2_PMD_MASK)) 1634 + if ((gpa_start & (map_size - 1)) != (uaddr_start & (map_size - 1))) 1658 1635 return false; 1659 1636 1660 1637 /* 1661 1638 * Next, let's make sure we're not trying to map anything not covered 1662 - * by the memslot. This means we have to prohibit PMD size mappings 1663 - * for the beginning and end of a non-PMD aligned and non-PMD sized 1639 + * by the memslot. This means we have to prohibit block size mappings 1640 + * for the beginning and end of a non-block aligned and non-block sized 1664 1641 * memory slot (illustrated by the head and tail parts of the 1665 1642 * userspace view above containing pages 'abcde' and 'xyz', 1666 1643 * respectively). ··· 1669 1646 * userspace_addr or the base_gfn, as both are equally aligned (per 1670 1647 * the check above) and equally sized. 1671 1648 */ 1672 - return (hva & S2_PMD_MASK) >= uaddr_start && 1673 - (hva & S2_PMD_MASK) + S2_PMD_SIZE <= uaddr_end; 1649 + return (hva & ~(map_size - 1)) >= uaddr_start && 1650 + (hva & ~(map_size - 1)) + map_size <= uaddr_end; 1674 1651 } 1675 1652 1676 1653 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, ··· 1699 1676 return -EFAULT; 1700 1677 } 1701 1678 1702 - if (!fault_supports_stage2_pmd_mappings(memslot, hva)) 1703 - force_pte = true; 1704 - 1705 - if (logging_active) 1706 - force_pte = true; 1707 - 1708 1679 /* Let's check if we will get back a huge page backed by hugetlbfs */ 1709 1680 down_read(&current->mm->mmap_sem); 1710 1681 vma = find_vma_intersection(current->mm, hva, hva + 1); ··· 1709 1692 } 1710 1693 1711 1694 vma_pagesize = vma_kernel_pagesize(vma); 1695 + if (logging_active || 1696 + !fault_supports_stage2_huge_mapping(memslot, hva, vma_pagesize)) { 1697 + force_pte = true; 1698 + vma_pagesize = PAGE_SIZE; 1699 + } 1700 + 1712 1701 /* 1713 1702 * The stage2 has a minimum of 2 level table (For arm64 see 1714 1703 * kvm_arm_setup_stage2()). Hence, we are guaranteed that we can ··· 1722 1699 * As for PUD huge maps, we must make sure that we have at least 1723 1700 * 3 levels, i.e, PMD is not folded. 1724 1701 */ 1725 - if ((vma_pagesize == PMD_SIZE || 1726 - (vma_pagesize == PUD_SIZE && kvm_stage2_has_pmd(kvm))) && 1727 - !force_pte) { 1702 + if (vma_pagesize == PMD_SIZE || 1703 + (vma_pagesize == PUD_SIZE && kvm_stage2_has_pmd(kvm))) 1728 1704 gfn = (fault_ipa & huge_page_mask(hstate_vma(vma))) >> PAGE_SHIFT; 1729 - } 1730 1705 up_read(&current->mm->mmap_sem); 1731 1706 1732 1707 /* We need minimum second+third level pages */
+19 -12
virt/kvm/arm/vgic/vgic-its.c
··· 754 754 u64 indirect_ptr, type = GITS_BASER_TYPE(baser); 755 755 phys_addr_t base = GITS_BASER_ADDR_48_to_52(baser); 756 756 int esz = GITS_BASER_ENTRY_SIZE(baser); 757 - int index; 757 + int index, idx; 758 758 gfn_t gfn; 759 + bool ret; 759 760 760 761 switch (type) { 761 762 case GITS_BASER_TYPE_DEVICE: ··· 783 782 784 783 if (eaddr) 785 784 *eaddr = addr; 786 - return kvm_is_visible_gfn(its->dev->kvm, gfn); 785 + 786 + goto out; 787 787 } 788 788 789 789 /* calculate and check the index into the 1st level */ ··· 814 812 815 813 if (eaddr) 816 814 *eaddr = indirect_ptr; 817 - return kvm_is_visible_gfn(its->dev->kvm, gfn); 815 + 816 + out: 817 + idx = srcu_read_lock(&its->dev->kvm->srcu); 818 + ret = kvm_is_visible_gfn(its->dev->kvm, gfn); 819 + srcu_read_unlock(&its->dev->kvm->srcu, idx); 820 + return ret; 818 821 } 819 822 820 823 static int vgic_its_alloc_collection(struct vgic_its *its, ··· 1736 1729 kfree(its); 1737 1730 } 1738 1731 1739 - int vgic_its_has_attr_regs(struct kvm_device *dev, 1740 - struct kvm_device_attr *attr) 1732 + static int vgic_its_has_attr_regs(struct kvm_device *dev, 1733 + struct kvm_device_attr *attr) 1741 1734 { 1742 1735 const struct vgic_register_region *region; 1743 1736 gpa_t offset = attr->attr; ··· 1757 1750 return 0; 1758 1751 } 1759 1752 1760 - int vgic_its_attr_regs_access(struct kvm_device *dev, 1761 - struct kvm_device_attr *attr, 1762 - u64 *reg, bool is_write) 1753 + static int vgic_its_attr_regs_access(struct kvm_device *dev, 1754 + struct kvm_device_attr *attr, 1755 + u64 *reg, bool is_write) 1763 1756 { 1764 1757 const struct vgic_register_region *region; 1765 1758 struct vgic_its *its; ··· 1926 1919 ((u64)ite->irq->intid << KVM_ITS_ITE_PINTID_SHIFT) | 1927 1920 ite->collection->collection_id; 1928 1921 val = cpu_to_le64(val); 1929 - return kvm_write_guest(kvm, gpa, &val, ite_esz); 1922 + return kvm_write_guest_lock(kvm, gpa, &val, ite_esz); 1930 1923 } 1931 1924 1932 1925 /** ··· 2073 2066 (itt_addr_field << KVM_ITS_DTE_ITTADDR_SHIFT) | 2074 2067 (dev->num_eventid_bits - 1)); 2075 2068 val = cpu_to_le64(val); 2076 - return kvm_write_guest(kvm, ptr, &val, dte_esz); 2069 + return kvm_write_guest_lock(kvm, ptr, &val, dte_esz); 2077 2070 } 2078 2071 2079 2072 /** ··· 2253 2246 ((u64)collection->target_addr << KVM_ITS_CTE_RDBASE_SHIFT) | 2254 2247 collection->collection_id); 2255 2248 val = cpu_to_le64(val); 2256 - return kvm_write_guest(its->dev->kvm, gpa, &val, esz); 2249 + return kvm_write_guest_lock(its->dev->kvm, gpa, &val, esz); 2257 2250 } 2258 2251 2259 2252 static int vgic_its_restore_cte(struct vgic_its *its, gpa_t gpa, int esz) ··· 2324 2317 */ 2325 2318 val = 0; 2326 2319 BUG_ON(cte_esz > sizeof(val)); 2327 - ret = kvm_write_guest(its->dev->kvm, gpa, &val, cte_esz); 2320 + ret = kvm_write_guest_lock(its->dev->kvm, gpa, &val, cte_esz); 2328 2321 return ret; 2329 2322 } 2330 2323
+2 -2
virt/kvm/arm/vgic/vgic-v3.c
··· 358 358 if (status) { 359 359 /* clear consumed data */ 360 360 val &= ~(1 << bit_nr); 361 - ret = kvm_write_guest(kvm, ptr, &val, 1); 361 + ret = kvm_write_guest_lock(kvm, ptr, &val, 1); 362 362 if (ret) 363 363 return ret; 364 364 } ··· 409 409 else 410 410 val &= ~(1 << bit_nr); 411 411 412 - ret = kvm_write_guest(kvm, ptr, &val, 1); 412 + ret = kvm_write_guest_lock(kvm, ptr, &val, 1); 413 413 if (ret) 414 414 return ret; 415 415 }
+10 -4
virt/kvm/arm/vgic/vgic.c
··· 867 867 * either observe the new interrupt before or after doing this check, 868 868 * and introducing additional synchronization mechanism doesn't change 869 869 * this. 870 + * 871 + * Note that we still need to go through the whole thing if anything 872 + * can be directly injected (GICv4). 870 873 */ 871 - if (list_empty(&vcpu->arch.vgic_cpu.ap_list_head)) 874 + if (list_empty(&vcpu->arch.vgic_cpu.ap_list_head) && 875 + !vgic_supports_direct_msis(vcpu->kvm)) 872 876 return; 873 877 874 878 DEBUG_SPINLOCK_BUG_ON(!irqs_disabled()); 875 879 876 - raw_spin_lock(&vcpu->arch.vgic_cpu.ap_list_lock); 877 - vgic_flush_lr_state(vcpu); 878 - raw_spin_unlock(&vcpu->arch.vgic_cpu.ap_list_lock); 880 + if (!list_empty(&vcpu->arch.vgic_cpu.ap_list_head)) { 881 + raw_spin_lock(&vcpu->arch.vgic_cpu.ap_list_lock); 882 + vgic_flush_lr_state(vcpu); 883 + raw_spin_unlock(&vcpu->arch.vgic_cpu.ap_list_lock); 884 + } 879 885 880 886 if (can_access_vgic_from_kernel()) 881 887 vgic_restore_state(vcpu);
+3 -3
virt/kvm/eventfd.c
··· 214 214 215 215 if (flags & EPOLLHUP) { 216 216 /* The eventfd is closing, detach from KVM */ 217 - unsigned long flags; 217 + unsigned long iflags; 218 218 219 - spin_lock_irqsave(&kvm->irqfds.lock, flags); 219 + spin_lock_irqsave(&kvm->irqfds.lock, iflags); 220 220 221 221 /* 222 222 * We must check if someone deactivated the irqfd before ··· 230 230 if (irqfd_is_active(irqfd)) 231 231 irqfd_deactivate(irqfd); 232 232 233 - spin_unlock_irqrestore(&kvm->irqfds.lock, flags); 233 + spin_unlock_irqrestore(&kvm->irqfds.lock, iflags); 234 234 } 235 235 236 236 return 0;
+3
virt/kvm/kvm_main.c
··· 2905 2905 { 2906 2906 struct kvm_device *dev = filp->private_data; 2907 2907 2908 + if (dev->kvm->mm != current->mm) 2909 + return -EIO; 2910 + 2908 2911 switch (ioctl) { 2909 2912 case KVM_SET_DEVICE_ATTR: 2910 2913 return kvm_device_ioctl_attr(dev, dev->ops->set_attr, arg);