Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

+2

Documentation/arch/arm64/cpu-feature-registers.rst

··· 152 152 +------------------------------+---------+---------+ 153 153 | DIT | [51-48] | y | 154 154 +------------------------------+---------+---------+ 155 + | MPAM | [43-40] | n | 156 + +------------------------------+---------+---------+ 155 157 | SVE | [35-32] | y | 156 158 +------------------------------+---------+---------+ 157 159 | GIC | [27-24] | n |

+64

Documentation/arch/loongarch/irq-chip-model.rst

··· 85 85 | Devices | 86 86 +---------+ 87 87 88 + Virtual Extended IRQ model 89 + ========================== 90 + 91 + In this model, IPI (Inter-Processor Interrupt) and CPU Local Timer interrupt 92 + go to CPUINTC directly, CPU UARTS interrupts go to PCH-PIC, while all other 93 + devices interrupts go to PCH-PIC/PCH-MSI and gathered by V-EIOINTC (Virtual 94 + Extended I/O Interrupt Controller), and then go to CPUINTC directly:: 95 + 96 + +-----+ +-------------------+ +-------+ 97 + | IPI |--> | CPUINTC(0-255vcpu)| <-- | Timer | 98 + +-----+ +-------------------+ +-------+ 99 + ^ 100 + | 101 + +-----------+ 102 + | V-EIOINTC | 103 + +-----------+ 104 + ^ ^ 105 + | | 106 + +---------+ +---------+ 107 + | PCH-PIC | | PCH-MSI | 108 + +---------+ +---------+ 109 + ^ ^ ^ 110 + | | | 111 + +--------+ +---------+ +---------+ 112 + | UARTs | | Devices | | Devices | 113 + +--------+ +---------+ +---------+ 114 + 115 + 116 + Description 117 + ----------- 118 + V-EIOINTC (Virtual Extended I/O Interrupt Controller) is an extension of 119 + EIOINTC, it only works in VM mode which runs in KVM hypervisor. Interrupts can 120 + be routed to up to four vCPUs via standard EIOINTC, however with V-EIOINTC 121 + interrupts can be routed to up to 256 virtual cpus. 122 + 123 + With standard EIOINTC, interrupt routing setting includes two parts: eight 124 + bits for CPU selection and four bits for CPU IP (Interrupt Pin) selection. 125 + For CPU selection there is four bits for EIOINTC node selection, four bits 126 + for EIOINTC CPU selection. Bitmap method is used for CPU selection and 127 + CPU IP selection, so interrupt can only route to CPU0 - CPU3 and IP0-IP3 in 128 + one EIOINTC node. 129 + 130 + With V-EIOINTC it supports to route more CPUs and CPU IP (Interrupt Pin), 131 + there are two newly added registers with V-EIOINTC. 132 + 133 + EXTIOI_VIRT_FEATURES 134 + -------------------- 135 + This register is read-only register, which indicates supported features with 136 + V-EIOINTC. Feature EXTIOI_HAS_INT_ENCODE and EXTIOI_HAS_CPU_ENCODE is added. 137 + 138 + Feature EXTIOI_HAS_INT_ENCODE is part of standard EIOINTC. If it is 1, it 139 + indicates that CPU Interrupt Pin selection can be normal method rather than 140 + bitmap method, so interrupt can be routed to IP0 - IP15. 141 + 142 + Feature EXTIOI_HAS_CPU_ENCODE is entension of V-EIOINTC. If it is 1, it 143 + indicates that CPU selection can be normal method rather than bitmap method, 144 + so interrupt can be routed to CPU0 - CPU255. 145 + 146 + EXTIOI_VIRT_CONFIG 147 + ------------------ 148 + This register is read-write register, for compatibility intterupt routed uses 149 + the default method which is the same with standard EIOINTC. If the bit is set 150 + with 1, it indicated HW to use normal method rather than bitmap method. 151 + 88 152 Advanced Extended IRQ model 89 153 =========================== 90 154

+55

Documentation/translations/zh_CN/arch/loongarch/irq-chip-model.rst

··· 87 87 | Devices | 88 88 +---------+ 89 89 90 + 虚拟扩展IRQ模型 91 + =============== 92 + 93 + 在这种模型里面, IPI(Inter-Processor Interrupt) 和CPU本地时钟中断直接发送到CPUINTC, 94 + CPU串口 (UARTs) 中断发送到PCH-PIC, 而其他所有设备的中断则分别发送到所连接的PCH_PIC/ 95 + PCH-MSI, 然后V-EIOINTC统一收集，再直接到达CPUINTC:: 96 + 97 + +-----+ +-------------------+ +-------+ 98 + | IPI |--> | CPUINTC(0-255vcpu)| <-- | Timer | 99 + +-----+ +-------------------+ +-------+ 100 + ^ 101 + | 102 + +-----------+ 103 + | V-EIOINTC | 104 + +-----------+ 105 + ^ ^ 106 + | | 107 + +---------+ +---------+ 108 + | PCH-PIC | | PCH-MSI | 109 + +---------+ +---------+ 110 + ^ ^ ^ 111 + | | | 112 + +--------+ +---------+ +---------+ 113 + | UARTs | | Devices | | Devices | 114 + +--------+ +---------+ +---------+ 115 + 116 + V-EIOINTC 是EIOINTC的扩展, 仅工作在虚拟机模式下, 中断经EIOINTC最多可个路由到 117 + ４个虚拟CPU. 但中断经V-EIOINTC最多可个路由到256个虚拟CPU. 118 + 119 + 传统的EIOINTC中断控制器，中断路由分为两个部分：8比特用于控制路由到哪个CPU， 120 + 4比特用于控制路由到特定CPU的哪个中断管脚。控制CPU路由的8比特前4比特用于控制 121 + 路由到哪个EIOINTC节点，后4比特用于控制此节点哪个CPU。中断路由在选择CPU路由 122 + 和CPU中断管脚路由时，使用bitmap编码方式而不是正常编码方式，所以对于一个 123 + EIOINTC中断控制器节点，中断只能路由到CPU0 - CPU3，中断管脚IP0-IP3。 124 + 125 + V-EIOINTC新增了两个寄存器，支持中断路由到更多CPU个和中断管脚。 126 + 127 + V-EIOINTC功能寄存器 128 + ------------------- 129 + 功能寄存器是只读寄存器，用于显示V-EIOINTC支持的特性，目前两个支持两个特性 130 + EXTIOI_HAS_INT_ENCODE 和 EXTIOI_HAS_CPU_ENCODE。 131 + 132 + 特性EXTIOI_HAS_INT_ENCODE是传统EIOINTC中断控制器的一个特性，如果此比特为1， 133 + 显示CPU中断管脚路由方式支持正常编码，而不是bitmap编码，所以中断可以路由到 134 + 管脚IP0 - IP15。 135 + 136 + 特性EXTIOI_HAS_CPU_ENCODE是V-EIOINTC新增特性，如果此比特为1，表示CPU路由 137 + 方式支持正常编码，而不是bitmap编码，所以中断可以路由到CPU0 - CPU255。 138 + 139 + V-EIOINTC配置寄存器 140 + ------------------- 141 + 配置寄存器是可读写寄存器，为了兼容性考虑，如果不写此寄存器，中断路由采用 142 + 和传统EIOINTC相同的路由设置。如果对应比特设置为1，表示采用正常路由方式而 143 + 不是bitmap编码的路由方式。 144 + 90 145 高级扩展IRQ模型 91 146 =============== 92 147

+116 -74

Documentation/virt/kvm/api.rst

··· 7 7 1. General description 8 8 ====================== 9 9 10 - The kvm API is a set of ioctls that are issued to control various aspects 11 - of a virtual machine. The ioctls belong to the following classes: 10 + The kvm API is centered around different kinds of file descriptors 11 + and ioctls that can be issued to these file descriptors. An initial 12 + open("/dev/kvm") obtains a handle to the kvm subsystem; this handle 13 + can be used to issue system ioctls. A KVM_CREATE_VM ioctl on this 14 + handle will create a VM file descriptor which can be used to issue VM 15 + ioctls. A KVM_CREATE_VCPU or KVM_CREATE_DEVICE ioctl on a VM fd will 16 + create a virtual cpu or device and return a file descriptor pointing to 17 + the new resource. 18 + 19 + In other words, the kvm API is a set of ioctls that are issued to 20 + different kinds of file descriptor in order to control various aspects of 21 + a virtual machine. Depending on the file descriptor that accepts them, 22 + ioctls belong to the following classes: 12 23 13 24 - System ioctls: These query and set global attributes which affect the 14 25 whole kvm subsystem. In addition a system ioctl is used to create ··· 46 35 device ioctls must be issued from the same process (address space) that 47 36 was used to create the VM. 48 37 49 - 2. File descriptors 50 - =================== 38 + While most ioctls are specific to one kind of file descriptor, in some 39 + cases the same ioctl can belong to more than one class. 51 40 52 - The kvm API is centered around file descriptors. An initial 53 - open("/dev/kvm") obtains a handle to the kvm subsystem; this handle 54 - can be used to issue system ioctls. A KVM_CREATE_VM ioctl on this 55 - handle will create a VM file descriptor which can be used to issue VM 56 - ioctls. A KVM_CREATE_VCPU or KVM_CREATE_DEVICE ioctl on a VM fd will 57 - create a virtual cpu or device and return a file descriptor pointing to 58 - the new resource. Finally, ioctls on a vcpu or device fd can be used 59 - to control the vcpu or device. For vcpus, this includes the important 60 - task of actually running guest code. 41 + The KVM API grew over time. For this reason, KVM defines many constants 42 + of the form ``KVM_CAP_*``, each corresponding to a set of functionality 43 + provided by one or more ioctls. Availability of these "capabilities" can 44 + be checked with :ref:`KVM_CHECK_EXTENSION <KVM_CHECK_EXTENSION>`. Some 45 + capabilities also need to be enabled for VMs or VCPUs where their 46 + functionality is desired (see :ref:`cap_enable` and :ref:`cap_enable_vm`). 47 + 48 + 49 + 2. Restrictions 50 + =============== 61 51 62 52 In general file descriptors can be migrated among processes by means 63 53 of fork() and the SCM_RIGHTS facility of unix domain socket. These ··· 108 96 Capability: 109 97 which KVM extension provides this ioctl. Can be 'basic', 110 98 which means that is will be provided by any kernel that supports 111 - API version 12 (see section 4.1), a KVM_CAP_xyz constant, which 112 - means availability needs to be checked with KVM_CHECK_EXTENSION 113 - (see section 4.4), or 'none' which means that while not all kernels 114 - support this ioctl, there's no capability bit to check its 115 - availability: for kernels that don't support the ioctl, 116 - the ioctl returns -ENOTTY. 99 + API version 12 (see :ref:`KVM_GET_API_VERSION <KVM_GET_API_VERSION>`), 100 + or a KVM_CAP_xyz constant that can be checked with 101 + :ref:`KVM_CHECK_EXTENSION <KVM_CHECK_EXTENSION>`. 117 102 118 103 Architectures: 119 104 which instruction set architectures provide this ioctl. ··· 126 117 the return value. General error numbers (EBADF, ENOMEM, EINVAL) 127 118 are not detailed, but errors with specific meanings are. 128 119 120 + 121 + .. _KVM_GET_API_VERSION: 129 122 130 123 4.1 KVM_GET_API_VERSION 131 124 ----------------------- ··· 257 246 otherwise. 258 247 259 248 249 + .. _KVM_CHECK_EXTENSION: 250 + 260 251 4.4 KVM_CHECK_EXTENSION 261 252 ----------------------- 262 253 ··· 301 288 302 289 - if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at 303 290 KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE. For more information on 304 - KVM_CAP_DIRTY_LOG_RING, see section 8.3. 291 + KVM_CAP_DIRTY_LOG_RING, see :ref:`KVM_CAP_DIRTY_LOG_RING`. 305 292 306 293 307 294 4.7 KVM_CREATE_VCPU ··· 351 338 cpu's hardware control block. 352 339 353 340 354 - 4.8 KVM_GET_DIRTY_LOG (vm ioctl) 355 - -------------------------------- 341 + 4.8 KVM_GET_DIRTY_LOG 342 + --------------------- 356 343 357 344 :Capability: basic 358 345 :Architectures: all ··· 1311 1298 1312 1299 :Capability: KVM_CAP_DEBUGREGS 1313 1300 :Architectures: x86 1314 - :Type: vm ioctl 1301 + :Type: vcpu ioctl 1315 1302 :Parameters: struct kvm_debugregs (out) 1316 1303 :Returns: 0 on success, -1 on error 1317 1304 ··· 1333 1320 1334 1321 :Capability: KVM_CAP_DEBUGREGS 1335 1322 :Architectures: x86 1336 - :Type: vm ioctl 1323 + :Type: vcpu ioctl 1337 1324 :Parameters: struct kvm_debugregs (in) 1338 1325 :Returns: 0 on success, -1 on error 1339 1326 ··· 1441 1428 because of a quirk in the virtualization implementation (see the internals 1442 1429 documentation when it pops into existence). 1443 1430 1431 + 1432 + .. _KVM_ENABLE_CAP: 1444 1433 1445 1434 4.37 KVM_ENABLE_CAP 1446 1435 ------------------- ··· 2131 2116 2132 2117 The "bitmap" field is the userspace address of an array. This array 2133 2118 consists of a number of bits, equal to the total number of TLB entries as 2134 - determined by the last successful call to KVM_CONFIG_TLB, rounded up to the 2135 - nearest multiple of 64. 2119 + determined by the last successful call to ``KVM_ENABLE_CAP(KVM_CAP_SW_TLB)``, 2120 + rounded up to the nearest multiple of 64. 2136 2121 2137 2122 Each bit corresponds to one TLB entry, ordered the same as in the shared TLB 2138 2123 array. ··· 2183 2168 the entries written by kernel-handled H_PUT_TCE calls, and also lets 2184 2169 userspace update the TCE table directly which is useful in some 2185 2170 circumstances. 2186 - 2187 - 2188 - 4.63 KVM_ALLOCATE_RMA 2189 - --------------------- 2190 - 2191 - :Capability: KVM_CAP_PPC_RMA 2192 - :Architectures: powerpc 2193 - :Type: vm ioctl 2194 - :Parameters: struct kvm_allocate_rma (out) 2195 - :Returns: file descriptor for mapping the allocated RMA 2196 - 2197 - This allocates a Real Mode Area (RMA) from the pool allocated at boot 2198 - time by the kernel. An RMA is a physically-contiguous, aligned region 2199 - of memory used on older POWER processors to provide the memory which 2200 - will be accessed by real-mode (MMU off) accesses in a KVM guest. 2201 - POWER processors support a set of sizes for the RMA that usually 2202 - includes 64MB, 128MB, 256MB and some larger powers of two. 2203 - 2204 - :: 2205 - 2206 - /* for KVM_ALLOCATE_RMA */ 2207 - struct kvm_allocate_rma { 2208 - __u64 rma_size; 2209 - }; 2210 - 2211 - The return value is a file descriptor which can be passed to mmap(2) 2212 - to map the allocated RMA into userspace. The mapped area can then be 2213 - passed to the KVM_SET_USER_MEMORY_REGION ioctl to establish it as the 2214 - RMA for a virtual machine. The size of the RMA in bytes (which is 2215 - fixed at host kernel boot time) is returned in the rma_size field of 2216 - the argument structure. 2217 - 2218 - The KVM_CAP_PPC_RMA capability is 1 or 2 if the KVM_ALLOCATE_RMA ioctl 2219 - is supported; 2 if the processor requires all virtual machines to have 2220 - an RMA, or 1 if the processor can use an RMA but doesn't require it, 2221 - because it supports the Virtual RMA (VRMA) facility. 2222 2171 2223 2172 2224 2173 4.64 KVM_NMI ··· 2581 2602 ======================= ========= ===== ======================================= 2582 2603 2583 2604 .. [1] These encodings are not accepted for SVE-enabled vcpus. See 2584 - KVM_ARM_VCPU_INIT. 2605 + :ref:`KVM_ARM_VCPU_INIT`. 2585 2606 2586 2607 The equivalent register content can be accessed via bits [127:0] of 2587 2608 the corresponding SVE Zn registers instead for vcpus that have SVE ··· 3571 3592 3572 3593 This ioctl returns the guest registers that are supported for the 3573 3594 KVM_GET_ONE_REG/KVM_SET_ONE_REG calls. 3595 + 3596 + Note that s390 does not support KVM_GET_REG_LIST for historical reasons 3597 + (read: nobody cared). The set of registers in kernels 4.x and newer is: 3598 + 3599 + - KVM_REG_S390_TODPR 3600 + 3601 + - KVM_REG_S390_EPOCHDIFF 3602 + 3603 + - KVM_REG_S390_CPU_TIMER 3604 + 3605 + - KVM_REG_S390_CLOCK_COMP 3606 + 3607 + - KVM_REG_S390_PFTOKEN 3608 + 3609 + - KVM_REG_S390_PFCOMPARE 3610 + 3611 + - KVM_REG_S390_PFSELECT 3612 + 3613 + - KVM_REG_S390_PP 3614 + 3615 + - KVM_REG_S390_GBEA 3574 3616 3575 3617 3576 3618 4.85 KVM_ARM_SET_DEVICE_ADDR (deprecated) ··· 4956 4956 between coalesced mmio and pio except that coalesced pio records accesses 4957 4957 to I/O ports. 4958 4958 4959 - 4.117 KVM_CLEAR_DIRTY_LOG (vm ioctl) 4960 - ------------------------------------ 4959 + 4.117 KVM_CLEAR_DIRTY_LOG 4960 + ------------------------- 4961 4961 4962 4962 :Capability: KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 4963 4963 :Architectures: x86, arm64, mips ··· 5093 5093 Finalizes the configuration of the specified vcpu feature. 5094 5094 5095 5095 The vcpu must already have been initialised, enabling the affected feature, by 5096 - means of a successful KVM_ARM_VCPU_INIT call with the appropriate flag set in 5097 - features[]. 5096 + means of a successful :ref:`KVM_ARM_VCPU_INIT <KVM_ARM_VCPU_INIT>` call with the 5097 + appropriate flag set in features[]. 5098 5098 5099 5099 For affected vcpu features, this is a mandatory step that must be performed 5100 5100 before the vcpu is fully usable. ··· 5266 5266 4.123 KVM_S390_INITIAL_RESET 5267 5267 ---------------------------- 5268 5268 5269 - :Capability: none 5269 + :Capability: basic 5270 5270 :Architectures: s390 5271 5271 :Type: vcpu ioctl 5272 5272 :Parameters: none ··· 6205 6205 .. _KVM_ARM_GET_REG_WRITABLE_MASKS: 6206 6206 6207 6207 4.139 KVM_ARM_GET_REG_WRITABLE_MASKS 6208 - ------------------------------------------- 6208 + ------------------------------------ 6209 6209 6210 6210 :Capability: KVM_CAP_ARM_SUPPORTED_REG_MASK_RANGES 6211 6211 :Architectures: arm64 ··· 6442 6442 6443 6443 `flags` must currently be zero. 6444 6444 6445 + 6446 + .. _kvm_run: 6445 6447 6446 6448 5. The kvm_run structure 6447 6449 ======================== ··· 6857 6855 the guest issued a SYSTEM_RESET2 call according to v1.1 of the PSCI 6858 6856 specification. 6859 6857 6858 + - for arm64, data[0] is set to KVM_SYSTEM_EVENT_SHUTDOWN_FLAG_PSCI_OFF2 6859 + if the guest issued a SYSTEM_OFF2 call according to v1.3 of the PSCI 6860 + specification. 6861 + 6860 6862 - for RISC-V, data[0] is set to the value of the second argument of the 6861 6863 ``sbi_system_reset`` call. 6862 6864 ··· 6893 6887 6894 6888 - Deny the guest request to suspend the VM. See ARM DEN0022D.b 5.19.2 6895 6889 "Caller responsibilities" for possible return values. 6890 + 6891 + Hibernation using the PSCI SYSTEM_OFF2 call is enabled when PSCI v1.3 6892 + is enabled. If a guest invokes the PSCI SYSTEM_OFF2 function, KVM will 6893 + exit to userspace with the KVM_SYSTEM_EVENT_SHUTDOWN event type and with 6894 + data[0] set to KVM_SYSTEM_EVENT_SHUTDOWN_FLAG_PSCI_OFF2. The only 6895 + supported hibernate type for the SYSTEM_OFF2 function is HIBERNATE_OFF. 6896 6896 6897 6897 :: 6898 6898 ··· 7174 7162 values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set. 7175 7163 7176 7164 7165 + .. _cap_enable: 7166 + 7177 7167 6. Capabilities that can be enabled on vCPUs 7178 7168 ============================================ 7179 7169 7180 7170 There are certain capabilities that change the behavior of the virtual CPU or 7181 - the virtual machine when enabled. To enable them, please see section 4.37. 7171 + the virtual machine when enabled. To enable them, please see 7172 + :ref:`KVM_ENABLE_CAP`. 7173 + 7182 7174 Below you can find a list of capabilities and what their effect on the vCPU or 7183 7175 the virtual machine is when enabling them. 7184 7176 ··· 7391 7375 sets are supported 7392 7376 (bitfields defined in arch/x86/include/uapi/asm/kvm.h). 7393 7377 7394 - As described above in the kvm_sync_regs struct info in section 5 (kvm_run): 7378 + As described above in the kvm_sync_regs struct info in section :ref:`kvm_run`, 7395 7379 KVM_CAP_SYNC_REGS "allow[s] userspace to access certain guest registers 7396 7380 without having to call SET/GET_*REGS". This reduces overhead by eliminating 7397 7381 repeated ioctl calls for setting and/or getting register values. This is ··· 7437 7421 7438 7422 This capability connects the vcpu to an in-kernel XIVE device. 7439 7423 7424 + .. _cap_enable_vm: 7425 + 7440 7426 7. Capabilities that can be enabled on VMs 7441 7427 ========================================== 7442 7428 7443 7429 There are certain capabilities that change the behavior of the virtual 7444 - machine when enabled. To enable them, please see section 4.37. Below 7445 - you can find a list of capabilities and what their effect on the VM 7446 - is when enabling them. 7430 + machine when enabled. To enable them, please see section 7431 + :ref:`KVM_ENABLE_CAP`. Below you can find a list of capabilities and 7432 + what their effect on the VM is when enabling them. 7447 7433 7448 7434 The following information is provided along with the description: 7449 7435 ··· 8125 8107 or moved memslot isn't reachable, i.e KVM 8126 8108 _may_ invalidate only SPTEs related to the 8127 8109 memslot. 8110 + 8111 + KVM_X86_QUIRK_STUFF_FEATURE_MSRS By default, at vCPU creation, KVM sets the 8112 + vCPU's MSR_IA32_PERF_CAPABILITIES (0x345), 8113 + MSR_IA32_ARCH_CAPABILITIES (0x10a), 8114 + MSR_PLATFORM_INFO (0xce), and all VMX MSRs 8115 + (0x480..0x492) to the maximal capabilities 8116 + supported by KVM. KVM also sets 8117 + MSR_IA32_UCODE_REV (0x8b) to an arbitrary 8118 + value (which is different for Intel vs. 8119 + AMD). Lastly, when guest CPUID is set (by 8120 + userspace), KVM modifies select VMX MSR 8121 + fields to force consistency between guest 8122 + CPUID and L2's effective ISA. When this 8123 + quirk is disabled, KVM zeroes the vCPU's MSR 8124 + values (with two exceptions, see below), 8125 + i.e. treats the feature MSRs like CPUID 8126 + leaves and gives userspace full control of 8127 + the vCPU model definition. This quirk does 8128 + not affect VMX MSRs CR0/CR4_FIXED1 (0x487 8129 + and 0x489), as KVM does now allow them to 8130 + be set by userspace (KVM sets them based on 8131 + guest CPUID, for safety purposes). 8128 8132 =================================== ============================================ 8129 8133 8130 8134 7.32 KVM_CAP_MAX_VCPU_ID ··· 8627 8587 guest according to the bits in the KVM_CPUID_FEATURES CPUID leaf 8628 8588 (0x40000001). Otherwise, a guest may use the paravirtual features 8629 8589 regardless of what has actually been exposed through the CPUID leaf. 8590 + 8591 + .. _KVM_CAP_DIRTY_LOG_RING: 8630 8592 8631 8593 8.29 KVM_CAP_DIRTY_LOG_RING/KVM_CAP_DIRTY_LOG_RING_ACQ_REL 8632 8594 ----------------------------------------------------------

+41 -39

Documentation/virt/kvm/locking.rst

··· 135 135 For direct sp, we can easily avoid it since the spte of direct sp is fixed 136 136 to gfn. For indirect sp, we disabled fast page fault for simplicity. 137 137 138 - A solution for indirect sp could be to pin the gfn, for example via 139 - gfn_to_pfn_memslot_atomic, before the cmpxchg. After the pinning: 138 + A solution for indirect sp could be to pin the gfn before the cmpxchg. After 139 + the pinning: 140 140 141 141 - We have held the refcount of pfn; that means the pfn can not be freed and 142 142 be reused for another gfn. ··· 147 147 148 148 2) Dirty bit tracking 149 149 150 - In the origin code, the spte can be fast updated (non-atomically) if the 150 + In the original code, the spte can be fast updated (non-atomically) if the 151 151 spte is read-only and the Accessed bit has already been set since the 152 152 Accessed bit and Dirty bit can not be lost. 153 153 154 154 But it is not true after fast page fault since the spte can be marked 155 155 writable between reading spte and updating spte. Like below case: 156 156 157 - +------------------------------------------------------------------------+ 158 - | At the beginning:: | 159 - | | 160 - | spte.W = 0 | 161 - | spte.Accessed = 1 | 162 - +------------------------------------+-----------------------------------+ 163 - | CPU 0: | CPU 1: | 164 - +------------------------------------+-----------------------------------+ 165 - | In mmu_spte_clear_track_bits():: | | 166 - | | | 167 - | old_spte = *spte; | | 168 - | | | 169 - | | | 170 - | /* 'if' condition is satisfied. */| | 171 - | if (old_spte.Accessed == 1 && | | 172 - | old_spte.W == 0) | | 173 - | spte = 0ull; | | 174 - +------------------------------------+-----------------------------------+ 175 - | | on fast page fault path:: | 176 - | | | 177 - | | spte.W = 1 | 178 - | | | 179 - | | memory write on the spte:: | 180 - | | | 181 - | | spte.Dirty = 1 | 182 - +------------------------------------+-----------------------------------+ 183 - | :: | | 184 - | | | 185 - | else | | 186 - | old_spte = xchg(spte, 0ull) | | 187 - | if (old_spte.Accessed == 1) | | 188 - | kvm_set_pfn_accessed(spte.pfn);| | 189 - | if (old_spte.Dirty == 1) | | 190 - | kvm_set_pfn_dirty(spte.pfn); | | 191 - | OOPS!!! | | 192 - +------------------------------------+-----------------------------------+ 157 + +-------------------------------------------------------------------------+ 158 + | At the beginning:: | 159 + | | 160 + | spte.W = 0 | 161 + | spte.Accessed = 1 | 162 + +-------------------------------------+-----------------------------------+ 163 + | CPU 0: | CPU 1: | 164 + +-------------------------------------+-----------------------------------+ 165 + | In mmu_spte_update():: | | 166 + | | | 167 + | old_spte = *spte; | | 168 + | | | 169 + | | | 170 + | /* 'if' condition is satisfied. */ | | 171 + | if (old_spte.Accessed == 1 && | | 172 + | old_spte.W == 0) | | 173 + | spte = new_spte; | | 174 + +-------------------------------------+-----------------------------------+ 175 + | | on fast page fault path:: | 176 + | | | 177 + | | spte.W = 1 | 178 + | | | 179 + | | memory write on the spte:: | 180 + | | | 181 + | | spte.Dirty = 1 | 182 + +-------------------------------------+-----------------------------------+ 183 + | :: | | 184 + | | | 185 + | else | | 186 + | old_spte = xchg(spte, new_spte);| | 187 + | if (old_spte.Accessed && | | 188 + | !new_spte.Accessed) | | 189 + | flush = true; | | 190 + | if (old_spte.Dirty && | | 191 + | !new_spte.Dirty) | | 192 + | flush = true; | | 193 + | OOPS!!! | | 194 + +-------------------------------------+-----------------------------------+ 193 195 194 196 The Dirty bit is lost in this case. 195 197

+12

Documentation/virt/kvm/x86/errata.rst

··· 33 33 to be present likely predates these CPUID feature bits, and therefore 34 34 doesn't know to check for them anyway. 35 35 36 + ``KVM_SET_VCPU_EVENTS`` issue 37 + ----------------------------- 38 + 39 + Invalid KVM_SET_VCPU_EVENTS input with respect to error codes *may* result in 40 + failed VM-Entry on Intel CPUs. Pre-CET Intel CPUs require that exception 41 + injection through the VMCS correctly set the "error code valid" flag, e.g. 42 + require the flag be set when injecting a #GP, clear when injecting a #UD, 43 + clear when injecting a soft exception, etc. Intel CPUs that enumerate 44 + IA32_VMX_BASIC[56] as '1' relax VMX's consistency checks, and AMD CPUs have no 45 + restrictions whatsoever. KVM_SET_VCPU_EVENTS doesn't sanity check the vector 46 + versus "has_error_code", i.e. KVM's ABI follows AMD behavior. 47 + 36 48 Nested virtualization features 37 49 ------------------------------ 38 50

+1

arch/arm64/include/asm/cpu.h

··· 46 46 u64 reg_revidr; 47 47 u64 reg_gmid; 48 48 u64 reg_smidr; 49 + u64 reg_mpamidr; 49 50 50 51 u64 reg_id_aa64dfr0; 51 52 u64 reg_id_aa64dfr1;

+5

arch/arm64/include/asm/cpucaps.h

··· 62 62 return IS_ENABLED(CONFIG_ARM64_WORKAROUND_REPEAT_TLBI); 63 63 case ARM64_WORKAROUND_SPECULATIVE_SSBS: 64 64 return IS_ENABLED(CONFIG_ARM64_ERRATUM_3194386); 65 + case ARM64_MPAM: 66 + /* 67 + * KVM MPAM support doesn't rely on the host kernel supporting MPAM. 68 + */ 69 + return true; 65 70 } 66 71 67 72 return true;

+17

arch/arm64/include/asm/cpufeature.h

··· 613 613 return val > 0; 614 614 } 615 615 616 + static inline bool id_aa64pfr0_mpam(u64 pfr0) 617 + { 618 + u32 val = cpuid_feature_extract_unsigned_field(pfr0, ID_AA64PFR0_EL1_MPAM_SHIFT); 619 + 620 + return val > 0; 621 + } 622 + 616 623 static inline bool id_aa64pfr1_mte(u64 pfr1) 617 624 { 618 625 u32 val = cpuid_feature_extract_unsigned_field(pfr1, ID_AA64PFR1_EL1_MTE_SHIFT); ··· 855 848 { 856 849 return IS_ENABLED(CONFIG_ARM64_HAFT) && 857 850 cpus_have_final_cap(ARM64_HAFT); 851 + } 852 + 853 + static __always_inline bool system_supports_mpam(void) 854 + { 855 + return alternative_has_cap_unlikely(ARM64_MPAM); 856 + } 857 + 858 + static __always_inline bool system_supports_mpam_hcr(void) 859 + { 860 + return alternative_has_cap_unlikely(ARM64_MPAM_HCR); 858 861 } 859 862 860 863 int do_emulate_mrs(struct pt_regs *regs, u32 sys_reg, u32 rt);

+14

arch/arm64/include/asm/el2_setup.h

··· 249 249 msr spsr_el2, x0 250 250 .endm 251 251 252 + .macro __init_el2_mpam 253 + /* Memory Partitioning And Monitoring: disable EL2 traps */ 254 + mrs x1, id_aa64pfr0_el1 255 + ubfx x0, x1, #ID_AA64PFR0_EL1_MPAM_SHIFT, #4 256 + cbz x0, .Lskip_mpam_\@ // skip if no MPAM 257 + msr_s SYS_MPAM2_EL2, xzr // use the default partition 258 + // and disable lower traps 259 + mrs_s x0, SYS_MPAMIDR_EL1 260 + tbz x0, #MPAMIDR_EL1_HAS_HCR_SHIFT, .Lskip_mpam_\@ // skip if no MPAMHCR reg 261 + msr_s SYS_MPAMHCR_EL2, xzr // clear TRAP_MPAMIDR_EL1 -> EL2 262 + .Lskip_mpam_\@: 263 + .endm 264 + 252 265 /** 253 266 * Initialize EL2 registers to sane values. This should be called early on all 254 267 * cores that were booted in EL2. Note that everything gets initialised as ··· 279 266 __init_el2_stage2 280 267 __init_el2_gicv3 281 268 __init_el2_hstr 269 + __init_el2_mpam 282 270 __init_el2_nvhe_idregs 283 271 __init_el2_cptr 284 272 __init_el2_fgt

+1 -29

arch/arm64/include/asm/kvm_arm.h

··· 103 103 #define HCR_HOST_VHE_FLAGS (HCR_RW | HCR_TGE | HCR_E2H) 104 104 105 105 #define HCRX_HOST_FLAGS (HCRX_EL2_MSCEn | HCRX_EL2_TCR2En | HCRX_EL2_EnFPM) 106 + #define MPAMHCR_HOST_FLAGS 0 106 107 107 108 /* TCR_EL2 Registers bits */ 108 109 #define TCR_EL2_DS (1UL << 32) ··· 311 310 GENMASK(23, 22) | \ 312 311 GENMASK(19, 18) | \ 313 312 GENMASK(15, 0)) 314 - 315 - /* Hyp Debug Configuration Register bits */ 316 - #define MDCR_EL2_E2TB_MASK (UL(0x3)) 317 - #define MDCR_EL2_E2TB_SHIFT (UL(24)) 318 - #define MDCR_EL2_HPMFZS (UL(1) << 36) 319 - #define MDCR_EL2_HPMFZO (UL(1) << 29) 320 - #define MDCR_EL2_MTPME (UL(1) << 28) 321 - #define MDCR_EL2_TDCC (UL(1) << 27) 322 - #define MDCR_EL2_HLP (UL(1) << 26) 323 - #define MDCR_EL2_HCCD (UL(1) << 23) 324 - #define MDCR_EL2_TTRF (UL(1) << 19) 325 - #define MDCR_EL2_HPMD (UL(1) << 17) 326 - #define MDCR_EL2_TPMS (UL(1) << 14) 327 - #define MDCR_EL2_E2PB_MASK (UL(0x3)) 328 - #define MDCR_EL2_E2PB_SHIFT (UL(12)) 329 - #define MDCR_EL2_TDRA (UL(1) << 11) 330 - #define MDCR_EL2_TDOSA (UL(1) << 10) 331 - #define MDCR_EL2_TDA (UL(1) << 9) 332 - #define MDCR_EL2_TDE (UL(1) << 8) 333 - #define MDCR_EL2_HPME (UL(1) << 7) 334 - #define MDCR_EL2_TPM (UL(1) << 6) 335 - #define MDCR_EL2_TPMCR (UL(1) << 5) 336 - #define MDCR_EL2_HPMN_MASK (UL(0x1F)) 337 - #define MDCR_EL2_RES0 (GENMASK(63, 37) | \ 338 - GENMASK(35, 30) | \ 339 - GENMASK(25, 24) | \ 340 - GENMASK(22, 20) | \ 341 - BIT(18) | \ 342 - GENMASK(16, 15)) 343 313 344 314 /* 345 315 * FGT register definitions

-1

arch/arm64/include/asm/kvm_asm.h

··· 76 76 __KVM_HOST_SMCCC_FUNC___kvm_timer_set_cntvoff, 77 77 __KVM_HOST_SMCCC_FUNC___vgic_v3_save_vmcr_aprs, 78 78 __KVM_HOST_SMCCC_FUNC___vgic_v3_restore_vmcr_aprs, 79 - __KVM_HOST_SMCCC_FUNC___pkvm_vcpu_init_traps, 80 79 __KVM_HOST_SMCCC_FUNC___pkvm_init_vm, 81 80 __KVM_HOST_SMCCC_FUNC___pkvm_init_vcpu, 82 81 __KVM_HOST_SMCCC_FUNC___pkvm_teardown_vm,

+9

arch/arm64/include/asm/kvm_emulate.h

··· 225 225 return vcpu_has_nv(vcpu) && __is_hyp_ctxt(&vcpu->arch.ctxt); 226 226 } 227 227 228 + static inline bool vcpu_is_host_el0(const struct kvm_vcpu *vcpu) 229 + { 230 + return is_hyp_ctxt(vcpu) && !vcpu_is_el2(vcpu); 231 + } 232 + 228 233 /* 229 234 * The layout of SPSR for an AArch32 state is different when observed from an 230 235 * AArch64 SPSR_ELx or an AArch32 SPSR_*. This function generates the AArch32 ··· 698 693 return __guest_hyp_cptr_xen_trap_enabled(vcpu, ZEN); 699 694 } 700 695 696 + static inline void kvm_vcpu_enable_ptrauth(struct kvm_vcpu *vcpu) 697 + { 698 + vcpu_set_flag(vcpu, GUEST_HAS_PTRAUTH); 699 + } 701 700 #endif /* __ARM64_KVM_EMULATE_H__ */

+36 -10

arch/arm64/include/asm/kvm_host.h

··· 74 74 static inline enum kvm_mode kvm_get_mode(void) { return KVM_MODE_NONE; }; 75 75 #endif 76 76 77 - DECLARE_STATIC_KEY_FALSE(userspace_irqchip_in_use); 78 - 79 77 extern unsigned int __ro_after_init kvm_sve_max_vl; 80 78 extern unsigned int __ro_after_init kvm_host_sve_max_vl; 81 79 int __init kvm_arm_init_sve(void); ··· 372 374 373 375 u64 ctr_el0; 374 376 375 - /* Masks for VNCR-baked sysregs */ 377 + /* Masks for VNCR-backed and general EL2 sysregs */ 376 378 struct kvm_sysreg_masks *sysreg_masks; 377 379 378 380 /* ··· 405 407 __before_##r, \ 406 408 r = __VNCR_START__ + ((VNCR_ ## r) / 8), \ 407 409 __after_##r = __MAX__(__before_##r - 1, r) 410 + 411 + #define MARKER(m) \ 412 + m, __after_##m = m - 1 408 413 409 414 enum vcpu_sysreg { 410 415 __INVALID_SYSREG__, /* 0 is reserved as an invalid value */ ··· 469 468 /* EL2 registers */ 470 469 SCTLR_EL2, /* System Control Register (EL2) */ 471 470 ACTLR_EL2, /* Auxiliary Control Register (EL2) */ 472 - MDCR_EL2, /* Monitor Debug Configuration Register (EL2) */ 473 471 CPTR_EL2, /* Architectural Feature Trap Register (EL2) */ 474 472 HACR_EL2, /* Hypervisor Auxiliary Control Register */ 475 473 ZCR_EL2, /* SVE Control Register (EL2) */ 476 474 TTBR0_EL2, /* Translation Table Base Register 0 (EL2) */ 477 475 TTBR1_EL2, /* Translation Table Base Register 1 (EL2) */ 478 476 TCR_EL2, /* Translation Control Register (EL2) */ 477 + PIRE0_EL2, /* Permission Indirection Register 0 (EL2) */ 478 + PIR_EL2, /* Permission Indirection Register 1 (EL2) */ 479 + POR_EL2, /* Permission Overlay Register 2 (EL2) */ 479 480 SPSR_EL2, /* EL2 saved program status register */ 480 481 ELR_EL2, /* EL2 exception link register */ 481 482 AFSR0_EL2, /* Auxiliary Fault Status Register 0 (EL2) */ ··· 497 494 CNTHV_CTL_EL2, 498 495 CNTHV_CVAL_EL2, 499 496 500 - __VNCR_START__, /* Any VNCR-capable reg goes after this point */ 497 + /* Anything from this can be RES0/RES1 sanitised */ 498 + MARKER(__SANITISED_REG_START__), 499 + TCR2_EL2, /* Extended Translation Control Register (EL2) */ 500 + MDCR_EL2, /* Monitor Debug Configuration Register (EL2) */ 501 + 502 + /* Any VNCR-capable reg goes after this point */ 503 + MARKER(__VNCR_START__), 501 504 502 505 VNCR(SCTLR_EL1),/* System Control Register */ 503 506 VNCR(ACTLR_EL1),/* Auxiliary Control Register */ ··· 563 554 struct { 564 555 u64 res0; 565 556 u64 res1; 566 - } mask[NR_SYS_REGS - __VNCR_START__]; 557 + } mask[NR_SYS_REGS - __SANITISED_REG_START__]; 567 558 }; 568 559 569 560 struct kvm_cpu_context { ··· 1011 1002 1012 1003 #define ctxt_sys_reg(c,r) (*__ctxt_sys_reg(c,r)) 1013 1004 1014 - u64 kvm_vcpu_sanitise_vncr_reg(const struct kvm_vcpu *, enum vcpu_sysreg); 1005 + u64 kvm_vcpu_apply_reg_masks(const struct kvm_vcpu *, enum vcpu_sysreg, u64); 1015 1006 #define __vcpu_sys_reg(v,r) \ 1016 1007 (*({ \ 1017 1008 const struct kvm_cpu_context *ctxt = &(v)->arch.ctxt; \ 1018 1009 u64 *__r = __ctxt_sys_reg(ctxt, (r)); \ 1019 - if (vcpu_has_nv((v)) && (r) >= __VNCR_START__) \ 1020 - *__r = kvm_vcpu_sanitise_vncr_reg((v), (r)); \ 1010 + if (vcpu_has_nv((v)) && (r) >= __SANITISED_REG_START__) \ 1011 + *__r = kvm_vcpu_apply_reg_masks((v), (r), *__r);\ 1021 1012 __r; \ 1022 1013 })) 1023 1014 ··· 1046 1037 case TTBR0_EL1: *val = read_sysreg_s(SYS_TTBR0_EL12); break; 1047 1038 case TTBR1_EL1: *val = read_sysreg_s(SYS_TTBR1_EL12); break; 1048 1039 case TCR_EL1: *val = read_sysreg_s(SYS_TCR_EL12); break; 1040 + case TCR2_EL1: *val = read_sysreg_s(SYS_TCR2_EL12); break; 1041 + case PIR_EL1: *val = read_sysreg_s(SYS_PIR_EL12); break; 1042 + case PIRE0_EL1: *val = read_sysreg_s(SYS_PIRE0_EL12); break; 1043 + case POR_EL1: *val = read_sysreg_s(SYS_POR_EL12); break; 1049 1044 case ESR_EL1: *val = read_sysreg_s(SYS_ESR_EL12); break; 1050 1045 case AFSR0_EL1: *val = read_sysreg_s(SYS_AFSR0_EL12); break; 1051 1046 case AFSR1_EL1: *val = read_sysreg_s(SYS_AFSR1_EL12); break; ··· 1096 1083 case TTBR0_EL1: write_sysreg_s(val, SYS_TTBR0_EL12); break; 1097 1084 case TTBR1_EL1: write_sysreg_s(val, SYS_TTBR1_EL12); break; 1098 1085 case TCR_EL1: write_sysreg_s(val, SYS_TCR_EL12); break; 1086 + case TCR2_EL1: write_sysreg_s(val, SYS_TCR2_EL12); break; 1087 + case PIR_EL1: write_sysreg_s(val, SYS_PIR_EL12); break; 1088 + case PIRE0_EL1: write_sysreg_s(val, SYS_PIRE0_EL12); break; 1089 + case POR_EL1: write_sysreg_s(val, SYS_POR_EL12); break; 1099 1090 case ESR_EL1: write_sysreg_s(val, SYS_ESR_EL12); break; 1100 1091 case AFSR0_EL1: write_sysreg_s(val, SYS_AFSR0_EL12); break; 1101 1092 case AFSR1_EL1: write_sysreg_s(val, SYS_AFSR1_EL12); break; ··· 1157 1140 void kvm_arm_halt_guest(struct kvm *kvm); 1158 1141 void kvm_arm_resume_guest(struct kvm *kvm); 1159 1142 1160 - #define vcpu_has_run_once(vcpu) !!rcu_access_pointer((vcpu)->pid) 1143 + #define vcpu_has_run_once(vcpu) (!!READ_ONCE((vcpu)->pid)) 1161 1144 1162 1145 #ifndef __KVM_NVHE_HYPERVISOR__ 1163 1146 #define kvm_call_hyp_nvhe(f, ...) \ ··· 1519 1502 #define kvm_has_fpmr(k) \ 1520 1503 (system_supports_fpmr() && \ 1521 1504 kvm_has_feat((k), ID_AA64PFR2_EL1, FPMR, IMP)) 1505 + 1506 + #define kvm_has_tcr2(k) \ 1507 + (kvm_has_feat((k), ID_AA64MMFR3_EL1, TCRX, IMP)) 1508 + 1509 + #define kvm_has_s1pie(k) \ 1510 + (kvm_has_feat((k), ID_AA64MMFR3_EL1, S1PIE, IMP)) 1511 + 1512 + #define kvm_has_s1poe(k) \ 1513 + (kvm_has_feat((k), ID_AA64MMFR3_EL1, S1POE, IMP)) 1522 1514 1523 1515 #endif /* __ARM64_KVM_HOST_H__ */

+1 -3

arch/arm64/include/asm/kvm_pgtable.h

··· 674 674 * 675 675 * If there is a valid, leaf page-table entry used to translate @addr, then 676 676 * set the access flag in that entry. 677 - * 678 - * Return: The old page-table entry prior to setting the flag, 0 on failure. 679 677 */ 680 - kvm_pte_t kvm_pgtable_stage2_mkyoung(struct kvm_pgtable *pgt, u64 addr); 678 + void kvm_pgtable_stage2_mkyoung(struct kvm_pgtable *pgt, u64 addr); 681 679 682 680 /** 683 681 * kvm_pgtable_stage2_test_clear_young() - Test and optionally clear the access

-12

arch/arm64/include/asm/sysreg.h

··· 542 542 543 543 #define SYS_MAIR_EL2 sys_reg(3, 4, 10, 2, 0) 544 544 #define SYS_AMAIR_EL2 sys_reg(3, 4, 10, 3, 0) 545 - #define SYS_MPAMHCR_EL2 sys_reg(3, 4, 10, 4, 0) 546 - #define SYS_MPAMVPMV_EL2 sys_reg(3, 4, 10, 4, 1) 547 - #define SYS_MPAM2_EL2 sys_reg(3, 4, 10, 5, 0) 548 - #define __SYS__MPAMVPMx_EL2(x) sys_reg(3, 4, 10, 6, x) 549 - #define SYS_MPAMVPM0_EL2 __SYS__MPAMVPMx_EL2(0) 550 - #define SYS_MPAMVPM1_EL2 __SYS__MPAMVPMx_EL2(1) 551 - #define SYS_MPAMVPM2_EL2 __SYS__MPAMVPMx_EL2(2) 552 - #define SYS_MPAMVPM3_EL2 __SYS__MPAMVPMx_EL2(3) 553 - #define SYS_MPAMVPM4_EL2 __SYS__MPAMVPMx_EL2(4) 554 - #define SYS_MPAMVPM5_EL2 __SYS__MPAMVPMx_EL2(5) 555 - #define SYS_MPAMVPM6_EL2 __SYS__MPAMVPMx_EL2(6) 556 - #define SYS_MPAMVPM7_EL2 __SYS__MPAMVPMx_EL2(7) 557 545 558 546 #define SYS_VBAR_EL2 sys_reg(3, 4, 12, 0, 0) 559 547 #define SYS_RVBAR_EL2 sys_reg(3, 4, 12, 0, 1)

-1

arch/arm64/include/asm/vncr_mapping.h

··· 50 50 #define VNCR_VBAR_EL1 0x250 51 51 #define VNCR_TCR2_EL1 0x270 52 52 #define VNCR_PIRE0_EL1 0x290 53 - #define VNCR_PIRE0_EL2 0x298 54 53 #define VNCR_PIR_EL1 0x2A0 55 54 #define VNCR_POR_EL1 0x2A8 56 55 #define VNCR_ICH_LR0_EL2 0x400

+6

arch/arm64/include/uapi/asm/kvm.h

··· 484 484 */ 485 485 #define KVM_SYSTEM_EVENT_RESET_FLAG_PSCI_RESET2 (1ULL << 0) 486 486 487 + /* 488 + * Shutdown caused by a PSCI v1.3 SYSTEM_OFF2 call. 489 + * Valid only when the system event has a type of KVM_SYSTEM_EVENT_SHUTDOWN. 490 + */ 491 + #define KVM_SYSTEM_EVENT_SHUTDOWN_FLAG_PSCI_OFF2 (1ULL << 0) 492 + 487 493 /* run->fail_entry.hardware_entry_failure_reason codes. */ 488 494 #define KVM_EXIT_FAIL_ENTRY_CPU_UNSUPPORTED (1ULL << 0) 489 495

+96

arch/arm64/kernel/cpufeature.c

··· 688 688 ARM64_FTR_END, 689 689 }; 690 690 691 + static const struct arm64_ftr_bits ftr_mpamidr[] = { 692 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, MPAMIDR_EL1_PMG_MAX_SHIFT, MPAMIDR_EL1_PMG_MAX_WIDTH, 0), 693 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, MPAMIDR_EL1_VPMR_MAX_SHIFT, MPAMIDR_EL1_VPMR_MAX_WIDTH, 0), 694 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, MPAMIDR_EL1_HAS_HCR_SHIFT, 1, 0), 695 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, MPAMIDR_EL1_PARTID_MAX_SHIFT, MPAMIDR_EL1_PARTID_MAX_WIDTH, 0), 696 + ARM64_FTR_END, 697 + }; 698 + 691 699 /* 692 700 * Common ftr bits for a 32bit register with all hidden, strict 693 701 * attributes, with 4bit feature fields and a default safe value of ··· 815 807 &id_aa64mmfr2_override), 816 808 ARM64_FTR_REG(SYS_ID_AA64MMFR3_EL1, ftr_id_aa64mmfr3), 817 809 ARM64_FTR_REG(SYS_ID_AA64MMFR4_EL1, ftr_id_aa64mmfr4), 810 + 811 + /* Op1 = 0, CRn = 10, CRm = 4 */ 812 + ARM64_FTR_REG(SYS_MPAMIDR_EL1, ftr_mpamidr), 818 813 819 814 /* Op1 = 1, CRn = 0, CRm = 0 */ 820 815 ARM64_FTR_REG(SYS_GMID_EL1, ftr_gmid), ··· 1178 1167 cpacr_restore(cpacr); 1179 1168 } 1180 1169 1170 + if (id_aa64pfr0_mpam(info->reg_id_aa64pfr0)) 1171 + init_cpu_ftr_reg(SYS_MPAMIDR_EL1, info->reg_mpamidr); 1172 + 1181 1173 if (id_aa64pfr1_mte(info->reg_id_aa64pfr1)) 1182 1174 init_cpu_ftr_reg(SYS_GMID_EL1, info->reg_gmid); 1183 1175 } ··· 1435 1421 vec_update_vq_map(ARM64_VEC_SME); 1436 1422 1437 1423 cpacr_restore(cpacr); 1424 + } 1425 + 1426 + if (id_aa64pfr0_mpam(info->reg_id_aa64pfr0)) { 1427 + taint |= check_update_ftr_reg(SYS_MPAMIDR_EL1, cpu, 1428 + info->reg_mpamidr, boot->reg_mpamidr); 1438 1429 } 1439 1430 1440 1431 /* ··· 2408 2389 return !!(cap->type & ARM64_CPUCAP_PANIC_ON_CONFLICT); 2409 2390 } 2410 2391 2392 + static bool 2393 + test_has_mpam(const struct arm64_cpu_capabilities *entry, int scope) 2394 + { 2395 + if (!has_cpuid_feature(entry, scope)) 2396 + return false; 2397 + 2398 + /* Check firmware actually enabled MPAM on this cpu. */ 2399 + return (read_sysreg_s(SYS_MPAM1_EL1) & MPAM1_EL1_MPAMEN); 2400 + } 2401 + 2402 + static void 2403 + cpu_enable_mpam(const struct arm64_cpu_capabilities *entry) 2404 + { 2405 + /* 2406 + * Access by the kernel (at EL1) should use the reserved PARTID 2407 + * which is configured unrestricted. This avoids priority-inversion 2408 + * where latency sensitive tasks have to wait for a task that has 2409 + * been throttled to release the lock. 2410 + */ 2411 + write_sysreg_s(0, SYS_MPAM1_EL1); 2412 + } 2413 + 2414 + static bool 2415 + test_has_mpam_hcr(const struct arm64_cpu_capabilities *entry, int scope) 2416 + { 2417 + u64 idr = read_sanitised_ftr_reg(SYS_MPAMIDR_EL1); 2418 + 2419 + return idr & MPAMIDR_EL1_HAS_HCR; 2420 + } 2421 + 2411 2422 static const struct arm64_cpu_capabilities arm64_features[] = { 2412 2423 { 2413 2424 .capability = ARM64_ALWAYS_BOOT, ··· 2949 2900 #endif 2950 2901 }, 2951 2902 #endif 2903 + { 2904 + .desc = "Memory Partitioning And Monitoring", 2905 + .type = ARM64_CPUCAP_SYSTEM_FEATURE, 2906 + .capability = ARM64_MPAM, 2907 + .matches = test_has_mpam, 2908 + .cpu_enable = cpu_enable_mpam, 2909 + ARM64_CPUID_FIELDS(ID_AA64PFR0_EL1, MPAM, 1) 2910 + }, 2911 + { 2912 + .desc = "Memory Partitioning And Monitoring Virtualisation", 2913 + .type = ARM64_CPUCAP_SYSTEM_FEATURE, 2914 + .capability = ARM64_MPAM_HCR, 2915 + .matches = test_has_mpam_hcr, 2916 + }, 2952 2917 { 2953 2918 .desc = "NV1", 2954 2919 .capability = ARM64_HAS_HCR_NV1, ··· 3499 3436 } 3500 3437 } 3501 3438 3439 + static void verify_mpam_capabilities(void) 3440 + { 3441 + u64 cpu_idr = read_cpuid(ID_AA64PFR0_EL1); 3442 + u64 sys_idr = read_sanitised_ftr_reg(SYS_ID_AA64PFR0_EL1); 3443 + u16 cpu_partid_max, cpu_pmg_max, sys_partid_max, sys_pmg_max; 3444 + 3445 + if (FIELD_GET(ID_AA64PFR0_EL1_MPAM_MASK, cpu_idr) != 3446 + FIELD_GET(ID_AA64PFR0_EL1_MPAM_MASK, sys_idr)) { 3447 + pr_crit("CPU%d: MPAM version mismatch\n", smp_processor_id()); 3448 + cpu_die_early(); 3449 + } 3450 + 3451 + cpu_idr = read_cpuid(MPAMIDR_EL1); 3452 + sys_idr = read_sanitised_ftr_reg(SYS_MPAMIDR_EL1); 3453 + if (FIELD_GET(MPAMIDR_EL1_HAS_HCR, cpu_idr) != 3454 + FIELD_GET(MPAMIDR_EL1_HAS_HCR, sys_idr)) { 3455 + pr_crit("CPU%d: Missing MPAM HCR\n", smp_processor_id()); 3456 + cpu_die_early(); 3457 + } 3458 + 3459 + cpu_partid_max = FIELD_GET(MPAMIDR_EL1_PARTID_MAX, cpu_idr); 3460 + cpu_pmg_max = FIELD_GET(MPAMIDR_EL1_PMG_MAX, cpu_idr); 3461 + sys_partid_max = FIELD_GET(MPAMIDR_EL1_PARTID_MAX, sys_idr); 3462 + sys_pmg_max = FIELD_GET(MPAMIDR_EL1_PMG_MAX, sys_idr); 3463 + if (cpu_partid_max < sys_partid_max || cpu_pmg_max < sys_pmg_max) { 3464 + pr_crit("CPU%d: MPAM PARTID/PMG max values are mismatched\n", smp_processor_id()); 3465 + cpu_die_early(); 3466 + } 3467 + } 3468 + 3502 3469 /* 3503 3470 * Run through the enabled system capabilities and enable() it on this CPU. 3504 3471 * The capabilities were decided based on the available CPUs at the boot time. ··· 3555 3462 3556 3463 if (is_hyp_mode_available()) 3557 3464 verify_hyp_capabilities(); 3465 + 3466 + if (system_supports_mpam()) 3467 + verify_mpam_capabilities(); 3558 3468 } 3559 3469 3560 3470 void check_local_cpu_capabilities(void)

+3

arch/arm64/kernel/cpuinfo.c

··· 479 479 if (id_aa64pfr0_32bit_el0(info->reg_id_aa64pfr0)) 480 480 __cpuinfo_store_cpu_32bit(&info->aarch32); 481 481 482 + if (id_aa64pfr0_mpam(info->reg_id_aa64pfr0)) 483 + info->reg_mpamidr = read_cpuid(MPAMIDR_EL1); 484 + 482 485 cpuinfo_detect_icache_policy(info); 483 486 } 484 487

+1 -2

arch/arm64/kvm/arch_timer.c

··· 206 206 207 207 static inline bool userspace_irqchip(struct kvm *kvm) 208 208 { 209 - return static_branch_unlikely(&userspace_irqchip_in_use) && 210 - unlikely(!irqchip_in_kernel(kvm)); 209 + return unlikely(!irqchip_in_kernel(kvm)); 211 210 } 212 211 213 212 static void soft_timer_start(struct hrtimer *hrt, u64 ns)

+3 -23

arch/arm64/kvm/arm.c

··· 69 69 static bool vgic_present, kvm_arm_initialised; 70 70 71 71 static DEFINE_PER_CPU(unsigned char, kvm_hyp_initialized); 72 - DEFINE_STATIC_KEY_FALSE(userspace_irqchip_in_use); 73 72 74 73 bool is_kvm_arm_initialised(void) 75 74 { ··· 502 503 503 504 void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu) 504 505 { 505 - if (vcpu_has_run_once(vcpu) && unlikely(!irqchip_in_kernel(vcpu->kvm))) 506 - static_branch_dec(&userspace_irqchip_in_use); 507 - 508 506 kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_cache); 509 507 kvm_timer_vcpu_terminate(vcpu); 510 508 kvm_pmu_vcpu_destroy(vcpu); ··· 844 848 return ret; 845 849 } 846 850 847 - if (!irqchip_in_kernel(kvm)) { 848 - /* 849 - * Tell the rest of the code that there are userspace irqchip 850 - * VMs in the wild. 851 - */ 852 - static_branch_inc(&userspace_irqchip_in_use); 853 - } 854 - 855 - /* 856 - * Initialize traps for protected VMs. 857 - * NOTE: Move to run in EL2 directly, rather than via a hypercall, once 858 - * the code is in place for first run initialization at EL2. 859 - */ 860 - if (kvm_vm_is_protected(kvm)) 861 - kvm_call_hyp_nvhe(__pkvm_vcpu_init_traps, vcpu); 862 - 863 851 mutex_lock(&kvm->arch.config_lock); 864 852 set_bit(KVM_ARCH_FLAG_HAS_RAN_ONCE, &kvm->arch.flags); 865 853 mutex_unlock(&kvm->arch.config_lock); ··· 1057 1077 * state gets updated in kvm_timer_update_run and 1058 1078 * kvm_pmu_update_run below). 1059 1079 */ 1060 - if (static_branch_unlikely(&userspace_irqchip_in_use)) { 1080 + if (unlikely(!irqchip_in_kernel(vcpu->kvm))) { 1061 1081 if (kvm_timer_should_notify_user(vcpu) || 1062 1082 kvm_pmu_should_notify_user(vcpu)) { 1063 1083 *ret = -EINTR; ··· 1179 1199 vcpu->mode = OUTSIDE_GUEST_MODE; 1180 1200 isb(); /* Ensure work in x_flush_hwstate is committed */ 1181 1201 kvm_pmu_sync_hwstate(vcpu); 1182 - if (static_branch_unlikely(&userspace_irqchip_in_use)) 1202 + if (unlikely(!irqchip_in_kernel(vcpu->kvm))) 1183 1203 kvm_timer_sync_user(vcpu); 1184 1204 kvm_vgic_sync_hwstate(vcpu); 1185 1205 local_irq_enable(); ··· 1225 1245 * we don't want vtimer interrupts to race with syncing the 1226 1246 * timer virtual interrupt state. 1227 1247 */ 1228 - if (static_branch_unlikely(&userspace_irqchip_in_use)) 1248 + if (unlikely(!irqchip_in_kernel(vcpu->kvm))) 1229 1249 kvm_timer_sync_user(vcpu); 1230 1250 1231 1251 kvm_arch_vcpu_ctxsync_fp(vcpu);

+411 -73

arch/arm64/kvm/at.c

··· 24 24 unsigned int txsz; 25 25 int sl; 26 26 bool hpd; 27 + bool e0poe; 28 + bool poe; 29 + bool pan; 27 30 bool be; 28 31 bool s2; 29 32 }; ··· 40 37 u8 APTable; 41 38 bool UXNTable; 42 39 bool PXNTable; 40 + bool uwxn; 41 + bool uov; 42 + bool ur; 43 + bool uw; 44 + bool ux; 45 + bool pwxn; 46 + bool pov; 47 + bool pr; 48 + bool pw; 49 + bool px; 43 50 }; 44 51 struct { 45 52 u8 fst; ··· 100 87 } 101 88 } 102 89 90 + static bool s1pie_enabled(struct kvm_vcpu *vcpu, enum trans_regime regime) 91 + { 92 + if (!kvm_has_s1pie(vcpu->kvm)) 93 + return false; 94 + 95 + switch (regime) { 96 + case TR_EL2: 97 + case TR_EL20: 98 + return vcpu_read_sys_reg(vcpu, TCR2_EL2) & TCR2_EL2_PIE; 99 + case TR_EL10: 100 + return (__vcpu_sys_reg(vcpu, HCRX_EL2) & HCRX_EL2_TCR2En) && 101 + (__vcpu_sys_reg(vcpu, TCR2_EL1) & TCR2_EL1x_PIE); 102 + default: 103 + BUG(); 104 + } 105 + } 106 + 107 + static void compute_s1poe(struct kvm_vcpu *vcpu, struct s1_walk_info *wi) 108 + { 109 + u64 val; 110 + 111 + if (!kvm_has_s1poe(vcpu->kvm)) { 112 + wi->poe = wi->e0poe = false; 113 + return; 114 + } 115 + 116 + switch (wi->regime) { 117 + case TR_EL2: 118 + case TR_EL20: 119 + val = vcpu_read_sys_reg(vcpu, TCR2_EL2); 120 + wi->poe = val & TCR2_EL2_POE; 121 + wi->e0poe = (wi->regime == TR_EL20) && (val & TCR2_EL2_E0POE); 122 + break; 123 + case TR_EL10: 124 + if (__vcpu_sys_reg(vcpu, HCRX_EL2) & HCRX_EL2_TCR2En) { 125 + wi->poe = wi->e0poe = false; 126 + return; 127 + } 128 + 129 + val = __vcpu_sys_reg(vcpu, TCR2_EL1); 130 + wi->poe = val & TCR2_EL1x_POE; 131 + wi->e0poe = val & TCR2_EL1x_E0POE; 132 + } 133 + } 134 + 103 135 static int setup_s1_walk(struct kvm_vcpu *vcpu, u32 op, struct s1_walk_info *wi, 104 136 struct s1_walk_result *wr, u64 va) 105 137 { ··· 156 98 157 99 wi->regime = compute_translation_regime(vcpu, op); 158 100 as_el0 = (op == OP_AT_S1E0R || op == OP_AT_S1E0W); 101 + wi->pan = (op == OP_AT_S1E1RP || op == OP_AT_S1E1WP) && 102 + (*vcpu_cpsr(vcpu) & PSR_PAN_BIT); 159 103 160 104 va55 = va & BIT(55); 161 105 ··· 240 180 (va55 ? 241 181 FIELD_GET(TCR_HPD1, tcr) : 242 182 FIELD_GET(TCR_HPD0, tcr))); 183 + /* R_JHSVW */ 184 + wi->hpd |= s1pie_enabled(vcpu, wi->regime); 185 + 186 + /* Do we have POE? */ 187 + compute_s1poe(vcpu, wi); 188 + 189 + /* R_BVXDG */ 190 + wi->hpd |= (wi->poe || wi->e0poe); 243 191 244 192 /* Someone was silly enough to encode TG0/TG1 differently */ 245 193 if (va55) { ··· 480 412 u64 ttbr1; 481 413 u64 tcr; 482 414 u64 mair; 415 + u64 tcr2; 416 + u64 pir; 417 + u64 pire0; 418 + u64 por_el0; 419 + u64 por_el1; 483 420 u64 sctlr; 484 421 u64 vttbr; 485 422 u64 vtcr; ··· 497 424 config->ttbr1 = read_sysreg_el1(SYS_TTBR1); 498 425 config->tcr = read_sysreg_el1(SYS_TCR); 499 426 config->mair = read_sysreg_el1(SYS_MAIR); 427 + if (cpus_have_final_cap(ARM64_HAS_TCR2)) { 428 + config->tcr2 = read_sysreg_el1(SYS_TCR2); 429 + if (cpus_have_final_cap(ARM64_HAS_S1PIE)) { 430 + config->pir = read_sysreg_el1(SYS_PIR); 431 + config->pire0 = read_sysreg_el1(SYS_PIRE0); 432 + } 433 + if (system_supports_poe()) { 434 + config->por_el1 = read_sysreg_el1(SYS_POR); 435 + config->por_el0 = read_sysreg_s(SYS_POR_EL0); 436 + } 437 + } 500 438 config->sctlr = read_sysreg_el1(SYS_SCTLR); 501 439 config->vttbr = read_sysreg(vttbr_el2); 502 440 config->vtcr = read_sysreg(vtcr_el2); ··· 528 444 write_sysreg_el1(config->ttbr1, SYS_TTBR1); 529 445 write_sysreg_el1(config->tcr, SYS_TCR); 530 446 write_sysreg_el1(config->mair, SYS_MAIR); 447 + if (cpus_have_final_cap(ARM64_HAS_TCR2)) { 448 + write_sysreg_el1(config->tcr2, SYS_TCR2); 449 + if (cpus_have_final_cap(ARM64_HAS_S1PIE)) { 450 + write_sysreg_el1(config->pir, SYS_PIR); 451 + write_sysreg_el1(config->pire0, SYS_PIRE0); 452 + } 453 + if (system_supports_poe()) { 454 + write_sysreg_el1(config->por_el1, SYS_POR); 455 + write_sysreg_s(config->por_el0, SYS_POR_EL0); 456 + } 457 + } 531 458 write_sysreg_el1(config->sctlr, SYS_SCTLR); 532 459 write_sysreg(config->vttbr, vttbr_el2); 533 460 write_sysreg(config->vtcr, vtcr_el2); ··· 834 739 if (!kvm_has_feat(vcpu->kvm, ID_AA64MMFR1_EL1, PAN, PAN3)) 835 740 return false; 836 741 742 + if (s1pie_enabled(vcpu, regime)) 743 + return true; 744 + 837 745 if (regime == TR_EL10) 838 746 sctlr = vcpu_read_sys_reg(vcpu, SCTLR_EL1); 839 747 else ··· 845 747 return sctlr & SCTLR_EL1_EPAN; 846 748 } 847 749 750 + static void compute_s1_direct_permissions(struct kvm_vcpu *vcpu, 751 + struct s1_walk_info *wi, 752 + struct s1_walk_result *wr) 753 + { 754 + bool wxn; 755 + 756 + /* Non-hierarchical part of AArch64.S1DirectBasePermissions() */ 757 + if (wi->regime != TR_EL2) { 758 + switch (FIELD_GET(PTE_USER | PTE_RDONLY, wr->desc)) { 759 + case 0b00: 760 + wr->pr = wr->pw = true; 761 + wr->ur = wr->uw = false; 762 + break; 763 + case 0b01: 764 + wr->pr = wr->pw = wr->ur = wr->uw = true; 765 + break; 766 + case 0b10: 767 + wr->pr = true; 768 + wr->pw = wr->ur = wr->uw = false; 769 + break; 770 + case 0b11: 771 + wr->pr = wr->ur = true; 772 + wr->pw = wr->uw = false; 773 + break; 774 + } 775 + 776 + /* We don't use px for anything yet, but hey... */ 777 + wr->px = !((wr->desc & PTE_PXN) || wr->uw); 778 + wr->ux = !(wr->desc & PTE_UXN); 779 + } else { 780 + wr->ur = wr->uw = wr->ux = false; 781 + 782 + if (!(wr->desc & PTE_RDONLY)) { 783 + wr->pr = wr->pw = true; 784 + } else { 785 + wr->pr = true; 786 + wr->pw = false; 787 + } 788 + 789 + /* XN maps to UXN */ 790 + wr->px = !(wr->desc & PTE_UXN); 791 + } 792 + 793 + switch (wi->regime) { 794 + case TR_EL2: 795 + case TR_EL20: 796 + wxn = (vcpu_read_sys_reg(vcpu, SCTLR_EL2) & SCTLR_ELx_WXN); 797 + break; 798 + case TR_EL10: 799 + wxn = (__vcpu_sys_reg(vcpu, SCTLR_EL1) & SCTLR_ELx_WXN); 800 + break; 801 + } 802 + 803 + wr->pwxn = wr->uwxn = wxn; 804 + wr->pov = wi->poe; 805 + wr->uov = wi->e0poe; 806 + } 807 + 808 + static void compute_s1_hierarchical_permissions(struct kvm_vcpu *vcpu, 809 + struct s1_walk_info *wi, 810 + struct s1_walk_result *wr) 811 + { 812 + /* Hierarchical part of AArch64.S1DirectBasePermissions() */ 813 + if (wi->regime != TR_EL2) { 814 + switch (wr->APTable) { 815 + case 0b00: 816 + break; 817 + case 0b01: 818 + wr->ur = wr->uw = false; 819 + break; 820 + case 0b10: 821 + wr->pw = wr->uw = false; 822 + break; 823 + case 0b11: 824 + wr->pw = wr->ur = wr->uw = false; 825 + break; 826 + } 827 + 828 + wr->px &= !wr->PXNTable; 829 + wr->ux &= !wr->UXNTable; 830 + } else { 831 + if (wr->APTable & BIT(1)) 832 + wr->pw = false; 833 + 834 + /* XN maps to UXN */ 835 + wr->px &= !wr->UXNTable; 836 + } 837 + } 838 + 839 + #define perm_idx(v, r, i) ((vcpu_read_sys_reg((v), (r)) >> ((i) * 4)) & 0xf) 840 + 841 + #define set_priv_perms(wr, r, w, x) \ 842 + do { \ 843 + (wr)->pr = (r); \ 844 + (wr)->pw = (w); \ 845 + (wr)->px = (x); \ 846 + } while (0) 847 + 848 + #define set_unpriv_perms(wr, r, w, x) \ 849 + do { \ 850 + (wr)->ur = (r); \ 851 + (wr)->uw = (w); \ 852 + (wr)->ux = (x); \ 853 + } while (0) 854 + 855 + #define set_priv_wxn(wr, v) \ 856 + do { \ 857 + (wr)->pwxn = (v); \ 858 + } while (0) 859 + 860 + #define set_unpriv_wxn(wr, v) \ 861 + do { \ 862 + (wr)->uwxn = (v); \ 863 + } while (0) 864 + 865 + /* Similar to AArch64.S1IndirectBasePermissions(), without GCS */ 866 + #define set_perms(w, wr, ip) \ 867 + do { \ 868 + /* R_LLZDZ */ \ 869 + switch ((ip)) { \ 870 + case 0b0000: \ 871 + set_ ## w ## _perms((wr), false, false, false); \ 872 + break; \ 873 + case 0b0001: \ 874 + set_ ## w ## _perms((wr), true , false, false); \ 875 + break; \ 876 + case 0b0010: \ 877 + set_ ## w ## _perms((wr), false, false, true ); \ 878 + break; \ 879 + case 0b0011: \ 880 + set_ ## w ## _perms((wr), true , false, true ); \ 881 + break; \ 882 + case 0b0100: \ 883 + set_ ## w ## _perms((wr), false, false, false); \ 884 + break; \ 885 + case 0b0101: \ 886 + set_ ## w ## _perms((wr), true , true , false); \ 887 + break; \ 888 + case 0b0110: \ 889 + set_ ## w ## _perms((wr), true , true , true ); \ 890 + break; \ 891 + case 0b0111: \ 892 + set_ ## w ## _perms((wr), true , true , true ); \ 893 + break; \ 894 + case 0b1000: \ 895 + set_ ## w ## _perms((wr), true , false, false); \ 896 + break; \ 897 + case 0b1001: \ 898 + set_ ## w ## _perms((wr), true , false, false); \ 899 + break; \ 900 + case 0b1010: \ 901 + set_ ## w ## _perms((wr), true , false, true ); \ 902 + break; \ 903 + case 0b1011: \ 904 + set_ ## w ## _perms((wr), false, false, false); \ 905 + break; \ 906 + case 0b1100: \ 907 + set_ ## w ## _perms((wr), true , true , false); \ 908 + break; \ 909 + case 0b1101: \ 910 + set_ ## w ## _perms((wr), false, false, false); \ 911 + break; \ 912 + case 0b1110: \ 913 + set_ ## w ## _perms((wr), true , true , true ); \ 914 + break; \ 915 + case 0b1111: \ 916 + set_ ## w ## _perms((wr), false, false, false); \ 917 + break; \ 918 + } \ 919 + \ 920 + /* R_HJYGR */ \ 921 + set_ ## w ## _wxn((wr), ((ip) == 0b0110)); \ 922 + \ 923 + } while (0) 924 + 925 + static void compute_s1_indirect_permissions(struct kvm_vcpu *vcpu, 926 + struct s1_walk_info *wi, 927 + struct s1_walk_result *wr) 928 + { 929 + u8 up, pp, idx; 930 + 931 + idx = pte_pi_index(wr->desc); 932 + 933 + switch (wi->regime) { 934 + case TR_EL10: 935 + pp = perm_idx(vcpu, PIR_EL1, idx); 936 + up = perm_idx(vcpu, PIRE0_EL1, idx); 937 + break; 938 + case TR_EL20: 939 + pp = perm_idx(vcpu, PIR_EL2, idx); 940 + up = perm_idx(vcpu, PIRE0_EL2, idx); 941 + break; 942 + case TR_EL2: 943 + pp = perm_idx(vcpu, PIR_EL2, idx); 944 + up = 0; 945 + break; 946 + } 947 + 948 + set_perms(priv, wr, pp); 949 + 950 + if (wi->regime != TR_EL2) 951 + set_perms(unpriv, wr, up); 952 + else 953 + set_unpriv_perms(wr, false, false, false); 954 + 955 + wr->pov = wi->poe && !(pp & BIT(3)); 956 + wr->uov = wi->e0poe && !(up & BIT(3)); 957 + 958 + /* R_VFPJF */ 959 + if (wr->px && wr->uw) { 960 + set_priv_perms(wr, false, false, false); 961 + set_unpriv_perms(wr, false, false, false); 962 + } 963 + } 964 + 965 + static void compute_s1_overlay_permissions(struct kvm_vcpu *vcpu, 966 + struct s1_walk_info *wi, 967 + struct s1_walk_result *wr) 968 + { 969 + u8 idx, pov_perms, uov_perms; 970 + 971 + idx = FIELD_GET(PTE_PO_IDX_MASK, wr->desc); 972 + 973 + switch (wi->regime) { 974 + case TR_EL10: 975 + pov_perms = perm_idx(vcpu, POR_EL1, idx); 976 + uov_perms = perm_idx(vcpu, POR_EL0, idx); 977 + break; 978 + case TR_EL20: 979 + pov_perms = perm_idx(vcpu, POR_EL2, idx); 980 + uov_perms = perm_idx(vcpu, POR_EL0, idx); 981 + break; 982 + case TR_EL2: 983 + pov_perms = perm_idx(vcpu, POR_EL2, idx); 984 + uov_perms = 0; 985 + break; 986 + } 987 + 988 + if (pov_perms & ~POE_RXW) 989 + pov_perms = POE_NONE; 990 + 991 + if (wi->poe && wr->pov) { 992 + wr->pr &= pov_perms & POE_R; 993 + wr->px &= pov_perms & POE_X; 994 + wr->pw &= pov_perms & POE_W; 995 + } 996 + 997 + if (uov_perms & ~POE_RXW) 998 + uov_perms = POE_NONE; 999 + 1000 + if (wi->e0poe && wr->uov) { 1001 + wr->ur &= uov_perms & POE_R; 1002 + wr->ux &= uov_perms & POE_X; 1003 + wr->uw &= uov_perms & POE_W; 1004 + } 1005 + } 1006 + 1007 + static void compute_s1_permissions(struct kvm_vcpu *vcpu, 1008 + struct s1_walk_info *wi, 1009 + struct s1_walk_result *wr) 1010 + { 1011 + bool pan; 1012 + 1013 + if (!s1pie_enabled(vcpu, wi->regime)) 1014 + compute_s1_direct_permissions(vcpu, wi, wr); 1015 + else 1016 + compute_s1_indirect_permissions(vcpu, wi, wr); 1017 + 1018 + if (!wi->hpd) 1019 + compute_s1_hierarchical_permissions(vcpu, wi, wr); 1020 + 1021 + if (wi->poe || wi->e0poe) 1022 + compute_s1_overlay_permissions(vcpu, wi, wr); 1023 + 1024 + /* R_QXXPC */ 1025 + if (wr->pwxn) { 1026 + if (!wr->pov && wr->pw) 1027 + wr->px = false; 1028 + if (wr->pov && wr->px) 1029 + wr->pw = false; 1030 + } 1031 + 1032 + /* R_NPBXC */ 1033 + if (wr->uwxn) { 1034 + if (!wr->uov && wr->uw) 1035 + wr->ux = false; 1036 + if (wr->uov && wr->ux) 1037 + wr->uw = false; 1038 + } 1039 + 1040 + pan = wi->pan && (wr->ur || wr->uw || 1041 + (pan3_enabled(vcpu, wi->regime) && wr->ux)); 1042 + wr->pw &= !pan; 1043 + wr->pr &= !pan; 1044 + } 1045 + 848 1046 static u64 handle_at_slow(struct kvm_vcpu *vcpu, u32 op, u64 vaddr) 849 1047 { 850 - bool perm_fail, ur, uw, ux, pr, pw, px; 851 1048 struct s1_walk_result wr = {}; 852 1049 struct s1_walk_info wi = {}; 1050 + bool perm_fail = false; 853 1051 int ret, idx; 854 1052 855 1053 ret = setup_s1_walk(vcpu, op, &wi, &wr, vaddr); ··· 1164 770 if (ret) 1165 771 goto compute_par; 1166 772 1167 - /* FIXME: revisit when adding indirect permission support */ 1168 - /* AArch64.S1DirectBasePermissions() */ 1169 - if (wi.regime != TR_EL2) { 1170 - switch (FIELD_GET(PTE_USER | PTE_RDONLY, wr.desc)) { 1171 - case 0b00: 1172 - pr = pw = true; 1173 - ur = uw = false; 1174 - break; 1175 - case 0b01: 1176 - pr = pw = ur = uw = true; 1177 - break; 1178 - case 0b10: 1179 - pr = true; 1180 - pw = ur = uw = false; 1181 - break; 1182 - case 0b11: 1183 - pr = ur = true; 1184 - pw = uw = false; 1185 - break; 1186 - } 1187 - 1188 - switch (wr.APTable) { 1189 - case 0b00: 1190 - break; 1191 - case 0b01: 1192 - ur = uw = false; 1193 - break; 1194 - case 0b10: 1195 - pw = uw = false; 1196 - break; 1197 - case 0b11: 1198 - pw = ur = uw = false; 1199 - break; 1200 - } 1201 - 1202 - /* We don't use px for anything yet, but hey... */ 1203 - px = !((wr.desc & PTE_PXN) || wr.PXNTable || uw); 1204 - ux = !((wr.desc & PTE_UXN) || wr.UXNTable); 1205 - 1206 - if (op == OP_AT_S1E1RP || op == OP_AT_S1E1WP) { 1207 - bool pan; 1208 - 1209 - pan = *vcpu_cpsr(vcpu) & PSR_PAN_BIT; 1210 - pan &= ur || uw || (pan3_enabled(vcpu, wi.regime) && ux); 1211 - pw &= !pan; 1212 - pr &= !pan; 1213 - } 1214 - } else { 1215 - ur = uw = ux = false; 1216 - 1217 - if (!(wr.desc & PTE_RDONLY)) { 1218 - pr = pw = true; 1219 - } else { 1220 - pr = true; 1221 - pw = false; 1222 - } 1223 - 1224 - if (wr.APTable & BIT(1)) 1225 - pw = false; 1226 - 1227 - /* XN maps to UXN */ 1228 - px = !((wr.desc & PTE_UXN) || wr.UXNTable); 1229 - } 1230 - 1231 - perm_fail = false; 773 + compute_s1_permissions(vcpu, &wi, &wr); 1232 774 1233 775 switch (op) { 1234 776 case OP_AT_S1E1RP: 1235 777 case OP_AT_S1E1R: 1236 778 case OP_AT_S1E2R: 1237 - perm_fail = !pr; 779 + perm_fail = !wr.pr; 1238 780 break; 1239 781 case OP_AT_S1E1WP: 1240 782 case OP_AT_S1E1W: 1241 783 case OP_AT_S1E2W: 1242 - perm_fail = !pw; 784 + perm_fail = !wr.pw; 1243 785 break; 1244 786 case OP_AT_S1E0R: 1245 - perm_fail = !ur; 787 + perm_fail = !wr.ur; 1246 788 break; 1247 789 case OP_AT_S1E0W: 1248 - perm_fail = !uw; 790 + perm_fail = !wr.uw; 1249 791 break; 1250 792 case OP_AT_S1E1A: 1251 793 case OP_AT_S1E2A: ··· 1244 914 write_sysreg_el1(vcpu_read_sys_reg(vcpu, TTBR1_EL1), SYS_TTBR1); 1245 915 write_sysreg_el1(vcpu_read_sys_reg(vcpu, TCR_EL1), SYS_TCR); 1246 916 write_sysreg_el1(vcpu_read_sys_reg(vcpu, MAIR_EL1), SYS_MAIR); 917 + if (kvm_has_tcr2(vcpu->kvm)) { 918 + write_sysreg_el1(vcpu_read_sys_reg(vcpu, TCR2_EL1), SYS_TCR2); 919 + if (kvm_has_s1pie(vcpu->kvm)) { 920 + write_sysreg_el1(vcpu_read_sys_reg(vcpu, PIR_EL1), SYS_PIR); 921 + write_sysreg_el1(vcpu_read_sys_reg(vcpu, PIRE0_EL1), SYS_PIRE0); 922 + } 923 + if (kvm_has_s1poe(vcpu->kvm)) { 924 + write_sysreg_el1(vcpu_read_sys_reg(vcpu, POR_EL1), SYS_POR); 925 + write_sysreg_s(vcpu_read_sys_reg(vcpu, POR_EL0), SYS_POR_EL0); 926 + } 927 + } 1247 928 write_sysreg_el1(vcpu_read_sys_reg(vcpu, SCTLR_EL1), SYS_SCTLR); 1248 929 __load_stage2(mmu, mmu->arch); 1249 930 ··· 1333 992 * switching context behind everybody's back, disable interrupts... 1334 993 */ 1335 994 scoped_guard(write_lock_irqsave, &vcpu->kvm->mmu_lock) { 1336 - struct kvm_s2_mmu *mmu; 1337 995 u64 val, hcr; 1338 996 bool fail; 1339 - 1340 - mmu = &vcpu->kvm->arch.mmu; 1341 997 1342 998 val = hcr = read_sysreg(hcr_el2); 1343 999 val &= ~HCR_TGE;

+179 -122

arch/arm64/kvm/emulate-nested.c

··· 16 16 17 17 enum trap_behaviour { 18 18 BEHAVE_HANDLE_LOCALLY = 0, 19 + 19 20 BEHAVE_FORWARD_READ = BIT(0), 20 21 BEHAVE_FORWARD_WRITE = BIT(1), 21 - BEHAVE_FORWARD_ANY = BEHAVE_FORWARD_READ | BEHAVE_FORWARD_WRITE, 22 + BEHAVE_FORWARD_RW = BEHAVE_FORWARD_READ | BEHAVE_FORWARD_WRITE, 23 + 24 + /* Traps that take effect in Host EL0, this is rare! */ 25 + BEHAVE_FORWARD_IN_HOST_EL0 = BIT(2), 22 26 }; 23 27 24 28 struct trap_bits { ··· 83 79 CGT_MDCR_E2TB, 84 80 CGT_MDCR_TDCC, 85 81 86 - CGT_CPACR_E0POE, 87 82 CGT_CPTR_TAM, 88 83 CGT_CPTR_TCPAC, 89 84 ··· 109 106 CGT_HCR_TPU_TOCU, 110 107 CGT_HCR_NV1_nNV2_ENSCXT, 111 108 CGT_MDCR_TPM_TPMCR, 109 + CGT_MDCR_TPM_HPMN, 112 110 CGT_MDCR_TDE_TDA, 113 111 CGT_MDCR_TDE_TDOSA, 114 112 CGT_MDCR_TDE_TDRA, ··· 126 122 CGT_CNTHCTL_EL1PTEN, 127 123 128 124 CGT_CPTR_TTA, 125 + CGT_MDCR_HPMN, 129 126 130 127 /* Must be last */ 131 128 __NR_CGT_GROUP_IDS__ ··· 143 138 .index = HCR_EL2, 144 139 .value = HCR_TID2, 145 140 .mask = HCR_TID2, 146 - .behaviour = BEHAVE_FORWARD_ANY, 141 + .behaviour = BEHAVE_FORWARD_RW, 147 142 }, 148 143 [CGT_HCR_TID3] = { 149 144 .index = HCR_EL2, ··· 167 162 .index = HCR_EL2, 168 163 .value = HCR_TIDCP, 169 164 .mask = HCR_TIDCP, 170 - .behaviour = BEHAVE_FORWARD_ANY, 165 + .behaviour = BEHAVE_FORWARD_RW, 171 166 }, 172 167 [CGT_HCR_TACR] = { 173 168 .index = HCR_EL2, 174 169 .value = HCR_TACR, 175 170 .mask = HCR_TACR, 176 - .behaviour = BEHAVE_FORWARD_ANY, 171 + .behaviour = BEHAVE_FORWARD_RW, 177 172 }, 178 173 [CGT_HCR_TSW] = { 179 174 .index = HCR_EL2, 180 175 .value = HCR_TSW, 181 176 .mask = HCR_TSW, 182 - .behaviour = BEHAVE_FORWARD_ANY, 177 + .behaviour = BEHAVE_FORWARD_RW, 183 178 }, 184 179 [CGT_HCR_TPC] = { /* Also called TCPC when FEAT_DPB is implemented */ 185 180 .index = HCR_EL2, 186 181 .value = HCR_TPC, 187 182 .mask = HCR_TPC, 188 - .behaviour = BEHAVE_FORWARD_ANY, 183 + .behaviour = BEHAVE_FORWARD_RW, 189 184 }, 190 185 [CGT_HCR_TPU] = { 191 186 .index = HCR_EL2, 192 187 .value = HCR_TPU, 193 188 .mask = HCR_TPU, 194 - .behaviour = BEHAVE_FORWARD_ANY, 189 + .behaviour = BEHAVE_FORWARD_RW, 195 190 }, 196 191 [CGT_HCR_TTLB] = { 197 192 .index = HCR_EL2, 198 193 .value = HCR_TTLB, 199 194 .mask = HCR_TTLB, 200 - .behaviour = BEHAVE_FORWARD_ANY, 195 + .behaviour = BEHAVE_FORWARD_RW, 201 196 }, 202 197 [CGT_HCR_TVM] = { 203 198 .index = HCR_EL2, ··· 209 204 .index = HCR_EL2, 210 205 .value = HCR_TDZ, 211 206 .mask = HCR_TDZ, 212 - .behaviour = BEHAVE_FORWARD_ANY, 207 + .behaviour = BEHAVE_FORWARD_RW, 213 208 }, 214 209 [CGT_HCR_TRVM] = { 215 210 .index = HCR_EL2, ··· 221 216 .index = HCR_EL2, 222 217 .value = HCR_TLOR, 223 218 .mask = HCR_TLOR, 224 - .behaviour = BEHAVE_FORWARD_ANY, 219 + .behaviour = BEHAVE_FORWARD_RW, 225 220 }, 226 221 [CGT_HCR_TERR] = { 227 222 .index = HCR_EL2, 228 223 .value = HCR_TERR, 229 224 .mask = HCR_TERR, 230 - .behaviour = BEHAVE_FORWARD_ANY, 225 + .behaviour = BEHAVE_FORWARD_RW, 231 226 }, 232 227 [CGT_HCR_APK] = { 233 228 .index = HCR_EL2, 234 229 .value = 0, 235 230 .mask = HCR_APK, 236 - .behaviour = BEHAVE_FORWARD_ANY, 231 + .behaviour = BEHAVE_FORWARD_RW, 237 232 }, 238 233 [CGT_HCR_NV] = { 239 234 .index = HCR_EL2, 240 235 .value = HCR_NV, 241 236 .mask = HCR_NV, 242 - .behaviour = BEHAVE_FORWARD_ANY, 237 + .behaviour = BEHAVE_FORWARD_RW, 243 238 }, 244 239 [CGT_HCR_NV_nNV2] = { 245 240 .index = HCR_EL2, 246 241 .value = HCR_NV, 247 242 .mask = HCR_NV | HCR_NV2, 248 - .behaviour = BEHAVE_FORWARD_ANY, 243 + .behaviour = BEHAVE_FORWARD_RW, 249 244 }, 250 245 [CGT_HCR_NV1_nNV2] = { 251 246 .index = HCR_EL2, 252 247 .value = HCR_NV | HCR_NV1, 253 248 .mask = HCR_NV | HCR_NV1 | HCR_NV2, 254 - .behaviour = BEHAVE_FORWARD_ANY, 249 + .behaviour = BEHAVE_FORWARD_RW, 255 250 }, 256 251 [CGT_HCR_AT] = { 257 252 .index = HCR_EL2, 258 253 .value = HCR_AT, 259 254 .mask = HCR_AT, 260 - .behaviour = BEHAVE_FORWARD_ANY, 255 + .behaviour = BEHAVE_FORWARD_RW, 261 256 }, 262 257 [CGT_HCR_nFIEN] = { 263 258 .index = HCR_EL2, 264 259 .value = 0, 265 260 .mask = HCR_FIEN, 266 - .behaviour = BEHAVE_FORWARD_ANY, 261 + .behaviour = BEHAVE_FORWARD_RW, 267 262 }, 268 263 [CGT_HCR_TID4] = { 269 264 .index = HCR_EL2, 270 265 .value = HCR_TID4, 271 266 .mask = HCR_TID4, 272 - .behaviour = BEHAVE_FORWARD_ANY, 267 + .behaviour = BEHAVE_FORWARD_RW, 273 268 }, 274 269 [CGT_HCR_TICAB] = { 275 270 .index = HCR_EL2, 276 271 .value = HCR_TICAB, 277 272 .mask = HCR_TICAB, 278 - .behaviour = BEHAVE_FORWARD_ANY, 273 + .behaviour = BEHAVE_FORWARD_RW, 279 274 }, 280 275 [CGT_HCR_TOCU] = { 281 276 .index = HCR_EL2, 282 277 .value = HCR_TOCU, 283 278 .mask = HCR_TOCU, 284 - .behaviour = BEHAVE_FORWARD_ANY, 279 + .behaviour = BEHAVE_FORWARD_RW, 285 280 }, 286 281 [CGT_HCR_ENSCXT] = { 287 282 .index = HCR_EL2, 288 283 .value = 0, 289 284 .mask = HCR_ENSCXT, 290 - .behaviour = BEHAVE_FORWARD_ANY, 285 + .behaviour = BEHAVE_FORWARD_RW, 291 286 }, 292 287 [CGT_HCR_TTLBIS] = { 293 288 .index = HCR_EL2, 294 289 .value = HCR_TTLBIS, 295 290 .mask = HCR_TTLBIS, 296 - .behaviour = BEHAVE_FORWARD_ANY, 291 + .behaviour = BEHAVE_FORWARD_RW, 297 292 }, 298 293 [CGT_HCR_TTLBOS] = { 299 294 .index = HCR_EL2, 300 295 .value = HCR_TTLBOS, 301 296 .mask = HCR_TTLBOS, 302 - .behaviour = BEHAVE_FORWARD_ANY, 297 + .behaviour = BEHAVE_FORWARD_RW, 303 298 }, 304 299 [CGT_MDCR_TPMCR] = { 305 300 .index = MDCR_EL2, 306 301 .value = MDCR_EL2_TPMCR, 307 302 .mask = MDCR_EL2_TPMCR, 308 - .behaviour = BEHAVE_FORWARD_ANY, 303 + .behaviour = BEHAVE_FORWARD_RW | 304 + BEHAVE_FORWARD_IN_HOST_EL0, 309 305 }, 310 306 [CGT_MDCR_TPM] = { 311 307 .index = MDCR_EL2, 312 308 .value = MDCR_EL2_TPM, 313 309 .mask = MDCR_EL2_TPM, 314 - .behaviour = BEHAVE_FORWARD_ANY, 310 + .behaviour = BEHAVE_FORWARD_RW | 311 + BEHAVE_FORWARD_IN_HOST_EL0, 315 312 }, 316 313 [CGT_MDCR_TDE] = { 317 314 .index = MDCR_EL2, 318 315 .value = MDCR_EL2_TDE, 319 316 .mask = MDCR_EL2_TDE, 320 - .behaviour = BEHAVE_FORWARD_ANY, 317 + .behaviour = BEHAVE_FORWARD_RW, 321 318 }, 322 319 [CGT_MDCR_TDA] = { 323 320 .index = MDCR_EL2, 324 321 .value = MDCR_EL2_TDA, 325 322 .mask = MDCR_EL2_TDA, 326 - .behaviour = BEHAVE_FORWARD_ANY, 323 + .behaviour = BEHAVE_FORWARD_RW, 327 324 }, 328 325 [CGT_MDCR_TDOSA] = { 329 326 .index = MDCR_EL2, 330 327 .value = MDCR_EL2_TDOSA, 331 328 .mask = MDCR_EL2_TDOSA, 332 - .behaviour = BEHAVE_FORWARD_ANY, 329 + .behaviour = BEHAVE_FORWARD_RW, 333 330 }, 334 331 [CGT_MDCR_TDRA] = { 335 332 .index = MDCR_EL2, 336 333 .value = MDCR_EL2_TDRA, 337 334 .mask = MDCR_EL2_TDRA, 338 - .behaviour = BEHAVE_FORWARD_ANY, 335 + .behaviour = BEHAVE_FORWARD_RW, 339 336 }, 340 337 [CGT_MDCR_E2PB] = { 341 338 .index = MDCR_EL2, 342 339 .value = 0, 343 340 .mask = BIT(MDCR_EL2_E2PB_SHIFT), 344 - .behaviour = BEHAVE_FORWARD_ANY, 341 + .behaviour = BEHAVE_FORWARD_RW, 345 342 }, 346 343 [CGT_MDCR_TPMS] = { 347 344 .index = MDCR_EL2, 348 345 .value = MDCR_EL2_TPMS, 349 346 .mask = MDCR_EL2_TPMS, 350 - .behaviour = BEHAVE_FORWARD_ANY, 347 + .behaviour = BEHAVE_FORWARD_RW, 351 348 }, 352 349 [CGT_MDCR_TTRF] = { 353 350 .index = MDCR_EL2, 354 351 .value = MDCR_EL2_TTRF, 355 352 .mask = MDCR_EL2_TTRF, 356 - .behaviour = BEHAVE_FORWARD_ANY, 353 + .behaviour = BEHAVE_FORWARD_RW, 357 354 }, 358 355 [CGT_MDCR_E2TB] = { 359 356 .index = MDCR_EL2, 360 357 .value = 0, 361 358 .mask = BIT(MDCR_EL2_E2TB_SHIFT), 362 - .behaviour = BEHAVE_FORWARD_ANY, 359 + .behaviour = BEHAVE_FORWARD_RW, 363 360 }, 364 361 [CGT_MDCR_TDCC] = { 365 362 .index = MDCR_EL2, 366 363 .value = MDCR_EL2_TDCC, 367 364 .mask = MDCR_EL2_TDCC, 368 - .behaviour = BEHAVE_FORWARD_ANY, 369 - }, 370 - [CGT_CPACR_E0POE] = { 371 - .index = CPTR_EL2, 372 - .value = CPACR_ELx_E0POE, 373 - .mask = CPACR_ELx_E0POE, 374 - .behaviour = BEHAVE_FORWARD_ANY, 365 + .behaviour = BEHAVE_FORWARD_RW, 375 366 }, 376 367 [CGT_CPTR_TAM] = { 377 368 .index = CPTR_EL2, 378 369 .value = CPTR_EL2_TAM, 379 370 .mask = CPTR_EL2_TAM, 380 - .behaviour = BEHAVE_FORWARD_ANY, 371 + .behaviour = BEHAVE_FORWARD_RW, 381 372 }, 382 373 [CGT_CPTR_TCPAC] = { 383 374 .index = CPTR_EL2, 384 375 .value = CPTR_EL2_TCPAC, 385 376 .mask = CPTR_EL2_TCPAC, 386 - .behaviour = BEHAVE_FORWARD_ANY, 377 + .behaviour = BEHAVE_FORWARD_RW, 387 378 }, 388 379 [CGT_HCRX_EnFPM] = { 389 380 .index = HCRX_EL2, 390 381 .value = 0, 391 382 .mask = HCRX_EL2_EnFPM, 392 - .behaviour = BEHAVE_FORWARD_ANY, 383 + .behaviour = BEHAVE_FORWARD_RW, 393 384 }, 394 385 [CGT_HCRX_TCR2En] = { 395 386 .index = HCRX_EL2, 396 387 .value = 0, 397 388 .mask = HCRX_EL2_TCR2En, 398 - .behaviour = BEHAVE_FORWARD_ANY, 389 + .behaviour = BEHAVE_FORWARD_RW, 399 390 }, 400 391 [CGT_ICH_HCR_TC] = { 401 392 .index = ICH_HCR_EL2, 402 393 .value = ICH_HCR_TC, 403 394 .mask = ICH_HCR_TC, 404 - .behaviour = BEHAVE_FORWARD_ANY, 395 + .behaviour = BEHAVE_FORWARD_RW, 405 396 }, 406 397 [CGT_ICH_HCR_TALL0] = { 407 398 .index = ICH_HCR_EL2, 408 399 .value = ICH_HCR_TALL0, 409 400 .mask = ICH_HCR_TALL0, 410 - .behaviour = BEHAVE_FORWARD_ANY, 401 + .behaviour = BEHAVE_FORWARD_RW, 411 402 }, 412 403 [CGT_ICH_HCR_TALL1] = { 413 404 .index = ICH_HCR_EL2, 414 405 .value = ICH_HCR_TALL1, 415 406 .mask = ICH_HCR_TALL1, 416 - .behaviour = BEHAVE_FORWARD_ANY, 407 + .behaviour = BEHAVE_FORWARD_RW, 417 408 }, 418 409 [CGT_ICH_HCR_TDIR] = { 419 410 .index = ICH_HCR_EL2, 420 411 .value = ICH_HCR_TDIR, 421 412 .mask = ICH_HCR_TDIR, 422 - .behaviour = BEHAVE_FORWARD_ANY, 413 + .behaviour = BEHAVE_FORWARD_RW, 423 414 }, 424 415 }; 425 416 ··· 436 435 MCB(CGT_HCR_TPU_TOCU, CGT_HCR_TPU, CGT_HCR_TOCU), 437 436 MCB(CGT_HCR_NV1_nNV2_ENSCXT, CGT_HCR_NV1_nNV2, CGT_HCR_ENSCXT), 438 437 MCB(CGT_MDCR_TPM_TPMCR, CGT_MDCR_TPM, CGT_MDCR_TPMCR), 438 + MCB(CGT_MDCR_TPM_HPMN, CGT_MDCR_TPM, CGT_MDCR_HPMN), 439 439 MCB(CGT_MDCR_TDE_TDA, CGT_MDCR_TDE, CGT_MDCR_TDA), 440 440 MCB(CGT_MDCR_TDE_TDOSA, CGT_MDCR_TDE, CGT_MDCR_TDOSA), 441 441 MCB(CGT_MDCR_TDE_TDRA, CGT_MDCR_TDE, CGT_MDCR_TDRA), ··· 476 474 if (get_sanitized_cnthctl(vcpu) & (CNTHCTL_EL1PCTEN << 10)) 477 475 return BEHAVE_HANDLE_LOCALLY; 478 476 479 - return BEHAVE_FORWARD_ANY; 477 + return BEHAVE_FORWARD_RW; 480 478 } 481 479 482 480 static enum trap_behaviour check_cnthctl_el1pten(struct kvm_vcpu *vcpu) ··· 484 482 if (get_sanitized_cnthctl(vcpu) & (CNTHCTL_EL1PCEN << 10)) 485 483 return BEHAVE_HANDLE_LOCALLY; 486 484 487 - return BEHAVE_FORWARD_ANY; 485 + return BEHAVE_FORWARD_RW; 488 486 } 489 487 490 488 static enum trap_behaviour check_cptr_tta(struct kvm_vcpu *vcpu) ··· 495 493 val = translate_cptr_el2_to_cpacr_el1(val); 496 494 497 495 if (val & CPACR_ELx_TTA) 498 - return BEHAVE_FORWARD_ANY; 496 + return BEHAVE_FORWARD_RW; 497 + 498 + return BEHAVE_HANDLE_LOCALLY; 499 + } 500 + 501 + static enum trap_behaviour check_mdcr_hpmn(struct kvm_vcpu *vcpu) 502 + { 503 + u32 sysreg = esr_sys64_to_sysreg(kvm_vcpu_get_esr(vcpu)); 504 + unsigned int idx; 505 + 506 + 507 + switch (sysreg) { 508 + case SYS_PMEVTYPERn_EL0(0) ... SYS_PMEVTYPERn_EL0(30): 509 + case SYS_PMEVCNTRn_EL0(0) ... SYS_PMEVCNTRn_EL0(30): 510 + idx = (sys_reg_CRm(sysreg) & 0x3) << 3 | sys_reg_Op2(sysreg); 511 + break; 512 + case SYS_PMXEVTYPER_EL0: 513 + case SYS_PMXEVCNTR_EL0: 514 + idx = SYS_FIELD_GET(PMSELR_EL0, SEL, 515 + __vcpu_sys_reg(vcpu, PMSELR_EL0)); 516 + break; 517 + default: 518 + /* Someone used this trap helper for something else... */ 519 + KVM_BUG_ON(1, vcpu->kvm); 520 + return BEHAVE_HANDLE_LOCALLY; 521 + } 522 + 523 + if (kvm_pmu_counter_is_hyp(vcpu, idx)) 524 + return BEHAVE_FORWARD_RW | BEHAVE_FORWARD_IN_HOST_EL0; 499 525 500 526 return BEHAVE_HANDLE_LOCALLY; 501 527 } ··· 535 505 CCC(CGT_CNTHCTL_EL1PCTEN, check_cnthctl_el1pcten), 536 506 CCC(CGT_CNTHCTL_EL1PTEN, check_cnthctl_el1pten), 537 507 CCC(CGT_CPTR_TTA, check_cptr_tta), 508 + CCC(CGT_MDCR_HPMN, check_mdcr_hpmn), 538 509 }; 539 510 540 511 /* ··· 742 711 SR_TRAP(SYS_MAIR_EL1, CGT_HCR_TVM_TRVM), 743 712 SR_TRAP(SYS_AMAIR_EL1, CGT_HCR_TVM_TRVM), 744 713 SR_TRAP(SYS_CONTEXTIDR_EL1, CGT_HCR_TVM_TRVM), 714 + SR_TRAP(SYS_PIR_EL1, CGT_HCR_TVM_TRVM), 715 + SR_TRAP(SYS_PIRE0_EL1, CGT_HCR_TVM_TRVM), 716 + SR_TRAP(SYS_POR_EL0, CGT_HCR_TVM_TRVM), 717 + SR_TRAP(SYS_POR_EL1, CGT_HCR_TVM_TRVM), 745 718 SR_TRAP(SYS_TCR2_EL1, CGT_HCR_TVM_TRVM_HCRX_TCR2En), 746 719 SR_TRAP(SYS_DC_ZVA, CGT_HCR_TDZ), 747 720 SR_TRAP(SYS_DC_GVA, CGT_HCR_TDZ), ··· 954 919 SR_TRAP(SYS_PMOVSCLR_EL0, CGT_MDCR_TPM), 955 920 SR_TRAP(SYS_PMCEID0_EL0, CGT_MDCR_TPM), 956 921 SR_TRAP(SYS_PMCEID1_EL0, CGT_MDCR_TPM), 957 - SR_TRAP(SYS_PMXEVTYPER_EL0, CGT_MDCR_TPM), 922 + SR_TRAP(SYS_PMXEVTYPER_EL0, CGT_MDCR_TPM_HPMN), 958 923 SR_TRAP(SYS_PMSWINC_EL0, CGT_MDCR_TPM), 959 924 SR_TRAP(SYS_PMSELR_EL0, CGT_MDCR_TPM), 960 - SR_TRAP(SYS_PMXEVCNTR_EL0, CGT_MDCR_TPM), 925 + SR_TRAP(SYS_PMXEVCNTR_EL0, CGT_MDCR_TPM_HPMN), 961 926 SR_TRAP(SYS_PMCCNTR_EL0, CGT_MDCR_TPM), 962 927 SR_TRAP(SYS_PMUSERENR_EL0, CGT_MDCR_TPM), 963 928 SR_TRAP(SYS_PMINTENSET_EL1, CGT_MDCR_TPM), 964 929 SR_TRAP(SYS_PMINTENCLR_EL1, CGT_MDCR_TPM), 965 930 SR_TRAP(SYS_PMMIR_EL1, CGT_MDCR_TPM), 966 - SR_TRAP(SYS_PMEVCNTRn_EL0(0), CGT_MDCR_TPM), 967 - SR_TRAP(SYS_PMEVCNTRn_EL0(1), CGT_MDCR_TPM), 968 - SR_TRAP(SYS_PMEVCNTRn_EL0(2), CGT_MDCR_TPM), 969 - SR_TRAP(SYS_PMEVCNTRn_EL0(3), CGT_MDCR_TPM), 970 - SR_TRAP(SYS_PMEVCNTRn_EL0(4), CGT_MDCR_TPM), 971 - SR_TRAP(SYS_PMEVCNTRn_EL0(5), CGT_MDCR_TPM), 972 - SR_TRAP(SYS_PMEVCNTRn_EL0(6), CGT_MDCR_TPM), 973 - SR_TRAP(SYS_PMEVCNTRn_EL0(7), CGT_MDCR_TPM), 974 - SR_TRAP(SYS_PMEVCNTRn_EL0(8), CGT_MDCR_TPM), 975 - SR_TRAP(SYS_PMEVCNTRn_EL0(9), CGT_MDCR_TPM), 976 - SR_TRAP(SYS_PMEVCNTRn_EL0(10), CGT_MDCR_TPM), 977 - SR_TRAP(SYS_PMEVCNTRn_EL0(11), CGT_MDCR_TPM), 978 - SR_TRAP(SYS_PMEVCNTRn_EL0(12), CGT_MDCR_TPM), 979 - SR_TRAP(SYS_PMEVCNTRn_EL0(13), CGT_MDCR_TPM), 980 - SR_TRAP(SYS_PMEVCNTRn_EL0(14), CGT_MDCR_TPM), 981 - SR_TRAP(SYS_PMEVCNTRn_EL0(15), CGT_MDCR_TPM), 982 - SR_TRAP(SYS_PMEVCNTRn_EL0(16), CGT_MDCR_TPM), 983 - SR_TRAP(SYS_PMEVCNTRn_EL0(17), CGT_MDCR_TPM), 984 - SR_TRAP(SYS_PMEVCNTRn_EL0(18), CGT_MDCR_TPM), 985 - SR_TRAP(SYS_PMEVCNTRn_EL0(19), CGT_MDCR_TPM), 986 - SR_TRAP(SYS_PMEVCNTRn_EL0(20), CGT_MDCR_TPM), 987 - SR_TRAP(SYS_PMEVCNTRn_EL0(21), CGT_MDCR_TPM), 988 - SR_TRAP(SYS_PMEVCNTRn_EL0(22), CGT_MDCR_TPM), 989 - SR_TRAP(SYS_PMEVCNTRn_EL0(23), CGT_MDCR_TPM), 990 - SR_TRAP(SYS_PMEVCNTRn_EL0(24), CGT_MDCR_TPM), 991 - SR_TRAP(SYS_PMEVCNTRn_EL0(25), CGT_MDCR_TPM), 992 - SR_TRAP(SYS_PMEVCNTRn_EL0(26), CGT_MDCR_TPM), 993 - SR_TRAP(SYS_PMEVCNTRn_EL0(27), CGT_MDCR_TPM), 994 - SR_TRAP(SYS_PMEVCNTRn_EL0(28), CGT_MDCR_TPM), 995 - SR_TRAP(SYS_PMEVCNTRn_EL0(29), CGT_MDCR_TPM), 996 - SR_TRAP(SYS_PMEVCNTRn_EL0(30), CGT_MDCR_TPM), 997 - SR_TRAP(SYS_PMEVTYPERn_EL0(0), CGT_MDCR_TPM), 998 - SR_TRAP(SYS_PMEVTYPERn_EL0(1), CGT_MDCR_TPM), 999 - SR_TRAP(SYS_PMEVTYPERn_EL0(2), CGT_MDCR_TPM), 1000 - SR_TRAP(SYS_PMEVTYPERn_EL0(3), CGT_MDCR_TPM), 1001 - SR_TRAP(SYS_PMEVTYPERn_EL0(4), CGT_MDCR_TPM), 1002 - SR_TRAP(SYS_PMEVTYPERn_EL0(5), CGT_MDCR_TPM), 1003 - SR_TRAP(SYS_PMEVTYPERn_EL0(6), CGT_MDCR_TPM), 1004 - SR_TRAP(SYS_PMEVTYPERn_EL0(7), CGT_MDCR_TPM), 1005 - SR_TRAP(SYS_PMEVTYPERn_EL0(8), CGT_MDCR_TPM), 1006 - SR_TRAP(SYS_PMEVTYPERn_EL0(9), CGT_MDCR_TPM), 1007 - SR_TRAP(SYS_PMEVTYPERn_EL0(10), CGT_MDCR_TPM), 1008 - SR_TRAP(SYS_PMEVTYPERn_EL0(11), CGT_MDCR_TPM), 1009 - SR_TRAP(SYS_PMEVTYPERn_EL0(12), CGT_MDCR_TPM), 1010 - SR_TRAP(SYS_PMEVTYPERn_EL0(13), CGT_MDCR_TPM), 1011 - SR_TRAP(SYS_PMEVTYPERn_EL0(14), CGT_MDCR_TPM), 1012 - SR_TRAP(SYS_PMEVTYPERn_EL0(15), CGT_MDCR_TPM), 1013 - SR_TRAP(SYS_PMEVTYPERn_EL0(16), CGT_MDCR_TPM), 1014 - SR_TRAP(SYS_PMEVTYPERn_EL0(17), CGT_MDCR_TPM), 1015 - SR_TRAP(SYS_PMEVTYPERn_EL0(18), CGT_MDCR_TPM), 1016 - SR_TRAP(SYS_PMEVTYPERn_EL0(19), CGT_MDCR_TPM), 1017 - SR_TRAP(SYS_PMEVTYPERn_EL0(20), CGT_MDCR_TPM), 1018 - SR_TRAP(SYS_PMEVTYPERn_EL0(21), CGT_MDCR_TPM), 1019 - SR_TRAP(SYS_PMEVTYPERn_EL0(22), CGT_MDCR_TPM), 1020 - SR_TRAP(SYS_PMEVTYPERn_EL0(23), CGT_MDCR_TPM), 1021 - SR_TRAP(SYS_PMEVTYPERn_EL0(24), CGT_MDCR_TPM), 1022 - SR_TRAP(SYS_PMEVTYPERn_EL0(25), CGT_MDCR_TPM), 1023 - SR_TRAP(SYS_PMEVTYPERn_EL0(26), CGT_MDCR_TPM), 1024 - SR_TRAP(SYS_PMEVTYPERn_EL0(27), CGT_MDCR_TPM), 1025 - SR_TRAP(SYS_PMEVTYPERn_EL0(28), CGT_MDCR_TPM), 1026 - SR_TRAP(SYS_PMEVTYPERn_EL0(29), CGT_MDCR_TPM), 1027 - SR_TRAP(SYS_PMEVTYPERn_EL0(30), CGT_MDCR_TPM), 931 + SR_TRAP(SYS_PMEVCNTRn_EL0(0), CGT_MDCR_TPM_HPMN), 932 + SR_TRAP(SYS_PMEVCNTRn_EL0(1), CGT_MDCR_TPM_HPMN), 933 + SR_TRAP(SYS_PMEVCNTRn_EL0(2), CGT_MDCR_TPM_HPMN), 934 + SR_TRAP(SYS_PMEVCNTRn_EL0(3), CGT_MDCR_TPM_HPMN), 935 + SR_TRAP(SYS_PMEVCNTRn_EL0(4), CGT_MDCR_TPM_HPMN), 936 + SR_TRAP(SYS_PMEVCNTRn_EL0(5), CGT_MDCR_TPM_HPMN), 937 + SR_TRAP(SYS_PMEVCNTRn_EL0(6), CGT_MDCR_TPM_HPMN), 938 + SR_TRAP(SYS_PMEVCNTRn_EL0(7), CGT_MDCR_TPM_HPMN), 939 + SR_TRAP(SYS_PMEVCNTRn_EL0(8), CGT_MDCR_TPM_HPMN), 940 + SR_TRAP(SYS_PMEVCNTRn_EL0(9), CGT_MDCR_TPM_HPMN), 941 + SR_TRAP(SYS_PMEVCNTRn_EL0(10), CGT_MDCR_TPM_HPMN), 942 + SR_TRAP(SYS_PMEVCNTRn_EL0(11), CGT_MDCR_TPM_HPMN), 943 + SR_TRAP(SYS_PMEVCNTRn_EL0(12), CGT_MDCR_TPM_HPMN), 944 + SR_TRAP(SYS_PMEVCNTRn_EL0(13), CGT_MDCR_TPM_HPMN), 945 + SR_TRAP(SYS_PMEVCNTRn_EL0(14), CGT_MDCR_TPM_HPMN), 946 + SR_TRAP(SYS_PMEVCNTRn_EL0(15), CGT_MDCR_TPM_HPMN), 947 + SR_TRAP(SYS_PMEVCNTRn_EL0(16), CGT_MDCR_TPM_HPMN), 948 + SR_TRAP(SYS_PMEVCNTRn_EL0(17), CGT_MDCR_TPM_HPMN), 949 + SR_TRAP(SYS_PMEVCNTRn_EL0(18), CGT_MDCR_TPM_HPMN), 950 + SR_TRAP(SYS_PMEVCNTRn_EL0(19), CGT_MDCR_TPM_HPMN), 951 + SR_TRAP(SYS_PMEVCNTRn_EL0(20), CGT_MDCR_TPM_HPMN), 952 + SR_TRAP(SYS_PMEVCNTRn_EL0(21), CGT_MDCR_TPM_HPMN), 953 + SR_TRAP(SYS_PMEVCNTRn_EL0(22), CGT_MDCR_TPM_HPMN), 954 + SR_TRAP(SYS_PMEVCNTRn_EL0(23), CGT_MDCR_TPM_HPMN), 955 + SR_TRAP(SYS_PMEVCNTRn_EL0(24), CGT_MDCR_TPM_HPMN), 956 + SR_TRAP(SYS_PMEVCNTRn_EL0(25), CGT_MDCR_TPM_HPMN), 957 + SR_TRAP(SYS_PMEVCNTRn_EL0(26), CGT_MDCR_TPM_HPMN), 958 + SR_TRAP(SYS_PMEVCNTRn_EL0(27), CGT_MDCR_TPM_HPMN), 959 + SR_TRAP(SYS_PMEVCNTRn_EL0(28), CGT_MDCR_TPM_HPMN), 960 + SR_TRAP(SYS_PMEVCNTRn_EL0(29), CGT_MDCR_TPM_HPMN), 961 + SR_TRAP(SYS_PMEVCNTRn_EL0(30), CGT_MDCR_TPM_HPMN), 962 + SR_TRAP(SYS_PMEVTYPERn_EL0(0), CGT_MDCR_TPM_HPMN), 963 + SR_TRAP(SYS_PMEVTYPERn_EL0(1), CGT_MDCR_TPM_HPMN), 964 + SR_TRAP(SYS_PMEVTYPERn_EL0(2), CGT_MDCR_TPM_HPMN), 965 + SR_TRAP(SYS_PMEVTYPERn_EL0(3), CGT_MDCR_TPM_HPMN), 966 + SR_TRAP(SYS_PMEVTYPERn_EL0(4), CGT_MDCR_TPM_HPMN), 967 + SR_TRAP(SYS_PMEVTYPERn_EL0(5), CGT_MDCR_TPM_HPMN), 968 + SR_TRAP(SYS_PMEVTYPERn_EL0(6), CGT_MDCR_TPM_HPMN), 969 + SR_TRAP(SYS_PMEVTYPERn_EL0(7), CGT_MDCR_TPM_HPMN), 970 + SR_TRAP(SYS_PMEVTYPERn_EL0(8), CGT_MDCR_TPM_HPMN), 971 + SR_TRAP(SYS_PMEVTYPERn_EL0(9), CGT_MDCR_TPM_HPMN), 972 + SR_TRAP(SYS_PMEVTYPERn_EL0(10), CGT_MDCR_TPM_HPMN), 973 + SR_TRAP(SYS_PMEVTYPERn_EL0(11), CGT_MDCR_TPM_HPMN), 974 + SR_TRAP(SYS_PMEVTYPERn_EL0(12), CGT_MDCR_TPM_HPMN), 975 + SR_TRAP(SYS_PMEVTYPERn_EL0(13), CGT_MDCR_TPM_HPMN), 976 + SR_TRAP(SYS_PMEVTYPERn_EL0(14), CGT_MDCR_TPM_HPMN), 977 + SR_TRAP(SYS_PMEVTYPERn_EL0(15), CGT_MDCR_TPM_HPMN), 978 + SR_TRAP(SYS_PMEVTYPERn_EL0(16), CGT_MDCR_TPM_HPMN), 979 + SR_TRAP(SYS_PMEVTYPERn_EL0(17), CGT_MDCR_TPM_HPMN), 980 + SR_TRAP(SYS_PMEVTYPERn_EL0(18), CGT_MDCR_TPM_HPMN), 981 + SR_TRAP(SYS_PMEVTYPERn_EL0(19), CGT_MDCR_TPM_HPMN), 982 + SR_TRAP(SYS_PMEVTYPERn_EL0(20), CGT_MDCR_TPM_HPMN), 983 + SR_TRAP(SYS_PMEVTYPERn_EL0(21), CGT_MDCR_TPM_HPMN), 984 + SR_TRAP(SYS_PMEVTYPERn_EL0(22), CGT_MDCR_TPM_HPMN), 985 + SR_TRAP(SYS_PMEVTYPERn_EL0(23), CGT_MDCR_TPM_HPMN), 986 + SR_TRAP(SYS_PMEVTYPERn_EL0(24), CGT_MDCR_TPM_HPMN), 987 + SR_TRAP(SYS_PMEVTYPERn_EL0(25), CGT_MDCR_TPM_HPMN), 988 + SR_TRAP(SYS_PMEVTYPERn_EL0(26), CGT_MDCR_TPM_HPMN), 989 + SR_TRAP(SYS_PMEVTYPERn_EL0(27), CGT_MDCR_TPM_HPMN), 990 + SR_TRAP(SYS_PMEVTYPERn_EL0(28), CGT_MDCR_TPM_HPMN), 991 + SR_TRAP(SYS_PMEVTYPERn_EL0(29), CGT_MDCR_TPM_HPMN), 992 + SR_TRAP(SYS_PMEVTYPERn_EL0(30), CGT_MDCR_TPM_HPMN), 1028 993 SR_TRAP(SYS_PMCCFILTR_EL0, CGT_MDCR_TPM), 1029 994 SR_TRAP(SYS_MDCCSR_EL0, CGT_MDCR_TDCC_TDE_TDA), 1030 995 SR_TRAP(SYS_MDCCINT_EL1, CGT_MDCR_TDCC_TDE_TDA), ··· 1176 1141 SR_TRAP(SYS_AMEVTYPER1_EL0(13), CGT_CPTR_TAM), 1177 1142 SR_TRAP(SYS_AMEVTYPER1_EL0(14), CGT_CPTR_TAM), 1178 1143 SR_TRAP(SYS_AMEVTYPER1_EL0(15), CGT_CPTR_TAM), 1179 - SR_TRAP(SYS_POR_EL0, CGT_CPACR_E0POE), 1180 1144 /* op0=2, op1=1, and CRn<0b1000 */ 1181 1145 SR_RANGE_TRAP(sys_reg(2, 1, 0, 0, 0), 1182 1146 sys_reg(2, 1, 7, 15, 7), CGT_CPTR_TTA), ··· 2055 2021 cgids = coarse_control_combo[id - __MULTIPLE_CONTROL_BITS__]; 2056 2022 2057 2023 for (int i = 0; cgids[i] != __RESERVED__; i++) { 2058 - if (cgids[i] >= __MULTIPLE_CONTROL_BITS__) { 2024 + if (cgids[i] >= __MULTIPLE_CONTROL_BITS__ && 2025 + cgids[i] < __COMPLEX_CONDITIONS__) { 2059 2026 kvm_err("Recursive MCB %d/%d\n", id, cgids[i]); 2060 2027 ret = -EINVAL; 2061 2028 } ··· 2161 2126 return masks->mask[sr - __VNCR_START__].res0; 2162 2127 } 2163 2128 2164 - static bool check_fgt_bit(struct kvm *kvm, bool is_read, 2129 + static bool check_fgt_bit(struct kvm_vcpu *vcpu, bool is_read, 2165 2130 u64 val, const union trap_config tc) 2166 2131 { 2132 + struct kvm *kvm = vcpu->kvm; 2167 2133 enum vcpu_sysreg sr; 2134 + 2135 + /* 2136 + * KVM doesn't know about any FGTs that apply to the host, and hopefully 2137 + * that'll remain the case. 2138 + */ 2139 + if (is_hyp_ctxt(vcpu)) 2140 + return false; 2168 2141 2169 2142 if (tc.pol) 2170 2143 return (val & BIT(tc.bit)); ··· 2250 2207 * If we're not nesting, immediately return to the caller, with the 2251 2208 * sysreg index, should we have it. 2252 2209 */ 2253 - if (!vcpu_has_nv(vcpu) || is_hyp_ctxt(vcpu)) 2210 + if (!vcpu_has_nv(vcpu)) 2211 + goto local; 2212 + 2213 + /* 2214 + * There are a few traps that take effect InHost, but are constrained 2215 + * to EL0. Don't bother with computing the trap behaviour if the vCPU 2216 + * isn't in EL0. 2217 + */ 2218 + if (is_hyp_ctxt(vcpu) && !vcpu_is_host_el0(vcpu)) 2254 2219 goto local; 2255 2220 2256 2221 switch ((enum fgt_group_id)tc.fgt) { ··· 2304 2253 goto local; 2305 2254 } 2306 2255 2307 - if (tc.fgt != __NO_FGT_GROUP__ && check_fgt_bit(vcpu->kvm, is_read, 2308 - val, tc)) 2256 + if (tc.fgt != __NO_FGT_GROUP__ && check_fgt_bit(vcpu, is_read, val, tc)) 2309 2257 goto inject; 2310 2258 2311 2259 b = compute_trap_behaviour(vcpu, tc); 2260 + 2261 + if (!(b & BEHAVE_FORWARD_IN_HOST_EL0) && vcpu_is_host_el0(vcpu)) 2262 + goto local; 2312 2263 2313 2264 if (((b & BEHAVE_FORWARD_READ) && is_read) || 2314 2265 ((b & BEHAVE_FORWARD_WRITE) && !is_read)) ··· 2446 2393 2447 2394 kvm_arch_vcpu_load(vcpu, smp_processor_id()); 2448 2395 preempt_enable(); 2396 + 2397 + kvm_pmu_nested_transition(vcpu); 2449 2398 } 2450 2399 2451 2400 static void kvm_inject_el2_exception(struct kvm_vcpu *vcpu, u64 esr_el2, ··· 2529 2474 2530 2475 kvm_arch_vcpu_load(vcpu, smp_processor_id()); 2531 2476 preempt_enable(); 2477 + 2478 + kvm_pmu_nested_transition(vcpu); 2532 2479 2533 2480 return 1; 2534 2481 }

+6 -8

arch/arm64/kvm/guest.c

··· 1051 1051 } 1052 1052 1053 1053 while (length > 0) { 1054 - kvm_pfn_t pfn = gfn_to_pfn_prot(kvm, gfn, write, NULL); 1054 + struct page *page = __gfn_to_page(kvm, gfn, write); 1055 1055 void *maddr; 1056 1056 unsigned long num_tags; 1057 - struct page *page; 1058 1057 struct folio *folio; 1059 1058 1060 - if (is_error_noslot_pfn(pfn)) { 1059 + if (!page) { 1061 1060 ret = -EFAULT; 1062 1061 goto out; 1063 1062 } 1064 1063 1065 - page = pfn_to_online_page(pfn); 1066 - if (!page) { 1064 + if (!pfn_to_online_page(page_to_pfn(page))) { 1067 1065 /* Reject ZONE_DEVICE memory */ 1068 - kvm_release_pfn_clean(pfn); 1066 + kvm_release_page_unused(page); 1069 1067 ret = -EFAULT; 1070 1068 goto out; 1071 1069 } ··· 1080 1082 /* No tags in memory, so write zeros */ 1081 1083 num_tags = MTE_GRANULES_PER_PAGE - 1082 1084 clear_user(tags, MTE_GRANULES_PER_PAGE); 1083 - kvm_release_pfn_clean(pfn); 1085 + kvm_release_page_clean(page); 1084 1086 } else { 1085 1087 /* 1086 1088 * Only locking to serialise with a concurrent ··· 1102 1104 else 1103 1105 set_page_mte_tagged(page); 1104 1106 1105 - kvm_release_pfn_dirty(pfn); 1107 + kvm_release_page_dirty(page); 1106 1108 } 1107 1109 1108 1110 if (num_tags != MTE_GRANULES_PER_PAGE) {

+31

arch/arm64/kvm/hyp/include/hyp/switch.h

··· 204 204 __deactivate_fgt(hctxt, vcpu, kvm, HAFGRTR_EL2); 205 205 } 206 206 207 + static inline void __activate_traps_mpam(struct kvm_vcpu *vcpu) 208 + { 209 + u64 r = MPAM2_EL2_TRAPMPAM0EL1 | MPAM2_EL2_TRAPMPAM1EL1; 210 + 211 + if (!system_supports_mpam()) 212 + return; 213 + 214 + /* trap guest access to MPAMIDR_EL1 */ 215 + if (system_supports_mpam_hcr()) { 216 + write_sysreg_s(MPAMHCR_EL2_TRAP_MPAMIDR_EL1, SYS_MPAMHCR_EL2); 217 + } else { 218 + /* From v1.1 TIDR can trap MPAMIDR, set it unconditionally */ 219 + r |= MPAM2_EL2_TIDR; 220 + } 221 + 222 + write_sysreg_s(r, SYS_MPAM2_EL2); 223 + } 224 + 225 + static inline void __deactivate_traps_mpam(void) 226 + { 227 + if (!system_supports_mpam()) 228 + return; 229 + 230 + write_sysreg_s(0, SYS_MPAM2_EL2); 231 + 232 + if (system_supports_mpam_hcr()) 233 + write_sysreg_s(MPAMHCR_HOST_FLAGS, SYS_MPAMHCR_EL2); 234 + } 235 + 207 236 static inline void __activate_traps_common(struct kvm_vcpu *vcpu) 208 237 { 209 238 /* Trap on AArch32 cp15 c15 (impdef sysregs) accesses (EL1 or EL0) */ ··· 273 244 } 274 245 275 246 __activate_traps_hfgxtr(vcpu); 247 + __activate_traps_mpam(vcpu); 276 248 } 277 249 278 250 static inline void __deactivate_traps_common(struct kvm_vcpu *vcpu) ··· 293 263 write_sysreg_s(HCRX_HOST_FLAGS, SYS_HCRX_EL2); 294 264 295 265 __deactivate_traps_hfgxtr(vcpu); 266 + __deactivate_traps_mpam(); 296 267 } 297 268 298 269 static inline void ___activate_traps(struct kvm_vcpu *vcpu, u64 hcr)

+6 -5

arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h

··· 58 58 return false; 59 59 60 60 vcpu = ctxt_to_vcpu(ctxt); 61 - return kvm_has_feat(kern_hyp_va(vcpu->kvm), ID_AA64MMFR3_EL1, S1PIE, IMP); 61 + return kvm_has_s1pie(kern_hyp_va(vcpu->kvm)); 62 62 } 63 63 64 64 static inline bool ctxt_has_tcrx(struct kvm_cpu_context *ctxt) ··· 69 69 return false; 70 70 71 71 vcpu = ctxt_to_vcpu(ctxt); 72 - return kvm_has_feat(kern_hyp_va(vcpu->kvm), ID_AA64MMFR3_EL1, TCRX, IMP); 72 + return kvm_has_tcr2(kern_hyp_va(vcpu->kvm)); 73 73 } 74 74 75 75 static inline bool ctxt_has_s1poe(struct kvm_cpu_context *ctxt) ··· 80 80 return false; 81 81 82 82 vcpu = ctxt_to_vcpu(ctxt); 83 - return kvm_has_feat(kern_hyp_va(vcpu->kvm), ID_AA64MMFR3_EL1, S1POE, IMP); 83 + return kvm_has_s1poe(kern_hyp_va(vcpu->kvm)); 84 84 } 85 85 86 86 static inline void __sysreg_save_el1_state(struct kvm_cpu_context *ctxt) ··· 152 152 write_sysreg(ctxt_sys_reg(ctxt, TPIDRRO_EL0), tpidrro_el0); 153 153 } 154 154 155 - static inline void __sysreg_restore_el1_state(struct kvm_cpu_context *ctxt) 155 + static inline void __sysreg_restore_el1_state(struct kvm_cpu_context *ctxt, 156 + u64 mpidr) 156 157 { 157 - write_sysreg(ctxt_sys_reg(ctxt, MPIDR_EL1), vmpidr_el2); 158 + write_sysreg(mpidr, vmpidr_el2); 158 159 159 160 if (has_vhe() || 160 161 !cpus_have_final_cap(ARM64_WORKAROUND_SPECULATIVE_AT)) {

-2

arch/arm64/kvm/hyp/include/nvhe/trap_handler.h

··· 15 15 #define DECLARE_REG(type, name, ctxt, reg) \ 16 16 type name = (type)cpu_reg(ctxt, (reg)) 17 17 18 - void __pkvm_vcpu_init_traps(struct kvm_vcpu *vcpu); 19 - 20 18 #endif /* __ARM64_KVM_NVHE_TRAP_HANDLER_H__ */

+3 -9

arch/arm64/kvm/hyp/nvhe/hyp-main.c

··· 105 105 106 106 hyp_vcpu->vcpu.arch.hw_mmu = host_vcpu->arch.hw_mmu; 107 107 108 - hyp_vcpu->vcpu.arch.hcr_el2 = host_vcpu->arch.hcr_el2; 109 108 hyp_vcpu->vcpu.arch.mdcr_el2 = host_vcpu->arch.mdcr_el2; 109 + hyp_vcpu->vcpu.arch.hcr_el2 &= ~(HCR_TWI | HCR_TWE); 110 + hyp_vcpu->vcpu.arch.hcr_el2 |= READ_ONCE(host_vcpu->arch.hcr_el2) & 111 + (HCR_TWI | HCR_TWE); 110 112 111 113 hyp_vcpu->vcpu.arch.iflags = host_vcpu->arch.iflags; 112 114 ··· 351 349 cpu_reg(host_ctxt, 1) = __pkvm_prot_finalize(); 352 350 } 353 351 354 - static void handle___pkvm_vcpu_init_traps(struct kvm_cpu_context *host_ctxt) 355 - { 356 - DECLARE_REG(struct kvm_vcpu *, vcpu, host_ctxt, 1); 357 - 358 - __pkvm_vcpu_init_traps(kern_hyp_va(vcpu)); 359 - } 360 - 361 352 static void handle___pkvm_init_vm(struct kvm_cpu_context *host_ctxt) 362 353 { 363 354 DECLARE_REG(struct kvm *, host_kvm, host_ctxt, 1); ··· 406 411 HANDLE_FUNC(__kvm_timer_set_cntvoff), 407 412 HANDLE_FUNC(__vgic_v3_save_vmcr_aprs), 408 413 HANDLE_FUNC(__vgic_v3_restore_vmcr_aprs), 409 - HANDLE_FUNC(__pkvm_vcpu_init_traps), 410 414 HANDLE_FUNC(__pkvm_init_vm), 411 415 HANDLE_FUNC(__pkvm_init_vcpu), 412 416 HANDLE_FUNC(__pkvm_teardown_vm),

+115 -1

arch/arm64/kvm/hyp/nvhe/pkvm.c

··· 6 6 7 7 #include <linux/kvm_host.h> 8 8 #include <linux/mm.h> 9 + 10 + #include <asm/kvm_emulate.h> 11 + 9 12 #include <nvhe/fixed_config.h> 10 13 #include <nvhe/mem_protect.h> 11 14 #include <nvhe/memory.h> ··· 204 201 } 205 202 } 206 203 204 + static void pkvm_vcpu_reset_hcr(struct kvm_vcpu *vcpu) 205 + { 206 + vcpu->arch.hcr_el2 = HCR_GUEST_FLAGS; 207 + 208 + if (has_hvhe()) 209 + vcpu->arch.hcr_el2 |= HCR_E2H; 210 + 211 + if (cpus_have_final_cap(ARM64_HAS_RAS_EXTN)) { 212 + /* route synchronous external abort exceptions to EL2 */ 213 + vcpu->arch.hcr_el2 |= HCR_TEA; 214 + /* trap error record accesses */ 215 + vcpu->arch.hcr_el2 |= HCR_TERR; 216 + } 217 + 218 + if (cpus_have_final_cap(ARM64_HAS_STAGE2_FWB)) 219 + vcpu->arch.hcr_el2 |= HCR_FWB; 220 + 221 + if (cpus_have_final_cap(ARM64_HAS_EVT) && 222 + !cpus_have_final_cap(ARM64_MISMATCHED_CACHE_TYPE)) 223 + vcpu->arch.hcr_el2 |= HCR_TID4; 224 + else 225 + vcpu->arch.hcr_el2 |= HCR_TID2; 226 + 227 + if (vcpu_has_ptrauth(vcpu)) 228 + vcpu->arch.hcr_el2 |= (HCR_API | HCR_APK); 229 + } 230 + 207 231 /* 208 232 * Initialize trap register values in protected mode. 209 233 */ 210 - void __pkvm_vcpu_init_traps(struct kvm_vcpu *vcpu) 234 + static void pkvm_vcpu_init_traps(struct kvm_vcpu *vcpu) 211 235 { 236 + vcpu->arch.cptr_el2 = kvm_get_reset_cptr_el2(vcpu); 237 + vcpu->arch.mdcr_el2 = 0; 238 + 239 + pkvm_vcpu_reset_hcr(vcpu); 240 + 241 + if ((!vcpu_is_protected(vcpu))) 242 + return; 243 + 212 244 pvm_init_trap_regs(vcpu); 213 245 pvm_init_traps_aa64pfr0(vcpu); 214 246 pvm_init_traps_aa64pfr1(vcpu); ··· 327 289 hyp_spin_unlock(&vm_table_lock); 328 290 } 329 291 292 + static void pkvm_init_features_from_host(struct pkvm_hyp_vm *hyp_vm, const struct kvm *host_kvm) 293 + { 294 + struct kvm *kvm = &hyp_vm->kvm; 295 + DECLARE_BITMAP(allowed_features, KVM_VCPU_MAX_FEATURES); 296 + 297 + /* No restrictions for non-protected VMs. */ 298 + if (!kvm_vm_is_protected(kvm)) { 299 + bitmap_copy(kvm->arch.vcpu_features, 300 + host_kvm->arch.vcpu_features, 301 + KVM_VCPU_MAX_FEATURES); 302 + return; 303 + } 304 + 305 + bitmap_zero(allowed_features, KVM_VCPU_MAX_FEATURES); 306 + 307 + /* 308 + * For protected VMs, always allow: 309 + * - CPU starting in poweroff state 310 + * - PSCI v0.2 311 + */ 312 + set_bit(KVM_ARM_VCPU_POWER_OFF, allowed_features); 313 + set_bit(KVM_ARM_VCPU_PSCI_0_2, allowed_features); 314 + 315 + /* 316 + * Check if remaining features are allowed: 317 + * - Performance Monitoring 318 + * - Scalable Vectors 319 + * - Pointer Authentication 320 + */ 321 + if (FIELD_GET(ARM64_FEATURE_MASK(ID_AA64DFR0_EL1_PMUVer), PVM_ID_AA64DFR0_ALLOW)) 322 + set_bit(KVM_ARM_VCPU_PMU_V3, allowed_features); 323 + 324 + if (FIELD_GET(ARM64_FEATURE_MASK(ID_AA64PFR0_EL1_SVE), PVM_ID_AA64PFR0_ALLOW)) 325 + set_bit(KVM_ARM_VCPU_SVE, allowed_features); 326 + 327 + if (FIELD_GET(ARM64_FEATURE_MASK(ID_AA64ISAR1_EL1_API), PVM_ID_AA64ISAR1_RESTRICT_UNSIGNED) && 328 + FIELD_GET(ARM64_FEATURE_MASK(ID_AA64ISAR1_EL1_APA), PVM_ID_AA64ISAR1_RESTRICT_UNSIGNED)) 329 + set_bit(KVM_ARM_VCPU_PTRAUTH_ADDRESS, allowed_features); 330 + 331 + if (FIELD_GET(ARM64_FEATURE_MASK(ID_AA64ISAR1_EL1_GPI), PVM_ID_AA64ISAR1_ALLOW) && 332 + FIELD_GET(ARM64_FEATURE_MASK(ID_AA64ISAR1_EL1_GPA), PVM_ID_AA64ISAR1_ALLOW)) 333 + set_bit(KVM_ARM_VCPU_PTRAUTH_GENERIC, allowed_features); 334 + 335 + bitmap_and(kvm->arch.vcpu_features, host_kvm->arch.vcpu_features, 336 + allowed_features, KVM_VCPU_MAX_FEATURES); 337 + } 338 + 339 + static void pkvm_vcpu_init_ptrauth(struct pkvm_hyp_vcpu *hyp_vcpu) 340 + { 341 + struct kvm_vcpu *vcpu = &hyp_vcpu->vcpu; 342 + 343 + if (vcpu_has_feature(vcpu, KVM_ARM_VCPU_PTRAUTH_ADDRESS) || 344 + vcpu_has_feature(vcpu, KVM_ARM_VCPU_PTRAUTH_GENERIC)) { 345 + kvm_vcpu_enable_ptrauth(vcpu); 346 + } else { 347 + vcpu_clear_flag(&hyp_vcpu->vcpu, GUEST_HAS_PTRAUTH); 348 + } 349 + } 350 + 330 351 static void unpin_host_vcpu(struct kvm_vcpu *host_vcpu) 331 352 { 332 353 if (host_vcpu) ··· 407 310 hyp_vm->host_kvm = host_kvm; 408 311 hyp_vm->kvm.created_vcpus = nr_vcpus; 409 312 hyp_vm->kvm.arch.mmu.vtcr = host_mmu.arch.mmu.vtcr; 313 + hyp_vm->kvm.arch.pkvm.enabled = READ_ONCE(host_kvm->arch.pkvm.enabled); 314 + pkvm_init_features_from_host(hyp_vm, host_kvm); 315 + } 316 + 317 + static void pkvm_vcpu_init_sve(struct pkvm_hyp_vcpu *hyp_vcpu, struct kvm_vcpu *host_vcpu) 318 + { 319 + struct kvm_vcpu *vcpu = &hyp_vcpu->vcpu; 320 + 321 + if (!vcpu_has_feature(vcpu, KVM_ARM_VCPU_SVE)) { 322 + vcpu_clear_flag(vcpu, GUEST_HAS_SVE); 323 + vcpu_clear_flag(vcpu, VCPU_SVE_FINALIZED); 324 + } 410 325 } 411 326 412 327 static int init_pkvm_hyp_vcpu(struct pkvm_hyp_vcpu *hyp_vcpu, ··· 444 335 445 336 hyp_vcpu->vcpu.arch.hw_mmu = &hyp_vm->kvm.arch.mmu; 446 337 hyp_vcpu->vcpu.arch.cflags = READ_ONCE(host_vcpu->arch.cflags); 338 + hyp_vcpu->vcpu.arch.mp_state.mp_state = KVM_MP_STATE_STOPPED; 339 + 340 + pkvm_vcpu_init_sve(hyp_vcpu, host_vcpu); 341 + pkvm_vcpu_init_ptrauth(hyp_vcpu); 342 + pkvm_vcpu_init_traps(&hyp_vcpu->vcpu); 447 343 done: 448 344 if (ret) 449 345 unpin_host_vcpu(host_vcpu);

+2

arch/arm64/kvm/hyp/nvhe/psci-relay.c

··· 265 265 case PSCI_1_0_FN_PSCI_FEATURES: 266 266 case PSCI_1_0_FN_SET_SUSPEND_MODE: 267 267 case PSCI_1_1_FN64_SYSTEM_RESET2: 268 + case PSCI_1_3_FN_SYSTEM_OFF2: 269 + case PSCI_1_3_FN64_SYSTEM_OFF2: 268 270 return psci_forward(host_ctxt); 269 271 case PSCI_1_0_FN64_SYSTEM_SUSPEND: 270 272 return psci_system_suspend(func_id, host_ctxt);

+1 -19

arch/arm64/kvm/hyp/nvhe/setup.c

··· 95 95 { 96 96 void *start, *end, *virt = hyp_phys_to_virt(phys); 97 97 unsigned long pgt_size = hyp_s1_pgtable_pages() << PAGE_SHIFT; 98 - enum kvm_pgtable_prot prot; 99 98 int ret, i; 100 99 101 100 /* Recreate the hyp page-table using the early page allocator */ ··· 146 147 return ret; 147 148 } 148 149 149 - pkvm_create_host_sve_mappings(); 150 - 151 - /* 152 - * Map the host sections RO in the hypervisor, but transfer the 153 - * ownership from the host to the hypervisor itself to make sure they 154 - * can't be donated or shared with another entity. 155 - * 156 - * The ownership transition requires matching changes in the host 157 - * stage-2. This will be done later (see finalize_host_mappings()) once 158 - * the hyp_vmemmap is addressable. 159 - */ 160 - prot = pkvm_mkstate(PAGE_HYP_RO, PKVM_PAGE_SHARED_OWNED); 161 - ret = pkvm_create_mappings(&kvm_vgic_global_state, 162 - &kvm_vgic_global_state + 1, prot); 163 - if (ret) 164 - return ret; 165 - 166 - return 0; 150 + return pkvm_create_host_sve_mappings(); 167 151 } 168 152 169 153 static void update_nvhe_init_params(void)

+1 -1

arch/arm64/kvm/hyp/nvhe/sysreg-sr.c

··· 28 28 29 29 void __sysreg_restore_state_nvhe(struct kvm_cpu_context *ctxt) 30 30 { 31 - __sysreg_restore_el1_state(ctxt); 31 + __sysreg_restore_el1_state(ctxt, ctxt_sys_reg(ctxt, MPIDR_EL1)); 32 32 __sysreg_restore_common_state(ctxt); 33 33 __sysreg_restore_user_state(ctxt); 34 34 __sysreg_restore_el2_return_state(ctxt);

+2 -5

arch/arm64/kvm/hyp/pgtable.c

··· 1245 1245 NULL, NULL, 0); 1246 1246 } 1247 1247 1248 - kvm_pte_t kvm_pgtable_stage2_mkyoung(struct kvm_pgtable *pgt, u64 addr) 1248 + void kvm_pgtable_stage2_mkyoung(struct kvm_pgtable *pgt, u64 addr) 1249 1249 { 1250 - kvm_pte_t pte = 0; 1251 1250 int ret; 1252 1251 1253 1252 ret = stage2_update_leaf_attrs(pgt, addr, 1, KVM_PTE_LEAF_ATTR_LO_S2_AF, 0, 1254 - &pte, NULL, 1253 + NULL, NULL, 1255 1254 KVM_PGTABLE_WALK_HANDLE_FAULT | 1256 1255 KVM_PGTABLE_WALK_SHARED); 1257 1256 if (!ret) 1258 1257 dsb(ishst); 1259 - 1260 - return pte; 1261 1258 } 1262 1259 1263 1260 struct stage2_age_data {

-3

arch/arm64/kvm/hyp/vgic-v3-sr.c

··· 1012 1012 val = ((vtr >> 29) & 7) << ICC_CTLR_EL1_PRI_BITS_SHIFT; 1013 1013 /* IDbits */ 1014 1014 val |= ((vtr >> 23) & 7) << ICC_CTLR_EL1_ID_BITS_SHIFT; 1015 - /* SEIS */ 1016 - if (kvm_vgic_global_state.ich_vtr_el2 & ICH_VTR_SEIS_MASK) 1017 - val |= BIT(ICC_CTLR_EL1_SEIS_SHIFT); 1018 1015 /* A3V */ 1019 1016 val |= ((vtr >> 21) & 1) << ICC_CTLR_EL1_A3V_SHIFT; 1020 1017 /* EOImode */

+158 -2

arch/arm64/kvm/hyp/vhe/sysreg-sr.c

··· 15 15 #include <asm/kvm_hyp.h> 16 16 #include <asm/kvm_nested.h> 17 17 18 + static void __sysreg_save_vel2_state(struct kvm_vcpu *vcpu) 19 + { 20 + /* These registers are common with EL1 */ 21 + __vcpu_sys_reg(vcpu, PAR_EL1) = read_sysreg(par_el1); 22 + __vcpu_sys_reg(vcpu, TPIDR_EL1) = read_sysreg(tpidr_el1); 23 + 24 + __vcpu_sys_reg(vcpu, ESR_EL2) = read_sysreg_el1(SYS_ESR); 25 + __vcpu_sys_reg(vcpu, AFSR0_EL2) = read_sysreg_el1(SYS_AFSR0); 26 + __vcpu_sys_reg(vcpu, AFSR1_EL2) = read_sysreg_el1(SYS_AFSR1); 27 + __vcpu_sys_reg(vcpu, FAR_EL2) = read_sysreg_el1(SYS_FAR); 28 + __vcpu_sys_reg(vcpu, MAIR_EL2) = read_sysreg_el1(SYS_MAIR); 29 + __vcpu_sys_reg(vcpu, VBAR_EL2) = read_sysreg_el1(SYS_VBAR); 30 + __vcpu_sys_reg(vcpu, CONTEXTIDR_EL2) = read_sysreg_el1(SYS_CONTEXTIDR); 31 + __vcpu_sys_reg(vcpu, AMAIR_EL2) = read_sysreg_el1(SYS_AMAIR); 32 + 33 + /* 34 + * In VHE mode those registers are compatible between EL1 and EL2, 35 + * and the guest uses the _EL1 versions on the CPU naturally. 36 + * So we save them into their _EL2 versions here. 37 + * For nVHE mode we trap accesses to those registers, so our 38 + * _EL2 copy in sys_regs[] is always up-to-date and we don't need 39 + * to save anything here. 40 + */ 41 + if (vcpu_el2_e2h_is_set(vcpu)) { 42 + u64 val; 43 + 44 + /* 45 + * We don't save CPTR_EL2, as accesses to CPACR_EL1 46 + * are always trapped, ensuring that the in-memory 47 + * copy is always up-to-date. A small blessing... 48 + */ 49 + __vcpu_sys_reg(vcpu, SCTLR_EL2) = read_sysreg_el1(SYS_SCTLR); 50 + __vcpu_sys_reg(vcpu, TTBR0_EL2) = read_sysreg_el1(SYS_TTBR0); 51 + __vcpu_sys_reg(vcpu, TTBR1_EL2) = read_sysreg_el1(SYS_TTBR1); 52 + __vcpu_sys_reg(vcpu, TCR_EL2) = read_sysreg_el1(SYS_TCR); 53 + 54 + if (ctxt_has_tcrx(&vcpu->arch.ctxt)) { 55 + __vcpu_sys_reg(vcpu, TCR2_EL2) = read_sysreg_el1(SYS_TCR2); 56 + 57 + if (ctxt_has_s1pie(&vcpu->arch.ctxt)) { 58 + __vcpu_sys_reg(vcpu, PIRE0_EL2) = read_sysreg_el1(SYS_PIRE0); 59 + __vcpu_sys_reg(vcpu, PIR_EL2) = read_sysreg_el1(SYS_PIR); 60 + } 61 + 62 + if (ctxt_has_s1poe(&vcpu->arch.ctxt)) 63 + __vcpu_sys_reg(vcpu, POR_EL2) = read_sysreg_el1(SYS_POR); 64 + } 65 + 66 + /* 67 + * The EL1 view of CNTKCTL_EL1 has a bunch of RES0 bits where 68 + * the interesting CNTHCTL_EL2 bits live. So preserve these 69 + * bits when reading back the guest-visible value. 70 + */ 71 + val = read_sysreg_el1(SYS_CNTKCTL); 72 + val &= CNTKCTL_VALID_BITS; 73 + __vcpu_sys_reg(vcpu, CNTHCTL_EL2) &= ~CNTKCTL_VALID_BITS; 74 + __vcpu_sys_reg(vcpu, CNTHCTL_EL2) |= val; 75 + } 76 + 77 + __vcpu_sys_reg(vcpu, SP_EL2) = read_sysreg(sp_el1); 78 + __vcpu_sys_reg(vcpu, ELR_EL2) = read_sysreg_el1(SYS_ELR); 79 + __vcpu_sys_reg(vcpu, SPSR_EL2) = read_sysreg_el1(SYS_SPSR); 80 + } 81 + 82 + static void __sysreg_restore_vel2_state(struct kvm_vcpu *vcpu) 83 + { 84 + u64 val; 85 + 86 + /* These registers are common with EL1 */ 87 + write_sysreg(__vcpu_sys_reg(vcpu, PAR_EL1), par_el1); 88 + write_sysreg(__vcpu_sys_reg(vcpu, TPIDR_EL1), tpidr_el1); 89 + 90 + write_sysreg(__vcpu_sys_reg(vcpu, MPIDR_EL1), vmpidr_el2); 91 + write_sysreg_el1(__vcpu_sys_reg(vcpu, MAIR_EL2), SYS_MAIR); 92 + write_sysreg_el1(__vcpu_sys_reg(vcpu, VBAR_EL2), SYS_VBAR); 93 + write_sysreg_el1(__vcpu_sys_reg(vcpu, CONTEXTIDR_EL2), SYS_CONTEXTIDR); 94 + write_sysreg_el1(__vcpu_sys_reg(vcpu, AMAIR_EL2), SYS_AMAIR); 95 + 96 + if (vcpu_el2_e2h_is_set(vcpu)) { 97 + /* 98 + * In VHE mode those registers are compatible between 99 + * EL1 and EL2. 100 + */ 101 + write_sysreg_el1(__vcpu_sys_reg(vcpu, SCTLR_EL2), SYS_SCTLR); 102 + write_sysreg_el1(__vcpu_sys_reg(vcpu, CPTR_EL2), SYS_CPACR); 103 + write_sysreg_el1(__vcpu_sys_reg(vcpu, TTBR0_EL2), SYS_TTBR0); 104 + write_sysreg_el1(__vcpu_sys_reg(vcpu, TTBR1_EL2), SYS_TTBR1); 105 + write_sysreg_el1(__vcpu_sys_reg(vcpu, TCR_EL2), SYS_TCR); 106 + write_sysreg_el1(__vcpu_sys_reg(vcpu, CNTHCTL_EL2), SYS_CNTKCTL); 107 + } else { 108 + /* 109 + * CNTHCTL_EL2 only affects EL1 when running nVHE, so 110 + * no need to restore it. 111 + */ 112 + val = translate_sctlr_el2_to_sctlr_el1(__vcpu_sys_reg(vcpu, SCTLR_EL2)); 113 + write_sysreg_el1(val, SYS_SCTLR); 114 + val = translate_cptr_el2_to_cpacr_el1(__vcpu_sys_reg(vcpu, CPTR_EL2)); 115 + write_sysreg_el1(val, SYS_CPACR); 116 + val = translate_ttbr0_el2_to_ttbr0_el1(__vcpu_sys_reg(vcpu, TTBR0_EL2)); 117 + write_sysreg_el1(val, SYS_TTBR0); 118 + val = translate_tcr_el2_to_tcr_el1(__vcpu_sys_reg(vcpu, TCR_EL2)); 119 + write_sysreg_el1(val, SYS_TCR); 120 + } 121 + 122 + if (ctxt_has_tcrx(&vcpu->arch.ctxt)) { 123 + write_sysreg_el1(__vcpu_sys_reg(vcpu, TCR2_EL2), SYS_TCR2); 124 + 125 + if (ctxt_has_s1pie(&vcpu->arch.ctxt)) { 126 + write_sysreg_el1(__vcpu_sys_reg(vcpu, PIR_EL2), SYS_PIR); 127 + write_sysreg_el1(__vcpu_sys_reg(vcpu, PIRE0_EL2), SYS_PIRE0); 128 + } 129 + 130 + if (ctxt_has_s1poe(&vcpu->arch.ctxt)) 131 + write_sysreg_el1(__vcpu_sys_reg(vcpu, POR_EL2), SYS_POR); 132 + } 133 + 134 + write_sysreg_el1(__vcpu_sys_reg(vcpu, ESR_EL2), SYS_ESR); 135 + write_sysreg_el1(__vcpu_sys_reg(vcpu, AFSR0_EL2), SYS_AFSR0); 136 + write_sysreg_el1(__vcpu_sys_reg(vcpu, AFSR1_EL2), SYS_AFSR1); 137 + write_sysreg_el1(__vcpu_sys_reg(vcpu, FAR_EL2), SYS_FAR); 138 + write_sysreg(__vcpu_sys_reg(vcpu, SP_EL2), sp_el1); 139 + write_sysreg_el1(__vcpu_sys_reg(vcpu, ELR_EL2), SYS_ELR); 140 + write_sysreg_el1(__vcpu_sys_reg(vcpu, SPSR_EL2), SYS_SPSR); 141 + } 142 + 18 143 /* 19 144 * VHE: Host and guest must save mdscr_el1 and sp_el0 (and the PC and 20 145 * pstate, which are handled as part of the el2 return state) on every ··· 191 66 { 192 67 struct kvm_cpu_context *guest_ctxt = &vcpu->arch.ctxt; 193 68 struct kvm_cpu_context *host_ctxt; 69 + u64 mpidr; 194 70 195 71 host_ctxt = host_data_ptr(host_ctxt); 196 72 __sysreg_save_user_state(host_ctxt); ··· 215 89 */ 216 90 __sysreg32_restore_state(vcpu); 217 91 __sysreg_restore_user_state(guest_ctxt); 218 - __sysreg_restore_el1_state(guest_ctxt); 92 + 93 + if (unlikely(__is_hyp_ctxt(guest_ctxt))) { 94 + __sysreg_restore_vel2_state(vcpu); 95 + } else { 96 + if (vcpu_has_nv(vcpu)) { 97 + /* 98 + * Use the guest hypervisor's VPIDR_EL2 when in a 99 + * nested state. The hardware value of MIDR_EL1 gets 100 + * restored on put. 101 + */ 102 + write_sysreg(ctxt_sys_reg(guest_ctxt, VPIDR_EL2), vpidr_el2); 103 + 104 + /* 105 + * As we're restoring a nested guest, set the value 106 + * provided by the guest hypervisor. 107 + */ 108 + mpidr = ctxt_sys_reg(guest_ctxt, VMPIDR_EL2); 109 + } else { 110 + mpidr = ctxt_sys_reg(guest_ctxt, MPIDR_EL1); 111 + } 112 + 113 + __sysreg_restore_el1_state(guest_ctxt, mpidr); 114 + } 219 115 220 116 vcpu_set_flag(vcpu, SYSREGS_ON_CPU); 221 117 } ··· 260 112 261 113 host_ctxt = host_data_ptr(host_ctxt); 262 114 263 - __sysreg_save_el1_state(guest_ctxt); 115 + if (unlikely(__is_hyp_ctxt(guest_ctxt))) 116 + __sysreg_save_vel2_state(vcpu); 117 + else 118 + __sysreg_save_el1_state(guest_ctxt); 119 + 264 120 __sysreg_save_user_state(guest_ctxt); 265 121 __sysreg32_save_state(vcpu); 266 122 267 123 /* Restore host user state */ 268 124 __sysreg_restore_user_state(host_ctxt); 125 + 126 + /* If leaving a nesting guest, restore MIDR_EL1 default view */ 127 + if (vcpu_has_nv(vcpu)) 128 + write_sysreg(read_cpuid_id(), vpidr_el2); 269 129 270 130 vcpu_clear_flag(vcpu, SYSREGS_ON_CPU); 271 131 }

+2

arch/arm64/kvm/hypercalls.c

··· 575 575 case KVM_ARM_PSCI_0_2: 576 576 case KVM_ARM_PSCI_1_0: 577 577 case KVM_ARM_PSCI_1_1: 578 + case KVM_ARM_PSCI_1_2: 579 + case KVM_ARM_PSCI_1_3: 578 580 if (!wants_02) 579 581 return -EINVAL; 580 582 vcpu->kvm->arch.psci_version = val;

+30 -2

arch/arm64/kvm/mmio.c

··· 72 72 return data; 73 73 } 74 74 75 + static bool kvm_pending_sync_exception(struct kvm_vcpu *vcpu) 76 + { 77 + if (!vcpu_get_flag(vcpu, PENDING_EXCEPTION)) 78 + return false; 79 + 80 + if (vcpu_el1_is_32bit(vcpu)) { 81 + switch (vcpu_get_flag(vcpu, EXCEPT_MASK)) { 82 + case unpack_vcpu_flag(EXCEPT_AA32_UND): 83 + case unpack_vcpu_flag(EXCEPT_AA32_IABT): 84 + case unpack_vcpu_flag(EXCEPT_AA32_DABT): 85 + return true; 86 + default: 87 + return false; 88 + } 89 + } else { 90 + switch (vcpu_get_flag(vcpu, EXCEPT_MASK)) { 91 + case unpack_vcpu_flag(EXCEPT_AA64_EL1_SYNC): 92 + case unpack_vcpu_flag(EXCEPT_AA64_EL2_SYNC): 93 + return true; 94 + default: 95 + return false; 96 + } 97 + } 98 + } 99 + 75 100 /** 76 101 * kvm_handle_mmio_return -- Handle MMIO loads after user space emulation 77 102 * or in-kernel IO emulation ··· 109 84 unsigned int len; 110 85 int mask; 111 86 112 - /* Detect an already handled MMIO return */ 113 - if (unlikely(!vcpu->mmio_needed)) 87 + /* 88 + * Detect if the MMIO return was already handled or if userspace aborted 89 + * the MMIO access. 90 + */ 91 + if (unlikely(!vcpu->mmio_needed || kvm_pending_sync_exception(vcpu))) 114 92 return 1; 115 93 116 94 vcpu->mmio_needed = 0;

+8 -13

arch/arm64/kvm/mmu.c

··· 1451 1451 long vma_pagesize, fault_granule; 1452 1452 enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R; 1453 1453 struct kvm_pgtable *pgt; 1454 + struct page *page; 1454 1455 1455 1456 if (fault_is_perm) 1456 1457 fault_granule = kvm_vcpu_trap_get_perm_fault_granule(vcpu); ··· 1573 1572 1574 1573 /* 1575 1574 * Read mmu_invalidate_seq so that KVM can detect if the results of 1576 - * vma_lookup() or __gfn_to_pfn_memslot() become stale prior to 1575 + * vma_lookup() or __kvm_faultin_pfn() become stale prior to 1577 1576 * acquiring kvm->mmu_lock. 1578 1577 * 1579 1578 * Rely on mmap_read_unlock() for an implicit smp_rmb(), which pairs ··· 1582 1581 mmu_seq = vcpu->kvm->mmu_invalidate_seq; 1583 1582 mmap_read_unlock(current->mm); 1584 1583 1585 - pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL, 1586 - write_fault, &writable, NULL); 1584 + pfn = __kvm_faultin_pfn(memslot, gfn, write_fault ? FOLL_WRITE : 0, 1585 + &writable, &page); 1587 1586 if (pfn == KVM_PFN_ERR_HWPOISON) { 1588 1587 kvm_send_hwpoison_signal(hva, vma_shift); 1589 1588 return 0; ··· 1596 1595 * If the page was identified as device early by looking at 1597 1596 * the VMA flags, vma_pagesize is already representing the 1598 1597 * largest quantity we can map. If instead it was mapped 1599 - * via gfn_to_pfn_prot(), vma_pagesize is set to PAGE_SIZE 1598 + * via __kvm_faultin_pfn(), vma_pagesize is set to PAGE_SIZE 1600 1599 * and must not be upgraded. 1601 1600 * 1602 1601 * In both cases, we don't let transparent_hugepage_adjust() ··· 1705 1704 } 1706 1705 1707 1706 out_unlock: 1707 + kvm_release_faultin_page(kvm, page, !!ret, writable); 1708 1708 read_unlock(&kvm->mmu_lock); 1709 1709 1710 1710 /* Mark the page dirty only if the fault is handled successfully */ 1711 - if (writable && !ret) { 1712 - kvm_set_pfn_dirty(pfn); 1711 + if (writable && !ret) 1713 1712 mark_page_dirty_in_slot(kvm, memslot, gfn); 1714 - } 1715 1713 1716 - kvm_release_pfn_clean(pfn); 1717 1714 return ret != -EAGAIN ? ret : 0; 1718 1715 } 1719 1716 1720 1717 /* Resolve the access fault by making the page young again. */ 1721 1718 static void handle_access_fault(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa) 1722 1719 { 1723 - kvm_pte_t pte; 1724 1720 struct kvm_s2_mmu *mmu; 1725 1721 1726 1722 trace_kvm_access_fault(fault_ipa); 1727 1723 1728 1724 read_lock(&vcpu->kvm->mmu_lock); 1729 1725 mmu = vcpu->arch.hw_mmu; 1730 - pte = kvm_pgtable_stage2_mkyoung(mmu->pgt, fault_ipa); 1726 + kvm_pgtable_stage2_mkyoung(mmu->pgt, fault_ipa); 1731 1727 read_unlock(&vcpu->kvm->mmu_lock); 1732 - 1733 - if (kvm_pte_valid(pte)) 1734 - kvm_set_pfn_accessed(kvm_pte_to_pfn(pte)); 1735 1728 } 1736 1729 1737 1730 /**

+73 -9

arch/arm64/kvm/nested.c

··· 917 917 ID_AA64MMFR4_EL1_E2H0_NI_NV1); 918 918 kvm_set_vm_id_reg(kvm, SYS_ID_AA64MMFR4_EL1, val); 919 919 920 - /* Only limited support for PMU, Debug, BPs and WPs */ 920 + /* Only limited support for PMU, Debug, BPs, WPs, and HPMN0 */ 921 921 val = kvm_read_vm_id_reg(kvm, SYS_ID_AA64DFR0_EL1); 922 922 val &= (NV_FTR(DFR0, PMUVer) | 923 923 NV_FTR(DFR0, WRPs) | 924 924 NV_FTR(DFR0, BRPs) | 925 - NV_FTR(DFR0, DebugVer)); 925 + NV_FTR(DFR0, DebugVer) | 926 + NV_FTR(DFR0, HPMN0)); 926 927 927 928 /* Cap Debug to ARMv8.1 */ 928 929 tmp = FIELD_GET(NV_FTR(DFR0, DebugVer), val); ··· 934 933 kvm_set_vm_id_reg(kvm, SYS_ID_AA64DFR0_EL1, val); 935 934 } 936 935 937 - u64 kvm_vcpu_sanitise_vncr_reg(const struct kvm_vcpu *vcpu, enum vcpu_sysreg sr) 936 + u64 kvm_vcpu_apply_reg_masks(const struct kvm_vcpu *vcpu, 937 + enum vcpu_sysreg sr, u64 v) 938 938 { 939 - u64 v = ctxt_sys_reg(&vcpu->arch.ctxt, sr); 940 939 struct kvm_sysreg_masks *masks; 941 940 942 941 masks = vcpu->kvm->arch.sysreg_masks; 943 942 944 943 if (masks) { 945 - sr -= __VNCR_START__; 944 + sr -= __SANITISED_REG_START__; 946 945 947 946 v &= ~masks->mask[sr].res0; 948 947 v |= masks->mask[sr].res1; ··· 953 952 954 953 static void set_sysreg_masks(struct kvm *kvm, int sr, u64 res0, u64 res1) 955 954 { 956 - int i = sr - __VNCR_START__; 955 + int i = sr - __SANITISED_REG_START__; 956 + 957 + BUILD_BUG_ON(!__builtin_constant_p(sr)); 958 + BUILD_BUG_ON(sr < __SANITISED_REG_START__); 959 + BUILD_BUG_ON(sr >= NR_SYS_REGS); 957 960 958 961 kvm->arch.sysreg_masks->mask[i].res0 = res0; 959 962 kvm->arch.sysreg_masks->mask[i].res1 = res1; ··· 1055 1050 res0 |= HCRX_EL2_PTTWI; 1056 1051 if (!kvm_has_feat(kvm, ID_AA64MMFR3_EL1, SCTLRX, IMP)) 1057 1052 res0 |= HCRX_EL2_SCTLR2En; 1058 - if (!kvm_has_feat(kvm, ID_AA64MMFR3_EL1, TCRX, IMP)) 1053 + if (!kvm_has_tcr2(kvm)) 1059 1054 res0 |= HCRX_EL2_TCR2En; 1060 1055 if (!kvm_has_feat(kvm, ID_AA64ISAR2_EL1, MOPS, IMP)) 1061 1056 res0 |= (HCRX_EL2_MSCEn | HCRX_EL2_MCE2); ··· 1106 1101 res0 |= (HFGxTR_EL2_nSMPRI_EL1 | HFGxTR_EL2_nTPIDR2_EL0); 1107 1102 if (!kvm_has_feat(kvm, ID_AA64PFR1_EL1, THE, IMP)) 1108 1103 res0 |= HFGxTR_EL2_nRCWMASK_EL1; 1109 - if (!kvm_has_feat(kvm, ID_AA64MMFR3_EL1, S1PIE, IMP)) 1104 + if (!kvm_has_s1pie(kvm)) 1110 1105 res0 |= (HFGxTR_EL2_nPIRE0_EL1 | HFGxTR_EL2_nPIR_EL1); 1111 - if (!kvm_has_feat(kvm, ID_AA64MMFR3_EL1, S1POE, IMP)) 1106 + if (!kvm_has_s1poe(kvm)) 1112 1107 res0 |= (HFGxTR_EL2_nPOR_EL0 | HFGxTR_EL2_nPOR_EL1); 1113 1108 if (!kvm_has_feat(kvm, ID_AA64MMFR3_EL1, S2POE, IMP)) 1114 1109 res0 |= HFGxTR_EL2_nS2POR_EL1; ··· 1205 1200 res0 |= ~(res0 | res1); 1206 1201 set_sysreg_masks(kvm, HAFGRTR_EL2, res0, res1); 1207 1202 1203 + /* TCR2_EL2 */ 1204 + res0 = TCR2_EL2_RES0; 1205 + res1 = TCR2_EL2_RES1; 1206 + if (!kvm_has_feat(kvm, ID_AA64MMFR3_EL1, D128, IMP)) 1207 + res0 |= (TCR2_EL2_DisCH0 | TCR2_EL2_DisCH1 | TCR2_EL2_D128); 1208 + if (!kvm_has_feat(kvm, ID_AA64MMFR3_EL1, MEC, IMP)) 1209 + res0 |= TCR2_EL2_AMEC1 | TCR2_EL2_AMEC0; 1210 + if (!kvm_has_feat(kvm, ID_AA64MMFR1_EL1, HAFDBS, HAFT)) 1211 + res0 |= TCR2_EL2_HAFT; 1212 + if (!kvm_has_feat(kvm, ID_AA64PFR1_EL1, THE, IMP)) 1213 + res0 |= TCR2_EL2_PTTWI | TCR2_EL2_PnCH; 1214 + if (!kvm_has_feat(kvm, ID_AA64MMFR3_EL1, AIE, IMP)) 1215 + res0 |= TCR2_EL2_AIE; 1216 + if (!kvm_has_s1poe(kvm)) 1217 + res0 |= TCR2_EL2_POE | TCR2_EL2_E0POE; 1218 + if (!kvm_has_s1pie(kvm)) 1219 + res0 |= TCR2_EL2_PIE; 1220 + if (!kvm_has_feat(kvm, ID_AA64MMFR1_EL1, VH, IMP)) 1221 + res0 |= (TCR2_EL2_E0POE | TCR2_EL2_D128 | 1222 + TCR2_EL2_AMEC1 | TCR2_EL2_DisCH0 | TCR2_EL2_DisCH1); 1223 + set_sysreg_masks(kvm, TCR2_EL2, res0, res1); 1224 + 1208 1225 /* SCTLR_EL1 */ 1209 1226 res0 = SCTLR_EL1_RES0; 1210 1227 res1 = SCTLR_EL1_RES1; 1211 1228 if (!kvm_has_feat(kvm, ID_AA64MMFR1_EL1, PAN, PAN3)) 1212 1229 res0 |= SCTLR_EL1_EPAN; 1213 1230 set_sysreg_masks(kvm, SCTLR_EL1, res0, res1); 1231 + 1232 + /* MDCR_EL2 */ 1233 + res0 = MDCR_EL2_RES0; 1234 + res1 = MDCR_EL2_RES1; 1235 + if (!kvm_has_feat(kvm, ID_AA64DFR0_EL1, PMUVer, IMP)) 1236 + res0 |= (MDCR_EL2_HPMN | MDCR_EL2_TPMCR | 1237 + MDCR_EL2_TPM | MDCR_EL2_HPME); 1238 + if (!kvm_has_feat(kvm, ID_AA64DFR0_EL1, PMSVer, IMP)) 1239 + res0 |= MDCR_EL2_E2PB | MDCR_EL2_TPMS; 1240 + if (!kvm_has_feat(kvm, ID_AA64DFR1_EL1, SPMU, IMP)) 1241 + res0 |= MDCR_EL2_EnSPM; 1242 + if (!kvm_has_feat(kvm, ID_AA64DFR0_EL1, PMUVer, V3P1)) 1243 + res0 |= MDCR_EL2_HPMD; 1244 + if (!kvm_has_feat(kvm, ID_AA64DFR0_EL1, TraceFilt, IMP)) 1245 + res0 |= MDCR_EL2_TTRF; 1246 + if (!kvm_has_feat(kvm, ID_AA64DFR0_EL1, PMUVer, V3P5)) 1247 + res0 |= MDCR_EL2_HCCD | MDCR_EL2_HLP; 1248 + if (!kvm_has_feat(kvm, ID_AA64DFR0_EL1, TraceBuffer, IMP)) 1249 + res0 |= MDCR_EL2_E2TB; 1250 + if (!kvm_has_feat(kvm, ID_AA64MMFR0_EL1, FGT, IMP)) 1251 + res0 |= MDCR_EL2_TDCC; 1252 + if (!kvm_has_feat(kvm, ID_AA64DFR0_EL1, MTPMU, IMP) || 1253 + kvm_has_feat(kvm, ID_AA64PFR0_EL1, EL3, IMP)) 1254 + res0 |= MDCR_EL2_MTPME; 1255 + if (!kvm_has_feat(kvm, ID_AA64DFR0_EL1, PMUVer, V3P7)) 1256 + res0 |= MDCR_EL2_HPMFZO; 1257 + if (!kvm_has_feat(kvm, ID_AA64DFR0_EL1, PMSS, IMP)) 1258 + res0 |= MDCR_EL2_PMSSE; 1259 + if (!kvm_has_feat(kvm, ID_AA64DFR0_EL1, PMSVer, V1P2)) 1260 + res0 |= MDCR_EL2_HPMFZS; 1261 + if (!kvm_has_feat(kvm, ID_AA64DFR1_EL1, EBEP, IMP)) 1262 + res0 |= MDCR_EL2_PMEE; 1263 + if (!kvm_has_feat(kvm, ID_AA64DFR0_EL1, DebugVer, V8P9)) 1264 + res0 |= MDCR_EL2_EBWE; 1265 + if (!kvm_has_feat(kvm, ID_AA64DFR2_EL1, STEP, IMP)) 1266 + res0 |= MDCR_EL2_EnSTEPOP; 1267 + set_sysreg_masks(kvm, MDCR_EL2, res0, res1); 1214 1268 1215 1269 return 0; 1216 1270 }

+124 -19

arch/arm64/kvm/pmu-emul.c

··· 89 89 90 90 static bool kvm_pmc_has_64bit_overflow(struct kvm_pmc *pmc) 91 91 { 92 - u64 val = kvm_vcpu_read_pmcr(kvm_pmc_to_vcpu(pmc)); 92 + struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc); 93 + u64 val = kvm_vcpu_read_pmcr(vcpu); 94 + 95 + if (kvm_pmu_counter_is_hyp(vcpu, pmc->idx)) 96 + return __vcpu_sys_reg(vcpu, MDCR_EL2) & MDCR_EL2_HLP; 93 97 94 98 return (pmc->idx < ARMV8_PMU_CYCLE_IDX && (val & ARMV8_PMU_PMCR_LP)) || 95 99 (pmc->idx == ARMV8_PMU_CYCLE_IDX && (val & ARMV8_PMU_PMCR_LC)); ··· 113 109 static u32 counter_index_to_evtreg(u64 idx) 114 110 { 115 111 return (idx == ARMV8_PMU_CYCLE_IDX) ? PMCCFILTR_EL0 : PMEVTYPER0_EL0 + idx; 112 + } 113 + 114 + static u64 kvm_pmc_read_evtreg(const struct kvm_pmc *pmc) 115 + { 116 + return __vcpu_sys_reg(kvm_pmc_to_vcpu(pmc), counter_index_to_evtreg(pmc->idx)); 116 117 } 117 118 118 119 static u64 kvm_pmu_get_pmc_value(struct kvm_pmc *pmc) ··· 253 244 */ 254 245 void kvm_pmu_vcpu_reset(struct kvm_vcpu *vcpu) 255 246 { 256 - unsigned long mask = kvm_pmu_valid_counter_mask(vcpu); 247 + unsigned long mask = kvm_pmu_implemented_counter_mask(vcpu); 257 248 int i; 258 249 259 250 for_each_set_bit(i, &mask, 32) ··· 274 265 irq_work_sync(&vcpu->arch.pmu.overflow_work); 275 266 } 276 267 277 - u64 kvm_pmu_valid_counter_mask(struct kvm_vcpu *vcpu) 268 + bool kvm_pmu_counter_is_hyp(struct kvm_vcpu *vcpu, unsigned int idx) 269 + { 270 + unsigned int hpmn; 271 + 272 + if (!vcpu_has_nv(vcpu) || idx == ARMV8_PMU_CYCLE_IDX) 273 + return false; 274 + 275 + /* 276 + * Programming HPMN=0 is CONSTRAINED UNPREDICTABLE if FEAT_HPMN0 isn't 277 + * implemented. Since KVM's ability to emulate HPMN=0 does not directly 278 + * depend on hardware (all PMU registers are trapped), make the 279 + * implementation choice that all counters are included in the second 280 + * range reserved for EL2/EL3. 281 + */ 282 + hpmn = SYS_FIELD_GET(MDCR_EL2, HPMN, __vcpu_sys_reg(vcpu, MDCR_EL2)); 283 + return idx >= hpmn; 284 + } 285 + 286 + u64 kvm_pmu_accessible_counter_mask(struct kvm_vcpu *vcpu) 287 + { 288 + u64 mask = kvm_pmu_implemented_counter_mask(vcpu); 289 + u64 hpmn; 290 + 291 + if (!vcpu_has_nv(vcpu) || vcpu_is_el2(vcpu)) 292 + return mask; 293 + 294 + hpmn = SYS_FIELD_GET(MDCR_EL2, HPMN, __vcpu_sys_reg(vcpu, MDCR_EL2)); 295 + return mask & ~GENMASK(vcpu->kvm->arch.pmcr_n - 1, hpmn); 296 + } 297 + 298 + u64 kvm_pmu_implemented_counter_mask(struct kvm_vcpu *vcpu) 278 299 { 279 300 u64 val = FIELD_GET(ARMV8_PMU_PMCR_N, kvm_vcpu_read_pmcr(vcpu)); 280 301 ··· 613 574 kvm_pmu_set_counter_value(vcpu, ARMV8_PMU_CYCLE_IDX, 0); 614 575 615 576 if (val & ARMV8_PMU_PMCR_P) { 616 - unsigned long mask = kvm_pmu_valid_counter_mask(vcpu); 577 + unsigned long mask = kvm_pmu_accessible_counter_mask(vcpu); 617 578 mask &= ~BIT(ARMV8_PMU_CYCLE_IDX); 618 579 for_each_set_bit(i, &mask, 32) 619 580 kvm_pmu_set_pmc_value(kvm_vcpu_idx_to_pmc(vcpu, i), 0, true); ··· 624 585 static bool kvm_pmu_counter_is_enabled(struct kvm_pmc *pmc) 625 586 { 626 587 struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc); 627 - return (kvm_vcpu_read_pmcr(vcpu) & ARMV8_PMU_PMCR_E) && 628 - (__vcpu_sys_reg(vcpu, PMCNTENSET_EL0) & BIT(pmc->idx)); 588 + unsigned int mdcr = __vcpu_sys_reg(vcpu, MDCR_EL2); 589 + 590 + if (!(__vcpu_sys_reg(vcpu, PMCNTENSET_EL0) & BIT(pmc->idx))) 591 + return false; 592 + 593 + if (kvm_pmu_counter_is_hyp(vcpu, pmc->idx)) 594 + return mdcr & MDCR_EL2_HPME; 595 + 596 + return kvm_vcpu_read_pmcr(vcpu) & ARMV8_PMU_PMCR_E; 597 + } 598 + 599 + static bool kvm_pmc_counts_at_el0(struct kvm_pmc *pmc) 600 + { 601 + u64 evtreg = kvm_pmc_read_evtreg(pmc); 602 + bool nsu = evtreg & ARMV8_PMU_EXCLUDE_NS_EL0; 603 + bool u = evtreg & ARMV8_PMU_EXCLUDE_EL0; 604 + 605 + return u == nsu; 606 + } 607 + 608 + static bool kvm_pmc_counts_at_el1(struct kvm_pmc *pmc) 609 + { 610 + u64 evtreg = kvm_pmc_read_evtreg(pmc); 611 + bool nsk = evtreg & ARMV8_PMU_EXCLUDE_NS_EL1; 612 + bool p = evtreg & ARMV8_PMU_EXCLUDE_EL1; 613 + 614 + return p == nsk; 615 + } 616 + 617 + static bool kvm_pmc_counts_at_el2(struct kvm_pmc *pmc) 618 + { 619 + struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc); 620 + u64 mdcr = __vcpu_sys_reg(vcpu, MDCR_EL2); 621 + 622 + if (!kvm_pmu_counter_is_hyp(vcpu, pmc->idx) && (mdcr & MDCR_EL2_HPMD)) 623 + return false; 624 + 625 + return kvm_pmc_read_evtreg(pmc) & ARMV8_PMU_INCLUDE_EL2; 629 626 } 630 627 631 628 /** ··· 674 599 struct arm_pmu *arm_pmu = vcpu->kvm->arch.arm_pmu; 675 600 struct perf_event *event; 676 601 struct perf_event_attr attr; 677 - u64 eventsel, reg, data; 678 - bool p, u, nsk, nsu; 602 + u64 eventsel, evtreg; 679 603 680 - reg = counter_index_to_evtreg(pmc->idx); 681 - data = __vcpu_sys_reg(vcpu, reg); 604 + evtreg = kvm_pmc_read_evtreg(pmc); 682 605 683 606 kvm_pmu_stop_counter(pmc); 684 607 if (pmc->idx == ARMV8_PMU_CYCLE_IDX) 685 608 eventsel = ARMV8_PMUV3_PERFCTR_CPU_CYCLES; 686 609 else 687 - eventsel = data & kvm_pmu_event_mask(vcpu->kvm); 610 + eventsel = evtreg & kvm_pmu_event_mask(vcpu->kvm); 688 611 689 612 /* 690 613 * Neither SW increment nor chained events need to be backed ··· 700 627 !test_bit(eventsel, vcpu->kvm->arch.pmu_filter)) 701 628 return; 702 629 703 - p = data & ARMV8_PMU_EXCLUDE_EL1; 704 - u = data & ARMV8_PMU_EXCLUDE_EL0; 705 - nsk = data & ARMV8_PMU_EXCLUDE_NS_EL1; 706 - nsu = data & ARMV8_PMU_EXCLUDE_NS_EL0; 707 - 708 630 memset(&attr, 0, sizeof(struct perf_event_attr)); 709 631 attr.type = arm_pmu->pmu.type; 710 632 attr.size = sizeof(attr); 711 633 attr.pinned = 1; 712 634 attr.disabled = !kvm_pmu_counter_is_enabled(pmc); 713 - attr.exclude_user = (u != nsu); 714 - attr.exclude_kernel = (p != nsk); 635 + attr.exclude_user = !kvm_pmc_counts_at_el0(pmc); 715 636 attr.exclude_hv = 1; /* Don't count EL2 events */ 716 637 attr.exclude_host = 1; /* Don't count host events */ 717 638 attr.config = eventsel; 639 + 640 + /* 641 + * Filter events at EL1 (i.e. vEL2) when in a hyp context based on the 642 + * guest's EL2 filter. 643 + */ 644 + if (unlikely(is_hyp_ctxt(vcpu))) 645 + attr.exclude_kernel = !kvm_pmc_counts_at_el2(pmc); 646 + else 647 + attr.exclude_kernel = !kvm_pmc_counts_at_el1(pmc); 718 648 719 649 /* 720 650 * If counting with a 64bit counter, advertise it to the perf ··· 880 804 881 805 void kvm_vcpu_reload_pmu(struct kvm_vcpu *vcpu) 882 806 { 883 - u64 mask = kvm_pmu_valid_counter_mask(vcpu); 807 + u64 mask = kvm_pmu_implemented_counter_mask(vcpu); 884 808 885 809 kvm_pmu_handle_pmcr(vcpu, kvm_vcpu_read_pmcr(vcpu)); 886 810 ··· 1214 1138 u64 pmcr = __vcpu_sys_reg(vcpu, PMCR_EL0); 1215 1139 1216 1140 return u64_replace_bits(pmcr, vcpu->kvm->arch.pmcr_n, ARMV8_PMU_PMCR_N); 1141 + } 1142 + 1143 + void kvm_pmu_nested_transition(struct kvm_vcpu *vcpu) 1144 + { 1145 + bool reprogrammed = false; 1146 + unsigned long mask; 1147 + int i; 1148 + 1149 + if (!kvm_vcpu_has_pmu(vcpu)) 1150 + return; 1151 + 1152 + mask = __vcpu_sys_reg(vcpu, PMCNTENSET_EL0); 1153 + for_each_set_bit(i, &mask, 32) { 1154 + struct kvm_pmc *pmc = kvm_vcpu_idx_to_pmc(vcpu, i); 1155 + 1156 + /* 1157 + * We only need to reconfigure events where the filter is 1158 + * different at EL1 vs. EL2, as we're multiplexing the true EL1 1159 + * event filter bit for nested. 1160 + */ 1161 + if (kvm_pmc_counts_at_el1(pmc) == kvm_pmc_counts_at_el2(pmc)) 1162 + continue; 1163 + 1164 + kvm_pmu_create_perf_event(pmc); 1165 + reprogrammed = true; 1166 + } 1167 + 1168 + if (reprogrammed) 1169 + kvm_vcpu_pmu_restore_guest(vcpu); 1217 1170 }

+43 -1

arch/arm64/kvm/psci.c

··· 194 194 kvm_prepare_system_event(vcpu, KVM_SYSTEM_EVENT_SHUTDOWN, 0); 195 195 } 196 196 197 + static void kvm_psci_system_off2(struct kvm_vcpu *vcpu) 198 + { 199 + kvm_prepare_system_event(vcpu, KVM_SYSTEM_EVENT_SHUTDOWN, 200 + KVM_SYSTEM_EVENT_SHUTDOWN_FLAG_PSCI_OFF2); 201 + } 202 + 197 203 static void kvm_psci_system_reset(struct kvm_vcpu *vcpu) 198 204 { 199 205 kvm_prepare_system_event(vcpu, KVM_SYSTEM_EVENT_RESET, 0); ··· 328 322 329 323 switch(psci_fn) { 330 324 case PSCI_0_2_FN_PSCI_VERSION: 331 - val = minor == 0 ? KVM_ARM_PSCI_1_0 : KVM_ARM_PSCI_1_1; 325 + val = PSCI_VERSION(1, minor); 332 326 break; 333 327 case PSCI_1_0_FN_PSCI_FEATURES: 334 328 arg = smccc_get_arg1(vcpu); ··· 364 358 if (minor >= 1) 365 359 val = 0; 366 360 break; 361 + case PSCI_1_3_FN_SYSTEM_OFF2: 362 + case PSCI_1_3_FN64_SYSTEM_OFF2: 363 + if (minor >= 3) 364 + val = PSCI_1_3_OFF_TYPE_HIBERNATE_OFF; 365 + break; 367 366 } 368 367 break; 369 368 case PSCI_1_0_FN_SYSTEM_SUSPEND: ··· 402 391 val = PSCI_RET_INVALID_PARAMS; 403 392 break; 404 393 } 394 + break; 395 + case PSCI_1_3_FN_SYSTEM_OFF2: 396 + kvm_psci_narrow_to_32bit(vcpu); 397 + fallthrough; 398 + case PSCI_1_3_FN64_SYSTEM_OFF2: 399 + if (minor < 3) 400 + break; 401 + 402 + arg = smccc_get_arg1(vcpu); 403 + /* 404 + * SYSTEM_OFF2 defaults to HIBERNATE_OFF if arg1 is zero. arg2 405 + * must be zero. 406 + */ 407 + if ((arg && arg != PSCI_1_3_OFF_TYPE_HIBERNATE_OFF) || 408 + smccc_get_arg2(vcpu) != 0) { 409 + val = PSCI_RET_INVALID_PARAMS; 410 + break; 411 + } 412 + kvm_psci_system_off2(vcpu); 413 + /* 414 + * We shouldn't be going back to the guest after receiving a 415 + * SYSTEM_OFF2 request. Preload a return value of 416 + * INTERNAL_FAILURE should userspace ignore the exit and resume 417 + * the vCPU. 418 + */ 419 + val = PSCI_RET_INTERNAL_FAILURE; 420 + ret = 0; 405 421 break; 406 422 default: 407 423 return kvm_psci_0_2_call(vcpu); ··· 487 449 } 488 450 489 451 switch (version) { 452 + case KVM_ARM_PSCI_1_3: 453 + return kvm_psci_1_x_call(vcpu, 3); 454 + case KVM_ARM_PSCI_1_2: 455 + return kvm_psci_1_x_call(vcpu, 2); 490 456 case KVM_ARM_PSCI_1_1: 491 457 return kvm_psci_1_x_call(vcpu, 1); 492 458 case KVM_ARM_PSCI_1_0:

-5

arch/arm64/kvm/reset.c

··· 167 167 memset(vcpu->arch.sve_state, 0, vcpu_sve_state_size(vcpu)); 168 168 } 169 169 170 - static void kvm_vcpu_enable_ptrauth(struct kvm_vcpu *vcpu) 171 - { 172 - vcpu_set_flag(vcpu, GUEST_HAS_PTRAUTH); 173 - } 174 - 175 170 /** 176 171 * kvm_reset_vcpu - sets core registers and sys_regs to reset value 177 172 * @vcpu: The VCPU pointer

+250 -59

arch/arm64/kvm/sys_regs.c

··· 110 110 PURE_EL2_SYSREG( RVBAR_EL2 ); 111 111 PURE_EL2_SYSREG( TPIDR_EL2 ); 112 112 PURE_EL2_SYSREG( HPFAR_EL2 ); 113 + PURE_EL2_SYSREG( HCRX_EL2 ); 114 + PURE_EL2_SYSREG( HFGRTR_EL2 ); 115 + PURE_EL2_SYSREG( HFGWTR_EL2 ); 116 + PURE_EL2_SYSREG( HFGITR_EL2 ); 117 + PURE_EL2_SYSREG( HDFGRTR_EL2 ); 118 + PURE_EL2_SYSREG( HDFGWTR_EL2 ); 119 + PURE_EL2_SYSREG( HAFGRTR_EL2 ); 120 + PURE_EL2_SYSREG( CNTVOFF_EL2 ); 113 121 PURE_EL2_SYSREG( CNTHCTL_EL2 ); 114 122 MAPPED_EL2_SYSREG(SCTLR_EL2, SCTLR_EL1, 115 123 translate_sctlr_el2_to_sctlr_el1 ); ··· 134 126 MAPPED_EL2_SYSREG(ESR_EL2, ESR_EL1, NULL ); 135 127 MAPPED_EL2_SYSREG(FAR_EL2, FAR_EL1, NULL ); 136 128 MAPPED_EL2_SYSREG(MAIR_EL2, MAIR_EL1, NULL ); 129 + MAPPED_EL2_SYSREG(TCR2_EL2, TCR2_EL1, NULL ); 130 + MAPPED_EL2_SYSREG(PIR_EL2, PIR_EL1, NULL ); 131 + MAPPED_EL2_SYSREG(PIRE0_EL2, PIRE0_EL1, NULL ); 132 + MAPPED_EL2_SYSREG(POR_EL2, POR_EL1, NULL ); 137 133 MAPPED_EL2_SYSREG(AMAIR_EL2, AMAIR_EL1, NULL ); 138 134 MAPPED_EL2_SYSREG(ELR_EL2, ELR_EL1, NULL ); 139 135 MAPPED_EL2_SYSREG(SPSR_EL2, SPSR_EL1, NULL ); 140 136 MAPPED_EL2_SYSREG(ZCR_EL2, ZCR_EL1, NULL ); 137 + MAPPED_EL2_SYSREG(CONTEXTIDR_EL2, CONTEXTIDR_EL1, NULL ); 141 138 default: 142 139 return false; 143 140 } ··· 162 149 goto memory_read; 163 150 164 151 /* 152 + * CNTHCTL_EL2 requires some special treatment to 153 + * account for the bits that can be set via CNTKCTL_EL1. 154 + */ 155 + switch (reg) { 156 + case CNTHCTL_EL2: 157 + if (vcpu_el2_e2h_is_set(vcpu)) { 158 + val = read_sysreg_el1(SYS_CNTKCTL); 159 + val &= CNTKCTL_VALID_BITS; 160 + val |= __vcpu_sys_reg(vcpu, reg) & ~CNTKCTL_VALID_BITS; 161 + return val; 162 + } 163 + break; 164 + } 165 + 166 + /* 165 167 * If this register does not have an EL1 counterpart, 166 168 * then read the stored EL2 version. 167 169 */ ··· 193 165 194 166 /* Get the current version of the EL1 counterpart. */ 195 167 WARN_ON(!__vcpu_read_sys_reg_from_cpu(el1r, &val)); 168 + if (reg >= __SANITISED_REG_START__) 169 + val = kvm_vcpu_apply_reg_masks(vcpu, reg, val); 170 + 196 171 return val; 197 172 } 198 173 ··· 228 197 * non-VHE guest hypervisor. 229 198 */ 230 199 __vcpu_sys_reg(vcpu, reg) = val; 200 + 201 + switch (reg) { 202 + case CNTHCTL_EL2: 203 + /* 204 + * If E2H=0, CNHTCTL_EL2 is a pure shadow register. 205 + * Otherwise, some of the bits are backed by 206 + * CNTKCTL_EL1, while the rest is kept in memory. 207 + * Yes, this is fun stuff. 208 + */ 209 + if (vcpu_el2_e2h_is_set(vcpu)) 210 + write_sysreg_el1(val, SYS_CNTKCTL); 211 + return; 212 + } 231 213 232 214 /* No EL1 counterpart? We're done here.? */ 233 215 if (reg == el1r) ··· 433 389 { 434 390 bool was_enabled = vcpu_has_cache_enabled(vcpu); 435 391 u64 val, mask, shift; 436 - 437 - if (reg_to_encoding(r) == SYS_TCR2_EL1 && 438 - !kvm_has_feat(vcpu->kvm, ID_AA64MMFR3_EL1, TCRX, IMP)) 439 - return undef_access(vcpu, p, r); 440 392 441 393 BUG_ON(!p->is_write); 442 394 ··· 1168 1128 { 1169 1129 bool set; 1170 1130 1171 - val &= kvm_pmu_valid_counter_mask(vcpu); 1131 + val &= kvm_pmu_accessible_counter_mask(vcpu); 1172 1132 1173 1133 switch (r->reg) { 1174 1134 case PMOVSSET_EL0: ··· 1191 1151 1192 1152 static int get_pmreg(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r, u64 *val) 1193 1153 { 1194 - u64 mask = kvm_pmu_valid_counter_mask(vcpu); 1154 + u64 mask = kvm_pmu_accessible_counter_mask(vcpu); 1195 1155 1196 1156 *val = __vcpu_sys_reg(vcpu, r->reg) & mask; 1197 1157 return 0; ··· 1205 1165 if (pmu_access_el0_disabled(vcpu)) 1206 1166 return false; 1207 1167 1208 - mask = kvm_pmu_valid_counter_mask(vcpu); 1168 + mask = kvm_pmu_accessible_counter_mask(vcpu); 1209 1169 if (p->is_write) { 1210 1170 val = p->regval & mask; 1211 1171 if (r->Op2 & 0x1) { ··· 1228 1188 static bool access_pminten(struct kvm_vcpu *vcpu, struct sys_reg_params *p, 1229 1189 const struct sys_reg_desc *r) 1230 1190 { 1231 - u64 mask = kvm_pmu_valid_counter_mask(vcpu); 1191 + u64 mask = kvm_pmu_accessible_counter_mask(vcpu); 1232 1192 1233 1193 if (check_pmu_access_disabled(vcpu, 0)) 1234 1194 return false; ··· 1252 1212 static bool access_pmovs(struct kvm_vcpu *vcpu, struct sys_reg_params *p, 1253 1213 const struct sys_reg_desc *r) 1254 1214 { 1255 - u64 mask = kvm_pmu_valid_counter_mask(vcpu); 1215 + u64 mask = kvm_pmu_accessible_counter_mask(vcpu); 1256 1216 1257 1217 if (pmu_access_el0_disabled(vcpu)) 1258 1218 return false; ··· 1282 1242 if (pmu_write_swinc_el0_disabled(vcpu)) 1283 1243 return false; 1284 1244 1285 - mask = kvm_pmu_valid_counter_mask(vcpu); 1245 + mask = kvm_pmu_accessible_counter_mask(vcpu); 1286 1246 kvm_pmu_software_increment(vcpu, p->regval & mask); 1287 1247 return true; 1288 1248 } ··· 1549 1509 } 1550 1510 } 1551 1511 1512 + static u64 sanitise_id_aa64pfr0_el1(const struct kvm_vcpu *vcpu, u64 val); 1513 + static u64 sanitise_id_aa64dfr0_el1(const struct kvm_vcpu *vcpu, u64 val); 1514 + 1552 1515 /* Read a sanitised cpufeature ID register by sys_reg_desc */ 1553 1516 static u64 __kvm_read_sanitised_id_reg(const struct kvm_vcpu *vcpu, 1554 1517 const struct sys_reg_desc *r) ··· 1565 1522 val = read_sanitised_ftr_reg(id); 1566 1523 1567 1524 switch (id) { 1525 + case SYS_ID_AA64DFR0_EL1: 1526 + val = sanitise_id_aa64dfr0_el1(vcpu, val); 1527 + break; 1528 + case SYS_ID_AA64PFR0_EL1: 1529 + val = sanitise_id_aa64pfr0_el1(vcpu, val); 1530 + break; 1568 1531 case SYS_ID_AA64PFR1_EL1: 1569 1532 if (!kvm_has_mte(vcpu->kvm)) 1570 1533 val &= ~ARM64_FEATURE_MASK(ID_AA64PFR1_EL1_MTE); ··· 1584 1535 val &= ~ARM64_FEATURE_MASK(ID_AA64PFR1_EL1_MTEX); 1585 1536 val &= ~ARM64_FEATURE_MASK(ID_AA64PFR1_EL1_DF2); 1586 1537 val &= ~ARM64_FEATURE_MASK(ID_AA64PFR1_EL1_PFAR); 1538 + val &= ~ARM64_FEATURE_MASK(ID_AA64PFR1_EL1_MPAM_frac); 1587 1539 break; 1588 1540 case SYS_ID_AA64PFR2_EL1: 1589 1541 /* We only expose FPMR */ ··· 1742 1692 return REG_HIDDEN; 1743 1693 } 1744 1694 1745 - static u64 read_sanitised_id_aa64pfr0_el1(struct kvm_vcpu *vcpu, 1746 - const struct sys_reg_desc *rd) 1695 + static u64 sanitise_id_aa64pfr0_el1(const struct kvm_vcpu *vcpu, u64 val) 1747 1696 { 1748 - u64 val = read_sanitised_ftr_reg(SYS_ID_AA64PFR0_EL1); 1749 - 1750 1697 if (!vcpu_has_sve(vcpu)) 1751 1698 val &= ~ID_AA64PFR0_EL1_SVE_MASK; 1752 1699 ··· 1771 1724 1772 1725 val &= ~ID_AA64PFR0_EL1_AMU_MASK; 1773 1726 1727 + /* 1728 + * MPAM is disabled by default as KVM also needs a set of PARTID to 1729 + * program the MPAMVPMx_EL2 PARTID remapping registers with. But some 1730 + * older kernels let the guest see the ID bit. 1731 + */ 1732 + val &= ~ID_AA64PFR0_EL1_MPAM_MASK; 1733 + 1774 1734 return val; 1775 1735 } 1776 1736 ··· 1791 1737 (val); \ 1792 1738 }) 1793 1739 1794 - static u64 read_sanitised_id_aa64dfr0_el1(struct kvm_vcpu *vcpu, 1795 - const struct sys_reg_desc *rd) 1740 + static u64 sanitise_id_aa64dfr0_el1(const struct kvm_vcpu *vcpu, u64 val) 1796 1741 { 1797 - u64 val = read_sanitised_ftr_reg(SYS_ID_AA64DFR0_EL1); 1798 - 1799 1742 val = ID_REG_LIMIT_FIELD_ENUM(val, ID_AA64DFR0_EL1, DebugVer, V8P8); 1800 1743 1801 1744 /* ··· 1883 1832 return -EINVAL; 1884 1833 1885 1834 return set_id_reg(vcpu, rd, val); 1835 + } 1836 + 1837 + static int set_id_aa64pfr0_el1(struct kvm_vcpu *vcpu, 1838 + const struct sys_reg_desc *rd, u64 user_val) 1839 + { 1840 + u64 hw_val = read_sanitised_ftr_reg(SYS_ID_AA64PFR0_EL1); 1841 + u64 mpam_mask = ID_AA64PFR0_EL1_MPAM_MASK; 1842 + 1843 + /* 1844 + * Commit 011e5f5bf529f ("arm64/cpufeature: Add remaining feature bits 1845 + * in ID_AA64PFR0 register") exposed the MPAM field of AA64PFR0_EL1 to 1846 + * guests, but didn't add trap handling. KVM doesn't support MPAM and 1847 + * always returns an UNDEF for these registers. The guest must see 0 1848 + * for this field. 1849 + * 1850 + * But KVM must also accept values from user-space that were provided 1851 + * by KVM. On CPUs that support MPAM, permit user-space to write 1852 + * the sanitizied value to ID_AA64PFR0_EL1.MPAM, but ignore this field. 1853 + */ 1854 + if ((hw_val & mpam_mask) == (user_val & mpam_mask)) 1855 + user_val &= ~ID_AA64PFR0_EL1_MPAM_MASK; 1856 + 1857 + return set_id_reg(vcpu, rd, user_val); 1858 + } 1859 + 1860 + static int set_id_aa64pfr1_el1(struct kvm_vcpu *vcpu, 1861 + const struct sys_reg_desc *rd, u64 user_val) 1862 + { 1863 + u64 hw_val = read_sanitised_ftr_reg(SYS_ID_AA64PFR1_EL1); 1864 + u64 mpam_mask = ID_AA64PFR1_EL1_MPAM_frac_MASK; 1865 + 1866 + /* See set_id_aa64pfr0_el1 for comment about MPAM */ 1867 + if ((hw_val & mpam_mask) == (user_val & mpam_mask)) 1868 + user_val &= ~ID_AA64PFR1_EL1_MPAM_frac_MASK; 1869 + 1870 + return set_id_reg(vcpu, rd, user_val); 1871 + } 1872 + 1873 + static int set_ctr_el0(struct kvm_vcpu *vcpu, 1874 + const struct sys_reg_desc *rd, u64 user_val) 1875 + { 1876 + u8 user_L1Ip = SYS_FIELD_GET(CTR_EL0, L1Ip, user_val); 1877 + 1878 + /* 1879 + * Both AIVIVT (0b01) and VPIPT (0b00) are documented as reserved. 1880 + * Hence only allow to set VIPT(0b10) or PIPT(0b11) for L1Ip based 1881 + * on what hardware reports. 1882 + * 1883 + * Using a VIPT software model on PIPT will lead to over invalidation, 1884 + * but still correct. Hence, we can allow downgrading PIPT to VIPT, 1885 + * but not the other way around. This is handled via arm64_ftr_safe_value() 1886 + * as CTR_EL0 ftr_bits has L1Ip field with type FTR_EXACT and safe value 1887 + * set as VIPT. 1888 + */ 1889 + switch (user_L1Ip) { 1890 + case CTR_EL0_L1Ip_RESERVED_VPIPT: 1891 + case CTR_EL0_L1Ip_RESERVED_AIVIVT: 1892 + return -EINVAL; 1893 + case CTR_EL0_L1Ip_VIPT: 1894 + case CTR_EL0_L1Ip_PIPT: 1895 + return set_id_reg(vcpu, rd, user_val); 1896 + default: 1897 + return -ENOENT; 1898 + } 1886 1899 } 1887 1900 1888 1901 /* ··· 2219 2104 .val = v, \ 2220 2105 } 2221 2106 2107 + #define EL2_REG_FILTERED(name, acc, rst, v, filter) { \ 2108 + SYS_DESC(SYS_##name), \ 2109 + .access = acc, \ 2110 + .reset = rst, \ 2111 + .reg = name, \ 2112 + .visibility = filter, \ 2113 + .val = v, \ 2114 + } 2115 + 2222 2116 #define EL2_REG_VNCR(name, rst, v) EL2_REG(name, bad_vncr_trap, rst, v) 2223 2117 #define EL2_REG_REDIR(name, rst, v) EL2_REG(name, bad_redir_trap, rst, v) 2224 2118 ··· 2272 2148 .visibility = id_visibility, \ 2273 2149 .reset = kvm_read_sanitised_id_reg, \ 2274 2150 .val = mask, \ 2151 + } 2152 + 2153 + /* sys_reg_desc initialiser for cpufeature ID registers that need filtering */ 2154 + #define ID_FILTERED(sysreg, name, mask) { \ 2155 + ID_DESC(sysreg), \ 2156 + .set_user = set_##name, \ 2157 + .visibility = id_visibility, \ 2158 + .reset = kvm_read_sanitised_id_reg, \ 2159 + .val = (mask), \ 2275 2160 } 2276 2161 2277 2162 /* ··· 2369 2236 return __vcpu_sys_reg(vcpu, r->reg) = val; 2370 2237 } 2371 2238 2239 + static unsigned int __el2_visibility(const struct kvm_vcpu *vcpu, 2240 + const struct sys_reg_desc *rd, 2241 + unsigned int (*fn)(const struct kvm_vcpu *, 2242 + const struct sys_reg_desc *)) 2243 + { 2244 + return el2_visibility(vcpu, rd) ?: fn(vcpu, rd); 2245 + } 2246 + 2372 2247 static unsigned int sve_el2_visibility(const struct kvm_vcpu *vcpu, 2373 2248 const struct sys_reg_desc *rd) 2374 2249 { 2375 - unsigned int r; 2376 - 2377 - r = el2_visibility(vcpu, rd); 2378 - if (r) 2379 - return r; 2380 - 2381 - return sve_visibility(vcpu, rd); 2250 + return __el2_visibility(vcpu, rd, sve_visibility); 2382 2251 } 2383 2252 2384 2253 static bool access_zcr_el2(struct kvm_vcpu *vcpu, ··· 2408 2273 static unsigned int s1poe_visibility(const struct kvm_vcpu *vcpu, 2409 2274 const struct sys_reg_desc *rd) 2410 2275 { 2411 - if (kvm_has_feat(vcpu->kvm, ID_AA64MMFR3_EL1, S1POE, IMP)) 2276 + if (kvm_has_s1poe(vcpu->kvm)) 2412 2277 return 0; 2413 2278 2414 2279 return REG_HIDDEN; 2280 + } 2281 + 2282 + static unsigned int s1poe_el2_visibility(const struct kvm_vcpu *vcpu, 2283 + const struct sys_reg_desc *rd) 2284 + { 2285 + return __el2_visibility(vcpu, rd, s1poe_visibility); 2286 + } 2287 + 2288 + static unsigned int tcr2_visibility(const struct kvm_vcpu *vcpu, 2289 + const struct sys_reg_desc *rd) 2290 + { 2291 + if (kvm_has_tcr2(vcpu->kvm)) 2292 + return 0; 2293 + 2294 + return REG_HIDDEN; 2295 + } 2296 + 2297 + static unsigned int tcr2_el2_visibility(const struct kvm_vcpu *vcpu, 2298 + const struct sys_reg_desc *rd) 2299 + { 2300 + return __el2_visibility(vcpu, rd, tcr2_visibility); 2301 + } 2302 + 2303 + static unsigned int s1pie_visibility(const struct kvm_vcpu *vcpu, 2304 + const struct sys_reg_desc *rd) 2305 + { 2306 + if (kvm_has_s1pie(vcpu->kvm)) 2307 + return 0; 2308 + 2309 + return REG_HIDDEN; 2310 + } 2311 + 2312 + static unsigned int s1pie_el2_visibility(const struct kvm_vcpu *vcpu, 2313 + const struct sys_reg_desc *rd) 2314 + { 2315 + return __el2_visibility(vcpu, rd, s1pie_visibility); 2415 2316 } 2416 2317 2417 2318 /* ··· 2545 2374 2546 2375 /* AArch64 ID registers */ 2547 2376 /* CRm=4 */ 2548 - { SYS_DESC(SYS_ID_AA64PFR0_EL1), 2549 - .access = access_id_reg, 2550 - .get_user = get_id_reg, 2551 - .set_user = set_id_reg, 2552 - .reset = read_sanitised_id_aa64pfr0_el1, 2553 - .val = ~(ID_AA64PFR0_EL1_AMU | 2554 - ID_AA64PFR0_EL1_MPAM | 2555 - ID_AA64PFR0_EL1_SVE | 2556 - ID_AA64PFR0_EL1_RAS | 2557 - ID_AA64PFR0_EL1_AdvSIMD | 2558 - ID_AA64PFR0_EL1_FP), }, 2559 - ID_WRITABLE(ID_AA64PFR1_EL1, ~(ID_AA64PFR1_EL1_PFAR | 2377 + ID_FILTERED(ID_AA64PFR0_EL1, id_aa64pfr0_el1, 2378 + ~(ID_AA64PFR0_EL1_AMU | 2379 + ID_AA64PFR0_EL1_MPAM | 2380 + ID_AA64PFR0_EL1_SVE | 2381 + ID_AA64PFR0_EL1_RAS | 2382 + ID_AA64PFR0_EL1_AdvSIMD | 2383 + ID_AA64PFR0_EL1_FP)), 2384 + ID_FILTERED(ID_AA64PFR1_EL1, id_aa64pfr1_el1, 2385 + ~(ID_AA64PFR1_EL1_PFAR | 2560 2386 ID_AA64PFR1_EL1_DF2 | 2561 2387 ID_AA64PFR1_EL1_MTEX | 2562 2388 ID_AA64PFR1_EL1_THE | ··· 2574 2406 ID_WRITABLE(ID_AA64FPFR0_EL1, ~ID_AA64FPFR0_EL1_RES0), 2575 2407 2576 2408 /* CRm=5 */ 2577 - { SYS_DESC(SYS_ID_AA64DFR0_EL1), 2578 - .access = access_id_reg, 2579 - .get_user = get_id_reg, 2580 - .set_user = set_id_aa64dfr0_el1, 2581 - .reset = read_sanitised_id_aa64dfr0_el1, 2582 2409 /* 2583 2410 * Prior to FEAT_Debugv8.9, the architecture defines context-aware 2584 2411 * breakpoints (CTX_CMPs) as the highest numbered breakpoints (BRPs). ··· 2586 2423 * See DDI0487K.a, section D2.8.3 Breakpoint types and linking 2587 2424 * of breakpoints for more details. 2588 2425 */ 2589 - .val = ID_AA64DFR0_EL1_DoubleLock_MASK | 2590 - ID_AA64DFR0_EL1_WRPs_MASK | 2591 - ID_AA64DFR0_EL1_PMUVer_MASK | 2592 - ID_AA64DFR0_EL1_DebugVer_MASK, }, 2426 + ID_FILTERED(ID_AA64DFR0_EL1, id_aa64dfr0_el1, 2427 + ID_AA64DFR0_EL1_DoubleLock_MASK | 2428 + ID_AA64DFR0_EL1_WRPs_MASK | 2429 + ID_AA64DFR0_EL1_PMUVer_MASK | 2430 + ID_AA64DFR0_EL1_DebugVer_MASK), 2593 2431 ID_SANITISED(ID_AA64DFR1_EL1), 2594 2432 ID_UNALLOCATED(5,2), 2595 2433 ID_UNALLOCATED(5,3), ··· 2653 2489 { SYS_DESC(SYS_TTBR0_EL1), access_vm_reg, reset_unknown, TTBR0_EL1 }, 2654 2490 { SYS_DESC(SYS_TTBR1_EL1), access_vm_reg, reset_unknown, TTBR1_EL1 }, 2655 2491 { SYS_DESC(SYS_TCR_EL1), access_vm_reg, reset_val, TCR_EL1, 0 }, 2656 - { SYS_DESC(SYS_TCR2_EL1), access_vm_reg, reset_val, TCR2_EL1, 0 }, 2492 + { SYS_DESC(SYS_TCR2_EL1), access_vm_reg, reset_val, TCR2_EL1, 0, 2493 + .visibility = tcr2_visibility }, 2657 2494 2658 2495 PTRAUTH_KEY(APIA), 2659 2496 PTRAUTH_KEY(APIB), ··· 2708 2543 { SYS_DESC(SYS_PMMIR_EL1), trap_raz_wi }, 2709 2544 2710 2545 { SYS_DESC(SYS_MAIR_EL1), access_vm_reg, reset_unknown, MAIR_EL1 }, 2711 - { SYS_DESC(SYS_PIRE0_EL1), NULL, reset_unknown, PIRE0_EL1 }, 2712 - { SYS_DESC(SYS_PIR_EL1), NULL, reset_unknown, PIR_EL1 }, 2546 + { SYS_DESC(SYS_PIRE0_EL1), NULL, reset_unknown, PIRE0_EL1, 2547 + .visibility = s1pie_visibility }, 2548 + { SYS_DESC(SYS_PIR_EL1), NULL, reset_unknown, PIR_EL1, 2549 + .visibility = s1pie_visibility }, 2713 2550 { SYS_DESC(SYS_POR_EL1), NULL, reset_unknown, POR_EL1, 2714 2551 .visibility = s1poe_visibility }, 2715 2552 { SYS_DESC(SYS_AMAIR_EL1), access_vm_reg, reset_amair_el1, AMAIR_EL1 }, ··· 2720 2553 { SYS_DESC(SYS_LOREA_EL1), trap_loregion }, 2721 2554 { SYS_DESC(SYS_LORN_EL1), trap_loregion }, 2722 2555 { SYS_DESC(SYS_LORC_EL1), trap_loregion }, 2556 + { SYS_DESC(SYS_MPAMIDR_EL1), undef_access }, 2723 2557 { SYS_DESC(SYS_LORID_EL1), trap_loregion }, 2724 2558 2559 + { SYS_DESC(SYS_MPAM1_EL1), undef_access }, 2560 + { SYS_DESC(SYS_MPAM0_EL1), undef_access }, 2725 2561 { SYS_DESC(SYS_VBAR_EL1), access_rw, reset_val, VBAR_EL1, 0 }, 2726 2562 { SYS_DESC(SYS_DISR_EL1), NULL, reset_val, DISR_EL1, 0 }, 2727 2563 ··· 2769 2599 { SYS_DESC(SYS_CCSIDR2_EL1), undef_access }, 2770 2600 { SYS_DESC(SYS_SMIDR_EL1), undef_access }, 2771 2601 { SYS_DESC(SYS_CSSELR_EL1), access_csselr, reset_unknown, CSSELR_EL1 }, 2772 - ID_WRITABLE(CTR_EL0, CTR_EL0_DIC_MASK | 2773 - CTR_EL0_IDC_MASK | 2774 - CTR_EL0_DminLine_MASK | 2775 - CTR_EL0_IminLine_MASK), 2602 + ID_FILTERED(CTR_EL0, ctr_el0, 2603 + CTR_EL0_DIC_MASK | 2604 + CTR_EL0_IDC_MASK | 2605 + CTR_EL0_DminLine_MASK | 2606 + CTR_EL0_L1Ip_MASK | 2607 + CTR_EL0_IminLine_MASK), 2776 2608 { SYS_DESC(SYS_SVCR), undef_access, reset_val, SVCR, 0, .visibility = sme_visibility }, 2777 2609 { SYS_DESC(SYS_FPMR), undef_access, reset_val, FPMR, 0, .visibility = fp8_visibility }, 2778 2610 ··· 2990 2818 EL2_REG_VNCR(HFGITR_EL2, reset_val, 0), 2991 2819 EL2_REG_VNCR(HACR_EL2, reset_val, 0), 2992 2820 2993 - { SYS_DESC(SYS_ZCR_EL2), .access = access_zcr_el2, .reset = reset_val, 2994 - .visibility = sve_el2_visibility, .reg = ZCR_EL2 }, 2821 + EL2_REG_FILTERED(ZCR_EL2, access_zcr_el2, reset_val, 0, 2822 + sve_el2_visibility), 2995 2823 2996 2824 EL2_REG_VNCR(HCRX_EL2, reset_val, 0), 2997 2825 2998 2826 EL2_REG(TTBR0_EL2, access_rw, reset_val, 0), 2999 2827 EL2_REG(TTBR1_EL2, access_rw, reset_val, 0), 3000 2828 EL2_REG(TCR_EL2, access_rw, reset_val, TCR_EL2_RES1), 2829 + EL2_REG_FILTERED(TCR2_EL2, access_rw, reset_val, TCR2_EL2_RES1, 2830 + tcr2_el2_visibility), 3001 2831 EL2_REG_VNCR(VTTBR_EL2, reset_val, 0), 3002 2832 EL2_REG_VNCR(VTCR_EL2, reset_val, 0), 3003 2833 ··· 3027 2853 EL2_REG(HPFAR_EL2, access_rw, reset_val, 0), 3028 2854 3029 2855 EL2_REG(MAIR_EL2, access_rw, reset_val, 0), 2856 + EL2_REG_FILTERED(PIRE0_EL2, access_rw, reset_val, 0, 2857 + s1pie_el2_visibility), 2858 + EL2_REG_FILTERED(PIR_EL2, access_rw, reset_val, 0, 2859 + s1pie_el2_visibility), 2860 + EL2_REG_FILTERED(POR_EL2, access_rw, reset_val, 0, 2861 + s1poe_el2_visibility), 3030 2862 EL2_REG(AMAIR_EL2, access_rw, reset_val, 0), 2863 + { SYS_DESC(SYS_MPAMHCR_EL2), undef_access }, 2864 + { SYS_DESC(SYS_MPAMVPMV_EL2), undef_access }, 2865 + { SYS_DESC(SYS_MPAM2_EL2), undef_access }, 2866 + { SYS_DESC(SYS_MPAMVPM0_EL2), undef_access }, 2867 + { SYS_DESC(SYS_MPAMVPM1_EL2), undef_access }, 2868 + { SYS_DESC(SYS_MPAMVPM2_EL2), undef_access }, 2869 + { SYS_DESC(SYS_MPAMVPM3_EL2), undef_access }, 2870 + { SYS_DESC(SYS_MPAMVPM4_EL2), undef_access }, 2871 + { SYS_DESC(SYS_MPAMVPM5_EL2), undef_access }, 2872 + { SYS_DESC(SYS_MPAMVPM6_EL2), undef_access }, 2873 + { SYS_DESC(SYS_MPAMVPM7_EL2), undef_access }, 3031 2874 3032 2875 EL2_REG(VBAR_EL2, access_rw, reset_val, 0), 3033 2876 EL2_REG(RVBAR_EL2, access_rw, reset_val, 0), ··· 4910 4719 if (kvm_has_feat(kvm, ID_AA64ISAR2_EL1, MOPS, IMP)) 4911 4720 vcpu->arch.hcrx_el2 |= (HCRX_EL2_MSCEn | HCRX_EL2_MCE2); 4912 4721 4913 - if (kvm_has_feat(kvm, ID_AA64MMFR3_EL1, TCRX, IMP)) 4722 + if (kvm_has_tcr2(kvm)) 4914 4723 vcpu->arch.hcrx_el2 |= HCRX_EL2_TCR2En; 4915 4724 4916 4725 if (kvm_has_fpmr(kvm)) ··· 4960 4769 kvm->arch.fgu[HFGITR_GROUP] |= (HFGITR_EL2_ATS1E1RP | 4961 4770 HFGITR_EL2_ATS1E1WP); 4962 4771 4963 - if (!kvm_has_feat(kvm, ID_AA64MMFR3_EL1, S1PIE, IMP)) 4772 + if (!kvm_has_s1pie(kvm)) 4964 4773 kvm->arch.fgu[HFGxTR_GROUP] |= (HFGxTR_EL2_nPIRE0_EL1 | 4965 4774 HFGxTR_EL2_nPIR_EL1); 4966 4775 4967 - if (!kvm_has_feat(kvm, ID_AA64MMFR3_EL1, S1POE, IMP)) 4776 + if (!kvm_has_s1poe(kvm)) 4968 4777 kvm->arch.fgu[HFGxTR_GROUP] |= (HFGxTR_EL2_nPOR_EL1 | 4969 4778 HFGxTR_EL2_nPOR_EL0); 4970 4779

+17 -15

arch/arm64/kvm/vgic/vgic-its.c

··· 782 782 783 783 ite = find_ite(its, device_id, event_id); 784 784 if (ite && its_is_collection_mapped(ite->collection)) { 785 + struct its_device *device = find_its_device(its, device_id); 786 + int ite_esz = vgic_its_get_abi(its)->ite_esz; 787 + gpa_t gpa = device->itt_addr + ite->event_id * ite_esz; 785 788 /* 786 789 * Though the spec talks about removing the pending state, we 787 790 * don't bother here since we clear the ITTE anyway and the ··· 793 790 vgic_its_invalidate_cache(its); 794 791 795 792 its_free_ite(kvm, ite); 796 - return 0; 793 + 794 + return vgic_its_write_entry_lock(its, gpa, 0, ite_esz); 797 795 } 798 796 799 797 return E_ITS_DISCARD_UNMAPPED_INTERRUPT; ··· 1143 1139 bool valid = its_cmd_get_validbit(its_cmd); 1144 1140 u8 num_eventid_bits = its_cmd_get_size(its_cmd); 1145 1141 gpa_t itt_addr = its_cmd_get_ittaddr(its_cmd); 1142 + int dte_esz = vgic_its_get_abi(its)->dte_esz; 1146 1143 struct its_device *device; 1144 + gpa_t gpa; 1147 1145 1148 - if (!vgic_its_check_id(its, its->baser_device_table, device_id, NULL)) 1146 + if (!vgic_its_check_id(its, its->baser_device_table, device_id, &gpa)) 1149 1147 return E_ITS_MAPD_DEVICE_OOR; 1150 1148 1151 1149 if (valid && num_eventid_bits > VITS_TYPER_IDBITS) ··· 1168 1162 * is an error, so we are done in any case. 1169 1163 */ 1170 1164 if (!valid) 1171 - return 0; 1165 + return vgic_its_write_entry_lock(its, gpa, 0, dte_esz); 1172 1166 1173 1167 device = vgic_its_alloc_device(its, device_id, itt_addr, 1174 1168 num_eventid_bits); ··· 2092 2086 static int vgic_its_save_ite(struct vgic_its *its, struct its_device *dev, 2093 2087 struct its_ite *ite, gpa_t gpa, int ite_esz) 2094 2088 { 2095 - struct kvm *kvm = its->dev->kvm; 2096 2089 u32 next_offset; 2097 2090 u64 val; 2098 2091 ··· 2100 2095 ((u64)ite->irq->intid << KVM_ITS_ITE_PINTID_SHIFT) | 2101 2096 ite->collection->collection_id; 2102 2097 val = cpu_to_le64(val); 2103 - return vgic_write_guest_lock(kvm, gpa, &val, ite_esz); 2098 + 2099 + return vgic_its_write_entry_lock(its, gpa, val, ite_esz); 2104 2100 } 2105 2101 2106 2102 /** ··· 2245 2239 static int vgic_its_save_dte(struct vgic_its *its, struct its_device *dev, 2246 2240 gpa_t ptr, int dte_esz) 2247 2241 { 2248 - struct kvm *kvm = its->dev->kvm; 2249 2242 u64 val, itt_addr_field; 2250 2243 u32 next_offset; 2251 2244 ··· 2255 2250 (itt_addr_field << KVM_ITS_DTE_ITTADDR_SHIFT) | 2256 2251 (dev->num_eventid_bits - 1)); 2257 2252 val = cpu_to_le64(val); 2258 - return vgic_write_guest_lock(kvm, ptr, &val, dte_esz); 2253 + 2254 + return vgic_its_write_entry_lock(its, ptr, val, dte_esz); 2259 2255 } 2260 2256 2261 2257 /** ··· 2443 2437 ((u64)collection->target_addr << KVM_ITS_CTE_RDBASE_SHIFT) | 2444 2438 collection->collection_id); 2445 2439 val = cpu_to_le64(val); 2446 - return vgic_write_guest_lock(its->dev->kvm, gpa, &val, esz); 2440 + 2441 + return vgic_its_write_entry_lock(its, gpa, val, esz); 2447 2442 } 2448 2443 2449 2444 /* ··· 2460 2453 u64 val; 2461 2454 int ret; 2462 2455 2463 - BUG_ON(esz > sizeof(val)); 2464 - ret = kvm_read_guest_lock(kvm, gpa, &val, esz); 2456 + ret = vgic_its_read_entry_lock(its, gpa, &val, esz); 2465 2457 if (ret) 2466 2458 return ret; 2467 2459 val = le64_to_cpu(val); ··· 2498 2492 u64 baser = its->baser_coll_table; 2499 2493 gpa_t gpa = GITS_BASER_ADDR_48_to_52(baser); 2500 2494 struct its_collection *collection; 2501 - u64 val; 2502 2495 size_t max_size, filled = 0; 2503 2496 int ret, cte_esz = abi->cte_esz; 2504 2497 ··· 2521 2516 * table is not fully filled, add a last dummy element 2522 2517 * with valid bit unset 2523 2518 */ 2524 - val = 0; 2525 - BUG_ON(cte_esz > sizeof(val)); 2526 - ret = vgic_write_guest_lock(its->dev->kvm, gpa, &val, cte_esz); 2527 - return ret; 2519 + return vgic_its_write_entry_lock(its, gpa, 0, cte_esz); 2528 2520 } 2529 2521 2530 2522 /*

+23

arch/arm64/kvm/vgic/vgic.h

··· 146 146 return ret; 147 147 } 148 148 149 + static inline int vgic_its_read_entry_lock(struct vgic_its *its, gpa_t eaddr, 150 + u64 *eval, unsigned long esize) 151 + { 152 + struct kvm *kvm = its->dev->kvm; 153 + 154 + if (KVM_BUG_ON(esize != sizeof(*eval), kvm)) 155 + return -EINVAL; 156 + 157 + return kvm_read_guest_lock(kvm, eaddr, eval, esize); 158 + 159 + } 160 + 161 + static inline int vgic_its_write_entry_lock(struct vgic_its *its, gpa_t eaddr, 162 + u64 eval, unsigned long esize) 163 + { 164 + struct kvm *kvm = its->dev->kvm; 165 + 166 + if (KVM_BUG_ON(esize != sizeof(eval), kvm)) 167 + return -EINVAL; 168 + 169 + return vgic_write_guest_lock(kvm, eaddr, &eval, esize); 170 + } 171 + 149 172 /* 150 173 * This struct provides an intermediate representation of the fields contained 151 174 * in the GICH_VMCR and ICH_VMCR registers, such that code exporting the GIC

+2

arch/arm64/tools/cpucaps

··· 62 62 KVM_HVHE 63 63 KVM_PROTECTED_MODE 64 64 MISMATCHED_CACHE_TYPE 65 + MPAM 66 + MPAM_HCR 65 67 MTE 66 68 MTE_ASYMM 67 69 SME

+242 -5

arch/arm64/tools/sysreg

··· 1200 1200 0b0001 IMP 1201 1201 0b0010 BRBE_V1P1 1202 1202 EndEnum 1203 - Enum 51:48 MTPMU 1203 + SignedEnum 51:48 MTPMU 1204 1204 0b0000 NI_IMPDEF 1205 1205 0b0001 IMP 1206 1206 0b1111 NI ··· 1208 1208 UnsignedEnum 47:44 TraceBuffer 1209 1209 0b0000 NI 1210 1210 0b0001 IMP 1211 + 0b0010 TRBE_V1P1 1211 1212 EndEnum 1212 1213 UnsignedEnum 43:40 TraceFilt 1213 1214 0b0000 NI ··· 1225 1224 0b0011 V1P2 1226 1225 0b0100 V1P3 1227 1226 0b0101 V1P4 1227 + 0b0110 V1P5 1228 1228 EndEnum 1229 1229 Field 31:28 CTX_CMPs 1230 - Res0 27:24 1230 + UnsignedEnum 27:24 SEBEP 1231 + 0b0000 NI 1232 + 0b0001 IMP 1233 + EndEnum 1231 1234 Field 23:20 WRPs 1232 - Res0 19:16 1235 + UnsignedEnum 19:16 PMSS 1236 + 0b0000 NI 1237 + 0b0001 IMP 1238 + EndEnum 1233 1239 Field 15:12 BRPs 1234 1240 UnsignedEnum 11:8 PMUVer 1235 1241 0b0000 NI ··· 1294 1286 Field 23:16 WRPs 1295 1287 Field 15:8 BRPs 1296 1288 Field 7:0 SYSPMUID 1289 + EndSysreg 1290 + 1291 + Sysreg ID_AA64DFR2_EL1 3 0 0 5 2 1292 + Res0 63:28 1293 + UnsignedEnum 27:24 TRBE_EXC 1294 + 0b0000 NI 1295 + 0b0001 IMP 1296 + EndEnum 1297 + UnsignedEnum 23:20 SPE_nVM 1298 + 0b0000 NI 1299 + 0b0001 IMP 1300 + EndEnum 1301 + UnsignedEnum 19:16 SPE_EXC 1302 + 0b0000 NI 1303 + 0b0001 IMP 1304 + EndEnum 1305 + Res0 15:8 1306 + UnsignedEnum 7:4 BWE 1307 + 0b0000 NI 1308 + 0b0001 FEAT_BWE 1309 + 0b0002 FEAT_BWE2 1310 + EndEnum 1311 + UnsignedEnum 3:0 STEP 1312 + 0b0000 NI 1313 + 0b0001 IMP 1314 + EndEnum 1297 1315 EndSysreg 1298 1316 1299 1317 Sysreg ID_AA64AFR0_EL1 3 0 0 5 4 ··· 2434 2400 Field 0 AFSR0_EL1 2435 2401 EndSysregFields 2436 2402 2403 + Sysreg MDCR_EL2 3 4 1 1 1 2404 + Res0 63:51 2405 + Field 50 EnSTEPOP 2406 + Res0 49:44 2407 + Field 43 EBWE 2408 + Res0 42 2409 + Field 41:40 PMEE 2410 + Res0 39:37 2411 + Field 36 HPMFZS 2412 + Res0 35:32 2413 + Field 31:30 PMSSE 2414 + Field 29 HPMFZO 2415 + Field 28 MTPME 2416 + Field 27 TDCC 2417 + Field 26 HLP 2418 + Field 25:24 E2TB 2419 + Field 23 HCCD 2420 + Res0 22:20 2421 + Field 19 TTRF 2422 + Res0 18 2423 + Field 17 HPMD 2424 + Res0 16 2425 + Field 15 EnSPM 2426 + Field 14 TPMS 2427 + Field 13:12 E2PB 2428 + Field 11 TDRA 2429 + Field 10 TDOSA 2430 + Field 9 TDA 2431 + Field 8 TDE 2432 + Field 7 HPME 2433 + Field 6 TPM 2434 + Field 5 TPMCR 2435 + Field 4:0 HPMN 2436 + EndSysreg 2437 + 2437 2438 Sysreg HFGRTR_EL2 3 4 1 1 4 2438 2439 Fields HFGxTR_EL2 2439 2440 EndSysreg ··· 2818 2749 Field 0 E0HSPE 2819 2750 EndSysreg 2820 2751 2752 + Sysreg MPAMHCR_EL2 3 4 10 4 0 2753 + Res0 63:32 2754 + Field 31 TRAP_MPAMIDR_EL1 2755 + Res0 30:9 2756 + Field 8 GSTAPP_PLK 2757 + Res0 7:2 2758 + Field 1 EL1_VPMEN 2759 + Field 0 EL0_VPMEN 2760 + EndSysreg 2761 + 2762 + Sysreg MPAMVPMV_EL2 3 4 10 4 1 2763 + Res0 63:32 2764 + Field 31 VPM_V31 2765 + Field 30 VPM_V30 2766 + Field 29 VPM_V29 2767 + Field 28 VPM_V28 2768 + Field 27 VPM_V27 2769 + Field 26 VPM_V26 2770 + Field 25 VPM_V25 2771 + Field 24 VPM_V24 2772 + Field 23 VPM_V23 2773 + Field 22 VPM_V22 2774 + Field 21 VPM_V21 2775 + Field 20 VPM_V20 2776 + Field 19 VPM_V19 2777 + Field 18 VPM_V18 2778 + Field 17 VPM_V17 2779 + Field 16 VPM_V16 2780 + Field 15 VPM_V15 2781 + Field 14 VPM_V14 2782 + Field 13 VPM_V13 2783 + Field 12 VPM_V12 2784 + Field 11 VPM_V11 2785 + Field 10 VPM_V10 2786 + Field 9 VPM_V9 2787 + Field 8 VPM_V8 2788 + Field 7 VPM_V7 2789 + Field 6 VPM_V6 2790 + Field 5 VPM_V5 2791 + Field 4 VPM_V4 2792 + Field 3 VPM_V3 2793 + Field 2 VPM_V2 2794 + Field 1 VPM_V1 2795 + Field 0 VPM_V0 2796 + EndSysreg 2797 + 2798 + Sysreg MPAM2_EL2 3 4 10 5 0 2799 + Field 63 MPAMEN 2800 + Res0 62:59 2801 + Field 58 TIDR 2802 + Res0 57 2803 + Field 56 ALTSP_HFC 2804 + Field 55 ALTSP_EL2 2805 + Field 54 ALTSP_FRCD 2806 + Res0 53:51 2807 + Field 50 EnMPAMSM 2808 + Field 49 TRAPMPAM0EL1 2809 + Field 48 TRAPMPAM1EL1 2810 + Field 47:40 PMG_D 2811 + Field 39:32 PMG_I 2812 + Field 31:16 PARTID_D 2813 + Field 15:0 PARTID_I 2814 + EndSysreg 2815 + 2816 + Sysreg MPAMVPM0_EL2 3 4 10 6 0 2817 + Field 63:48 PhyPARTID3 2818 + Field 47:32 PhyPARTID2 2819 + Field 31:16 PhyPARTID1 2820 + Field 15:0 PhyPARTID0 2821 + EndSysreg 2822 + 2823 + Sysreg MPAMVPM1_EL2 3 4 10 6 1 2824 + Field 63:48 PhyPARTID7 2825 + Field 47:32 PhyPARTID6 2826 + Field 31:16 PhyPARTID5 2827 + Field 15:0 PhyPARTID4 2828 + EndSysreg 2829 + 2830 + Sysreg MPAMVPM2_EL2 3 4 10 6 2 2831 + Field 63:48 PhyPARTID11 2832 + Field 47:32 PhyPARTID10 2833 + Field 31:16 PhyPARTID9 2834 + Field 15:0 PhyPARTID8 2835 + EndSysreg 2836 + 2837 + Sysreg MPAMVPM3_EL2 3 4 10 6 3 2838 + Field 63:48 PhyPARTID15 2839 + Field 47:32 PhyPARTID14 2840 + Field 31:16 PhyPARTID13 2841 + Field 15:0 PhyPARTID12 2842 + EndSysreg 2843 + 2844 + Sysreg MPAMVPM4_EL2 3 4 10 6 4 2845 + Field 63:48 PhyPARTID19 2846 + Field 47:32 PhyPARTID18 2847 + Field 31:16 PhyPARTID17 2848 + Field 15:0 PhyPARTID16 2849 + EndSysreg 2850 + 2851 + Sysreg MPAMVPM5_EL2 3 4 10 6 5 2852 + Field 63:48 PhyPARTID23 2853 + Field 47:32 PhyPARTID22 2854 + Field 31:16 PhyPARTID21 2855 + Field 15:0 PhyPARTID20 2856 + EndSysreg 2857 + 2858 + Sysreg MPAMVPM6_EL2 3 4 10 6 6 2859 + Field 63:48 PhyPARTID27 2860 + Field 47:32 PhyPARTID26 2861 + Field 31:16 PhyPARTID25 2862 + Field 15:0 PhyPARTID24 2863 + EndSysreg 2864 + 2865 + Sysreg MPAMVPM7_EL2 3 4 10 6 7 2866 + Field 63:48 PhyPARTID31 2867 + Field 47:32 PhyPARTID30 2868 + Field 31:16 PhyPARTID29 2869 + Field 15:0 PhyPARTID28 2870 + EndSysreg 2871 + 2821 2872 Sysreg CONTEXTIDR_EL2 3 4 13 0 1 2822 2873 Fields CONTEXTIDR_ELx 2823 2874 EndSysreg ··· 2968 2779 2969 2780 Sysreg FAR_EL12 3 5 6 0 0 2970 2781 Field 63:0 ADDR 2782 + EndSysreg 2783 + 2784 + Sysreg MPAM1_EL12 3 5 10 5 0 2785 + Fields MPAM1_ELx 2971 2786 EndSysreg 2972 2787 2973 2788 Sysreg CONTEXTIDR_EL12 3 5 13 0 1 ··· 3024 2831 Field 12 AMEC0 3025 2832 Field 11 HAFT 3026 2833 Field 10 PTTWI 3027 - Field 9:8 SKL1 3028 - Field 7:6 SKL0 2834 + Res0 9:6 3029 2835 Field 5 D128 3030 2836 Field 4 AIE 3031 2837 Field 3 POE ··· 3087 2895 Fields PIRx_ELx 3088 2896 EndSysreg 3089 2897 2898 + Sysreg PIRE0_EL2 3 4 10 2 2 2899 + Fields PIRx_ELx 2900 + EndSysreg 2901 + 3090 2902 Sysreg PIR_EL1 3 0 10 2 3 3091 2903 Fields PIRx_ELx 3092 2904 EndSysreg ··· 3108 2912 EndSysreg 3109 2913 3110 2914 Sysreg POR_EL1 3 0 10 2 4 2915 + Fields PIRx_ELx 2916 + EndSysreg 2917 + 2918 + Sysreg POR_EL2 3 4 10 2 4 3111 2919 Fields PIRx_ELx 3112 2920 EndSysreg 3113 2921 ··· 3153 2953 Field 0 EN 3154 2954 EndSysreg 3155 2955 2956 + Sysreg MPAMIDR_EL1 3 0 10 4 4 2957 + Res0 63:62 2958 + Field 61 HAS_SDEFLT 2959 + Field 60 HAS_FORCE_NS 2960 + Field 59 SP4 2961 + Field 58 HAS_TIDR 2962 + Field 57 HAS_ALTSP 2963 + Res0 56:40 2964 + Field 39:32 PMG_MAX 2965 + Res0 31:21 2966 + Field 20:18 VPMR_MAX 2967 + Field 17 HAS_HCR 2968 + Res0 16 2969 + Field 15:0 PARTID_MAX 2970 + EndSysreg 2971 + 3156 2972 Sysreg LORID_EL1 3 0 10 4 7 3157 2973 Res0 63:24 3158 2974 Field 23:16 LD 3159 2975 Res0 15:8 3160 2976 Field 7:0 LR 2977 + EndSysreg 2978 + 2979 + Sysreg MPAM1_EL1 3 0 10 5 0 2980 + Field 63 MPAMEN 2981 + Res0 62:61 2982 + Field 60 FORCED_NS 2983 + Res0 59:55 2984 + Field 54 ALTSP_FRCD 2985 + Res0 53:48 2986 + Field 47:40 PMG_D 2987 + Field 39:32 PMG_I 2988 + Field 31:16 PARTID_D 2989 + Field 15:0 PARTID_I 2990 + EndSysreg 2991 + 2992 + Sysreg MPAM0_EL1 3 0 10 5 1 2993 + Res0 63:48 2994 + Field 47:40 PMG_D 2995 + Field 39:32 PMG_I 2996 + Field 31:16 PARTID_D 2997 + Field 15:0 PARTID_I 3161 2998 EndSysreg 3162 2999 3163 3000 Sysreg ISR_EL1 3 0 12 1 0

+1

arch/loongarch/include/asm/irq.h

··· 65 65 extern struct acpi_vector_group msi_group[MAX_IO_PICS]; 66 66 67 67 #define CORES_PER_EIO_NODE 4 68 + #define CORES_PER_VEIO_NODE 256 68 69 69 70 #define LOONGSON_CPU_UART0_VEC 10 /* CPU UART0 */ 70 71 #define LOONGSON_CPU_THSENS_VEC 14 /* CPU Thsens */

+123

arch/loongarch/include/asm/kvm_eiointc.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright (C) 2024 Loongson Technology Corporation Limited 4 + */ 5 + 6 + #ifndef __ASM_KVM_EIOINTC_H 7 + #define __ASM_KVM_EIOINTC_H 8 + 9 + #include <kvm/iodev.h> 10 + 11 + #define EIOINTC_IRQS 256 12 + #define EIOINTC_ROUTE_MAX_VCPUS 256 13 + #define EIOINTC_IRQS_U8_NUMS (EIOINTC_IRQS / 8) 14 + #define EIOINTC_IRQS_U16_NUMS (EIOINTC_IRQS_U8_NUMS / 2) 15 + #define EIOINTC_IRQS_U32_NUMS (EIOINTC_IRQS_U8_NUMS / 4) 16 + #define EIOINTC_IRQS_U64_NUMS (EIOINTC_IRQS_U8_NUMS / 8) 17 + /* map to ipnum per 32 irqs */ 18 + #define EIOINTC_IRQS_NODETYPE_COUNT 16 19 + 20 + #define EIOINTC_BASE 0x1400 21 + #define EIOINTC_SIZE 0x900 22 + 23 + #define EIOINTC_NODETYPE_START 0xa0 24 + #define EIOINTC_NODETYPE_END 0xbf 25 + #define EIOINTC_IPMAP_START 0xc0 26 + #define EIOINTC_IPMAP_END 0xc7 27 + #define EIOINTC_ENABLE_START 0x200 28 + #define EIOINTC_ENABLE_END 0x21f 29 + #define EIOINTC_BOUNCE_START 0x280 30 + #define EIOINTC_BOUNCE_END 0x29f 31 + #define EIOINTC_ISR_START 0x300 32 + #define EIOINTC_ISR_END 0x31f 33 + #define EIOINTC_COREISR_START 0x400 34 + #define EIOINTC_COREISR_END 0x41f 35 + #define EIOINTC_COREMAP_START 0x800 36 + #define EIOINTC_COREMAP_END 0x8ff 37 + 38 + #define EIOINTC_VIRT_BASE (0x40000000) 39 + #define EIOINTC_VIRT_SIZE (0x1000) 40 + 41 + #define EIOINTC_VIRT_FEATURES (0x0) 42 + #define EIOINTC_HAS_VIRT_EXTENSION (0) 43 + #define EIOINTC_HAS_ENABLE_OPTION (1) 44 + #define EIOINTC_HAS_INT_ENCODE (2) 45 + #define EIOINTC_HAS_CPU_ENCODE (3) 46 + #define EIOINTC_VIRT_HAS_FEATURES ((1U << EIOINTC_HAS_VIRT_EXTENSION) \ 47 + | (1U << EIOINTC_HAS_ENABLE_OPTION) \ 48 + | (1U << EIOINTC_HAS_INT_ENCODE) \ 49 + | (1U << EIOINTC_HAS_CPU_ENCODE)) 50 + #define EIOINTC_VIRT_CONFIG (0x4) 51 + #define EIOINTC_ENABLE (1) 52 + #define EIOINTC_ENABLE_INT_ENCODE (2) 53 + #define EIOINTC_ENABLE_CPU_ENCODE (3) 54 + 55 + #define LOONGSON_IP_NUM 8 56 + 57 + struct loongarch_eiointc { 58 + spinlock_t lock; 59 + struct kvm *kvm; 60 + struct kvm_io_device device; 61 + struct kvm_io_device device_vext; 62 + uint32_t num_cpu; 63 + uint32_t features; 64 + uint32_t status; 65 + 66 + /* hardware state */ 67 + union nodetype { 68 + u64 reg_u64[EIOINTC_IRQS_NODETYPE_COUNT / 4]; 69 + u32 reg_u32[EIOINTC_IRQS_NODETYPE_COUNT / 2]; 70 + u16 reg_u16[EIOINTC_IRQS_NODETYPE_COUNT]; 71 + u8 reg_u8[EIOINTC_IRQS_NODETYPE_COUNT * 2]; 72 + } nodetype; 73 + 74 + /* one bit shows the state of one irq */ 75 + union bounce { 76 + u64 reg_u64[EIOINTC_IRQS_U64_NUMS]; 77 + u32 reg_u32[EIOINTC_IRQS_U32_NUMS]; 78 + u16 reg_u16[EIOINTC_IRQS_U16_NUMS]; 79 + u8 reg_u8[EIOINTC_IRQS_U8_NUMS]; 80 + } bounce; 81 + 82 + union isr { 83 + u64 reg_u64[EIOINTC_IRQS_U64_NUMS]; 84 + u32 reg_u32[EIOINTC_IRQS_U32_NUMS]; 85 + u16 reg_u16[EIOINTC_IRQS_U16_NUMS]; 86 + u8 reg_u8[EIOINTC_IRQS_U8_NUMS]; 87 + } isr; 88 + union coreisr { 89 + u64 reg_u64[EIOINTC_ROUTE_MAX_VCPUS][EIOINTC_IRQS_U64_NUMS]; 90 + u32 reg_u32[EIOINTC_ROUTE_MAX_VCPUS][EIOINTC_IRQS_U32_NUMS]; 91 + u16 reg_u16[EIOINTC_ROUTE_MAX_VCPUS][EIOINTC_IRQS_U16_NUMS]; 92 + u8 reg_u8[EIOINTC_ROUTE_MAX_VCPUS][EIOINTC_IRQS_U8_NUMS]; 93 + } coreisr; 94 + union enable { 95 + u64 reg_u64[EIOINTC_IRQS_U64_NUMS]; 96 + u32 reg_u32[EIOINTC_IRQS_U32_NUMS]; 97 + u16 reg_u16[EIOINTC_IRQS_U16_NUMS]; 98 + u8 reg_u8[EIOINTC_IRQS_U8_NUMS]; 99 + } enable; 100 + 101 + /* use one byte to config ipmap for 32 irqs at once */ 102 + union ipmap { 103 + u64 reg_u64; 104 + u32 reg_u32[EIOINTC_IRQS_U32_NUMS / 4]; 105 + u16 reg_u16[EIOINTC_IRQS_U16_NUMS / 4]; 106 + u8 reg_u8[EIOINTC_IRQS_U8_NUMS / 4]; 107 + } ipmap; 108 + /* use one byte to config coremap for one irq */ 109 + union coremap { 110 + u64 reg_u64[EIOINTC_IRQS / 8]; 111 + u32 reg_u32[EIOINTC_IRQS / 4]; 112 + u16 reg_u16[EIOINTC_IRQS / 2]; 113 + u8 reg_u8[EIOINTC_IRQS]; 114 + } coremap; 115 + 116 + DECLARE_BITMAP(sw_coreisr[EIOINTC_ROUTE_MAX_VCPUS][LOONGSON_IP_NUM], EIOINTC_IRQS); 117 + uint8_t sw_coremap[EIOINTC_IRQS]; 118 + }; 119 + 120 + int kvm_loongarch_register_eiointc_device(void); 121 + void eiointc_set_irq(struct loongarch_eiointc *s, int irq, int level); 122 + 123 + #endif /* __ASM_KVM_EIOINTC_H */

+17 -1

arch/loongarch/include/asm/kvm_host.h

··· 18 18 19 19 #include <asm/inst.h> 20 20 #include <asm/kvm_mmu.h> 21 + #include <asm/kvm_ipi.h> 22 + #include <asm/kvm_eiointc.h> 23 + #include <asm/kvm_pch_pic.h> 21 24 #include <asm/loongarch.h> 25 + 26 + #define __KVM_HAVE_ARCH_INTC_INITIALIZED 22 27 23 28 /* Loongarch KVM register ids */ 24 29 #define KVM_GET_IOC_CSR_IDX(id) ((id & KVM_CSR_IDX_MASK) >> LOONGARCH_REG_SHIFT) ··· 49 44 struct kvm_vm_stat_generic generic; 50 45 u64 pages; 51 46 u64 hugepages; 47 + u64 ipi_read_exits; 48 + u64 ipi_write_exits; 49 + u64 eiointc_read_exits; 50 + u64 eiointc_write_exits; 51 + u64 pch_pic_read_exits; 52 + u64 pch_pic_write_exits; 52 53 }; 53 54 54 55 struct kvm_vcpu_stat { ··· 95 84 * 96 85 * For LOONGARCH_CSR_CPUID register, max CPUID size if 512 97 86 * For IPI hardware, max destination CPUID size 1024 98 - * For extioi interrupt controller, max destination CPUID size is 256 87 + * For eiointc interrupt controller, max destination CPUID size is 256 99 88 * For msgint interrupt controller, max supported CPUID size is 65536 100 89 * 101 90 * Currently max CPUID is defined as 256 for KVM hypervisor, in future ··· 128 117 129 118 s64 time_offset; 130 119 struct kvm_context __percpu *vmcs; 120 + struct loongarch_ipi *ipi; 121 + struct loongarch_eiointc *eiointc; 122 + struct loongarch_pch_pic *pch_pic; 131 123 }; 132 124 133 125 #define CSR_MAX_NUMS 0x800 ··· 235 221 int last_sched_cpu; 236 222 /* mp state */ 237 223 struct kvm_mp_state mp_state; 224 + /* ipi state */ 225 + struct ipi_state ipi_state; 238 226 /* cpucfg */ 239 227 u32 cpucfg[KVM_MAX_CPUCFG_REGS]; 240 228

+45

arch/loongarch/include/asm/kvm_ipi.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright (C) 2024 Loongson Technology Corporation Limited 4 + */ 5 + 6 + #ifndef __ASM_KVM_IPI_H 7 + #define __ASM_KVM_IPI_H 8 + 9 + #include <kvm/iodev.h> 10 + 11 + #define LARCH_INT_IPI 12 12 + 13 + struct loongarch_ipi { 14 + spinlock_t lock; 15 + struct kvm *kvm; 16 + struct kvm_io_device device; 17 + }; 18 + 19 + struct ipi_state { 20 + spinlock_t lock; 21 + uint32_t status; 22 + uint32_t en; 23 + uint32_t set; 24 + uint32_t clear; 25 + uint64_t buf[4]; 26 + }; 27 + 28 + #define IOCSR_IPI_BASE 0x1000 29 + #define IOCSR_IPI_SIZE 0x160 30 + 31 + #define IOCSR_IPI_STATUS 0x000 32 + #define IOCSR_IPI_EN 0x004 33 + #define IOCSR_IPI_SET 0x008 34 + #define IOCSR_IPI_CLEAR 0x00c 35 + #define IOCSR_IPI_BUF_20 0x020 36 + #define IOCSR_IPI_BUF_28 0x028 37 + #define IOCSR_IPI_BUF_30 0x030 38 + #define IOCSR_IPI_BUF_38 0x038 39 + #define IOCSR_IPI_SEND 0x040 40 + #define IOCSR_MAIL_SEND 0x048 41 + #define IOCSR_ANY_SEND 0x158 42 + 43 + int kvm_loongarch_register_ipi_device(void); 44 + 45 + #endif

+62

arch/loongarch/include/asm/kvm_pch_pic.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright (C) 2024 Loongson Technology Corporation Limited 4 + */ 5 + 6 + #ifndef __ASM_KVM_PCH_PIC_H 7 + #define __ASM_KVM_PCH_PIC_H 8 + 9 + #include <kvm/iodev.h> 10 + 11 + #define PCH_PIC_SIZE 0x3e8 12 + 13 + #define PCH_PIC_INT_ID_START 0x0 14 + #define PCH_PIC_INT_ID_END 0x7 15 + #define PCH_PIC_MASK_START 0x20 16 + #define PCH_PIC_MASK_END 0x27 17 + #define PCH_PIC_HTMSI_EN_START 0x40 18 + #define PCH_PIC_HTMSI_EN_END 0x47 19 + #define PCH_PIC_EDGE_START 0x60 20 + #define PCH_PIC_EDGE_END 0x67 21 + #define PCH_PIC_CLEAR_START 0x80 22 + #define PCH_PIC_CLEAR_END 0x87 23 + #define PCH_PIC_AUTO_CTRL0_START 0xc0 24 + #define PCH_PIC_AUTO_CTRL0_END 0xc7 25 + #define PCH_PIC_AUTO_CTRL1_START 0xe0 26 + #define PCH_PIC_AUTO_CTRL1_END 0xe7 27 + #define PCH_PIC_ROUTE_ENTRY_START 0x100 28 + #define PCH_PIC_ROUTE_ENTRY_END 0x13f 29 + #define PCH_PIC_HTMSI_VEC_START 0x200 30 + #define PCH_PIC_HTMSI_VEC_END 0x23f 31 + #define PCH_PIC_INT_IRR_START 0x380 32 + #define PCH_PIC_INT_IRR_END 0x38f 33 + #define PCH_PIC_INT_ISR_START 0x3a0 34 + #define PCH_PIC_INT_ISR_END 0x3af 35 + #define PCH_PIC_POLARITY_START 0x3e0 36 + #define PCH_PIC_POLARITY_END 0x3e7 37 + #define PCH_PIC_INT_ID_VAL 0x7000000UL 38 + #define PCH_PIC_INT_ID_VER 0x1UL 39 + 40 + struct loongarch_pch_pic { 41 + spinlock_t lock; 42 + struct kvm *kvm; 43 + struct kvm_io_device device; 44 + uint64_t mask; /* 1:disable irq, 0:enable irq */ 45 + uint64_t htmsi_en; /* 1:msi */ 46 + uint64_t edge; /* 1:edge triggered, 0:level triggered */ 47 + uint64_t auto_ctrl0; /* only use default value 00b */ 48 + uint64_t auto_ctrl1; /* only use default value 00b */ 49 + uint64_t last_intirr; /* edge detection */ 50 + uint64_t irr; /* interrupt request register */ 51 + uint64_t isr; /* interrupt service register */ 52 + uint64_t polarity; /* 0: high level trigger, 1: low level trigger */ 53 + uint8_t route_entry[64]; /* default value 0, route to int0: eiointc */ 54 + uint8_t htmsi_vector[64]; /* irq route table for routing to eiointc */ 55 + uint64_t pch_pic_base; 56 + }; 57 + 58 + int kvm_loongarch_register_pch_pic_device(void); 59 + void pch_pic_set_irq(struct loongarch_pch_pic *s, int irq, int level); 60 + void pch_msi_set_irq(struct kvm *kvm, int irq, int level); 61 + 62 + #endif /* __ASM_KVM_PCH_PIC_H */

+20

arch/loongarch/include/uapi/asm/kvm.h

··· 8 8 9 9 #include <linux/types.h> 10 10 11 + #define __KVM_HAVE_IRQ_LINE 12 + 11 13 /* 12 14 * KVM LoongArch specific structures and definitions. 13 15 * ··· 133 131 #define KVM_NR_IRQCHIPS 1 134 132 #define KVM_IRQCHIP_NUM_PINS 64 135 133 #define KVM_MAX_CORES 256 134 + 135 + #define KVM_DEV_LOONGARCH_IPI_GRP_REGS 0x40000001 136 + 137 + #define KVM_DEV_LOONGARCH_EXTIOI_GRP_REGS 0x40000002 138 + 139 + #define KVM_DEV_LOONGARCH_EXTIOI_GRP_SW_STATUS 0x40000003 140 + #define KVM_DEV_LOONGARCH_EXTIOI_SW_STATUS_NUM_CPU 0x0 141 + #define KVM_DEV_LOONGARCH_EXTIOI_SW_STATUS_FEATURE 0x1 142 + #define KVM_DEV_LOONGARCH_EXTIOI_SW_STATUS_STATE 0x2 143 + 144 + #define KVM_DEV_LOONGARCH_EXTIOI_GRP_CTRL 0x40000004 145 + #define KVM_DEV_LOONGARCH_EXTIOI_CTRL_INIT_NUM_CPU 0x0 146 + #define KVM_DEV_LOONGARCH_EXTIOI_CTRL_INIT_FEATURE 0x1 147 + #define KVM_DEV_LOONGARCH_EXTIOI_CTRL_LOAD_FINISHED 0x3 148 + 149 + #define KVM_DEV_LOONGARCH_PCH_PIC_GRP_REGS 0x40000005 150 + #define KVM_DEV_LOONGARCH_PCH_PIC_GRP_CTRL 0x40000006 151 + #define KVM_DEV_LOONGARCH_PCH_PIC_CTRL_INIT 0 136 152 137 153 #endif /* __UAPI_ASM_LOONGARCH_KVM_H */

+4 -1

arch/loongarch/kvm/Kconfig

··· 21 21 tristate "Kernel-based Virtual Machine (KVM) support" 22 22 depends on AS_HAS_LVZ_EXTENSION 23 23 select HAVE_KVM_DIRTY_RING_ACQ_REL 24 + select HAVE_KVM_IRQ_ROUTING 25 + select HAVE_KVM_IRQCHIP 26 + select HAVE_KVM_MSI 27 + select HAVE_KVM_READONLY_MEM 24 28 select HAVE_KVM_VCPU_ASYNC_IOCTL 25 29 select KVM_COMMON 26 30 select KVM_GENERIC_DIRTYLOG_READ_PROTECT 27 31 select KVM_GENERIC_HARDWARE_ENABLING 28 32 select KVM_GENERIC_MMU_NOTIFIER 29 33 select KVM_MMIO 30 - select HAVE_KVM_READONLY_MEM 31 34 select KVM_XFER_TO_GUEST_WORK 32 35 select SCHED_INFO 33 36 help

+4

arch/loongarch/kvm/Makefile

··· 18 18 kvm-y += tlb.o 19 19 kvm-y += vcpu.o 20 20 kvm-y += vm.o 21 + kvm-y += intc/ipi.o 22 + kvm-y += intc/eiointc.o 23 + kvm-y += intc/pch_pic.o 24 + kvm-y += irqfd.o 21 25 22 26 CFLAGS_exit.o += $(call cc-option,-Wno-override-init,)

+58 -24

arch/loongarch/kvm/exit.c

··· 157 157 int kvm_emu_iocsr(larch_inst inst, struct kvm_run *run, struct kvm_vcpu *vcpu) 158 158 { 159 159 int ret; 160 - unsigned long val; 160 + unsigned long *val; 161 161 u32 addr, rd, rj, opcode; 162 162 163 163 /* ··· 170 170 ret = EMULATE_DO_IOCSR; 171 171 run->iocsr_io.phys_addr = addr; 172 172 run->iocsr_io.is_write = 0; 173 + val = &vcpu->arch.gprs[rd]; 173 174 174 175 /* LoongArch is Little endian */ 175 176 switch (opcode) { ··· 203 202 run->iocsr_io.is_write = 1; 204 203 break; 205 204 default: 206 - ret = EMULATE_FAIL; 207 - break; 205 + return EMULATE_FAIL; 208 206 } 209 207 210 - if (ret == EMULATE_DO_IOCSR) { 211 - if (run->iocsr_io.is_write) { 212 - val = vcpu->arch.gprs[rd]; 213 - memcpy(run->iocsr_io.data, &val, run->iocsr_io.len); 214 - } 215 - vcpu->arch.io_gpr = rd; 208 + if (run->iocsr_io.is_write) { 209 + if (!kvm_io_bus_write(vcpu, KVM_IOCSR_BUS, addr, run->iocsr_io.len, val)) 210 + ret = EMULATE_DONE; 211 + else 212 + /* Save data and let user space to write it */ 213 + memcpy(run->iocsr_io.data, val, run->iocsr_io.len); 214 + 215 + trace_kvm_iocsr(KVM_TRACE_IOCSR_WRITE, run->iocsr_io.len, addr, val); 216 + } else { 217 + if (!kvm_io_bus_read(vcpu, KVM_IOCSR_BUS, addr, run->iocsr_io.len, val)) 218 + ret = EMULATE_DONE; 219 + else 220 + /* Save register id for iocsr read completion */ 221 + vcpu->arch.io_gpr = rd; 222 + 223 + trace_kvm_iocsr(KVM_TRACE_IOCSR_READ, run->iocsr_io.len, addr, NULL); 216 224 } 217 225 218 226 return ret; ··· 457 447 } 458 448 459 449 if (ret == EMULATE_DO_MMIO) { 450 + trace_kvm_mmio(KVM_TRACE_MMIO_READ, run->mmio.len, run->mmio.phys_addr, NULL); 451 + 452 + /* 453 + * If mmio device such as PCH-PIC is emulated in KVM, 454 + * it need not return to user space to handle the mmio 455 + * exception. 456 + */ 457 + ret = kvm_io_bus_read(vcpu, KVM_MMIO_BUS, vcpu->arch.badv, 458 + run->mmio.len, &vcpu->arch.gprs[rd]); 459 + if (!ret) { 460 + update_pc(&vcpu->arch); 461 + vcpu->mmio_needed = 0; 462 + return EMULATE_DONE; 463 + } 464 + 460 465 /* Set for kvm_complete_mmio_read() use */ 461 466 vcpu->arch.io_gpr = rd; 462 467 run->mmio.is_write = 0; 463 468 vcpu->mmio_is_write = 0; 464 - trace_kvm_mmio(KVM_TRACE_MMIO_READ_UNSATISFIED, run->mmio.len, 465 - run->mmio.phys_addr, NULL); 466 - } else { 467 - kvm_err("Read not supported Inst=0x%08x @%lx BadVaddr:%#lx\n", 468 - inst.word, vcpu->arch.pc, vcpu->arch.badv); 469 - kvm_arch_vcpu_dump_regs(vcpu); 470 - vcpu->mmio_needed = 0; 469 + return EMULATE_DO_MMIO; 471 470 } 471 + 472 + kvm_err("Read not supported Inst=0x%08x @%lx BadVaddr:%#lx\n", 473 + inst.word, vcpu->arch.pc, vcpu->arch.badv); 474 + kvm_arch_vcpu_dump_regs(vcpu); 475 + vcpu->mmio_needed = 0; 472 476 473 477 return ret; 474 478 } ··· 624 600 } 625 601 626 602 if (ret == EMULATE_DO_MMIO) { 603 + trace_kvm_mmio(KVM_TRACE_MMIO_WRITE, run->mmio.len, run->mmio.phys_addr, data); 604 + 605 + /* 606 + * If mmio device such as PCH-PIC is emulated in KVM, 607 + * it need not return to user space to handle the mmio 608 + * exception. 609 + */ 610 + ret = kvm_io_bus_write(vcpu, KVM_MMIO_BUS, vcpu->arch.badv, run->mmio.len, data); 611 + if (!ret) 612 + return EMULATE_DONE; 613 + 627 614 run->mmio.is_write = 1; 628 615 vcpu->mmio_needed = 1; 629 616 vcpu->mmio_is_write = 1; 630 - trace_kvm_mmio(KVM_TRACE_MMIO_WRITE, run->mmio.len, 631 - run->mmio.phys_addr, data); 632 - } else { 633 - vcpu->arch.pc = curr_pc; 634 - kvm_err("Write not supported Inst=0x%08x @%lx BadVaddr:%#lx\n", 635 - inst.word, vcpu->arch.pc, vcpu->arch.badv); 636 - kvm_arch_vcpu_dump_regs(vcpu); 637 - /* Rollback PC if emulation was unsuccessful */ 617 + return EMULATE_DO_MMIO; 638 618 } 619 + 620 + vcpu->arch.pc = curr_pc; 621 + kvm_err("Write not supported Inst=0x%08x @%lx BadVaddr:%#lx\n", 622 + inst.word, vcpu->arch.pc, vcpu->arch.badv); 623 + kvm_arch_vcpu_dump_regs(vcpu); 624 + /* Rollback PC if emulation was unsuccessful */ 639 625 640 626 return ret; 641 627 }

+1027

arch/loongarch/kvm/intc/eiointc.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * Copyright (C) 2024 Loongson Technology Corporation Limited 4 + */ 5 + 6 + #include <asm/kvm_eiointc.h> 7 + #include <asm/kvm_vcpu.h> 8 + #include <linux/count_zeros.h> 9 + 10 + static void eiointc_set_sw_coreisr(struct loongarch_eiointc *s) 11 + { 12 + int ipnum, cpu, irq_index, irq_mask, irq; 13 + 14 + for (irq = 0; irq < EIOINTC_IRQS; irq++) { 15 + ipnum = s->ipmap.reg_u8[irq / 32]; 16 + if (!(s->status & BIT(EIOINTC_ENABLE_INT_ENCODE))) { 17 + ipnum = count_trailing_zeros(ipnum); 18 + ipnum = (ipnum >= 0 && ipnum < 4) ? ipnum : 0; 19 + } 20 + irq_index = irq / 32; 21 + irq_mask = BIT(irq & 0x1f); 22 + 23 + cpu = s->coremap.reg_u8[irq]; 24 + if (!!(s->coreisr.reg_u32[cpu][irq_index] & irq_mask)) 25 + set_bit(irq, s->sw_coreisr[cpu][ipnum]); 26 + else 27 + clear_bit(irq, s->sw_coreisr[cpu][ipnum]); 28 + } 29 + } 30 + 31 + static void eiointc_update_irq(struct loongarch_eiointc *s, int irq, int level) 32 + { 33 + int ipnum, cpu, found, irq_index, irq_mask; 34 + struct kvm_vcpu *vcpu; 35 + struct kvm_interrupt vcpu_irq; 36 + 37 + ipnum = s->ipmap.reg_u8[irq / 32]; 38 + if (!(s->status & BIT(EIOINTC_ENABLE_INT_ENCODE))) { 39 + ipnum = count_trailing_zeros(ipnum); 40 + ipnum = (ipnum >= 0 && ipnum < 4) ? ipnum : 0; 41 + } 42 + 43 + cpu = s->sw_coremap[irq]; 44 + vcpu = kvm_get_vcpu(s->kvm, cpu); 45 + irq_index = irq / 32; 46 + irq_mask = BIT(irq & 0x1f); 47 + 48 + if (level) { 49 + /* if not enable return false */ 50 + if (((s->enable.reg_u32[irq_index]) & irq_mask) == 0) 51 + return; 52 + s->coreisr.reg_u32[cpu][irq_index] |= irq_mask; 53 + found = find_first_bit(s->sw_coreisr[cpu][ipnum], EIOINTC_IRQS); 54 + set_bit(irq, s->sw_coreisr[cpu][ipnum]); 55 + } else { 56 + s->coreisr.reg_u32[cpu][irq_index] &= ~irq_mask; 57 + clear_bit(irq, s->sw_coreisr[cpu][ipnum]); 58 + found = find_first_bit(s->sw_coreisr[cpu][ipnum], EIOINTC_IRQS); 59 + } 60 + 61 + if (found < EIOINTC_IRQS) 62 + return; /* other irq is handling, needn't update parent irq */ 63 + 64 + vcpu_irq.irq = level ? (INT_HWI0 + ipnum) : -(INT_HWI0 + ipnum); 65 + kvm_vcpu_ioctl_interrupt(vcpu, &vcpu_irq); 66 + } 67 + 68 + static inline void eiointc_update_sw_coremap(struct loongarch_eiointc *s, 69 + int irq, void *pvalue, u32 len, bool notify) 70 + { 71 + int i, cpu; 72 + u64 val = *(u64 *)pvalue; 73 + 74 + for (i = 0; i < len; i++) { 75 + cpu = val & 0xff; 76 + val = val >> 8; 77 + 78 + if (!(s->status & BIT(EIOINTC_ENABLE_CPU_ENCODE))) { 79 + cpu = ffs(cpu) - 1; 80 + cpu = (cpu >= 4) ? 0 : cpu; 81 + } 82 + 83 + if (s->sw_coremap[irq + i] == cpu) 84 + continue; 85 + 86 + if (notify && test_bit(irq + i, (unsigned long *)s->isr.reg_u8)) { 87 + /* lower irq at old cpu and raise irq at new cpu */ 88 + eiointc_update_irq(s, irq + i, 0); 89 + s->sw_coremap[irq + i] = cpu; 90 + eiointc_update_irq(s, irq + i, 1); 91 + } else { 92 + s->sw_coremap[irq + i] = cpu; 93 + } 94 + } 95 + } 96 + 97 + void eiointc_set_irq(struct loongarch_eiointc *s, int irq, int level) 98 + { 99 + unsigned long flags; 100 + unsigned long *isr = (unsigned long *)s->isr.reg_u8; 101 + 102 + level ? set_bit(irq, isr) : clear_bit(irq, isr); 103 + spin_lock_irqsave(&s->lock, flags); 104 + eiointc_update_irq(s, irq, level); 105 + spin_unlock_irqrestore(&s->lock, flags); 106 + } 107 + 108 + static inline void eiointc_enable_irq(struct kvm_vcpu *vcpu, 109 + struct loongarch_eiointc *s, int index, u8 mask, int level) 110 + { 111 + u8 val; 112 + int irq; 113 + 114 + val = mask & s->isr.reg_u8[index]; 115 + irq = ffs(val); 116 + while (irq != 0) { 117 + /* 118 + * enable bit change from 0 to 1, 119 + * need to update irq by pending bits 120 + */ 121 + eiointc_update_irq(s, irq - 1 + index * 8, level); 122 + val &= ~BIT(irq - 1); 123 + irq = ffs(val); 124 + } 125 + } 126 + 127 + static int loongarch_eiointc_readb(struct kvm_vcpu *vcpu, struct loongarch_eiointc *s, 128 + gpa_t addr, int len, void *val) 129 + { 130 + int index, ret = 0; 131 + u8 data = 0; 132 + gpa_t offset; 133 + 134 + offset = addr - EIOINTC_BASE; 135 + switch (offset) { 136 + case EIOINTC_NODETYPE_START ... EIOINTC_NODETYPE_END: 137 + index = offset - EIOINTC_NODETYPE_START; 138 + data = s->nodetype.reg_u8[index]; 139 + break; 140 + case EIOINTC_IPMAP_START ... EIOINTC_IPMAP_END: 141 + index = offset - EIOINTC_IPMAP_START; 142 + data = s->ipmap.reg_u8[index]; 143 + break; 144 + case EIOINTC_ENABLE_START ... EIOINTC_ENABLE_END: 145 + index = offset - EIOINTC_ENABLE_START; 146 + data = s->enable.reg_u8[index]; 147 + break; 148 + case EIOINTC_BOUNCE_START ... EIOINTC_BOUNCE_END: 149 + index = offset - EIOINTC_BOUNCE_START; 150 + data = s->bounce.reg_u8[index]; 151 + break; 152 + case EIOINTC_COREISR_START ... EIOINTC_COREISR_END: 153 + index = offset - EIOINTC_COREISR_START; 154 + data = s->coreisr.reg_u8[vcpu->vcpu_id][index]; 155 + break; 156 + case EIOINTC_COREMAP_START ... EIOINTC_COREMAP_END: 157 + index = offset - EIOINTC_COREMAP_START; 158 + data = s->coremap.reg_u8[index]; 159 + break; 160 + default: 161 + ret = -EINVAL; 162 + break; 163 + } 164 + *(u8 *)val = data; 165 + 166 + return ret; 167 + } 168 + 169 + static int loongarch_eiointc_readw(struct kvm_vcpu *vcpu, struct loongarch_eiointc *s, 170 + gpa_t addr, int len, void *val) 171 + { 172 + int index, ret = 0; 173 + u16 data = 0; 174 + gpa_t offset; 175 + 176 + offset = addr - EIOINTC_BASE; 177 + switch (offset) { 178 + case EIOINTC_NODETYPE_START ... EIOINTC_NODETYPE_END: 179 + index = (offset - EIOINTC_NODETYPE_START) >> 1; 180 + data = s->nodetype.reg_u16[index]; 181 + break; 182 + case EIOINTC_IPMAP_START ... EIOINTC_IPMAP_END: 183 + index = (offset - EIOINTC_IPMAP_START) >> 1; 184 + data = s->ipmap.reg_u16[index]; 185 + break; 186 + case EIOINTC_ENABLE_START ... EIOINTC_ENABLE_END: 187 + index = (offset - EIOINTC_ENABLE_START) >> 1; 188 + data = s->enable.reg_u16[index]; 189 + break; 190 + case EIOINTC_BOUNCE_START ... EIOINTC_BOUNCE_END: 191 + index = (offset - EIOINTC_BOUNCE_START) >> 1; 192 + data = s->bounce.reg_u16[index]; 193 + break; 194 + case EIOINTC_COREISR_START ... EIOINTC_COREISR_END: 195 + index = (offset - EIOINTC_COREISR_START) >> 1; 196 + data = s->coreisr.reg_u16[vcpu->vcpu_id][index]; 197 + break; 198 + case EIOINTC_COREMAP_START ... EIOINTC_COREMAP_END: 199 + index = (offset - EIOINTC_COREMAP_START) >> 1; 200 + data = s->coremap.reg_u16[index]; 201 + break; 202 + default: 203 + ret = -EINVAL; 204 + break; 205 + } 206 + *(u16 *)val = data; 207 + 208 + return ret; 209 + } 210 + 211 + static int loongarch_eiointc_readl(struct kvm_vcpu *vcpu, struct loongarch_eiointc *s, 212 + gpa_t addr, int len, void *val) 213 + { 214 + int index, ret = 0; 215 + u32 data = 0; 216 + gpa_t offset; 217 + 218 + offset = addr - EIOINTC_BASE; 219 + switch (offset) { 220 + case EIOINTC_NODETYPE_START ... EIOINTC_NODETYPE_END: 221 + index = (offset - EIOINTC_NODETYPE_START) >> 2; 222 + data = s->nodetype.reg_u32[index]; 223 + break; 224 + case EIOINTC_IPMAP_START ... EIOINTC_IPMAP_END: 225 + index = (offset - EIOINTC_IPMAP_START) >> 2; 226 + data = s->ipmap.reg_u32[index]; 227 + break; 228 + case EIOINTC_ENABLE_START ... EIOINTC_ENABLE_END: 229 + index = (offset - EIOINTC_ENABLE_START) >> 2; 230 + data = s->enable.reg_u32[index]; 231 + break; 232 + case EIOINTC_BOUNCE_START ... EIOINTC_BOUNCE_END: 233 + index = (offset - EIOINTC_BOUNCE_START) >> 2; 234 + data = s->bounce.reg_u32[index]; 235 + break; 236 + case EIOINTC_COREISR_START ... EIOINTC_COREISR_END: 237 + index = (offset - EIOINTC_COREISR_START) >> 2; 238 + data = s->coreisr.reg_u32[vcpu->vcpu_id][index]; 239 + break; 240 + case EIOINTC_COREMAP_START ... EIOINTC_COREMAP_END: 241 + index = (offset - EIOINTC_COREMAP_START) >> 2; 242 + data = s->coremap.reg_u32[index]; 243 + break; 244 + default: 245 + ret = -EINVAL; 246 + break; 247 + } 248 + *(u32 *)val = data; 249 + 250 + return ret; 251 + } 252 + 253 + static int loongarch_eiointc_readq(struct kvm_vcpu *vcpu, struct loongarch_eiointc *s, 254 + gpa_t addr, int len, void *val) 255 + { 256 + int index, ret = 0; 257 + u64 data = 0; 258 + gpa_t offset; 259 + 260 + offset = addr - EIOINTC_BASE; 261 + switch (offset) { 262 + case EIOINTC_NODETYPE_START ... EIOINTC_NODETYPE_END: 263 + index = (offset - EIOINTC_NODETYPE_START) >> 3; 264 + data = s->nodetype.reg_u64[index]; 265 + break; 266 + case EIOINTC_IPMAP_START ... EIOINTC_IPMAP_END: 267 + index = (offset - EIOINTC_IPMAP_START) >> 3; 268 + data = s->ipmap.reg_u64; 269 + break; 270 + case EIOINTC_ENABLE_START ... EIOINTC_ENABLE_END: 271 + index = (offset - EIOINTC_ENABLE_START) >> 3; 272 + data = s->enable.reg_u64[index]; 273 + break; 274 + case EIOINTC_BOUNCE_START ... EIOINTC_BOUNCE_END: 275 + index = (offset - EIOINTC_BOUNCE_START) >> 3; 276 + data = s->bounce.reg_u64[index]; 277 + break; 278 + case EIOINTC_COREISR_START ... EIOINTC_COREISR_END: 279 + index = (offset - EIOINTC_COREISR_START) >> 3; 280 + data = s->coreisr.reg_u64[vcpu->vcpu_id][index]; 281 + break; 282 + case EIOINTC_COREMAP_START ... EIOINTC_COREMAP_END: 283 + index = (offset - EIOINTC_COREMAP_START) >> 3; 284 + data = s->coremap.reg_u64[index]; 285 + break; 286 + default: 287 + ret = -EINVAL; 288 + break; 289 + } 290 + *(u64 *)val = data; 291 + 292 + return ret; 293 + } 294 + 295 + static int kvm_eiointc_read(struct kvm_vcpu *vcpu, 296 + struct kvm_io_device *dev, 297 + gpa_t addr, int len, void *val) 298 + { 299 + int ret = -EINVAL; 300 + unsigned long flags; 301 + struct loongarch_eiointc *eiointc = vcpu->kvm->arch.eiointc; 302 + 303 + if (!eiointc) { 304 + kvm_err("%s: eiointc irqchip not valid!\n", __func__); 305 + return -EINVAL; 306 + } 307 + 308 + vcpu->kvm->stat.eiointc_read_exits++; 309 + spin_lock_irqsave(&eiointc->lock, flags); 310 + switch (len) { 311 + case 1: 312 + ret = loongarch_eiointc_readb(vcpu, eiointc, addr, len, val); 313 + break; 314 + case 2: 315 + ret = loongarch_eiointc_readw(vcpu, eiointc, addr, len, val); 316 + break; 317 + case 4: 318 + ret = loongarch_eiointc_readl(vcpu, eiointc, addr, len, val); 319 + break; 320 + case 8: 321 + ret = loongarch_eiointc_readq(vcpu, eiointc, addr, len, val); 322 + break; 323 + default: 324 + WARN_ONCE(1, "%s: Abnormal address access: addr 0x%llx, size %d\n", 325 + __func__, addr, len); 326 + } 327 + spin_unlock_irqrestore(&eiointc->lock, flags); 328 + 329 + return ret; 330 + } 331 + 332 + static int loongarch_eiointc_writeb(struct kvm_vcpu *vcpu, 333 + struct loongarch_eiointc *s, 334 + gpa_t addr, int len, const void *val) 335 + { 336 + int index, irq, bits, ret = 0; 337 + u8 cpu; 338 + u8 data, old_data; 339 + u8 coreisr, old_coreisr; 340 + gpa_t offset; 341 + 342 + data = *(u8 *)val; 343 + offset = addr - EIOINTC_BASE; 344 + 345 + switch (offset) { 346 + case EIOINTC_NODETYPE_START ... EIOINTC_NODETYPE_END: 347 + index = (offset - EIOINTC_NODETYPE_START); 348 + s->nodetype.reg_u8[index] = data; 349 + break; 350 + case EIOINTC_IPMAP_START ... EIOINTC_IPMAP_END: 351 + /* 352 + * ipmap cannot be set at runtime, can be set only at the beginning 353 + * of irqchip driver, need not update upper irq level 354 + */ 355 + index = (offset - EIOINTC_IPMAP_START); 356 + s->ipmap.reg_u8[index] = data; 357 + break; 358 + case EIOINTC_ENABLE_START ... EIOINTC_ENABLE_END: 359 + index = (offset - EIOINTC_ENABLE_START); 360 + old_data = s->enable.reg_u8[index]; 361 + s->enable.reg_u8[index] = data; 362 + /* 363 + * 1: enable irq. 364 + * update irq when isr is set. 365 + */ 366 + data = s->enable.reg_u8[index] & ~old_data & s->isr.reg_u8[index]; 367 + eiointc_enable_irq(vcpu, s, index, data, 1); 368 + /* 369 + * 0: disable irq. 370 + * update irq when isr is set. 371 + */ 372 + data = ~s->enable.reg_u8[index] & old_data & s->isr.reg_u8[index]; 373 + eiointc_enable_irq(vcpu, s, index, data, 0); 374 + break; 375 + case EIOINTC_BOUNCE_START ... EIOINTC_BOUNCE_END: 376 + /* do not emulate hw bounced irq routing */ 377 + index = offset - EIOINTC_BOUNCE_START; 378 + s->bounce.reg_u8[index] = data; 379 + break; 380 + case EIOINTC_COREISR_START ... EIOINTC_COREISR_END: 381 + index = (offset - EIOINTC_COREISR_START); 382 + /* use attrs to get current cpu index */ 383 + cpu = vcpu->vcpu_id; 384 + coreisr = data; 385 + old_coreisr = s->coreisr.reg_u8[cpu][index]; 386 + /* write 1 to clear interrupt */ 387 + s->coreisr.reg_u8[cpu][index] = old_coreisr & ~coreisr; 388 + coreisr &= old_coreisr; 389 + bits = sizeof(data) * 8; 390 + irq = find_first_bit((void *)&coreisr, bits); 391 + while (irq < bits) { 392 + eiointc_update_irq(s, irq + index * bits, 0); 393 + bitmap_clear((void *)&coreisr, irq, 1); 394 + irq = find_first_bit((void *)&coreisr, bits); 395 + } 396 + break; 397 + case EIOINTC_COREMAP_START ... EIOINTC_COREMAP_END: 398 + irq = offset - EIOINTC_COREMAP_START; 399 + index = irq; 400 + s->coremap.reg_u8[index] = data; 401 + eiointc_update_sw_coremap(s, irq, (void *)&data, sizeof(data), true); 402 + break; 403 + default: 404 + ret = -EINVAL; 405 + break; 406 + } 407 + 408 + return ret; 409 + } 410 + 411 + static int loongarch_eiointc_writew(struct kvm_vcpu *vcpu, 412 + struct loongarch_eiointc *s, 413 + gpa_t addr, int len, const void *val) 414 + { 415 + int i, index, irq, bits, ret = 0; 416 + u8 cpu; 417 + u16 data, old_data; 418 + u16 coreisr, old_coreisr; 419 + gpa_t offset; 420 + 421 + data = *(u16 *)val; 422 + offset = addr - EIOINTC_BASE; 423 + 424 + switch (offset) { 425 + case EIOINTC_NODETYPE_START ... EIOINTC_NODETYPE_END: 426 + index = (offset - EIOINTC_NODETYPE_START) >> 1; 427 + s->nodetype.reg_u16[index] = data; 428 + break; 429 + case EIOINTC_IPMAP_START ... EIOINTC_IPMAP_END: 430 + /* 431 + * ipmap cannot be set at runtime, can be set only at the beginning 432 + * of irqchip driver, need not update upper irq level 433 + */ 434 + index = (offset - EIOINTC_IPMAP_START) >> 1; 435 + s->ipmap.reg_u16[index] = data; 436 + break; 437 + case EIOINTC_ENABLE_START ... EIOINTC_ENABLE_END: 438 + index = (offset - EIOINTC_ENABLE_START) >> 1; 439 + old_data = s->enable.reg_u32[index]; 440 + s->enable.reg_u16[index] = data; 441 + /* 442 + * 1: enable irq. 443 + * update irq when isr is set. 444 + */ 445 + data = s->enable.reg_u16[index] & ~old_data & s->isr.reg_u16[index]; 446 + index = index << 1; 447 + for (i = 0; i < sizeof(data); i++) { 448 + u8 mask = (data >> (i * 8)) & 0xff; 449 + eiointc_enable_irq(vcpu, s, index + i, mask, 1); 450 + } 451 + /* 452 + * 0: disable irq. 453 + * update irq when isr is set. 454 + */ 455 + data = ~s->enable.reg_u16[index] & old_data & s->isr.reg_u16[index]; 456 + for (i = 0; i < sizeof(data); i++) { 457 + u8 mask = (data >> (i * 8)) & 0xff; 458 + eiointc_enable_irq(vcpu, s, index, mask, 0); 459 + } 460 + break; 461 + case EIOINTC_BOUNCE_START ... EIOINTC_BOUNCE_END: 462 + /* do not emulate hw bounced irq routing */ 463 + index = (offset - EIOINTC_BOUNCE_START) >> 1; 464 + s->bounce.reg_u16[index] = data; 465 + break; 466 + case EIOINTC_COREISR_START ... EIOINTC_COREISR_END: 467 + index = (offset - EIOINTC_COREISR_START) >> 1; 468 + /* use attrs to get current cpu index */ 469 + cpu = vcpu->vcpu_id; 470 + coreisr = data; 471 + old_coreisr = s->coreisr.reg_u16[cpu][index]; 472 + /* write 1 to clear interrupt */ 473 + s->coreisr.reg_u16[cpu][index] = old_coreisr & ~coreisr; 474 + coreisr &= old_coreisr; 475 + bits = sizeof(data) * 8; 476 + irq = find_first_bit((void *)&coreisr, bits); 477 + while (irq < bits) { 478 + eiointc_update_irq(s, irq + index * bits, 0); 479 + bitmap_clear((void *)&coreisr, irq, 1); 480 + irq = find_first_bit((void *)&coreisr, bits); 481 + } 482 + break; 483 + case EIOINTC_COREMAP_START ... EIOINTC_COREMAP_END: 484 + irq = offset - EIOINTC_COREMAP_START; 485 + index = irq >> 1; 486 + s->coremap.reg_u16[index] = data; 487 + eiointc_update_sw_coremap(s, irq, (void *)&data, sizeof(data), true); 488 + break; 489 + default: 490 + ret = -EINVAL; 491 + break; 492 + } 493 + 494 + return ret; 495 + } 496 + 497 + static int loongarch_eiointc_writel(struct kvm_vcpu *vcpu, 498 + struct loongarch_eiointc *s, 499 + gpa_t addr, int len, const void *val) 500 + { 501 + int i, index, irq, bits, ret = 0; 502 + u8 cpu; 503 + u32 data, old_data; 504 + u32 coreisr, old_coreisr; 505 + gpa_t offset; 506 + 507 + data = *(u32 *)val; 508 + offset = addr - EIOINTC_BASE; 509 + 510 + switch (offset) { 511 + case EIOINTC_NODETYPE_START ... EIOINTC_NODETYPE_END: 512 + index = (offset - EIOINTC_NODETYPE_START) >> 2; 513 + s->nodetype.reg_u32[index] = data; 514 + break; 515 + case EIOINTC_IPMAP_START ... EIOINTC_IPMAP_END: 516 + /* 517 + * ipmap cannot be set at runtime, can be set only at the beginning 518 + * of irqchip driver, need not update upper irq level 519 + */ 520 + index = (offset - EIOINTC_IPMAP_START) >> 2; 521 + s->ipmap.reg_u32[index] = data; 522 + break; 523 + case EIOINTC_ENABLE_START ... EIOINTC_ENABLE_END: 524 + index = (offset - EIOINTC_ENABLE_START) >> 2; 525 + old_data = s->enable.reg_u32[index]; 526 + s->enable.reg_u32[index] = data; 527 + /* 528 + * 1: enable irq. 529 + * update irq when isr is set. 530 + */ 531 + data = s->enable.reg_u32[index] & ~old_data & s->isr.reg_u32[index]; 532 + index = index << 2; 533 + for (i = 0; i < sizeof(data); i++) { 534 + u8 mask = (data >> (i * 8)) & 0xff; 535 + eiointc_enable_irq(vcpu, s, index + i, mask, 1); 536 + } 537 + /* 538 + * 0: disable irq. 539 + * update irq when isr is set. 540 + */ 541 + data = ~s->enable.reg_u32[index] & old_data & s->isr.reg_u32[index]; 542 + for (i = 0; i < sizeof(data); i++) { 543 + u8 mask = (data >> (i * 8)) & 0xff; 544 + eiointc_enable_irq(vcpu, s, index, mask, 0); 545 + } 546 + break; 547 + case EIOINTC_BOUNCE_START ... EIOINTC_BOUNCE_END: 548 + /* do not emulate hw bounced irq routing */ 549 + index = (offset - EIOINTC_BOUNCE_START) >> 2; 550 + s->bounce.reg_u32[index] = data; 551 + break; 552 + case EIOINTC_COREISR_START ... EIOINTC_COREISR_END: 553 + index = (offset - EIOINTC_COREISR_START) >> 2; 554 + /* use attrs to get current cpu index */ 555 + cpu = vcpu->vcpu_id; 556 + coreisr = data; 557 + old_coreisr = s->coreisr.reg_u32[cpu][index]; 558 + /* write 1 to clear interrupt */ 559 + s->coreisr.reg_u32[cpu][index] = old_coreisr & ~coreisr; 560 + coreisr &= old_coreisr; 561 + bits = sizeof(data) * 8; 562 + irq = find_first_bit((void *)&coreisr, bits); 563 + while (irq < bits) { 564 + eiointc_update_irq(s, irq + index * bits, 0); 565 + bitmap_clear((void *)&coreisr, irq, 1); 566 + irq = find_first_bit((void *)&coreisr, bits); 567 + } 568 + break; 569 + case EIOINTC_COREMAP_START ... EIOINTC_COREMAP_END: 570 + irq = offset - EIOINTC_COREMAP_START; 571 + index = irq >> 2; 572 + s->coremap.reg_u32[index] = data; 573 + eiointc_update_sw_coremap(s, irq, (void *)&data, sizeof(data), true); 574 + break; 575 + default: 576 + ret = -EINVAL; 577 + break; 578 + } 579 + 580 + return ret; 581 + } 582 + 583 + static int loongarch_eiointc_writeq(struct kvm_vcpu *vcpu, 584 + struct loongarch_eiointc *s, 585 + gpa_t addr, int len, const void *val) 586 + { 587 + int i, index, irq, bits, ret = 0; 588 + u8 cpu; 589 + u64 data, old_data; 590 + u64 coreisr, old_coreisr; 591 + gpa_t offset; 592 + 593 + data = *(u64 *)val; 594 + offset = addr - EIOINTC_BASE; 595 + 596 + switch (offset) { 597 + case EIOINTC_NODETYPE_START ... EIOINTC_NODETYPE_END: 598 + index = (offset - EIOINTC_NODETYPE_START) >> 3; 599 + s->nodetype.reg_u64[index] = data; 600 + break; 601 + case EIOINTC_IPMAP_START ... EIOINTC_IPMAP_END: 602 + /* 603 + * ipmap cannot be set at runtime, can be set only at the beginning 604 + * of irqchip driver, need not update upper irq level 605 + */ 606 + index = (offset - EIOINTC_IPMAP_START) >> 3; 607 + s->ipmap.reg_u64 = data; 608 + break; 609 + case EIOINTC_ENABLE_START ... EIOINTC_ENABLE_END: 610 + index = (offset - EIOINTC_ENABLE_START) >> 3; 611 + old_data = s->enable.reg_u64[index]; 612 + s->enable.reg_u64[index] = data; 613 + /* 614 + * 1: enable irq. 615 + * update irq when isr is set. 616 + */ 617 + data = s->enable.reg_u64[index] & ~old_data & s->isr.reg_u64[index]; 618 + index = index << 3; 619 + for (i = 0; i < sizeof(data); i++) { 620 + u8 mask = (data >> (i * 8)) & 0xff; 621 + eiointc_enable_irq(vcpu, s, index + i, mask, 1); 622 + } 623 + /* 624 + * 0: disable irq. 625 + * update irq when isr is set. 626 + */ 627 + data = ~s->enable.reg_u64[index] & old_data & s->isr.reg_u64[index]; 628 + for (i = 0; i < sizeof(data); i++) { 629 + u8 mask = (data >> (i * 8)) & 0xff; 630 + eiointc_enable_irq(vcpu, s, index, mask, 0); 631 + } 632 + break; 633 + case EIOINTC_BOUNCE_START ... EIOINTC_BOUNCE_END: 634 + /* do not emulate hw bounced irq routing */ 635 + index = (offset - EIOINTC_BOUNCE_START) >> 3; 636 + s->bounce.reg_u64[index] = data; 637 + break; 638 + case EIOINTC_COREISR_START ... EIOINTC_COREISR_END: 639 + index = (offset - EIOINTC_COREISR_START) >> 3; 640 + /* use attrs to get current cpu index */ 641 + cpu = vcpu->vcpu_id; 642 + coreisr = data; 643 + old_coreisr = s->coreisr.reg_u64[cpu][index]; 644 + /* write 1 to clear interrupt */ 645 + s->coreisr.reg_u64[cpu][index] = old_coreisr & ~coreisr; 646 + coreisr &= old_coreisr; 647 + bits = sizeof(data) * 8; 648 + irq = find_first_bit((void *)&coreisr, bits); 649 + while (irq < bits) { 650 + eiointc_update_irq(s, irq + index * bits, 0); 651 + bitmap_clear((void *)&coreisr, irq, 1); 652 + irq = find_first_bit((void *)&coreisr, bits); 653 + } 654 + break; 655 + case EIOINTC_COREMAP_START ... EIOINTC_COREMAP_END: 656 + irq = offset - EIOINTC_COREMAP_START; 657 + index = irq >> 3; 658 + s->coremap.reg_u64[index] = data; 659 + eiointc_update_sw_coremap(s, irq, (void *)&data, sizeof(data), true); 660 + break; 661 + default: 662 + ret = -EINVAL; 663 + break; 664 + } 665 + 666 + return ret; 667 + } 668 + 669 + static int kvm_eiointc_write(struct kvm_vcpu *vcpu, 670 + struct kvm_io_device *dev, 671 + gpa_t addr, int len, const void *val) 672 + { 673 + int ret = -EINVAL; 674 + unsigned long flags; 675 + struct loongarch_eiointc *eiointc = vcpu->kvm->arch.eiointc; 676 + 677 + if (!eiointc) { 678 + kvm_err("%s: eiointc irqchip not valid!\n", __func__); 679 + return -EINVAL; 680 + } 681 + 682 + vcpu->kvm->stat.eiointc_write_exits++; 683 + spin_lock_irqsave(&eiointc->lock, flags); 684 + switch (len) { 685 + case 1: 686 + ret = loongarch_eiointc_writeb(vcpu, eiointc, addr, len, val); 687 + break; 688 + case 2: 689 + ret = loongarch_eiointc_writew(vcpu, eiointc, addr, len, val); 690 + break; 691 + case 4: 692 + ret = loongarch_eiointc_writel(vcpu, eiointc, addr, len, val); 693 + break; 694 + case 8: 695 + ret = loongarch_eiointc_writeq(vcpu, eiointc, addr, len, val); 696 + break; 697 + default: 698 + WARN_ONCE(1, "%s: Abnormal address access: addr 0x%llx, size %d\n", 699 + __func__, addr, len); 700 + } 701 + spin_unlock_irqrestore(&eiointc->lock, flags); 702 + 703 + return ret; 704 + } 705 + 706 + static const struct kvm_io_device_ops kvm_eiointc_ops = { 707 + .read = kvm_eiointc_read, 708 + .write = kvm_eiointc_write, 709 + }; 710 + 711 + static int kvm_eiointc_virt_read(struct kvm_vcpu *vcpu, 712 + struct kvm_io_device *dev, 713 + gpa_t addr, int len, void *val) 714 + { 715 + unsigned long flags; 716 + u32 *data = val; 717 + struct loongarch_eiointc *eiointc = vcpu->kvm->arch.eiointc; 718 + 719 + if (!eiointc) { 720 + kvm_err("%s: eiointc irqchip not valid!\n", __func__); 721 + return -EINVAL; 722 + } 723 + 724 + addr -= EIOINTC_VIRT_BASE; 725 + spin_lock_irqsave(&eiointc->lock, flags); 726 + switch (addr) { 727 + case EIOINTC_VIRT_FEATURES: 728 + *data = eiointc->features; 729 + break; 730 + case EIOINTC_VIRT_CONFIG: 731 + *data = eiointc->status; 732 + break; 733 + default: 734 + break; 735 + } 736 + spin_unlock_irqrestore(&eiointc->lock, flags); 737 + 738 + return 0; 739 + } 740 + 741 + static int kvm_eiointc_virt_write(struct kvm_vcpu *vcpu, 742 + struct kvm_io_device *dev, 743 + gpa_t addr, int len, const void *val) 744 + { 745 + int ret = 0; 746 + unsigned long flags; 747 + u32 value = *(u32 *)val; 748 + struct loongarch_eiointc *eiointc = vcpu->kvm->arch.eiointc; 749 + 750 + if (!eiointc) { 751 + kvm_err("%s: eiointc irqchip not valid!\n", __func__); 752 + return -EINVAL; 753 + } 754 + 755 + addr -= EIOINTC_VIRT_BASE; 756 + spin_lock_irqsave(&eiointc->lock, flags); 757 + switch (addr) { 758 + case EIOINTC_VIRT_FEATURES: 759 + ret = -EPERM; 760 + break; 761 + case EIOINTC_VIRT_CONFIG: 762 + /* 763 + * eiointc features can only be set at disabled status 764 + */ 765 + if ((eiointc->status & BIT(EIOINTC_ENABLE)) && value) { 766 + ret = -EPERM; 767 + break; 768 + } 769 + eiointc->status = value & eiointc->features; 770 + break; 771 + default: 772 + break; 773 + } 774 + spin_unlock_irqrestore(&eiointc->lock, flags); 775 + 776 + return ret; 777 + } 778 + 779 + static const struct kvm_io_device_ops kvm_eiointc_virt_ops = { 780 + .read = kvm_eiointc_virt_read, 781 + .write = kvm_eiointc_virt_write, 782 + }; 783 + 784 + static int kvm_eiointc_ctrl_access(struct kvm_device *dev, 785 + struct kvm_device_attr *attr) 786 + { 787 + int ret = 0; 788 + unsigned long flags; 789 + unsigned long type = (unsigned long)attr->attr; 790 + u32 i, start_irq; 791 + void __user *data; 792 + struct loongarch_eiointc *s = dev->kvm->arch.eiointc; 793 + 794 + data = (void __user *)attr->addr; 795 + spin_lock_irqsave(&s->lock, flags); 796 + switch (type) { 797 + case KVM_DEV_LOONGARCH_EXTIOI_CTRL_INIT_NUM_CPU: 798 + if (copy_from_user(&s->num_cpu, data, 4)) 799 + ret = -EFAULT; 800 + break; 801 + case KVM_DEV_LOONGARCH_EXTIOI_CTRL_INIT_FEATURE: 802 + if (copy_from_user(&s->features, data, 4)) 803 + ret = -EFAULT; 804 + if (!(s->features & BIT(EIOINTC_HAS_VIRT_EXTENSION))) 805 + s->status |= BIT(EIOINTC_ENABLE); 806 + break; 807 + case KVM_DEV_LOONGARCH_EXTIOI_CTRL_LOAD_FINISHED: 808 + eiointc_set_sw_coreisr(s); 809 + for (i = 0; i < (EIOINTC_IRQS / 4); i++) { 810 + start_irq = i * 4; 811 + eiointc_update_sw_coremap(s, start_irq, 812 + (void *)&s->coremap.reg_u32[i], sizeof(u32), false); 813 + } 814 + break; 815 + default: 816 + break; 817 + } 818 + spin_unlock_irqrestore(&s->lock, flags); 819 + 820 + return ret; 821 + } 822 + 823 + static int kvm_eiointc_regs_access(struct kvm_device *dev, 824 + struct kvm_device_attr *attr, 825 + bool is_write) 826 + { 827 + int addr, cpuid, offset, ret = 0; 828 + unsigned long flags; 829 + void *p = NULL; 830 + void __user *data; 831 + struct loongarch_eiointc *s; 832 + 833 + s = dev->kvm->arch.eiointc; 834 + addr = attr->attr; 835 + cpuid = addr >> 16; 836 + addr &= 0xffff; 837 + data = (void __user *)attr->addr; 838 + switch (addr) { 839 + case EIOINTC_NODETYPE_START ... EIOINTC_NODETYPE_END: 840 + offset = (addr - EIOINTC_NODETYPE_START) / 4; 841 + p = &s->nodetype.reg_u32[offset]; 842 + break; 843 + case EIOINTC_IPMAP_START ... EIOINTC_IPMAP_END: 844 + offset = (addr - EIOINTC_IPMAP_START) / 4; 845 + p = &s->ipmap.reg_u32[offset]; 846 + break; 847 + case EIOINTC_ENABLE_START ... EIOINTC_ENABLE_END: 848 + offset = (addr - EIOINTC_ENABLE_START) / 4; 849 + p = &s->enable.reg_u32[offset]; 850 + break; 851 + case EIOINTC_BOUNCE_START ... EIOINTC_BOUNCE_END: 852 + offset = (addr - EIOINTC_BOUNCE_START) / 4; 853 + p = &s->bounce.reg_u32[offset]; 854 + break; 855 + case EIOINTC_ISR_START ... EIOINTC_ISR_END: 856 + offset = (addr - EIOINTC_ISR_START) / 4; 857 + p = &s->isr.reg_u32[offset]; 858 + break; 859 + case EIOINTC_COREISR_START ... EIOINTC_COREISR_END: 860 + offset = (addr - EIOINTC_COREISR_START) / 4; 861 + p = &s->coreisr.reg_u32[cpuid][offset]; 862 + break; 863 + case EIOINTC_COREMAP_START ... EIOINTC_COREMAP_END: 864 + offset = (addr - EIOINTC_COREMAP_START) / 4; 865 + p = &s->coremap.reg_u32[offset]; 866 + break; 867 + default: 868 + kvm_err("%s: unknown eiointc register, addr = %d\n", __func__, addr); 869 + return -EINVAL; 870 + } 871 + 872 + spin_lock_irqsave(&s->lock, flags); 873 + if (is_write) { 874 + if (copy_from_user(p, data, 4)) 875 + ret = -EFAULT; 876 + } else { 877 + if (copy_to_user(data, p, 4)) 878 + ret = -EFAULT; 879 + } 880 + spin_unlock_irqrestore(&s->lock, flags); 881 + 882 + return ret; 883 + } 884 + 885 + static int kvm_eiointc_sw_status_access(struct kvm_device *dev, 886 + struct kvm_device_attr *attr, 887 + bool is_write) 888 + { 889 + int addr, ret = 0; 890 + unsigned long flags; 891 + void *p = NULL; 892 + void __user *data; 893 + struct loongarch_eiointc *s; 894 + 895 + s = dev->kvm->arch.eiointc; 896 + addr = attr->attr; 897 + addr &= 0xffff; 898 + 899 + data = (void __user *)attr->addr; 900 + switch (addr) { 901 + case KVM_DEV_LOONGARCH_EXTIOI_SW_STATUS_NUM_CPU: 902 + p = &s->num_cpu; 903 + break; 904 + case KVM_DEV_LOONGARCH_EXTIOI_SW_STATUS_FEATURE: 905 + p = &s->features; 906 + break; 907 + case KVM_DEV_LOONGARCH_EXTIOI_SW_STATUS_STATE: 908 + p = &s->status; 909 + break; 910 + default: 911 + kvm_err("%s: unknown eiointc register, addr = %d\n", __func__, addr); 912 + return -EINVAL; 913 + } 914 + spin_lock_irqsave(&s->lock, flags); 915 + if (is_write) { 916 + if (copy_from_user(p, data, 4)) 917 + ret = -EFAULT; 918 + } else { 919 + if (copy_to_user(data, p, 4)) 920 + ret = -EFAULT; 921 + } 922 + spin_unlock_irqrestore(&s->lock, flags); 923 + 924 + return ret; 925 + } 926 + 927 + static int kvm_eiointc_get_attr(struct kvm_device *dev, 928 + struct kvm_device_attr *attr) 929 + { 930 + switch (attr->group) { 931 + case KVM_DEV_LOONGARCH_EXTIOI_GRP_REGS: 932 + return kvm_eiointc_regs_access(dev, attr, false); 933 + case KVM_DEV_LOONGARCH_EXTIOI_GRP_SW_STATUS: 934 + return kvm_eiointc_sw_status_access(dev, attr, false); 935 + default: 936 + return -EINVAL; 937 + } 938 + } 939 + 940 + static int kvm_eiointc_set_attr(struct kvm_device *dev, 941 + struct kvm_device_attr *attr) 942 + { 943 + switch (attr->group) { 944 + case KVM_DEV_LOONGARCH_EXTIOI_GRP_CTRL: 945 + return kvm_eiointc_ctrl_access(dev, attr); 946 + case KVM_DEV_LOONGARCH_EXTIOI_GRP_REGS: 947 + return kvm_eiointc_regs_access(dev, attr, true); 948 + case KVM_DEV_LOONGARCH_EXTIOI_GRP_SW_STATUS: 949 + return kvm_eiointc_sw_status_access(dev, attr, true); 950 + default: 951 + return -EINVAL; 952 + } 953 + } 954 + 955 + static int kvm_eiointc_create(struct kvm_device *dev, u32 type) 956 + { 957 + int ret; 958 + struct loongarch_eiointc *s; 959 + struct kvm_io_device *device, *device1; 960 + struct kvm *kvm = dev->kvm; 961 + 962 + /* eiointc has been created */ 963 + if (kvm->arch.eiointc) 964 + return -EINVAL; 965 + 966 + s = kzalloc(sizeof(struct loongarch_eiointc), GFP_KERNEL); 967 + if (!s) 968 + return -ENOMEM; 969 + 970 + spin_lock_init(&s->lock); 971 + s->kvm = kvm; 972 + 973 + /* 974 + * Initialize IOCSR device 975 + */ 976 + device = &s->device; 977 + kvm_iodevice_init(device, &kvm_eiointc_ops); 978 + mutex_lock(&kvm->slots_lock); 979 + ret = kvm_io_bus_register_dev(kvm, KVM_IOCSR_BUS, 980 + EIOINTC_BASE, EIOINTC_SIZE, device); 981 + mutex_unlock(&kvm->slots_lock); 982 + if (ret < 0) { 983 + kfree(s); 984 + return ret; 985 + } 986 + 987 + device1 = &s->device_vext; 988 + kvm_iodevice_init(device1, &kvm_eiointc_virt_ops); 989 + ret = kvm_io_bus_register_dev(kvm, KVM_IOCSR_BUS, 990 + EIOINTC_VIRT_BASE, EIOINTC_VIRT_SIZE, device1); 991 + if (ret < 0) { 992 + kvm_io_bus_unregister_dev(kvm, KVM_IOCSR_BUS, &s->device); 993 + kfree(s); 994 + return ret; 995 + } 996 + kvm->arch.eiointc = s; 997 + 998 + return 0; 999 + } 1000 + 1001 + static void kvm_eiointc_destroy(struct kvm_device *dev) 1002 + { 1003 + struct kvm *kvm; 1004 + struct loongarch_eiointc *eiointc; 1005 + 1006 + if (!dev || !dev->kvm || !dev->kvm->arch.eiointc) 1007 + return; 1008 + 1009 + kvm = dev->kvm; 1010 + eiointc = kvm->arch.eiointc; 1011 + kvm_io_bus_unregister_dev(kvm, KVM_IOCSR_BUS, &eiointc->device); 1012 + kvm_io_bus_unregister_dev(kvm, KVM_IOCSR_BUS, &eiointc->device_vext); 1013 + kfree(eiointc); 1014 + } 1015 + 1016 + static struct kvm_device_ops kvm_eiointc_dev_ops = { 1017 + .name = "kvm-loongarch-eiointc", 1018 + .create = kvm_eiointc_create, 1019 + .destroy = kvm_eiointc_destroy, 1020 + .set_attr = kvm_eiointc_set_attr, 1021 + .get_attr = kvm_eiointc_get_attr, 1022 + }; 1023 + 1024 + int kvm_loongarch_register_eiointc_device(void) 1025 + { 1026 + return kvm_register_device_ops(&kvm_eiointc_dev_ops, KVM_DEV_TYPE_LOONGARCH_EIOINTC); 1027 + }

+475

arch/loongarch/kvm/intc/ipi.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * Copyright (C) 2024 Loongson Technology Corporation Limited 4 + */ 5 + 6 + #include <linux/kvm_host.h> 7 + #include <asm/kvm_ipi.h> 8 + #include <asm/kvm_vcpu.h> 9 + 10 + static void ipi_send(struct kvm *kvm, uint64_t data) 11 + { 12 + int cpu, action; 13 + uint32_t status; 14 + struct kvm_vcpu *vcpu; 15 + struct kvm_interrupt irq; 16 + 17 + cpu = ((data & 0xffffffff) >> 16) & 0x3ff; 18 + vcpu = kvm_get_vcpu_by_cpuid(kvm, cpu); 19 + if (unlikely(vcpu == NULL)) { 20 + kvm_err("%s: invalid target cpu: %d\n", __func__, cpu); 21 + return; 22 + } 23 + 24 + action = BIT(data & 0x1f); 25 + spin_lock(&vcpu->arch.ipi_state.lock); 26 + status = vcpu->arch.ipi_state.status; 27 + vcpu->arch.ipi_state.status |= action; 28 + spin_unlock(&vcpu->arch.ipi_state.lock); 29 + if (status == 0) { 30 + irq.irq = LARCH_INT_IPI; 31 + kvm_vcpu_ioctl_interrupt(vcpu, &irq); 32 + } 33 + } 34 + 35 + static void ipi_clear(struct kvm_vcpu *vcpu, uint64_t data) 36 + { 37 + uint32_t status; 38 + struct kvm_interrupt irq; 39 + 40 + spin_lock(&vcpu->arch.ipi_state.lock); 41 + vcpu->arch.ipi_state.status &= ~data; 42 + status = vcpu->arch.ipi_state.status; 43 + spin_unlock(&vcpu->arch.ipi_state.lock); 44 + if (status == 0) { 45 + irq.irq = -LARCH_INT_IPI; 46 + kvm_vcpu_ioctl_interrupt(vcpu, &irq); 47 + } 48 + } 49 + 50 + static uint64_t read_mailbox(struct kvm_vcpu *vcpu, int offset, int len) 51 + { 52 + uint64_t data = 0; 53 + 54 + spin_lock(&vcpu->arch.ipi_state.lock); 55 + data = *(ulong *)((void *)vcpu->arch.ipi_state.buf + (offset - 0x20)); 56 + spin_unlock(&vcpu->arch.ipi_state.lock); 57 + 58 + switch (len) { 59 + case 1: 60 + return data & 0xff; 61 + case 2: 62 + return data & 0xffff; 63 + case 4: 64 + return data & 0xffffffff; 65 + case 8: 66 + return data; 67 + default: 68 + kvm_err("%s: unknown data len: %d\n", __func__, len); 69 + return 0; 70 + } 71 + } 72 + 73 + static void write_mailbox(struct kvm_vcpu *vcpu, int offset, uint64_t data, int len) 74 + { 75 + void *pbuf; 76 + 77 + spin_lock(&vcpu->arch.ipi_state.lock); 78 + pbuf = (void *)vcpu->arch.ipi_state.buf + (offset - 0x20); 79 + 80 + switch (len) { 81 + case 1: 82 + *(unsigned char *)pbuf = (unsigned char)data; 83 + break; 84 + case 2: 85 + *(unsigned short *)pbuf = (unsigned short)data; 86 + break; 87 + case 4: 88 + *(unsigned int *)pbuf = (unsigned int)data; 89 + break; 90 + case 8: 91 + *(unsigned long *)pbuf = (unsigned long)data; 92 + break; 93 + default: 94 + kvm_err("%s: unknown data len: %d\n", __func__, len); 95 + } 96 + spin_unlock(&vcpu->arch.ipi_state.lock); 97 + } 98 + 99 + static int send_ipi_data(struct kvm_vcpu *vcpu, gpa_t addr, uint64_t data) 100 + { 101 + int i, ret; 102 + uint32_t val = 0, mask = 0; 103 + 104 + /* 105 + * Bit 27-30 is mask for byte writing. 106 + * If the mask is 0, we need not to do anything. 107 + */ 108 + if ((data >> 27) & 0xf) { 109 + /* Read the old val */ 110 + ret = kvm_io_bus_read(vcpu, KVM_IOCSR_BUS, addr, sizeof(val), &val); 111 + if (unlikely(ret)) { 112 + kvm_err("%s: : read date from addr %llx failed\n", __func__, addr); 113 + return ret; 114 + } 115 + /* Construct the mask by scanning the bit 27-30 */ 116 + for (i = 0; i < 4; i++) { 117 + if (data & (BIT(27 + i))) 118 + mask |= (0xff << (i * 8)); 119 + } 120 + /* Save the old part of val */ 121 + val &= mask; 122 + } 123 + val |= ((uint32_t)(data >> 32) & ~mask); 124 + ret = kvm_io_bus_write(vcpu, KVM_IOCSR_BUS, addr, sizeof(val), &val); 125 + if (unlikely(ret)) 126 + kvm_err("%s: : write date to addr %llx failed\n", __func__, addr); 127 + 128 + return ret; 129 + } 130 + 131 + static int mail_send(struct kvm *kvm, uint64_t data) 132 + { 133 + int cpu, mailbox, offset; 134 + struct kvm_vcpu *vcpu; 135 + 136 + cpu = ((data & 0xffffffff) >> 16) & 0x3ff; 137 + vcpu = kvm_get_vcpu_by_cpuid(kvm, cpu); 138 + if (unlikely(vcpu == NULL)) { 139 + kvm_err("%s: invalid target cpu: %d\n", __func__, cpu); 140 + return -EINVAL; 141 + } 142 + mailbox = ((data & 0xffffffff) >> 2) & 0x7; 143 + offset = IOCSR_IPI_BASE + IOCSR_IPI_BUF_20 + mailbox * 4; 144 + 145 + return send_ipi_data(vcpu, offset, data); 146 + } 147 + 148 + static int any_send(struct kvm *kvm, uint64_t data) 149 + { 150 + int cpu, offset; 151 + struct kvm_vcpu *vcpu; 152 + 153 + cpu = ((data & 0xffffffff) >> 16) & 0x3ff; 154 + vcpu = kvm_get_vcpu_by_cpuid(kvm, cpu); 155 + if (unlikely(vcpu == NULL)) { 156 + kvm_err("%s: invalid target cpu: %d\n", __func__, cpu); 157 + return -EINVAL; 158 + } 159 + offset = data & 0xffff; 160 + 161 + return send_ipi_data(vcpu, offset, data); 162 + } 163 + 164 + static int loongarch_ipi_readl(struct kvm_vcpu *vcpu, gpa_t addr, int len, void *val) 165 + { 166 + int ret = 0; 167 + uint32_t offset; 168 + uint64_t res = 0; 169 + 170 + offset = (uint32_t)(addr & 0x1ff); 171 + WARN_ON_ONCE(offset & (len - 1)); 172 + 173 + switch (offset) { 174 + case IOCSR_IPI_STATUS: 175 + spin_lock(&vcpu->arch.ipi_state.lock); 176 + res = vcpu->arch.ipi_state.status; 177 + spin_unlock(&vcpu->arch.ipi_state.lock); 178 + break; 179 + case IOCSR_IPI_EN: 180 + spin_lock(&vcpu->arch.ipi_state.lock); 181 + res = vcpu->arch.ipi_state.en; 182 + spin_unlock(&vcpu->arch.ipi_state.lock); 183 + break; 184 + case IOCSR_IPI_SET: 185 + res = 0; 186 + break; 187 + case IOCSR_IPI_CLEAR: 188 + res = 0; 189 + break; 190 + case IOCSR_IPI_BUF_20 ... IOCSR_IPI_BUF_38 + 7: 191 + if (offset + len > IOCSR_IPI_BUF_38 + 8) { 192 + kvm_err("%s: invalid offset or len: offset = %d, len = %d\n", 193 + __func__, offset, len); 194 + ret = -EINVAL; 195 + break; 196 + } 197 + res = read_mailbox(vcpu, offset, len); 198 + break; 199 + default: 200 + kvm_err("%s: unknown addr: %llx\n", __func__, addr); 201 + ret = -EINVAL; 202 + break; 203 + } 204 + *(uint64_t *)val = res; 205 + 206 + return ret; 207 + } 208 + 209 + static int loongarch_ipi_writel(struct kvm_vcpu *vcpu, gpa_t addr, int len, const void *val) 210 + { 211 + int ret = 0; 212 + uint64_t data; 213 + uint32_t offset; 214 + 215 + data = *(uint64_t *)val; 216 + 217 + offset = (uint32_t)(addr & 0x1ff); 218 + WARN_ON_ONCE(offset & (len - 1)); 219 + 220 + switch (offset) { 221 + case IOCSR_IPI_STATUS: 222 + ret = -EINVAL; 223 + break; 224 + case IOCSR_IPI_EN: 225 + spin_lock(&vcpu->arch.ipi_state.lock); 226 + vcpu->arch.ipi_state.en = data; 227 + spin_unlock(&vcpu->arch.ipi_state.lock); 228 + break; 229 + case IOCSR_IPI_SET: 230 + ret = -EINVAL; 231 + break; 232 + case IOCSR_IPI_CLEAR: 233 + /* Just clear the status of the current vcpu */ 234 + ipi_clear(vcpu, data); 235 + break; 236 + case IOCSR_IPI_BUF_20 ... IOCSR_IPI_BUF_38 + 7: 237 + if (offset + len > IOCSR_IPI_BUF_38 + 8) { 238 + kvm_err("%s: invalid offset or len: offset = %d, len = %d\n", 239 + __func__, offset, len); 240 + ret = -EINVAL; 241 + break; 242 + } 243 + write_mailbox(vcpu, offset, data, len); 244 + break; 245 + case IOCSR_IPI_SEND: 246 + ipi_send(vcpu->kvm, data); 247 + break; 248 + case IOCSR_MAIL_SEND: 249 + ret = mail_send(vcpu->kvm, *(uint64_t *)val); 250 + break; 251 + case IOCSR_ANY_SEND: 252 + ret = any_send(vcpu->kvm, *(uint64_t *)val); 253 + break; 254 + default: 255 + kvm_err("%s: unknown addr: %llx\n", __func__, addr); 256 + ret = -EINVAL; 257 + break; 258 + } 259 + 260 + return ret; 261 + } 262 + 263 + static int kvm_ipi_read(struct kvm_vcpu *vcpu, 264 + struct kvm_io_device *dev, 265 + gpa_t addr, int len, void *val) 266 + { 267 + int ret; 268 + struct loongarch_ipi *ipi; 269 + 270 + ipi = vcpu->kvm->arch.ipi; 271 + if (!ipi) { 272 + kvm_err("%s: ipi irqchip not valid!\n", __func__); 273 + return -EINVAL; 274 + } 275 + ipi->kvm->stat.ipi_read_exits++; 276 + ret = loongarch_ipi_readl(vcpu, addr, len, val); 277 + 278 + return ret; 279 + } 280 + 281 + static int kvm_ipi_write(struct kvm_vcpu *vcpu, 282 + struct kvm_io_device *dev, 283 + gpa_t addr, int len, const void *val) 284 + { 285 + int ret; 286 + struct loongarch_ipi *ipi; 287 + 288 + ipi = vcpu->kvm->arch.ipi; 289 + if (!ipi) { 290 + kvm_err("%s: ipi irqchip not valid!\n", __func__); 291 + return -EINVAL; 292 + } 293 + ipi->kvm->stat.ipi_write_exits++; 294 + ret = loongarch_ipi_writel(vcpu, addr, len, val); 295 + 296 + return ret; 297 + } 298 + 299 + static const struct kvm_io_device_ops kvm_ipi_ops = { 300 + .read = kvm_ipi_read, 301 + .write = kvm_ipi_write, 302 + }; 303 + 304 + static int kvm_ipi_regs_access(struct kvm_device *dev, 305 + struct kvm_device_attr *attr, 306 + bool is_write) 307 + { 308 + int len = 4; 309 + int cpu, addr; 310 + uint64_t val; 311 + void *p = NULL; 312 + struct kvm_vcpu *vcpu; 313 + 314 + cpu = (attr->attr >> 16) & 0x3ff; 315 + addr = attr->attr & 0xff; 316 + 317 + vcpu = kvm_get_vcpu(dev->kvm, cpu); 318 + if (unlikely(vcpu == NULL)) { 319 + kvm_err("%s: invalid target cpu: %d\n", __func__, cpu); 320 + return -EINVAL; 321 + } 322 + 323 + switch (addr) { 324 + case IOCSR_IPI_STATUS: 325 + p = &vcpu->arch.ipi_state.status; 326 + break; 327 + case IOCSR_IPI_EN: 328 + p = &vcpu->arch.ipi_state.en; 329 + break; 330 + case IOCSR_IPI_SET: 331 + p = &vcpu->arch.ipi_state.set; 332 + break; 333 + case IOCSR_IPI_CLEAR: 334 + p = &vcpu->arch.ipi_state.clear; 335 + break; 336 + case IOCSR_IPI_BUF_20: 337 + p = &vcpu->arch.ipi_state.buf[0]; 338 + len = 8; 339 + break; 340 + case IOCSR_IPI_BUF_28: 341 + p = &vcpu->arch.ipi_state.buf[1]; 342 + len = 8; 343 + break; 344 + case IOCSR_IPI_BUF_30: 345 + p = &vcpu->arch.ipi_state.buf[2]; 346 + len = 8; 347 + break; 348 + case IOCSR_IPI_BUF_38: 349 + p = &vcpu->arch.ipi_state.buf[3]; 350 + len = 8; 351 + break; 352 + default: 353 + kvm_err("%s: unknown ipi register, addr = %d\n", __func__, addr); 354 + return -EINVAL; 355 + } 356 + 357 + if (is_write) { 358 + if (len == 4) { 359 + if (get_user(val, (uint32_t __user *)attr->addr)) 360 + return -EFAULT; 361 + *(uint32_t *)p = (uint32_t)val; 362 + } else if (len == 8) { 363 + if (get_user(val, (uint64_t __user *)attr->addr)) 364 + return -EFAULT; 365 + *(uint64_t *)p = val; 366 + } 367 + } else { 368 + if (len == 4) { 369 + val = *(uint32_t *)p; 370 + return put_user(val, (uint32_t __user *)attr->addr); 371 + } else if (len == 8) { 372 + val = *(uint64_t *)p; 373 + return put_user(val, (uint64_t __user *)attr->addr); 374 + } 375 + } 376 + 377 + return 0; 378 + } 379 + 380 + static int kvm_ipi_get_attr(struct kvm_device *dev, 381 + struct kvm_device_attr *attr) 382 + { 383 + switch (attr->group) { 384 + case KVM_DEV_LOONGARCH_IPI_GRP_REGS: 385 + return kvm_ipi_regs_access(dev, attr, false); 386 + default: 387 + kvm_err("%s: unknown group (%d)\n", __func__, attr->group); 388 + return -EINVAL; 389 + } 390 + } 391 + 392 + static int kvm_ipi_set_attr(struct kvm_device *dev, 393 + struct kvm_device_attr *attr) 394 + { 395 + switch (attr->group) { 396 + case KVM_DEV_LOONGARCH_IPI_GRP_REGS: 397 + return kvm_ipi_regs_access(dev, attr, true); 398 + default: 399 + kvm_err("%s: unknown group (%d)\n", __func__, attr->group); 400 + return -EINVAL; 401 + } 402 + } 403 + 404 + static int kvm_ipi_create(struct kvm_device *dev, u32 type) 405 + { 406 + int ret; 407 + struct kvm *kvm; 408 + struct kvm_io_device *device; 409 + struct loongarch_ipi *s; 410 + 411 + if (!dev) { 412 + kvm_err("%s: kvm_device ptr is invalid!\n", __func__); 413 + return -EINVAL; 414 + } 415 + 416 + kvm = dev->kvm; 417 + if (kvm->arch.ipi) { 418 + kvm_err("%s: LoongArch IPI has already been created!\n", __func__); 419 + return -EINVAL; 420 + } 421 + 422 + s = kzalloc(sizeof(struct loongarch_ipi), GFP_KERNEL); 423 + if (!s) 424 + return -ENOMEM; 425 + 426 + spin_lock_init(&s->lock); 427 + s->kvm = kvm; 428 + 429 + /* 430 + * Initialize IOCSR device 431 + */ 432 + device = &s->device; 433 + kvm_iodevice_init(device, &kvm_ipi_ops); 434 + mutex_lock(&kvm->slots_lock); 435 + ret = kvm_io_bus_register_dev(kvm, KVM_IOCSR_BUS, IOCSR_IPI_BASE, IOCSR_IPI_SIZE, device); 436 + mutex_unlock(&kvm->slots_lock); 437 + if (ret < 0) { 438 + kvm_err("%s: Initialize IOCSR dev failed, ret = %d\n", __func__, ret); 439 + goto err; 440 + } 441 + 442 + kvm->arch.ipi = s; 443 + return 0; 444 + 445 + err: 446 + kfree(s); 447 + return -EFAULT; 448 + } 449 + 450 + static void kvm_ipi_destroy(struct kvm_device *dev) 451 + { 452 + struct kvm *kvm; 453 + struct loongarch_ipi *ipi; 454 + 455 + if (!dev || !dev->kvm || !dev->kvm->arch.ipi) 456 + return; 457 + 458 + kvm = dev->kvm; 459 + ipi = kvm->arch.ipi; 460 + kvm_io_bus_unregister_dev(kvm, KVM_IOCSR_BUS, &ipi->device); 461 + kfree(ipi); 462 + } 463 + 464 + static struct kvm_device_ops kvm_ipi_dev_ops = { 465 + .name = "kvm-loongarch-ipi", 466 + .create = kvm_ipi_create, 467 + .destroy = kvm_ipi_destroy, 468 + .set_attr = kvm_ipi_set_attr, 469 + .get_attr = kvm_ipi_get_attr, 470 + }; 471 + 472 + int kvm_loongarch_register_ipi_device(void) 473 + { 474 + return kvm_register_device_ops(&kvm_ipi_dev_ops, KVM_DEV_TYPE_LOONGARCH_IPI); 475 + }

+519

arch/loongarch/kvm/intc/pch_pic.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * Copyright (C) 2024 Loongson Technology Corporation Limited 4 + */ 5 + 6 + #include <asm/kvm_eiointc.h> 7 + #include <asm/kvm_pch_pic.h> 8 + #include <asm/kvm_vcpu.h> 9 + #include <linux/count_zeros.h> 10 + 11 + /* update the isr according to irq level and route irq to eiointc */ 12 + static void pch_pic_update_irq(struct loongarch_pch_pic *s, int irq, int level) 13 + { 14 + u64 mask = BIT(irq); 15 + 16 + /* 17 + * set isr and route irq to eiointc and 18 + * the route table is in htmsi_vector[] 19 + */ 20 + if (level) { 21 + if (mask & s->irr & ~s->mask) { 22 + s->isr |= mask; 23 + irq = s->htmsi_vector[irq]; 24 + eiointc_set_irq(s->kvm->arch.eiointc, irq, level); 25 + } 26 + } else { 27 + if (mask & s->isr & ~s->irr) { 28 + s->isr &= ~mask; 29 + irq = s->htmsi_vector[irq]; 30 + eiointc_set_irq(s->kvm->arch.eiointc, irq, level); 31 + } 32 + } 33 + } 34 + 35 + /* update batch irqs, the irq_mask is a bitmap of irqs */ 36 + static void pch_pic_update_batch_irqs(struct loongarch_pch_pic *s, u64 irq_mask, int level) 37 + { 38 + int irq, bits; 39 + 40 + /* find each irq by irqs bitmap and update each irq */ 41 + bits = sizeof(irq_mask) * 8; 42 + irq = find_first_bit((void *)&irq_mask, bits); 43 + while (irq < bits) { 44 + pch_pic_update_irq(s, irq, level); 45 + bitmap_clear((void *)&irq_mask, irq, 1); 46 + irq = find_first_bit((void *)&irq_mask, bits); 47 + } 48 + } 49 + 50 + /* called when a irq is triggered in pch pic */ 51 + void pch_pic_set_irq(struct loongarch_pch_pic *s, int irq, int level) 52 + { 53 + u64 mask = BIT(irq); 54 + 55 + spin_lock(&s->lock); 56 + if (level) 57 + s->irr |= mask; /* set irr */ 58 + else { 59 + /* 60 + * In edge triggered mode, 0 does not mean to clear irq 61 + * The irr register variable is cleared when cpu writes to the 62 + * PCH_PIC_CLEAR_START address area 63 + */ 64 + if (s->edge & mask) { 65 + spin_unlock(&s->lock); 66 + return; 67 + } 68 + s->irr &= ~mask; 69 + } 70 + pch_pic_update_irq(s, irq, level); 71 + spin_unlock(&s->lock); 72 + } 73 + 74 + /* msi irq handler */ 75 + void pch_msi_set_irq(struct kvm *kvm, int irq, int level) 76 + { 77 + eiointc_set_irq(kvm->arch.eiointc, irq, level); 78 + } 79 + 80 + /* 81 + * pch pic register is 64-bit, but it is accessed by 32-bit, 82 + * so we use high to get whether low or high 32 bits we want 83 + * to read. 84 + */ 85 + static u32 pch_pic_read_reg(u64 *s, int high) 86 + { 87 + u64 val = *s; 88 + 89 + /* read the high 32 bits when high is 1 */ 90 + return high ? (u32)(val >> 32) : (u32)val; 91 + } 92 + 93 + /* 94 + * pch pic register is 64-bit, but it is accessed by 32-bit, 95 + * so we use high to get whether low or high 32 bits we want 96 + * to write. 97 + */ 98 + static u32 pch_pic_write_reg(u64 *s, int high, u32 v) 99 + { 100 + u64 val = *s, data = v; 101 + 102 + if (high) { 103 + /* 104 + * Clear val high 32 bits 105 + * Write the high 32 bits when the high is 1 106 + */ 107 + *s = (val << 32 >> 32) | (data << 32); 108 + val >>= 32; 109 + } else 110 + /* 111 + * Clear val low 32 bits 112 + * Write the low 32 bits when the high is 0 113 + */ 114 + *s = (val >> 32 << 32) | v; 115 + 116 + return (u32)val; 117 + } 118 + 119 + static int loongarch_pch_pic_read(struct loongarch_pch_pic *s, gpa_t addr, int len, void *val) 120 + { 121 + int offset, index, ret = 0; 122 + u32 data = 0; 123 + u64 int_id = 0; 124 + 125 + offset = addr - s->pch_pic_base; 126 + 127 + spin_lock(&s->lock); 128 + switch (offset) { 129 + case PCH_PIC_INT_ID_START ... PCH_PIC_INT_ID_END: 130 + /* int id version */ 131 + int_id |= (u64)PCH_PIC_INT_ID_VER << 32; 132 + /* irq number */ 133 + int_id |= (u64)31 << (32 + 16); 134 + /* int id value */ 135 + int_id |= PCH_PIC_INT_ID_VAL; 136 + *(u64 *)val = int_id; 137 + break; 138 + case PCH_PIC_MASK_START ... PCH_PIC_MASK_END: 139 + offset -= PCH_PIC_MASK_START; 140 + index = offset >> 2; 141 + /* read mask reg */ 142 + data = pch_pic_read_reg(&s->mask, index); 143 + *(u32 *)val = data; 144 + break; 145 + case PCH_PIC_HTMSI_EN_START ... PCH_PIC_HTMSI_EN_END: 146 + offset -= PCH_PIC_HTMSI_EN_START; 147 + index = offset >> 2; 148 + /* read htmsi enable reg */ 149 + data = pch_pic_read_reg(&s->htmsi_en, index); 150 + *(u32 *)val = data; 151 + break; 152 + case PCH_PIC_EDGE_START ... PCH_PIC_EDGE_END: 153 + offset -= PCH_PIC_EDGE_START; 154 + index = offset >> 2; 155 + /* read edge enable reg */ 156 + data = pch_pic_read_reg(&s->edge, index); 157 + *(u32 *)val = data; 158 + break; 159 + case PCH_PIC_AUTO_CTRL0_START ... PCH_PIC_AUTO_CTRL0_END: 160 + case PCH_PIC_AUTO_CTRL1_START ... PCH_PIC_AUTO_CTRL1_END: 161 + /* we only use default mode: fixed interrupt distribution mode */ 162 + *(u32 *)val = 0; 163 + break; 164 + case PCH_PIC_ROUTE_ENTRY_START ... PCH_PIC_ROUTE_ENTRY_END: 165 + /* only route to int0: eiointc */ 166 + *(u8 *)val = 1; 167 + break; 168 + case PCH_PIC_HTMSI_VEC_START ... PCH_PIC_HTMSI_VEC_END: 169 + offset -= PCH_PIC_HTMSI_VEC_START; 170 + /* read htmsi vector */ 171 + data = s->htmsi_vector[offset]; 172 + *(u8 *)val = data; 173 + break; 174 + case PCH_PIC_POLARITY_START ... PCH_PIC_POLARITY_END: 175 + /* we only use defalut value 0: high level triggered */ 176 + *(u32 *)val = 0; 177 + break; 178 + default: 179 + ret = -EINVAL; 180 + } 181 + spin_unlock(&s->lock); 182 + 183 + return ret; 184 + } 185 + 186 + static int kvm_pch_pic_read(struct kvm_vcpu *vcpu, 187 + struct kvm_io_device *dev, 188 + gpa_t addr, int len, void *val) 189 + { 190 + int ret; 191 + struct loongarch_pch_pic *s = vcpu->kvm->arch.pch_pic; 192 + 193 + if (!s) { 194 + kvm_err("%s: pch pic irqchip not valid!\n", __func__); 195 + return -EINVAL; 196 + } 197 + 198 + /* statistics of pch pic reading */ 199 + vcpu->kvm->stat.pch_pic_read_exits++; 200 + ret = loongarch_pch_pic_read(s, addr, len, val); 201 + 202 + return ret; 203 + } 204 + 205 + static int loongarch_pch_pic_write(struct loongarch_pch_pic *s, gpa_t addr, 206 + int len, const void *val) 207 + { 208 + int ret; 209 + u32 old, data, offset, index; 210 + u64 irq; 211 + 212 + ret = 0; 213 + data = *(u32 *)val; 214 + offset = addr - s->pch_pic_base; 215 + 216 + spin_lock(&s->lock); 217 + switch (offset) { 218 + case PCH_PIC_MASK_START ... PCH_PIC_MASK_END: 219 + offset -= PCH_PIC_MASK_START; 220 + /* get whether high or low 32 bits we want to write */ 221 + index = offset >> 2; 222 + old = pch_pic_write_reg(&s->mask, index, data); 223 + /* enable irq when mask value change to 0 */ 224 + irq = (old & ~data) << (32 * index); 225 + pch_pic_update_batch_irqs(s, irq, 1); 226 + /* disable irq when mask value change to 1 */ 227 + irq = (~old & data) << (32 * index); 228 + pch_pic_update_batch_irqs(s, irq, 0); 229 + break; 230 + case PCH_PIC_HTMSI_EN_START ... PCH_PIC_HTMSI_EN_END: 231 + offset -= PCH_PIC_HTMSI_EN_START; 232 + index = offset >> 2; 233 + pch_pic_write_reg(&s->htmsi_en, index, data); 234 + break; 235 + case PCH_PIC_EDGE_START ... PCH_PIC_EDGE_END: 236 + offset -= PCH_PIC_EDGE_START; 237 + index = offset >> 2; 238 + /* 1: edge triggered, 0: level triggered */ 239 + pch_pic_write_reg(&s->edge, index, data); 240 + break; 241 + case PCH_PIC_CLEAR_START ... PCH_PIC_CLEAR_END: 242 + offset -= PCH_PIC_CLEAR_START; 243 + index = offset >> 2; 244 + /* write 1 to clear edge irq */ 245 + old = pch_pic_read_reg(&s->irr, index); 246 + /* 247 + * get the irq bitmap which is edge triggered and 248 + * already set and to be cleared 249 + */ 250 + irq = old & pch_pic_read_reg(&s->edge, index) & data; 251 + /* write irr to the new state where irqs have been cleared */ 252 + pch_pic_write_reg(&s->irr, index, old & ~irq); 253 + /* update cleared irqs */ 254 + pch_pic_update_batch_irqs(s, irq, 0); 255 + break; 256 + case PCH_PIC_AUTO_CTRL0_START ... PCH_PIC_AUTO_CTRL0_END: 257 + offset -= PCH_PIC_AUTO_CTRL0_START; 258 + index = offset >> 2; 259 + /* we only use default mode: fixed interrupt distribution mode */ 260 + pch_pic_write_reg(&s->auto_ctrl0, index, 0); 261 + break; 262 + case PCH_PIC_AUTO_CTRL1_START ... PCH_PIC_AUTO_CTRL1_END: 263 + offset -= PCH_PIC_AUTO_CTRL1_START; 264 + index = offset >> 2; 265 + /* we only use default mode: fixed interrupt distribution mode */ 266 + pch_pic_write_reg(&s->auto_ctrl1, index, 0); 267 + break; 268 + case PCH_PIC_ROUTE_ENTRY_START ... PCH_PIC_ROUTE_ENTRY_END: 269 + offset -= PCH_PIC_ROUTE_ENTRY_START; 270 + /* only route to int0: eiointc */ 271 + s->route_entry[offset] = 1; 272 + break; 273 + case PCH_PIC_HTMSI_VEC_START ... PCH_PIC_HTMSI_VEC_END: 274 + /* route table to eiointc */ 275 + offset -= PCH_PIC_HTMSI_VEC_START; 276 + s->htmsi_vector[offset] = (u8)data; 277 + break; 278 + case PCH_PIC_POLARITY_START ... PCH_PIC_POLARITY_END: 279 + offset -= PCH_PIC_POLARITY_START; 280 + index = offset >> 2; 281 + /* we only use defalut value 0: high level triggered */ 282 + pch_pic_write_reg(&s->polarity, index, 0); 283 + break; 284 + default: 285 + ret = -EINVAL; 286 + break; 287 + } 288 + spin_unlock(&s->lock); 289 + 290 + return ret; 291 + } 292 + 293 + static int kvm_pch_pic_write(struct kvm_vcpu *vcpu, 294 + struct kvm_io_device *dev, 295 + gpa_t addr, int len, const void *val) 296 + { 297 + int ret; 298 + struct loongarch_pch_pic *s = vcpu->kvm->arch.pch_pic; 299 + 300 + if (!s) { 301 + kvm_err("%s: pch pic irqchip not valid!\n", __func__); 302 + return -EINVAL; 303 + } 304 + 305 + /* statistics of pch pic writing */ 306 + vcpu->kvm->stat.pch_pic_write_exits++; 307 + ret = loongarch_pch_pic_write(s, addr, len, val); 308 + 309 + return ret; 310 + } 311 + 312 + static const struct kvm_io_device_ops kvm_pch_pic_ops = { 313 + .read = kvm_pch_pic_read, 314 + .write = kvm_pch_pic_write, 315 + }; 316 + 317 + static int kvm_pch_pic_init(struct kvm_device *dev, u64 addr) 318 + { 319 + int ret; 320 + struct kvm *kvm = dev->kvm; 321 + struct kvm_io_device *device; 322 + struct loongarch_pch_pic *s = dev->kvm->arch.pch_pic; 323 + 324 + s->pch_pic_base = addr; 325 + device = &s->device; 326 + /* init device by pch pic writing and reading ops */ 327 + kvm_iodevice_init(device, &kvm_pch_pic_ops); 328 + mutex_lock(&kvm->slots_lock); 329 + /* register pch pic device */ 330 + ret = kvm_io_bus_register_dev(kvm, KVM_MMIO_BUS, addr, PCH_PIC_SIZE, device); 331 + mutex_unlock(&kvm->slots_lock); 332 + 333 + return (ret < 0) ? -EFAULT : 0; 334 + } 335 + 336 + /* used by user space to get or set pch pic registers */ 337 + static int kvm_pch_pic_regs_access(struct kvm_device *dev, 338 + struct kvm_device_attr *attr, 339 + bool is_write) 340 + { 341 + int addr, offset, len = 8, ret = 0; 342 + void __user *data; 343 + void *p = NULL; 344 + struct loongarch_pch_pic *s; 345 + 346 + s = dev->kvm->arch.pch_pic; 347 + addr = attr->attr; 348 + data = (void __user *)attr->addr; 349 + 350 + /* get pointer to pch pic register by addr */ 351 + switch (addr) { 352 + case PCH_PIC_MASK_START: 353 + p = &s->mask; 354 + break; 355 + case PCH_PIC_HTMSI_EN_START: 356 + p = &s->htmsi_en; 357 + break; 358 + case PCH_PIC_EDGE_START: 359 + p = &s->edge; 360 + break; 361 + case PCH_PIC_AUTO_CTRL0_START: 362 + p = &s->auto_ctrl0; 363 + break; 364 + case PCH_PIC_AUTO_CTRL1_START: 365 + p = &s->auto_ctrl1; 366 + break; 367 + case PCH_PIC_ROUTE_ENTRY_START ... PCH_PIC_ROUTE_ENTRY_END: 368 + offset = addr - PCH_PIC_ROUTE_ENTRY_START; 369 + p = &s->route_entry[offset]; 370 + len = 1; 371 + break; 372 + case PCH_PIC_HTMSI_VEC_START ... PCH_PIC_HTMSI_VEC_END: 373 + offset = addr - PCH_PIC_HTMSI_VEC_START; 374 + p = &s->htmsi_vector[offset]; 375 + len = 1; 376 + break; 377 + case PCH_PIC_INT_IRR_START: 378 + p = &s->irr; 379 + break; 380 + case PCH_PIC_INT_ISR_START: 381 + p = &s->isr; 382 + break; 383 + case PCH_PIC_POLARITY_START: 384 + p = &s->polarity; 385 + break; 386 + default: 387 + return -EINVAL; 388 + } 389 + 390 + spin_lock(&s->lock); 391 + /* write or read value according to is_write */ 392 + if (is_write) { 393 + if (copy_from_user(p, data, len)) 394 + ret = -EFAULT; 395 + } else { 396 + if (copy_to_user(data, p, len)) 397 + ret = -EFAULT; 398 + } 399 + spin_unlock(&s->lock); 400 + 401 + return ret; 402 + } 403 + 404 + static int kvm_pch_pic_get_attr(struct kvm_device *dev, 405 + struct kvm_device_attr *attr) 406 + { 407 + switch (attr->group) { 408 + case KVM_DEV_LOONGARCH_PCH_PIC_GRP_REGS: 409 + return kvm_pch_pic_regs_access(dev, attr, false); 410 + default: 411 + return -EINVAL; 412 + } 413 + } 414 + 415 + static int kvm_pch_pic_set_attr(struct kvm_device *dev, 416 + struct kvm_device_attr *attr) 417 + { 418 + u64 addr; 419 + void __user *uaddr = (void __user *)(long)attr->addr; 420 + 421 + switch (attr->group) { 422 + case KVM_DEV_LOONGARCH_PCH_PIC_GRP_CTRL: 423 + switch (attr->attr) { 424 + case KVM_DEV_LOONGARCH_PCH_PIC_CTRL_INIT: 425 + if (copy_from_user(&addr, uaddr, sizeof(addr))) 426 + return -EFAULT; 427 + 428 + if (!dev->kvm->arch.pch_pic) { 429 + kvm_err("%s: please create pch_pic irqchip first!\n", __func__); 430 + return -ENODEV; 431 + } 432 + 433 + return kvm_pch_pic_init(dev, addr); 434 + default: 435 + kvm_err("%s: unknown group (%d) attr (%lld)\n", __func__, attr->group, 436 + attr->attr); 437 + return -EINVAL; 438 + } 439 + case KVM_DEV_LOONGARCH_PCH_PIC_GRP_REGS: 440 + return kvm_pch_pic_regs_access(dev, attr, true); 441 + default: 442 + return -EINVAL; 443 + } 444 + } 445 + 446 + static int kvm_setup_default_irq_routing(struct kvm *kvm) 447 + { 448 + int i, ret; 449 + u32 nr = KVM_IRQCHIP_NUM_PINS; 450 + struct kvm_irq_routing_entry *entries; 451 + 452 + entries = kcalloc(nr, sizeof(*entries), GFP_KERNEL); 453 + if (!entries) 454 + return -ENOMEM; 455 + 456 + for (i = 0; i < nr; i++) { 457 + entries[i].gsi = i; 458 + entries[i].type = KVM_IRQ_ROUTING_IRQCHIP; 459 + entries[i].u.irqchip.irqchip = 0; 460 + entries[i].u.irqchip.pin = i; 461 + } 462 + ret = kvm_set_irq_routing(kvm, entries, nr, 0); 463 + kfree(entries); 464 + 465 + return ret; 466 + } 467 + 468 + static int kvm_pch_pic_create(struct kvm_device *dev, u32 type) 469 + { 470 + int ret; 471 + struct kvm *kvm = dev->kvm; 472 + struct loongarch_pch_pic *s; 473 + 474 + /* pch pic should not has been created */ 475 + if (kvm->arch.pch_pic) 476 + return -EINVAL; 477 + 478 + ret = kvm_setup_default_irq_routing(kvm); 479 + if (ret) 480 + return -ENOMEM; 481 + 482 + s = kzalloc(sizeof(struct loongarch_pch_pic), GFP_KERNEL); 483 + if (!s) 484 + return -ENOMEM; 485 + 486 + spin_lock_init(&s->lock); 487 + s->kvm = kvm; 488 + kvm->arch.pch_pic = s; 489 + 490 + return 0; 491 + } 492 + 493 + static void kvm_pch_pic_destroy(struct kvm_device *dev) 494 + { 495 + struct kvm *kvm; 496 + struct loongarch_pch_pic *s; 497 + 498 + if (!dev || !dev->kvm || !dev->kvm->arch.pch_pic) 499 + return; 500 + 501 + kvm = dev->kvm; 502 + s = kvm->arch.pch_pic; 503 + /* unregister pch pic device and free it's memory */ 504 + kvm_io_bus_unregister_dev(kvm, KVM_MMIO_BUS, &s->device); 505 + kfree(s); 506 + } 507 + 508 + static struct kvm_device_ops kvm_pch_pic_dev_ops = { 509 + .name = "kvm-loongarch-pch-pic", 510 + .create = kvm_pch_pic_create, 511 + .destroy = kvm_pch_pic_destroy, 512 + .set_attr = kvm_pch_pic_set_attr, 513 + .get_attr = kvm_pch_pic_get_attr, 514 + }; 515 + 516 + int kvm_loongarch_register_pch_pic_device(void) 517 + { 518 + return kvm_register_device_ops(&kvm_pch_pic_dev_ops, KVM_DEV_TYPE_LOONGARCH_PCHPIC); 519 + }

+89

arch/loongarch/kvm/irqfd.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * Copyright (C) 2024 Loongson Technology Corporation Limited 4 + */ 5 + 6 + #include <linux/kvm_host.h> 7 + #include <trace/events/kvm.h> 8 + #include <asm/kvm_pch_pic.h> 9 + 10 + static int kvm_set_pic_irq(struct kvm_kernel_irq_routing_entry *e, 11 + struct kvm *kvm, int irq_source_id, int level, bool line_status) 12 + { 13 + /* PCH-PIC pin (0 ~ 64) <---> GSI (0 ~ 64) */ 14 + pch_pic_set_irq(kvm->arch.pch_pic, e->irqchip.pin, level); 15 + 16 + return 0; 17 + } 18 + 19 + /* 20 + * kvm_set_msi: inject the MSI corresponding to the 21 + * MSI routing entry 22 + * 23 + * This is the entry point for irqfd MSI injection 24 + * and userspace MSI injection. 25 + */ 26 + int kvm_set_msi(struct kvm_kernel_irq_routing_entry *e, 27 + struct kvm *kvm, int irq_source_id, int level, bool line_status) 28 + { 29 + if (!level) 30 + return -1; 31 + 32 + pch_msi_set_irq(kvm, e->msi.data, level); 33 + 34 + return 0; 35 + } 36 + 37 + /* 38 + * kvm_set_routing_entry: populate a kvm routing entry 39 + * from a user routing entry 40 + * 41 + * @kvm: the VM this entry is applied to 42 + * @e: kvm kernel routing entry handle 43 + * @ue: user api routing entry handle 44 + * return 0 on success, -EINVAL on errors. 45 + */ 46 + int kvm_set_routing_entry(struct kvm *kvm, 47 + struct kvm_kernel_irq_routing_entry *e, 48 + const struct kvm_irq_routing_entry *ue) 49 + { 50 + switch (ue->type) { 51 + case KVM_IRQ_ROUTING_IRQCHIP: 52 + e->set = kvm_set_pic_irq; 53 + e->irqchip.irqchip = ue->u.irqchip.irqchip; 54 + e->irqchip.pin = ue->u.irqchip.pin; 55 + 56 + if (e->irqchip.pin >= KVM_IRQCHIP_NUM_PINS) 57 + return -EINVAL; 58 + 59 + return 0; 60 + case KVM_IRQ_ROUTING_MSI: 61 + e->set = kvm_set_msi; 62 + e->msi.address_lo = ue->u.msi.address_lo; 63 + e->msi.address_hi = ue->u.msi.address_hi; 64 + e->msi.data = ue->u.msi.data; 65 + return 0; 66 + default: 67 + return -EINVAL; 68 + } 69 + } 70 + 71 + int kvm_arch_set_irq_inatomic(struct kvm_kernel_irq_routing_entry *e, 72 + struct kvm *kvm, int irq_source_id, int level, bool line_status) 73 + { 74 + switch (e->type) { 75 + case KVM_IRQ_ROUTING_IRQCHIP: 76 + pch_pic_set_irq(kvm->arch.pch_pic, e->irqchip.pin, level); 77 + return 0; 78 + case KVM_IRQ_ROUTING_MSI: 79 + pch_msi_set_irq(kvm, e->msi.data, level); 80 + return 0; 81 + default: 82 + return -EWOULDBLOCK; 83 + } 84 + } 85 + 86 + bool kvm_arch_intc_initialized(struct kvm *kvm) 87 + { 88 + return kvm_arch_irqchip_in_kernel(kvm); 89 + }

+17 -2

arch/loongarch/kvm/main.c

··· 9 9 #include <asm/cacheflush.h> 10 10 #include <asm/cpufeature.h> 11 11 #include <asm/kvm_csr.h> 12 + #include <asm/kvm_eiointc.h> 13 + #include <asm/kvm_pch_pic.h> 12 14 #include "trace.h" 13 15 14 16 unsigned long vpid_mask; ··· 315 313 316 314 static int kvm_loongarch_env_init(void) 317 315 { 318 - int cpu, order; 316 + int cpu, order, ret; 319 317 void *addr; 320 318 struct kvm_context *context; 321 319 ··· 370 368 371 369 kvm_init_gcsr_flag(); 372 370 373 - return 0; 371 + /* Register LoongArch IPI interrupt controller interface. */ 372 + ret = kvm_loongarch_register_ipi_device(); 373 + if (ret) 374 + return ret; 375 + 376 + /* Register LoongArch EIOINTC interrupt controller interface. */ 377 + ret = kvm_loongarch_register_eiointc_device(); 378 + if (ret) 379 + return ret; 380 + 381 + /* Register LoongArch PCH-PIC interrupt controller interface. */ 382 + ret = kvm_loongarch_register_pch_pic_device(); 383 + 384 + return ret; 374 385 } 375 386 376 387 static void kvm_loongarch_env_exit(void)

+12 -28

arch/loongarch/kvm/mmu.c

··· 552 552 static int kvm_map_page_fast(struct kvm_vcpu *vcpu, unsigned long gpa, bool write) 553 553 { 554 554 int ret = 0; 555 - kvm_pfn_t pfn = 0; 556 555 kvm_pte_t *ptep, changed, new; 557 556 gfn_t gfn = gpa >> PAGE_SHIFT; 558 557 struct kvm *kvm = vcpu->kvm; 559 558 struct kvm_memory_slot *slot; 560 - struct page *page; 561 559 562 560 spin_lock(&kvm->mmu_lock); 563 561 ··· 568 570 569 571 /* Track access to pages marked old */ 570 572 new = kvm_pte_mkyoung(*ptep); 571 - /* call kvm_set_pfn_accessed() after unlock */ 572 - 573 573 if (write && !kvm_pte_dirty(new)) { 574 574 if (!kvm_pte_write(new)) { 575 575 ret = -EFAULT; ··· 591 595 } 592 596 593 597 changed = new ^ (*ptep); 594 - if (changed) { 598 + if (changed) 595 599 kvm_set_pte(ptep, new); 596 - pfn = kvm_pte_pfn(new); 597 - page = kvm_pfn_to_refcounted_page(pfn); 598 - if (page) 599 - get_page(page); 600 - } 600 + 601 601 spin_unlock(&kvm->mmu_lock); 602 602 603 - if (changed) { 604 - if (kvm_pte_young(changed)) 605 - kvm_set_pfn_accessed(pfn); 603 + if (kvm_pte_dirty(changed)) 604 + mark_page_dirty(kvm, gfn); 606 605 607 - if (kvm_pte_dirty(changed)) { 608 - mark_page_dirty(kvm, gfn); 609 - kvm_set_pfn_dirty(pfn); 610 - } 611 - if (page) 612 - put_page(page); 613 - } 614 606 return ret; 615 607 out: 616 608 spin_unlock(&kvm->mmu_lock); ··· 780 796 struct kvm *kvm = vcpu->kvm; 781 797 struct kvm_memory_slot *memslot; 782 798 struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache; 799 + struct page *page; 783 800 784 801 /* Try the fast path to handle old / clean pages */ 785 802 srcu_idx = srcu_read_lock(&kvm->srcu); ··· 808 823 mmu_seq = kvm->mmu_invalidate_seq; 809 824 /* 810 825 * Ensure the read of mmu_invalidate_seq isn't reordered with PTE reads in 811 - * gfn_to_pfn_prot() (which calls get_user_pages()), so that we don't 826 + * kvm_faultin_pfn() (which calls get_user_pages()), so that we don't 812 827 * risk the page we get a reference to getting unmapped before we have a 813 828 * chance to grab the mmu_lock without mmu_invalidate_retry() noticing. 814 829 * ··· 820 835 smp_rmb(); 821 836 822 837 /* Slow path - ask KVM core whether we can access this GPA */ 823 - pfn = gfn_to_pfn_prot(kvm, gfn, write, &writeable); 838 + pfn = kvm_faultin_pfn(vcpu, gfn, write, &writeable, &page); 824 839 if (is_error_noslot_pfn(pfn)) { 825 840 err = -EFAULT; 826 841 goto out; ··· 832 847 /* 833 848 * This can happen when mappings are changed asynchronously, but 834 849 * also synchronously if a COW is triggered by 835 - * gfn_to_pfn_prot(). 850 + * kvm_faultin_pfn(). 836 851 */ 837 852 spin_unlock(&kvm->mmu_lock); 838 - kvm_release_pfn_clean(pfn); 853 + kvm_release_page_unused(page); 839 854 if (retry_no > 100) { 840 855 retry_no = 0; 841 856 schedule(); ··· 900 915 else 901 916 ++kvm->stat.pages; 902 917 kvm_set_pte(ptep, new_pte); 918 + 919 + kvm_release_faultin_page(kvm, page, false, writeable); 903 920 spin_unlock(&kvm->mmu_lock); 904 921 905 - if (prot_bits & _PAGE_DIRTY) { 922 + if (prot_bits & _PAGE_DIRTY) 906 923 mark_page_dirty_in_slot(kvm, memslot, gfn); 907 - kvm_set_pfn_dirty(pfn); 908 - } 909 924 910 - kvm_release_pfn_clean(pfn); 911 925 out: 912 926 srcu_read_unlock(&kvm->srcu, srcu_idx); 913 927 return err;

+3

arch/loongarch/kvm/vcpu.c

··· 1475 1475 /* Init */ 1476 1476 vcpu->arch.last_sched_cpu = -1; 1477 1477 1478 + /* Init ipi_state lock */ 1479 + spin_lock_init(&vcpu->arch.ipi_state.lock); 1480 + 1478 1481 /* 1479 1482 * Initialize guest register state to valid architectural reset state. 1480 1483 */

+21

arch/loongarch/kvm/vm.c

··· 6 6 #include <linux/kvm_host.h> 7 7 #include <asm/kvm_mmu.h> 8 8 #include <asm/kvm_vcpu.h> 9 + #include <asm/kvm_eiointc.h> 10 + #include <asm/kvm_pch_pic.h> 9 11 10 12 const struct _kvm_stats_desc kvm_vm_stats_desc[] = { 11 13 KVM_GENERIC_VM_STATS(), ··· 78 76 int r; 79 77 80 78 switch (ext) { 79 + case KVM_CAP_IRQCHIP: 81 80 case KVM_CAP_ONE_REG: 82 81 case KVM_CAP_ENABLE_CAP: 83 82 case KVM_CAP_READONLY_MEM: ··· 164 161 struct kvm_device_attr attr; 165 162 166 163 switch (ioctl) { 164 + case KVM_CREATE_IRQCHIP: 165 + return 0; 167 166 case KVM_HAS_DEVICE_ATTR: 168 167 if (copy_from_user(&attr, argp, sizeof(attr))) 169 168 return -EFAULT; ··· 174 169 default: 175 170 return -ENOIOCTLCMD; 176 171 } 172 + } 173 + 174 + int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irq_event, bool line_status) 175 + { 176 + if (!kvm_arch_irqchip_in_kernel(kvm)) 177 + return -ENXIO; 178 + 179 + irq_event->status = kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, 180 + irq_event->irq, irq_event->level, line_status); 181 + 182 + return 0; 183 + } 184 + 185 + bool kvm_arch_irqchip_in_kernel(struct kvm *kvm) 186 + { 187 + return (kvm->arch.ipi && kvm->arch.eiointc && kvm->arch.pch_pic); 177 188 }

+8 -18

arch/mips/kvm/mmu.c

··· 484 484 struct kvm *kvm = vcpu->kvm; 485 485 gfn_t gfn = gpa >> PAGE_SHIFT; 486 486 pte_t *ptep; 487 - kvm_pfn_t pfn = 0; /* silence bogus GCC warning */ 488 - bool pfn_valid = false; 489 487 int ret = 0; 490 488 491 489 spin_lock(&kvm->mmu_lock); ··· 496 498 } 497 499 498 500 /* Track access to pages marked old */ 499 - if (!pte_young(*ptep)) { 501 + if (!pte_young(*ptep)) 500 502 set_pte(ptep, pte_mkyoung(*ptep)); 501 - pfn = pte_pfn(*ptep); 502 - pfn_valid = true; 503 - /* call kvm_set_pfn_accessed() after unlock */ 504 - } 503 + 505 504 if (write_fault && !pte_dirty(*ptep)) { 506 505 if (!pte_write(*ptep)) { 507 506 ret = -EFAULT; ··· 507 512 508 513 /* Track dirtying of writeable pages */ 509 514 set_pte(ptep, pte_mkdirty(*ptep)); 510 - pfn = pte_pfn(*ptep); 511 515 mark_page_dirty(kvm, gfn); 512 - kvm_set_pfn_dirty(pfn); 513 516 } 514 517 515 518 if (out_entry) ··· 517 524 518 525 out: 519 526 spin_unlock(&kvm->mmu_lock); 520 - if (pfn_valid) 521 - kvm_set_pfn_accessed(pfn); 522 527 return ret; 523 528 } 524 529 ··· 557 566 bool writeable; 558 567 unsigned long prot_bits; 559 568 unsigned long mmu_seq; 569 + struct page *page; 560 570 561 571 /* Try the fast path to handle old / clean pages */ 562 572 srcu_idx = srcu_read_lock(&kvm->srcu); ··· 579 587 mmu_seq = kvm->mmu_invalidate_seq; 580 588 /* 581 589 * Ensure the read of mmu_invalidate_seq isn't reordered with PTE reads 582 - * in gfn_to_pfn_prot() (which calls get_user_pages()), so that we don't 590 + * in kvm_faultin_pfn() (which calls get_user_pages()), so that we don't 583 591 * risk the page we get a reference to getting unmapped before we have a 584 592 * chance to grab the mmu_lock without mmu_invalidate_retry() noticing. 585 593 * ··· 591 599 smp_rmb(); 592 600 593 601 /* Slow path - ask KVM core whether we can access this GPA */ 594 - pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writeable); 602 + pfn = kvm_faultin_pfn(vcpu, gfn, write_fault, &writeable, &page); 595 603 if (is_error_noslot_pfn(pfn)) { 596 604 err = -EFAULT; 597 605 goto out; ··· 603 611 /* 604 612 * This can happen when mappings are changed asynchronously, but 605 613 * also synchronously if a COW is triggered by 606 - * gfn_to_pfn_prot(). 614 + * kvm_faultin_pfn(). 607 615 */ 608 616 spin_unlock(&kvm->mmu_lock); 609 - kvm_release_pfn_clean(pfn); 617 + kvm_release_page_unused(page); 610 618 goto retry; 611 619 } 612 620 ··· 620 628 if (write_fault) { 621 629 prot_bits |= __WRITEABLE; 622 630 mark_page_dirty(kvm, gfn); 623 - kvm_set_pfn_dirty(pfn); 624 631 } 625 632 } 626 633 entry = pfn_pte(pfn, __pgprot(prot_bits)); ··· 633 642 if (out_buddy) 634 643 *out_buddy = *ptep_buddy(ptep); 635 644 645 + kvm_release_faultin_page(kvm, page, false, writeable); 636 646 spin_unlock(&kvm->mmu_lock); 637 - kvm_release_pfn_clean(pfn); 638 - kvm_set_pfn_accessed(pfn); 639 647 out: 640 648 srcu_read_unlock(&kvm->srcu, srcu_idx); 641 649 return err;

+2 -2

arch/powerpc/include/asm/kvm_book3s.h

··· 203 203 extern int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu, 204 204 unsigned long gpa, 205 205 struct kvm_memory_slot *memslot, 206 - bool writing, bool kvm_ro, 206 + bool writing, 207 207 pte_t *inserted_pte, unsigned int *levelp); 208 208 extern int kvmppc_init_vm_radix(struct kvm *kvm); 209 209 extern void kvmppc_free_radix(struct kvm *kvm); ··· 235 235 extern void kvmppc_giveup_ext(struct kvm_vcpu *vcpu, ulong msr); 236 236 extern int kvmppc_emulate_paired_single(struct kvm_vcpu *vcpu); 237 237 extern kvm_pfn_t kvmppc_gpa_to_pfn(struct kvm_vcpu *vcpu, gpa_t gpa, 238 - bool writing, bool *writable); 238 + bool writing, bool *writable, struct page **page); 239 239 extern void kvmppc_add_revmap_chain(struct kvm *kvm, struct revmap_entry *rev, 240 240 unsigned long *rmap, long pte_index, int realmode); 241 241 extern void kvmppc_update_dirty_map(const struct kvm_memory_slot *memslot,

+4 -3

arch/powerpc/kvm/book3s.c

··· 422 422 EXPORT_SYMBOL_GPL(kvmppc_core_prepare_to_enter); 423 423 424 424 kvm_pfn_t kvmppc_gpa_to_pfn(struct kvm_vcpu *vcpu, gpa_t gpa, bool writing, 425 - bool *writable) 425 + bool *writable, struct page **page) 426 426 { 427 427 ulong mp_pa = vcpu->arch.magic_page_pa & KVM_PAM; 428 428 gfn_t gfn = gpa >> PAGE_SHIFT; ··· 437 437 kvm_pfn_t pfn; 438 438 439 439 pfn = (kvm_pfn_t)virt_to_phys((void*)shared_page) >> PAGE_SHIFT; 440 - get_page(pfn_to_page(pfn)); 440 + *page = pfn_to_page(pfn); 441 + get_page(*page); 441 442 if (writable) 442 443 *writable = true; 443 444 return pfn; 444 445 } 445 446 446 - return gfn_to_pfn_prot(vcpu->kvm, gfn, writing, writable); 447 + return kvm_faultin_pfn(vcpu, gfn, writing, writable, page); 447 448 } 448 449 EXPORT_SYMBOL_GPL(kvmppc_gpa_to_pfn); 449 450

+4 -3

arch/powerpc/kvm/book3s_32_mmu_host.c

··· 130 130 int kvmppc_mmu_map_page(struct kvm_vcpu *vcpu, struct kvmppc_pte *orig_pte, 131 131 bool iswrite) 132 132 { 133 + struct page *page; 133 134 kvm_pfn_t hpaddr; 134 135 u64 vpn; 135 136 u64 vsid; ··· 146 145 bool writable; 147 146 148 147 /* Get host physical address for gpa */ 149 - hpaddr = kvmppc_gpa_to_pfn(vcpu, orig_pte->raddr, iswrite, &writable); 148 + hpaddr = kvmppc_gpa_to_pfn(vcpu, orig_pte->raddr, iswrite, &writable, &page); 150 149 if (is_error_noslot_pfn(hpaddr)) { 151 150 printk(KERN_INFO "Couldn't get guest page for gpa %lx!\n", 152 151 orig_pte->raddr); ··· 233 232 234 233 pte = kvmppc_mmu_hpte_cache_next(vcpu); 235 234 if (!pte) { 236 - kvm_release_pfn_clean(hpaddr >> PAGE_SHIFT); 235 + kvm_release_page_unused(page); 237 236 r = -EAGAIN; 238 237 goto out; 239 238 } ··· 251 250 252 251 kvmppc_mmu_hpte_cache_map(vcpu, pte); 253 252 254 - kvm_release_pfn_clean(hpaddr >> PAGE_SHIFT); 253 + kvm_release_page_clean(page); 255 254 out: 256 255 return r; 257 256 }

+6 -6

arch/powerpc/kvm/book3s_64_mmu_host.c

··· 88 88 struct hpte_cache *cpte; 89 89 unsigned long gfn = orig_pte->raddr >> PAGE_SHIFT; 90 90 unsigned long pfn; 91 + struct page *page; 91 92 92 93 /* used to check for invalidations in progress */ 93 94 mmu_seq = kvm->mmu_invalidate_seq; 94 95 smp_rmb(); 95 96 96 97 /* Get host physical address for gpa */ 97 - pfn = kvmppc_gpa_to_pfn(vcpu, orig_pte->raddr, iswrite, &writable); 98 + pfn = kvmppc_gpa_to_pfn(vcpu, orig_pte->raddr, iswrite, &writable, &page); 98 99 if (is_error_noslot_pfn(pfn)) { 99 100 printk(KERN_INFO "Couldn't get guest page for gpa %lx!\n", 100 101 orig_pte->raddr); ··· 122 121 123 122 vpn = hpt_vpn(orig_pte->eaddr, map->host_vsid, MMU_SEGSIZE_256M); 124 123 125 - kvm_set_pfn_accessed(pfn); 126 124 if (!orig_pte->may_write || !writable) 127 125 rflags |= PP_RXRX; 128 - else { 126 + else 129 127 mark_page_dirty(vcpu->kvm, gfn); 130 - kvm_set_pfn_dirty(pfn); 131 - } 132 128 133 129 if (!orig_pte->may_execute) 134 130 rflags |= HPTE_R_N; ··· 200 202 } 201 203 202 204 out_unlock: 205 + /* FIXME: Don't unconditionally pass unused=false. */ 206 + kvm_release_faultin_page(kvm, page, false, 207 + orig_pte->may_write && writable); 203 208 spin_unlock(&kvm->mmu_lock); 204 - kvm_release_pfn_clean(pfn); 205 209 if (cpte) 206 210 kvmppc_mmu_hpte_cache_free(cpte); 207 211

+4 -21

arch/powerpc/kvm/book3s_64_mmu_hv.c

··· 603 603 write_ok = writing; 604 604 hva = gfn_to_hva_memslot(memslot, gfn); 605 605 606 - /* 607 - * Do a fast check first, since __gfn_to_pfn_memslot doesn't 608 - * do it with !atomic && !async, which is how we call it. 609 - * We always ask for write permission since the common case 610 - * is that the page is writable. 611 - */ 612 - if (get_user_page_fast_only(hva, FOLL_WRITE, &page)) { 613 - write_ok = true; 614 - } else { 615 - /* Call KVM generic code to do the slow-path check */ 616 - pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL, 617 - writing, &write_ok, NULL); 618 - if (is_error_noslot_pfn(pfn)) 619 - return -EFAULT; 620 - page = NULL; 621 - if (pfn_valid(pfn)) { 622 - page = pfn_to_page(pfn); 623 - if (PageReserved(page)) 624 - page = NULL; 625 - } 626 - } 606 + pfn = __kvm_faultin_pfn(memslot, gfn, writing ? FOLL_WRITE : 0, 607 + &write_ok, &page); 608 + if (is_error_noslot_pfn(pfn)) 609 + return -EFAULT; 627 610 628 611 /* 629 612 * Read the PTE from the process' radix tree and use that

+7 -28

arch/powerpc/kvm/book3s_64_mmu_radix.c

··· 821 821 int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu, 822 822 unsigned long gpa, 823 823 struct kvm_memory_slot *memslot, 824 - bool writing, bool kvm_ro, 824 + bool writing, 825 825 pte_t *inserted_pte, unsigned int *levelp) 826 826 { 827 827 struct kvm *kvm = vcpu->kvm; ··· 829 829 unsigned long mmu_seq; 830 830 unsigned long hva, gfn = gpa >> PAGE_SHIFT; 831 831 bool upgrade_write = false; 832 - bool *upgrade_p = &upgrade_write; 833 832 pte_t pte, *ptep; 834 833 unsigned int shift, level; 835 834 int ret; 836 835 bool large_enable; 836 + kvm_pfn_t pfn; 837 837 838 838 /* used to check for invalidations in progress */ 839 839 mmu_seq = kvm->mmu_invalidate_seq; 840 840 smp_rmb(); 841 841 842 - /* 843 - * Do a fast check first, since __gfn_to_pfn_memslot doesn't 844 - * do it with !atomic && !async, which is how we call it. 845 - * We always ask for write permission since the common case 846 - * is that the page is writable. 847 - */ 848 842 hva = gfn_to_hva_memslot(memslot, gfn); 849 - if (!kvm_ro && get_user_page_fast_only(hva, FOLL_WRITE, &page)) { 850 - upgrade_write = true; 851 - } else { 852 - unsigned long pfn; 853 - 854 - /* Call KVM generic code to do the slow-path check */ 855 - pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL, 856 - writing, upgrade_p, NULL); 857 - if (is_error_noslot_pfn(pfn)) 858 - return -EFAULT; 859 - page = NULL; 860 - if (pfn_valid(pfn)) { 861 - page = pfn_to_page(pfn); 862 - if (PageReserved(page)) 863 - page = NULL; 864 - } 865 - } 843 + pfn = __kvm_faultin_pfn(memslot, gfn, writing ? FOLL_WRITE : 0, 844 + &upgrade_write, &page); 845 + if (is_error_noslot_pfn(pfn)) 846 + return -EFAULT; 866 847 867 848 /* 868 849 * Read the PTE from the process' radix tree and use that ··· 931 950 struct kvm_memory_slot *memslot; 932 951 long ret; 933 952 bool writing = !!(dsisr & DSISR_ISSTORE); 934 - bool kvm_ro = false; 935 953 936 954 /* Check for unusual errors */ 937 955 if (dsisr & DSISR_UNSUPP_MMU) { ··· 983 1003 ea, DSISR_ISSTORE | DSISR_PROTFAULT); 984 1004 return RESUME_GUEST; 985 1005 } 986 - kvm_ro = true; 987 1006 } 988 1007 989 1008 /* Failed to set the reference/change bits */ ··· 1000 1021 1001 1022 /* Try to insert a pte */ 1002 1023 ret = kvmppc_book3s_instantiate_page(vcpu, gpa, memslot, writing, 1003 - kvm_ro, NULL, NULL); 1024 + NULL, NULL); 1004 1025 1005 1026 if (ret == 0 || ret == -EAGAIN) 1006 1027 ret = RESUME_GUEST;

+1 -3

arch/powerpc/kvm/book3s_hv_nested.c

··· 1535 1535 unsigned long n_gpa, gpa, gfn, perm = 0UL; 1536 1536 unsigned int shift, l1_shift, level; 1537 1537 bool writing = !!(dsisr & DSISR_ISSTORE); 1538 - bool kvm_ro = false; 1539 1538 long int ret; 1540 1539 1541 1540 if (!gp->l1_gr_to_hr) { ··· 1614 1615 ea, DSISR_ISSTORE | DSISR_PROTFAULT); 1615 1616 return RESUME_GUEST; 1616 1617 } 1617 - kvm_ro = true; 1618 1618 } 1619 1619 1620 1620 /* 2. Find the host pte for this L1 guest real address */ ··· 1635 1637 if (!pte_present(pte) || (writing && !(pte_val(pte) & _PAGE_WRITE))) { 1636 1638 /* No suitable pte found -> try to insert a mapping */ 1637 1639 ret = kvmppc_book3s_instantiate_page(vcpu, gpa, memslot, 1638 - writing, kvm_ro, &pte, &level); 1640 + writing, &pte, &level); 1639 1641 if (ret == -EAGAIN) 1640 1642 return RESUME_GUEST; 1641 1643 else if (ret)

+12 -13

arch/powerpc/kvm/book3s_hv_uvmem.c

··· 879 879 { 880 880 881 881 int ret = H_PARAMETER; 882 - struct page *uvmem_page; 882 + struct page *page, *uvmem_page; 883 883 struct kvmppc_uvmem_page_pvt *pvt; 884 - unsigned long pfn; 885 884 unsigned long gfn = gpa >> page_shift; 886 885 int srcu_idx; 887 886 unsigned long uvmem_pfn; ··· 900 901 901 902 retry: 902 903 mutex_unlock(&kvm->arch.uvmem_lock); 903 - pfn = gfn_to_pfn(kvm, gfn); 904 - if (is_error_noslot_pfn(pfn)) 904 + page = gfn_to_page(kvm, gfn); 905 + if (!page) 905 906 goto out; 906 907 907 908 mutex_lock(&kvm->arch.uvmem_lock); ··· 910 911 pvt = uvmem_page->zone_device_data; 911 912 pvt->skip_page_out = true; 912 913 pvt->remove_gfn = false; /* it continues to be a valid GFN */ 913 - kvm_release_pfn_clean(pfn); 914 + kvm_release_page_unused(page); 914 915 goto retry; 915 916 } 916 917 917 - if (!uv_page_in(kvm->arch.lpid, pfn << page_shift, gpa, 0, 918 + if (!uv_page_in(kvm->arch.lpid, page_to_pfn(page) << page_shift, gpa, 0, 918 919 page_shift)) { 919 920 kvmppc_gfn_shared(gfn, kvm); 920 921 ret = H_SUCCESS; 921 922 } 922 - kvm_release_pfn_clean(pfn); 923 + kvm_release_page_clean(page); 923 924 mutex_unlock(&kvm->arch.uvmem_lock); 924 925 out: 925 926 srcu_read_unlock(&kvm->srcu, srcu_idx); ··· 1082 1083 1083 1084 int kvmppc_send_page_to_uv(struct kvm *kvm, unsigned long gfn) 1084 1085 { 1085 - unsigned long pfn; 1086 + struct page *page; 1086 1087 int ret = U_SUCCESS; 1087 1088 1088 - pfn = gfn_to_pfn(kvm, gfn); 1089 - if (is_error_noslot_pfn(pfn)) 1089 + page = gfn_to_page(kvm, gfn); 1090 + if (!page) 1090 1091 return -EFAULT; 1091 1092 1092 1093 mutex_lock(&kvm->arch.uvmem_lock); 1093 1094 if (kvmppc_gfn_is_uvmem_pfn(gfn, kvm, NULL)) 1094 1095 goto out; 1095 1096 1096 - ret = uv_page_in(kvm->arch.lpid, pfn << PAGE_SHIFT, gfn << PAGE_SHIFT, 1097 - 0, PAGE_SHIFT); 1097 + ret = uv_page_in(kvm->arch.lpid, page_to_pfn(page) << PAGE_SHIFT, 1098 + gfn << PAGE_SHIFT, 0, PAGE_SHIFT); 1098 1099 out: 1099 - kvm_release_pfn_clean(pfn); 1100 + kvm_release_page_clean(page); 1100 1101 mutex_unlock(&kvm->arch.uvmem_lock); 1101 1102 return (ret == U_SUCCESS) ? RESUME_GUEST : -EFAULT; 1102 1103 }

+6 -8

arch/powerpc/kvm/book3s_pr.c

··· 639 639 */ 640 640 static void kvmppc_patch_dcbz(struct kvm_vcpu *vcpu, struct kvmppc_pte *pte) 641 641 { 642 - struct page *hpage; 642 + struct kvm_host_map map; 643 643 u64 hpage_offset; 644 644 u32 *page; 645 - int i; 645 + int i, r; 646 646 647 - hpage = gfn_to_page(vcpu->kvm, pte->raddr >> PAGE_SHIFT); 648 - if (is_error_page(hpage)) 647 + r = kvm_vcpu_map(vcpu, pte->raddr >> PAGE_SHIFT, &map); 648 + if (r) 649 649 return; 650 650 651 651 hpage_offset = pte->raddr & ~PAGE_MASK; 652 652 hpage_offset &= ~0xFFFULL; 653 653 hpage_offset /= 4; 654 654 655 - get_page(hpage); 656 - page = kmap_atomic(hpage); 655 + page = map.hva; 657 656 658 657 /* patch dcbz into reserved instruction, so we trap */ 659 658 for (i=hpage_offset; i < hpage_offset + (HW_PAGE_SIZE / 4); i++) 660 659 if ((be32_to_cpu(page[i]) & 0xff0007ff) == INS_DCBZ) 661 660 page[i] &= cpu_to_be32(0xfffffff7); 662 661 663 - kunmap_atomic(page); 664 - put_page(hpage); 662 + kvm_vcpu_unmap(vcpu, &map); 665 663 } 666 664 667 665 static bool kvmppc_visible_gpa(struct kvm_vcpu *vcpu, gpa_t gpa)

+1 -1

arch/powerpc/kvm/book3s_xive_native.c

··· 654 654 } 655 655 656 656 page = gfn_to_page(kvm, gfn); 657 - if (is_error_page(page)) { 657 + if (!page) { 658 658 srcu_read_unlock(&kvm->srcu, srcu_idx); 659 659 pr_err("Couldn't get queue page %llx!\n", kvm_eq.qaddr); 660 660 return -EINVAL;

+7 -12

arch/powerpc/kvm/e500_mmu_host.c

··· 242 242 return tlbe->mas7_3 & (MAS3_SW|MAS3_UW); 243 243 } 244 244 245 - static inline void kvmppc_e500_ref_setup(struct tlbe_ref *ref, 245 + static inline bool kvmppc_e500_ref_setup(struct tlbe_ref *ref, 246 246 struct kvm_book3e_206_tlb_entry *gtlbe, 247 247 kvm_pfn_t pfn, unsigned int wimg) 248 248 { ··· 252 252 /* Use guest supplied MAS2_G and MAS2_E */ 253 253 ref->flags |= (gtlbe->mas2 & MAS2_ATTRIB_MASK) | wimg; 254 254 255 - /* Mark the page accessed */ 256 - kvm_set_pfn_accessed(pfn); 257 - 258 - if (tlbe_is_writable(gtlbe)) 259 - kvm_set_pfn_dirty(pfn); 255 + return tlbe_is_writable(gtlbe); 260 256 } 261 257 262 258 static inline void kvmppc_e500_ref_release(struct tlbe_ref *ref) ··· 322 326 { 323 327 struct kvm_memory_slot *slot; 324 328 unsigned long pfn = 0; /* silence GCC warning */ 329 + struct page *page = NULL; 325 330 unsigned long hva; 326 331 int pfnmap = 0; 327 332 int tsize = BOOK3E_PAGESZ_4K; ··· 334 337 unsigned int wimg = 0; 335 338 pgd_t *pgdir; 336 339 unsigned long flags; 340 + bool writable = false; 337 341 338 342 /* used to check for invalidations in progress */ 339 343 mmu_seq = kvm->mmu_invalidate_seq; ··· 444 446 445 447 if (likely(!pfnmap)) { 446 448 tsize_pages = 1UL << (tsize + 10 - PAGE_SHIFT); 447 - pfn = gfn_to_pfn_memslot(slot, gfn); 449 + pfn = __kvm_faultin_pfn(slot, gfn, FOLL_WRITE, NULL, &page); 448 450 if (is_error_noslot_pfn(pfn)) { 449 451 if (printk_ratelimit()) 450 452 pr_err("%s: real page not found for gfn %lx\n", ··· 488 490 goto out; 489 491 } 490 492 } 491 - kvmppc_e500_ref_setup(ref, gtlbe, pfn, wimg); 493 + writable = kvmppc_e500_ref_setup(ref, gtlbe, pfn, wimg); 492 494 493 495 kvmppc_e500_setup_stlbe(&vcpu_e500->vcpu, gtlbe, tsize, 494 496 ref, gvaddr, stlbe); ··· 497 499 kvmppc_mmu_flush_icache(pfn); 498 500 499 501 out: 502 + kvm_release_faultin_page(kvm, page, !!ret, writable); 500 503 spin_unlock(&kvm->mmu_lock); 501 - 502 - /* Drop refcount on page, so that mmu notifiers can clear it */ 503 - kvm_release_pfn_clean(pfn); 504 - 505 504 return ret; 506 505 } 507 506

-3

arch/powerpc/kvm/powerpc.c

··· 612 612 r = 8 | 4 | 2 | 1; 613 613 } 614 614 break; 615 - case KVM_CAP_PPC_RMA: 616 - r = 0; 617 - break; 618 615 case KVM_CAP_PPC_HWRNG: 619 616 r = kvmppc_hwrng_present(); 620 617 break;

+10

arch/riscv/include/asm/kvm_host.h

··· 286 286 } sta; 287 287 }; 288 288 289 + /* 290 + * Returns true if a Performance Monitoring Interrupt (PMI), a.k.a. perf event, 291 + * arrived in guest context. For riscv, any event that arrives while a vCPU is 292 + * loaded is considered to be "in guest". 293 + */ 294 + static inline bool kvm_arch_pmi_in_guest(struct kvm_vcpu *vcpu) 295 + { 296 + return IS_ENABLED(CONFIG_GUEST_PERF_EVENTS) && !!vcpu; 297 + } 298 + 289 299 static inline void kvm_arch_sync_events(struct kvm *kvm) {} 290 300 291 301 #define KVM_RISCV_GSTAGE_TLB_MIN_ORDER 12

+245

arch/riscv/include/asm/kvm_nacl.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0-only */ 2 + /* 3 + * Copyright (c) 2024 Ventana Micro Systems Inc. 4 + */ 5 + 6 + #ifndef __KVM_NACL_H 7 + #define __KVM_NACL_H 8 + 9 + #include <linux/jump_label.h> 10 + #include <linux/percpu.h> 11 + #include <asm/byteorder.h> 12 + #include <asm/csr.h> 13 + #include <asm/sbi.h> 14 + 15 + struct kvm_vcpu_arch; 16 + 17 + DECLARE_STATIC_KEY_FALSE(kvm_riscv_nacl_available); 18 + #define kvm_riscv_nacl_available() \ 19 + static_branch_unlikely(&kvm_riscv_nacl_available) 20 + 21 + DECLARE_STATIC_KEY_FALSE(kvm_riscv_nacl_sync_csr_available); 22 + #define kvm_riscv_nacl_sync_csr_available() \ 23 + static_branch_unlikely(&kvm_riscv_nacl_sync_csr_available) 24 + 25 + DECLARE_STATIC_KEY_FALSE(kvm_riscv_nacl_sync_hfence_available); 26 + #define kvm_riscv_nacl_sync_hfence_available() \ 27 + static_branch_unlikely(&kvm_riscv_nacl_sync_hfence_available) 28 + 29 + DECLARE_STATIC_KEY_FALSE(kvm_riscv_nacl_sync_sret_available); 30 + #define kvm_riscv_nacl_sync_sret_available() \ 31 + static_branch_unlikely(&kvm_riscv_nacl_sync_sret_available) 32 + 33 + DECLARE_STATIC_KEY_FALSE(kvm_riscv_nacl_autoswap_csr_available); 34 + #define kvm_riscv_nacl_autoswap_csr_available() \ 35 + static_branch_unlikely(&kvm_riscv_nacl_autoswap_csr_available) 36 + 37 + struct kvm_riscv_nacl { 38 + void *shmem; 39 + phys_addr_t shmem_phys; 40 + }; 41 + DECLARE_PER_CPU(struct kvm_riscv_nacl, kvm_riscv_nacl); 42 + 43 + void __kvm_riscv_nacl_hfence(void *shmem, 44 + unsigned long control, 45 + unsigned long page_num, 46 + unsigned long page_count); 47 + 48 + void __kvm_riscv_nacl_switch_to(struct kvm_vcpu_arch *vcpu_arch, 49 + unsigned long sbi_ext_id, 50 + unsigned long sbi_func_id); 51 + 52 + int kvm_riscv_nacl_enable(void); 53 + 54 + void kvm_riscv_nacl_disable(void); 55 + 56 + void kvm_riscv_nacl_exit(void); 57 + 58 + int kvm_riscv_nacl_init(void); 59 + 60 + #ifdef CONFIG_32BIT 61 + #define lelong_to_cpu(__x) le32_to_cpu(__x) 62 + #define cpu_to_lelong(__x) cpu_to_le32(__x) 63 + #else 64 + #define lelong_to_cpu(__x) le64_to_cpu(__x) 65 + #define cpu_to_lelong(__x) cpu_to_le64(__x) 66 + #endif 67 + 68 + #define nacl_shmem() \ 69 + this_cpu_ptr(&kvm_riscv_nacl)->shmem 70 + 71 + #define nacl_scratch_read_long(__shmem, __offset) \ 72 + ({ \ 73 + unsigned long *__p = (__shmem) + \ 74 + SBI_NACL_SHMEM_SCRATCH_OFFSET + \ 75 + (__offset); \ 76 + lelong_to_cpu(*__p); \ 77 + }) 78 + 79 + #define nacl_scratch_write_long(__shmem, __offset, __val) \ 80 + do { \ 81 + unsigned long *__p = (__shmem) + \ 82 + SBI_NACL_SHMEM_SCRATCH_OFFSET + \ 83 + (__offset); \ 84 + *__p = cpu_to_lelong(__val); \ 85 + } while (0) 86 + 87 + #define nacl_scratch_write_longs(__shmem, __offset, __array, __count) \ 88 + do { \ 89 + unsigned int __i; \ 90 + unsigned long *__p = (__shmem) + \ 91 + SBI_NACL_SHMEM_SCRATCH_OFFSET + \ 92 + (__offset); \ 93 + for (__i = 0; __i < (__count); __i++) \ 94 + __p[__i] = cpu_to_lelong((__array)[__i]); \ 95 + } while (0) 96 + 97 + #define nacl_sync_hfence(__e) \ 98 + sbi_ecall(SBI_EXT_NACL, SBI_EXT_NACL_SYNC_HFENCE, \ 99 + (__e), 0, 0, 0, 0, 0) 100 + 101 + #define nacl_hfence_mkconfig(__type, __order, __vmid, __asid) \ 102 + ({ \ 103 + unsigned long __c = SBI_NACL_SHMEM_HFENCE_CONFIG_PEND; \ 104 + __c |= ((__type) & SBI_NACL_SHMEM_HFENCE_CONFIG_TYPE_MASK) \ 105 + << SBI_NACL_SHMEM_HFENCE_CONFIG_TYPE_SHIFT; \ 106 + __c |= (((__order) - SBI_NACL_SHMEM_HFENCE_ORDER_BASE) & \ 107 + SBI_NACL_SHMEM_HFENCE_CONFIG_ORDER_MASK) \ 108 + << SBI_NACL_SHMEM_HFENCE_CONFIG_ORDER_SHIFT; \ 109 + __c |= ((__vmid) & SBI_NACL_SHMEM_HFENCE_CONFIG_VMID_MASK) \ 110 + << SBI_NACL_SHMEM_HFENCE_CONFIG_VMID_SHIFT; \ 111 + __c |= ((__asid) & SBI_NACL_SHMEM_HFENCE_CONFIG_ASID_MASK); \ 112 + __c; \ 113 + }) 114 + 115 + #define nacl_hfence_mkpnum(__order, __addr) \ 116 + ((__addr) >> (__order)) 117 + 118 + #define nacl_hfence_mkpcount(__order, __size) \ 119 + ((__size) >> (__order)) 120 + 121 + #define nacl_hfence_gvma(__shmem, __gpa, __gpsz, __order) \ 122 + __kvm_riscv_nacl_hfence(__shmem, \ 123 + nacl_hfence_mkconfig(SBI_NACL_SHMEM_HFENCE_TYPE_GVMA, \ 124 + __order, 0, 0), \ 125 + nacl_hfence_mkpnum(__order, __gpa), \ 126 + nacl_hfence_mkpcount(__order, __gpsz)) 127 + 128 + #define nacl_hfence_gvma_all(__shmem) \ 129 + __kvm_riscv_nacl_hfence(__shmem, \ 130 + nacl_hfence_mkconfig(SBI_NACL_SHMEM_HFENCE_TYPE_GVMA_ALL, \ 131 + 0, 0, 0), 0, 0) 132 + 133 + #define nacl_hfence_gvma_vmid(__shmem, __vmid, __gpa, __gpsz, __order) \ 134 + __kvm_riscv_nacl_hfence(__shmem, \ 135 + nacl_hfence_mkconfig(SBI_NACL_SHMEM_HFENCE_TYPE_GVMA_VMID, \ 136 + __order, __vmid, 0), \ 137 + nacl_hfence_mkpnum(__order, __gpa), \ 138 + nacl_hfence_mkpcount(__order, __gpsz)) 139 + 140 + #define nacl_hfence_gvma_vmid_all(__shmem, __vmid) \ 141 + __kvm_riscv_nacl_hfence(__shmem, \ 142 + nacl_hfence_mkconfig(SBI_NACL_SHMEM_HFENCE_TYPE_GVMA_VMID_ALL, \ 143 + 0, __vmid, 0), 0, 0) 144 + 145 + #define nacl_hfence_vvma(__shmem, __vmid, __gva, __gvsz, __order) \ 146 + __kvm_riscv_nacl_hfence(__shmem, \ 147 + nacl_hfence_mkconfig(SBI_NACL_SHMEM_HFENCE_TYPE_VVMA, \ 148 + __order, __vmid, 0), \ 149 + nacl_hfence_mkpnum(__order, __gva), \ 150 + nacl_hfence_mkpcount(__order, __gvsz)) 151 + 152 + #define nacl_hfence_vvma_all(__shmem, __vmid) \ 153 + __kvm_riscv_nacl_hfence(__shmem, \ 154 + nacl_hfence_mkconfig(SBI_NACL_SHMEM_HFENCE_TYPE_VVMA_ALL, \ 155 + 0, __vmid, 0), 0, 0) 156 + 157 + #define nacl_hfence_vvma_asid(__shmem, __vmid, __asid, __gva, __gvsz, __order)\ 158 + __kvm_riscv_nacl_hfence(__shmem, \ 159 + nacl_hfence_mkconfig(SBI_NACL_SHMEM_HFENCE_TYPE_VVMA_ASID, \ 160 + __order, __vmid, __asid), \ 161 + nacl_hfence_mkpnum(__order, __gva), \ 162 + nacl_hfence_mkpcount(__order, __gvsz)) 163 + 164 + #define nacl_hfence_vvma_asid_all(__shmem, __vmid, __asid) \ 165 + __kvm_riscv_nacl_hfence(__shmem, \ 166 + nacl_hfence_mkconfig(SBI_NACL_SHMEM_HFENCE_TYPE_VVMA_ASID_ALL, \ 167 + 0, __vmid, __asid), 0, 0) 168 + 169 + #define nacl_csr_read(__shmem, __csr) \ 170 + ({ \ 171 + unsigned long *__a = (__shmem) + SBI_NACL_SHMEM_CSR_OFFSET; \ 172 + lelong_to_cpu(__a[SBI_NACL_SHMEM_CSR_INDEX(__csr)]); \ 173 + }) 174 + 175 + #define nacl_csr_write(__shmem, __csr, __val) \ 176 + do { \ 177 + void *__s = (__shmem); \ 178 + unsigned int __i = SBI_NACL_SHMEM_CSR_INDEX(__csr); \ 179 + unsigned long *__a = (__s) + SBI_NACL_SHMEM_CSR_OFFSET; \ 180 + u8 *__b = (__s) + SBI_NACL_SHMEM_DBITMAP_OFFSET; \ 181 + __a[__i] = cpu_to_lelong(__val); \ 182 + __b[__i >> 3] |= 1U << (__i & 0x7); \ 183 + } while (0) 184 + 185 + #define nacl_csr_swap(__shmem, __csr, __val) \ 186 + ({ \ 187 + void *__s = (__shmem); \ 188 + unsigned int __i = SBI_NACL_SHMEM_CSR_INDEX(__csr); \ 189 + unsigned long *__a = (__s) + SBI_NACL_SHMEM_CSR_OFFSET; \ 190 + u8 *__b = (__s) + SBI_NACL_SHMEM_DBITMAP_OFFSET; \ 191 + unsigned long __r = lelong_to_cpu(__a[__i]); \ 192 + __a[__i] = cpu_to_lelong(__val); \ 193 + __b[__i >> 3] |= 1U << (__i & 0x7); \ 194 + __r; \ 195 + }) 196 + 197 + #define nacl_sync_csr(__csr) \ 198 + sbi_ecall(SBI_EXT_NACL, SBI_EXT_NACL_SYNC_CSR, \ 199 + (__csr), 0, 0, 0, 0, 0) 200 + 201 + /* 202 + * Each ncsr_xyz() macro defined below has it's own static-branch so every 203 + * use of ncsr_xyz() macro emits a patchable direct jump. This means multiple 204 + * back-to-back ncsr_xyz() macro usage will emit multiple patchable direct 205 + * jumps which is sub-optimal. 206 + * 207 + * Based on the above, it is recommended to avoid multiple back-to-back 208 + * ncsr_xyz() macro usage. 209 + */ 210 + 211 + #define ncsr_read(__csr) \ 212 + ({ \ 213 + unsigned long __r; \ 214 + if (kvm_riscv_nacl_available()) \ 215 + __r = nacl_csr_read(nacl_shmem(), __csr); \ 216 + else \ 217 + __r = csr_read(__csr); \ 218 + __r; \ 219 + }) 220 + 221 + #define ncsr_write(__csr, __val) \ 222 + do { \ 223 + if (kvm_riscv_nacl_sync_csr_available()) \ 224 + nacl_csr_write(nacl_shmem(), __csr, __val); \ 225 + else \ 226 + csr_write(__csr, __val); \ 227 + } while (0) 228 + 229 + #define ncsr_swap(__csr, __val) \ 230 + ({ \ 231 + unsigned long __r; \ 232 + if (kvm_riscv_nacl_sync_csr_available()) \ 233 + __r = nacl_csr_swap(nacl_shmem(), __csr, __val); \ 234 + else \ 235 + __r = csr_swap(__csr, __val); \ 236 + __r; \ 237 + }) 238 + 239 + #define nsync_csr(__csr) \ 240 + do { \ 241 + if (kvm_riscv_nacl_sync_csr_available()) \ 242 + nacl_sync_csr(__csr); \ 243 + } while (0) 244 + 245 + #endif

+3

arch/riscv/include/asm/perf_event.h

··· 8 8 #ifndef _ASM_RISCV_PERF_EVENT_H 9 9 #define _ASM_RISCV_PERF_EVENT_H 10 10 11 + #ifdef CONFIG_PERF_EVENTS 11 12 #include <linux/perf_event.h> 12 13 #define perf_arch_bpf_user_pt_regs(regs) (struct user_regs_struct *)regs 13 14 ··· 18 17 (regs)->sp = current_stack_pointer; \ 19 18 (regs)->status = SR_PP; \ 20 19 } 20 + #endif 21 + 21 22 #endif /* _ASM_RISCV_PERF_EVENT_H */

+120

arch/riscv/include/asm/sbi.h

··· 34 34 SBI_EXT_PMU = 0x504D55, 35 35 SBI_EXT_DBCN = 0x4442434E, 36 36 SBI_EXT_STA = 0x535441, 37 + SBI_EXT_NACL = 0x4E41434C, 37 38 38 39 /* Experimentals extensions must lie within this range */ 39 40 SBI_EXT_EXPERIMENTAL_START = 0x08000000, ··· 281 280 } __packed; 282 281 283 282 #define SBI_SHMEM_DISABLE -1 283 + 284 + enum sbi_ext_nacl_fid { 285 + SBI_EXT_NACL_PROBE_FEATURE = 0x0, 286 + SBI_EXT_NACL_SET_SHMEM = 0x1, 287 + SBI_EXT_NACL_SYNC_CSR = 0x2, 288 + SBI_EXT_NACL_SYNC_HFENCE = 0x3, 289 + SBI_EXT_NACL_SYNC_SRET = 0x4, 290 + }; 291 + 292 + enum sbi_ext_nacl_feature { 293 + SBI_NACL_FEAT_SYNC_CSR = 0x0, 294 + SBI_NACL_FEAT_SYNC_HFENCE = 0x1, 295 + SBI_NACL_FEAT_SYNC_SRET = 0x2, 296 + SBI_NACL_FEAT_AUTOSWAP_CSR = 0x3, 297 + }; 298 + 299 + #define SBI_NACL_SHMEM_ADDR_SHIFT 12 300 + #define SBI_NACL_SHMEM_SCRATCH_OFFSET 0x0000 301 + #define SBI_NACL_SHMEM_SCRATCH_SIZE 0x1000 302 + #define SBI_NACL_SHMEM_SRET_OFFSET 0x0000 303 + #define SBI_NACL_SHMEM_SRET_SIZE 0x0200 304 + #define SBI_NACL_SHMEM_AUTOSWAP_OFFSET (SBI_NACL_SHMEM_SRET_OFFSET + \ 305 + SBI_NACL_SHMEM_SRET_SIZE) 306 + #define SBI_NACL_SHMEM_AUTOSWAP_SIZE 0x0080 307 + #define SBI_NACL_SHMEM_UNUSED_OFFSET (SBI_NACL_SHMEM_AUTOSWAP_OFFSET + \ 308 + SBI_NACL_SHMEM_AUTOSWAP_SIZE) 309 + #define SBI_NACL_SHMEM_UNUSED_SIZE 0x0580 310 + #define SBI_NACL_SHMEM_HFENCE_OFFSET (SBI_NACL_SHMEM_UNUSED_OFFSET + \ 311 + SBI_NACL_SHMEM_UNUSED_SIZE) 312 + #define SBI_NACL_SHMEM_HFENCE_SIZE 0x0780 313 + #define SBI_NACL_SHMEM_DBITMAP_OFFSET (SBI_NACL_SHMEM_HFENCE_OFFSET + \ 314 + SBI_NACL_SHMEM_HFENCE_SIZE) 315 + #define SBI_NACL_SHMEM_DBITMAP_SIZE 0x0080 316 + #define SBI_NACL_SHMEM_CSR_OFFSET (SBI_NACL_SHMEM_DBITMAP_OFFSET + \ 317 + SBI_NACL_SHMEM_DBITMAP_SIZE) 318 + #define SBI_NACL_SHMEM_CSR_SIZE ((__riscv_xlen / 8) * 1024) 319 + #define SBI_NACL_SHMEM_SIZE (SBI_NACL_SHMEM_CSR_OFFSET + \ 320 + SBI_NACL_SHMEM_CSR_SIZE) 321 + 322 + #define SBI_NACL_SHMEM_CSR_INDEX(__csr_num) \ 323 + ((((__csr_num) & 0xc00) >> 2) | ((__csr_num) & 0xff)) 324 + 325 + #define SBI_NACL_SHMEM_HFENCE_ENTRY_SZ ((__riscv_xlen / 8) * 4) 326 + #define SBI_NACL_SHMEM_HFENCE_ENTRY_MAX \ 327 + (SBI_NACL_SHMEM_HFENCE_SIZE / \ 328 + SBI_NACL_SHMEM_HFENCE_ENTRY_SZ) 329 + #define SBI_NACL_SHMEM_HFENCE_ENTRY(__num) \ 330 + (SBI_NACL_SHMEM_HFENCE_OFFSET + \ 331 + (__num) * SBI_NACL_SHMEM_HFENCE_ENTRY_SZ) 332 + #define SBI_NACL_SHMEM_HFENCE_ENTRY_CONFIG(__num) \ 333 + SBI_NACL_SHMEM_HFENCE_ENTRY(__num) 334 + #define SBI_NACL_SHMEM_HFENCE_ENTRY_PNUM(__num)\ 335 + (SBI_NACL_SHMEM_HFENCE_ENTRY(__num) + (__riscv_xlen / 8)) 336 + #define SBI_NACL_SHMEM_HFENCE_ENTRY_PCOUNT(__num)\ 337 + (SBI_NACL_SHMEM_HFENCE_ENTRY(__num) + \ 338 + ((__riscv_xlen / 8) * 3)) 339 + 340 + #define SBI_NACL_SHMEM_HFENCE_CONFIG_PEND_BITS 1 341 + #define SBI_NACL_SHMEM_HFENCE_CONFIG_PEND_SHIFT \ 342 + (__riscv_xlen - SBI_NACL_SHMEM_HFENCE_CONFIG_PEND_BITS) 343 + #define SBI_NACL_SHMEM_HFENCE_CONFIG_PEND_MASK \ 344 + ((1UL << SBI_NACL_SHMEM_HFENCE_CONFIG_PEND_BITS) - 1) 345 + #define SBI_NACL_SHMEM_HFENCE_CONFIG_PEND \ 346 + (SBI_NACL_SHMEM_HFENCE_CONFIG_PEND_MASK << \ 347 + SBI_NACL_SHMEM_HFENCE_CONFIG_PEND_SHIFT) 348 + 349 + #define SBI_NACL_SHMEM_HFENCE_CONFIG_RSVD1_BITS 3 350 + #define SBI_NACL_SHMEM_HFENCE_CONFIG_RSVD1_SHIFT \ 351 + (SBI_NACL_SHMEM_HFENCE_CONFIG_PEND_SHIFT - \ 352 + SBI_NACL_SHMEM_HFENCE_CONFIG_RSVD1_BITS) 353 + 354 + #define SBI_NACL_SHMEM_HFENCE_CONFIG_TYPE_BITS 4 355 + #define SBI_NACL_SHMEM_HFENCE_CONFIG_TYPE_SHIFT \ 356 + (SBI_NACL_SHMEM_HFENCE_CONFIG_RSVD1_SHIFT - \ 357 + SBI_NACL_SHMEM_HFENCE_CONFIG_TYPE_BITS) 358 + #define SBI_NACL_SHMEM_HFENCE_CONFIG_TYPE_MASK \ 359 + ((1UL << SBI_NACL_SHMEM_HFENCE_CONFIG_TYPE_BITS) - 1) 360 + 361 + #define SBI_NACL_SHMEM_HFENCE_TYPE_GVMA 0x0 362 + #define SBI_NACL_SHMEM_HFENCE_TYPE_GVMA_ALL 0x1 363 + #define SBI_NACL_SHMEM_HFENCE_TYPE_GVMA_VMID 0x2 364 + #define SBI_NACL_SHMEM_HFENCE_TYPE_GVMA_VMID_ALL 0x3 365 + #define SBI_NACL_SHMEM_HFENCE_TYPE_VVMA 0x4 366 + #define SBI_NACL_SHMEM_HFENCE_TYPE_VVMA_ALL 0x5 367 + #define SBI_NACL_SHMEM_HFENCE_TYPE_VVMA_ASID 0x6 368 + #define SBI_NACL_SHMEM_HFENCE_TYPE_VVMA_ASID_ALL 0x7 369 + 370 + #define SBI_NACL_SHMEM_HFENCE_CONFIG_RSVD2_BITS 1 371 + #define SBI_NACL_SHMEM_HFENCE_CONFIG_RSVD2_SHIFT \ 372 + (SBI_NACL_SHMEM_HFENCE_CONFIG_TYPE_SHIFT - \ 373 + SBI_NACL_SHMEM_HFENCE_CONFIG_RSVD2_BITS) 374 + 375 + #define SBI_NACL_SHMEM_HFENCE_CONFIG_ORDER_BITS 7 376 + #define SBI_NACL_SHMEM_HFENCE_CONFIG_ORDER_SHIFT \ 377 + (SBI_NACL_SHMEM_HFENCE_CONFIG_RSVD2_SHIFT - \ 378 + SBI_NACL_SHMEM_HFENCE_CONFIG_ORDER_BITS) 379 + #define SBI_NACL_SHMEM_HFENCE_CONFIG_ORDER_MASK \ 380 + ((1UL << SBI_NACL_SHMEM_HFENCE_CONFIG_ORDER_BITS) - 1) 381 + #define SBI_NACL_SHMEM_HFENCE_ORDER_BASE 12 382 + 383 + #if __riscv_xlen == 32 384 + #define SBI_NACL_SHMEM_HFENCE_CONFIG_ASID_BITS 9 385 + #define SBI_NACL_SHMEM_HFENCE_CONFIG_VMID_BITS 7 386 + #else 387 + #define SBI_NACL_SHMEM_HFENCE_CONFIG_ASID_BITS 16 388 + #define SBI_NACL_SHMEM_HFENCE_CONFIG_VMID_BITS 14 389 + #endif 390 + #define SBI_NACL_SHMEM_HFENCE_CONFIG_VMID_SHIFT \ 391 + SBI_NACL_SHMEM_HFENCE_CONFIG_ASID_BITS 392 + #define SBI_NACL_SHMEM_HFENCE_CONFIG_ASID_MASK \ 393 + ((1UL << SBI_NACL_SHMEM_HFENCE_CONFIG_ASID_BITS) - 1) 394 + #define SBI_NACL_SHMEM_HFENCE_CONFIG_VMID_MASK \ 395 + ((1UL << SBI_NACL_SHMEM_HFENCE_CONFIG_VMID_BITS) - 1) 396 + 397 + #define SBI_NACL_SHMEM_AUTOSWAP_FLAG_HSTATUS BIT(0) 398 + #define SBI_NACL_SHMEM_AUTOSWAP_HSTATUS ((__riscv_xlen / 8) * 1) 399 + 400 + #define SBI_NACL_SHMEM_SRET_X(__i) ((__riscv_xlen / 8) * (__i)) 401 + #define SBI_NACL_SHMEM_SRET_X_LAST 31 284 402 285 403 /* SBI spec version fields */ 286 404 #define SBI_SPEC_VERSION_DEFAULT 0x1

+10

arch/riscv/kernel/perf_callchain.c

··· 28 28 void perf_callchain_user(struct perf_callchain_entry_ctx *entry, 29 29 struct pt_regs *regs) 30 30 { 31 + if (perf_guest_state()) { 32 + /* TODO: We don't support guest os callchain now */ 33 + return; 34 + } 35 + 31 36 arch_stack_walk_user(fill_callchain, entry, regs); 32 37 } 33 38 34 39 void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, 35 40 struct pt_regs *regs) 36 41 { 42 + if (perf_guest_state()) { 43 + /* TODO: We don't support guest os callchain now */ 44 + return; 45 + } 46 + 37 47 walk_stackframe(NULL, regs, fill_callchain, entry); 38 48 }

+1

arch/riscv/kvm/Kconfig

··· 32 32 select KVM_XFER_TO_GUEST_WORK 33 33 select KVM_GENERIC_MMU_NOTIFIER 34 34 select SCHED_INFO 35 + select GUEST_PERF_EVENTS if PERF_EVENTS 35 36 help 36 37 Support hosting virtualized guest machines. 37 38

+15 -12

arch/riscv/kvm/Makefile

··· 9 9 10 10 obj-$(CONFIG_KVM) += kvm.o 11 11 12 + # Ordered alphabetically 13 + kvm-y += aia.o 14 + kvm-y += aia_aplic.o 15 + kvm-y += aia_device.o 16 + kvm-y += aia_imsic.o 12 17 kvm-y += main.o 13 - kvm-y += vm.o 14 - kvm-y += vmid.o 15 - kvm-y += tlb.o 16 18 kvm-y += mmu.o 19 + kvm-y += nacl.o 20 + kvm-y += tlb.o 17 21 kvm-y += vcpu.o 18 22 kvm-y += vcpu_exit.o 19 23 kvm-y += vcpu_fp.o 20 - kvm-y += vcpu_vector.o 21 24 kvm-y += vcpu_insn.o 22 25 kvm-y += vcpu_onereg.o 23 - kvm-y += vcpu_switch.o 26 + kvm-$(CONFIG_RISCV_PMU_SBI) += vcpu_pmu.o 24 27 kvm-y += vcpu_sbi.o 25 - kvm-$(CONFIG_RISCV_SBI_V01) += vcpu_sbi_v01.o 26 28 kvm-y += vcpu_sbi_base.o 27 - kvm-y += vcpu_sbi_replace.o 28 29 kvm-y += vcpu_sbi_hsm.o 30 + kvm-$(CONFIG_RISCV_PMU_SBI) += vcpu_sbi_pmu.o 31 + kvm-y += vcpu_sbi_replace.o 29 32 kvm-y += vcpu_sbi_sta.o 33 + kvm-$(CONFIG_RISCV_SBI_V01) += vcpu_sbi_v01.o 34 + kvm-y += vcpu_switch.o 30 35 kvm-y += vcpu_timer.o 31 - kvm-$(CONFIG_RISCV_PMU_SBI) += vcpu_pmu.o vcpu_sbi_pmu.o 32 - kvm-y += aia.o 33 - kvm-y += aia_device.o 34 - kvm-y += aia_aplic.o 35 - kvm-y += aia_imsic.o 36 + kvm-y += vcpu_vector.o 37 + kvm-y += vm.o 38 + kvm-y += vmid.o

+76 -38

arch/riscv/kvm/aia.c

··· 16 16 #include <linux/percpu.h> 17 17 #include <linux/spinlock.h> 18 18 #include <asm/cpufeature.h> 19 + #include <asm/kvm_nacl.h> 19 20 20 21 struct aia_hgei_control { 21 22 raw_spinlock_t lock; ··· 52 51 return hgei; 53 52 } 54 53 55 - static void aia_set_hvictl(bool ext_irq_pending) 54 + static inline unsigned long aia_hvictl_value(bool ext_irq_pending) 56 55 { 57 56 unsigned long hvictl; 58 57 ··· 63 62 64 63 hvictl = (IRQ_S_EXT << HVICTL_IID_SHIFT) & HVICTL_IID; 65 64 hvictl |= ext_irq_pending; 66 - csr_write(CSR_HVICTL, hvictl); 65 + return hvictl; 67 66 } 68 67 69 68 #ifdef CONFIG_32BIT ··· 89 88 struct kvm_vcpu_aia_csr *csr = &vcpu->arch.aia_context.guest_csr; 90 89 91 90 if (kvm_riscv_aia_available()) 92 - csr->vsieh = csr_read(CSR_VSIEH); 91 + csr->vsieh = ncsr_read(CSR_VSIEH); 93 92 } 94 93 #endif 95 94 ··· 116 115 117 116 hgei = aia_find_hgei(vcpu); 118 117 if (hgei > 0) 119 - return !!(csr_read(CSR_HGEIP) & BIT(hgei)); 118 + return !!(ncsr_read(CSR_HGEIP) & BIT(hgei)); 120 119 121 120 return false; 122 121 } ··· 129 128 return; 130 129 131 130 #ifdef CONFIG_32BIT 132 - csr_write(CSR_HVIPH, vcpu->arch.aia_context.guest_csr.hviph); 131 + ncsr_write(CSR_HVIPH, vcpu->arch.aia_context.guest_csr.hviph); 133 132 #endif 134 - aia_set_hvictl(!!(csr->hvip & BIT(IRQ_VS_EXT))); 133 + ncsr_write(CSR_HVICTL, aia_hvictl_value(!!(csr->hvip & BIT(IRQ_VS_EXT)))); 135 134 } 136 135 137 136 void kvm_riscv_vcpu_aia_load(struct kvm_vcpu *vcpu, int cpu) 138 137 { 139 138 struct kvm_vcpu_aia_csr *csr = &vcpu->arch.aia_context.guest_csr; 139 + void *nsh; 140 140 141 141 if (!kvm_riscv_aia_available()) 142 142 return; 143 143 144 - csr_write(CSR_VSISELECT, csr->vsiselect); 145 - csr_write(CSR_HVIPRIO1, csr->hviprio1); 146 - csr_write(CSR_HVIPRIO2, csr->hviprio2); 144 + if (kvm_riscv_nacl_sync_csr_available()) { 145 + nsh = nacl_shmem(); 146 + nacl_csr_write(nsh, CSR_VSISELECT, csr->vsiselect); 147 + nacl_csr_write(nsh, CSR_HVIPRIO1, csr->hviprio1); 148 + nacl_csr_write(nsh, CSR_HVIPRIO2, csr->hviprio2); 147 149 #ifdef CONFIG_32BIT 148 - csr_write(CSR_VSIEH, csr->vsieh); 149 - csr_write(CSR_HVIPH, csr->hviph); 150 - csr_write(CSR_HVIPRIO1H, csr->hviprio1h); 151 - csr_write(CSR_HVIPRIO2H, csr->hviprio2h); 150 + nacl_csr_write(nsh, CSR_VSIEH, csr->vsieh); 151 + nacl_csr_write(nsh, CSR_HVIPH, csr->hviph); 152 + nacl_csr_write(nsh, CSR_HVIPRIO1H, csr->hviprio1h); 153 + nacl_csr_write(nsh, CSR_HVIPRIO2H, csr->hviprio2h); 152 154 #endif 155 + } else { 156 + csr_write(CSR_VSISELECT, csr->vsiselect); 157 + csr_write(CSR_HVIPRIO1, csr->hviprio1); 158 + csr_write(CSR_HVIPRIO2, csr->hviprio2); 159 + #ifdef CONFIG_32BIT 160 + csr_write(CSR_VSIEH, csr->vsieh); 161 + csr_write(CSR_HVIPH, csr->hviph); 162 + csr_write(CSR_HVIPRIO1H, csr->hviprio1h); 163 + csr_write(CSR_HVIPRIO2H, csr->hviprio2h); 164 + #endif 165 + } 153 166 } 154 167 155 168 void kvm_riscv_vcpu_aia_put(struct kvm_vcpu *vcpu) 156 169 { 157 170 struct kvm_vcpu_aia_csr *csr = &vcpu->arch.aia_context.guest_csr; 171 + void *nsh; 158 172 159 173 if (!kvm_riscv_aia_available()) 160 174 return; 161 175 162 - csr->vsiselect = csr_read(CSR_VSISELECT); 163 - csr->hviprio1 = csr_read(CSR_HVIPRIO1); 164 - csr->hviprio2 = csr_read(CSR_HVIPRIO2); 176 + if (kvm_riscv_nacl_available()) { 177 + nsh = nacl_shmem(); 178 + csr->vsiselect = nacl_csr_read(nsh, CSR_VSISELECT); 179 + csr->hviprio1 = nacl_csr_read(nsh, CSR_HVIPRIO1); 180 + csr->hviprio2 = nacl_csr_read(nsh, CSR_HVIPRIO2); 165 181 #ifdef CONFIG_32BIT 166 - csr->vsieh = csr_read(CSR_VSIEH); 167 - csr->hviph = csr_read(CSR_HVIPH); 168 - csr->hviprio1h = csr_read(CSR_HVIPRIO1H); 169 - csr->hviprio2h = csr_read(CSR_HVIPRIO2H); 182 + csr->vsieh = nacl_csr_read(nsh, CSR_VSIEH); 183 + csr->hviph = nacl_csr_read(nsh, CSR_HVIPH); 184 + csr->hviprio1h = nacl_csr_read(nsh, CSR_HVIPRIO1H); 185 + csr->hviprio2h = nacl_csr_read(nsh, CSR_HVIPRIO2H); 170 186 #endif 187 + } else { 188 + csr->vsiselect = csr_read(CSR_VSISELECT); 189 + csr->hviprio1 = csr_read(CSR_HVIPRIO1); 190 + csr->hviprio2 = csr_read(CSR_HVIPRIO2); 191 + #ifdef CONFIG_32BIT 192 + csr->vsieh = csr_read(CSR_VSIEH); 193 + csr->hviph = csr_read(CSR_HVIPH); 194 + csr->hviprio1h = csr_read(CSR_HVIPRIO1H); 195 + csr->hviprio2h = csr_read(CSR_HVIPRIO2H); 196 + #endif 197 + } 171 198 } 172 199 173 200 int kvm_riscv_vcpu_aia_get_csr(struct kvm_vcpu *vcpu, ··· 279 250 280 251 switch (bitpos / BITS_PER_LONG) { 281 252 case 0: 282 - hviprio = csr_read(CSR_HVIPRIO1); 253 + hviprio = ncsr_read(CSR_HVIPRIO1); 283 254 break; 284 255 case 1: 285 256 #ifndef CONFIG_32BIT 286 - hviprio = csr_read(CSR_HVIPRIO2); 257 + hviprio = ncsr_read(CSR_HVIPRIO2); 287 258 break; 288 259 #else 289 - hviprio = csr_read(CSR_HVIPRIO1H); 260 + hviprio = ncsr_read(CSR_HVIPRIO1H); 290 261 break; 291 262 case 2: 292 - hviprio = csr_read(CSR_HVIPRIO2); 263 + hviprio = ncsr_read(CSR_HVIPRIO2); 293 264 break; 294 265 case 3: 295 - hviprio = csr_read(CSR_HVIPRIO2H); 266 + hviprio = ncsr_read(CSR_HVIPRIO2H); 296 267 break; 297 268 #endif 298 269 default: ··· 312 283 313 284 switch (bitpos / BITS_PER_LONG) { 314 285 case 0: 315 - hviprio = csr_read(CSR_HVIPRIO1); 286 + hviprio = ncsr_read(CSR_HVIPRIO1); 316 287 break; 317 288 case 1: 318 289 #ifndef CONFIG_32BIT 319 - hviprio = csr_read(CSR_HVIPRIO2); 290 + hviprio = ncsr_read(CSR_HVIPRIO2); 320 291 break; 321 292 #else 322 - hviprio = csr_read(CSR_HVIPRIO1H); 293 + hviprio = ncsr_read(CSR_HVIPRIO1H); 323 294 break; 324 295 case 2: 325 - hviprio = csr_read(CSR_HVIPRIO2); 296 + hviprio = ncsr_read(CSR_HVIPRIO2); 326 297 break; 327 298 case 3: 328 - hviprio = csr_read(CSR_HVIPRIO2H); 299 + hviprio = ncsr_read(CSR_HVIPRIO2H); 329 300 break; 330 301 #endif 331 302 default: ··· 337 308 338 309 switch (bitpos / BITS_PER_LONG) { 339 310 case 0: 340 - csr_write(CSR_HVIPRIO1, hviprio); 311 + ncsr_write(CSR_HVIPRIO1, hviprio); 341 312 break; 342 313 case 1: 343 314 #ifndef CONFIG_32BIT 344 - csr_write(CSR_HVIPRIO2, hviprio); 315 + ncsr_write(CSR_HVIPRIO2, hviprio); 345 316 break; 346 317 #else 347 - csr_write(CSR_HVIPRIO1H, hviprio); 318 + ncsr_write(CSR_HVIPRIO1H, hviprio); 348 319 break; 349 320 case 2: 350 - csr_write(CSR_HVIPRIO2, hviprio); 321 + ncsr_write(CSR_HVIPRIO2, hviprio); 351 322 break; 352 323 case 3: 353 - csr_write(CSR_HVIPRIO2H, hviprio); 324 + ncsr_write(CSR_HVIPRIO2H, hviprio); 354 325 break; 355 326 #endif 356 327 default: ··· 406 377 return KVM_INSN_ILLEGAL_TRAP; 407 378 408 379 /* First try to emulate in kernel space */ 409 - isel = csr_read(CSR_VSISELECT) & ISELECT_MASK; 380 + isel = ncsr_read(CSR_VSISELECT) & ISELECT_MASK; 410 381 if (isel >= ISELECT_IPRIO0 && isel <= ISELECT_IPRIO15) 411 382 return aia_rmw_iprio(vcpu, isel, val, new_val, wr_mask); 412 383 else if (isel >= IMSIC_FIRST && isel <= IMSIC_LAST && ··· 528 499 hgctrl->free_bitmap = 0; 529 500 } 530 501 502 + /* Skip SGEI interrupt setup for zero guest external interrupts */ 503 + if (!kvm_riscv_aia_nr_hgei) 504 + goto skip_sgei_interrupt; 505 + 531 506 /* Find INTC irq domain */ 532 507 domain = irq_find_matching_fwnode(riscv_get_intc_hwnode(), 533 508 DOMAIN_BUS_ANY); ··· 555 522 return rc; 556 523 } 557 524 525 + skip_sgei_interrupt: 558 526 return 0; 559 527 } 560 528 561 529 static void aia_hgei_exit(void) 562 530 { 531 + /* Do nothing for zero guest external interrupts */ 532 + if (!kvm_riscv_aia_nr_hgei) 533 + return; 534 + 563 535 /* Free per-CPU SGEI interrupt */ 564 536 free_percpu_irq(hgei_parent_irq, &aia_hgei); 565 537 } ··· 574 536 if (!kvm_riscv_aia_available()) 575 537 return; 576 538 577 - aia_set_hvictl(false); 539 + csr_write(CSR_HVICTL, aia_hvictl_value(false)); 578 540 csr_write(CSR_HVIPRIO1, 0x0); 579 541 csr_write(CSR_HVIPRIO2, 0x0); 580 542 #ifdef CONFIG_32BIT ··· 610 572 csr_clear(CSR_HIE, BIT(IRQ_S_GEXT)); 611 573 disable_percpu_irq(hgei_parent_irq); 612 574 613 - aia_set_hvictl(false); 575 + csr_write(CSR_HVICTL, aia_hvictl_value(false)); 614 576 615 577 raw_spin_lock_irqsave(&hgctrl->lock, flags); 616 578

+2 -1

arch/riscv/kvm/aia_aplic.c

··· 143 143 if (sm == APLIC_SOURCECFG_SM_LEVEL_HIGH || 144 144 sm == APLIC_SOURCECFG_SM_LEVEL_LOW) { 145 145 if (!pending) 146 - goto skip_write_pending; 146 + goto noskip_write_pending; 147 147 if ((irqd->state & APLIC_IRQ_STATE_INPUT) && 148 148 sm == APLIC_SOURCECFG_SM_LEVEL_LOW) 149 149 goto skip_write_pending; ··· 152 152 goto skip_write_pending; 153 153 } 154 154 155 + noskip_write_pending: 155 156 if (pending) 156 157 irqd->state |= APLIC_IRQ_STATE_PENDING; 157 158 else

+59 -4

arch/riscv/kvm/main.c

··· 10 10 #include <linux/err.h> 11 11 #include <linux/module.h> 12 12 #include <linux/kvm_host.h> 13 - #include <asm/csr.h> 14 13 #include <asm/cpufeature.h> 14 + #include <asm/kvm_nacl.h> 15 15 #include <asm/sbi.h> 16 16 17 17 long kvm_arch_dev_ioctl(struct file *filp, ··· 22 22 23 23 int kvm_arch_enable_virtualization_cpu(void) 24 24 { 25 + int rc; 26 + 27 + rc = kvm_riscv_nacl_enable(); 28 + if (rc) 29 + return rc; 30 + 25 31 csr_write(CSR_HEDELEG, KVM_HEDELEG_DEFAULT); 26 32 csr_write(CSR_HIDELEG, KVM_HIDELEG_DEFAULT); 27 33 ··· 55 49 csr_write(CSR_HVIP, 0); 56 50 csr_write(CSR_HEDELEG, 0); 57 51 csr_write(CSR_HIDELEG, 0); 52 + 53 + kvm_riscv_nacl_disable(); 54 + } 55 + 56 + static void kvm_riscv_teardown(void) 57 + { 58 + kvm_riscv_aia_exit(); 59 + kvm_riscv_nacl_exit(); 60 + kvm_unregister_perf_callbacks(); 58 61 } 59 62 60 63 static int __init riscv_kvm_init(void) 61 64 { 62 65 int rc; 66 + char slist[64]; 63 67 const char *str; 64 68 65 69 if (!riscv_isa_extension_available(NULL, h)) { ··· 87 71 return -ENODEV; 88 72 } 89 73 74 + rc = kvm_riscv_nacl_init(); 75 + if (rc && rc != -ENODEV) 76 + return rc; 77 + 90 78 kvm_riscv_gstage_mode_detect(); 91 79 92 80 kvm_riscv_gstage_vmid_detect(); 93 81 94 82 rc = kvm_riscv_aia_init(); 95 - if (rc && rc != -ENODEV) 83 + if (rc && rc != -ENODEV) { 84 + kvm_riscv_nacl_exit(); 96 85 return rc; 86 + } 97 87 98 88 kvm_info("hypervisor extension available\n"); 89 + 90 + if (kvm_riscv_nacl_available()) { 91 + rc = 0; 92 + slist[0] = '\0'; 93 + if (kvm_riscv_nacl_sync_csr_available()) { 94 + if (rc) 95 + strcat(slist, ", "); 96 + strcat(slist, "sync_csr"); 97 + rc++; 98 + } 99 + if (kvm_riscv_nacl_sync_hfence_available()) { 100 + if (rc) 101 + strcat(slist, ", "); 102 + strcat(slist, "sync_hfence"); 103 + rc++; 104 + } 105 + if (kvm_riscv_nacl_sync_sret_available()) { 106 + if (rc) 107 + strcat(slist, ", "); 108 + strcat(slist, "sync_sret"); 109 + rc++; 110 + } 111 + if (kvm_riscv_nacl_autoswap_csr_available()) { 112 + if (rc) 113 + strcat(slist, ", "); 114 + strcat(slist, "autoswap_csr"); 115 + rc++; 116 + } 117 + kvm_info("using SBI nested acceleration with %s\n", 118 + (rc) ? slist : "no features"); 119 + } 99 120 100 121 switch (kvm_riscv_gstage_mode()) { 101 122 case HGATP_MODE_SV32X4: ··· 158 105 kvm_info("AIA available with %d guest external interrupts\n", 159 106 kvm_riscv_aia_nr_hgei); 160 107 108 + kvm_register_perf_callbacks(NULL); 109 + 161 110 rc = kvm_init(sizeof(struct kvm_vcpu), 0, THIS_MODULE); 162 111 if (rc) { 163 - kvm_riscv_aia_exit(); 112 + kvm_riscv_teardown(); 164 113 return rc; 165 114 } 166 115 ··· 172 117 173 118 static void __exit riscv_kvm_exit(void) 174 119 { 175 - kvm_riscv_aia_exit(); 120 + kvm_riscv_teardown(); 176 121 177 122 kvm_exit(); 178 123 }

+6 -7

arch/riscv/kvm/mmu.c

··· 15 15 #include <linux/vmalloc.h> 16 16 #include <linux/kvm_host.h> 17 17 #include <linux/sched/signal.h> 18 - #include <asm/csr.h> 18 + #include <asm/kvm_nacl.h> 19 19 #include <asm/page.h> 20 20 #include <asm/pgtable.h> 21 21 ··· 601 601 bool logging = (memslot->dirty_bitmap && 602 602 !(memslot->flags & KVM_MEM_READONLY)) ? true : false; 603 603 unsigned long vma_pagesize, mmu_seq; 604 + struct page *page; 604 605 605 606 /* We need minimum second+third level pages */ 606 607 ret = kvm_mmu_topup_memory_cache(pcache, gstage_pgd_levels); ··· 632 631 633 632 /* 634 633 * Read mmu_invalidate_seq so that KVM can detect if the results of 635 - * vma_lookup() or gfn_to_pfn_prot() become stale priort to acquiring 634 + * vma_lookup() or __kvm_faultin_pfn() become stale prior to acquiring 636 635 * kvm->mmu_lock. 637 636 * 638 637 * Rely on mmap_read_unlock() for an implicit smp_rmb(), which pairs ··· 648 647 return -EFAULT; 649 648 } 650 649 651 - hfn = gfn_to_pfn_prot(kvm, gfn, is_write, &writable); 650 + hfn = kvm_faultin_pfn(vcpu, gfn, is_write, &writable, &page); 652 651 if (hfn == KVM_PFN_ERR_HWPOISON) { 653 652 send_sig_mceerr(BUS_MCEERR_AR, (void __user *)hva, 654 653 vma_pageshift, current); ··· 670 669 goto out_unlock; 671 670 672 671 if (writable) { 673 - kvm_set_pfn_dirty(hfn); 674 672 mark_page_dirty(kvm, gfn); 675 673 ret = gstage_map_page(kvm, pcache, gpa, hfn << PAGE_SHIFT, 676 674 vma_pagesize, false, true); ··· 682 682 kvm_err("Failed to map in G-stage\n"); 683 683 684 684 out_unlock: 685 + kvm_release_faultin_page(kvm, page, ret && ret != -EEXIST, writable); 685 686 spin_unlock(&kvm->mmu_lock); 686 - kvm_set_pfn_accessed(hfn); 687 - kvm_release_pfn_clean(hfn); 688 687 return ret; 689 688 } 690 689 ··· 731 732 hgatp |= (READ_ONCE(k->vmid.vmid) << HGATP_VMID_SHIFT) & HGATP_VMID; 732 733 hgatp |= (k->pgd_phys >> PAGE_SHIFT) & HGATP_PPN; 733 734 734 - csr_write(CSR_HGATP, hgatp); 735 + ncsr_write(CSR_HGATP, hgatp); 735 736 736 737 if (!kvm_riscv_gstage_vmid_bits()) 737 738 kvm_riscv_local_hfence_gvma_all();

+152

arch/riscv/kvm/nacl.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * Copyright (c) 2024 Ventana Micro Systems Inc. 4 + */ 5 + 6 + #include <linux/kvm_host.h> 7 + #include <linux/vmalloc.h> 8 + #include <asm/kvm_nacl.h> 9 + 10 + DEFINE_STATIC_KEY_FALSE(kvm_riscv_nacl_available); 11 + DEFINE_STATIC_KEY_FALSE(kvm_riscv_nacl_sync_csr_available); 12 + DEFINE_STATIC_KEY_FALSE(kvm_riscv_nacl_sync_hfence_available); 13 + DEFINE_STATIC_KEY_FALSE(kvm_riscv_nacl_sync_sret_available); 14 + DEFINE_STATIC_KEY_FALSE(kvm_riscv_nacl_autoswap_csr_available); 15 + DEFINE_PER_CPU(struct kvm_riscv_nacl, kvm_riscv_nacl); 16 + 17 + void __kvm_riscv_nacl_hfence(void *shmem, 18 + unsigned long control, 19 + unsigned long page_num, 20 + unsigned long page_count) 21 + { 22 + int i, ent = -1, try_count = 5; 23 + unsigned long *entp; 24 + 25 + again: 26 + for (i = 0; i < SBI_NACL_SHMEM_HFENCE_ENTRY_MAX; i++) { 27 + entp = shmem + SBI_NACL_SHMEM_HFENCE_ENTRY_CONFIG(i); 28 + if (lelong_to_cpu(*entp) & SBI_NACL_SHMEM_HFENCE_CONFIG_PEND) 29 + continue; 30 + 31 + ent = i; 32 + break; 33 + } 34 + 35 + if (ent < 0) { 36 + if (try_count) { 37 + nacl_sync_hfence(-1UL); 38 + goto again; 39 + } else { 40 + pr_warn("KVM: No free entry in NACL shared memory\n"); 41 + return; 42 + } 43 + } 44 + 45 + entp = shmem + SBI_NACL_SHMEM_HFENCE_ENTRY_CONFIG(i); 46 + *entp = cpu_to_lelong(control); 47 + entp = shmem + SBI_NACL_SHMEM_HFENCE_ENTRY_PNUM(i); 48 + *entp = cpu_to_lelong(page_num); 49 + entp = shmem + SBI_NACL_SHMEM_HFENCE_ENTRY_PCOUNT(i); 50 + *entp = cpu_to_lelong(page_count); 51 + } 52 + 53 + int kvm_riscv_nacl_enable(void) 54 + { 55 + int rc; 56 + struct sbiret ret; 57 + struct kvm_riscv_nacl *nacl; 58 + 59 + if (!kvm_riscv_nacl_available()) 60 + return 0; 61 + nacl = this_cpu_ptr(&kvm_riscv_nacl); 62 + 63 + ret = sbi_ecall(SBI_EXT_NACL, SBI_EXT_NACL_SET_SHMEM, 64 + nacl->shmem_phys, 0, 0, 0, 0, 0); 65 + rc = sbi_err_map_linux_errno(ret.error); 66 + if (rc) 67 + return rc; 68 + 69 + return 0; 70 + } 71 + 72 + void kvm_riscv_nacl_disable(void) 73 + { 74 + if (!kvm_riscv_nacl_available()) 75 + return; 76 + 77 + sbi_ecall(SBI_EXT_NACL, SBI_EXT_NACL_SET_SHMEM, 78 + SBI_SHMEM_DISABLE, SBI_SHMEM_DISABLE, 0, 0, 0, 0); 79 + } 80 + 81 + void kvm_riscv_nacl_exit(void) 82 + { 83 + int cpu; 84 + struct kvm_riscv_nacl *nacl; 85 + 86 + if (!kvm_riscv_nacl_available()) 87 + return; 88 + 89 + /* Allocate per-CPU shared memory */ 90 + for_each_possible_cpu(cpu) { 91 + nacl = per_cpu_ptr(&kvm_riscv_nacl, cpu); 92 + if (!nacl->shmem) 93 + continue; 94 + 95 + free_pages((unsigned long)nacl->shmem, 96 + get_order(SBI_NACL_SHMEM_SIZE)); 97 + nacl->shmem = NULL; 98 + nacl->shmem_phys = 0; 99 + } 100 + } 101 + 102 + static long nacl_probe_feature(long feature_id) 103 + { 104 + struct sbiret ret; 105 + 106 + if (!kvm_riscv_nacl_available()) 107 + return 0; 108 + 109 + ret = sbi_ecall(SBI_EXT_NACL, SBI_EXT_NACL_PROBE_FEATURE, 110 + feature_id, 0, 0, 0, 0, 0); 111 + return ret.value; 112 + } 113 + 114 + int kvm_riscv_nacl_init(void) 115 + { 116 + int cpu; 117 + struct page *shmem_page; 118 + struct kvm_riscv_nacl *nacl; 119 + 120 + if (sbi_spec_version < sbi_mk_version(1, 0) || 121 + sbi_probe_extension(SBI_EXT_NACL) <= 0) 122 + return -ENODEV; 123 + 124 + /* Enable NACL support */ 125 + static_branch_enable(&kvm_riscv_nacl_available); 126 + 127 + /* Probe NACL features */ 128 + if (nacl_probe_feature(SBI_NACL_FEAT_SYNC_CSR)) 129 + static_branch_enable(&kvm_riscv_nacl_sync_csr_available); 130 + if (nacl_probe_feature(SBI_NACL_FEAT_SYNC_HFENCE)) 131 + static_branch_enable(&kvm_riscv_nacl_sync_hfence_available); 132 + if (nacl_probe_feature(SBI_NACL_FEAT_SYNC_SRET)) 133 + static_branch_enable(&kvm_riscv_nacl_sync_sret_available); 134 + if (nacl_probe_feature(SBI_NACL_FEAT_AUTOSWAP_CSR)) 135 + static_branch_enable(&kvm_riscv_nacl_autoswap_csr_available); 136 + 137 + /* Allocate per-CPU shared memory */ 138 + for_each_possible_cpu(cpu) { 139 + nacl = per_cpu_ptr(&kvm_riscv_nacl, cpu); 140 + 141 + shmem_page = alloc_pages(GFP_KERNEL | __GFP_ZERO, 142 + get_order(SBI_NACL_SHMEM_SIZE)); 143 + if (!shmem_page) { 144 + kvm_riscv_nacl_exit(); 145 + return -ENOMEM; 146 + } 147 + nacl->shmem = page_to_virt(shmem_page); 148 + nacl->shmem_phys = page_to_phys(shmem_page); 149 + } 150 + 151 + return 0; 152 + }

+40 -17

arch/riscv/kvm/tlb.c

··· 14 14 #include <asm/csr.h> 15 15 #include <asm/cpufeature.h> 16 16 #include <asm/insn-def.h> 17 + #include <asm/kvm_nacl.h> 17 18 18 19 #define has_svinval() riscv_has_extension_unlikely(RISCV_ISA_EXT_SVINVAL) 19 20 ··· 187 186 188 187 void kvm_riscv_hfence_gvma_vmid_all_process(struct kvm_vcpu *vcpu) 189 188 { 190 - struct kvm_vmid *vmid; 189 + struct kvm_vmid *v = &vcpu->kvm->arch.vmid; 190 + unsigned long vmid = READ_ONCE(v->vmid); 191 191 192 - vmid = &vcpu->kvm->arch.vmid; 193 - kvm_riscv_local_hfence_gvma_vmid_all(READ_ONCE(vmid->vmid)); 192 + if (kvm_riscv_nacl_available()) 193 + nacl_hfence_gvma_vmid_all(nacl_shmem(), vmid); 194 + else 195 + kvm_riscv_local_hfence_gvma_vmid_all(vmid); 194 196 } 195 197 196 198 void kvm_riscv_hfence_vvma_all_process(struct kvm_vcpu *vcpu) 197 199 { 198 - struct kvm_vmid *vmid; 200 + struct kvm_vmid *v = &vcpu->kvm->arch.vmid; 201 + unsigned long vmid = READ_ONCE(v->vmid); 199 202 200 - vmid = &vcpu->kvm->arch.vmid; 201 - kvm_riscv_local_hfence_vvma_all(READ_ONCE(vmid->vmid)); 203 + if (kvm_riscv_nacl_available()) 204 + nacl_hfence_vvma_all(nacl_shmem(), vmid); 205 + else 206 + kvm_riscv_local_hfence_vvma_all(vmid); 202 207 } 203 208 204 209 static bool vcpu_hfence_dequeue(struct kvm_vcpu *vcpu, ··· 258 251 259 252 void kvm_riscv_hfence_process(struct kvm_vcpu *vcpu) 260 253 { 254 + unsigned long vmid; 261 255 struct kvm_riscv_hfence d = { 0 }; 262 256 struct kvm_vmid *v = &vcpu->kvm->arch.vmid; 263 257 ··· 267 259 case KVM_RISCV_HFENCE_UNKNOWN: 268 260 break; 269 261 case KVM_RISCV_HFENCE_GVMA_VMID_GPA: 270 - kvm_riscv_local_hfence_gvma_vmid_gpa( 271 - READ_ONCE(v->vmid), 272 - d.addr, d.size, d.order); 262 + vmid = READ_ONCE(v->vmid); 263 + if (kvm_riscv_nacl_available()) 264 + nacl_hfence_gvma_vmid(nacl_shmem(), vmid, 265 + d.addr, d.size, d.order); 266 + else 267 + kvm_riscv_local_hfence_gvma_vmid_gpa(vmid, d.addr, 268 + d.size, d.order); 273 269 break; 274 270 case KVM_RISCV_HFENCE_VVMA_ASID_GVA: 275 271 kvm_riscv_vcpu_pmu_incr_fw(vcpu, SBI_PMU_FW_HFENCE_VVMA_ASID_RCVD); 276 - kvm_riscv_local_hfence_vvma_asid_gva( 277 - READ_ONCE(v->vmid), d.asid, 278 - d.addr, d.size, d.order); 272 + vmid = READ_ONCE(v->vmid); 273 + if (kvm_riscv_nacl_available()) 274 + nacl_hfence_vvma_asid(nacl_shmem(), vmid, d.asid, 275 + d.addr, d.size, d.order); 276 + else 277 + kvm_riscv_local_hfence_vvma_asid_gva(vmid, d.asid, d.addr, 278 + d.size, d.order); 279 279 break; 280 280 case KVM_RISCV_HFENCE_VVMA_ASID_ALL: 281 281 kvm_riscv_vcpu_pmu_incr_fw(vcpu, SBI_PMU_FW_HFENCE_VVMA_ASID_RCVD); 282 - kvm_riscv_local_hfence_vvma_asid_all( 283 - READ_ONCE(v->vmid), d.asid); 282 + vmid = READ_ONCE(v->vmid); 283 + if (kvm_riscv_nacl_available()) 284 + nacl_hfence_vvma_asid_all(nacl_shmem(), vmid, d.asid); 285 + else 286 + kvm_riscv_local_hfence_vvma_asid_all(vmid, d.asid); 284 287 break; 285 288 case KVM_RISCV_HFENCE_VVMA_GVA: 286 289 kvm_riscv_vcpu_pmu_incr_fw(vcpu, SBI_PMU_FW_HFENCE_VVMA_RCVD); 287 - kvm_riscv_local_hfence_vvma_gva( 288 - READ_ONCE(v->vmid), 289 - d.addr, d.size, d.order); 290 + vmid = READ_ONCE(v->vmid); 291 + if (kvm_riscv_nacl_available()) 292 + nacl_hfence_vvma(nacl_shmem(), vmid, 293 + d.addr, d.size, d.order); 294 + else 295 + kvm_riscv_local_hfence_vvma_gva(vmid, d.addr, 296 + d.size, d.order); 290 297 break; 291 298 default: 292 299 break;

+148 -43

arch/riscv/kvm/vcpu.c

··· 17 17 #include <linux/sched/signal.h> 18 18 #include <linux/fs.h> 19 19 #include <linux/kvm_host.h> 20 - #include <asm/csr.h> 21 20 #include <asm/cacheflush.h> 21 + #include <asm/kvm_nacl.h> 22 22 #include <asm/kvm_vcpu_vector.h> 23 23 24 24 #define CREATE_TRACE_POINTS ··· 226 226 return (vcpu->arch.guest_context.sstatus & SR_SPP) ? true : false; 227 227 } 228 228 229 + #ifdef CONFIG_GUEST_PERF_EVENTS 230 + unsigned long kvm_arch_vcpu_get_ip(struct kvm_vcpu *vcpu) 231 + { 232 + return vcpu->arch.guest_context.sepc; 233 + } 234 + #endif 235 + 229 236 vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf) 230 237 { 231 238 return VM_FAULT_SIGBUS; ··· 368 361 struct kvm_vcpu_csr *csr = &vcpu->arch.guest_csr; 369 362 370 363 /* Read current HVIP and VSIE CSRs */ 371 - csr->vsie = csr_read(CSR_VSIE); 364 + csr->vsie = ncsr_read(CSR_VSIE); 372 365 373 366 /* Sync-up HVIP.VSSIP bit changes does by Guest */ 374 - hvip = csr_read(CSR_HVIP); 367 + hvip = ncsr_read(CSR_HVIP); 375 368 if ((csr->hvip ^ hvip) & (1UL << IRQ_VS_SOFT)) { 376 369 if (hvip & (1UL << IRQ_VS_SOFT)) { 377 370 if (!test_and_set_bit(IRQ_VS_SOFT, ··· 568 561 569 562 void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) 570 563 { 564 + void *nsh; 571 565 struct kvm_vcpu_csr *csr = &vcpu->arch.guest_csr; 572 566 struct kvm_vcpu_config *cfg = &vcpu->arch.cfg; 573 567 574 - csr_write(CSR_VSSTATUS, csr->vsstatus); 575 - csr_write(CSR_VSIE, csr->vsie); 576 - csr_write(CSR_VSTVEC, csr->vstvec); 577 - csr_write(CSR_VSSCRATCH, csr->vsscratch); 578 - csr_write(CSR_VSEPC, csr->vsepc); 579 - csr_write(CSR_VSCAUSE, csr->vscause); 580 - csr_write(CSR_VSTVAL, csr->vstval); 581 - csr_write(CSR_HEDELEG, cfg->hedeleg); 582 - csr_write(CSR_HVIP, csr->hvip); 583 - csr_write(CSR_VSATP, csr->vsatp); 584 - csr_write(CSR_HENVCFG, cfg->henvcfg); 585 - if (IS_ENABLED(CONFIG_32BIT)) 586 - csr_write(CSR_HENVCFGH, cfg->henvcfg >> 32); 587 - if (riscv_has_extension_unlikely(RISCV_ISA_EXT_SMSTATEEN)) { 588 - csr_write(CSR_HSTATEEN0, cfg->hstateen0); 568 + if (kvm_riscv_nacl_sync_csr_available()) { 569 + nsh = nacl_shmem(); 570 + nacl_csr_write(nsh, CSR_VSSTATUS, csr->vsstatus); 571 + nacl_csr_write(nsh, CSR_VSIE, csr->vsie); 572 + nacl_csr_write(nsh, CSR_VSTVEC, csr->vstvec); 573 + nacl_csr_write(nsh, CSR_VSSCRATCH, csr->vsscratch); 574 + nacl_csr_write(nsh, CSR_VSEPC, csr->vsepc); 575 + nacl_csr_write(nsh, CSR_VSCAUSE, csr->vscause); 576 + nacl_csr_write(nsh, CSR_VSTVAL, csr->vstval); 577 + nacl_csr_write(nsh, CSR_HEDELEG, cfg->hedeleg); 578 + nacl_csr_write(nsh, CSR_HVIP, csr->hvip); 579 + nacl_csr_write(nsh, CSR_VSATP, csr->vsatp); 580 + nacl_csr_write(nsh, CSR_HENVCFG, cfg->henvcfg); 589 581 if (IS_ENABLED(CONFIG_32BIT)) 590 - csr_write(CSR_HSTATEEN0H, cfg->hstateen0 >> 32); 582 + nacl_csr_write(nsh, CSR_HENVCFGH, cfg->henvcfg >> 32); 583 + if (riscv_has_extension_unlikely(RISCV_ISA_EXT_SMSTATEEN)) { 584 + nacl_csr_write(nsh, CSR_HSTATEEN0, cfg->hstateen0); 585 + if (IS_ENABLED(CONFIG_32BIT)) 586 + nacl_csr_write(nsh, CSR_HSTATEEN0H, cfg->hstateen0 >> 32); 587 + } 588 + } else { 589 + csr_write(CSR_VSSTATUS, csr->vsstatus); 590 + csr_write(CSR_VSIE, csr->vsie); 591 + csr_write(CSR_VSTVEC, csr->vstvec); 592 + csr_write(CSR_VSSCRATCH, csr->vsscratch); 593 + csr_write(CSR_VSEPC, csr->vsepc); 594 + csr_write(CSR_VSCAUSE, csr->vscause); 595 + csr_write(CSR_VSTVAL, csr->vstval); 596 + csr_write(CSR_HEDELEG, cfg->hedeleg); 597 + csr_write(CSR_HVIP, csr->hvip); 598 + csr_write(CSR_VSATP, csr->vsatp); 599 + csr_write(CSR_HENVCFG, cfg->henvcfg); 600 + if (IS_ENABLED(CONFIG_32BIT)) 601 + csr_write(CSR_HENVCFGH, cfg->henvcfg >> 32); 602 + if (riscv_has_extension_unlikely(RISCV_ISA_EXT_SMSTATEEN)) { 603 + csr_write(CSR_HSTATEEN0, cfg->hstateen0); 604 + if (IS_ENABLED(CONFIG_32BIT)) 605 + csr_write(CSR_HSTATEEN0H, cfg->hstateen0 >> 32); 606 + } 591 607 } 592 608 593 609 kvm_riscv_gstage_update_hgatp(vcpu); ··· 633 603 634 604 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) 635 605 { 606 + void *nsh; 636 607 struct kvm_vcpu_csr *csr = &vcpu->arch.guest_csr; 637 608 638 609 vcpu->cpu = -1; ··· 649 618 vcpu->arch.isa); 650 619 kvm_riscv_vcpu_host_vector_restore(&vcpu->arch.host_context); 651 620 652 - csr->vsstatus = csr_read(CSR_VSSTATUS); 653 - csr->vsie = csr_read(CSR_VSIE); 654 - csr->vstvec = csr_read(CSR_VSTVEC); 655 - csr->vsscratch = csr_read(CSR_VSSCRATCH); 656 - csr->vsepc = csr_read(CSR_VSEPC); 657 - csr->vscause = csr_read(CSR_VSCAUSE); 658 - csr->vstval = csr_read(CSR_VSTVAL); 659 - csr->hvip = csr_read(CSR_HVIP); 660 - csr->vsatp = csr_read(CSR_VSATP); 621 + if (kvm_riscv_nacl_available()) { 622 + nsh = nacl_shmem(); 623 + csr->vsstatus = nacl_csr_read(nsh, CSR_VSSTATUS); 624 + csr->vsie = nacl_csr_read(nsh, CSR_VSIE); 625 + csr->vstvec = nacl_csr_read(nsh, CSR_VSTVEC); 626 + csr->vsscratch = nacl_csr_read(nsh, CSR_VSSCRATCH); 627 + csr->vsepc = nacl_csr_read(nsh, CSR_VSEPC); 628 + csr->vscause = nacl_csr_read(nsh, CSR_VSCAUSE); 629 + csr->vstval = nacl_csr_read(nsh, CSR_VSTVAL); 630 + csr->hvip = nacl_csr_read(nsh, CSR_HVIP); 631 + csr->vsatp = nacl_csr_read(nsh, CSR_VSATP); 632 + } else { 633 + csr->vsstatus = csr_read(CSR_VSSTATUS); 634 + csr->vsie = csr_read(CSR_VSIE); 635 + csr->vstvec = csr_read(CSR_VSTVEC); 636 + csr->vsscratch = csr_read(CSR_VSSCRATCH); 637 + csr->vsepc = csr_read(CSR_VSEPC); 638 + csr->vscause = csr_read(CSR_VSCAUSE); 639 + csr->vstval = csr_read(CSR_VSTVAL); 640 + csr->hvip = csr_read(CSR_HVIP); 641 + csr->vsatp = csr_read(CSR_VSATP); 642 + } 661 643 } 662 644 663 645 static void kvm_riscv_check_vcpu_requests(struct kvm_vcpu *vcpu) ··· 725 681 { 726 682 struct kvm_vcpu_csr *csr = &vcpu->arch.guest_csr; 727 683 728 - csr_write(CSR_HVIP, csr->hvip); 684 + ncsr_write(CSR_HVIP, csr->hvip); 729 685 kvm_riscv_vcpu_aia_update_hvip(vcpu); 730 686 } 731 687 ··· 735 691 struct kvm_vcpu_csr *csr = &vcpu->arch.guest_csr; 736 692 struct kvm_vcpu_config *cfg = &vcpu->arch.cfg; 737 693 694 + vcpu->arch.host_scounteren = csr_swap(CSR_SCOUNTEREN, csr->scounteren); 738 695 vcpu->arch.host_senvcfg = csr_swap(CSR_SENVCFG, csr->senvcfg); 739 696 if (riscv_has_extension_unlikely(RISCV_ISA_EXT_SMSTATEEN) && 740 697 (cfg->hstateen0 & SMSTATEEN0_SSTATEEN0)) ··· 749 704 struct kvm_vcpu_csr *csr = &vcpu->arch.guest_csr; 750 705 struct kvm_vcpu_config *cfg = &vcpu->arch.cfg; 751 706 707 + csr->scounteren = csr_swap(CSR_SCOUNTEREN, vcpu->arch.host_scounteren); 752 708 csr->senvcfg = csr_swap(CSR_SENVCFG, vcpu->arch.host_senvcfg); 753 709 if (riscv_has_extension_unlikely(RISCV_ISA_EXT_SMSTATEEN) && 754 710 (cfg->hstateen0 & SMSTATEEN0_SSTATEEN0)) ··· 764 718 * This must be noinstr as instrumentation may make use of RCU, and this is not 765 719 * safe during the EQS. 766 720 */ 767 - static void noinstr kvm_riscv_vcpu_enter_exit(struct kvm_vcpu *vcpu) 721 + static void noinstr kvm_riscv_vcpu_enter_exit(struct kvm_vcpu *vcpu, 722 + struct kvm_cpu_trap *trap) 768 723 { 724 + void *nsh; 725 + struct kvm_cpu_context *gcntx = &vcpu->arch.guest_context; 726 + struct kvm_cpu_context *hcntx = &vcpu->arch.host_context; 727 + 728 + /* 729 + * We save trap CSRs (such as SEPC, SCAUSE, STVAL, HTVAL, and 730 + * HTINST) here because we do local_irq_enable() after this 731 + * function in kvm_arch_vcpu_ioctl_run() which can result in 732 + * an interrupt immediately after local_irq_enable() and can 733 + * potentially change trap CSRs. 734 + */ 735 + 769 736 kvm_riscv_vcpu_swap_in_guest_state(vcpu); 770 737 guest_state_enter_irqoff(); 771 - __kvm_riscv_switch_to(&vcpu->arch); 738 + 739 + if (kvm_riscv_nacl_sync_sret_available()) { 740 + nsh = nacl_shmem(); 741 + 742 + if (kvm_riscv_nacl_autoswap_csr_available()) { 743 + hcntx->hstatus = 744 + nacl_csr_read(nsh, CSR_HSTATUS); 745 + nacl_scratch_write_long(nsh, 746 + SBI_NACL_SHMEM_AUTOSWAP_OFFSET + 747 + SBI_NACL_SHMEM_AUTOSWAP_HSTATUS, 748 + gcntx->hstatus); 749 + nacl_scratch_write_long(nsh, 750 + SBI_NACL_SHMEM_AUTOSWAP_OFFSET, 751 + SBI_NACL_SHMEM_AUTOSWAP_FLAG_HSTATUS); 752 + } else if (kvm_riscv_nacl_sync_csr_available()) { 753 + hcntx->hstatus = nacl_csr_swap(nsh, 754 + CSR_HSTATUS, gcntx->hstatus); 755 + } else { 756 + hcntx->hstatus = csr_swap(CSR_HSTATUS, gcntx->hstatus); 757 + } 758 + 759 + nacl_scratch_write_longs(nsh, 760 + SBI_NACL_SHMEM_SRET_OFFSET + 761 + SBI_NACL_SHMEM_SRET_X(1), 762 + &gcntx->ra, 763 + SBI_NACL_SHMEM_SRET_X_LAST); 764 + 765 + __kvm_riscv_nacl_switch_to(&vcpu->arch, SBI_EXT_NACL, 766 + SBI_EXT_NACL_SYNC_SRET); 767 + 768 + if (kvm_riscv_nacl_autoswap_csr_available()) { 769 + nacl_scratch_write_long(nsh, 770 + SBI_NACL_SHMEM_AUTOSWAP_OFFSET, 771 + 0); 772 + gcntx->hstatus = nacl_scratch_read_long(nsh, 773 + SBI_NACL_SHMEM_AUTOSWAP_OFFSET + 774 + SBI_NACL_SHMEM_AUTOSWAP_HSTATUS); 775 + } else { 776 + gcntx->hstatus = csr_swap(CSR_HSTATUS, hcntx->hstatus); 777 + } 778 + 779 + trap->htval = nacl_csr_read(nsh, CSR_HTVAL); 780 + trap->htinst = nacl_csr_read(nsh, CSR_HTINST); 781 + } else { 782 + hcntx->hstatus = csr_swap(CSR_HSTATUS, gcntx->hstatus); 783 + 784 + __kvm_riscv_switch_to(&vcpu->arch); 785 + 786 + gcntx->hstatus = csr_swap(CSR_HSTATUS, hcntx->hstatus); 787 + 788 + trap->htval = csr_read(CSR_HTVAL); 789 + trap->htinst = csr_read(CSR_HTINST); 790 + } 791 + 792 + trap->sepc = gcntx->sepc; 793 + trap->scause = csr_read(CSR_SCAUSE); 794 + trap->stval = csr_read(CSR_STVAL); 795 + 772 796 vcpu->arch.last_exit_cpu = vcpu->cpu; 773 797 guest_state_exit_irqoff(); 774 798 kvm_riscv_vcpu_swap_in_host_state(vcpu); ··· 955 839 956 840 guest_timing_enter_irqoff(); 957 841 958 - kvm_riscv_vcpu_enter_exit(vcpu); 842 + kvm_riscv_vcpu_enter_exit(vcpu, &trap); 959 843 960 844 vcpu->mode = OUTSIDE_GUEST_MODE; 961 845 vcpu->stat.exits++; 962 - 963 - /* 964 - * Save SCAUSE, STVAL, HTVAL, and HTINST because we might 965 - * get an interrupt between __kvm_riscv_switch_to() and 966 - * local_irq_enable() which can potentially change CSRs. 967 - */ 968 - trap.sepc = vcpu->arch.guest_context.sepc; 969 - trap.scause = csr_read(CSR_SCAUSE); 970 - trap.stval = csr_read(CSR_STVAL); 971 - trap.htval = csr_read(CSR_HTVAL); 972 - trap.htinst = csr_read(CSR_HTINST); 973 846 974 847 /* Syncup interrupts state with HW */ 975 848 kvm_riscv_vcpu_sync_interrupts(vcpu);

+7 -4

arch/riscv/kvm/vcpu_sbi.c

··· 486 486 struct kvm_vcpu_sbi_context *scontext = &vcpu->arch.sbi_context; 487 487 const struct kvm_riscv_sbi_extension_entry *entry; 488 488 const struct kvm_vcpu_sbi_extension *ext; 489 - int i; 489 + int idx, i; 490 490 491 491 for (i = 0; i < ARRAY_SIZE(sbi_ext); i++) { 492 492 entry = &sbi_ext[i]; 493 493 ext = entry->ext_ptr; 494 + idx = entry->ext_idx; 495 + 496 + if (idx < 0 || idx >= ARRAY_SIZE(scontext->ext_status)) 497 + continue; 494 498 495 499 if (ext->probe && !ext->probe(vcpu)) { 496 - scontext->ext_status[entry->ext_idx] = 497 - KVM_RISCV_SBI_EXT_STATUS_UNAVAILABLE; 500 + scontext->ext_status[idx] = KVM_RISCV_SBI_EXT_STATUS_UNAVAILABLE; 498 501 continue; 499 502 } 500 503 501 - scontext->ext_status[entry->ext_idx] = ext->default_disabled ? 504 + scontext->ext_status[idx] = ext->default_disabled ? 502 505 KVM_RISCV_SBI_EXT_STATUS_DISABLED : 503 506 KVM_RISCV_SBI_EXT_STATUS_ENABLED; 504 507 }

+87 -50

arch/riscv/kvm/vcpu_switch.S

··· 11 11 #include <asm/asm-offsets.h> 12 12 #include <asm/csr.h> 13 13 14 - .text 15 - .altmacro 16 - .option norelax 17 - 18 - SYM_FUNC_START(__kvm_riscv_switch_to) 14 + .macro SAVE_HOST_GPRS 19 15 /* Save Host GPRs (except A0 and T0-T6) */ 20 16 REG_S ra, (KVM_ARCH_HOST_RA)(a0) 21 17 REG_S sp, (KVM_ARCH_HOST_SP)(a0) ··· 36 40 REG_S s9, (KVM_ARCH_HOST_S9)(a0) 37 41 REG_S s10, (KVM_ARCH_HOST_S10)(a0) 38 42 REG_S s11, (KVM_ARCH_HOST_S11)(a0) 43 + .endm 39 44 45 + .macro SAVE_HOST_AND_RESTORE_GUEST_CSRS __resume_addr 40 46 /* Load Guest CSR values */ 41 47 REG_L t0, (KVM_ARCH_GUEST_SSTATUS)(a0) 42 - REG_L t1, (KVM_ARCH_GUEST_HSTATUS)(a0) 43 - REG_L t2, (KVM_ARCH_GUEST_SCOUNTEREN)(a0) 44 - la t4, .Lkvm_switch_return 45 - REG_L t5, (KVM_ARCH_GUEST_SEPC)(a0) 48 + la t1, \__resume_addr 49 + REG_L t2, (KVM_ARCH_GUEST_SEPC)(a0) 46 50 47 51 /* Save Host and Restore Guest SSTATUS */ 48 52 csrrw t0, CSR_SSTATUS, t0 49 53 50 - /* Save Host and Restore Guest HSTATUS */ 51 - csrrw t1, CSR_HSTATUS, t1 52 - 53 - /* Save Host and Restore Guest SCOUNTEREN */ 54 - csrrw t2, CSR_SCOUNTEREN, t2 55 - 56 54 /* Save Host STVEC and change it to return path */ 57 - csrrw t4, CSR_STVEC, t4 55 + csrrw t1, CSR_STVEC, t1 56 + 57 + /* Restore Guest SEPC */ 58 + csrw CSR_SEPC, t2 58 59 59 60 /* Save Host SSCRATCH and change it to struct kvm_vcpu_arch pointer */ 60 61 csrrw t3, CSR_SSCRATCH, a0 61 62 62 - /* Restore Guest SEPC */ 63 - csrw CSR_SEPC, t5 64 - 65 63 /* Store Host CSR values */ 66 64 REG_S t0, (KVM_ARCH_HOST_SSTATUS)(a0) 67 - REG_S t1, (KVM_ARCH_HOST_HSTATUS)(a0) 68 - REG_S t2, (KVM_ARCH_HOST_SCOUNTEREN)(a0) 65 + REG_S t1, (KVM_ARCH_HOST_STVEC)(a0) 69 66 REG_S t3, (KVM_ARCH_HOST_SSCRATCH)(a0) 70 - REG_S t4, (KVM_ARCH_HOST_STVEC)(a0) 67 + .endm 71 68 69 + .macro RESTORE_GUEST_GPRS 72 70 /* Restore Guest GPRs (except A0) */ 73 71 REG_L ra, (KVM_ARCH_GUEST_RA)(a0) 74 72 REG_L sp, (KVM_ARCH_GUEST_SP)(a0) ··· 97 107 98 108 /* Restore Guest A0 */ 99 109 REG_L a0, (KVM_ARCH_GUEST_A0)(a0) 110 + .endm 100 111 101 - /* Resume Guest */ 102 - sret 103 - 104 - /* Back to Host */ 105 - .align 2 106 - .Lkvm_switch_return: 112 + .macro SAVE_GUEST_GPRS 107 113 /* Swap Guest A0 with SSCRATCH */ 108 114 csrrw a0, CSR_SSCRATCH, a0 109 115 ··· 134 148 REG_S t4, (KVM_ARCH_GUEST_T4)(a0) 135 149 REG_S t5, (KVM_ARCH_GUEST_T5)(a0) 136 150 REG_S t6, (KVM_ARCH_GUEST_T6)(a0) 151 + .endm 137 152 153 + .macro SAVE_GUEST_AND_RESTORE_HOST_CSRS 138 154 /* Load Host CSR values */ 139 - REG_L t1, (KVM_ARCH_HOST_STVEC)(a0) 140 - REG_L t2, (KVM_ARCH_HOST_SSCRATCH)(a0) 141 - REG_L t3, (KVM_ARCH_HOST_SCOUNTEREN)(a0) 142 - REG_L t4, (KVM_ARCH_HOST_HSTATUS)(a0) 143 - REG_L t5, (KVM_ARCH_HOST_SSTATUS)(a0) 144 - 145 - /* Save Guest SEPC */ 146 - csrr t0, CSR_SEPC 155 + REG_L t0, (KVM_ARCH_HOST_STVEC)(a0) 156 + REG_L t1, (KVM_ARCH_HOST_SSCRATCH)(a0) 157 + REG_L t2, (KVM_ARCH_HOST_SSTATUS)(a0) 147 158 148 159 /* Save Guest A0 and Restore Host SSCRATCH */ 149 - csrrw t2, CSR_SSCRATCH, t2 160 + csrrw t1, CSR_SSCRATCH, t1 161 + 162 + /* Save Guest SEPC */ 163 + csrr t3, CSR_SEPC 150 164 151 165 /* Restore Host STVEC */ 152 - csrw CSR_STVEC, t1 153 - 154 - /* Save Guest and Restore Host SCOUNTEREN */ 155 - csrrw t3, CSR_SCOUNTEREN, t3 156 - 157 - /* Save Guest and Restore Host HSTATUS */ 158 - csrrw t4, CSR_HSTATUS, t4 166 + csrw CSR_STVEC, t0 159 167 160 168 /* Save Guest and Restore Host SSTATUS */ 161 - csrrw t5, CSR_SSTATUS, t5 169 + csrrw t2, CSR_SSTATUS, t2 162 170 163 171 /* Store Guest CSR values */ 164 - REG_S t0, (KVM_ARCH_GUEST_SEPC)(a0) 165 - REG_S t2, (KVM_ARCH_GUEST_A0)(a0) 166 - REG_S t3, (KVM_ARCH_GUEST_SCOUNTEREN)(a0) 167 - REG_S t4, (KVM_ARCH_GUEST_HSTATUS)(a0) 168 - REG_S t5, (KVM_ARCH_GUEST_SSTATUS)(a0) 172 + REG_S t1, (KVM_ARCH_GUEST_A0)(a0) 173 + REG_S t2, (KVM_ARCH_GUEST_SSTATUS)(a0) 174 + REG_S t3, (KVM_ARCH_GUEST_SEPC)(a0) 175 + .endm 169 176 177 + .macro RESTORE_HOST_GPRS 170 178 /* Restore Host GPRs (except A0 and T0-T6) */ 171 179 REG_L ra, (KVM_ARCH_HOST_RA)(a0) 172 180 REG_L sp, (KVM_ARCH_HOST_SP)(a0) ··· 185 205 REG_L s9, (KVM_ARCH_HOST_S9)(a0) 186 206 REG_L s10, (KVM_ARCH_HOST_S10)(a0) 187 207 REG_L s11, (KVM_ARCH_HOST_S11)(a0) 208 + .endm 209 + 210 + .text 211 + .altmacro 212 + .option norelax 213 + 214 + /* 215 + * Parameters: 216 + * A0 <= Pointer to struct kvm_vcpu_arch 217 + */ 218 + SYM_FUNC_START(__kvm_riscv_switch_to) 219 + SAVE_HOST_GPRS 220 + 221 + SAVE_HOST_AND_RESTORE_GUEST_CSRS .Lkvm_switch_return 222 + 223 + RESTORE_GUEST_GPRS 224 + 225 + /* Resume Guest using SRET */ 226 + sret 227 + 228 + /* Back to Host */ 229 + .align 2 230 + .Lkvm_switch_return: 231 + SAVE_GUEST_GPRS 232 + 233 + SAVE_GUEST_AND_RESTORE_HOST_CSRS 234 + 235 + RESTORE_HOST_GPRS 188 236 189 237 /* Return to C code */ 190 238 ret 191 239 SYM_FUNC_END(__kvm_riscv_switch_to) 240 + 241 + /* 242 + * Parameters: 243 + * A0 <= Pointer to struct kvm_vcpu_arch 244 + * A1 <= SBI extension ID 245 + * A2 <= SBI function ID 246 + */ 247 + SYM_FUNC_START(__kvm_riscv_nacl_switch_to) 248 + SAVE_HOST_GPRS 249 + 250 + SAVE_HOST_AND_RESTORE_GUEST_CSRS .Lkvm_nacl_switch_return 251 + 252 + /* Resume Guest using SBI nested acceleration */ 253 + add a6, a2, zero 254 + add a7, a1, zero 255 + ecall 256 + 257 + /* Back to Host */ 258 + .align 2 259 + .Lkvm_nacl_switch_return: 260 + SAVE_GUEST_GPRS 261 + 262 + SAVE_GUEST_AND_RESTORE_HOST_CSRS 263 + 264 + RESTORE_HOST_GPRS 265 + 266 + /* Return to C code */ 267 + ret 268 + SYM_FUNC_END(__kvm_riscv_nacl_switch_to) 192 269 193 270 SYM_CODE_START(__kvm_riscv_unpriv_trap) 194 271 /*

+14 -14

arch/riscv/kvm/vcpu_timer.c

··· 11 11 #include <linux/kvm_host.h> 12 12 #include <linux/uaccess.h> 13 13 #include <clocksource/timer-riscv.h> 14 - #include <asm/csr.h> 15 14 #include <asm/delay.h> 15 + #include <asm/kvm_nacl.h> 16 16 #include <asm/kvm_vcpu_timer.h> 17 17 18 18 static u64 kvm_riscv_current_cycles(struct kvm_guest_timer *gt) ··· 72 72 static int kvm_riscv_vcpu_update_vstimecmp(struct kvm_vcpu *vcpu, u64 ncycles) 73 73 { 74 74 #if defined(CONFIG_32BIT) 75 - csr_write(CSR_VSTIMECMP, ncycles & 0xFFFFFFFF); 76 - csr_write(CSR_VSTIMECMPH, ncycles >> 32); 75 + ncsr_write(CSR_VSTIMECMP, ncycles & 0xFFFFFFFF); 76 + ncsr_write(CSR_VSTIMECMPH, ncycles >> 32); 77 77 #else 78 - csr_write(CSR_VSTIMECMP, ncycles); 78 + ncsr_write(CSR_VSTIMECMP, ncycles); 79 79 #endif 80 - return 0; 80 + return 0; 81 81 } 82 82 83 83 static int kvm_riscv_vcpu_update_hrtimer(struct kvm_vcpu *vcpu, u64 ncycles) ··· 289 289 struct kvm_guest_timer *gt = &vcpu->kvm->arch.timer; 290 290 291 291 #if defined(CONFIG_32BIT) 292 - csr_write(CSR_HTIMEDELTA, (u32)(gt->time_delta)); 293 - csr_write(CSR_HTIMEDELTAH, (u32)(gt->time_delta >> 32)); 292 + ncsr_write(CSR_HTIMEDELTA, (u32)(gt->time_delta)); 293 + ncsr_write(CSR_HTIMEDELTAH, (u32)(gt->time_delta >> 32)); 294 294 #else 295 - csr_write(CSR_HTIMEDELTA, gt->time_delta); 295 + ncsr_write(CSR_HTIMEDELTA, gt->time_delta); 296 296 #endif 297 297 } 298 298 ··· 306 306 return; 307 307 308 308 #if defined(CONFIG_32BIT) 309 - csr_write(CSR_VSTIMECMP, (u32)t->next_cycles); 310 - csr_write(CSR_VSTIMECMPH, (u32)(t->next_cycles >> 32)); 309 + ncsr_write(CSR_VSTIMECMP, (u32)t->next_cycles); 310 + ncsr_write(CSR_VSTIMECMPH, (u32)(t->next_cycles >> 32)); 311 311 #else 312 - csr_write(CSR_VSTIMECMP, t->next_cycles); 312 + ncsr_write(CSR_VSTIMECMP, t->next_cycles); 313 313 #endif 314 314 315 315 /* timer should be enabled for the remaining operations */ ··· 327 327 return; 328 328 329 329 #if defined(CONFIG_32BIT) 330 - t->next_cycles = csr_read(CSR_VSTIMECMP); 331 - t->next_cycles |= (u64)csr_read(CSR_VSTIMECMPH) << 32; 330 + t->next_cycles = ncsr_read(CSR_VSTIMECMP); 331 + t->next_cycles |= (u64)ncsr_read(CSR_VSTIMECMPH) << 32; 332 332 #else 333 - t->next_cycles = csr_read(CSR_VSTIMECMP); 333 + t->next_cycles = ncsr_read(CSR_VSTIMECMP); 334 334 #endif 335 335 } 336 336

+1

arch/s390/include/asm/kvm_host.h

··· 356 356 #define ECD_MEF 0x08000000 357 357 #define ECD_ETOKENF 0x02000000 358 358 #define ECD_ECC 0x00200000 359 + #define ECD_HMAC 0x00004000 359 360 __u32 ecd; /* 0x01c8 */ 360 361 __u8 reserved1cc[18]; /* 0x01cc */ 361 362 __u64 pp; /* 0x01de */

+2 -1

arch/s390/include/uapi/asm/kvm.h

··· 469 469 __u8 kdsa[16]; /* with MSA9 */ 470 470 __u8 sortl[32]; /* with STFLE.150 */ 471 471 __u8 dfltcc[32]; /* with STFLE.151 */ 472 - __u8 reserved[1728]; 472 + __u8 pfcr[16]; /* with STFLE.201 */ 473 + __u8 reserved[1712]; 473 474 }; 474 475 475 476 #define KVM_S390_VM_CPU_PROCESSOR_UV_FEAT_GUEST 6

+41 -2

arch/s390/kvm/kvm-s390.c

··· 348 348 return CC_TRANSFORM(cc) == 0; 349 349 } 350 350 351 + static __always_inline void pfcr_query(u8 (*query)[16]) 352 + { 353 + asm volatile( 354 + " lghi 0,0\n" 355 + " .insn rsy,0xeb0000000016,0,0,%[query]\n" 356 + : [query] "=QS" (*query) 357 + : 358 + : "cc", "0"); 359 + } 360 + 351 361 static __always_inline void __sortl_query(u8 (*query)[32]) 352 362 { 353 363 asm volatile( ··· 438 428 439 429 if (test_facility(151)) /* DFLTCC */ 440 430 __dfltcc_query(&kvm_s390_available_subfunc.dfltcc); 431 + 432 + if (test_facility(201)) /* PFCR */ 433 + pfcr_query(&kvm_s390_available_subfunc.pfcr); 441 434 442 435 if (MACHINE_HAS_ESOP) 443 436 allow_cpu_feat(KVM_S390_VM_CPU_FEAT_ESOP); ··· 811 798 if (test_facility(192)) { 812 799 set_kvm_facility(kvm->arch.model.fac_mask, 192); 813 800 set_kvm_facility(kvm->arch.model.fac_list, 192); 801 + } 802 + if (test_facility(198)) { 803 + set_kvm_facility(kvm->arch.model.fac_mask, 198); 804 + set_kvm_facility(kvm->arch.model.fac_list, 198); 805 + } 806 + if (test_facility(199)) { 807 + set_kvm_facility(kvm->arch.model.fac_mask, 199); 808 + set_kvm_facility(kvm->arch.model.fac_list, 199); 814 809 } 815 810 r = 0; 816 811 } else ··· 1564 1543 ((unsigned long *) &kvm->arch.model.subfuncs.dfltcc)[1], 1565 1544 ((unsigned long *) &kvm->arch.model.subfuncs.dfltcc)[2], 1566 1545 ((unsigned long *) &kvm->arch.model.subfuncs.dfltcc)[3]); 1546 + VM_EVENT(kvm, 3, "GET: guest PFCR subfunc 0x%16.16lx.%16.16lx", 1547 + ((unsigned long *) &kvm_s390_available_subfunc.pfcr)[0], 1548 + ((unsigned long *) &kvm_s390_available_subfunc.pfcr)[1]); 1567 1549 1568 1550 return 0; 1569 1551 } ··· 1781 1757 ((unsigned long *) &kvm->arch.model.subfuncs.dfltcc)[1], 1782 1758 ((unsigned long *) &kvm->arch.model.subfuncs.dfltcc)[2], 1783 1759 ((unsigned long *) &kvm->arch.model.subfuncs.dfltcc)[3]); 1760 + VM_EVENT(kvm, 3, "GET: guest PFCR subfunc 0x%16.16lx.%16.16lx", 1761 + ((unsigned long *) &kvm_s390_available_subfunc.pfcr)[0], 1762 + ((unsigned long *) &kvm_s390_available_subfunc.pfcr)[1]); 1784 1763 1785 1764 return 0; 1786 1765 } ··· 1852 1825 ((unsigned long *) &kvm_s390_available_subfunc.dfltcc)[1], 1853 1826 ((unsigned long *) &kvm_s390_available_subfunc.dfltcc)[2], 1854 1827 ((unsigned long *) &kvm_s390_available_subfunc.dfltcc)[3]); 1828 + VM_EVENT(kvm, 3, "GET: host PFCR subfunc 0x%16.16lx.%16.16lx", 1829 + ((unsigned long *) &kvm_s390_available_subfunc.pfcr)[0], 1830 + ((unsigned long *) &kvm_s390_available_subfunc.pfcr)[1]); 1855 1831 1856 1832 return 0; 1857 1833 } ··· 3799 3769 3800 3770 } 3801 3771 3772 + static bool kvm_has_pckmo_hmac(struct kvm *kvm) 3773 + { 3774 + /* At least one HMAC subfunction must be present */ 3775 + return kvm_has_pckmo_subfunc(kvm, 118) || 3776 + kvm_has_pckmo_subfunc(kvm, 122); 3777 + } 3778 + 3802 3779 static void kvm_s390_vcpu_crypto_setup(struct kvm_vcpu *vcpu) 3803 3780 { 3804 3781 /* ··· 3818 3781 vcpu->arch.sie_block->crycbd = vcpu->kvm->arch.crypto.crycbd; 3819 3782 vcpu->arch.sie_block->ecb3 &= ~(ECB3_AES | ECB3_DEA); 3820 3783 vcpu->arch.sie_block->eca &= ~ECA_APIE; 3821 - vcpu->arch.sie_block->ecd &= ~ECD_ECC; 3784 + vcpu->arch.sie_block->ecd &= ~(ECD_ECC | ECD_HMAC); 3822 3785 3823 3786 if (vcpu->kvm->arch.crypto.apie) 3824 3787 vcpu->arch.sie_block->eca |= ECA_APIE; ··· 3826 3789 /* Set up protected key support */ 3827 3790 if (vcpu->kvm->arch.crypto.aes_kw) { 3828 3791 vcpu->arch.sie_block->ecb3 |= ECB3_AES; 3829 - /* ecc is also wrapped with AES key */ 3792 + /* ecc/hmac is also wrapped with AES key */ 3830 3793 if (kvm_has_pckmo_ecc(vcpu->kvm)) 3831 3794 vcpu->arch.sie_block->ecd |= ECD_ECC; 3795 + if (kvm_has_pckmo_hmac(vcpu->kvm)) 3796 + vcpu->arch.sie_block->ecd |= ECD_HMAC; 3832 3797 } 3833 3798 3834 3799 if (vcpu->kvm->arch.crypto.dea_kw)

+4 -3

arch/s390/kvm/vsie.c

··· 335 335 /* we may only allow it if enabled for guest 2 */ 336 336 ecb3_flags = scb_o->ecb3 & vcpu->arch.sie_block->ecb3 & 337 337 (ECB3_AES | ECB3_DEA); 338 - ecd_flags = scb_o->ecd & vcpu->arch.sie_block->ecd & ECD_ECC; 338 + ecd_flags = scb_o->ecd & vcpu->arch.sie_block->ecd & 339 + (ECD_ECC | ECD_HMAC); 339 340 if (!ecb3_flags && !ecd_flags) 340 341 goto end; 341 342 ··· 662 661 struct page *page; 663 662 664 663 page = gfn_to_page(kvm, gpa_to_gfn(gpa)); 665 - if (is_error_page(page)) 664 + if (!page) 666 665 return -EINVAL; 667 666 *hpa = (hpa_t)page_to_phys(page) + (gpa & ~PAGE_MASK); 668 667 return 0; ··· 671 670 /* Unpins a page previously pinned via pin_guest_page, marking it as dirty. */ 672 671 static void unpin_guest_page(struct kvm *kvm, gpa_t gpa, hpa_t hpa) 673 672 { 674 - kvm_release_pfn_dirty(hpa >> PAGE_SHIFT); 673 + kvm_release_page_dirty(pfn_to_page(hpa >> PAGE_SHIFT)); 675 674 /* mark the page always as dirty for migration */ 676 675 mark_page_dirty(kvm, gpa_to_gfn(gpa)); 677 676 }

+2

arch/s390/tools/gen_facilities.c

··· 109 109 15, /* AP Facilities Test */ 110 110 156, /* etoken facility */ 111 111 165, /* nnpa facility */ 112 + 170, /* ineffective-nonconstrained-transaction facility */ 112 113 193, /* bear enhancement facility */ 113 114 194, /* rdp enhancement facility */ 114 115 196, /* processor activity instrumentation facility */ 115 116 197, /* processor activity instrumentation extension 1 */ 117 + 201, /* concurrent-functions facility */ 116 118 -1 /* END */ 117 119 } 118 120 },

+3

arch/x86/include/asm/cpufeatures.h

··· 317 317 #define X86_FEATURE_ZEN1 (11*32+31) /* CPU based on Zen1 microarchitecture */ 318 318 319 319 /* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX), word 12 */ 320 + #define X86_FEATURE_SHA512 (12*32+ 0) /* SHA512 instructions */ 321 + #define X86_FEATURE_SM3 (12*32+ 1) /* SM3 instructions */ 322 + #define X86_FEATURE_SM4 (12*32+ 2) /* SM4 instructions */ 320 323 #define X86_FEATURE_AVX_VNNI (12*32+ 4) /* "avx_vnni" AVX VNNI instructions */ 321 324 #define X86_FEATURE_AVX512_BF16 (12*32+ 5) /* "avx512_bf16" AVX512 BFLOAT16 instructions */ 322 325 #define X86_FEATURE_CMPCCXADD (12*32+ 7) /* CMPccXADD instructions */

+1

arch/x86/include/asm/kvm-x86-ops.h

··· 34 34 KVM_X86_OP(get_segment_base) 35 35 KVM_X86_OP(get_segment) 36 36 KVM_X86_OP(get_cpl) 37 + KVM_X86_OP(get_cpl_no_cache) 37 38 KVM_X86_OP(set_segment) 38 39 KVM_X86_OP(get_cs_db_l_bits) 39 40 KVM_X86_OP(is_valid_cr0)

+8 -5

arch/x86/include/asm/kvm_host.h

··· 26 26 #include <linux/irqbypass.h> 27 27 #include <linux/hyperv.h> 28 28 #include <linux/kfifo.h> 29 + #include <linux/sched/vhost_task.h> 29 30 30 31 #include <asm/apic.h> 31 32 #include <asm/pvclock-abi.h> ··· 1307 1306 bool pre_fault_allowed; 1308 1307 struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES]; 1309 1308 struct list_head active_mmu_pages; 1310 - struct list_head zapped_obsolete_pages; 1311 1309 /* 1312 1310 * A list of kvm_mmu_page structs that, if zapped, could possibly be 1313 1311 * replaced by an NX huge page. A shadow page is on this list if its ··· 1443 1443 bool sgx_provisioning_allowed; 1444 1444 1445 1445 struct kvm_x86_pmu_event_filter __rcu *pmu_event_filter; 1446 - struct task_struct *nx_huge_page_recovery_thread; 1446 + struct vhost_task *nx_huge_page_recovery_thread; 1447 + u64 nx_huge_page_last; 1447 1448 1448 1449 #ifdef CONFIG_X86_64 1449 1450 /* The number of TDP MMU pages across all roots. */ ··· 1657 1656 void (*get_segment)(struct kvm_vcpu *vcpu, 1658 1657 struct kvm_segment *var, int seg); 1659 1658 int (*get_cpl)(struct kvm_vcpu *vcpu); 1659 + int (*get_cpl_no_cache)(struct kvm_vcpu *vcpu); 1660 1660 void (*set_segment)(struct kvm_vcpu *vcpu, 1661 1661 struct kvm_segment *var, int seg); 1662 1662 void (*get_cs_db_l_bits)(struct kvm_vcpu *vcpu, int *db, int *l); ··· 1957 1955 const struct kvm_memory_slot *memslot, 1958 1956 u64 start, u64 end, 1959 1957 int target_level); 1960 - void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm, 1961 - const struct kvm_memory_slot *memslot); 1958 + void kvm_mmu_recover_huge_pages(struct kvm *kvm, 1959 + const struct kvm_memory_slot *memslot); 1962 1960 void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm, 1963 1961 const struct kvm_memory_slot *memslot); 1964 1962 void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen); ··· 2361 2359 KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT | \ 2362 2360 KVM_X86_QUIRK_FIX_HYPERCALL_INSN | \ 2363 2361 KVM_X86_QUIRK_MWAIT_NEVER_UD_FAULTS | \ 2364 - KVM_X86_QUIRK_SLOT_ZAP_ALL) 2362 + KVM_X86_QUIRK_SLOT_ZAP_ALL | \ 2363 + KVM_X86_QUIRK_STUFF_FEATURE_MSRS) 2365 2364 2366 2365 /* 2367 2366 * KVM previously used a u32 field in kvm_run to indicate the hypercall was

+1

arch/x86/include/uapi/asm/kvm.h

··· 440 440 #define KVM_X86_QUIRK_FIX_HYPERCALL_INSN (1 << 5) 441 441 #define KVM_X86_QUIRK_MWAIT_NEVER_UD_FAULTS (1 << 6) 442 442 #define KVM_X86_QUIRK_SLOT_ZAP_ALL (1 << 7) 443 + #define KVM_X86_QUIRK_STUFF_FEATURE_MSRS (1 << 8) 443 444 444 445 #define KVM_STATE_NESTED_FORMAT_VMX 0 445 446 #define KVM_STATE_NESTED_FORMAT_SVM 1

+4 -2

arch/x86/kvm/Kconfig

··· 18 18 if VIRTUALIZATION 19 19 20 20 config KVM_X86 21 - def_tristate KVM if KVM_INTEL || KVM_AMD 22 - depends on X86_LOCAL_APIC 21 + def_tristate KVM if (KVM_INTEL != n || KVM_AMD != n) 23 22 select KVM_COMMON 24 23 select KVM_GENERIC_MMU_NOTIFIER 24 + select KVM_ELIDE_TLB_FLUSH_IF_YOUNG 25 25 select HAVE_KVM_IRQCHIP 26 26 select HAVE_KVM_PFNCACHE 27 27 select HAVE_KVM_DIRTY_RING_TSO ··· 29 29 select HAVE_KVM_IRQ_BYPASS 30 30 select HAVE_KVM_IRQ_ROUTING 31 31 select HAVE_KVM_READONLY_MEM 32 + select VHOST_TASK 32 33 select KVM_ASYNC_PF 33 34 select USER_RETURN_NOTIFIER 34 35 select KVM_MMIO ··· 50 49 51 50 config KVM 52 51 tristate "Kernel-based Virtual Machine (KVM) support" 52 + depends on X86_LOCAL_APIC 53 53 help 54 54 Support hosting fully virtualized guest machines using hardware 55 55 virtualization extensions. You will need a fairly recent

+14 -8

arch/x86/kvm/cpuid.c

··· 690 690 kvm_cpu_cap_set(X86_FEATURE_TSC_ADJUST); 691 691 kvm_cpu_cap_set(X86_FEATURE_ARCH_CAPABILITIES); 692 692 693 - if (boot_cpu_has(X86_FEATURE_IBPB) && boot_cpu_has(X86_FEATURE_IBRS)) 693 + if (boot_cpu_has(X86_FEATURE_AMD_IBPB_RET) && 694 + boot_cpu_has(X86_FEATURE_AMD_IBPB) && 695 + boot_cpu_has(X86_FEATURE_AMD_IBRS)) 694 696 kvm_cpu_cap_set(X86_FEATURE_SPEC_CTRL); 695 697 if (boot_cpu_has(X86_FEATURE_STIBP)) 696 698 kvm_cpu_cap_set(X86_FEATURE_INTEL_STIBP); ··· 700 698 kvm_cpu_cap_set(X86_FEATURE_SPEC_CTRL_SSBD); 701 699 702 700 kvm_cpu_cap_mask(CPUID_7_1_EAX, 703 - F(AVX_VNNI) | F(AVX512_BF16) | F(CMPCCXADD) | 704 - F(FZRM) | F(FSRS) | F(FSRC) | 705 - F(AMX_FP16) | F(AVX_IFMA) | F(LAM) 701 + F(SHA512) | F(SM3) | F(SM4) | F(AVX_VNNI) | F(AVX512_BF16) | 702 + F(CMPCCXADD) | F(FZRM) | F(FSRS) | F(FSRC) | F(AMX_FP16) | 703 + F(AVX_IFMA) | F(LAM) 706 704 ); 707 705 708 706 kvm_cpu_cap_init_kvm_defined(CPUID_7_1_EDX, 709 - F(AVX_VNNI_INT8) | F(AVX_NE_CONVERT) | F(PREFETCHITI) | 710 - F(AMX_COMPLEX) | F(AVX10) 707 + F(AVX_VNNI_INT8) | F(AVX_NE_CONVERT) | F(AMX_COMPLEX) | 708 + F(AVX_VNNI_INT16) | F(PREFETCHITI) | F(AVX10) 711 709 ); 712 710 713 711 kvm_cpu_cap_init_kvm_defined(CPUID_7_2_EDX, ··· 757 755 F(CLZERO) | F(XSAVEERPTR) | 758 756 F(WBNOINVD) | F(AMD_IBPB) | F(AMD_IBRS) | F(AMD_SSBD) | F(VIRT_SSBD) | 759 757 F(AMD_SSB_NO) | F(AMD_STIBP) | F(AMD_STIBP_ALWAYS_ON) | 760 - F(AMD_PSFD) 758 + F(AMD_PSFD) | F(AMD_IBPB_RET) 761 759 ); 762 760 763 761 /* ··· 765 763 * arch/x86/kernel/cpu/bugs.c is kind enough to 766 764 * record that in cpufeatures so use them. 767 765 */ 768 - if (boot_cpu_has(X86_FEATURE_IBPB)) 766 + if (boot_cpu_has(X86_FEATURE_IBPB)) { 769 767 kvm_cpu_cap_set(X86_FEATURE_AMD_IBPB); 768 + if (boot_cpu_has(X86_FEATURE_SPEC_CTRL) && 769 + !boot_cpu_has_bug(X86_BUG_EIBRS_PBRSB)) 770 + kvm_cpu_cap_set(X86_FEATURE_AMD_IBPB_RET); 771 + } 770 772 if (boot_cpu_has(X86_FEATURE_IBRS)) 771 773 kvm_cpu_cap_set(X86_FEATURE_AMD_IBRS); 772 774 if (boot_cpu_has(X86_FEATURE_STIBP))

-1

arch/x86/kvm/cpuid.h

··· 2 2 #ifndef ARCH_X86_KVM_CPUID_H 3 3 #define ARCH_X86_KVM_CPUID_H 4 4 5 - #include "x86.h" 6 5 #include "reverse_cpuid.h" 7 6 #include <asm/cpu.h> 8 7 #include <asm/processor.h>

+9 -6

arch/x86/kvm/emulate.c

··· 651 651 } 652 652 653 653 static inline bool emul_is_noncanonical_address(u64 la, 654 - struct x86_emulate_ctxt *ctxt) 654 + struct x86_emulate_ctxt *ctxt, 655 + unsigned int flags) 655 656 { 656 - return !__is_canonical_address(la, ctxt_virt_addr_bits(ctxt)); 657 + return !ctxt->ops->is_canonical_addr(ctxt, la, flags); 657 658 } 658 659 659 660 /* ··· 1734 1733 if (ret != X86EMUL_CONTINUE) 1735 1734 return ret; 1736 1735 if (emul_is_noncanonical_address(get_desc_base(&seg_desc) | 1737 - ((u64)base3 << 32), ctxt)) 1736 + ((u64)base3 << 32), ctxt, 1737 + X86EMUL_F_DT_LOAD)) 1738 1738 return emulate_gp(ctxt, err_code); 1739 1739 } 1740 1740 ··· 2518 2516 ss_sel = cs_sel + 8; 2519 2517 cs.d = 0; 2520 2518 cs.l = 1; 2521 - if (emul_is_noncanonical_address(rcx, ctxt) || 2522 - emul_is_noncanonical_address(rdx, ctxt)) 2519 + if (emul_is_noncanonical_address(rcx, ctxt, 0) || 2520 + emul_is_noncanonical_address(rdx, ctxt, 0)) 2523 2521 return emulate_gp(ctxt, 0); 2524 2522 break; 2525 2523 } ··· 3496 3494 if (rc != X86EMUL_CONTINUE) 3497 3495 return rc; 3498 3496 if (ctxt->mode == X86EMUL_MODE_PROT64 && 3499 - emul_is_noncanonical_address(desc_ptr.address, ctxt)) 3497 + emul_is_noncanonical_address(desc_ptr.address, ctxt, 3498 + X86EMUL_F_DT_LOAD)) 3500 3499 return emulate_gp(ctxt, 0); 3501 3500 if (lgdt) 3502 3501 ctxt->ops->set_gdt(ctxt, &desc_ptr);

+17

arch/x86/kvm/kvm_cache_regs.h

··· 44 44 #endif 45 45 46 46 /* 47 + * Using the register cache from interrupt context is generally not allowed, as 48 + * caching a register and marking it available/dirty can't be done atomically, 49 + * i.e. accesses from interrupt context may clobber state or read stale data if 50 + * the vCPU task is in the process of updating the cache. The exception is if 51 + * KVM is handling a PMI IRQ/NMI VM-Exit, as that bound code sequence doesn't 52 + * touch the cache, it runs after the cache is reset (post VM-Exit), and PMIs 53 + * need to access several registers that are cacheable. 54 + */ 55 + #define kvm_assert_register_caching_allowed(vcpu) \ 56 + lockdep_assert_once(in_task() || kvm_arch_pmi_in_guest(vcpu)) 57 + 58 + /* 47 59 * avail dirty 48 60 * 0 0 register in VMCS/VMCB 49 61 * 0 1 *INVALID* ··· 65 53 static inline bool kvm_register_is_available(struct kvm_vcpu *vcpu, 66 54 enum kvm_reg reg) 67 55 { 56 + kvm_assert_register_caching_allowed(vcpu); 68 57 return test_bit(reg, (unsigned long *)&vcpu->arch.regs_avail); 69 58 } 70 59 71 60 static inline bool kvm_register_is_dirty(struct kvm_vcpu *vcpu, 72 61 enum kvm_reg reg) 73 62 { 63 + kvm_assert_register_caching_allowed(vcpu); 74 64 return test_bit(reg, (unsigned long *)&vcpu->arch.regs_dirty); 75 65 } 76 66 77 67 static inline void kvm_register_mark_available(struct kvm_vcpu *vcpu, 78 68 enum kvm_reg reg) 79 69 { 70 + kvm_assert_register_caching_allowed(vcpu); 80 71 __set_bit(reg, (unsigned long *)&vcpu->arch.regs_avail); 81 72 } 82 73 83 74 static inline void kvm_register_mark_dirty(struct kvm_vcpu *vcpu, 84 75 enum kvm_reg reg) 85 76 { 77 + kvm_assert_register_caching_allowed(vcpu); 86 78 __set_bit(reg, (unsigned long *)&vcpu->arch.regs_avail); 87 79 __set_bit(reg, (unsigned long *)&vcpu->arch.regs_dirty); 88 80 } ··· 100 84 static __always_inline bool kvm_register_test_and_mark_available(struct kvm_vcpu *vcpu, 101 85 enum kvm_reg reg) 102 86 { 87 + kvm_assert_register_caching_allowed(vcpu); 103 88 return arch___test_and_set_bit(reg, (unsigned long *)&vcpu->arch.regs_avail); 104 89 } 105 90

+5

arch/x86/kvm/kvm_emulate.h

··· 94 94 #define X86EMUL_F_FETCH BIT(1) 95 95 #define X86EMUL_F_IMPLICIT BIT(2) 96 96 #define X86EMUL_F_INVLPG BIT(3) 97 + #define X86EMUL_F_MSR BIT(4) 98 + #define X86EMUL_F_DT_LOAD BIT(5) 97 99 98 100 struct x86_emulate_ops { 99 101 void (*vm_bugged)(struct x86_emulate_ctxt *ctxt); ··· 237 235 238 236 gva_t (*get_untagged_addr)(struct x86_emulate_ctxt *ctxt, gva_t addr, 239 237 unsigned int flags); 238 + 239 + bool (*is_canonical_addr)(struct x86_emulate_ctxt *ctxt, gva_t addr, 240 + unsigned int flags); 240 241 }; 241 242 242 243 /* Type, address-of, and value of an instruction's operand. */

+35 -16

arch/x86/kvm/lapic.c

··· 382 382 DIRTY 383 383 }; 384 384 385 - void kvm_recalculate_apic_map(struct kvm *kvm) 385 + static void kvm_recalculate_apic_map(struct kvm *kvm) 386 386 { 387 387 struct kvm_apic_map *new, *old = NULL; 388 388 struct kvm_vcpu *vcpu; ··· 2577 2577 return (tpr & 0xf0) >> 4; 2578 2578 } 2579 2579 2580 - void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value) 2580 + static void __kvm_apic_set_base(struct kvm_vcpu *vcpu, u64 value) 2581 2581 { 2582 2582 u64 old_value = vcpu->arch.apic_base; 2583 2583 struct kvm_lapic *apic = vcpu->arch.apic; ··· 2625 2625 } 2626 2626 } 2627 2627 2628 + int kvm_apic_set_base(struct kvm_vcpu *vcpu, u64 value, bool host_initiated) 2629 + { 2630 + enum lapic_mode old_mode = kvm_get_apic_mode(vcpu); 2631 + enum lapic_mode new_mode = kvm_apic_mode(value); 2632 + 2633 + if (vcpu->arch.apic_base == value) 2634 + return 0; 2635 + 2636 + u64 reserved_bits = kvm_vcpu_reserved_gpa_bits_raw(vcpu) | 0x2ff | 2637 + (guest_cpuid_has(vcpu, X86_FEATURE_X2APIC) ? 0 : X2APIC_ENABLE); 2638 + 2639 + if ((value & reserved_bits) != 0 || new_mode == LAPIC_MODE_INVALID) 2640 + return 1; 2641 + if (!host_initiated) { 2642 + if (old_mode == LAPIC_MODE_X2APIC && new_mode == LAPIC_MODE_XAPIC) 2643 + return 1; 2644 + if (old_mode == LAPIC_MODE_DISABLED && new_mode == LAPIC_MODE_X2APIC) 2645 + return 1; 2646 + } 2647 + 2648 + __kvm_apic_set_base(vcpu, value); 2649 + kvm_recalculate_apic_map(vcpu->kvm); 2650 + return 0; 2651 + } 2652 + 2628 2653 void kvm_apic_update_apicv(struct kvm_vcpu *vcpu) 2629 2654 { 2630 2655 struct kvm_lapic *apic = vcpu->arch.apic; ··· 2679 2654 2680 2655 int kvm_alloc_apic_access_page(struct kvm *kvm) 2681 2656 { 2682 - struct page *page; 2683 2657 void __user *hva; 2684 2658 int ret = 0; 2685 2659 ··· 2694 2670 goto out; 2695 2671 } 2696 2672 2697 - page = gfn_to_page(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT); 2698 - if (is_error_page(page)) { 2699 - ret = -EFAULT; 2700 - goto out; 2701 - } 2702 - 2703 - /* 2704 - * Do not pin the page in memory, so that memory hot-unplug 2705 - * is able to migrate it. 2706 - */ 2707 - put_page(page); 2708 2673 kvm->arch.apic_access_memslot_enabled = true; 2709 2674 out: 2710 2675 mutex_unlock(&kvm->slots_lock); ··· 2748 2735 msr_val = APIC_DEFAULT_PHYS_BASE | MSR_IA32_APICBASE_ENABLE; 2749 2736 if (kvm_vcpu_is_reset_bsp(vcpu)) 2750 2737 msr_val |= MSR_IA32_APICBASE_BSP; 2751 - kvm_lapic_set_base(vcpu, msr_val); 2738 + 2739 + /* 2740 + * Use the inner helper to avoid an extra recalcuation of the 2741 + * optimized APIC map if some other task has dirtied the map. 2742 + * The recalculation needed for this vCPU will be done after 2743 + * all APIC state has been initialized (see below). 2744 + */ 2745 + __kvm_apic_set_base(vcpu, msr_val); 2752 2746 } 2753 2747 2754 2748 if (!apic) ··· 3096 3076 3097 3077 kvm_x86_call(apicv_pre_state_restore)(vcpu); 3098 3078 3099 - kvm_lapic_set_base(vcpu, vcpu->arch.apic_base); 3100 3079 /* set SPIV separately to get count of SW disabled APICs right */ 3101 3080 apic_set_spiv(apic, *((u32 *)(s->regs + APIC_SPIV))); 3102 3081

+6 -5

arch/x86/kvm/lapic.h

··· 95 95 u64 kvm_lapic_get_cr8(struct kvm_vcpu *vcpu); 96 96 void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned long cr8); 97 97 void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu); 98 - void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value); 99 - void kvm_recalculate_apic_map(struct kvm *kvm); 100 98 void kvm_apic_set_version(struct kvm_vcpu *vcpu); 101 99 void kvm_apic_after_set_mcg_cap(struct kvm_vcpu *vcpu); 102 100 bool kvm_apic_match_dest(struct kvm_vcpu *vcpu, struct kvm_lapic *source, ··· 115 117 struct kvm_lapic_irq *irq, int *r, struct dest_map *dest_map); 116 118 void kvm_apic_send_ipi(struct kvm_lapic *apic, u32 icr_low, u32 icr_high); 117 119 118 - u64 kvm_get_apic_base(struct kvm_vcpu *vcpu); 119 - int kvm_set_apic_base(struct kvm_vcpu *vcpu, struct msr_data *msr_info); 120 + int kvm_apic_set_base(struct kvm_vcpu *vcpu, u64 value, bool host_initiated); 120 121 int kvm_apic_get_state(struct kvm_vcpu *vcpu, struct kvm_lapic_state *s); 121 122 int kvm_apic_set_state(struct kvm_vcpu *vcpu, struct kvm_lapic_state *s); 122 - enum lapic_mode kvm_get_apic_mode(struct kvm_vcpu *vcpu); 123 123 int kvm_lapic_find_highest_irr(struct kvm_vcpu *vcpu); 124 124 125 125 u64 kvm_get_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu); ··· 265 269 static inline enum lapic_mode kvm_apic_mode(u64 apic_base) 266 270 { 267 271 return apic_base & (MSR_IA32_APICBASE_ENABLE | X2APIC_ENABLE); 272 + } 273 + 274 + static inline enum lapic_mode kvm_get_apic_mode(struct kvm_vcpu *vcpu) 275 + { 276 + return kvm_apic_mode(vcpu->arch.apic_base); 268 277 } 269 278 270 279 static inline u8 kvm_xapic_id(struct kvm_lapic *apic)

+1

arch/x86/kvm/mmu.h

··· 4 4 5 5 #include <linux/kvm_host.h> 6 6 #include "kvm_cache_regs.h" 7 + #include "x86.h" 7 8 #include "cpuid.h" 8 9 9 10 extern bool __read_mostly enable_mmio_caching;

+151 -293

arch/x86/kvm/mmu/mmu.c

··· 179 179 180 180 static struct kmem_cache *pte_list_desc_cache; 181 181 struct kmem_cache *mmu_page_header_cache; 182 - static struct percpu_counter kvm_total_used_mmu_pages; 183 182 184 183 static void mmu_spte_set(u64 *sptep, u64 spte); 185 184 ··· 484 485 __set_spte(sptep, new_spte); 485 486 } 486 487 487 - /* 488 - * Update the SPTE (excluding the PFN), but do not track changes in its 489 - * accessed/dirty status. 488 + /* Rules for using mmu_spte_update: 489 + * Update the state bits, it means the mapped pfn is not changed. 490 + * 491 + * Returns true if the TLB needs to be flushed 490 492 */ 491 - static u64 mmu_spte_update_no_track(u64 *sptep, u64 new_spte) 493 + static bool mmu_spte_update(u64 *sptep, u64 new_spte) 492 494 { 493 495 u64 old_spte = *sptep; 494 496 ··· 498 498 499 499 if (!is_shadow_present_pte(old_spte)) { 500 500 mmu_spte_set(sptep, new_spte); 501 - return old_spte; 501 + return false; 502 502 } 503 503 504 504 if (!spte_has_volatile_bits(old_spte)) ··· 506 506 else 507 507 old_spte = __update_clear_spte_slow(sptep, new_spte); 508 508 509 - WARN_ON_ONCE(spte_to_pfn(old_spte) != spte_to_pfn(new_spte)); 509 + WARN_ON_ONCE(!is_shadow_present_pte(old_spte) || 510 + spte_to_pfn(old_spte) != spte_to_pfn(new_spte)); 510 511 511 - return old_spte; 512 - } 513 - 514 - /* Rules for using mmu_spte_update: 515 - * Update the state bits, it means the mapped pfn is not changed. 516 - * 517 - * Whenever an MMU-writable SPTE is overwritten with a read-only SPTE, remote 518 - * TLBs must be flushed. Otherwise rmap_write_protect will find a read-only 519 - * spte, even though the writable spte might be cached on a CPU's TLB. 520 - * 521 - * Returns true if the TLB needs to be flushed 522 - */ 523 - static bool mmu_spte_update(u64 *sptep, u64 new_spte) 524 - { 525 - bool flush = false; 526 - u64 old_spte = mmu_spte_update_no_track(sptep, new_spte); 527 - 528 - if (!is_shadow_present_pte(old_spte)) 529 - return false; 530 - 531 - /* 532 - * For the spte updated out of mmu-lock is safe, since 533 - * we always atomically update it, see the comments in 534 - * spte_has_volatile_bits(). 535 - */ 536 - if (is_mmu_writable_spte(old_spte) && 537 - !is_writable_pte(new_spte)) 538 - flush = true; 539 - 540 - /* 541 - * Flush TLB when accessed/dirty states are changed in the page tables, 542 - * to guarantee consistency between TLB and page tables. 543 - */ 544 - 545 - if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte)) { 546 - flush = true; 547 - kvm_set_pfn_accessed(spte_to_pfn(old_spte)); 548 - } 549 - 550 - if (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte)) { 551 - flush = true; 552 - kvm_set_pfn_dirty(spte_to_pfn(old_spte)); 553 - } 554 - 555 - return flush; 512 + return leaf_spte_change_needs_tlb_flush(old_spte, new_spte); 556 513 } 557 514 558 515 /* ··· 520 563 */ 521 564 static u64 mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep) 522 565 { 523 - kvm_pfn_t pfn; 524 566 u64 old_spte = *sptep; 525 567 int level = sptep_to_sp(sptep)->role.level; 526 - struct page *page; 527 568 528 569 if (!is_shadow_present_pte(old_spte) || 529 570 !spte_has_volatile_bits(old_spte)) ··· 533 578 return old_spte; 534 579 535 580 kvm_update_page_stats(kvm, level, -1); 536 - 537 - pfn = spte_to_pfn(old_spte); 538 - 539 - /* 540 - * KVM doesn't hold a reference to any pages mapped into the guest, and 541 - * instead uses the mmu_notifier to ensure that KVM unmaps any pages 542 - * before they are reclaimed. Sanity check that, if the pfn is backed 543 - * by a refcounted page, the refcount is elevated. 544 - */ 545 - page = kvm_pfn_to_refcounted_page(pfn); 546 - WARN_ON_ONCE(page && !page_count(page)); 547 - 548 - if (is_accessed_spte(old_spte)) 549 - kvm_set_pfn_accessed(pfn); 550 - 551 - if (is_dirty_spte(old_spte)) 552 - kvm_set_pfn_dirty(pfn); 553 - 554 581 return old_spte; 555 582 } 556 583 ··· 1187 1250 return mmu_spte_update(sptep, spte); 1188 1251 } 1189 1252 1190 - static bool spte_wrprot_for_clear_dirty(u64 *sptep) 1191 - { 1192 - bool was_writable = test_and_clear_bit(PT_WRITABLE_SHIFT, 1193 - (unsigned long *)sptep); 1194 - if (was_writable && !spte_ad_enabled(*sptep)) 1195 - kvm_set_pfn_dirty(spte_to_pfn(*sptep)); 1196 - 1197 - return was_writable; 1198 - } 1199 - 1200 1253 /* 1201 1254 * Gets the GFN ready for another round of dirty logging by clearing the 1202 1255 * - D bit on ad-enabled SPTEs, and ··· 1202 1275 1203 1276 for_each_rmap_spte(rmap_head, &iter, sptep) 1204 1277 if (spte_ad_need_write_protect(*sptep)) 1205 - flush |= spte_wrprot_for_clear_dirty(sptep); 1278 + flush |= test_and_clear_bit(PT_WRITABLE_SHIFT, 1279 + (unsigned long *)sptep); 1206 1280 else 1207 1281 flush |= spte_clear_dirty(sptep); 1208 1282 ··· 1568 1640 (unsigned long *)sptep); 1569 1641 } else { 1570 1642 /* 1571 - * Capture the dirty status of the page, so that 1572 - * it doesn't get lost when the SPTE is marked 1573 - * for access tracking. 1643 + * WARN if mmu_spte_update() signals the need 1644 + * for a TLB flush, as Access tracking a SPTE 1645 + * should never trigger an _immediate_ flush. 1574 1646 */ 1575 - if (is_writable_pte(spte)) 1576 - kvm_set_pfn_dirty(spte_to_pfn(spte)); 1577 - 1578 1647 spte = mark_spte_for_access_track(spte); 1579 - mmu_spte_update_no_track(sptep, spte); 1648 + WARN_ON_ONCE(mmu_spte_update(sptep, spte)); 1580 1649 } 1581 1650 young = true; 1582 1651 } ··· 1621 1696 #endif 1622 1697 } 1623 1698 1624 - /* 1625 - * This value is the sum of all of the kvm instances's 1626 - * kvm->arch.n_used_mmu_pages values. We need a global, 1627 - * aggregate version in order to make the slab shrinker 1628 - * faster 1629 - */ 1630 - static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, long nr) 1631 - { 1632 - kvm->arch.n_used_mmu_pages += nr; 1633 - percpu_counter_add(&kvm_total_used_mmu_pages, nr); 1634 - } 1635 - 1636 1699 static void kvm_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp) 1637 1700 { 1638 - kvm_mod_used_mmu_pages(kvm, +1); 1701 + kvm->arch.n_used_mmu_pages++; 1639 1702 kvm_account_pgtable_pages((void *)sp->spt, +1); 1640 1703 } 1641 1704 1642 1705 static void kvm_unaccount_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp) 1643 1706 { 1644 - kvm_mod_used_mmu_pages(kvm, -1); 1707 + kvm->arch.n_used_mmu_pages--; 1645 1708 kvm_account_pgtable_pages((void *)sp->spt, -1); 1646 1709 } 1647 1710 ··· 2715 2802 * be write-protected. 2716 2803 */ 2717 2804 int mmu_try_to_unsync_pages(struct kvm *kvm, const struct kvm_memory_slot *slot, 2718 - gfn_t gfn, bool can_unsync, bool prefetch) 2805 + gfn_t gfn, bool synchronizing, bool prefetch) 2719 2806 { 2720 2807 struct kvm_mmu_page *sp; 2721 2808 bool locked = false; ··· 2730 2817 2731 2818 /* 2732 2819 * The page is not write-tracked, mark existing shadow pages unsync 2733 - * unless KVM is synchronizing an unsync SP (can_unsync = false). In 2734 - * that case, KVM must complete emulation of the guest TLB flush before 2735 - * allowing shadow pages to become unsync (writable by the guest). 2820 + * unless KVM is synchronizing an unsync SP. In that case, KVM must 2821 + * complete emulation of the guest TLB flush before allowing shadow 2822 + * pages to become unsync (writable by the guest). 2736 2823 */ 2737 2824 for_each_gfn_valid_sp_with_gptes(kvm, sp, gfn) { 2738 - if (!can_unsync) 2825 + if (synchronizing) 2739 2826 return -EPERM; 2740 2827 2741 2828 if (sp->unsync) ··· 2839 2926 } 2840 2927 2841 2928 if (is_shadow_present_pte(*sptep)) { 2929 + if (prefetch) 2930 + return RET_PF_SPURIOUS; 2931 + 2842 2932 /* 2843 2933 * If we overwrite a PTE page pointer with a 2MB PMD, unlink 2844 2934 * the parent of the now unreachable PTE. ··· 2861 2945 } 2862 2946 2863 2947 wrprot = make_spte(vcpu, sp, slot, pte_access, gfn, pfn, *sptep, prefetch, 2864 - true, host_writable, &spte); 2948 + false, host_writable, &spte); 2865 2949 2866 2950 if (*sptep == spte) { 2867 2951 ret = RET_PF_SPURIOUS; ··· 2887 2971 return ret; 2888 2972 } 2889 2973 2890 - static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu, 2891 - struct kvm_mmu_page *sp, 2892 - u64 *start, u64 *end) 2974 + static bool kvm_mmu_prefetch_sptes(struct kvm_vcpu *vcpu, gfn_t gfn, u64 *sptep, 2975 + int nr_pages, unsigned int access) 2893 2976 { 2894 2977 struct page *pages[PTE_PREFETCH_NUM]; 2895 2978 struct kvm_memory_slot *slot; 2896 - unsigned int access = sp->role.access; 2897 - int i, ret; 2898 - gfn_t gfn; 2979 + int i; 2899 2980 2900 - gfn = kvm_mmu_page_get_gfn(sp, spte_index(start)); 2981 + if (WARN_ON_ONCE(nr_pages > PTE_PREFETCH_NUM)) 2982 + return false; 2983 + 2901 2984 slot = gfn_to_memslot_dirty_bitmap(vcpu, gfn, access & ACC_WRITE_MASK); 2902 2985 if (!slot) 2903 - return -1; 2986 + return false; 2904 2987 2905 - ret = gfn_to_page_many_atomic(slot, gfn, pages, end - start); 2906 - if (ret <= 0) 2907 - return -1; 2988 + nr_pages = kvm_prefetch_pages(slot, gfn, pages, nr_pages); 2989 + if (nr_pages <= 0) 2990 + return false; 2908 2991 2909 - for (i = 0; i < ret; i++, gfn++, start++) { 2910 - mmu_set_spte(vcpu, slot, start, access, gfn, 2992 + for (i = 0; i < nr_pages; i++, gfn++, sptep++) { 2993 + mmu_set_spte(vcpu, slot, sptep, access, gfn, 2911 2994 page_to_pfn(pages[i]), NULL); 2912 - put_page(pages[i]); 2995 + 2996 + /* 2997 + * KVM always prefetches writable pages from the primary MMU, 2998 + * and KVM can make its SPTE writable in the fast page handler, 2999 + * without notifying the primary MMU. Mark pages/folios dirty 3000 + * now to ensure file data is written back if it ends up being 3001 + * written by the guest. Because KVM's prefetching GUPs 3002 + * writable PTEs, the probability of unnecessary writeback is 3003 + * extremely low. 3004 + */ 3005 + kvm_release_page_dirty(pages[i]); 2913 3006 } 2914 3007 2915 - return 0; 3008 + return true; 3009 + } 3010 + 3011 + static bool direct_pte_prefetch_many(struct kvm_vcpu *vcpu, 3012 + struct kvm_mmu_page *sp, 3013 + u64 *start, u64 *end) 3014 + { 3015 + gfn_t gfn = kvm_mmu_page_get_gfn(sp, spte_index(start)); 3016 + unsigned int access = sp->role.access; 3017 + 3018 + return kvm_mmu_prefetch_sptes(vcpu, gfn, start, end - start, access); 2916 3019 } 2917 3020 2918 3021 static void __direct_pte_prefetch(struct kvm_vcpu *vcpu, ··· 2949 3014 if (is_shadow_present_pte(*spte) || spte == sptep) { 2950 3015 if (!start) 2951 3016 continue; 2952 - if (direct_pte_prefetch_many(vcpu, sp, start, spte) < 0) 3017 + if (!direct_pte_prefetch_many(vcpu, sp, start, spte)) 2953 3018 return; 3019 + 2954 3020 start = NULL; 2955 3021 } else if (!start) 2956 3022 start = spte; ··· 3101 3165 } 3102 3166 3103 3167 int kvm_mmu_max_mapping_level(struct kvm *kvm, 3104 - const struct kvm_memory_slot *slot, gfn_t gfn, 3105 - int max_level) 3168 + const struct kvm_memory_slot *slot, gfn_t gfn) 3106 3169 { 3107 3170 bool is_private = kvm_slot_can_be_private(slot) && 3108 3171 kvm_mem_is_private(kvm, gfn); 3109 3172 3110 - return __kvm_mmu_max_mapping_level(kvm, slot, gfn, max_level, is_private); 3173 + return __kvm_mmu_max_mapping_level(kvm, slot, gfn, PG_LEVEL_NUM, is_private); 3111 3174 } 3112 3175 3113 3176 void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) ··· 3257 3322 fault->slot = NULL; 3258 3323 fault->pfn = KVM_PFN_NOSLOT; 3259 3324 fault->map_writable = false; 3260 - fault->hva = KVM_HVA_ERR_BAD; 3261 3325 3262 3326 /* 3263 3327 * If MMIO caching is disabled, emulate immediately without ··· 3326 3392 * by setting the Writable bit, which can be done out of mmu_lock. 3327 3393 */ 3328 3394 if (!fault->present) 3329 - return !kvm_ad_enabled(); 3395 + return !kvm_ad_enabled; 3330 3396 3331 3397 /* 3332 3398 * Note, instruction fetches and writes are mutually exclusive, ignore ··· 3353 3419 * harm. This also avoids the TLB flush needed after setting dirty bit 3354 3420 * so non-PML cases won't be impacted. 3355 3421 * 3356 - * Compare with set_spte where instead shadow_dirty_mask is set. 3422 + * Compare with make_spte() where instead shadow_dirty_mask is set. 3357 3423 */ 3358 3424 if (!try_cmpxchg64(sptep, &old_spte, new_spte)) 3359 3425 return false; ··· 3461 3527 * uses A/D bits for non-nested MMUs. Thus, if A/D bits are 3462 3528 * enabled, the SPTE can't be an access-tracked SPTE. 3463 3529 */ 3464 - if (unlikely(!kvm_ad_enabled()) && is_access_track_spte(spte)) 3465 - new_spte = restore_acc_track_spte(new_spte); 3530 + if (unlikely(!kvm_ad_enabled) && is_access_track_spte(spte)) 3531 + new_spte = restore_acc_track_spte(new_spte) | 3532 + shadow_accessed_mask; 3466 3533 3467 3534 /* 3468 3535 * To keep things simple, only SPTEs that are MMU-writable can ··· 4311 4376 return max_level; 4312 4377 } 4313 4378 4314 - static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu, 4315 - struct kvm_page_fault *fault) 4379 + static void kvm_mmu_finish_page_fault(struct kvm_vcpu *vcpu, 4380 + struct kvm_page_fault *fault, int r) 4381 + { 4382 + kvm_release_faultin_page(vcpu->kvm, fault->refcounted_page, 4383 + r == RET_PF_RETRY, fault->map_writable); 4384 + } 4385 + 4386 + static int kvm_mmu_faultin_pfn_private(struct kvm_vcpu *vcpu, 4387 + struct kvm_page_fault *fault) 4316 4388 { 4317 4389 int max_order, r; 4318 4390 ··· 4329 4387 } 4330 4388 4331 4389 r = kvm_gmem_get_pfn(vcpu->kvm, fault->slot, fault->gfn, &fault->pfn, 4332 - &max_order); 4390 + &fault->refcounted_page, &max_order); 4333 4391 if (r) { 4334 4392 kvm_mmu_prepare_memory_fault_exit(vcpu, fault); 4335 4393 return r; ··· 4342 4400 return RET_PF_CONTINUE; 4343 4401 } 4344 4402 4345 - static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) 4403 + static int __kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu, 4404 + struct kvm_page_fault *fault) 4346 4405 { 4347 - bool async; 4406 + unsigned int foll = fault->write ? FOLL_WRITE : 0; 4348 4407 4349 4408 if (fault->is_private) 4350 - return kvm_faultin_pfn_private(vcpu, fault); 4409 + return kvm_mmu_faultin_pfn_private(vcpu, fault); 4351 4410 4352 - async = false; 4353 - fault->pfn = __gfn_to_pfn_memslot(fault->slot, fault->gfn, false, false, 4354 - &async, fault->write, 4355 - &fault->map_writable, &fault->hva); 4356 - if (!async) 4357 - return RET_PF_CONTINUE; /* *pfn has correct page already */ 4411 + foll |= FOLL_NOWAIT; 4412 + fault->pfn = __kvm_faultin_pfn(fault->slot, fault->gfn, foll, 4413 + &fault->map_writable, &fault->refcounted_page); 4414 + 4415 + /* 4416 + * If resolving the page failed because I/O is needed to fault-in the 4417 + * page, then either set up an asynchronous #PF to do the I/O, or if 4418 + * doing an async #PF isn't possible, retry with I/O allowed. All 4419 + * other failures are terminal, i.e. retrying won't help. 4420 + */ 4421 + if (fault->pfn != KVM_PFN_ERR_NEEDS_IO) 4422 + return RET_PF_CONTINUE; 4358 4423 4359 4424 if (!fault->prefetch && kvm_can_do_async_pf(vcpu)) { 4360 4425 trace_kvm_try_async_get_page(fault->addr, fault->gfn); ··· 4379 4430 * to wait for IO. Note, gup always bails if it is unable to quickly 4380 4431 * get a page and a fatal signal, i.e. SIGKILL, is pending. 4381 4432 */ 4382 - fault->pfn = __gfn_to_pfn_memslot(fault->slot, fault->gfn, false, true, 4383 - NULL, fault->write, 4384 - &fault->map_writable, &fault->hva); 4433 + foll |= FOLL_INTERRUPTIBLE; 4434 + foll &= ~FOLL_NOWAIT; 4435 + fault->pfn = __kvm_faultin_pfn(fault->slot, fault->gfn, foll, 4436 + &fault->map_writable, &fault->refcounted_page); 4437 + 4385 4438 return RET_PF_CONTINUE; 4386 4439 } 4387 4440 4388 - static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, 4389 - unsigned int access) 4441 + static int kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu, 4442 + struct kvm_page_fault *fault, unsigned int access) 4390 4443 { 4391 4444 struct kvm_memory_slot *slot = fault->slot; 4392 4445 int ret; ··· 4471 4520 if (mmu_invalidate_retry_gfn_unsafe(vcpu->kvm, fault->mmu_seq, fault->gfn)) 4472 4521 return RET_PF_RETRY; 4473 4522 4474 - ret = __kvm_faultin_pfn(vcpu, fault); 4523 + ret = __kvm_mmu_faultin_pfn(vcpu, fault); 4475 4524 if (ret != RET_PF_CONTINUE) 4476 4525 return ret; 4477 4526 ··· 4489 4538 * mmu_lock is acquired. 4490 4539 */ 4491 4540 if (mmu_invalidate_retry_gfn_unsafe(vcpu->kvm, fault->mmu_seq, fault->gfn)) { 4492 - kvm_release_pfn_clean(fault->pfn); 4541 + kvm_mmu_finish_page_fault(vcpu, fault, RET_PF_RETRY); 4493 4542 return RET_PF_RETRY; 4494 4543 } 4495 4544 ··· 4548 4597 if (r) 4549 4598 return r; 4550 4599 4551 - r = kvm_faultin_pfn(vcpu, fault, ACC_ALL); 4600 + r = kvm_mmu_faultin_pfn(vcpu, fault, ACC_ALL); 4552 4601 if (r != RET_PF_CONTINUE) 4553 4602 return r; 4554 4603 ··· 4565 4614 r = direct_map(vcpu, fault); 4566 4615 4567 4616 out_unlock: 4617 + kvm_mmu_finish_page_fault(vcpu, fault, r); 4568 4618 write_unlock(&vcpu->kvm->mmu_lock); 4569 - kvm_release_pfn_clean(fault->pfn); 4570 4619 return r; 4571 4620 } 4572 4621 ··· 4639 4688 if (r) 4640 4689 return r; 4641 4690 4642 - r = kvm_faultin_pfn(vcpu, fault, ACC_ALL); 4691 + r = kvm_mmu_faultin_pfn(vcpu, fault, ACC_ALL); 4643 4692 if (r != RET_PF_CONTINUE) 4644 4693 return r; 4645 4694 ··· 4652 4701 r = kvm_tdp_mmu_map(vcpu, fault); 4653 4702 4654 4703 out_unlock: 4704 + kvm_mmu_finish_page_fault(vcpu, fault, r); 4655 4705 read_unlock(&vcpu->kvm->mmu_lock); 4656 - kvm_release_pfn_clean(fault->pfn); 4657 4706 return r; 4658 4707 } 4659 4708 #endif ··· 5439 5488 role.efer_nx = true; 5440 5489 role.smm = cpu_role.base.smm; 5441 5490 role.guest_mode = cpu_role.base.guest_mode; 5442 - role.ad_disabled = !kvm_ad_enabled(); 5491 + role.ad_disabled = !kvm_ad_enabled; 5443 5492 role.level = kvm_mmu_get_tdp_level(vcpu); 5444 5493 role.direct = true; 5445 5494 role.has_4_byte_gpte = false; ··· 6179 6228 /* It's actually a GPA for vcpu->arch.guest_mmu. */ 6180 6229 if (mmu != &vcpu->arch.guest_mmu) { 6181 6230 /* INVLPG on a non-canonical address is a NOP according to the SDM. */ 6182 - if (is_noncanonical_address(addr, vcpu)) 6231 + if (is_noncanonical_invlpg_address(addr, vcpu)) 6183 6232 return; 6184 6233 6185 6234 kvm_x86_call(flush_tlb_gva)(vcpu, addr); ··· 6367 6416 { 6368 6417 struct kvm_mmu_page *sp, *node; 6369 6418 int nr_zapped, batch = 0; 6419 + LIST_HEAD(invalid_list); 6370 6420 bool unstable; 6421 + 6422 + lockdep_assert_held(&kvm->slots_lock); 6371 6423 6372 6424 restart: 6373 6425 list_for_each_entry_safe_reverse(sp, node, ··· 6403 6449 } 6404 6450 6405 6451 unstable = __kvm_mmu_prepare_zap_page(kvm, sp, 6406 - &kvm->arch.zapped_obsolete_pages, &nr_zapped); 6452 + &invalid_list, &nr_zapped); 6407 6453 batch += nr_zapped; 6408 6454 6409 6455 if (unstable) ··· 6419 6465 * kvm_mmu_load()), and the reload in the caller ensure no vCPUs are 6420 6466 * running with an obsolete MMU. 6421 6467 */ 6422 - kvm_mmu_commit_zap_page(kvm, &kvm->arch.zapped_obsolete_pages); 6468 + kvm_mmu_commit_zap_page(kvm, &invalid_list); 6423 6469 } 6424 6470 6425 6471 /* ··· 6482 6528 kvm_tdp_mmu_zap_invalidated_roots(kvm); 6483 6529 } 6484 6530 6485 - static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm) 6486 - { 6487 - return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages)); 6488 - } 6489 - 6490 6531 void kvm_mmu_init_vm(struct kvm *kvm) 6491 6532 { 6492 6533 kvm->arch.shadow_mmio_value = shadow_mmio_value; 6493 6534 INIT_LIST_HEAD(&kvm->arch.active_mmu_pages); 6494 - INIT_LIST_HEAD(&kvm->arch.zapped_obsolete_pages); 6495 6535 INIT_LIST_HEAD(&kvm->arch.possible_nx_huge_pages); 6496 6536 spin_lock_init(&kvm->arch.mmu_unsync_pages_lock); 6497 6537 ··· 6719 6771 continue; 6720 6772 } 6721 6773 6722 - spte = make_huge_page_split_spte(kvm, huge_spte, sp->role, index); 6774 + spte = make_small_spte(kvm, huge_spte, sp->role, index); 6723 6775 mmu_spte_set(sptep, spte); 6724 6776 __rmap_add(kvm, cache, slot, sptep, gfn, sp->role.access); 6725 6777 } ··· 6902 6954 * mapping if the indirect sp has level = 1. 6903 6955 */ 6904 6956 if (sp->role.direct && 6905 - sp->role.level < kvm_mmu_max_mapping_level(kvm, slot, sp->gfn, 6906 - PG_LEVEL_NUM)) { 6957 + sp->role.level < kvm_mmu_max_mapping_level(kvm, slot, sp->gfn)) { 6907 6958 kvm_zap_one_rmap_spte(kvm, rmap_head, sptep); 6908 6959 6909 6960 if (kvm_available_flush_remote_tlbs_range()) ··· 6930 6983 kvm_flush_remote_tlbs_memslot(kvm, slot); 6931 6984 } 6932 6985 6933 - void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm, 6934 - const struct kvm_memory_slot *slot) 6986 + void kvm_mmu_recover_huge_pages(struct kvm *kvm, 6987 + const struct kvm_memory_slot *slot) 6935 6988 { 6936 6989 if (kvm_memslots_have_rmaps(kvm)) { 6937 6990 write_lock(&kvm->mmu_lock); ··· 6941 6994 6942 6995 if (tdp_mmu_enabled) { 6943 6996 read_lock(&kvm->mmu_lock); 6944 - kvm_tdp_mmu_zap_collapsible_sptes(kvm, slot); 6997 + kvm_tdp_mmu_recover_huge_pages(kvm, slot); 6945 6998 read_unlock(&kvm->mmu_lock); 6946 6999 } 6947 7000 } ··· 7096 7149 } 7097 7150 } 7098 7151 7099 - static unsigned long mmu_shrink_scan(struct shrinker *shrink, 7100 - struct shrink_control *sc) 7101 - { 7102 - struct kvm *kvm; 7103 - int nr_to_scan = sc->nr_to_scan; 7104 - unsigned long freed = 0; 7105 - 7106 - mutex_lock(&kvm_lock); 7107 - 7108 - list_for_each_entry(kvm, &vm_list, vm_list) { 7109 - int idx; 7110 - 7111 - /* 7112 - * Never scan more than sc->nr_to_scan VM instances. 7113 - * Will not hit this condition practically since we do not try 7114 - * to shrink more than one VM and it is very unlikely to see 7115 - * !n_used_mmu_pages so many times. 7116 - */ 7117 - if (!nr_to_scan--) 7118 - break; 7119 - /* 7120 - * n_used_mmu_pages is accessed without holding kvm->mmu_lock 7121 - * here. We may skip a VM instance errorneosly, but we do not 7122 - * want to shrink a VM that only started to populate its MMU 7123 - * anyway. 7124 - */ 7125 - if (!kvm->arch.n_used_mmu_pages && 7126 - !kvm_has_zapped_obsolete_pages(kvm)) 7127 - continue; 7128 - 7129 - idx = srcu_read_lock(&kvm->srcu); 7130 - write_lock(&kvm->mmu_lock); 7131 - 7132 - if (kvm_has_zapped_obsolete_pages(kvm)) { 7133 - kvm_mmu_commit_zap_page(kvm, 7134 - &kvm->arch.zapped_obsolete_pages); 7135 - goto unlock; 7136 - } 7137 - 7138 - freed = kvm_mmu_zap_oldest_mmu_pages(kvm, sc->nr_to_scan); 7139 - 7140 - unlock: 7141 - write_unlock(&kvm->mmu_lock); 7142 - srcu_read_unlock(&kvm->srcu, idx); 7143 - 7144 - /* 7145 - * unfair on small ones 7146 - * per-vm shrinkers cry out 7147 - * sadness comes quickly 7148 - */ 7149 - list_move_tail(&kvm->vm_list, &vm_list); 7150 - break; 7151 - } 7152 - 7153 - mutex_unlock(&kvm_lock); 7154 - return freed; 7155 - } 7156 - 7157 - static unsigned long mmu_shrink_count(struct shrinker *shrink, 7158 - struct shrink_control *sc) 7159 - { 7160 - return percpu_counter_read_positive(&kvm_total_used_mmu_pages); 7161 - } 7162 - 7163 - static struct shrinker *mmu_shrinker; 7164 - 7165 7152 static void mmu_destroy_caches(void) 7166 7153 { 7167 7154 kmem_cache_destroy(pte_list_desc_cache); ··· 7162 7281 kvm_mmu_zap_all_fast(kvm); 7163 7282 mutex_unlock(&kvm->slots_lock); 7164 7283 7165 - wake_up_process(kvm->arch.nx_huge_page_recovery_thread); 7284 + vhost_task_wake(kvm->arch.nx_huge_page_recovery_thread); 7166 7285 } 7167 7286 mutex_unlock(&kvm_lock); 7168 7287 } ··· 7222 7341 if (!mmu_page_header_cache) 7223 7342 goto out; 7224 7343 7225 - if (percpu_counter_init(&kvm_total_used_mmu_pages, 0, GFP_KERNEL)) 7226 - goto out; 7227 - 7228 - mmu_shrinker = shrinker_alloc(0, "x86-mmu"); 7229 - if (!mmu_shrinker) 7230 - goto out_shrinker; 7231 - 7232 - mmu_shrinker->count_objects = mmu_shrink_count; 7233 - mmu_shrinker->scan_objects = mmu_shrink_scan; 7234 - mmu_shrinker->seeks = DEFAULT_SEEKS * 10; 7235 - 7236 - shrinker_register(mmu_shrinker); 7237 - 7238 7344 return 0; 7239 7345 7240 - out_shrinker: 7241 - percpu_counter_destroy(&kvm_total_used_mmu_pages); 7242 7346 out: 7243 7347 mmu_destroy_caches(); 7244 7348 return ret; ··· 7240 7374 void kvm_mmu_vendor_module_exit(void) 7241 7375 { 7242 7376 mmu_destroy_caches(); 7243 - percpu_counter_destroy(&kvm_total_used_mmu_pages); 7244 - shrinker_free(mmu_shrinker); 7245 7377 } 7246 7378 7247 7379 /* ··· 7291 7427 mutex_lock(&kvm_lock); 7292 7428 7293 7429 list_for_each_entry(kvm, &vm_list, vm_list) 7294 - wake_up_process(kvm->arch.nx_huge_page_recovery_thread); 7430 + vhost_task_wake(kvm->arch.nx_huge_page_recovery_thread); 7295 7431 7296 7432 mutex_unlock(&kvm_lock); 7297 7433 } ··· 7394 7530 srcu_read_unlock(&kvm->srcu, rcu_idx); 7395 7531 } 7396 7532 7397 - static long get_nx_huge_page_recovery_timeout(u64 start_time) 7533 + static void kvm_nx_huge_page_recovery_worker_kill(void *data) 7398 7534 { 7399 - bool enabled; 7400 - uint period; 7401 - 7402 - enabled = calc_nx_huge_pages_recovery_period(&period); 7403 - 7404 - return enabled ? start_time + msecs_to_jiffies(period) - get_jiffies_64() 7405 - : MAX_SCHEDULE_TIMEOUT; 7406 7535 } 7407 7536 7408 - static int kvm_nx_huge_page_recovery_worker(struct kvm *kvm, uintptr_t data) 7537 + static bool kvm_nx_huge_page_recovery_worker(void *data) 7409 7538 { 7410 - u64 start_time; 7539 + struct kvm *kvm = data; 7540 + bool enabled; 7541 + uint period; 7411 7542 long remaining_time; 7412 7543 7413 - while (true) { 7414 - start_time = get_jiffies_64(); 7415 - remaining_time = get_nx_huge_page_recovery_timeout(start_time); 7544 + enabled = calc_nx_huge_pages_recovery_period(&period); 7545 + if (!enabled) 7546 + return false; 7416 7547 7417 - set_current_state(TASK_INTERRUPTIBLE); 7418 - while (!kthread_should_stop() && remaining_time > 0) { 7419 - schedule_timeout(remaining_time); 7420 - remaining_time = get_nx_huge_page_recovery_timeout(start_time); 7421 - set_current_state(TASK_INTERRUPTIBLE); 7422 - } 7423 - 7424 - set_current_state(TASK_RUNNING); 7425 - 7426 - if (kthread_should_stop()) 7427 - return 0; 7428 - 7429 - kvm_recover_nx_huge_pages(kvm); 7548 + remaining_time = kvm->arch.nx_huge_page_last + msecs_to_jiffies(period) 7549 + - get_jiffies_64(); 7550 + if (remaining_time > 0) { 7551 + schedule_timeout(remaining_time); 7552 + /* check for signals and come back */ 7553 + return true; 7430 7554 } 7555 + 7556 + __set_current_state(TASK_RUNNING); 7557 + kvm_recover_nx_huge_pages(kvm); 7558 + kvm->arch.nx_huge_page_last = get_jiffies_64(); 7559 + return true; 7431 7560 } 7432 7561 7433 7562 int kvm_mmu_post_init_vm(struct kvm *kvm) 7434 7563 { 7435 - int err; 7436 - 7437 7564 if (nx_hugepage_mitigation_hard_disabled) 7438 7565 return 0; 7439 7566 7440 - err = kvm_vm_create_worker_thread(kvm, kvm_nx_huge_page_recovery_worker, 0, 7441 - "kvm-nx-lpage-recovery", 7442 - &kvm->arch.nx_huge_page_recovery_thread); 7443 - if (!err) 7444 - kthread_unpark(kvm->arch.nx_huge_page_recovery_thread); 7567 + kvm->arch.nx_huge_page_last = get_jiffies_64(); 7568 + kvm->arch.nx_huge_page_recovery_thread = vhost_task_create( 7569 + kvm_nx_huge_page_recovery_worker, kvm_nx_huge_page_recovery_worker_kill, 7570 + kvm, "kvm-nx-lpage-recovery"); 7445 7571 7446 - return err; 7572 + if (!kvm->arch.nx_huge_page_recovery_thread) 7573 + return -ENOMEM; 7574 + 7575 + vhost_task_start(kvm->arch.nx_huge_page_recovery_thread); 7576 + return 0; 7447 7577 } 7448 7578 7449 7579 void kvm_mmu_pre_destroy_vm(struct kvm *kvm) 7450 7580 { 7451 7581 if (kvm->arch.nx_huge_page_recovery_thread) 7452 - kthread_stop(kvm->arch.nx_huge_page_recovery_thread); 7582 + vhost_task_stop(kvm->arch.nx_huge_page_recovery_thread); 7453 7583 } 7454 7584 7455 7585 #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES

+4 -6

arch/x86/kvm/mmu/mmu_internal.h

··· 164 164 } 165 165 166 166 int mmu_try_to_unsync_pages(struct kvm *kvm, const struct kvm_memory_slot *slot, 167 - gfn_t gfn, bool can_unsync, bool prefetch); 167 + gfn_t gfn, bool synchronizing, bool prefetch); 168 168 169 169 void kvm_mmu_gfn_disallow_lpage(const struct kvm_memory_slot *slot, gfn_t gfn); 170 170 void kvm_mmu_gfn_allow_lpage(const struct kvm_memory_slot *slot, gfn_t gfn); ··· 235 235 /* The memslot containing gfn. May be NULL. */ 236 236 struct kvm_memory_slot *slot; 237 237 238 - /* Outputs of kvm_faultin_pfn. */ 238 + /* Outputs of kvm_mmu_faultin_pfn(). */ 239 239 unsigned long mmu_seq; 240 240 kvm_pfn_t pfn; 241 - hva_t hva; 241 + struct page *refcounted_page; 242 242 bool map_writable; 243 243 244 244 /* ··· 313 313 .is_private = err & PFERR_PRIVATE_ACCESS, 314 314 315 315 .pfn = KVM_PFN_ERR_FAULT, 316 - .hva = KVM_HVA_ERR_BAD, 317 316 }; 318 317 int r; 319 318 ··· 346 347 } 347 348 348 349 int kvm_mmu_max_mapping_level(struct kvm *kvm, 349 - const struct kvm_memory_slot *slot, gfn_t gfn, 350 - int max_level); 350 + const struct kvm_memory_slot *slot, gfn_t gfn); 351 351 void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault); 352 352 void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level); 353 353

+12 -19

arch/x86/kvm/mmu/paging_tmpl.h

··· 533 533 FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, 534 534 u64 *spte, pt_element_t gpte) 535 535 { 536 - struct kvm_memory_slot *slot; 537 536 unsigned pte_access; 538 537 gfn_t gfn; 539 - kvm_pfn_t pfn; 540 538 541 539 if (FNAME(prefetch_invalid_gpte)(vcpu, sp, spte, gpte)) 542 540 return false; ··· 543 545 pte_access = sp->role.access & FNAME(gpte_access)(gpte); 544 546 FNAME(protect_clean_gpte)(vcpu->arch.mmu, &pte_access, gpte); 545 547 546 - slot = gfn_to_memslot_dirty_bitmap(vcpu, gfn, pte_access & ACC_WRITE_MASK); 547 - if (!slot) 548 - return false; 549 - 550 - pfn = gfn_to_pfn_memslot_atomic(slot, gfn); 551 - if (is_error_pfn(pfn)) 552 - return false; 553 - 554 - mmu_set_spte(vcpu, slot, spte, pte_access, gfn, pfn, NULL); 555 - kvm_release_pfn_clean(pfn); 556 - return true; 548 + return kvm_mmu_prefetch_sptes(vcpu, gfn, spte, 1, pte_access); 557 549 } 558 550 559 551 static bool FNAME(gpte_changed)(struct kvm_vcpu *vcpu, ··· 801 813 if (r) 802 814 return r; 803 815 804 - r = kvm_faultin_pfn(vcpu, fault, walker.pte_access); 816 + r = kvm_mmu_faultin_pfn(vcpu, fault, walker.pte_access); 805 817 if (r != RET_PF_CONTINUE) 806 818 return r; 807 819 ··· 836 848 r = FNAME(fetch)(vcpu, fault, &walker); 837 849 838 850 out_unlock: 851 + kvm_mmu_finish_page_fault(vcpu, fault, r); 839 852 write_unlock(&vcpu->kvm->mmu_lock); 840 - kvm_release_pfn_clean(fault->pfn); 841 853 return r; 842 854 } 843 855 ··· 880 892 881 893 /* 882 894 * Using the information in sp->shadowed_translation (kvm_mmu_page_get_gfn()) is 883 - * safe because: 884 - * - The spte has a reference to the struct page, so the pfn for a given gfn 885 - * can't change unless all sptes pointing to it are nuked first. 895 + * safe because SPTEs are protected by mmu_notifiers and memslot generations, so 896 + * the pfn for a given gfn can't change unless all SPTEs pointing to the gfn are 897 + * nuked first. 886 898 * 887 899 * Returns 888 900 * < 0: failed to sync spte ··· 951 963 host_writable = spte & shadow_host_writable_mask; 952 964 slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn); 953 965 make_spte(vcpu, sp, slot, pte_access, gfn, 954 - spte_to_pfn(spte), spte, true, false, 966 + spte_to_pfn(spte), spte, true, true, 955 967 host_writable, &spte); 956 968 969 + /* 970 + * There is no need to mark the pfn dirty, as the new protections must 971 + * be a subset of the old protections, i.e. synchronizing a SPTE cannot 972 + * change the SPTE from read-only to writable. 973 + */ 957 974 return mmu_spte_update(sptep, spte); 958 975 } 959 976

+65 -37

arch/x86/kvm/mmu/spte.c

··· 24 24 module_param_named(mmio_caching, enable_mmio_caching, bool, 0444); 25 25 EXPORT_SYMBOL_GPL(enable_mmio_caching); 26 26 27 + bool __read_mostly kvm_ad_enabled; 28 + 27 29 u64 __read_mostly shadow_host_writable_mask; 28 30 u64 __read_mostly shadow_mmu_writable_mask; 29 31 u64 __read_mostly shadow_nx_mask; ··· 135 133 */ 136 134 bool spte_has_volatile_bits(u64 spte) 137 135 { 138 - /* 139 - * Always atomically update spte if it can be updated 140 - * out of mmu-lock, it can ensure dirty bit is not lost, 141 - * also, it can help us to get a stable is_writable_pte() 142 - * to ensure tlb flush is not missed. 143 - */ 144 136 if (!is_writable_pte(spte) && is_mmu_writable_spte(spte)) 145 137 return true; 146 138 ··· 153 157 bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, 154 158 const struct kvm_memory_slot *slot, 155 159 unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn, 156 - u64 old_spte, bool prefetch, bool can_unsync, 160 + u64 old_spte, bool prefetch, bool synchronizing, 157 161 bool host_writable, u64 *new_spte) 158 162 { 159 163 int level = sp->role.level; ··· 174 178 spte |= SPTE_TDP_AD_WRPROT_ONLY; 175 179 176 180 spte |= shadow_present_mask; 177 - if (!prefetch) 178 - spte |= spte_shadow_accessed_mask(spte); 181 + if (!prefetch || synchronizing) 182 + spte |= shadow_accessed_mask; 179 183 180 184 /* 181 185 * For simplicity, enforce the NX huge page mitigation even if not ··· 219 223 spte |= (u64)pfn << PAGE_SHIFT; 220 224 221 225 if (pte_access & ACC_WRITE_MASK) { 222 - spte |= PT_WRITABLE_MASK | shadow_mmu_writable_mask; 223 - 224 - /* 225 - * Optimization: for pte sync, if spte was writable the hash 226 - * lookup is unnecessary (and expensive). Write protection 227 - * is responsibility of kvm_mmu_get_page / kvm_mmu_sync_roots. 228 - * Same reasoning can be applied to dirty page accounting. 229 - */ 230 - if (is_writable_pte(old_spte)) 231 - goto out; 232 - 233 226 /* 234 227 * Unsync shadow pages that are reachable by the new, writable 235 228 * SPTE. Write-protect the SPTE if the page can't be unsync'd, 236 229 * e.g. it's write-tracked (upper-level SPs) or has one or more 237 230 * shadow pages and unsync'ing pages is not allowed. 231 + * 232 + * When overwriting an existing leaf SPTE, and the old SPTE was 233 + * writable, skip trying to unsync shadow pages as any relevant 234 + * shadow pages must already be unsync, i.e. the hash lookup is 235 + * unnecessary (and expensive). Note, this relies on KVM not 236 + * changing PFNs without first zapping the old SPTE, which is 237 + * guaranteed by both the shadow MMU and the TDP MMU. 238 238 */ 239 - if (mmu_try_to_unsync_pages(vcpu->kvm, slot, gfn, can_unsync, prefetch)) { 239 + if ((!is_last_spte(old_spte, level) || !is_writable_pte(old_spte)) && 240 + mmu_try_to_unsync_pages(vcpu->kvm, slot, gfn, synchronizing, prefetch)) 240 241 wrprot = true; 241 - pte_access &= ~ACC_WRITE_MASK; 242 - spte &= ~(PT_WRITABLE_MASK | shadow_mmu_writable_mask); 243 - } 242 + else 243 + spte |= PT_WRITABLE_MASK | shadow_mmu_writable_mask | 244 + shadow_dirty_mask; 244 245 } 245 246 246 - if (pte_access & ACC_WRITE_MASK) 247 - spte |= spte_shadow_dirty_mask(spte); 248 - 249 - out: 250 - if (prefetch) 247 + if (prefetch && !synchronizing) 251 248 spte = mark_spte_for_access_track(spte); 252 249 253 250 WARN_ONCE(is_rsvd_spte(&vcpu->arch.mmu->shadow_zero_check, spte, level), 254 251 "spte = 0x%llx, level = %d, rsvd bits = 0x%llx", spte, level, 255 252 get_rsvd_bits(&vcpu->arch.mmu->shadow_zero_check, spte, level)); 256 253 254 + /* 255 + * Mark the memslot dirty *after* modifying it for access tracking. 256 + * Unlike folios, memslots can be safely marked dirty out of mmu_lock, 257 + * i.e. in the fast page fault handler. 258 + */ 257 259 if ((spte & PT_WRITABLE_MASK) && kvm_slot_dirty_track_enabled(slot)) { 258 260 /* Enforced by kvm_mmu_hugepage_adjust. */ 259 261 WARN_ON_ONCE(level > PG_LEVEL_4K); ··· 262 268 return wrprot; 263 269 } 264 270 265 - static u64 make_spte_executable(u64 spte) 271 + static u64 modify_spte_protections(u64 spte, u64 set, u64 clear) 266 272 { 267 273 bool is_access_track = is_access_track_spte(spte); 268 274 269 275 if (is_access_track) 270 276 spte = restore_acc_track_spte(spte); 271 277 272 - spte &= ~shadow_nx_mask; 273 - spte |= shadow_x_mask; 278 + KVM_MMU_WARN_ON(set & clear); 279 + spte = (spte | set) & ~clear; 274 280 275 281 if (is_access_track) 276 282 spte = mark_spte_for_access_track(spte); 277 283 278 284 return spte; 285 + } 286 + 287 + static u64 make_spte_executable(u64 spte) 288 + { 289 + return modify_spte_protections(spte, shadow_x_mask, shadow_nx_mask); 290 + } 291 + 292 + static u64 make_spte_nonexecutable(u64 spte) 293 + { 294 + return modify_spte_protections(spte, shadow_nx_mask, shadow_x_mask); 279 295 } 280 296 281 297 /* ··· 295 291 * This is used during huge page splitting to build the SPTEs that make up the 296 292 * new page table. 297 293 */ 298 - u64 make_huge_page_split_spte(struct kvm *kvm, u64 huge_spte, 299 - union kvm_mmu_page_role role, int index) 294 + u64 make_small_spte(struct kvm *kvm, u64 huge_spte, 295 + union kvm_mmu_page_role role, int index) 300 296 { 301 297 u64 child_spte = huge_spte; 302 298 ··· 324 320 return child_spte; 325 321 } 326 322 323 + u64 make_huge_spte(struct kvm *kvm, u64 small_spte, int level) 324 + { 325 + u64 huge_spte; 326 + 327 + KVM_BUG_ON(!is_shadow_present_pte(small_spte) || level == PG_LEVEL_4K, kvm); 328 + 329 + huge_spte = small_spte | PT_PAGE_SIZE_MASK; 330 + 331 + /* 332 + * huge_spte already has the address of the sub-page being collapsed 333 + * from small_spte, so just clear the lower address bits to create the 334 + * huge page address. 335 + */ 336 + huge_spte &= KVM_HPAGE_MASK(level) | ~PAGE_MASK; 337 + 338 + if (is_nx_huge_page_enabled(kvm)) 339 + huge_spte = make_spte_nonexecutable(huge_spte); 340 + 341 + return huge_spte; 342 + } 327 343 328 344 u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled) 329 345 { ··· 376 352 377 353 spte |= (spte & SHADOW_ACC_TRACK_SAVED_BITS_MASK) << 378 354 SHADOW_ACC_TRACK_SAVED_BITS_SHIFT; 379 - spte &= ~shadow_acc_track_mask; 355 + spte &= ~(shadow_acc_track_mask | shadow_accessed_mask); 380 356 381 357 return spte; 382 358 } ··· 446 422 447 423 void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only) 448 424 { 425 + kvm_ad_enabled = has_ad_bits; 426 + 449 427 shadow_user_mask = VMX_EPT_READABLE_MASK; 450 - shadow_accessed_mask = has_ad_bits ? VMX_EPT_ACCESS_BIT : 0ull; 451 - shadow_dirty_mask = has_ad_bits ? VMX_EPT_DIRTY_BIT : 0ull; 428 + shadow_accessed_mask = VMX_EPT_ACCESS_BIT; 429 + shadow_dirty_mask = VMX_EPT_DIRTY_BIT; 452 430 shadow_nx_mask = 0ull; 453 431 shadow_x_mask = VMX_EPT_EXECUTABLE_MASK; 454 432 /* VMX_EPT_SUPPRESS_VE_BIT is needed for W or X violation. */ ··· 480 454 { 481 455 u8 low_phys_bits; 482 456 u64 mask; 457 + 458 + kvm_ad_enabled = true; 483 459 484 460 /* 485 461 * If the CPU has 46 or less physical address bits, then set an

+41 -37

arch/x86/kvm/mmu/spte.h

··· 167 167 #define SHADOW_NONPRESENT_VALUE 0ULL 168 168 #endif 169 169 170 + 171 + /* 172 + * True if A/D bits are supported in hardware and are enabled by KVM. When 173 + * enabled, KVM uses A/D bits for all non-nested MMUs. Because L1 can disable 174 + * A/D bits in EPTP12, SP and SPTE variants are needed to handle the scenario 175 + * where KVM is using A/D bits for L1, but not L2. 176 + */ 177 + extern bool __read_mostly kvm_ad_enabled; 178 + 170 179 extern u64 __read_mostly shadow_host_writable_mask; 171 180 extern u64 __read_mostly shadow_mmu_writable_mask; 172 181 extern u64 __read_mostly shadow_nx_mask; ··· 294 285 (spte & VMX_EPT_RWX_MASK) != VMX_EPT_MISCONFIG_WX_VALUE; 295 286 } 296 287 297 - /* 298 - * Returns true if A/D bits are supported in hardware and are enabled by KVM. 299 - * When enabled, KVM uses A/D bits for all non-nested MMUs. Because L1 can 300 - * disable A/D bits in EPTP12, SP and SPTE variants are needed to handle the 301 - * scenario where KVM is using A/D bits for L1, but not L2. 302 - */ 303 - static inline bool kvm_ad_enabled(void) 304 - { 305 - return !!shadow_accessed_mask; 306 - } 307 - 308 288 static inline bool sp_ad_disabled(struct kvm_mmu_page *sp) 309 289 { 310 290 return sp->role.ad_disabled; ··· 314 316 * TDP and do the A/D type check unconditionally. 315 317 */ 316 318 return (spte & SPTE_TDP_AD_MASK) != SPTE_TDP_AD_ENABLED; 317 - } 318 - 319 - static inline u64 spte_shadow_accessed_mask(u64 spte) 320 - { 321 - KVM_MMU_WARN_ON(!is_shadow_present_pte(spte)); 322 - return spte_ad_enabled(spte) ? shadow_accessed_mask : 0; 323 - } 324 - 325 - static inline u64 spte_shadow_dirty_mask(u64 spte) 326 - { 327 - KVM_MMU_WARN_ON(!is_shadow_present_pte(spte)); 328 - return spte_ad_enabled(spte) ? shadow_dirty_mask : 0; 329 319 } 330 320 331 321 static inline bool is_access_track_spte(u64 spte) ··· 343 357 344 358 static inline bool is_accessed_spte(u64 spte) 345 359 { 346 - u64 accessed_mask = spte_shadow_accessed_mask(spte); 347 - 348 - return accessed_mask ? spte & accessed_mask 349 - : !is_access_track_spte(spte); 350 - } 351 - 352 - static inline bool is_dirty_spte(u64 spte) 353 - { 354 - u64 dirty_mask = spte_shadow_dirty_mask(spte); 355 - 356 - return dirty_mask ? spte & dirty_mask : spte & PT_WRITABLE_MASK; 360 + return spte & shadow_accessed_mask; 357 361 } 358 362 359 363 static inline u64 get_rsvd_bits(struct rsvd_bits_validate *rsvd_check, u64 pte, ··· 461 485 return spte & shadow_mmu_writable_mask; 462 486 } 463 487 488 + /* 489 + * If the MMU-writable flag is cleared, i.e. the SPTE is write-protected for 490 + * write-tracking, remote TLBs must be flushed, even if the SPTE was read-only, 491 + * as KVM allows stale Writable TLB entries to exist. When dirty logging, KVM 492 + * flushes TLBs based on whether or not dirty bitmap/ring entries were reaped, 493 + * not whether or not SPTEs were modified, i.e. only the write-tracking case 494 + * needs to flush at the time the SPTEs is modified, before dropping mmu_lock. 495 + * 496 + * Don't flush if the Accessed bit is cleared, as access tracking tolerates 497 + * false negatives, e.g. KVM x86 omits TLB flushes even when aging SPTEs for a 498 + * mmu_notifier.clear_flush_young() event. 499 + * 500 + * Lastly, don't flush if the Dirty bit is cleared, as KVM unconditionally 501 + * flushes when enabling dirty logging (see kvm_mmu_slot_apply_flags()), and 502 + * when clearing dirty logs, KVM flushes based on whether or not dirty entries 503 + * were reaped from the bitmap/ring, not whether or not dirty SPTEs were found. 504 + * 505 + * Note, this logic only applies to shadow-present leaf SPTEs. The caller is 506 + * responsible for checking that the old SPTE is shadow-present, and is also 507 + * responsible for determining whether or not a TLB flush is required when 508 + * modifying a shadow-present non-leaf SPTE. 509 + */ 510 + static inline bool leaf_spte_change_needs_tlb_flush(u64 old_spte, u64 new_spte) 511 + { 512 + return is_mmu_writable_spte(old_spte) && !is_mmu_writable_spte(new_spte); 513 + } 514 + 464 515 static inline u64 get_mmio_spte_generation(u64 spte) 465 516 { 466 517 u64 gen; ··· 502 499 bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, 503 500 const struct kvm_memory_slot *slot, 504 501 unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn, 505 - u64 old_spte, bool prefetch, bool can_unsync, 502 + u64 old_spte, bool prefetch, bool synchronizing, 506 503 bool host_writable, u64 *new_spte); 507 - u64 make_huge_page_split_spte(struct kvm *kvm, u64 huge_spte, 508 - union kvm_mmu_page_role role, int index); 504 + u64 make_small_spte(struct kvm *kvm, u64 huge_spte, 505 + union kvm_mmu_page_role role, int index); 506 + u64 make_huge_spte(struct kvm *kvm, u64 small_spte, int level); 509 507 u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled); 510 508 u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access); 511 509 u64 mark_spte_for_access_track(u64 spte);

+130 -150

arch/x86/kvm/mmu/tdp_mmu.c

··· 511 511 if (is_leaf != was_leaf) 512 512 kvm_update_page_stats(kvm, level, is_leaf ? 1 : -1); 513 513 514 - if (was_leaf && is_dirty_spte(old_spte) && 515 - (!is_present || !is_dirty_spte(new_spte) || pfn_changed)) 516 - kvm_set_pfn_dirty(spte_to_pfn(old_spte)); 517 - 518 514 /* 519 515 * Recursively handle child PTs if the change removed a subtree from 520 516 * the paging structure. Note the WARN on the PFN changing without the ··· 520 524 if (was_present && !was_leaf && 521 525 (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed))) 522 526 handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared); 523 - 524 - if (was_leaf && is_accessed_spte(old_spte) && 525 - (!is_present || !is_accessed_spte(new_spte) || pfn_changed)) 526 - kvm_set_pfn_accessed(spte_to_pfn(old_spte)); 527 527 } 528 528 529 529 static inline int __must_check __tdp_mmu_set_spte_atomic(struct tdp_iter *iter, ··· 583 591 return 0; 584 592 } 585 593 586 - static inline int __must_check tdp_mmu_zap_spte_atomic(struct kvm *kvm, 587 - struct tdp_iter *iter) 588 - { 589 - int ret; 590 - 591 - lockdep_assert_held_read(&kvm->mmu_lock); 592 - 593 - /* 594 - * Freeze the SPTE by setting it to a special, non-present value. This 595 - * will stop other threads from immediately installing a present entry 596 - * in its place before the TLBs are flushed. 597 - * 598 - * Delay processing of the zapped SPTE until after TLBs are flushed and 599 - * the FROZEN_SPTE is replaced (see below). 600 - */ 601 - ret = __tdp_mmu_set_spte_atomic(iter, FROZEN_SPTE); 602 - if (ret) 603 - return ret; 604 - 605 - kvm_flush_remote_tlbs_gfn(kvm, iter->gfn, iter->level); 606 - 607 - /* 608 - * No other thread can overwrite the frozen SPTE as they must either 609 - * wait on the MMU lock or use tdp_mmu_set_spte_atomic() which will not 610 - * overwrite the special frozen SPTE value. Use the raw write helper to 611 - * avoid an unnecessary check on volatile bits. 612 - */ 613 - __kvm_tdp_mmu_write_spte(iter->sptep, SHADOW_NONPRESENT_VALUE); 614 - 615 - /* 616 - * Process the zapped SPTE after flushing TLBs, and after replacing 617 - * FROZEN_SPTE with 0. This minimizes the amount of time vCPUs are 618 - * blocked by the FROZEN_SPTE and reduces contention on the child 619 - * SPTEs. 620 - */ 621 - handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte, 622 - SHADOW_NONPRESENT_VALUE, iter->level, true); 623 - 624 - return 0; 625 - } 626 - 627 - 628 594 /* 629 595 * tdp_mmu_set_spte - Set a TDP MMU SPTE and handle the associated bookkeeping 630 596 * @kvm: KVM instance ··· 638 688 #define tdp_mmu_for_each_pte(_iter, _mmu, _start, _end) \ 639 689 for_each_tdp_pte(_iter, root_to_sp(_mmu->root.hpa), _start, _end) 640 690 691 + static inline bool __must_check tdp_mmu_iter_need_resched(struct kvm *kvm, 692 + struct tdp_iter *iter) 693 + { 694 + if (!need_resched() && !rwlock_needbreak(&kvm->mmu_lock)) 695 + return false; 696 + 697 + /* Ensure forward progress has been made before yielding. */ 698 + return iter->next_last_level_gfn != iter->yielded_gfn; 699 + } 700 + 641 701 /* 642 702 * Yield if the MMU lock is contended or this thread needs to return control 643 703 * to the scheduler. ··· 666 706 struct tdp_iter *iter, 667 707 bool flush, bool shared) 668 708 { 669 - WARN_ON_ONCE(iter->yielded); 709 + KVM_MMU_WARN_ON(iter->yielded); 670 710 671 - /* Ensure forward progress has been made before yielding. */ 672 - if (iter->next_last_level_gfn == iter->yielded_gfn) 711 + if (!tdp_mmu_iter_need_resched(kvm, iter)) 673 712 return false; 674 713 675 - if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) { 676 - if (flush) 677 - kvm_flush_remote_tlbs(kvm); 714 + if (flush) 715 + kvm_flush_remote_tlbs(kvm); 678 716 679 - rcu_read_unlock(); 717 + rcu_read_unlock(); 680 718 681 - if (shared) 682 - cond_resched_rwlock_read(&kvm->mmu_lock); 683 - else 684 - cond_resched_rwlock_write(&kvm->mmu_lock); 719 + if (shared) 720 + cond_resched_rwlock_read(&kvm->mmu_lock); 721 + else 722 + cond_resched_rwlock_write(&kvm->mmu_lock); 685 723 686 - rcu_read_lock(); 724 + rcu_read_lock(); 687 725 688 - WARN_ON_ONCE(iter->gfn > iter->next_last_level_gfn); 726 + WARN_ON_ONCE(iter->gfn > iter->next_last_level_gfn); 689 727 690 - iter->yielded = true; 691 - } 692 - 693 - return iter->yielded; 728 + iter->yielded = true; 729 + return true; 694 730 } 695 731 696 732 static inline gfn_t tdp_mmu_max_gfn_exclusive(void) ··· 982 1026 if (WARN_ON_ONCE(sp->role.level != fault->goal_level)) 983 1027 return RET_PF_RETRY; 984 1028 1029 + if (fault->prefetch && is_shadow_present_pte(iter->old_spte)) 1030 + return RET_PF_SPURIOUS; 1031 + 985 1032 if (unlikely(!fault->slot)) 986 1033 new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL); 987 1034 else 988 1035 wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter->gfn, 989 - fault->pfn, iter->old_spte, fault->prefetch, true, 990 - fault->map_writable, &new_spte); 1036 + fault->pfn, iter->old_spte, fault->prefetch, 1037 + false, fault->map_writable, &new_spte); 991 1038 992 1039 if (new_spte == iter->old_spte) 993 1040 ret = RET_PF_SPURIOUS; 994 1041 else if (tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte)) 995 1042 return RET_PF_RETRY; 996 1043 else if (is_shadow_present_pte(iter->old_spte) && 997 - !is_last_spte(iter->old_spte, iter->level)) 1044 + (!is_last_spte(iter->old_spte, iter->level) || 1045 + WARN_ON_ONCE(leaf_spte_change_needs_tlb_flush(iter->old_spte, new_spte)))) 998 1046 kvm_flush_remote_tlbs_gfn(vcpu->kvm, iter->gfn, iter->level); 999 1047 1000 1048 /* ··· 1038 1078 static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter, 1039 1079 struct kvm_mmu_page *sp, bool shared) 1040 1080 { 1041 - u64 spte = make_nonleaf_spte(sp->spt, !kvm_ad_enabled()); 1081 + u64 spte = make_nonleaf_spte(sp->spt, !kvm_ad_enabled); 1042 1082 int ret = 0; 1043 1083 1044 1084 if (shared) { ··· 1155 1195 return flush; 1156 1196 } 1157 1197 1158 - typedef bool (*tdp_handler_t)(struct kvm *kvm, struct tdp_iter *iter, 1159 - struct kvm_gfn_range *range); 1160 - 1161 - static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm, 1162 - struct kvm_gfn_range *range, 1163 - tdp_handler_t handler) 1164 - { 1165 - struct kvm_mmu_page *root; 1166 - struct tdp_iter iter; 1167 - bool ret = false; 1168 - 1169 - /* 1170 - * Don't support rescheduling, none of the MMU notifiers that funnel 1171 - * into this helper allow blocking; it'd be dead, wasteful code. 1172 - */ 1173 - for_each_tdp_mmu_root(kvm, root, range->slot->as_id) { 1174 - rcu_read_lock(); 1175 - 1176 - tdp_root_for_each_leaf_pte(iter, root, range->start, range->end) 1177 - ret |= handler(kvm, &iter, range); 1178 - 1179 - rcu_read_unlock(); 1180 - } 1181 - 1182 - return ret; 1183 - } 1184 - 1185 1198 /* 1186 1199 * Mark the SPTEs range of GFNs [start, end) unaccessed and return non-zero 1187 1200 * if any of the GFNs in the range have been accessed. ··· 1163 1230 * from the clear_young() or clear_flush_young() notifier, which uses the 1164 1231 * return value to determine if the page has been accessed. 1165 1232 */ 1166 - static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter, 1167 - struct kvm_gfn_range *range) 1233 + static void kvm_tdp_mmu_age_spte(struct tdp_iter *iter) 1168 1234 { 1169 1235 u64 new_spte; 1170 - 1171 - /* If we have a non-accessed entry we don't need to change the pte. */ 1172 - if (!is_accessed_spte(iter->old_spte)) 1173 - return false; 1174 1236 1175 1237 if (spte_ad_enabled(iter->old_spte)) { 1176 1238 iter->old_spte = tdp_mmu_clear_spte_bits(iter->sptep, ··· 1174 1246 iter->level); 1175 1247 new_spte = iter->old_spte & ~shadow_accessed_mask; 1176 1248 } else { 1177 - /* 1178 - * Capture the dirty status of the page, so that it doesn't get 1179 - * lost when the SPTE is marked for access tracking. 1180 - */ 1181 - if (is_writable_pte(iter->old_spte)) 1182 - kvm_set_pfn_dirty(spte_to_pfn(iter->old_spte)); 1183 - 1184 1249 new_spte = mark_spte_for_access_track(iter->old_spte); 1185 1250 iter->old_spte = kvm_tdp_mmu_write_spte(iter->sptep, 1186 1251 iter->old_spte, new_spte, ··· 1182 1261 1183 1262 trace_kvm_tdp_mmu_spte_changed(iter->as_id, iter->gfn, iter->level, 1184 1263 iter->old_spte, new_spte); 1185 - return true; 1264 + } 1265 + 1266 + static bool __kvm_tdp_mmu_age_gfn_range(struct kvm *kvm, 1267 + struct kvm_gfn_range *range, 1268 + bool test_only) 1269 + { 1270 + struct kvm_mmu_page *root; 1271 + struct tdp_iter iter; 1272 + bool ret = false; 1273 + 1274 + /* 1275 + * Don't support rescheduling, none of the MMU notifiers that funnel 1276 + * into this helper allow blocking; it'd be dead, wasteful code. Note, 1277 + * this helper must NOT be used to unmap GFNs, as it processes only 1278 + * valid roots! 1279 + */ 1280 + for_each_valid_tdp_mmu_root(kvm, root, range->slot->as_id) { 1281 + guard(rcu)(); 1282 + 1283 + tdp_root_for_each_leaf_pte(iter, root, range->start, range->end) { 1284 + if (!is_accessed_spte(iter.old_spte)) 1285 + continue; 1286 + 1287 + if (test_only) 1288 + return true; 1289 + 1290 + ret = true; 1291 + kvm_tdp_mmu_age_spte(&iter); 1292 + } 1293 + } 1294 + 1295 + return ret; 1186 1296 } 1187 1297 1188 1298 bool kvm_tdp_mmu_age_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) 1189 1299 { 1190 - return kvm_tdp_mmu_handle_gfn(kvm, range, age_gfn_range); 1191 - } 1192 - 1193 - static bool test_age_gfn(struct kvm *kvm, struct tdp_iter *iter, 1194 - struct kvm_gfn_range *range) 1195 - { 1196 - return is_accessed_spte(iter->old_spte); 1300 + return __kvm_tdp_mmu_age_gfn_range(kvm, range, false); 1197 1301 } 1198 1302 1199 1303 bool kvm_tdp_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) 1200 1304 { 1201 - return kvm_tdp_mmu_handle_gfn(kvm, range, test_age_gfn); 1305 + return __kvm_tdp_mmu_age_gfn_range(kvm, range, true); 1202 1306 } 1203 1307 1204 1308 /* ··· 1314 1368 * not been linked in yet and thus is not reachable from any other CPU. 1315 1369 */ 1316 1370 for (i = 0; i < SPTE_ENT_PER_PAGE; i++) 1317 - sp->spt[i] = make_huge_page_split_spte(kvm, huge_spte, sp->role, i); 1371 + sp->spt[i] = make_small_spte(kvm, huge_spte, sp->role, i); 1318 1372 1319 1373 /* 1320 1374 * Replace the huge spte with a pointer to the populated lower level ··· 1447 1501 * from level, so it is valid to key off any shadow page to determine if 1448 1502 * write protection is needed for an entire tree. 1449 1503 */ 1450 - return kvm_mmu_page_ad_need_write_protect(sp) || !kvm_ad_enabled(); 1504 + return kvm_mmu_page_ad_need_write_protect(sp) || !kvm_ad_enabled; 1451 1505 } 1452 1506 1453 - static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root, 1454 - gfn_t start, gfn_t end) 1507 + static void clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root, 1508 + gfn_t start, gfn_t end) 1455 1509 { 1456 1510 const u64 dbit = tdp_mmu_need_write_protect(root) ? PT_WRITABLE_MASK : 1457 1511 shadow_dirty_mask; 1458 1512 struct tdp_iter iter; 1459 - bool spte_set = false; 1460 1513 1461 1514 rcu_read_lock(); 1462 1515 ··· 1476 1531 1477 1532 if (tdp_mmu_set_spte_atomic(kvm, &iter, iter.old_spte & ~dbit)) 1478 1533 goto retry; 1479 - 1480 - spte_set = true; 1481 1534 } 1482 1535 1483 1536 rcu_read_unlock(); 1484 - return spte_set; 1485 1537 } 1486 1538 1487 1539 /* 1488 1540 * Clear the dirty status (D-bit or W-bit) of all the SPTEs mapping GFNs in the 1489 - * memslot. Returns true if an SPTE has been changed and the TLBs need to be 1490 - * flushed. 1541 + * memslot. 1491 1542 */ 1492 - bool kvm_tdp_mmu_clear_dirty_slot(struct kvm *kvm, 1543 + void kvm_tdp_mmu_clear_dirty_slot(struct kvm *kvm, 1493 1544 const struct kvm_memory_slot *slot) 1494 1545 { 1495 1546 struct kvm_mmu_page *root; 1496 - bool spte_set = false; 1497 1547 1498 1548 lockdep_assert_held_read(&kvm->mmu_lock); 1499 1549 for_each_valid_tdp_mmu_root_yield_safe(kvm, root, slot->as_id) 1500 - spte_set |= clear_dirty_gfn_range(kvm, root, slot->base_gfn, 1501 - slot->base_gfn + slot->npages); 1502 - 1503 - return spte_set; 1550 + clear_dirty_gfn_range(kvm, root, slot->base_gfn, 1551 + slot->base_gfn + slot->npages); 1504 1552 } 1505 1553 1506 1554 static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root, ··· 1531 1593 trace_kvm_tdp_mmu_spte_changed(iter.as_id, iter.gfn, iter.level, 1532 1594 iter.old_spte, 1533 1595 iter.old_spte & ~dbit); 1534 - kvm_set_pfn_dirty(spte_to_pfn(iter.old_spte)); 1535 1596 } 1536 1597 1537 1598 rcu_read_unlock(); ··· 1552 1615 clear_dirty_pt_masked(kvm, root, gfn, mask, wrprot); 1553 1616 } 1554 1617 1555 - static void zap_collapsible_spte_range(struct kvm *kvm, 1556 - struct kvm_mmu_page *root, 1557 - const struct kvm_memory_slot *slot) 1618 + static int tdp_mmu_make_huge_spte(struct kvm *kvm, 1619 + struct tdp_iter *parent, 1620 + u64 *huge_spte) 1621 + { 1622 + struct kvm_mmu_page *root = spte_to_child_sp(parent->old_spte); 1623 + gfn_t start = parent->gfn; 1624 + gfn_t end = start + KVM_PAGES_PER_HPAGE(parent->level); 1625 + struct tdp_iter iter; 1626 + 1627 + tdp_root_for_each_leaf_pte(iter, root, start, end) { 1628 + /* 1629 + * Use the parent iterator when checking for forward progress so 1630 + * that KVM doesn't get stuck continuously trying to yield (i.e. 1631 + * returning -EAGAIN here and then failing the forward progress 1632 + * check in the caller ad nauseam). 1633 + */ 1634 + if (tdp_mmu_iter_need_resched(kvm, parent)) 1635 + return -EAGAIN; 1636 + 1637 + *huge_spte = make_huge_spte(kvm, iter.old_spte, parent->level); 1638 + return 0; 1639 + } 1640 + 1641 + return -ENOENT; 1642 + } 1643 + 1644 + static void recover_huge_pages_range(struct kvm *kvm, 1645 + struct kvm_mmu_page *root, 1646 + const struct kvm_memory_slot *slot) 1558 1647 { 1559 1648 gfn_t start = slot->base_gfn; 1560 1649 gfn_t end = start + slot->npages; 1561 1650 struct tdp_iter iter; 1562 1651 int max_mapping_level; 1652 + bool flush = false; 1653 + u64 huge_spte; 1654 + int r; 1655 + 1656 + if (WARN_ON_ONCE(kvm_slot_dirty_track_enabled(slot))) 1657 + return; 1563 1658 1564 1659 rcu_read_lock(); 1565 1660 1566 1661 for_each_tdp_pte_min_level(iter, root, PG_LEVEL_2M, start, end) { 1567 1662 retry: 1568 - if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true)) 1663 + if (tdp_mmu_iter_cond_resched(kvm, &iter, flush, true)) { 1664 + flush = false; 1569 1665 continue; 1666 + } 1570 1667 1571 1668 if (iter.level > KVM_MAX_HUGEPAGE_LEVEL || 1572 1669 !is_shadow_present_pte(iter.old_spte)) ··· 1624 1653 if (iter.gfn < start || iter.gfn >= end) 1625 1654 continue; 1626 1655 1627 - max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot, 1628 - iter.gfn, PG_LEVEL_NUM); 1656 + max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot, iter.gfn); 1629 1657 if (max_mapping_level < iter.level) 1630 1658 continue; 1631 1659 1632 - /* Note, a successful atomic zap also does a remote TLB flush. */ 1633 - if (tdp_mmu_zap_spte_atomic(kvm, &iter)) 1660 + r = tdp_mmu_make_huge_spte(kvm, &iter, &huge_spte); 1661 + if (r == -EAGAIN) 1634 1662 goto retry; 1663 + else if (r) 1664 + continue; 1665 + 1666 + if (tdp_mmu_set_spte_atomic(kvm, &iter, huge_spte)) 1667 + goto retry; 1668 + 1669 + flush = true; 1635 1670 } 1671 + 1672 + if (flush) 1673 + kvm_flush_remote_tlbs_memslot(kvm, slot); 1636 1674 1637 1675 rcu_read_unlock(); 1638 1676 } 1639 1677 1640 1678 /* 1641 - * Zap non-leaf SPTEs (and free their associated page tables) which could 1642 - * be replaced by huge pages, for GFNs within the slot. 1679 + * Recover huge page mappings within the slot by replacing non-leaf SPTEs with 1680 + * huge SPTEs where possible. 1643 1681 */ 1644 - void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm, 1645 - const struct kvm_memory_slot *slot) 1682 + void kvm_tdp_mmu_recover_huge_pages(struct kvm *kvm, 1683 + const struct kvm_memory_slot *slot) 1646 1684 { 1647 1685 struct kvm_mmu_page *root; 1648 1686 1649 1687 lockdep_assert_held_read(&kvm->mmu_lock); 1650 1688 for_each_valid_tdp_mmu_root_yield_safe(kvm, root, slot->as_id) 1651 - zap_collapsible_spte_range(kvm, root, slot); 1689 + recover_huge_pages_range(kvm, root, slot); 1652 1690 } 1653 1691 1654 1692 /*

+3 -3

arch/x86/kvm/mmu/tdp_mmu.h

··· 34 34 35 35 bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm, 36 36 const struct kvm_memory_slot *slot, int min_level); 37 - bool kvm_tdp_mmu_clear_dirty_slot(struct kvm *kvm, 37 + void kvm_tdp_mmu_clear_dirty_slot(struct kvm *kvm, 38 38 const struct kvm_memory_slot *slot); 39 39 void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm, 40 40 struct kvm_memory_slot *slot, 41 41 gfn_t gfn, unsigned long mask, 42 42 bool wrprot); 43 - void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm, 44 - const struct kvm_memory_slot *slot); 43 + void kvm_tdp_mmu_recover_huge_pages(struct kvm *kvm, 44 + const struct kvm_memory_slot *slot); 45 45 46 46 bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm, 47 47 struct kvm_memory_slot *slot, gfn_t gfn,

+1

arch/x86/kvm/mtrr.c

··· 19 19 #include <asm/mtrr.h> 20 20 21 21 #include "cpuid.h" 22 + #include "x86.h" 22 23 23 24 static u64 *find_mtrr(struct kvm_vcpu *vcpu, unsigned int msr) 24 25 {

+1

arch/x86/kvm/reverse_cpuid.h

··· 46 46 #define X86_FEATURE_AVX_VNNI_INT8 KVM_X86_FEATURE(CPUID_7_1_EDX, 4) 47 47 #define X86_FEATURE_AVX_NE_CONVERT KVM_X86_FEATURE(CPUID_7_1_EDX, 5) 48 48 #define X86_FEATURE_AMX_COMPLEX KVM_X86_FEATURE(CPUID_7_1_EDX, 8) 49 + #define X86_FEATURE_AVX_VNNI_INT16 KVM_X86_FEATURE(CPUID_7_1_EDX, 10) 49 50 #define X86_FEATURE_PREFETCHITI KVM_X86_FEATURE(CPUID_7_1_EDX, 14) 50 51 #define X86_FEATURE_AVX10 KVM_X86_FEATURE(CPUID_7_1_EDX, 19) 51 52

+2 -2

arch/x86/kvm/svm/nested.c

··· 926 926 nested_svm_vmexit(svm); 927 927 928 928 out: 929 - kvm_vcpu_unmap(vcpu, &map, true); 929 + kvm_vcpu_unmap(vcpu, &map); 930 930 931 931 return ret; 932 932 } ··· 1130 1130 vmcb12->control.exit_int_info_err, 1131 1131 KVM_ISA_SVM); 1132 1132 1133 - kvm_vcpu_unmap(vcpu, &map, true); 1133 + kvm_vcpu_unmap(vcpu, &map); 1134 1134 1135 1135 nested_svm_transition_tlb_flush(vcpu); 1136 1136

+7 -5

arch/x86/kvm/svm/sev.c

··· 3458 3458 3459 3459 sev_es_sync_to_ghcb(svm); 3460 3460 3461 - kvm_vcpu_unmap(&svm->vcpu, &svm->sev_es.ghcb_map, true); 3461 + kvm_vcpu_unmap(&svm->vcpu, &svm->sev_es.ghcb_map); 3462 3462 svm->sev_es.ghcb = NULL; 3463 3463 } 3464 3464 ··· 3839 3839 if (VALID_PAGE(svm->sev_es.snp_vmsa_gpa)) { 3840 3840 gfn_t gfn = gpa_to_gfn(svm->sev_es.snp_vmsa_gpa); 3841 3841 struct kvm_memory_slot *slot; 3842 + struct page *page; 3842 3843 kvm_pfn_t pfn; 3843 3844 3844 3845 slot = gfn_to_memslot(vcpu->kvm, gfn); ··· 3850 3849 * The new VMSA will be private memory guest memory, so 3851 3850 * retrieve the PFN from the gmem backend. 3852 3851 */ 3853 - if (kvm_gmem_get_pfn(vcpu->kvm, slot, gfn, &pfn, NULL)) 3852 + if (kvm_gmem_get_pfn(vcpu->kvm, slot, gfn, &pfn, &page, NULL)) 3854 3853 return -EINVAL; 3855 3854 3856 3855 /* ··· 3879 3878 * changes then care should be taken to ensure 3880 3879 * svm->sev_es.vmsa is pinned through some other means. 3881 3880 */ 3882 - kvm_release_pfn_clean(pfn); 3881 + kvm_release_page_clean(page); 3883 3882 } 3884 3883 3885 3884 /* ··· 4679 4678 struct kvm_memory_slot *slot; 4680 4679 struct kvm *kvm = vcpu->kvm; 4681 4680 int order, rmp_level, ret; 4681 + struct page *page; 4682 4682 bool assigned; 4683 4683 kvm_pfn_t pfn; 4684 4684 gfn_t gfn; ··· 4706 4704 return; 4707 4705 } 4708 4706 4709 - ret = kvm_gmem_get_pfn(kvm, slot, gfn, &pfn, &order); 4707 + ret = kvm_gmem_get_pfn(kvm, slot, gfn, &pfn, &page, &order); 4710 4708 if (ret) { 4711 4709 pr_warn_ratelimited("SEV: Unexpected RMP fault, no backing page for private GPA 0x%llx\n", 4712 4710 gpa); ··· 4764 4762 out: 4765 4763 trace_kvm_rmp_fault(vcpu, gpa, pfn, error_code, rmp_level, ret); 4766 4764 out_no_trace: 4767 - put_page(pfn_to_page(pfn)); 4765 + kvm_release_page_unused(page); 4768 4766 } 4769 4767 4770 4768 static bool is_pfn_range_shared(kvm_pfn_t start, kvm_pfn_t end)

+8 -5

arch/x86/kvm/svm/svm.c

··· 1390 1390 svm_vcpu_init_msrpm(vcpu, svm->msrpm); 1391 1391 1392 1392 svm_init_osvw(vcpu); 1393 - vcpu->arch.microcode_version = 0x01000065; 1393 + 1394 + if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_STUFF_FEATURE_MSRS)) 1395 + vcpu->arch.microcode_version = 0x01000065; 1394 1396 svm->tsc_ratio_msr = kvm_caps.default_tsc_scaling_ratio; 1395 1397 1396 1398 svm->nmi_masked = false; ··· 2301 2299 svm_copy_vmloadsave_state(vmcb12, svm->vmcb); 2302 2300 } 2303 2301 2304 - kvm_vcpu_unmap(vcpu, &map, true); 2302 + kvm_vcpu_unmap(vcpu, &map); 2305 2303 2306 2304 return ret; 2307 2305 } ··· 4716 4714 svm_copy_vmrun_state(map_save.hva + 0x400, 4717 4715 &svm->vmcb01.ptr->save); 4718 4716 4719 - kvm_vcpu_unmap(vcpu, &map_save, true); 4717 + kvm_vcpu_unmap(vcpu, &map_save); 4720 4718 return 0; 4721 4719 } 4722 4720 ··· 4776 4774 svm->nested.nested_run_pending = 1; 4777 4775 4778 4776 unmap_save: 4779 - kvm_vcpu_unmap(vcpu, &map_save, true); 4777 + kvm_vcpu_unmap(vcpu, &map_save); 4780 4778 unmap_map: 4781 - kvm_vcpu_unmap(vcpu, &map, true); 4779 + kvm_vcpu_unmap(vcpu, &map); 4782 4780 return ret; 4783 4781 } 4784 4782 ··· 5033 5031 .get_segment = svm_get_segment, 5034 5032 .set_segment = svm_set_segment, 5035 5033 .get_cpl = svm_get_cpl, 5034 + .get_cpl_no_cache = svm_get_cpl, 5036 5035 .get_cs_db_l_bits = svm_get_cs_db_l_bits, 5037 5036 .is_valid_cr0 = svm_is_valid_cr0, 5038 5037 .set_cr0 = svm_set_cr0,

+1

arch/x86/kvm/vmx/hyperv.c

··· 4 4 #include <linux/errno.h> 5 5 #include <linux/smp.h> 6 6 7 + #include "x86.h" 7 8 #include "../cpuid.h" 8 9 #include "hyperv.h" 9 10 #include "nested.h"

+1

arch/x86/kvm/vmx/main.c

··· 50 50 .get_segment = vmx_get_segment, 51 51 .set_segment = vmx_set_segment, 52 52 .get_cpl = vmx_get_cpl, 53 + .get_cpl_no_cache = vmx_get_cpl_no_cache, 53 54 .get_cs_db_l_bits = vmx_get_cs_db_l_bits, 54 55 .is_valid_cr0 = vmx_is_valid_cr0, 55 56 .set_cr0 = vmx_set_cr0,

+42 -35

arch/x86/kvm/vmx/nested.c

··· 7 7 #include <asm/debugreg.h> 8 8 #include <asm/mmu_context.h> 9 9 10 + #include "x86.h" 10 11 #include "cpuid.h" 11 12 #include "hyperv.h" 12 13 #include "mmu.h" ··· 17 16 #include "sgx.h" 18 17 #include "trace.h" 19 18 #include "vmx.h" 20 - #include "x86.h" 21 19 #include "smm.h" 22 20 23 21 static bool __read_mostly enable_shadow_vmcs = 1; ··· 231 231 struct kvm_vcpu_hv *hv_vcpu = to_hv_vcpu(vcpu); 232 232 struct vcpu_vmx *vmx = to_vmx(vcpu); 233 233 234 - if (nested_vmx_is_evmptr12_valid(vmx)) { 235 - kvm_vcpu_unmap(vcpu, &vmx->nested.hv_evmcs_map, true); 236 - vmx->nested.hv_evmcs = NULL; 237 - } 238 - 234 + kvm_vcpu_unmap(vcpu, &vmx->nested.hv_evmcs_map); 235 + vmx->nested.hv_evmcs = NULL; 239 236 vmx->nested.hv_evmcs_vmptr = EVMPTR_INVALID; 240 237 241 238 if (hv_vcpu) { ··· 314 317 vcpu->arch.regs_dirty = 0; 315 318 } 316 319 320 + static void nested_put_vmcs12_pages(struct kvm_vcpu *vcpu) 321 + { 322 + struct vcpu_vmx *vmx = to_vmx(vcpu); 323 + 324 + kvm_vcpu_unmap(vcpu, &vmx->nested.apic_access_page_map); 325 + kvm_vcpu_unmap(vcpu, &vmx->nested.virtual_apic_map); 326 + kvm_vcpu_unmap(vcpu, &vmx->nested.pi_desc_map); 327 + vmx->nested.pi_desc = NULL; 328 + } 329 + 317 330 /* 318 331 * Free whatever needs to be freed from vmx->nested when L1 goes down, or 319 332 * just stops using VMX. ··· 356 349 vmx->nested.cached_vmcs12 = NULL; 357 350 kfree(vmx->nested.cached_shadow_vmcs12); 358 351 vmx->nested.cached_shadow_vmcs12 = NULL; 359 - /* 360 - * Unpin physical memory we referred to in the vmcs02. The APIC access 361 - * page's backing page (yeah, confusing) shouldn't actually be accessed, 362 - * and if it is written, the contents are irrelevant. 363 - */ 364 - kvm_vcpu_unmap(vcpu, &vmx->nested.apic_access_page_map, false); 365 - kvm_vcpu_unmap(vcpu, &vmx->nested.virtual_apic_map, true); 366 - kvm_vcpu_unmap(vcpu, &vmx->nested.pi_desc_map, true); 367 - vmx->nested.pi_desc = NULL; 352 + 353 + nested_put_vmcs12_pages(vcpu); 368 354 369 355 kvm_mmu_free_roots(vcpu->kvm, &vcpu->arch.guest_mmu, KVM_MMU_ROOTS_ALL); 370 356 ··· 624 624 int msr; 625 625 unsigned long *msr_bitmap_l1; 626 626 unsigned long *msr_bitmap_l0 = vmx->nested.vmcs02.msr_bitmap; 627 - struct kvm_host_map *map = &vmx->nested.msr_bitmap_map; 627 + struct kvm_host_map map; 628 628 629 629 /* Nothing to do if the MSR bitmap is not in use. */ 630 630 if (!cpu_has_vmx_msr_bitmap() || ··· 647 647 return true; 648 648 } 649 649 650 - if (kvm_vcpu_map(vcpu, gpa_to_gfn(vmcs12->msr_bitmap), map)) 650 + if (kvm_vcpu_map_readonly(vcpu, gpa_to_gfn(vmcs12->msr_bitmap), &map)) 651 651 return false; 652 652 653 - msr_bitmap_l1 = (unsigned long *)map->hva; 653 + msr_bitmap_l1 = (unsigned long *)map.hva; 654 654 655 655 /* 656 656 * To keep the control flow simple, pay eight 8-byte writes (sixteen ··· 714 714 nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0, 715 715 MSR_IA32_FLUSH_CMD, MSR_TYPE_W); 716 716 717 - kvm_vcpu_unmap(vcpu, &vmx->nested.msr_bitmap_map, false); 717 + kvm_vcpu_unmap(vcpu, &map); 718 718 719 719 vmx->nested.force_msr_bitmap_recalc = false; 720 720 ··· 3010 3010 return 0; 3011 3011 } 3012 3012 3013 + static bool is_l1_noncanonical_address_on_vmexit(u64 la, struct vmcs12 *vmcs12) 3014 + { 3015 + /* 3016 + * Check that the given linear address is canonical after a VM exit 3017 + * from L2, based on HOST_CR4.LA57 value that will be loaded for L1. 3018 + */ 3019 + u8 l1_address_bits_on_exit = (vmcs12->host_cr4 & X86_CR4_LA57) ? 57 : 48; 3020 + 3021 + return !__is_canonical_address(la, l1_address_bits_on_exit); 3022 + } 3023 + 3013 3024 static int nested_vmx_check_host_state(struct kvm_vcpu *vcpu, 3014 3025 struct vmcs12 *vmcs12) 3015 3026 { ··· 3031 3020 CC(!kvm_vcpu_is_legal_cr3(vcpu, vmcs12->host_cr3))) 3032 3021 return -EINVAL; 3033 3022 3034 - if (CC(is_noncanonical_address(vmcs12->host_ia32_sysenter_esp, vcpu)) || 3035 - CC(is_noncanonical_address(vmcs12->host_ia32_sysenter_eip, vcpu))) 3023 + if (CC(is_noncanonical_msr_address(vmcs12->host_ia32_sysenter_esp, vcpu)) || 3024 + CC(is_noncanonical_msr_address(vmcs12->host_ia32_sysenter_eip, vcpu))) 3036 3025 return -EINVAL; 3037 3026 3038 3027 if ((vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_PAT) && ··· 3066 3055 CC(vmcs12->host_ss_selector == 0 && !ia32e)) 3067 3056 return -EINVAL; 3068 3057 3069 - if (CC(is_noncanonical_address(vmcs12->host_fs_base, vcpu)) || 3070 - CC(is_noncanonical_address(vmcs12->host_gs_base, vcpu)) || 3071 - CC(is_noncanonical_address(vmcs12->host_gdtr_base, vcpu)) || 3072 - CC(is_noncanonical_address(vmcs12->host_idtr_base, vcpu)) || 3073 - CC(is_noncanonical_address(vmcs12->host_tr_base, vcpu)) || 3074 - CC(is_noncanonical_address(vmcs12->host_rip, vcpu))) 3058 + if (CC(is_noncanonical_base_address(vmcs12->host_fs_base, vcpu)) || 3059 + CC(is_noncanonical_base_address(vmcs12->host_gs_base, vcpu)) || 3060 + CC(is_noncanonical_base_address(vmcs12->host_gdtr_base, vcpu)) || 3061 + CC(is_noncanonical_base_address(vmcs12->host_idtr_base, vcpu)) || 3062 + CC(is_noncanonical_base_address(vmcs12->host_tr_base, vcpu)) || 3063 + CC(is_l1_noncanonical_address_on_vmexit(vmcs12->host_rip, vmcs12))) 3075 3064 return -EINVAL; 3076 3065 3077 3066 /* ··· 3189 3178 } 3190 3179 3191 3180 if ((vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS) && 3192 - (CC(is_noncanonical_address(vmcs12->guest_bndcfgs & PAGE_MASK, vcpu)) || 3181 + (CC(is_noncanonical_msr_address(vmcs12->guest_bndcfgs & PAGE_MASK, vcpu)) || 3193 3182 CC((vmcs12->guest_bndcfgs & MSR_IA32_BNDCFGS_RSVD)))) 3194 3183 return -EINVAL; 3195 3184 ··· 5038 5027 vmx_update_cpu_dirty_logging(vcpu); 5039 5028 } 5040 5029 5041 - /* Unpin physical memory we referred to in vmcs02 */ 5042 - kvm_vcpu_unmap(vcpu, &vmx->nested.apic_access_page_map, false); 5043 - kvm_vcpu_unmap(vcpu, &vmx->nested.virtual_apic_map, true); 5044 - kvm_vcpu_unmap(vcpu, &vmx->nested.pi_desc_map, true); 5045 - vmx->nested.pi_desc = NULL; 5030 + nested_put_vmcs12_pages(vcpu); 5046 5031 5047 5032 if (vmx->nested.reload_vmcs01_apic_access_page) { 5048 5033 vmx->nested.reload_vmcs01_apic_access_page = false; ··· 5174 5167 * non-canonical form. This is the only check on the memory 5175 5168 * destination for long mode! 5176 5169 */ 5177 - exn = is_noncanonical_address(*ret, vcpu); 5170 + exn = is_noncanonical_address(*ret, vcpu, 0); 5178 5171 } else { 5179 5172 /* 5180 5173 * When not in long mode, the virtual/linear address is ··· 5985 5978 * invalidation. 5986 5979 */ 5987 5980 if (!operand.vpid || 5988 - is_noncanonical_address(operand.gla, vcpu)) 5981 + is_noncanonical_invlpg_address(operand.gla, vcpu)) 5989 5982 return nested_vmx_fail(vcpu, 5990 5983 VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID); 5991 5984 vpid_sync_vcpu_addr(vpid02, operand.gla);

+1 -1

arch/x86/kvm/vmx/pmu_intel.c

··· 365 365 } 366 366 break; 367 367 case MSR_IA32_DS_AREA: 368 - if (is_noncanonical_address(data, vcpu)) 368 + if (is_noncanonical_msr_address(data, vcpu)) 369 369 return 1; 370 370 371 371 pmu->ds_area = data;

+2 -3

arch/x86/kvm/vmx/sgx.c

··· 4 4 5 5 #include <asm/sgx.h> 6 6 7 - #include "cpuid.h" 7 + #include "x86.h" 8 8 #include "kvm_cache_regs.h" 9 9 #include "nested.h" 10 10 #include "sgx.h" 11 11 #include "vmx.h" 12 - #include "x86.h" 13 12 14 13 bool __read_mostly enable_sgx = 1; 15 14 module_param_named(sgx, enable_sgx, bool, 0444); ··· 37 38 fault = true; 38 39 } else if (likely(is_64_bit_mode(vcpu))) { 39 40 *gva = vmx_get_untagged_addr(vcpu, *gva, 0); 40 - fault = is_noncanonical_address(*gva, vcpu); 41 + fault = is_noncanonical_address(*gva, vcpu, 0); 41 42 } else { 42 43 *gva &= 0xffffffff; 43 44 fault = (s.unusable) ||

+64 -61

arch/x86/kvm/vmx/vmx.c

··· 483 483 ext, vpid, gva); 484 484 } 485 485 486 - noinline void invept_error(unsigned long ext, u64 eptp, gpa_t gpa) 486 + noinline void invept_error(unsigned long ext, u64 eptp) 487 487 { 488 - vmx_insn_failed("invept failed: ext=0x%lx eptp=%llx gpa=0x%llx\n", 489 - ext, eptp, gpa); 488 + vmx_insn_failed("invept failed: ext=0x%lx eptp=%llx\n", ext, eptp); 490 489 } 491 490 492 491 static DEFINE_PER_CPU(struct vmcs *, vmxarea); ··· 2284 2285 (!msr_info->host_initiated && 2285 2286 !guest_cpuid_has(vcpu, X86_FEATURE_MPX))) 2286 2287 return 1; 2287 - if (is_noncanonical_address(data & PAGE_MASK, vcpu) || 2288 + if (is_noncanonical_msr_address(data & PAGE_MASK, vcpu) || 2288 2289 (data & MSR_IA32_BNDCFGS_RSVD)) 2289 2290 return 1; 2290 2291 ··· 2449 2450 index = msr_info->index - MSR_IA32_RTIT_ADDR0_A; 2450 2451 if (index >= 2 * vmx->pt_desc.num_address_ranges) 2451 2452 return 1; 2452 - if (is_noncanonical_address(data, vcpu)) 2453 + if (is_noncanonical_msr_address(data, vcpu)) 2453 2454 return 1; 2454 2455 if (index % 2) 2455 2456 vmx->pt_desc.guest.addr_b[index / 2] = data; ··· 2457 2458 vmx->pt_desc.guest.addr_a[index / 2] = data; 2458 2459 break; 2459 2460 case MSR_IA32_PERF_CAPABILITIES: 2460 - if (data && !vcpu_to_pmu(vcpu)->version) 2461 - return 1; 2462 2461 if (data & PMU_CAP_LBR_FMT) { 2463 2462 if ((data & PMU_CAP_LBR_FMT) != 2464 2463 (kvm_caps.supported_perf_cap & PMU_CAP_LBR_FMT)) ··· 2546 2549 static bool cpu_has_sgx(void) 2547 2550 { 2548 2551 return cpuid_eax(0) >= 0x12 && (cpuid_eax(0x12) & BIT(0)); 2549 - } 2550 - 2551 - /* 2552 - * Some cpus support VM_{ENTRY,EXIT}_IA32_PERF_GLOBAL_CTRL but they 2553 - * can't be used due to errata where VM Exit may incorrectly clear 2554 - * IA32_PERF_GLOBAL_CTRL[34:32]. Work around the errata by using the 2555 - * MSR load mechanism to switch IA32_PERF_GLOBAL_CTRL. 2556 - */ 2557 - static bool cpu_has_perf_global_ctrl_bug(void) 2558 - { 2559 - switch (boot_cpu_data.x86_vfm) { 2560 - case INTEL_NEHALEM_EP: /* AAK155 */ 2561 - case INTEL_NEHALEM: /* AAP115 */ 2562 - case INTEL_WESTMERE: /* AAT100 */ 2563 - case INTEL_WESTMERE_EP: /* BC86,AAY89,BD102 */ 2564 - case INTEL_NEHALEM_EX: /* BA97 */ 2565 - return true; 2566 - default: 2567 - break; 2568 - } 2569 - 2570 - return false; 2571 2552 } 2572 2553 2573 2554 static int adjust_vmx_controls(u32 ctl_min, u32 ctl_opt, u32 msr, u32 *result) ··· 2705 2730 2706 2731 _vmentry_control &= ~n_ctrl; 2707 2732 _vmexit_control &= ~x_ctrl; 2733 + } 2734 + 2735 + /* 2736 + * Some cpus support VM_{ENTRY,EXIT}_IA32_PERF_GLOBAL_CTRL but they 2737 + * can't be used due to an errata where VM Exit may incorrectly clear 2738 + * IA32_PERF_GLOBAL_CTRL[34:32]. Workaround the errata by using the 2739 + * MSR load mechanism to switch IA32_PERF_GLOBAL_CTRL. 2740 + */ 2741 + switch (boot_cpu_data.x86_vfm) { 2742 + case INTEL_NEHALEM_EP: /* AAK155 */ 2743 + case INTEL_NEHALEM: /* AAP115 */ 2744 + case INTEL_WESTMERE: /* AAT100 */ 2745 + case INTEL_WESTMERE_EP: /* BC86,AAY89,BD102 */ 2746 + case INTEL_NEHALEM_EX: /* BA97 */ 2747 + _vmentry_control &= ~VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL; 2748 + _vmexit_control &= ~VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL; 2749 + pr_warn_once("VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL " 2750 + "does not work properly. Using workaround\n"); 2751 + break; 2752 + default: 2753 + break; 2708 2754 } 2709 2755 2710 2756 rdmsrl(MSR_IA32_VMX_BASIC, basic_msr); ··· 3566 3570 return vmx_read_guest_seg_base(to_vmx(vcpu), seg); 3567 3571 } 3568 3572 3569 - int vmx_get_cpl(struct kvm_vcpu *vcpu) 3573 + static int __vmx_get_cpl(struct kvm_vcpu *vcpu, bool no_cache) 3570 3574 { 3571 3575 struct vcpu_vmx *vmx = to_vmx(vcpu); 3576 + int ar; 3572 3577 3573 3578 if (unlikely(vmx->rmode.vm86_active)) 3574 3579 return 0; 3575 - else { 3576 - int ar = vmx_read_guest_seg_ar(vmx, VCPU_SREG_SS); 3577 - return VMX_AR_DPL(ar); 3578 - } 3580 + 3581 + if (no_cache) 3582 + ar = vmcs_read32(GUEST_SS_AR_BYTES); 3583 + else 3584 + ar = vmx_read_guest_seg_ar(vmx, VCPU_SREG_SS); 3585 + return VMX_AR_DPL(ar); 3586 + } 3587 + 3588 + int vmx_get_cpl(struct kvm_vcpu *vcpu) 3589 + { 3590 + return __vmx_get_cpl(vcpu, false); 3591 + } 3592 + 3593 + int vmx_get_cpl_no_cache(struct kvm_vcpu *vcpu) 3594 + { 3595 + return __vmx_get_cpl(vcpu, true); 3579 3596 } 3580 3597 3581 3598 static u32 vmx_segment_access_rights(struct kvm_segment *var) ··· 4431 4422 VM_ENTRY_LOAD_IA32_EFER | 4432 4423 VM_ENTRY_IA32E_MODE); 4433 4424 4434 - if (cpu_has_perf_global_ctrl_bug()) 4435 - vmentry_ctrl &= ~VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL; 4436 - 4437 4425 return vmentry_ctrl; 4438 4426 } 4439 4427 ··· 4448 4442 if (vmx_pt_mode_is_system()) 4449 4443 vmexit_ctrl &= ~(VM_EXIT_PT_CONCEAL_PIP | 4450 4444 VM_EXIT_CLEAR_IA32_RTIT_CTL); 4451 - 4452 - if (cpu_has_perf_global_ctrl_bug()) 4453 - vmexit_ctrl &= ~VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL; 4454 - 4455 4445 /* Loading of EFER and PERF_GLOBAL_CTRL are toggled dynamically */ 4456 4446 return vmexit_ctrl & 4457 4447 ~(VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL | VM_EXIT_LOAD_IA32_EFER); ··· 4563 4561 * Update the nested MSR settings so that a nested VMM can/can't set 4564 4562 * controls for features that are/aren't exposed to the guest. 4565 4563 */ 4566 - if (nested) { 4564 + if (nested && 4565 + kvm_check_has_quirk(vmx->vcpu.kvm, KVM_X86_QUIRK_STUFF_FEATURE_MSRS)) { 4567 4566 /* 4568 4567 * All features that can be added or removed to VMX MSRs must 4569 4568 * be supported in the first place for nested virtualization. ··· 4854 4851 4855 4852 init_vmcs(vmx); 4856 4853 4857 - if (nested) 4854 + if (nested && 4855 + kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_STUFF_FEATURE_MSRS)) 4858 4856 memcpy(&vmx->nested.msrs, &vmcs_config.nested, sizeof(vmx->nested.msrs)); 4859 4857 4860 4858 vcpu_setup_sgx_lepubkeyhash(vcpu); ··· 4868 4864 vmx->nested.hv_evmcs_vmptr = EVMPTR_INVALID; 4869 4865 #endif 4870 4866 4871 - vcpu->arch.microcode_version = 0x100000000ULL; 4867 + if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_STUFF_FEATURE_MSRS)) 4868 + vcpu->arch.microcode_version = 0x100000000ULL; 4872 4869 vmx->msr_ia32_feature_control_valid_bits = FEAT_CTL_LOCKED; 4873 4870 4874 4871 /* ··· 6797 6792 struct kvm *kvm = vcpu->kvm; 6798 6793 struct kvm_memslots *slots = kvm_memslots(kvm); 6799 6794 struct kvm_memory_slot *slot; 6795 + struct page *refcounted_page; 6800 6796 unsigned long mmu_seq; 6801 6797 kvm_pfn_t pfn; 6798 + bool writable; 6802 6799 6803 6800 /* Defer reload until vmcs01 is the current VMCS. */ 6804 6801 if (is_guest_mode(vcpu)) { ··· 6836 6829 * controls the APIC-access page memslot, and only deletes the memslot 6837 6830 * if APICv is permanently inhibited, i.e. the memslot won't reappear. 6838 6831 */ 6839 - pfn = gfn_to_pfn_memslot(slot, gfn); 6832 + pfn = __kvm_faultin_pfn(slot, gfn, FOLL_WRITE, &writable, &refcounted_page); 6840 6833 if (is_error_noslot_pfn(pfn)) 6841 6834 return; 6842 6835 6843 6836 read_lock(&vcpu->kvm->mmu_lock); 6844 - if (mmu_invalidate_retry_gfn(kvm, mmu_seq, gfn)) { 6837 + if (mmu_invalidate_retry_gfn(kvm, mmu_seq, gfn)) 6845 6838 kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu); 6846 - read_unlock(&vcpu->kvm->mmu_lock); 6847 - goto out; 6848 - } 6839 + else 6840 + vmcs_write64(APIC_ACCESS_ADDR, pfn_to_hpa(pfn)); 6849 6841 6850 - vmcs_write64(APIC_ACCESS_ADDR, pfn_to_hpa(pfn)); 6851 - read_unlock(&vcpu->kvm->mmu_lock); 6842 + /* 6843 + * Do not pin the APIC access page in memory so that it can be freely 6844 + * migrated, the MMU notifier will call us again if it is migrated or 6845 + * swapped out. KVM backs the memslot with anonymous memory, the pfn 6846 + * should always point at a refcounted page (if the pfn is valid). 6847 + */ 6848 + if (!WARN_ON_ONCE(!refcounted_page)) 6849 + kvm_release_page_clean(refcounted_page); 6852 6850 6853 6851 /* 6854 6852 * No need for a manual TLB flush at this point, KVM has already done a 6855 6853 * flush if there were SPTEs pointing at the previous page. 6856 6854 */ 6857 - out: 6858 - /* 6859 - * Do not pin apic access page in memory, the MMU notifier 6860 - * will call us again if it is migrated or swapped out. 6861 - */ 6862 - kvm_release_pfn_clean(pfn); 6855 + read_unlock(&vcpu->kvm->mmu_lock); 6863 6856 } 6864 6857 6865 6858 void vmx_hwapic_isr_update(int max_isr) ··· 8406 8399 8407 8400 if (setup_vmcs_config(&vmcs_config, &vmx_capability) < 0) 8408 8401 return -EIO; 8409 - 8410 - if (cpu_has_perf_global_ctrl_bug()) 8411 - pr_warn_once("VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL " 8412 - "does not work properly. Using workaround\n"); 8413 8402 8414 8403 if (boot_cpu_has(X86_FEATURE_NX)) 8415 8404 kvm_enable_efer_bits(EFER_NX);

+1 -2

arch/x86/kvm/vmx/vmx.h

··· 200 200 struct kvm_host_map virtual_apic_map; 201 201 struct kvm_host_map pi_desc_map; 202 202 203 - struct kvm_host_map msr_bitmap_map; 204 - 205 203 struct pi_desc *pi_desc; 206 204 bool pi_pending; 207 205 u16 posted_intr_nv; ··· 383 385 void vmx_set_host_fs_gs(struct vmcs_host_state *host, u16 fs_sel, u16 gs_sel, 384 386 unsigned long fs_base, unsigned long gs_base); 385 387 int vmx_get_cpl(struct kvm_vcpu *vcpu); 388 + int vmx_get_cpl_no_cache(struct kvm_vcpu *vcpu); 386 389 bool vmx_emulation_required(struct kvm_vcpu *vcpu); 387 390 unsigned long vmx_get_rflags(struct kvm_vcpu *vcpu); 388 391 void vmx_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags);

+8 -8

arch/x86/kvm/vmx/vmx_ops.h

··· 15 15 void vmclear_error(struct vmcs *vmcs, u64 phys_addr); 16 16 void vmptrld_error(struct vmcs *vmcs, u64 phys_addr); 17 17 void invvpid_error(unsigned long ext, u16 vpid, gva_t gva); 18 - void invept_error(unsigned long ext, u64 eptp, gpa_t gpa); 18 + void invept_error(unsigned long ext, u64 eptp); 19 19 20 20 #ifndef CONFIG_CC_HAS_ASM_GOTO_OUTPUT 21 21 /* ··· 312 312 vmx_asm2(invvpid, "r"(ext), "m"(operand), ext, vpid, gva); 313 313 } 314 314 315 - static inline void __invept(unsigned long ext, u64 eptp, gpa_t gpa) 315 + static inline void __invept(unsigned long ext, u64 eptp) 316 316 { 317 317 struct { 318 - u64 eptp, gpa; 319 - } operand = {eptp, gpa}; 320 - 321 - vmx_asm2(invept, "r"(ext), "m"(operand), ext, eptp, gpa); 318 + u64 eptp; 319 + u64 reserved_0; 320 + } operand = { eptp, 0 }; 321 + vmx_asm2(invept, "r"(ext), "m"(operand), ext, eptp); 322 322 } 323 323 324 324 static inline void vpid_sync_vcpu_single(int vpid) ··· 355 355 356 356 static inline void ept_sync_global(void) 357 357 { 358 - __invept(VMX_EPT_EXTENT_GLOBAL, 0, 0); 358 + __invept(VMX_EPT_EXTENT_GLOBAL, 0); 359 359 } 360 360 361 361 static inline void ept_sync_context(u64 eptp) 362 362 { 363 363 if (cpu_has_vmx_invept_context()) 364 - __invept(VMX_EPT_EXTENT_CONTEXT, eptp, 0); 364 + __invept(VMX_EPT_EXTENT_CONTEXT, eptp); 365 365 else 366 366 ept_sync_global(); 367 367 }

+64 -77

arch/x86/kvm/x86.c

··· 451 451 MSR_IA32_UCODE_REV, 452 452 MSR_IA32_ARCH_CAPABILITIES, 453 453 MSR_IA32_PERF_CAPABILITIES, 454 + MSR_PLATFORM_INFO, 454 455 }; 455 456 456 457 static u32 msr_based_features[ARRAY_SIZE(msr_based_features_all_except_vmx) + ··· 666 665 667 666 if (msrs->registered) 668 667 kvm_on_user_return(&msrs->urn); 669 - } 670 - 671 - u64 kvm_get_apic_base(struct kvm_vcpu *vcpu) 672 - { 673 - return vcpu->arch.apic_base; 674 - } 675 - 676 - enum lapic_mode kvm_get_apic_mode(struct kvm_vcpu *vcpu) 677 - { 678 - return kvm_apic_mode(kvm_get_apic_base(vcpu)); 679 - } 680 - EXPORT_SYMBOL_GPL(kvm_get_apic_mode); 681 - 682 - int kvm_set_apic_base(struct kvm_vcpu *vcpu, struct msr_data *msr_info) 683 - { 684 - enum lapic_mode old_mode = kvm_get_apic_mode(vcpu); 685 - enum lapic_mode new_mode = kvm_apic_mode(msr_info->data); 686 - u64 reserved_bits = kvm_vcpu_reserved_gpa_bits_raw(vcpu) | 0x2ff | 687 - (guest_cpuid_has(vcpu, X86_FEATURE_X2APIC) ? 0 : X2APIC_ENABLE); 688 - 689 - if ((msr_info->data & reserved_bits) != 0 || new_mode == LAPIC_MODE_INVALID) 690 - return 1; 691 - if (!msr_info->host_initiated) { 692 - if (old_mode == LAPIC_MODE_X2APIC && new_mode == LAPIC_MODE_XAPIC) 693 - return 1; 694 - if (old_mode == LAPIC_MODE_DISABLED && new_mode == LAPIC_MODE_X2APIC) 695 - return 1; 696 - } 697 - 698 - kvm_lapic_set_base(vcpu, msr_info->data); 699 - kvm_recalculate_apic_map(vcpu->kvm); 700 - return 0; 701 668 } 702 669 703 670 /* ··· 1675 1706 case MSR_IA32_PERF_CAPABILITIES: 1676 1707 *data = kvm_caps.supported_perf_cap; 1677 1708 break; 1709 + case MSR_PLATFORM_INFO: 1710 + *data = MSR_PLATFORM_INFO_CPUID_FAULT; 1711 + break; 1678 1712 case MSR_IA32_UCODE_REV: 1679 1713 rdmsrl_safe(index, data); 1680 1714 break; ··· 1826 1854 case MSR_KERNEL_GS_BASE: 1827 1855 case MSR_CSTAR: 1828 1856 case MSR_LSTAR: 1829 - if (is_noncanonical_address(data, vcpu)) 1857 + if (is_noncanonical_msr_address(data, vcpu)) 1830 1858 return 1; 1831 1859 break; 1832 1860 case MSR_IA32_SYSENTER_EIP: ··· 1843 1871 * value, and that something deterministic happens if the guest 1844 1872 * invokes 64-bit SYSENTER. 1845 1873 */ 1846 - data = __canonical_address(data, vcpu_virt_addr_bits(vcpu)); 1874 + data = __canonical_address(data, max_host_virt_addr_bits()); 1847 1875 break; 1848 1876 case MSR_TSC_AUX: 1849 1877 if (!kvm_is_supported_user_return_msr(MSR_TSC_AUX)) ··· 2116 2144 static inline bool kvm_vcpu_exit_request(struct kvm_vcpu *vcpu) 2117 2145 { 2118 2146 xfer_to_guest_mode_prepare(); 2119 - return vcpu->mode == EXITING_GUEST_MODE || kvm_request_pending(vcpu) || 2120 - xfer_to_guest_mode_work_pending(); 2147 + 2148 + return READ_ONCE(vcpu->mode) == EXITING_GUEST_MODE || 2149 + kvm_request_pending(vcpu) || xfer_to_guest_mode_work_pending(); 2121 2150 } 2122 2151 2123 2152 /* ··· 3766 3793 vcpu->arch.microcode_version = data; 3767 3794 break; 3768 3795 case MSR_IA32_ARCH_CAPABILITIES: 3769 - if (!msr_info->host_initiated) 3770 - return 1; 3796 + if (!msr_info->host_initiated || 3797 + !guest_cpuid_has(vcpu, X86_FEATURE_ARCH_CAPABILITIES)) 3798 + return KVM_MSR_RET_UNSUPPORTED; 3771 3799 vcpu->arch.arch_capabilities = data; 3772 3800 break; 3773 3801 case MSR_IA32_PERF_CAPABILITIES: 3774 - if (!msr_info->host_initiated) 3775 - return 1; 3802 + if (!msr_info->host_initiated || 3803 + !guest_cpuid_has(vcpu, X86_FEATURE_PDCM)) 3804 + return KVM_MSR_RET_UNSUPPORTED; 3805 + 3776 3806 if (data & ~kvm_caps.supported_perf_cap) 3777 3807 return 1; 3778 3808 ··· 3866 3890 case MSR_MTRRdefType: 3867 3891 return kvm_mtrr_set_msr(vcpu, msr, data); 3868 3892 case MSR_IA32_APICBASE: 3869 - return kvm_set_apic_base(vcpu, msr_info); 3893 + return kvm_apic_set_base(vcpu, data, msr_info->host_initiated); 3870 3894 case APIC_BASE_MSR ... APIC_BASE_MSR + 0xff: 3871 3895 return kvm_x2apic_msr_write(vcpu, msr, data); 3872 3896 case MSR_IA32_TSC_DEADLINE: ··· 4087 4111 vcpu->arch.osvw.status = data; 4088 4112 break; 4089 4113 case MSR_PLATFORM_INFO: 4090 - if (!msr_info->host_initiated || 4091 - (!(data & MSR_PLATFORM_INFO_CPUID_FAULT) && 4092 - cpuid_fault_enabled(vcpu))) 4114 + if (!msr_info->host_initiated) 4093 4115 return 1; 4094 4116 vcpu->arch.msr_platform_info = data; 4095 4117 break; ··· 4226 4252 msr_info->data = vcpu->arch.microcode_version; 4227 4253 break; 4228 4254 case MSR_IA32_ARCH_CAPABILITIES: 4229 - if (!msr_info->host_initiated && 4230 - !guest_cpuid_has(vcpu, X86_FEATURE_ARCH_CAPABILITIES)) 4231 - return 1; 4255 + if (!guest_cpuid_has(vcpu, X86_FEATURE_ARCH_CAPABILITIES)) 4256 + return KVM_MSR_RET_UNSUPPORTED; 4232 4257 msr_info->data = vcpu->arch.arch_capabilities; 4233 4258 break; 4234 4259 case MSR_IA32_PERF_CAPABILITIES: 4235 - if (!msr_info->host_initiated && 4236 - !guest_cpuid_has(vcpu, X86_FEATURE_PDCM)) 4237 - return 1; 4260 + if (!guest_cpuid_has(vcpu, X86_FEATURE_PDCM)) 4261 + return KVM_MSR_RET_UNSUPPORTED; 4238 4262 msr_info->data = vcpu->arch.perf_capabilities; 4239 4263 break; 4240 4264 case MSR_IA32_POWER_CTL: ··· 4286 4314 msr_info->data = 1 << 24; 4287 4315 break; 4288 4316 case MSR_IA32_APICBASE: 4289 - msr_info->data = kvm_get_apic_base(vcpu); 4317 + msr_info->data = vcpu->arch.apic_base; 4290 4318 break; 4291 4319 case APIC_BASE_MSR ... APIC_BASE_MSR + 0xff: 4292 4320 return kvm_x2apic_msr_read(vcpu, msr_info->index, &msr_info->data); ··· 5066 5094 int idx; 5067 5095 5068 5096 if (vcpu->preempted) { 5069 - vcpu->arch.preempted_in_kernel = kvm_arch_vcpu_in_kernel(vcpu); 5097 + /* 5098 + * Assume protected guests are in-kernel. Inefficient yielding 5099 + * due to false positives is preferable to never yielding due 5100 + * to false negatives. 5101 + */ 5102 + vcpu->arch.preempted_in_kernel = vcpu->arch.guest_state_protected || 5103 + !kvm_x86_call(get_cpl_no_cache)(vcpu); 5070 5104 5071 5105 /* 5072 5106 * Take the srcu lock as memslots will be accessed to check the gfn ··· 8590 8612 addr, flags); 8591 8613 } 8592 8614 8615 + static bool emulator_is_canonical_addr(struct x86_emulate_ctxt *ctxt, 8616 + gva_t addr, unsigned int flags) 8617 + { 8618 + return !is_noncanonical_address(addr, emul_to_vcpu(ctxt), flags); 8619 + } 8620 + 8593 8621 static const struct x86_emulate_ops emulate_ops = { 8594 8622 .vm_bugged = emulator_vm_bugged, 8595 8623 .read_gpr = emulator_read_gpr, ··· 8642 8658 .triple_fault = emulator_triple_fault, 8643 8659 .set_xcr = emulator_set_xcr, 8644 8660 .get_untagged_addr = emulator_get_untagged_addr, 8661 + .is_canonical_addr = emulator_is_canonical_addr, 8645 8662 }; 8646 8663 8647 8664 static void toggle_interruptibility(struct kvm_vcpu *vcpu, u32 mask) ··· 10144 10159 10145 10160 kvm_run->if_flag = kvm_x86_call(get_if_flag)(vcpu); 10146 10161 kvm_run->cr8 = kvm_get_cr8(vcpu); 10147 - kvm_run->apic_base = kvm_get_apic_base(vcpu); 10162 + kvm_run->apic_base = vcpu->arch.apic_base; 10148 10163 10149 10164 kvm_run->ready_for_interrupt_injection = 10150 10165 pic_in_kernel(vcpu->kvm) || ··· 10561 10576 * deleted if any vCPU has xAPIC virtualization and x2APIC enabled, but 10562 10577 * and hardware doesn't support x2APIC virtualization. E.g. some AMD 10563 10578 * CPUs support AVIC but not x2APIC. KVM still allows enabling AVIC in 10564 - * this case so that KVM can the AVIC doorbell to inject interrupts to 10565 - * running vCPUs, but KVM must not create SPTEs for the APIC base as 10579 + * this case so that KVM can use the AVIC doorbell to inject interrupts 10580 + * to running vCPUs, but KVM must not create SPTEs for the APIC base as 10566 10581 * the vCPU would incorrectly be able to access the vAPIC page via MMIO 10567 10582 * despite being in x2APIC mode. For simplicity, inhibiting the APIC 10568 10583 * access page is sticky. ··· 10591 10606 if (!!old != !!new) { 10592 10607 /* 10593 10608 * Kick all vCPUs before setting apicv_inhibit_reasons to avoid 10594 - * false positives in the sanity check WARN in svm_vcpu_run(). 10609 + * false positives in the sanity check WARN in vcpu_enter_guest(). 10595 10610 * This task will wait for all vCPUs to ack the kick IRQ before 10596 10611 * updating apicv_inhibit_reasons, and all other vCPUs will 10597 10612 * block on acquiring apicv_update_lock so that vCPUs can't 10598 - * redo svm_vcpu_run() without seeing the new inhibit state. 10613 + * redo vcpu_enter_guest() without seeing the new inhibit state. 10599 10614 * 10600 10615 * Note, holding apicv_update_lock and taking it in the read 10601 10616 * side (handling the request) also prevents other vCPUs from ··· 11696 11711 sregs->cr4 = kvm_read_cr4(vcpu); 11697 11712 sregs->cr8 = kvm_get_cr8(vcpu); 11698 11713 sregs->efer = vcpu->arch.efer; 11699 - sregs->apic_base = kvm_get_apic_base(vcpu); 11714 + sregs->apic_base = vcpu->arch.apic_base; 11700 11715 } 11701 11716 11702 11717 static void __get_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs) ··· 11873 11888 static int __set_sregs_common(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs, 11874 11889 int *mmu_reset_needed, bool update_pdptrs) 11875 11890 { 11876 - struct msr_data apic_base_msr; 11877 11891 int idx; 11878 11892 struct desc_ptr dt; 11879 11893 11880 11894 if (!kvm_is_valid_sregs(vcpu, sregs)) 11881 11895 return -EINVAL; 11882 11896 11883 - apic_base_msr.data = sregs->apic_base; 11884 - apic_base_msr.host_initiated = true; 11885 - if (kvm_set_apic_base(vcpu, &apic_base_msr)) 11897 + if (kvm_apic_set_base(vcpu, sregs->apic_base, true)) 11886 11898 return -EINVAL; 11887 11899 11888 11900 if (vcpu->arch.guest_state_protected) ··· 12281 12299 12282 12300 kvm_async_pf_hash_reset(vcpu); 12283 12301 12284 - vcpu->arch.perf_capabilities = kvm_caps.supported_perf_cap; 12302 + if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_STUFF_FEATURE_MSRS)) { 12303 + vcpu->arch.arch_capabilities = kvm_get_arch_capabilities(); 12304 + vcpu->arch.msr_platform_info = MSR_PLATFORM_INFO_CPUID_FAULT; 12305 + vcpu->arch.perf_capabilities = kvm_caps.supported_perf_cap; 12306 + } 12285 12307 kvm_pmu_init(vcpu); 12286 12308 12287 12309 vcpu->arch.pending_external_vector = -1; ··· 12299 12313 if (r) 12300 12314 goto free_guest_fpu; 12301 12315 12302 - vcpu->arch.arch_capabilities = kvm_get_arch_capabilities(); 12303 - vcpu->arch.msr_platform_info = MSR_PLATFORM_INFO_CPUID_FAULT; 12304 12316 kvm_xen_init_vcpu(vcpu); 12305 12317 vcpu_load(vcpu); 12306 12318 kvm_set_tsc_khz(vcpu, vcpu->kvm->arch.default_tsc_khz); ··· 13088 13104 13089 13105 if (!log_dirty_pages) { 13090 13106 /* 13091 - * Dirty logging tracks sptes in 4k granularity, meaning that 13092 - * large sptes have to be split. If live migration succeeds, 13093 - * the guest in the source machine will be destroyed and large 13094 - * sptes will be created in the destination. However, if the 13095 - * guest continues to run in the source machine (for example if 13096 - * live migration fails), small sptes will remain around and 13097 - * cause bad performance. 13107 + * Recover huge page mappings in the slot now that dirty logging 13108 + * is disabled, i.e. now that KVM does not have to track guest 13109 + * writes at 4KiB granularity. 13098 13110 * 13099 - * Scan sptes if dirty logging has been stopped, dropping those 13100 - * which can be collapsed into a single large-page spte. Later 13101 - * page faults will create the large-page sptes. 13111 + * Dirty logging might be disabled by userspace if an ongoing VM 13112 + * live migration is cancelled and the VM must continue running 13113 + * on the source. 13102 13114 */ 13103 - kvm_mmu_zap_collapsible_sptes(kvm, new); 13115 + kvm_mmu_recover_huge_pages(kvm, new); 13104 13116 } else { 13105 13117 /* 13106 13118 * Initially-all-set does not require write protecting any page, ··· 13187 13207 13188 13208 bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu) 13189 13209 { 13210 + WARN_ON_ONCE(!kvm_arch_pmi_in_guest(vcpu)); 13211 + 13190 13212 if (vcpu->arch.guest_state_protected) 13191 13213 return true; 13192 13214 ··· 13197 13215 13198 13216 unsigned long kvm_arch_vcpu_get_ip(struct kvm_vcpu *vcpu) 13199 13217 { 13218 + WARN_ON_ONCE(!kvm_arch_pmi_in_guest(vcpu)); 13219 + 13220 + if (vcpu->arch.guest_state_protected) 13221 + return 0; 13222 + 13200 13223 return kvm_rip_read(vcpu); 13201 13224 } 13202 13225 ··· 13717 13730 * invalidation. 13718 13731 */ 13719 13732 if ((!pcid_enabled && (operand.pcid != 0)) || 13720 - is_noncanonical_address(operand.gla, vcpu)) { 13733 + is_noncanonical_invlpg_address(operand.gla, vcpu)) { 13721 13734 kvm_inject_gp(vcpu, 0); 13722 13735 return 1; 13723 13736 }

+46 -2

arch/x86/kvm/x86.h

··· 8 8 #include <asm/pvclock.h> 9 9 #include "kvm_cache_regs.h" 10 10 #include "kvm_emulate.h" 11 + #include "cpuid.h" 11 12 12 13 struct kvm_caps { 13 14 /* control of guest tsc rate supported? */ ··· 234 233 return kvm_is_cr4_bit_set(vcpu, X86_CR4_LA57) ? 57 : 48; 235 234 } 236 235 237 - static inline bool is_noncanonical_address(u64 la, struct kvm_vcpu *vcpu) 236 + static inline u8 max_host_virt_addr_bits(void) 238 237 { 239 - return !__is_canonical_address(la, vcpu_virt_addr_bits(vcpu)); 238 + return kvm_cpu_cap_has(X86_FEATURE_LA57) ? 57 : 48; 239 + } 240 + 241 + /* 242 + * x86 MSRs which contain linear addresses, x86 hidden segment bases, and 243 + * IDT/GDT bases have static canonicality checks, the size of which depends 244 + * only on the CPU's support for 5-level paging, rather than on the state of 245 + * CR4.LA57. This applies to both WRMSR and to other instructions that set 246 + * their values, e.g. SGDT. 247 + * 248 + * KVM passes through most of these MSRS and also doesn't intercept the 249 + * instructions that set the hidden segment bases. 250 + * 251 + * Because of this, to be consistent with hardware, even if the guest doesn't 252 + * have LA57 enabled in its CPUID, perform canonicality checks based on *host* 253 + * support for 5 level paging. 254 + * 255 + * Finally, instructions which are related to MMU invalidation of a given 256 + * linear address, also have a similar static canonical check on address. 257 + * This allows for example to invalidate 5-level addresses of a guest from a 258 + * host which uses 4-level paging. 259 + */ 260 + static inline bool is_noncanonical_address(u64 la, struct kvm_vcpu *vcpu, 261 + unsigned int flags) 262 + { 263 + if (flags & (X86EMUL_F_INVLPG | X86EMUL_F_MSR | X86EMUL_F_DT_LOAD)) 264 + return !__is_canonical_address(la, max_host_virt_addr_bits()); 265 + else 266 + return !__is_canonical_address(la, vcpu_virt_addr_bits(vcpu)); 267 + } 268 + 269 + static inline bool is_noncanonical_msr_address(u64 la, struct kvm_vcpu *vcpu) 270 + { 271 + return is_noncanonical_address(la, vcpu, X86EMUL_F_MSR); 272 + } 273 + 274 + static inline bool is_noncanonical_base_address(u64 la, struct kvm_vcpu *vcpu) 275 + { 276 + return is_noncanonical_address(la, vcpu, X86EMUL_F_DT_LOAD); 277 + } 278 + 279 + static inline bool is_noncanonical_invlpg_address(u64 la, struct kvm_vcpu *vcpu) 280 + { 281 + return is_noncanonical_address(la, vcpu, X86EMUL_F_INVLPG); 240 282 } 241 283 242 284 static inline void vcpu_cache_mmio_info(struct kvm_vcpu *vcpu,

+45

drivers/firmware/psci/psci.c

··· 78 78 79 79 static u32 psci_cpu_suspend_feature; 80 80 static bool psci_system_reset2_supported; 81 + static bool psci_system_off2_hibernate_supported; 81 82 82 83 static inline bool psci_has_ext_power_state(void) 83 84 { ··· 334 333 invoke_psci_fn(PSCI_0_2_FN_SYSTEM_OFF, 0, 0, 0); 335 334 } 336 335 336 + #ifdef CONFIG_HIBERNATION 337 + static int psci_sys_hibernate(struct sys_off_data *data) 338 + { 339 + /* 340 + * If no hibernate type is specified SYSTEM_OFF2 defaults to selecting 341 + * HIBERNATE_OFF. 342 + * 343 + * There are hypervisors in the wild that do not align with the spec and 344 + * reject calls that explicitly provide a hibernate type. For 345 + * compatibility with these nonstandard implementations, pass 0 as the 346 + * type. 347 + */ 348 + if (system_entering_hibernation()) 349 + invoke_psci_fn(PSCI_FN_NATIVE(1_3, SYSTEM_OFF2), 0, 0, 0); 350 + return NOTIFY_DONE; 351 + } 352 + 353 + static int __init psci_hibernate_init(void) 354 + { 355 + if (psci_system_off2_hibernate_supported) { 356 + /* Higher priority than EFI shutdown, but only for hibernate */ 357 + register_sys_off_handler(SYS_OFF_MODE_POWER_OFF, 358 + SYS_OFF_PRIO_FIRMWARE + 2, 359 + psci_sys_hibernate, NULL); 360 + } 361 + return 0; 362 + } 363 + subsys_initcall(psci_hibernate_init); 364 + #endif 365 + 337 366 static int psci_features(u32 psci_func_id) 338 367 { 339 368 return invoke_psci_fn(PSCI_1_0_FN_PSCI_FEATURES, ··· 395 364 PSCI_ID_NATIVE(1_1, SYSTEM_RESET2), 396 365 PSCI_ID(1_1, MEM_PROTECT), 397 366 PSCI_ID_NATIVE(1_1, MEM_PROTECT_CHECK_RANGE), 367 + PSCI_ID_NATIVE(1_3, SYSTEM_OFF2), 398 368 }; 399 369 400 370 static int psci_debugfs_read(struct seq_file *s, void *data) ··· 557 525 psci_system_reset2_supported = true; 558 526 } 559 527 528 + static void __init psci_init_system_off2(void) 529 + { 530 + int ret; 531 + 532 + ret = psci_features(PSCI_FN_NATIVE(1_3, SYSTEM_OFF2)); 533 + if (ret < 0) 534 + return; 535 + 536 + if (ret & PSCI_1_3_OFF_TYPE_HIBERNATE_OFF) 537 + psci_system_off2_hibernate_supported = true; 538 + } 539 + 560 540 static void __init psci_init_system_suspend(void) 561 541 { 562 542 int ret; ··· 699 655 psci_init_cpu_suspend(); 700 656 psci_init_system_suspend(); 701 657 psci_init_system_reset2(); 658 + psci_init_system_off2(); 702 659 kvm_init_hyp_services(); 703 660 } 704 661

+84 -18

drivers/irqchip/irq-loongson-eiointc.c

··· 14 14 #include <linux/irqdomain.h> 15 15 #include <linux/irqchip/chained_irq.h> 16 16 #include <linux/kernel.h> 17 + #include <linux/kvm_para.h> 17 18 #include <linux/syscore_ops.h> 18 19 #include <asm/numa.h> 19 20 ··· 27 26 #define EIOINTC_REG_ISR 0x1800 28 27 #define EIOINTC_REG_ROUTE 0x1c00 29 28 29 + #define EXTIOI_VIRT_FEATURES 0x40000000 30 + #define EXTIOI_HAS_VIRT_EXTENSION BIT(0) 31 + #define EXTIOI_HAS_ENABLE_OPTION BIT(1) 32 + #define EXTIOI_HAS_INT_ENCODE BIT(2) 33 + #define EXTIOI_HAS_CPU_ENCODE BIT(3) 34 + #define EXTIOI_VIRT_CONFIG 0x40000004 35 + #define EXTIOI_ENABLE BIT(1) 36 + #define EXTIOI_ENABLE_INT_ENCODE BIT(2) 37 + #define EXTIOI_ENABLE_CPU_ENCODE BIT(3) 38 + 30 39 #define VEC_REG_COUNT 4 31 40 #define VEC_COUNT_PER_REG 64 32 41 #define VEC_COUNT (VEC_REG_COUNT * VEC_COUNT_PER_REG) 33 42 #define VEC_REG_IDX(irq_id) ((irq_id) / VEC_COUNT_PER_REG) 34 43 #define VEC_REG_BIT(irq_id) ((irq_id) % VEC_COUNT_PER_REG) 35 44 #define EIOINTC_ALL_ENABLE 0xffffffff 45 + #define EIOINTC_ALL_ENABLE_VEC_MASK(vector) (EIOINTC_ALL_ENABLE & ~BIT(vector & 0x1f)) 46 + #define EIOINTC_REG_ENABLE_VEC(vector) (EIOINTC_REG_ENABLE + ((vector >> 5) << 2)) 47 + #define EIOINTC_USE_CPU_ENCODE BIT(0) 36 48 37 49 #define MAX_EIO_NODES (NR_CPUS / CORES_PER_EIO_NODE) 50 + 51 + /* 52 + * Routing registers are 32bit, and there is 8-bit route setting for every 53 + * interrupt vector. So one Route register contains four vectors routing 54 + * information. 55 + */ 56 + #define EIOINTC_REG_ROUTE_VEC(vector) (EIOINTC_REG_ROUTE + (vector & ~0x03)) 57 + #define EIOINTC_REG_ROUTE_VEC_SHIFT(vector) ((vector & 0x03) << 3) 58 + #define EIOINTC_REG_ROUTE_VEC_MASK(vector) (0xff << EIOINTC_REG_ROUTE_VEC_SHIFT(vector)) 38 59 39 60 static int nr_pics; 40 61 ··· 67 44 cpumask_t cpuspan_map; 68 45 struct fwnode_handle *domain_handle; 69 46 struct irq_domain *eiointc_domain; 47 + int flags; 70 48 }; 71 49 72 50 static struct eiointc_priv *eiointc_priv[MAX_IO_PICS]; ··· 83 59 84 60 static int cpu_to_eio_node(int cpu) 85 61 { 86 - return cpu_logical_map(cpu) / CORES_PER_EIO_NODE; 62 + if (!kvm_para_has_feature(KVM_FEATURE_VIRT_EXTIOI)) 63 + return cpu_logical_map(cpu) / CORES_PER_EIO_NODE; 64 + else 65 + return cpu_logical_map(cpu) / CORES_PER_VEIO_NODE; 87 66 } 88 67 89 68 #ifdef CONFIG_SMP ··· 116 89 } 117 90 } 118 91 92 + static void veiointc_set_irq_route(unsigned int vector, unsigned int cpu) 93 + { 94 + unsigned long reg = EIOINTC_REG_ROUTE_VEC(vector); 95 + unsigned int data; 96 + 97 + data = iocsr_read32(reg); 98 + data &= ~EIOINTC_REG_ROUTE_VEC_MASK(vector); 99 + data |= cpu_logical_map(cpu) << EIOINTC_REG_ROUTE_VEC_SHIFT(vector); 100 + iocsr_write32(data, reg); 101 + } 102 + 119 103 static DEFINE_RAW_SPINLOCK(affinity_lock); 120 104 121 105 static int eiointc_set_irq_affinity(struct irq_data *d, const struct cpumask *affinity, bool force) ··· 145 107 } 146 108 147 109 vector = d->hwirq; 148 - regaddr = EIOINTC_REG_ENABLE + ((vector >> 5) << 2); 110 + regaddr = EIOINTC_REG_ENABLE_VEC(vector); 149 111 150 - /* Mask target vector */ 151 - csr_any_send(regaddr, EIOINTC_ALL_ENABLE & (~BIT(vector & 0x1F)), 152 - 0x0, priv->node * CORES_PER_EIO_NODE); 112 + if (priv->flags & EIOINTC_USE_CPU_ENCODE) { 113 + iocsr_write32(EIOINTC_ALL_ENABLE_VEC_MASK(vector), regaddr); 114 + veiointc_set_irq_route(vector, cpu); 115 + iocsr_write32(EIOINTC_ALL_ENABLE, regaddr); 116 + } else { 117 + /* Mask target vector */ 118 + csr_any_send(regaddr, EIOINTC_ALL_ENABLE_VEC_MASK(vector), 119 + 0x0, priv->node * CORES_PER_EIO_NODE); 153 120 154 - /* Set route for target vector */ 155 - eiointc_set_irq_route(vector, cpu, priv->node, &priv->node_map); 121 + /* Set route for target vector */ 122 + eiointc_set_irq_route(vector, cpu, priv->node, &priv->node_map); 156 123 157 - /* Unmask target vector */ 158 - csr_any_send(regaddr, EIOINTC_ALL_ENABLE, 159 - 0x0, priv->node * CORES_PER_EIO_NODE); 124 + /* Unmask target vector */ 125 + csr_any_send(regaddr, EIOINTC_ALL_ENABLE, 126 + 0x0, priv->node * CORES_PER_EIO_NODE); 127 + } 160 128 161 129 irq_data_update_effective_affinity(d, cpumask_of(cpu)); 162 130 ··· 186 142 187 143 static int eiointc_router_init(unsigned int cpu) 188 144 { 189 - int i, bit; 190 - uint32_t data; 191 - uint32_t node = cpu_to_eio_node(cpu); 192 - int index = eiointc_index(node); 145 + int i, bit, cores, index, node; 146 + unsigned int data; 147 + 148 + node = cpu_to_eio_node(cpu); 149 + index = eiointc_index(node); 193 150 194 151 if (index < 0) { 195 152 pr_err("Error: invalid nodemap!\n"); 196 - return -1; 153 + return -EINVAL; 197 154 } 198 155 199 - if ((cpu_logical_map(cpu) % CORES_PER_EIO_NODE) == 0) { 156 + if (!(eiointc_priv[index]->flags & EIOINTC_USE_CPU_ENCODE)) 157 + cores = CORES_PER_EIO_NODE; 158 + else 159 + cores = CORES_PER_VEIO_NODE; 160 + 161 + if ((cpu_logical_map(cpu) % cores) == 0) { 200 162 eiointc_enable(); 201 163 202 164 for (i = 0; i < eiointc_priv[0]->vec_count / 32; i++) { ··· 218 168 219 169 for (i = 0; i < eiointc_priv[0]->vec_count / 4; i++) { 220 170 /* Route to Node-0 Core-0 */ 221 - if (index == 0) 171 + if (eiointc_priv[index]->flags & EIOINTC_USE_CPU_ENCODE) 172 + bit = cpu_logical_map(0); 173 + else if (index == 0) 222 174 bit = BIT(cpu_logical_map(0)); 223 175 else 224 176 bit = (eiointc_priv[index]->node << 4) | 1; ··· 427 375 static int __init eiointc_init(struct eiointc_priv *priv, int parent_irq, 428 376 u64 node_map) 429 377 { 430 - int i; 378 + int i, val; 431 379 432 380 node_map = node_map ? node_map : -1ULL; 433 381 for_each_possible_cpu(i) { ··· 445 393 if (!priv->eiointc_domain) { 446 394 pr_err("loongson-extioi: cannot add IRQ domain\n"); 447 395 return -ENOMEM; 396 + } 397 + 398 + if (kvm_para_has_feature(KVM_FEATURE_VIRT_EXTIOI)) { 399 + val = iocsr_read32(EXTIOI_VIRT_FEATURES); 400 + /* 401 + * With EXTIOI_ENABLE_CPU_ENCODE set 402 + * interrupts can route to 256 vCPUs. 403 + */ 404 + if (val & EXTIOI_HAS_CPU_ENCODE) { 405 + val = iocsr_read32(EXTIOI_VIRT_CONFIG); 406 + val |= EXTIOI_ENABLE_CPU_ENCODE; 407 + iocsr_write32(val, EXTIOI_VIRT_CONFIG); 408 + priv->flags = EIOINTC_USE_CPU_ENCODE; 409 + } 448 410 } 449 411 450 412 eiointc_priv[nr_pics++] = priv;

+3

include/kvm/arm_arch_timer.h

··· 147 147 void kvm_timer_cpu_up(void); 148 148 void kvm_timer_cpu_down(void); 149 149 150 + /* CNTKCTL_EL1 valid bits as of DDI0487J.a */ 151 + #define CNTKCTL_VALID_BITS (BIT(17) | GENMASK_ULL(9, 0)) 152 + 150 153 static inline bool has_cntpoff(void) 151 154 { 152 155 return (has_vhe() && cpus_have_final_cap(ARM64_HAS_ECV_CNTPOFF));

+16 -2

include/kvm/arm_pmu.h

··· 47 47 #define kvm_arm_pmu_irq_initialized(v) ((v)->arch.pmu.irq_num >= VGIC_NR_SGIS) 48 48 u64 kvm_pmu_get_counter_value(struct kvm_vcpu *vcpu, u64 select_idx); 49 49 void kvm_pmu_set_counter_value(struct kvm_vcpu *vcpu, u64 select_idx, u64 val); 50 - u64 kvm_pmu_valid_counter_mask(struct kvm_vcpu *vcpu); 50 + u64 kvm_pmu_implemented_counter_mask(struct kvm_vcpu *vcpu); 51 + u64 kvm_pmu_accessible_counter_mask(struct kvm_vcpu *vcpu); 51 52 u64 kvm_pmu_get_pmceid(struct kvm_vcpu *vcpu, bool pmceid1); 52 53 void kvm_pmu_vcpu_init(struct kvm_vcpu *vcpu); 53 54 void kvm_pmu_vcpu_reset(struct kvm_vcpu *vcpu); ··· 97 96 u8 kvm_arm_pmu_get_max_counters(struct kvm *kvm); 98 97 99 98 u64 kvm_vcpu_read_pmcr(struct kvm_vcpu *vcpu); 99 + bool kvm_pmu_counter_is_hyp(struct kvm_vcpu *vcpu, unsigned int idx); 100 + void kvm_pmu_nested_transition(struct kvm_vcpu *vcpu); 100 101 #else 101 102 struct kvm_pmu { 102 103 }; ··· 116 113 } 117 114 static inline void kvm_pmu_set_counter_value(struct kvm_vcpu *vcpu, 118 115 u64 select_idx, u64 val) {} 119 - static inline u64 kvm_pmu_valid_counter_mask(struct kvm_vcpu *vcpu) 116 + static inline u64 kvm_pmu_implemented_counter_mask(struct kvm_vcpu *vcpu) 117 + { 118 + return 0; 119 + } 120 + static inline u64 kvm_pmu_accessible_counter_mask(struct kvm_vcpu *vcpu) 120 121 { 121 122 return 0; 122 123 } ··· 193 186 { 194 187 return 0; 195 188 } 189 + 190 + static inline bool kvm_pmu_counter_is_hyp(struct kvm_vcpu *vcpu, unsigned int idx) 191 + { 192 + return false; 193 + } 194 + 195 + static inline void kvm_pmu_nested_transition(struct kvm_vcpu *vcpu) {} 196 196 197 197 #endif 198 198

+3 -1

include/kvm/arm_psci.h

··· 14 14 #define KVM_ARM_PSCI_0_2 PSCI_VERSION(0, 2) 15 15 #define KVM_ARM_PSCI_1_0 PSCI_VERSION(1, 0) 16 16 #define KVM_ARM_PSCI_1_1 PSCI_VERSION(1, 1) 17 + #define KVM_ARM_PSCI_1_2 PSCI_VERSION(1, 2) 18 + #define KVM_ARM_PSCI_1_3 PSCI_VERSION(1, 3) 17 19 18 - #define KVM_ARM_PSCI_LATEST KVM_ARM_PSCI_1_1 20 + #define KVM_ARM_PSCI_LATEST KVM_ARM_PSCI_1_3 19 21 20 22 static inline int kvm_psci_version(struct kvm_vcpu *vcpu) 21 23 {

+83 -44

include/linux/kvm_host.h

··· 97 97 #define KVM_PFN_ERR_HWPOISON (KVM_PFN_ERR_MASK + 1) 98 98 #define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2) 99 99 #define KVM_PFN_ERR_SIGPENDING (KVM_PFN_ERR_MASK + 3) 100 + #define KVM_PFN_ERR_NEEDS_IO (KVM_PFN_ERR_MASK + 4) 100 101 101 102 /* 102 103 * error pfns indicate that the gfn is in slot but faild to ··· 152 151 static inline bool kvm_is_error_gpa(gpa_t gpa) 153 152 { 154 153 return gpa == INVALID_GPA; 155 - } 156 - 157 - #define KVM_ERR_PTR_BAD_PAGE (ERR_PTR(-ENOENT)) 158 - 159 - static inline bool is_error_page(struct page *page) 160 - { 161 - return IS_ERR(page); 162 154 } 163 155 164 156 #define KVM_REQUEST_MASK GENMASK(7,0) ··· 213 219 KVM_PIO_BUS, 214 220 KVM_VIRTIO_CCW_NOTIFY_BUS, 215 221 KVM_FAST_MMIO_BUS, 222 + KVM_IOCSR_BUS, 216 223 KVM_NR_BUSES 217 224 }; 218 225 ··· 274 279 READING_SHADOW_PAGE_TABLES, 275 280 }; 276 281 277 - #define KVM_UNMAPPED_PAGE ((void *) 0x500 + POISON_POINTER_DELTA) 278 - 279 282 struct kvm_host_map { 280 283 /* 281 284 * Only valid if the 'pfn' is managed by the host kernel (i.e. There is 282 285 * a 'struct page' for it. When using mem= kernel parameter some memory 283 286 * can be used as guest memory but they are not managed by host 284 287 * kernel). 285 - * If 'pfn' is not managed by the host kernel, this field is 286 - * initialized to KVM_UNMAPPED_PAGE. 287 288 */ 289 + struct page *pinned_page; 288 290 struct page *page; 289 291 void *hva; 290 292 kvm_pfn_t pfn; 291 293 kvm_pfn_t gfn; 294 + bool writable; 292 295 }; 293 296 294 297 /* ··· 335 342 #ifndef __KVM_HAVE_ARCH_WQP 336 343 struct rcuwait wait; 337 344 #endif 338 - struct pid __rcu *pid; 345 + struct pid *pid; 346 + rwlock_t pid_lock; 339 347 int sigset_active; 340 348 sigset_t sigset; 341 349 unsigned int halt_poll_ns; ··· 1170 1176 kvm_memslot_iter_is_valid(iter, end); \ 1171 1177 kvm_memslot_iter_next(iter)) 1172 1178 1179 + struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn); 1180 + struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu); 1181 + struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn); 1182 + 1173 1183 /* 1174 1184 * KVM_SET_USER_MEMORY_REGION ioctl allows the following operations: 1175 1185 * - create a new memory slot ··· 1212 1214 void kvm_arch_flush_shadow_memslot(struct kvm *kvm, 1213 1215 struct kvm_memory_slot *slot); 1214 1216 1215 - int gfn_to_page_many_atomic(struct kvm_memory_slot *slot, gfn_t gfn, 1216 - struct page **pages, int nr_pages); 1217 + int kvm_prefetch_pages(struct kvm_memory_slot *slot, gfn_t gfn, 1218 + struct page **pages, int nr_pages); 1217 1219 1218 - struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn); 1220 + struct page *__gfn_to_page(struct kvm *kvm, gfn_t gfn, bool write); 1221 + static inline struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn) 1222 + { 1223 + return __gfn_to_page(kvm, gfn, true); 1224 + } 1225 + 1219 1226 unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn); 1220 1227 unsigned long gfn_to_hva_prot(struct kvm *kvm, gfn_t gfn, bool *writable); 1221 1228 unsigned long gfn_to_hva_memslot(struct kvm_memory_slot *slot, gfn_t gfn); 1222 1229 unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn, 1223 1230 bool *writable); 1231 + 1232 + static inline void kvm_release_page_unused(struct page *page) 1233 + { 1234 + if (!page) 1235 + return; 1236 + 1237 + put_page(page); 1238 + } 1239 + 1224 1240 void kvm_release_page_clean(struct page *page); 1225 1241 void kvm_release_page_dirty(struct page *page); 1226 1242 1227 - kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn); 1228 - kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault, 1229 - bool *writable); 1230 - kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn); 1231 - kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gfn); 1232 - kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn, 1233 - bool atomic, bool interruptible, bool *async, 1234 - bool write_fault, bool *writable, hva_t *hva); 1243 + static inline void kvm_release_faultin_page(struct kvm *kvm, struct page *page, 1244 + bool unused, bool dirty) 1245 + { 1246 + lockdep_assert_once(lockdep_is_held(&kvm->mmu_lock) || unused); 1235 1247 1236 - void kvm_release_pfn_clean(kvm_pfn_t pfn); 1237 - void kvm_release_pfn_dirty(kvm_pfn_t pfn); 1238 - void kvm_set_pfn_dirty(kvm_pfn_t pfn); 1239 - void kvm_set_pfn_accessed(kvm_pfn_t pfn); 1248 + if (!page) 1249 + return; 1240 1250 1241 - void kvm_release_pfn(kvm_pfn_t pfn, bool dirty); 1251 + /* 1252 + * If the page that KVM got from the *primary MMU* is writable, and KVM 1253 + * installed or reused a SPTE, mark the page/folio dirty. Note, this 1254 + * may mark a folio dirty even if KVM created a read-only SPTE, e.g. if 1255 + * the GFN is write-protected. Folios can't be safely marked dirty 1256 + * outside of mmu_lock as doing so could race with writeback on the 1257 + * folio. As a result, KVM can't mark folios dirty in the fast page 1258 + * fault handler, and so KVM must (somewhat) speculatively mark the 1259 + * folio dirty if KVM could locklessly make the SPTE writable. 1260 + */ 1261 + if (unused) 1262 + kvm_release_page_unused(page); 1263 + else if (dirty) 1264 + kvm_release_page_dirty(page); 1265 + else 1266 + kvm_release_page_clean(page); 1267 + } 1268 + 1269 + kvm_pfn_t __kvm_faultin_pfn(const struct kvm_memory_slot *slot, gfn_t gfn, 1270 + unsigned int foll, bool *writable, 1271 + struct page **refcounted_page); 1272 + 1273 + static inline kvm_pfn_t kvm_faultin_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, 1274 + bool write, bool *writable, 1275 + struct page **refcounted_page) 1276 + { 1277 + return __kvm_faultin_pfn(kvm_vcpu_gfn_to_memslot(vcpu, gfn), gfn, 1278 + write ? FOLL_WRITE : 0, writable, refcounted_page); 1279 + } 1280 + 1242 1281 int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset, 1243 1282 int len); 1244 1283 int kvm_read_guest(struct kvm *kvm, gpa_t gpa, void *data, unsigned long len); ··· 1339 1304 }) 1340 1305 1341 1306 int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len); 1342 - struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn); 1343 1307 bool kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn); 1344 1308 bool kvm_vcpu_is_visible_gfn(struct kvm_vcpu *vcpu, gfn_t gfn); 1345 1309 unsigned long kvm_host_page_size(struct kvm_vcpu *vcpu, gfn_t gfn); 1346 1310 void mark_page_dirty_in_slot(struct kvm *kvm, const struct kvm_memory_slot *memslot, gfn_t gfn); 1347 1311 void mark_page_dirty(struct kvm *kvm, gfn_t gfn); 1348 1312 1349 - struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu); 1350 - struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn); 1351 - int kvm_vcpu_map(struct kvm_vcpu *vcpu, gpa_t gpa, struct kvm_host_map *map); 1352 - void kvm_vcpu_unmap(struct kvm_vcpu *vcpu, struct kvm_host_map *map, bool dirty); 1313 + int __kvm_vcpu_map(struct kvm_vcpu *vcpu, gpa_t gpa, struct kvm_host_map *map, 1314 + bool writable); 1315 + void kvm_vcpu_unmap(struct kvm_vcpu *vcpu, struct kvm_host_map *map); 1316 + 1317 + static inline int kvm_vcpu_map(struct kvm_vcpu *vcpu, gpa_t gpa, 1318 + struct kvm_host_map *map) 1319 + { 1320 + return __kvm_vcpu_map(vcpu, gpa, map, true); 1321 + } 1322 + 1323 + static inline int kvm_vcpu_map_readonly(struct kvm_vcpu *vcpu, gpa_t gpa, 1324 + struct kvm_host_map *map) 1325 + { 1326 + return __kvm_vcpu_map(vcpu, gpa, map, false); 1327 + } 1328 + 1353 1329 unsigned long kvm_vcpu_gfn_to_hva(struct kvm_vcpu *vcpu, gfn_t gfn); 1354 1330 unsigned long kvm_vcpu_gfn_to_hva_prot(struct kvm_vcpu *vcpu, gfn_t gfn, bool *writable); 1355 1331 int kvm_vcpu_read_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn, void *data, int offset, ··· 1731 1685 void kvm_arch_sync_events(struct kvm *kvm); 1732 1686 1733 1687 int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu); 1734 - 1735 - struct page *kvm_pfn_to_refcounted_page(kvm_pfn_t pfn); 1736 - bool kvm_is_zone_device_page(struct page *page); 1737 1688 1738 1689 struct kvm_irq_ack_notifier { 1739 1690 struct hlist_node link; ··· 2425 2382 } 2426 2383 #endif /* CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE */ 2427 2384 2428 - typedef int (*kvm_vm_thread_fn_t)(struct kvm *kvm, uintptr_t data); 2429 - 2430 - int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn, 2431 - uintptr_t data, const char *name, 2432 - struct task_struct **thread_ptr); 2433 - 2434 2385 #ifdef CONFIG_KVM_XFER_TO_GUEST_WORK 2435 2386 static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu) 2436 2387 { ··· 2498 2461 2499 2462 #ifdef CONFIG_KVM_PRIVATE_MEM 2500 2463 int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, 2501 - gfn_t gfn, kvm_pfn_t *pfn, int *max_order); 2464 + gfn_t gfn, kvm_pfn_t *pfn, struct page **page, 2465 + int *max_order); 2502 2466 #else 2503 2467 static inline int kvm_gmem_get_pfn(struct kvm *kvm, 2504 2468 struct kvm_memory_slot *slot, gfn_t gfn, 2505 - kvm_pfn_t *pfn, int *max_order) 2469 + kvm_pfn_t *pfn, struct page **page, 2470 + int *max_order) 2506 2471 { 2507 2472 KVM_BUG_ON(1, kvm); 2508 2473 return -EIO;

+35

include/trace/events/kvm.h

··· 236 236 __entry->len, __entry->gpa, __entry->val) 237 237 ); 238 238 239 + #define KVM_TRACE_IOCSR_READ_UNSATISFIED 0 240 + #define KVM_TRACE_IOCSR_READ 1 241 + #define KVM_TRACE_IOCSR_WRITE 2 242 + 243 + #define kvm_trace_symbol_iocsr \ 244 + { KVM_TRACE_IOCSR_READ_UNSATISFIED, "unsatisfied-read" }, \ 245 + { KVM_TRACE_IOCSR_READ, "read" }, \ 246 + { KVM_TRACE_IOCSR_WRITE, "write" } 247 + 248 + TRACE_EVENT(kvm_iocsr, 249 + TP_PROTO(int type, int len, u64 gpa, void *val), 250 + TP_ARGS(type, len, gpa, val), 251 + 252 + TP_STRUCT__entry( 253 + __field( u32, type ) 254 + __field( u32, len ) 255 + __field( u64, gpa ) 256 + __field( u64, val ) 257 + ), 258 + 259 + TP_fast_assign( 260 + __entry->type = type; 261 + __entry->len = len; 262 + __entry->gpa = gpa; 263 + __entry->val = 0; 264 + if (val) 265 + memcpy(&__entry->val, val, 266 + min_t(u32, sizeof(__entry->val), len)); 267 + ), 268 + 269 + TP_printk("iocsr %s len %u gpa 0x%llx val 0x%llx", 270 + __print_symbolic(__entry->type, kvm_trace_symbol_iocsr), 271 + __entry->len, __entry->gpa, __entry->val) 272 + ); 273 + 239 274 #define kvm_fpu_load_symbol \ 240 275 {0, "unload"}, \ 241 276 {1, "load"}

+8

include/uapi/linux/kvm.h

··· 1158 1158 #define KVM_DEV_TYPE_ARM_PV_TIME KVM_DEV_TYPE_ARM_PV_TIME 1159 1159 KVM_DEV_TYPE_RISCV_AIA, 1160 1160 #define KVM_DEV_TYPE_RISCV_AIA KVM_DEV_TYPE_RISCV_AIA 1161 + KVM_DEV_TYPE_LOONGARCH_IPI, 1162 + #define KVM_DEV_TYPE_LOONGARCH_IPI KVM_DEV_TYPE_LOONGARCH_IPI 1163 + KVM_DEV_TYPE_LOONGARCH_EIOINTC, 1164 + #define KVM_DEV_TYPE_LOONGARCH_EIOINTC KVM_DEV_TYPE_LOONGARCH_EIOINTC 1165 + KVM_DEV_TYPE_LOONGARCH_PCHPIC, 1166 + #define KVM_DEV_TYPE_LOONGARCH_PCHPIC KVM_DEV_TYPE_LOONGARCH_PCHPIC 1167 + 1161 1168 KVM_DEV_TYPE_MAX, 1169 + 1162 1170 }; 1163 1171 1164 1172 struct kvm_vfio_spapr_tce {

+5

include/uapi/linux/psci.h

··· 59 59 #define PSCI_1_1_FN_SYSTEM_RESET2 PSCI_0_2_FN(18) 60 60 #define PSCI_1_1_FN_MEM_PROTECT PSCI_0_2_FN(19) 61 61 #define PSCI_1_1_FN_MEM_PROTECT_CHECK_RANGE PSCI_0_2_FN(20) 62 + #define PSCI_1_3_FN_SYSTEM_OFF2 PSCI_0_2_FN(21) 62 63 63 64 #define PSCI_1_0_FN64_CPU_DEFAULT_SUSPEND PSCI_0_2_FN64(12) 64 65 #define PSCI_1_0_FN64_NODE_HW_STATE PSCI_0_2_FN64(13) ··· 69 68 70 69 #define PSCI_1_1_FN64_SYSTEM_RESET2 PSCI_0_2_FN64(18) 71 70 #define PSCI_1_1_FN64_MEM_PROTECT_CHECK_RANGE PSCI_0_2_FN64(20) 71 + #define PSCI_1_3_FN64_SYSTEM_OFF2 PSCI_0_2_FN64(21) 72 72 73 73 /* PSCI v0.2 power state encoding for CPU_SUSPEND function */ 74 74 #define PSCI_0_2_POWER_STATE_ID_MASK 0xffff ··· 101 99 /* PSCI v1.1 reset type encoding for SYSTEM_RESET2 */ 102 100 #define PSCI_1_1_RESET_TYPE_SYSTEM_WARM_RESET 0 103 101 #define PSCI_1_1_RESET_TYPE_VENDOR_START 0x80000000U 102 + 103 + /* PSCI v1.3 hibernate type for SYSTEM_OFF2 */ 104 + #define PSCI_1_3_OFF_TYPE_HIBERNATE_OFF BIT(0) 104 105 105 106 /* PSCI version decoding (independent of PSCI version) */ 106 107 #define PSCI_VERSION_MAJOR_SHIFT 16

+4 -1

kernel/power/hibernate.c

··· 685 685 } 686 686 fallthrough; 687 687 case HIBERNATION_SHUTDOWN: 688 - if (kernel_can_power_off()) 688 + if (kernel_can_power_off()) { 689 + entering_platform_hibernation = true; 689 690 kernel_power_off(); 691 + entering_platform_hibernation = false; 692 + } 690 693 break; 691 694 } 692 695 kernel_halt();

+42

tools/arch/arm64/include/asm/brk-imm.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0-only */ 2 + /* 3 + * Copyright (C) 2012 ARM Ltd. 4 + */ 5 + 6 + #ifndef __ASM_BRK_IMM_H 7 + #define __ASM_BRK_IMM_H 8 + 9 + /* 10 + * #imm16 values used for BRK instruction generation 11 + * 0x004: for installing kprobes 12 + * 0x005: for installing uprobes 13 + * 0x006: for kprobe software single-step 14 + * 0x007: for kretprobe return 15 + * Allowed values for kgdb are 0x400 - 0x7ff 16 + * 0x100: for triggering a fault on purpose (reserved) 17 + * 0x400: for dynamic BRK instruction 18 + * 0x401: for compile time BRK instruction 19 + * 0x800: kernel-mode BUG() and WARN() traps 20 + * 0x9xx: tag-based KASAN trap (allowed values 0x900 - 0x9ff) 21 + * 0x55xx: Undefined Behavior Sanitizer traps ('U' << 8) 22 + * 0x8xxx: Control-Flow Integrity traps 23 + */ 24 + #define KPROBES_BRK_IMM 0x004 25 + #define UPROBES_BRK_IMM 0x005 26 + #define KPROBES_BRK_SS_IMM 0x006 27 + #define KRETPROBES_BRK_IMM 0x007 28 + #define FAULT_BRK_IMM 0x100 29 + #define KGDB_DYN_DBG_BRK_IMM 0x400 30 + #define KGDB_COMPILED_DBG_BRK_IMM 0x401 31 + #define BUG_BRK_IMM 0x800 32 + #define KASAN_BRK_IMM 0x900 33 + #define KASAN_BRK_MASK 0x0ff 34 + #define UBSAN_BRK_IMM 0x5500 35 + #define UBSAN_BRK_MASK 0x00ff 36 + 37 + #define CFI_BRK_IMM_TARGET GENMASK(4, 0) 38 + #define CFI_BRK_IMM_TYPE GENMASK(9, 5) 39 + #define CFI_BRK_IMM_BASE 0x8000 40 + #define CFI_BRK_IMM_MASK (CFI_BRK_IMM_TARGET | CFI_BRK_IMM_TYPE) 41 + 42 + #endif

+455

tools/arch/arm64/include/asm/esr.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0-only */ 2 + /* 3 + * Copyright (C) 2013 - ARM Ltd 4 + * Author: Marc Zyngier <marc.zyngier@arm.com> 5 + */ 6 + 7 + #ifndef __ASM_ESR_H 8 + #define __ASM_ESR_H 9 + 10 + #include <asm/sysreg.h> 11 + 12 + #define ESR_ELx_EC_UNKNOWN UL(0x00) 13 + #define ESR_ELx_EC_WFx UL(0x01) 14 + /* Unallocated EC: 0x02 */ 15 + #define ESR_ELx_EC_CP15_32 UL(0x03) 16 + #define ESR_ELx_EC_CP15_64 UL(0x04) 17 + #define ESR_ELx_EC_CP14_MR UL(0x05) 18 + #define ESR_ELx_EC_CP14_LS UL(0x06) 19 + #define ESR_ELx_EC_FP_ASIMD UL(0x07) 20 + #define ESR_ELx_EC_CP10_ID UL(0x08) /* EL2 only */ 21 + #define ESR_ELx_EC_PAC UL(0x09) /* EL2 and above */ 22 + /* Unallocated EC: 0x0A - 0x0B */ 23 + #define ESR_ELx_EC_CP14_64 UL(0x0C) 24 + #define ESR_ELx_EC_BTI UL(0x0D) 25 + #define ESR_ELx_EC_ILL UL(0x0E) 26 + /* Unallocated EC: 0x0F - 0x10 */ 27 + #define ESR_ELx_EC_SVC32 UL(0x11) 28 + #define ESR_ELx_EC_HVC32 UL(0x12) /* EL2 only */ 29 + #define ESR_ELx_EC_SMC32 UL(0x13) /* EL2 and above */ 30 + /* Unallocated EC: 0x14 */ 31 + #define ESR_ELx_EC_SVC64 UL(0x15) 32 + #define ESR_ELx_EC_HVC64 UL(0x16) /* EL2 and above */ 33 + #define ESR_ELx_EC_SMC64 UL(0x17) /* EL2 and above */ 34 + #define ESR_ELx_EC_SYS64 UL(0x18) 35 + #define ESR_ELx_EC_SVE UL(0x19) 36 + #define ESR_ELx_EC_ERET UL(0x1a) /* EL2 only */ 37 + /* Unallocated EC: 0x1B */ 38 + #define ESR_ELx_EC_FPAC UL(0x1C) /* EL1 and above */ 39 + #define ESR_ELx_EC_SME UL(0x1D) 40 + /* Unallocated EC: 0x1E */ 41 + #define ESR_ELx_EC_IMP_DEF UL(0x1f) /* EL3 only */ 42 + #define ESR_ELx_EC_IABT_LOW UL(0x20) 43 + #define ESR_ELx_EC_IABT_CUR UL(0x21) 44 + #define ESR_ELx_EC_PC_ALIGN UL(0x22) 45 + /* Unallocated EC: 0x23 */ 46 + #define ESR_ELx_EC_DABT_LOW UL(0x24) 47 + #define ESR_ELx_EC_DABT_CUR UL(0x25) 48 + #define ESR_ELx_EC_SP_ALIGN UL(0x26) 49 + #define ESR_ELx_EC_MOPS UL(0x27) 50 + #define ESR_ELx_EC_FP_EXC32 UL(0x28) 51 + /* Unallocated EC: 0x29 - 0x2B */ 52 + #define ESR_ELx_EC_FP_EXC64 UL(0x2C) 53 + /* Unallocated EC: 0x2D - 0x2E */ 54 + #define ESR_ELx_EC_SERROR UL(0x2F) 55 + #define ESR_ELx_EC_BREAKPT_LOW UL(0x30) 56 + #define ESR_ELx_EC_BREAKPT_CUR UL(0x31) 57 + #define ESR_ELx_EC_SOFTSTP_LOW UL(0x32) 58 + #define ESR_ELx_EC_SOFTSTP_CUR UL(0x33) 59 + #define ESR_ELx_EC_WATCHPT_LOW UL(0x34) 60 + #define ESR_ELx_EC_WATCHPT_CUR UL(0x35) 61 + /* Unallocated EC: 0x36 - 0x37 */ 62 + #define ESR_ELx_EC_BKPT32 UL(0x38) 63 + /* Unallocated EC: 0x39 */ 64 + #define ESR_ELx_EC_VECTOR32 UL(0x3A) /* EL2 only */ 65 + /* Unallocated EC: 0x3B */ 66 + #define ESR_ELx_EC_BRK64 UL(0x3C) 67 + /* Unallocated EC: 0x3D - 0x3F */ 68 + #define ESR_ELx_EC_MAX UL(0x3F) 69 + 70 + #define ESR_ELx_EC_SHIFT (26) 71 + #define ESR_ELx_EC_WIDTH (6) 72 + #define ESR_ELx_EC_MASK (UL(0x3F) << ESR_ELx_EC_SHIFT) 73 + #define ESR_ELx_EC(esr) (((esr) & ESR_ELx_EC_MASK) >> ESR_ELx_EC_SHIFT) 74 + 75 + #define ESR_ELx_IL_SHIFT (25) 76 + #define ESR_ELx_IL (UL(1) << ESR_ELx_IL_SHIFT) 77 + #define ESR_ELx_ISS_MASK (GENMASK(24, 0)) 78 + #define ESR_ELx_ISS(esr) ((esr) & ESR_ELx_ISS_MASK) 79 + #define ESR_ELx_ISS2_SHIFT (32) 80 + #define ESR_ELx_ISS2_MASK (GENMASK_ULL(55, 32)) 81 + #define ESR_ELx_ISS2(esr) (((esr) & ESR_ELx_ISS2_MASK) >> ESR_ELx_ISS2_SHIFT) 82 + 83 + /* ISS field definitions shared by different classes */ 84 + #define ESR_ELx_WNR_SHIFT (6) 85 + #define ESR_ELx_WNR (UL(1) << ESR_ELx_WNR_SHIFT) 86 + 87 + /* Asynchronous Error Type */ 88 + #define ESR_ELx_IDS_SHIFT (24) 89 + #define ESR_ELx_IDS (UL(1) << ESR_ELx_IDS_SHIFT) 90 + #define ESR_ELx_AET_SHIFT (10) 91 + #define ESR_ELx_AET (UL(0x7) << ESR_ELx_AET_SHIFT) 92 + 93 + #define ESR_ELx_AET_UC (UL(0) << ESR_ELx_AET_SHIFT) 94 + #define ESR_ELx_AET_UEU (UL(1) << ESR_ELx_AET_SHIFT) 95 + #define ESR_ELx_AET_UEO (UL(2) << ESR_ELx_AET_SHIFT) 96 + #define ESR_ELx_AET_UER (UL(3) << ESR_ELx_AET_SHIFT) 97 + #define ESR_ELx_AET_CE (UL(6) << ESR_ELx_AET_SHIFT) 98 + 99 + /* Shared ISS field definitions for Data/Instruction aborts */ 100 + #define ESR_ELx_SET_SHIFT (11) 101 + #define ESR_ELx_SET_MASK (UL(3) << ESR_ELx_SET_SHIFT) 102 + #define ESR_ELx_FnV_SHIFT (10) 103 + #define ESR_ELx_FnV (UL(1) << ESR_ELx_FnV_SHIFT) 104 + #define ESR_ELx_EA_SHIFT (9) 105 + #define ESR_ELx_EA (UL(1) << ESR_ELx_EA_SHIFT) 106 + #define ESR_ELx_S1PTW_SHIFT (7) 107 + #define ESR_ELx_S1PTW (UL(1) << ESR_ELx_S1PTW_SHIFT) 108 + 109 + /* Shared ISS fault status code(IFSC/DFSC) for Data/Instruction aborts */ 110 + #define ESR_ELx_FSC (0x3F) 111 + #define ESR_ELx_FSC_TYPE (0x3C) 112 + #define ESR_ELx_FSC_LEVEL (0x03) 113 + #define ESR_ELx_FSC_EXTABT (0x10) 114 + #define ESR_ELx_FSC_MTE (0x11) 115 + #define ESR_ELx_FSC_SERROR (0x11) 116 + #define ESR_ELx_FSC_ACCESS (0x08) 117 + #define ESR_ELx_FSC_FAULT (0x04) 118 + #define ESR_ELx_FSC_PERM (0x0C) 119 + #define ESR_ELx_FSC_SEA_TTW(n) (0x14 + (n)) 120 + #define ESR_ELx_FSC_SECC (0x18) 121 + #define ESR_ELx_FSC_SECC_TTW(n) (0x1c + (n)) 122 + 123 + /* Status codes for individual page table levels */ 124 + #define ESR_ELx_FSC_ACCESS_L(n) (ESR_ELx_FSC_ACCESS + (n)) 125 + #define ESR_ELx_FSC_PERM_L(n) (ESR_ELx_FSC_PERM + (n)) 126 + 127 + #define ESR_ELx_FSC_FAULT_nL (0x2C) 128 + #define ESR_ELx_FSC_FAULT_L(n) (((n) < 0 ? ESR_ELx_FSC_FAULT_nL : \ 129 + ESR_ELx_FSC_FAULT) + (n)) 130 + 131 + /* ISS field definitions for Data Aborts */ 132 + #define ESR_ELx_ISV_SHIFT (24) 133 + #define ESR_ELx_ISV (UL(1) << ESR_ELx_ISV_SHIFT) 134 + #define ESR_ELx_SAS_SHIFT (22) 135 + #define ESR_ELx_SAS (UL(3) << ESR_ELx_SAS_SHIFT) 136 + #define ESR_ELx_SSE_SHIFT (21) 137 + #define ESR_ELx_SSE (UL(1) << ESR_ELx_SSE_SHIFT) 138 + #define ESR_ELx_SRT_SHIFT (16) 139 + #define ESR_ELx_SRT_MASK (UL(0x1F) << ESR_ELx_SRT_SHIFT) 140 + #define ESR_ELx_SF_SHIFT (15) 141 + #define ESR_ELx_SF (UL(1) << ESR_ELx_SF_SHIFT) 142 + #define ESR_ELx_AR_SHIFT (14) 143 + #define ESR_ELx_AR (UL(1) << ESR_ELx_AR_SHIFT) 144 + #define ESR_ELx_CM_SHIFT (8) 145 + #define ESR_ELx_CM (UL(1) << ESR_ELx_CM_SHIFT) 146 + 147 + /* ISS2 field definitions for Data Aborts */ 148 + #define ESR_ELx_TnD_SHIFT (10) 149 + #define ESR_ELx_TnD (UL(1) << ESR_ELx_TnD_SHIFT) 150 + #define ESR_ELx_TagAccess_SHIFT (9) 151 + #define ESR_ELx_TagAccess (UL(1) << ESR_ELx_TagAccess_SHIFT) 152 + #define ESR_ELx_GCS_SHIFT (8) 153 + #define ESR_ELx_GCS (UL(1) << ESR_ELx_GCS_SHIFT) 154 + #define ESR_ELx_Overlay_SHIFT (6) 155 + #define ESR_ELx_Overlay (UL(1) << ESR_ELx_Overlay_SHIFT) 156 + #define ESR_ELx_DirtyBit_SHIFT (5) 157 + #define ESR_ELx_DirtyBit (UL(1) << ESR_ELx_DirtyBit_SHIFT) 158 + #define ESR_ELx_Xs_SHIFT (0) 159 + #define ESR_ELx_Xs_MASK (GENMASK_ULL(4, 0)) 160 + 161 + /* ISS field definitions for exceptions taken in to Hyp */ 162 + #define ESR_ELx_FSC_ADDRSZ (0x00) 163 + #define ESR_ELx_FSC_ADDRSZ_L(n) (ESR_ELx_FSC_ADDRSZ + (n)) 164 + #define ESR_ELx_CV (UL(1) << 24) 165 + #define ESR_ELx_COND_SHIFT (20) 166 + #define ESR_ELx_COND_MASK (UL(0xF) << ESR_ELx_COND_SHIFT) 167 + #define ESR_ELx_WFx_ISS_RN (UL(0x1F) << 5) 168 + #define ESR_ELx_WFx_ISS_RV (UL(1) << 2) 169 + #define ESR_ELx_WFx_ISS_TI (UL(3) << 0) 170 + #define ESR_ELx_WFx_ISS_WFxT (UL(2) << 0) 171 + #define ESR_ELx_WFx_ISS_WFI (UL(0) << 0) 172 + #define ESR_ELx_WFx_ISS_WFE (UL(1) << 0) 173 + #define ESR_ELx_xVC_IMM_MASK ((UL(1) << 16) - 1) 174 + 175 + #define DISR_EL1_IDS (UL(1) << 24) 176 + /* 177 + * DISR_EL1 and ESR_ELx share the bottom 13 bits, but the RES0 bits may mean 178 + * different things in the future... 179 + */ 180 + #define DISR_EL1_ESR_MASK (ESR_ELx_AET | ESR_ELx_EA | ESR_ELx_FSC) 181 + 182 + /* ESR value templates for specific events */ 183 + #define ESR_ELx_WFx_MASK (ESR_ELx_EC_MASK | \ 184 + (ESR_ELx_WFx_ISS_TI & ~ESR_ELx_WFx_ISS_WFxT)) 185 + #define ESR_ELx_WFx_WFI_VAL ((ESR_ELx_EC_WFx << ESR_ELx_EC_SHIFT) | \ 186 + ESR_ELx_WFx_ISS_WFI) 187 + 188 + /* BRK instruction trap from AArch64 state */ 189 + #define ESR_ELx_BRK64_ISS_COMMENT_MASK 0xffff 190 + 191 + /* ISS field definitions for System instruction traps */ 192 + #define ESR_ELx_SYS64_ISS_RES0_SHIFT 22 193 + #define ESR_ELx_SYS64_ISS_RES0_MASK (UL(0x7) << ESR_ELx_SYS64_ISS_RES0_SHIFT) 194 + #define ESR_ELx_SYS64_ISS_DIR_MASK 0x1 195 + #define ESR_ELx_SYS64_ISS_DIR_READ 0x1 196 + #define ESR_ELx_SYS64_ISS_DIR_WRITE 0x0 197 + 198 + #define ESR_ELx_SYS64_ISS_RT_SHIFT 5 199 + #define ESR_ELx_SYS64_ISS_RT_MASK (UL(0x1f) << ESR_ELx_SYS64_ISS_RT_SHIFT) 200 + #define ESR_ELx_SYS64_ISS_CRM_SHIFT 1 201 + #define ESR_ELx_SYS64_ISS_CRM_MASK (UL(0xf) << ESR_ELx_SYS64_ISS_CRM_SHIFT) 202 + #define ESR_ELx_SYS64_ISS_CRN_SHIFT 10 203 + #define ESR_ELx_SYS64_ISS_CRN_MASK (UL(0xf) << ESR_ELx_SYS64_ISS_CRN_SHIFT) 204 + #define ESR_ELx_SYS64_ISS_OP1_SHIFT 14 205 + #define ESR_ELx_SYS64_ISS_OP1_MASK (UL(0x7) << ESR_ELx_SYS64_ISS_OP1_SHIFT) 206 + #define ESR_ELx_SYS64_ISS_OP2_SHIFT 17 207 + #define ESR_ELx_SYS64_ISS_OP2_MASK (UL(0x7) << ESR_ELx_SYS64_ISS_OP2_SHIFT) 208 + #define ESR_ELx_SYS64_ISS_OP0_SHIFT 20 209 + #define ESR_ELx_SYS64_ISS_OP0_MASK (UL(0x3) << ESR_ELx_SYS64_ISS_OP0_SHIFT) 210 + #define ESR_ELx_SYS64_ISS_SYS_MASK (ESR_ELx_SYS64_ISS_OP0_MASK | \ 211 + ESR_ELx_SYS64_ISS_OP1_MASK | \ 212 + ESR_ELx_SYS64_ISS_OP2_MASK | \ 213 + ESR_ELx_SYS64_ISS_CRN_MASK | \ 214 + ESR_ELx_SYS64_ISS_CRM_MASK) 215 + #define ESR_ELx_SYS64_ISS_SYS_VAL(op0, op1, op2, crn, crm) \ 216 + (((op0) << ESR_ELx_SYS64_ISS_OP0_SHIFT) | \ 217 + ((op1) << ESR_ELx_SYS64_ISS_OP1_SHIFT) | \ 218 + ((op2) << ESR_ELx_SYS64_ISS_OP2_SHIFT) | \ 219 + ((crn) << ESR_ELx_SYS64_ISS_CRN_SHIFT) | \ 220 + ((crm) << ESR_ELx_SYS64_ISS_CRM_SHIFT)) 221 + 222 + #define ESR_ELx_SYS64_ISS_SYS_OP_MASK (ESR_ELx_SYS64_ISS_SYS_MASK | \ 223 + ESR_ELx_SYS64_ISS_DIR_MASK) 224 + #define ESR_ELx_SYS64_ISS_RT(esr) \ 225 + (((esr) & ESR_ELx_SYS64_ISS_RT_MASK) >> ESR_ELx_SYS64_ISS_RT_SHIFT) 226 + /* 227 + * User space cache operations have the following sysreg encoding 228 + * in System instructions. 229 + * op0=1, op1=3, op2=1, crn=7, crm={ 5, 10, 11, 12, 13, 14 }, WRITE (L=0) 230 + */ 231 + #define ESR_ELx_SYS64_ISS_CRM_DC_CIVAC 14 232 + #define ESR_ELx_SYS64_ISS_CRM_DC_CVADP 13 233 + #define ESR_ELx_SYS64_ISS_CRM_DC_CVAP 12 234 + #define ESR_ELx_SYS64_ISS_CRM_DC_CVAU 11 235 + #define ESR_ELx_SYS64_ISS_CRM_DC_CVAC 10 236 + #define ESR_ELx_SYS64_ISS_CRM_IC_IVAU 5 237 + 238 + #define ESR_ELx_SYS64_ISS_EL0_CACHE_OP_MASK (ESR_ELx_SYS64_ISS_OP0_MASK | \ 239 + ESR_ELx_SYS64_ISS_OP1_MASK | \ 240 + ESR_ELx_SYS64_ISS_OP2_MASK | \ 241 + ESR_ELx_SYS64_ISS_CRN_MASK | \ 242 + ESR_ELx_SYS64_ISS_DIR_MASK) 243 + #define ESR_ELx_SYS64_ISS_EL0_CACHE_OP_VAL \ 244 + (ESR_ELx_SYS64_ISS_SYS_VAL(1, 3, 1, 7, 0) | \ 245 + ESR_ELx_SYS64_ISS_DIR_WRITE) 246 + /* 247 + * User space MRS operations which are supported for emulation 248 + * have the following sysreg encoding in System instructions. 249 + * op0 = 3, op1= 0, crn = 0, {crm = 0, 4-7}, READ (L = 1) 250 + */ 251 + #define ESR_ELx_SYS64_ISS_SYS_MRS_OP_MASK (ESR_ELx_SYS64_ISS_OP0_MASK | \ 252 + ESR_ELx_SYS64_ISS_OP1_MASK | \ 253 + ESR_ELx_SYS64_ISS_CRN_MASK | \ 254 + ESR_ELx_SYS64_ISS_DIR_MASK) 255 + #define ESR_ELx_SYS64_ISS_SYS_MRS_OP_VAL \ 256 + (ESR_ELx_SYS64_ISS_SYS_VAL(3, 0, 0, 0, 0) | \ 257 + ESR_ELx_SYS64_ISS_DIR_READ) 258 + 259 + #define ESR_ELx_SYS64_ISS_SYS_CTR ESR_ELx_SYS64_ISS_SYS_VAL(3, 3, 1, 0, 0) 260 + #define ESR_ELx_SYS64_ISS_SYS_CTR_READ (ESR_ELx_SYS64_ISS_SYS_CTR | \ 261 + ESR_ELx_SYS64_ISS_DIR_READ) 262 + 263 + #define ESR_ELx_SYS64_ISS_SYS_CNTVCT (ESR_ELx_SYS64_ISS_SYS_VAL(3, 3, 2, 14, 0) | \ 264 + ESR_ELx_SYS64_ISS_DIR_READ) 265 + 266 + #define ESR_ELx_SYS64_ISS_SYS_CNTVCTSS (ESR_ELx_SYS64_ISS_SYS_VAL(3, 3, 6, 14, 0) | \ 267 + ESR_ELx_SYS64_ISS_DIR_READ) 268 + 269 + #define ESR_ELx_SYS64_ISS_SYS_CNTFRQ (ESR_ELx_SYS64_ISS_SYS_VAL(3, 3, 0, 14, 0) | \ 270 + ESR_ELx_SYS64_ISS_DIR_READ) 271 + 272 + #define esr_sys64_to_sysreg(e) \ 273 + sys_reg((((e) & ESR_ELx_SYS64_ISS_OP0_MASK) >> \ 274 + ESR_ELx_SYS64_ISS_OP0_SHIFT), \ 275 + (((e) & ESR_ELx_SYS64_ISS_OP1_MASK) >> \ 276 + ESR_ELx_SYS64_ISS_OP1_SHIFT), \ 277 + (((e) & ESR_ELx_SYS64_ISS_CRN_MASK) >> \ 278 + ESR_ELx_SYS64_ISS_CRN_SHIFT), \ 279 + (((e) & ESR_ELx_SYS64_ISS_CRM_MASK) >> \ 280 + ESR_ELx_SYS64_ISS_CRM_SHIFT), \ 281 + (((e) & ESR_ELx_SYS64_ISS_OP2_MASK) >> \ 282 + ESR_ELx_SYS64_ISS_OP2_SHIFT)) 283 + 284 + #define esr_cp15_to_sysreg(e) \ 285 + sys_reg(3, \ 286 + (((e) & ESR_ELx_SYS64_ISS_OP1_MASK) >> \ 287 + ESR_ELx_SYS64_ISS_OP1_SHIFT), \ 288 + (((e) & ESR_ELx_SYS64_ISS_CRN_MASK) >> \ 289 + ESR_ELx_SYS64_ISS_CRN_SHIFT), \ 290 + (((e) & ESR_ELx_SYS64_ISS_CRM_MASK) >> \ 291 + ESR_ELx_SYS64_ISS_CRM_SHIFT), \ 292 + (((e) & ESR_ELx_SYS64_ISS_OP2_MASK) >> \ 293 + ESR_ELx_SYS64_ISS_OP2_SHIFT)) 294 + 295 + /* ISS field definitions for ERET/ERETAA/ERETAB trapping */ 296 + #define ESR_ELx_ERET_ISS_ERET 0x2 297 + #define ESR_ELx_ERET_ISS_ERETA 0x1 298 + 299 + /* 300 + * ISS field definitions for floating-point exception traps 301 + * (FP_EXC_32/FP_EXC_64). 302 + * 303 + * (The FPEXC_* constants are used instead for common bits.) 304 + */ 305 + 306 + #define ESR_ELx_FP_EXC_TFV (UL(1) << 23) 307 + 308 + /* 309 + * ISS field definitions for CP15 accesses 310 + */ 311 + #define ESR_ELx_CP15_32_ISS_DIR_MASK 0x1 312 + #define ESR_ELx_CP15_32_ISS_DIR_READ 0x1 313 + #define ESR_ELx_CP15_32_ISS_DIR_WRITE 0x0 314 + 315 + #define ESR_ELx_CP15_32_ISS_RT_SHIFT 5 316 + #define ESR_ELx_CP15_32_ISS_RT_MASK (UL(0x1f) << ESR_ELx_CP15_32_ISS_RT_SHIFT) 317 + #define ESR_ELx_CP15_32_ISS_CRM_SHIFT 1 318 + #define ESR_ELx_CP15_32_ISS_CRM_MASK (UL(0xf) << ESR_ELx_CP15_32_ISS_CRM_SHIFT) 319 + #define ESR_ELx_CP15_32_ISS_CRN_SHIFT 10 320 + #define ESR_ELx_CP15_32_ISS_CRN_MASK (UL(0xf) << ESR_ELx_CP15_32_ISS_CRN_SHIFT) 321 + #define ESR_ELx_CP15_32_ISS_OP1_SHIFT 14 322 + #define ESR_ELx_CP15_32_ISS_OP1_MASK (UL(0x7) << ESR_ELx_CP15_32_ISS_OP1_SHIFT) 323 + #define ESR_ELx_CP15_32_ISS_OP2_SHIFT 17 324 + #define ESR_ELx_CP15_32_ISS_OP2_MASK (UL(0x7) << ESR_ELx_CP15_32_ISS_OP2_SHIFT) 325 + 326 + #define ESR_ELx_CP15_32_ISS_SYS_MASK (ESR_ELx_CP15_32_ISS_OP1_MASK | \ 327 + ESR_ELx_CP15_32_ISS_OP2_MASK | \ 328 + ESR_ELx_CP15_32_ISS_CRN_MASK | \ 329 + ESR_ELx_CP15_32_ISS_CRM_MASK | \ 330 + ESR_ELx_CP15_32_ISS_DIR_MASK) 331 + #define ESR_ELx_CP15_32_ISS_SYS_VAL(op1, op2, crn, crm) \ 332 + (((op1) << ESR_ELx_CP15_32_ISS_OP1_SHIFT) | \ 333 + ((op2) << ESR_ELx_CP15_32_ISS_OP2_SHIFT) | \ 334 + ((crn) << ESR_ELx_CP15_32_ISS_CRN_SHIFT) | \ 335 + ((crm) << ESR_ELx_CP15_32_ISS_CRM_SHIFT)) 336 + 337 + #define ESR_ELx_CP15_64_ISS_DIR_MASK 0x1 338 + #define ESR_ELx_CP15_64_ISS_DIR_READ 0x1 339 + #define ESR_ELx_CP15_64_ISS_DIR_WRITE 0x0 340 + 341 + #define ESR_ELx_CP15_64_ISS_RT_SHIFT 5 342 + #define ESR_ELx_CP15_64_ISS_RT_MASK (UL(0x1f) << ESR_ELx_CP15_64_ISS_RT_SHIFT) 343 + 344 + #define ESR_ELx_CP15_64_ISS_RT2_SHIFT 10 345 + #define ESR_ELx_CP15_64_ISS_RT2_MASK (UL(0x1f) << ESR_ELx_CP15_64_ISS_RT2_SHIFT) 346 + 347 + #define ESR_ELx_CP15_64_ISS_OP1_SHIFT 16 348 + #define ESR_ELx_CP15_64_ISS_OP1_MASK (UL(0xf) << ESR_ELx_CP15_64_ISS_OP1_SHIFT) 349 + #define ESR_ELx_CP15_64_ISS_CRM_SHIFT 1 350 + #define ESR_ELx_CP15_64_ISS_CRM_MASK (UL(0xf) << ESR_ELx_CP15_64_ISS_CRM_SHIFT) 351 + 352 + #define ESR_ELx_CP15_64_ISS_SYS_VAL(op1, crm) \ 353 + (((op1) << ESR_ELx_CP15_64_ISS_OP1_SHIFT) | \ 354 + ((crm) << ESR_ELx_CP15_64_ISS_CRM_SHIFT)) 355 + 356 + #define ESR_ELx_CP15_64_ISS_SYS_MASK (ESR_ELx_CP15_64_ISS_OP1_MASK | \ 357 + ESR_ELx_CP15_64_ISS_CRM_MASK | \ 358 + ESR_ELx_CP15_64_ISS_DIR_MASK) 359 + 360 + #define ESR_ELx_CP15_64_ISS_SYS_CNTVCT (ESR_ELx_CP15_64_ISS_SYS_VAL(1, 14) | \ 361 + ESR_ELx_CP15_64_ISS_DIR_READ) 362 + 363 + #define ESR_ELx_CP15_64_ISS_SYS_CNTVCTSS (ESR_ELx_CP15_64_ISS_SYS_VAL(9, 14) | \ 364 + ESR_ELx_CP15_64_ISS_DIR_READ) 365 + 366 + #define ESR_ELx_CP15_32_ISS_SYS_CNTFRQ (ESR_ELx_CP15_32_ISS_SYS_VAL(0, 0, 14, 0) |\ 367 + ESR_ELx_CP15_32_ISS_DIR_READ) 368 + 369 + /* 370 + * ISS values for SME traps 371 + */ 372 + 373 + #define ESR_ELx_SME_ISS_SME_DISABLED 0 374 + #define ESR_ELx_SME_ISS_ILL 1 375 + #define ESR_ELx_SME_ISS_SM_DISABLED 2 376 + #define ESR_ELx_SME_ISS_ZA_DISABLED 3 377 + #define ESR_ELx_SME_ISS_ZT_DISABLED 4 378 + 379 + /* ISS field definitions for MOPS exceptions */ 380 + #define ESR_ELx_MOPS_ISS_MEM_INST (UL(1) << 24) 381 + #define ESR_ELx_MOPS_ISS_FROM_EPILOGUE (UL(1) << 18) 382 + #define ESR_ELx_MOPS_ISS_WRONG_OPTION (UL(1) << 17) 383 + #define ESR_ELx_MOPS_ISS_OPTION_A (UL(1) << 16) 384 + #define ESR_ELx_MOPS_ISS_DESTREG(esr) (((esr) & (UL(0x1f) << 10)) >> 10) 385 + #define ESR_ELx_MOPS_ISS_SRCREG(esr) (((esr) & (UL(0x1f) << 5)) >> 5) 386 + #define ESR_ELx_MOPS_ISS_SIZEREG(esr) (((esr) & (UL(0x1f) << 0)) >> 0) 387 + 388 + #ifndef __ASSEMBLY__ 389 + #include <asm/types.h> 390 + 391 + static inline unsigned long esr_brk_comment(unsigned long esr) 392 + { 393 + return esr & ESR_ELx_BRK64_ISS_COMMENT_MASK; 394 + } 395 + 396 + static inline bool esr_is_data_abort(unsigned long esr) 397 + { 398 + const unsigned long ec = ESR_ELx_EC(esr); 399 + 400 + return ec == ESR_ELx_EC_DABT_LOW || ec == ESR_ELx_EC_DABT_CUR; 401 + } 402 + 403 + static inline bool esr_is_cfi_brk(unsigned long esr) 404 + { 405 + return ESR_ELx_EC(esr) == ESR_ELx_EC_BRK64 && 406 + (esr_brk_comment(esr) & ~CFI_BRK_IMM_MASK) == CFI_BRK_IMM_BASE; 407 + } 408 + 409 + static inline bool esr_fsc_is_translation_fault(unsigned long esr) 410 + { 411 + esr = esr & ESR_ELx_FSC; 412 + 413 + return (esr == ESR_ELx_FSC_FAULT_L(3)) || 414 + (esr == ESR_ELx_FSC_FAULT_L(2)) || 415 + (esr == ESR_ELx_FSC_FAULT_L(1)) || 416 + (esr == ESR_ELx_FSC_FAULT_L(0)) || 417 + (esr == ESR_ELx_FSC_FAULT_L(-1)); 418 + } 419 + 420 + static inline bool esr_fsc_is_permission_fault(unsigned long esr) 421 + { 422 + esr = esr & ESR_ELx_FSC; 423 + 424 + return (esr == ESR_ELx_FSC_PERM_L(3)) || 425 + (esr == ESR_ELx_FSC_PERM_L(2)) || 426 + (esr == ESR_ELx_FSC_PERM_L(1)) || 427 + (esr == ESR_ELx_FSC_PERM_L(0)); 428 + } 429 + 430 + static inline bool esr_fsc_is_access_flag_fault(unsigned long esr) 431 + { 432 + esr = esr & ESR_ELx_FSC; 433 + 434 + return (esr == ESR_ELx_FSC_ACCESS_L(3)) || 435 + (esr == ESR_ELx_FSC_ACCESS_L(2)) || 436 + (esr == ESR_ELx_FSC_ACCESS_L(1)) || 437 + (esr == ESR_ELx_FSC_ACCESS_L(0)); 438 + } 439 + 440 + /* Indicate whether ESR.EC==0x1A is for an ERETAx instruction */ 441 + static inline bool esr_iss_is_eretax(unsigned long esr) 442 + { 443 + return esr & ESR_ELx_ERET_ISS_ERET; 444 + } 445 + 446 + /* Indicate which key is used for ERETAx (false: A-Key, true: B-Key) */ 447 + static inline bool esr_iss_is_eretab(unsigned long esr) 448 + { 449 + return esr & ESR_ELx_ERET_ISS_ERETA; 450 + } 451 + 452 + const char *esr_get_class_string(unsigned long esr); 453 + #endif /* __ASSEMBLY */ 454 + 455 + #endif /* __ASM_ESR_H */

+2 -1

tools/arch/s390/include/uapi/asm/kvm.h

··· 469 469 __u8 kdsa[16]; /* with MSA9 */ 470 470 __u8 sortl[32]; /* with STFLE.150 */ 471 471 __u8 dfltcc[32]; /* with STFLE.151 */ 472 - __u8 reserved[1728]; 472 + __u8 pfcr[16]; /* with STFLE.201 */ 473 + __u8 reserved[1712]; 473 474 }; 474 475 475 476 #define KVM_S390_VM_CPU_PROCESSOR_UV_FEAT_GUEST 6

+4 -1

tools/testing/selftests/kvm/Makefile

··· 55 55 LIBKVM_s390x += lib/s390x/diag318_test_handler.c 56 56 LIBKVM_s390x += lib/s390x/processor.c 57 57 LIBKVM_s390x += lib/s390x/ucall.c 58 + LIBKVM_s390x += lib/s390x/facility.c 58 59 59 60 LIBKVM_riscv += lib/riscv/handlers.S 60 61 LIBKVM_riscv += lib/riscv/processor.c ··· 68 67 TEST_GEN_PROGS_x86_64 = x86_64/cpuid_test 69 68 TEST_GEN_PROGS_x86_64 += x86_64/cr4_cpuid_sync_test 70 69 TEST_GEN_PROGS_x86_64 += x86_64/dirty_log_page_splitting_test 71 - TEST_GEN_PROGS_x86_64 += x86_64/get_msr_index_features 70 + TEST_GEN_PROGS_x86_64 += x86_64/feature_msrs_test 72 71 TEST_GEN_PROGS_x86_64 += x86_64/exit_on_emulation_failure_test 73 72 TEST_GEN_PROGS_x86_64 += x86_64/fix_hypercall_test 74 73 TEST_GEN_PROGS_x86_64 += x86_64/hwcr_msr_test ··· 157 156 TEST_GEN_PROGS_aarch64 += aarch64/arch_timer_edge_cases 158 157 TEST_GEN_PROGS_aarch64 += aarch64/debug-exceptions 159 158 TEST_GEN_PROGS_aarch64 += aarch64/hypercalls 159 + TEST_GEN_PROGS_aarch64 += aarch64/mmio_abort 160 160 TEST_GEN_PROGS_aarch64 += aarch64/page_fault_test 161 161 TEST_GEN_PROGS_aarch64 += aarch64/psci_test 162 162 TEST_GEN_PROGS_aarch64 += aarch64/set_id_regs ··· 191 189 TEST_GEN_PROGS_s390x += s390x/tprot 192 190 TEST_GEN_PROGS_s390x += s390x/cmma_test 193 191 TEST_GEN_PROGS_s390x += s390x/debug_test 192 + TEST_GEN_PROGS_s390x += s390x/cpumodel_subfuncs_test 194 193 TEST_GEN_PROGS_s390x += s390x/shared_zeropage_test 195 194 TEST_GEN_PROGS_s390x += s390x/ucontrol_test 196 195 TEST_GEN_PROGS_s390x += demand_paging_test

+5 -5

tools/testing/selftests/kvm/aarch64/debug-exceptions.c

··· 433 433 vcpu_init_descriptor_tables(vcpu); 434 434 435 435 vm_install_sync_handler(vm, VECTOR_SYNC_CURRENT, 436 - ESR_EC_BRK_INS, guest_sw_bp_handler); 436 + ESR_ELx_EC_BRK64, guest_sw_bp_handler); 437 437 vm_install_sync_handler(vm, VECTOR_SYNC_CURRENT, 438 - ESR_EC_HW_BP_CURRENT, guest_hw_bp_handler); 438 + ESR_ELx_EC_BREAKPT_CUR, guest_hw_bp_handler); 439 439 vm_install_sync_handler(vm, VECTOR_SYNC_CURRENT, 440 - ESR_EC_WP_CURRENT, guest_wp_handler); 440 + ESR_ELx_EC_WATCHPT_CUR, guest_wp_handler); 441 441 vm_install_sync_handler(vm, VECTOR_SYNC_CURRENT, 442 - ESR_EC_SSTEP_CURRENT, guest_ss_handler); 442 + ESR_ELx_EC_SOFTSTP_CUR, guest_ss_handler); 443 443 vm_install_sync_handler(vm, VECTOR_SYNC_CURRENT, 444 - ESR_EC_SVC64, guest_svc_handler); 444 + ESR_ELx_EC_SVC64, guest_svc_handler); 445 445 446 446 /* Specify bpn/wpn/ctx_bpn to be tested */ 447 447 vcpu_args_set(vcpu, 3, bpn, wpn, ctx_bpn);

+159

tools/testing/selftests/kvm/aarch64/mmio_abort.c

··· 1 + // SPDX-License-Identifier: GPL-2.0-only 2 + /* 3 + * mmio_abort - Tests for userspace MMIO abort injection 4 + * 5 + * Copyright (c) 2024 Google LLC 6 + */ 7 + #include "processor.h" 8 + #include "test_util.h" 9 + 10 + #define MMIO_ADDR 0x8000000ULL 11 + 12 + static u64 expected_abort_pc; 13 + 14 + static void expect_sea_handler(struct ex_regs *regs) 15 + { 16 + u64 esr = read_sysreg(esr_el1); 17 + 18 + GUEST_ASSERT_EQ(regs->pc, expected_abort_pc); 19 + GUEST_ASSERT_EQ(ESR_ELx_EC(esr), ESR_ELx_EC_DABT_CUR); 20 + GUEST_ASSERT_EQ(esr & ESR_ELx_FSC_TYPE, ESR_ELx_FSC_EXTABT); 21 + 22 + GUEST_DONE(); 23 + } 24 + 25 + static void unexpected_dabt_handler(struct ex_regs *regs) 26 + { 27 + GUEST_FAIL("Unexpected data abort at PC: %lx\n", regs->pc); 28 + } 29 + 30 + static struct kvm_vm *vm_create_with_dabt_handler(struct kvm_vcpu **vcpu, void *guest_code, 31 + handler_fn dabt_handler) 32 + { 33 + struct kvm_vm *vm = vm_create_with_one_vcpu(vcpu, guest_code); 34 + 35 + vm_init_descriptor_tables(vm); 36 + vcpu_init_descriptor_tables(*vcpu); 37 + vm_install_sync_handler(vm, VECTOR_SYNC_CURRENT, ESR_ELx_EC_DABT_CUR, dabt_handler); 38 + 39 + virt_map(vm, MMIO_ADDR, MMIO_ADDR, 1); 40 + 41 + return vm; 42 + } 43 + 44 + static void vcpu_inject_extabt(struct kvm_vcpu *vcpu) 45 + { 46 + struct kvm_vcpu_events events = {}; 47 + 48 + events.exception.ext_dabt_pending = true; 49 + vcpu_events_set(vcpu, &events); 50 + } 51 + 52 + static void vcpu_run_expect_done(struct kvm_vcpu *vcpu) 53 + { 54 + struct ucall uc; 55 + 56 + vcpu_run(vcpu); 57 + switch (get_ucall(vcpu, &uc)) { 58 + case UCALL_ABORT: 59 + REPORT_GUEST_ASSERT(uc); 60 + break; 61 + case UCALL_DONE: 62 + break; 63 + default: 64 + TEST_FAIL("Unexpected ucall: %lu", uc.cmd); 65 + } 66 + } 67 + 68 + extern char test_mmio_abort_insn; 69 + 70 + static void test_mmio_abort_guest(void) 71 + { 72 + WRITE_ONCE(expected_abort_pc, (u64)&test_mmio_abort_insn); 73 + 74 + asm volatile("test_mmio_abort_insn:\n\t" 75 + "ldr x0, [%0]\n\t" 76 + : : "r" (MMIO_ADDR) : "x0", "memory"); 77 + 78 + GUEST_FAIL("MMIO instruction should not retire"); 79 + } 80 + 81 + /* 82 + * Test that KVM doesn't complete MMIO emulation when userspace has made an 83 + * external abort pending for the instruction. 84 + */ 85 + static void test_mmio_abort(void) 86 + { 87 + struct kvm_vcpu *vcpu; 88 + struct kvm_vm *vm = vm_create_with_dabt_handler(&vcpu, test_mmio_abort_guest, 89 + expect_sea_handler); 90 + struct kvm_run *run = vcpu->run; 91 + 92 + vcpu_run(vcpu); 93 + TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_MMIO); 94 + TEST_ASSERT_EQ(run->mmio.phys_addr, MMIO_ADDR); 95 + TEST_ASSERT_EQ(run->mmio.len, sizeof(unsigned long)); 96 + TEST_ASSERT(!run->mmio.is_write, "Expected MMIO read"); 97 + 98 + vcpu_inject_extabt(vcpu); 99 + vcpu_run_expect_done(vcpu); 100 + kvm_vm_free(vm); 101 + } 102 + 103 + extern char test_mmio_nisv_insn; 104 + 105 + static void test_mmio_nisv_guest(void) 106 + { 107 + WRITE_ONCE(expected_abort_pc, (u64)&test_mmio_nisv_insn); 108 + 109 + asm volatile("test_mmio_nisv_insn:\n\t" 110 + "ldr x0, [%0], #8\n\t" 111 + : : "r" (MMIO_ADDR) : "x0", "memory"); 112 + 113 + GUEST_FAIL("MMIO instruction should not retire"); 114 + } 115 + 116 + /* 117 + * Test that the KVM_RUN ioctl fails for ESR_EL2.ISV=0 MMIO aborts if userspace 118 + * hasn't enabled KVM_CAP_ARM_NISV_TO_USER. 119 + */ 120 + static void test_mmio_nisv(void) 121 + { 122 + struct kvm_vcpu *vcpu; 123 + struct kvm_vm *vm = vm_create_with_dabt_handler(&vcpu, test_mmio_nisv_guest, 124 + unexpected_dabt_handler); 125 + 126 + TEST_ASSERT(_vcpu_run(vcpu), "Expected nonzero return code from KVM_RUN"); 127 + TEST_ASSERT_EQ(errno, ENOSYS); 128 + 129 + kvm_vm_free(vm); 130 + } 131 + 132 + /* 133 + * Test that ESR_EL2.ISV=0 MMIO aborts reach userspace and that an injected SEA 134 + * reaches the guest. 135 + */ 136 + static void test_mmio_nisv_abort(void) 137 + { 138 + struct kvm_vcpu *vcpu; 139 + struct kvm_vm *vm = vm_create_with_dabt_handler(&vcpu, test_mmio_nisv_guest, 140 + expect_sea_handler); 141 + struct kvm_run *run = vcpu->run; 142 + 143 + vm_enable_cap(vm, KVM_CAP_ARM_NISV_TO_USER, 1); 144 + 145 + vcpu_run(vcpu); 146 + TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_ARM_NISV); 147 + TEST_ASSERT_EQ(run->arm_nisv.fault_ipa, MMIO_ADDR); 148 + 149 + vcpu_inject_extabt(vcpu); 150 + vcpu_run_expect_done(vcpu); 151 + kvm_vm_free(vm); 152 + } 153 + 154 + int main(void) 155 + { 156 + test_mmio_abort(); 157 + test_mmio_nisv(); 158 + test_mmio_nisv_abort(); 159 + }

+1 -1

tools/testing/selftests/kvm/aarch64/no-vgic-v3.c

··· 150 150 vcpu_init_descriptor_tables(vcpu); 151 151 152 152 vm_install_sync_handler(vm, VECTOR_SYNC_CURRENT, 153 - ESR_EC_UNKNOWN, guest_undef_handler); 153 + ESR_ELx_EC_UNKNOWN, guest_undef_handler); 154 154 155 155 test_run_vcpu(vcpu); 156 156

+2 -2

tools/testing/selftests/kvm/aarch64/page_fault_test.c

··· 544 544 vcpu_init_descriptor_tables(vcpu); 545 545 546 546 vm_install_sync_handler(vm, VECTOR_SYNC_CURRENT, 547 - ESR_EC_DABT, no_dabt_handler); 547 + ESR_ELx_EC_DABT_CUR, no_dabt_handler); 548 548 vm_install_sync_handler(vm, VECTOR_SYNC_CURRENT, 549 - ESR_EC_IABT, no_iabt_handler); 549 + ESR_ELx_EC_IABT_CUR, no_iabt_handler); 550 550 } 551 551 552 552 static void setup_gva_maps(struct kvm_vm *vm)

+92

tools/testing/selftests/kvm/aarch64/psci_test.c

··· 54 54 return res.a0; 55 55 } 56 56 57 + static uint64_t psci_system_off2(uint64_t type, uint64_t cookie) 58 + { 59 + struct arm_smccc_res res; 60 + 61 + smccc_hvc(PSCI_1_3_FN64_SYSTEM_OFF2, type, cookie, 0, 0, 0, 0, 0, &res); 62 + 63 + return res.a0; 64 + } 65 + 57 66 static uint64_t psci_features(uint32_t func_id) 58 67 { 59 68 struct arm_smccc_res res; ··· 197 188 kvm_vm_free(vm); 198 189 } 199 190 191 + static void guest_test_system_off2(void) 192 + { 193 + uint64_t ret; 194 + 195 + /* assert that SYSTEM_OFF2 is discoverable */ 196 + GUEST_ASSERT(psci_features(PSCI_1_3_FN_SYSTEM_OFF2) & 197 + PSCI_1_3_OFF_TYPE_HIBERNATE_OFF); 198 + GUEST_ASSERT(psci_features(PSCI_1_3_FN64_SYSTEM_OFF2) & 199 + PSCI_1_3_OFF_TYPE_HIBERNATE_OFF); 200 + 201 + /* With non-zero 'cookie' field, it should fail */ 202 + ret = psci_system_off2(PSCI_1_3_OFF_TYPE_HIBERNATE_OFF, 1); 203 + GUEST_ASSERT(ret == PSCI_RET_INVALID_PARAMS); 204 + 205 + /* 206 + * This would normally never return, so KVM sets the return value 207 + * to PSCI_RET_INTERNAL_FAILURE. The test case *does* return, so 208 + * that it can test both values for HIBERNATE_OFF. 209 + */ 210 + ret = psci_system_off2(PSCI_1_3_OFF_TYPE_HIBERNATE_OFF, 0); 211 + GUEST_ASSERT(ret == PSCI_RET_INTERNAL_FAILURE); 212 + 213 + /* 214 + * Revision F.b of the PSCI v1.3 specification documents zero as an 215 + * alias for HIBERNATE_OFF, since that's the value used in earlier 216 + * revisions of the spec and some implementations in the field. 217 + */ 218 + ret = psci_system_off2(0, 1); 219 + GUEST_ASSERT(ret == PSCI_RET_INVALID_PARAMS); 220 + 221 + ret = psci_system_off2(0, 0); 222 + GUEST_ASSERT(ret == PSCI_RET_INTERNAL_FAILURE); 223 + 224 + GUEST_DONE(); 225 + } 226 + 227 + static void host_test_system_off2(void) 228 + { 229 + struct kvm_vcpu *source, *target; 230 + struct kvm_mp_state mps; 231 + uint64_t psci_version = 0; 232 + int nr_shutdowns = 0; 233 + struct kvm_run *run; 234 + struct ucall uc; 235 + 236 + setup_vm(guest_test_system_off2, &source, &target); 237 + 238 + vcpu_get_reg(target, KVM_REG_ARM_PSCI_VERSION, &psci_version); 239 + 240 + TEST_ASSERT(psci_version >= PSCI_VERSION(1, 3), 241 + "Unexpected PSCI version %lu.%lu", 242 + PSCI_VERSION_MAJOR(psci_version), 243 + PSCI_VERSION_MINOR(psci_version)); 244 + 245 + vcpu_power_off(target); 246 + run = source->run; 247 + 248 + enter_guest(source); 249 + while (run->exit_reason == KVM_EXIT_SYSTEM_EVENT) { 250 + TEST_ASSERT(run->system_event.type == KVM_SYSTEM_EVENT_SHUTDOWN, 251 + "Unhandled system event: %u (expected: %u)", 252 + run->system_event.type, KVM_SYSTEM_EVENT_SHUTDOWN); 253 + TEST_ASSERT(run->system_event.ndata >= 1, 254 + "Unexpected amount of system event data: %u (expected, >= 1)", 255 + run->system_event.ndata); 256 + TEST_ASSERT(run->system_event.data[0] & KVM_SYSTEM_EVENT_SHUTDOWN_FLAG_PSCI_OFF2, 257 + "PSCI_OFF2 flag not set. Flags %llu (expected %llu)", 258 + run->system_event.data[0], KVM_SYSTEM_EVENT_SHUTDOWN_FLAG_PSCI_OFF2); 259 + 260 + nr_shutdowns++; 261 + 262 + /* Restart the vCPU */ 263 + mps.mp_state = KVM_MP_STATE_RUNNABLE; 264 + vcpu_mp_state_set(source, &mps); 265 + 266 + enter_guest(source); 267 + } 268 + 269 + TEST_ASSERT(get_ucall(source, &uc) == UCALL_DONE, "Guest did not exit cleanly"); 270 + TEST_ASSERT(nr_shutdowns == 2, "Two shutdown events were expected, but saw %d", nr_shutdowns); 271 + } 272 + 200 273 int main(void) 201 274 { 202 275 TEST_REQUIRE(kvm_has_cap(KVM_CAP_ARM_SYSTEM_SUSPEND)); 203 276 204 277 host_test_cpu_on(); 205 278 host_test_system_suspend(); 279 + host_test_system_off2(); 206 280 return 0; 207 281 }

+98 -1

tools/testing/selftests/kvm/aarch64/set_id_regs.c

··· 443 443 } 444 444 } 445 445 446 + #define MPAM_IDREG_TEST 6 447 + static void test_user_set_mpam_reg(struct kvm_vcpu *vcpu) 448 + { 449 + uint64_t masks[KVM_ARM_FEATURE_ID_RANGE_SIZE]; 450 + struct reg_mask_range range = { 451 + .addr = (__u64)masks, 452 + }; 453 + uint64_t val; 454 + int idx, err; 455 + 456 + /* 457 + * If ID_AA64PFR0.MPAM is _not_ officially modifiable and is zero, 458 + * check that if it can be set to 1, (i.e. it is supported by the 459 + * hardware), that it can't be set to other values. 460 + */ 461 + 462 + /* Get writable masks for feature ID registers */ 463 + memset(range.reserved, 0, sizeof(range.reserved)); 464 + vm_ioctl(vcpu->vm, KVM_ARM_GET_REG_WRITABLE_MASKS, &range); 465 + 466 + /* Writeable? Nothing to test! */ 467 + idx = encoding_to_range_idx(SYS_ID_AA64PFR0_EL1); 468 + if ((masks[idx] & ID_AA64PFR0_EL1_MPAM_MASK) == ID_AA64PFR0_EL1_MPAM_MASK) { 469 + ksft_test_result_skip("ID_AA64PFR0_EL1.MPAM is officially writable, nothing to test\n"); 470 + return; 471 + } 472 + 473 + /* Get the id register value */ 474 + vcpu_get_reg(vcpu, KVM_ARM64_SYS_REG(SYS_ID_AA64PFR0_EL1), &val); 475 + 476 + /* Try to set MPAM=0. This should always be possible. */ 477 + val &= ~ID_AA64PFR0_EL1_MPAM_MASK; 478 + val |= FIELD_PREP(ID_AA64PFR0_EL1_MPAM_MASK, 0); 479 + err = __vcpu_set_reg(vcpu, KVM_ARM64_SYS_REG(SYS_ID_AA64PFR0_EL1), val); 480 + if (err) 481 + ksft_test_result_fail("ID_AA64PFR0_EL1.MPAM=0 was not accepted\n"); 482 + else 483 + ksft_test_result_pass("ID_AA64PFR0_EL1.MPAM=0 worked\n"); 484 + 485 + /* Try to set MPAM=1 */ 486 + val &= ~ID_AA64PFR0_EL1_MPAM_MASK; 487 + val |= FIELD_PREP(ID_AA64PFR0_EL1_MPAM_MASK, 1); 488 + err = __vcpu_set_reg(vcpu, KVM_ARM64_SYS_REG(SYS_ID_AA64PFR0_EL1), val); 489 + if (err) 490 + ksft_test_result_skip("ID_AA64PFR0_EL1.MPAM is not writable, nothing to test\n"); 491 + else 492 + ksft_test_result_pass("ID_AA64PFR0_EL1.MPAM=1 was writable\n"); 493 + 494 + /* Try to set MPAM=2 */ 495 + val &= ~ID_AA64PFR0_EL1_MPAM_MASK; 496 + val |= FIELD_PREP(ID_AA64PFR0_EL1_MPAM_MASK, 2); 497 + err = __vcpu_set_reg(vcpu, KVM_ARM64_SYS_REG(SYS_ID_AA64PFR0_EL1), val); 498 + if (err) 499 + ksft_test_result_pass("ID_AA64PFR0_EL1.MPAM not arbitrarily modifiable\n"); 500 + else 501 + ksft_test_result_fail("ID_AA64PFR0_EL1.MPAM value should not be ignored\n"); 502 + 503 + /* And again for ID_AA64PFR1_EL1.MPAM_frac */ 504 + idx = encoding_to_range_idx(SYS_ID_AA64PFR1_EL1); 505 + if ((masks[idx] & ID_AA64PFR1_EL1_MPAM_frac_MASK) == ID_AA64PFR1_EL1_MPAM_frac_MASK) { 506 + ksft_test_result_skip("ID_AA64PFR1_EL1.MPAM_frac is officially writable, nothing to test\n"); 507 + return; 508 + } 509 + 510 + /* Get the id register value */ 511 + vcpu_get_reg(vcpu, KVM_ARM64_SYS_REG(SYS_ID_AA64PFR1_EL1), &val); 512 + 513 + /* Try to set MPAM_frac=0. This should always be possible. */ 514 + val &= ~ID_AA64PFR1_EL1_MPAM_frac_MASK; 515 + val |= FIELD_PREP(ID_AA64PFR1_EL1_MPAM_frac_MASK, 0); 516 + err = __vcpu_set_reg(vcpu, KVM_ARM64_SYS_REG(SYS_ID_AA64PFR1_EL1), val); 517 + if (err) 518 + ksft_test_result_fail("ID_AA64PFR0_EL1.MPAM_frac=0 was not accepted\n"); 519 + else 520 + ksft_test_result_pass("ID_AA64PFR0_EL1.MPAM_frac=0 worked\n"); 521 + 522 + /* Try to set MPAM_frac=1 */ 523 + val &= ~ID_AA64PFR1_EL1_MPAM_frac_MASK; 524 + val |= FIELD_PREP(ID_AA64PFR1_EL1_MPAM_frac_MASK, 1); 525 + err = __vcpu_set_reg(vcpu, KVM_ARM64_SYS_REG(SYS_ID_AA64PFR1_EL1), val); 526 + if (err) 527 + ksft_test_result_skip("ID_AA64PFR1_EL1.MPAM_frac is not writable, nothing to test\n"); 528 + else 529 + ksft_test_result_pass("ID_AA64PFR0_EL1.MPAM_frac=1 was writable\n"); 530 + 531 + /* Try to set MPAM_frac=2 */ 532 + val &= ~ID_AA64PFR1_EL1_MPAM_frac_MASK; 533 + val |= FIELD_PREP(ID_AA64PFR1_EL1_MPAM_frac_MASK, 2); 534 + err = __vcpu_set_reg(vcpu, KVM_ARM64_SYS_REG(SYS_ID_AA64PFR1_EL1), val); 535 + if (err) 536 + ksft_test_result_pass("ID_AA64PFR1_EL1.MPAM_frac not arbitrarily modifiable\n"); 537 + else 538 + ksft_test_result_fail("ID_AA64PFR1_EL1.MPAM_frac value should not be ignored\n"); 539 + } 540 + 446 541 static void test_guest_reg_read(struct kvm_vcpu *vcpu) 447 542 { 448 543 bool done = false; ··· 676 581 ARRAY_SIZE(ftr_id_aa64isar2_el1) + ARRAY_SIZE(ftr_id_aa64pfr0_el1) + 677 582 ARRAY_SIZE(ftr_id_aa64pfr1_el1) + ARRAY_SIZE(ftr_id_aa64mmfr0_el1) + 678 583 ARRAY_SIZE(ftr_id_aa64mmfr1_el1) + ARRAY_SIZE(ftr_id_aa64mmfr2_el1) + 679 - ARRAY_SIZE(ftr_id_aa64zfr0_el1) - ARRAY_SIZE(test_regs) + 2; 584 + ARRAY_SIZE(ftr_id_aa64zfr0_el1) - ARRAY_SIZE(test_regs) + 2 + 585 + MPAM_IDREG_TEST; 680 586 681 587 ksft_set_plan(test_cnt); 682 588 683 589 test_vm_ftr_id_regs(vcpu, aarch64_only); 684 590 test_vcpu_ftr_id_regs(vcpu); 591 + test_user_set_mpam_reg(vcpu); 685 592 686 593 test_guest_reg_read(vcpu); 687 594

+6 -6

tools/testing/selftests/kvm/aarch64/vpmu_counter_access.c

··· 300 300 uint64_t esr, ec; 301 301 302 302 esr = read_sysreg(esr_el1); 303 - ec = (esr >> ESR_EC_SHIFT) & ESR_EC_MASK; 303 + ec = ESR_ELx_EC(esr); 304 304 305 305 __GUEST_ASSERT(expected_ec == ec, 306 306 "PC: 0x%lx; ESR: 0x%lx; EC: 0x%lx; EC expected: 0x%lx", ··· 338 338 * Reading/writing the event count/type registers should cause 339 339 * an UNDEFINED exception. 340 340 */ 341 - TEST_EXCEPTION(ESR_EC_UNKNOWN, acc->read_cntr(pmc_idx)); 342 - TEST_EXCEPTION(ESR_EC_UNKNOWN, acc->write_cntr(pmc_idx, 0)); 343 - TEST_EXCEPTION(ESR_EC_UNKNOWN, acc->read_typer(pmc_idx)); 344 - TEST_EXCEPTION(ESR_EC_UNKNOWN, acc->write_typer(pmc_idx, 0)); 341 + TEST_EXCEPTION(ESR_ELx_EC_UNKNOWN, acc->read_cntr(pmc_idx)); 342 + TEST_EXCEPTION(ESR_ELx_EC_UNKNOWN, acc->write_cntr(pmc_idx, 0)); 343 + TEST_EXCEPTION(ESR_ELx_EC_UNKNOWN, acc->read_typer(pmc_idx)); 344 + TEST_EXCEPTION(ESR_ELx_EC_UNKNOWN, acc->write_typer(pmc_idx, 0)); 345 345 /* 346 346 * The bit corresponding to the (unimplemented) counter in 347 347 * {PMCNTEN,PMINTEN,PMOVS}{SET,CLR} registers should be RAZ. ··· 425 425 426 426 vpmu_vm.vm = vm_create(1); 427 427 vm_init_descriptor_tables(vpmu_vm.vm); 428 - for (ec = 0; ec < ESR_EC_NUM; ec++) { 428 + for (ec = 0; ec < ESR_ELx_EC_MAX + 1; ec++) { 429 429 vm_install_sync_handler(vpmu_vm.vm, VECTOR_SYNC_CURRENT, ec, 430 430 guest_sync_handler); 431 431 }

-1

tools/testing/selftests/kvm/hardware_disable_test.c

··· 20 20 #define SLEEPING_THREAD_NUM (1 << 4) 21 21 #define FORK_NUM (1ULL << 9) 22 22 #define DELAY_US_MAX 2000 23 - #define GUEST_CODE_PIO_PORT 4 24 23 25 24 sem_t *sem; 26 25

+2 -13

tools/testing/selftests/kvm/include/aarch64/processor.h

··· 12 12 13 13 #include <linux/stringify.h> 14 14 #include <linux/types.h> 15 + #include <asm/brk-imm.h> 16 + #include <asm/esr.h> 15 17 #include <asm/sysreg.h> 16 18 17 19 ··· 101 99 (v) == VECTOR_SYNC_CURRENT || \ 102 100 (v) == VECTOR_SYNC_LOWER_64 || \ 103 101 (v) == VECTOR_SYNC_LOWER_32) 104 - 105 - #define ESR_EC_NUM 64 106 - #define ESR_EC_SHIFT 26 107 - #define ESR_EC_MASK (ESR_EC_NUM - 1) 108 - 109 - #define ESR_EC_UNKNOWN 0x0 110 - #define ESR_EC_SVC64 0x15 111 - #define ESR_EC_IABT 0x21 112 - #define ESR_EC_DABT 0x25 113 - #define ESR_EC_HW_BP_CURRENT 0x31 114 - #define ESR_EC_SSTEP_CURRENT 0x33 115 - #define ESR_EC_WP_CURRENT 0x35 116 - #define ESR_EC_BRK_INS 0x3c 117 102 118 103 /* Access flag */ 119 104 #define PTE_AF (1ULL << 10)

+50

tools/testing/selftests/kvm/include/s390x/facility.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0-only */ 2 + /* 3 + * Copyright IBM Corp. 2024 4 + * 5 + * Authors: 6 + * Hariharan Mari <hari55@linux.ibm.com> 7 + * 8 + * Get the facility bits with the STFLE instruction 9 + */ 10 + 11 + #ifndef SELFTEST_KVM_FACILITY_H 12 + #define SELFTEST_KVM_FACILITY_H 13 + 14 + #include <linux/bitops.h> 15 + 16 + /* alt_stfle_fac_list[16] + stfle_fac_list[16] */ 17 + #define NB_STFL_DOUBLEWORDS 32 18 + 19 + extern uint64_t stfl_doublewords[NB_STFL_DOUBLEWORDS]; 20 + extern bool stfle_flag; 21 + 22 + static inline bool test_bit_inv(unsigned long nr, const unsigned long *ptr) 23 + { 24 + return test_bit(nr ^ (BITS_PER_LONG - 1), ptr); 25 + } 26 + 27 + static inline void stfle(uint64_t *fac, unsigned int nb_doublewords) 28 + { 29 + register unsigned long r0 asm("0") = nb_doublewords - 1; 30 + 31 + asm volatile(" .insn s,0xb2b00000,0(%1)\n" 32 + : "+d" (r0) 33 + : "a" (fac) 34 + : "memory", "cc"); 35 + } 36 + 37 + static inline void setup_facilities(void) 38 + { 39 + stfle(stfl_doublewords, NB_STFL_DOUBLEWORDS); 40 + stfle_flag = true; 41 + } 42 + 43 + static inline bool test_facility(int nr) 44 + { 45 + if (!stfle_flag) 46 + setup_facilities(); 47 + return test_bit_inv(nr, stfl_doublewords); 48 + } 49 + 50 + #endif

+6

tools/testing/selftests/kvm/include/s390x/processor.h

··· 32 32 barrier(); 33 33 } 34 34 35 + /* Get the instruction length */ 36 + static inline int insn_length(unsigned char code) 37 + { 38 + return ((((int)code + 64) >> 7) + 1) << 1; 39 + } 40 + 35 41 #endif

+5

tools/testing/selftests/kvm/include/x86_64/processor.h

··· 1049 1049 vcpu_ioctl(vcpu, KVM_GET_CPUID2, vcpu->cpuid); 1050 1050 } 1051 1051 1052 + static inline void vcpu_get_cpuid(struct kvm_vcpu *vcpu) 1053 + { 1054 + vcpu_ioctl(vcpu, KVM_GET_CPUID2, vcpu->cpuid); 1055 + } 1056 + 1052 1057 void vcpu_set_cpuid_property(struct kvm_vcpu *vcpu, 1053 1058 struct kvm_x86_cpu_property property, 1054 1059 uint32_t value);

+3 -3

tools/testing/selftests/kvm/lib/aarch64/processor.c

··· 450 450 } 451 451 452 452 struct handlers { 453 - handler_fn exception_handlers[VECTOR_NUM][ESR_EC_NUM]; 453 + handler_fn exception_handlers[VECTOR_NUM][ESR_ELx_EC_MAX + 1]; 454 454 }; 455 455 456 456 void vcpu_init_descriptor_tables(struct kvm_vcpu *vcpu) ··· 469 469 switch (vector) { 470 470 case VECTOR_SYNC_CURRENT: 471 471 case VECTOR_SYNC_LOWER_64: 472 - ec = (read_sysreg(esr_el1) >> ESR_EC_SHIFT) & ESR_EC_MASK; 472 + ec = ESR_ELx_EC(read_sysreg(esr_el1)); 473 473 valid_ec = true; 474 474 break; 475 475 case VECTOR_IRQ_CURRENT: ··· 508 508 509 509 assert(VECTOR_IS_SYNC(vector)); 510 510 assert(vector < VECTOR_NUM); 511 - assert(ec < ESR_EC_NUM); 511 + assert(ec <= ESR_ELx_EC_MAX); 512 512 handlers->exception_handlers[vector][ec] = handler; 513 513 } 514 514

+6 -4

tools/testing/selftests/kvm/lib/kvm_util.c

··· 720 720 rb_erase(&region->hva_node, &vm->regions.hva_tree); 721 721 hash_del(&region->slot_node); 722 722 723 - region->region.memory_size = 0; 724 - vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION2, &region->region); 725 - 726 723 sparsebit_free(&region->unused_phy_pages); 727 724 sparsebit_free(&region->protected_phy_pages); 728 725 ret = munmap(region->mmap_start, region->mmap_size); ··· 1194 1197 */ 1195 1198 void vm_mem_region_delete(struct kvm_vm *vm, uint32_t slot) 1196 1199 { 1197 - __vm_mem_region_delete(vm, memslot2region(vm, slot)); 1200 + struct userspace_mem_region *region = memslot2region(vm, slot); 1201 + 1202 + region->region.memory_size = 0; 1203 + vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION2, &region->region); 1204 + 1205 + __vm_mem_region_delete(vm, region); 1198 1206 } 1199 1207 1200 1208 void vm_guest_mem_fallocate(struct kvm_vm *vm, uint64_t base, uint64_t size,

+14

tools/testing/selftests/kvm/lib/s390x/facility.c

··· 1 + // SPDX-License-Identifier: GPL-2.0-only 2 + /* 3 + * Copyright IBM Corp. 2024 4 + * 5 + * Authors: 6 + * Hariharan Mari <hari55@linux.ibm.com> 7 + * 8 + * Contains the definition for the global variables to have the test facitlity feature. 9 + */ 10 + 11 + #include "facility.h" 12 + 13 + uint64_t stfl_doublewords[NB_STFL_DOUBLEWORDS]; 14 + bool stfle_flag;

+24

tools/testing/selftests/kvm/lib/x86_64/processor.c

··· 506 506 507 507 sregs.cr0 = X86_CR0_PE | X86_CR0_NE | X86_CR0_PG; 508 508 sregs.cr4 |= X86_CR4_PAE | X86_CR4_OSFXSR; 509 + if (kvm_cpu_has(X86_FEATURE_XSAVE)) 510 + sregs.cr4 |= X86_CR4_OSXSAVE; 509 511 sregs.efer |= (EFER_LME | EFER_LMA | EFER_NX); 510 512 511 513 kvm_seg_set_unusable(&sregs.ldt); ··· 519 517 520 518 sregs.cr3 = vm->pgd; 521 519 vcpu_sregs_set(vcpu, &sregs); 520 + } 521 + 522 + static void vcpu_init_xcrs(struct kvm_vm *vm, struct kvm_vcpu *vcpu) 523 + { 524 + struct kvm_xcrs xcrs = { 525 + .nr_xcrs = 1, 526 + .xcrs[0].xcr = 0, 527 + .xcrs[0].value = kvm_cpu_supported_xcr0(), 528 + }; 529 + 530 + if (!kvm_cpu_has(X86_FEATURE_XSAVE)) 531 + return; 532 + 533 + vcpu_xcrs_set(vcpu, &xcrs); 522 534 } 523 535 524 536 static void set_idt_entry(struct kvm_vm *vm, int vector, unsigned long addr, ··· 691 675 vcpu = __vm_vcpu_add(vm, vcpu_id); 692 676 vcpu_init_cpuid(vcpu, kvm_get_supported_cpuid()); 693 677 vcpu_init_sregs(vm, vcpu); 678 + vcpu_init_xcrs(vm, vcpu); 694 679 695 680 /* Setup guest general purpose registers */ 696 681 vcpu_regs_get(vcpu, &regs); ··· 703 686 mp_state.mp_state = 0; 704 687 vcpu_mp_state_set(vcpu, &mp_state); 705 688 689 + /* 690 + * Refresh CPUID after setting SREGS and XCR0, so that KVM's "runtime" 691 + * updates to guest CPUID, e.g. for OSXSAVE and XSAVE state size, are 692 + * reflected into selftests' vCPU CPUID cache, i.e. so that the cache 693 + * is consistent with vCPU state. 694 + */ 695 + vcpu_get_cpuid(vcpu); 706 696 return vcpu; 707 697 } 708 698

+301

tools/testing/selftests/kvm/s390x/cpumodel_subfuncs_test.c

··· 1 + // SPDX-License-Identifier: GPL-2.0-only 2 + /* 3 + * Copyright IBM Corp. 2024 4 + * 5 + * Authors: 6 + * Hariharan Mari <hari55@linux.ibm.com> 7 + * 8 + * The tests compare the result of the KVM ioctl for obtaining CPU subfunction data with those 9 + * from an ASM block performing the same CPU subfunction. Currently KVM doesn't mask instruction 10 + * query data reported via the CPU Model, allowing us to directly compare it with the data 11 + * acquired through executing the queries in the test. 12 + */ 13 + 14 + #include <stdio.h> 15 + #include <stdlib.h> 16 + #include <string.h> 17 + #include <sys/ioctl.h> 18 + #include "facility.h" 19 + 20 + #include "kvm_util.h" 21 + 22 + #define PLO_FUNCTION_MAX 256 23 + 24 + /* Query available CPU subfunctions */ 25 + struct kvm_s390_vm_cpu_subfunc cpu_subfunc; 26 + 27 + static void get_cpu_machine_subfuntions(struct kvm_vm *vm, 28 + struct kvm_s390_vm_cpu_subfunc *cpu_subfunc) 29 + { 30 + int r; 31 + 32 + r = __kvm_device_attr_get(vm->fd, KVM_S390_VM_CPU_MODEL, 33 + KVM_S390_VM_CPU_MACHINE_SUBFUNC, cpu_subfunc); 34 + 35 + TEST_ASSERT(!r, "Get cpu subfunctions failed r=%d errno=%d", r, errno); 36 + } 37 + 38 + static inline int plo_test_bit(unsigned char nr) 39 + { 40 + unsigned long function = nr | 0x100; 41 + int cc; 42 + 43 + asm volatile(" lgr 0,%[function]\n" 44 + /* Parameter registers are ignored for "test bit" */ 45 + " plo 0,0,0,0(0)\n" 46 + " ipm %0\n" 47 + " srl %0,28\n" 48 + : "=d" (cc) 49 + : [function] "d" (function) 50 + : "cc", "0"); 51 + return cc == 0; 52 + } 53 + 54 + /* Testing Perform Locked Operation (PLO) CPU subfunction's ASM block */ 55 + static void test_plo_asm_block(u8 (*query)[32]) 56 + { 57 + for (int i = 0; i < PLO_FUNCTION_MAX; ++i) { 58 + if (plo_test_bit(i)) 59 + (*query)[i >> 3] |= 0x80 >> (i & 7); 60 + } 61 + } 62 + 63 + /* Testing Crypto Compute Message Authentication Code (KMAC) CPU subfunction's ASM block */ 64 + static void test_kmac_asm_block(u8 (*query)[16]) 65 + { 66 + asm volatile(" la %%r1,%[query]\n" 67 + " xgr %%r0,%%r0\n" 68 + " .insn rre,0xb91e0000,0,2\n" 69 + : [query] "=R" (*query) 70 + : 71 + : "cc", "r0", "r1"); 72 + } 73 + 74 + /* Testing Crypto Cipher Message with Chaining (KMC) CPU subfunction's ASM block */ 75 + static void test_kmc_asm_block(u8 (*query)[16]) 76 + { 77 + asm volatile(" la %%r1,%[query]\n" 78 + " xgr %%r0,%%r0\n" 79 + " .insn rre,0xb92f0000,2,4\n" 80 + : [query] "=R" (*query) 81 + : 82 + : "cc", "r0", "r1"); 83 + } 84 + 85 + /* Testing Crypto Cipher Message (KM) CPU subfunction's ASM block */ 86 + static void test_km_asm_block(u8 (*query)[16]) 87 + { 88 + asm volatile(" la %%r1,%[query]\n" 89 + " xgr %%r0,%%r0\n" 90 + " .insn rre,0xb92e0000,2,4\n" 91 + : [query] "=R" (*query) 92 + : 93 + : "cc", "r0", "r1"); 94 + } 95 + 96 + /* Testing Crypto Compute Intermediate Message Digest (KIMD) CPU subfunction's ASM block */ 97 + static void test_kimd_asm_block(u8 (*query)[16]) 98 + { 99 + asm volatile(" la %%r1,%[query]\n" 100 + " xgr %%r0,%%r0\n" 101 + " .insn rre,0xb93e0000,0,2\n" 102 + : [query] "=R" (*query) 103 + : 104 + : "cc", "r0", "r1"); 105 + } 106 + 107 + /* Testing Crypto Compute Last Message Digest (KLMD) CPU subfunction's ASM block */ 108 + static void test_klmd_asm_block(u8 (*query)[16]) 109 + { 110 + asm volatile(" la %%r1,%[query]\n" 111 + " xgr %%r0,%%r0\n" 112 + " .insn rre,0xb93f0000,0,2\n" 113 + : [query] "=R" (*query) 114 + : 115 + : "cc", "r0", "r1"); 116 + } 117 + 118 + /* Testing Crypto Cipher Message with Counter (KMCTR) CPU subfunction's ASM block */ 119 + static void test_kmctr_asm_block(u8 (*query)[16]) 120 + { 121 + asm volatile(" la %%r1,%[query]\n" 122 + " xgr %%r0,%%r0\n" 123 + " .insn rrf,0xb92d0000,2,4,6,0\n" 124 + : [query] "=R" (*query) 125 + : 126 + : "cc", "r0", "r1"); 127 + } 128 + 129 + /* Testing Crypto Cipher Message with Cipher Feedback (KMF) CPU subfunction's ASM block */ 130 + static void test_kmf_asm_block(u8 (*query)[16]) 131 + { 132 + asm volatile(" la %%r1,%[query]\n" 133 + " xgr %%r0,%%r0\n" 134 + " .insn rre,0xb92a0000,2,4\n" 135 + : [query] "=R" (*query) 136 + : 137 + : "cc", "r0", "r1"); 138 + } 139 + 140 + /* Testing Crypto Cipher Message with Output Feedback (KMO) CPU subfunction's ASM block */ 141 + static void test_kmo_asm_block(u8 (*query)[16]) 142 + { 143 + asm volatile(" la %%r1,%[query]\n" 144 + " xgr %%r0,%%r0\n" 145 + " .insn rre,0xb92b0000,2,4\n" 146 + : [query] "=R" (*query) 147 + : 148 + : "cc", "r0", "r1"); 149 + } 150 + 151 + /* Testing Crypto Perform Cryptographic Computation (PCC) CPU subfunction's ASM block */ 152 + static void test_pcc_asm_block(u8 (*query)[16]) 153 + { 154 + asm volatile(" la %%r1,%[query]\n" 155 + " xgr %%r0,%%r0\n" 156 + " .insn rre,0xb92c0000,0,0\n" 157 + : [query] "=R" (*query) 158 + : 159 + : "cc", "r0", "r1"); 160 + } 161 + 162 + /* Testing Crypto Perform Random Number Operation (PRNO) CPU subfunction's ASM block */ 163 + static void test_prno_asm_block(u8 (*query)[16]) 164 + { 165 + asm volatile(" la %%r1,%[query]\n" 166 + " xgr %%r0,%%r0\n" 167 + " .insn rre,0xb93c0000,2,4\n" 168 + : [query] "=R" (*query) 169 + : 170 + : "cc", "r0", "r1"); 171 + } 172 + 173 + /* Testing Crypto Cipher Message with Authentication (KMA) CPU subfunction's ASM block */ 174 + static void test_kma_asm_block(u8 (*query)[16]) 175 + { 176 + asm volatile(" la %%r1,%[query]\n" 177 + " xgr %%r0,%%r0\n" 178 + " .insn rrf,0xb9290000,2,4,6,0\n" 179 + : [query] "=R" (*query) 180 + : 181 + : "cc", "r0", "r1"); 182 + } 183 + 184 + /* Testing Crypto Compute Digital Signature Authentication (KDSA) CPU subfunction's ASM block */ 185 + static void test_kdsa_asm_block(u8 (*query)[16]) 186 + { 187 + asm volatile(" la %%r1,%[query]\n" 188 + " xgr %%r0,%%r0\n" 189 + " .insn rre,0xb93a0000,0,2\n" 190 + : [query] "=R" (*query) 191 + : 192 + : "cc", "r0", "r1"); 193 + } 194 + 195 + /* Testing Sort Lists (SORTL) CPU subfunction's ASM block */ 196 + static void test_sortl_asm_block(u8 (*query)[32]) 197 + { 198 + asm volatile(" lghi 0,0\n" 199 + " la 1,%[query]\n" 200 + " .insn rre,0xb9380000,2,4\n" 201 + : [query] "=R" (*query) 202 + : 203 + : "cc", "0", "1"); 204 + } 205 + 206 + /* Testing Deflate Conversion Call (DFLTCC) CPU subfunction's ASM block */ 207 + static void test_dfltcc_asm_block(u8 (*query)[32]) 208 + { 209 + asm volatile(" lghi 0,0\n" 210 + " la 1,%[query]\n" 211 + " .insn rrf,0xb9390000,2,4,6,0\n" 212 + : [query] "=R" (*query) 213 + : 214 + : "cc", "0", "1"); 215 + } 216 + 217 + /* 218 + * Testing Perform Function with Concurrent Results (PFCR) 219 + * CPU subfunctions's ASM block 220 + */ 221 + static void test_pfcr_asm_block(u8 (*query)[16]) 222 + { 223 + asm volatile(" lghi 0,0\n" 224 + " .insn rsy,0xeb0000000016,0,0,%[query]\n" 225 + : [query] "=QS" (*query) 226 + : 227 + : "cc", "0"); 228 + } 229 + 230 + typedef void (*testfunc_t)(u8 (*array)[]); 231 + 232 + struct testdef { 233 + const char *subfunc_name; 234 + u8 *subfunc_array; 235 + size_t array_size; 236 + testfunc_t test; 237 + int facility_bit; 238 + } testlist[] = { 239 + /* 240 + * PLO was introduced in the very first 64-bit machine generation. 241 + * Hence it is assumed PLO is always installed in Z Arch. 242 + */ 243 + { "PLO", cpu_subfunc.plo, sizeof(cpu_subfunc.plo), test_plo_asm_block, 1 }, 244 + /* MSA - Facility bit 17 */ 245 + { "KMAC", cpu_subfunc.kmac, sizeof(cpu_subfunc.kmac), test_kmac_asm_block, 17 }, 246 + { "KMC", cpu_subfunc.kmc, sizeof(cpu_subfunc.kmc), test_kmc_asm_block, 17 }, 247 + { "KM", cpu_subfunc.km, sizeof(cpu_subfunc.km), test_km_asm_block, 17 }, 248 + { "KIMD", cpu_subfunc.kimd, sizeof(cpu_subfunc.kimd), test_kimd_asm_block, 17 }, 249 + { "KLMD", cpu_subfunc.klmd, sizeof(cpu_subfunc.klmd), test_klmd_asm_block, 17 }, 250 + /* MSA - Facility bit 77 */ 251 + { "KMCTR", cpu_subfunc.kmctr, sizeof(cpu_subfunc.kmctr), test_kmctr_asm_block, 77 }, 252 + { "KMF", cpu_subfunc.kmf, sizeof(cpu_subfunc.kmf), test_kmf_asm_block, 77 }, 253 + { "KMO", cpu_subfunc.kmo, sizeof(cpu_subfunc.kmo), test_kmo_asm_block, 77 }, 254 + { "PCC", cpu_subfunc.pcc, sizeof(cpu_subfunc.pcc), test_pcc_asm_block, 77 }, 255 + /* MSA5 - Facility bit 57 */ 256 + { "PPNO", cpu_subfunc.ppno, sizeof(cpu_subfunc.ppno), test_prno_asm_block, 57 }, 257 + /* MSA8 - Facility bit 146 */ 258 + { "KMA", cpu_subfunc.kma, sizeof(cpu_subfunc.kma), test_kma_asm_block, 146 }, 259 + /* MSA9 - Facility bit 155 */ 260 + { "KDSA", cpu_subfunc.kdsa, sizeof(cpu_subfunc.kdsa), test_kdsa_asm_block, 155 }, 261 + /* SORTL - Facility bit 150 */ 262 + { "SORTL", cpu_subfunc.sortl, sizeof(cpu_subfunc.sortl), test_sortl_asm_block, 150 }, 263 + /* DFLTCC - Facility bit 151 */ 264 + { "DFLTCC", cpu_subfunc.dfltcc, sizeof(cpu_subfunc.dfltcc), test_dfltcc_asm_block, 151 }, 265 + /* Concurrent-function facility - Facility bit 201 */ 266 + { "PFCR", cpu_subfunc.pfcr, sizeof(cpu_subfunc.pfcr), test_pfcr_asm_block, 201 }, 267 + }; 268 + 269 + int main(int argc, char *argv[]) 270 + { 271 + struct kvm_vm *vm; 272 + int idx; 273 + 274 + ksft_print_header(); 275 + 276 + vm = vm_create(1); 277 + 278 + memset(&cpu_subfunc, 0, sizeof(cpu_subfunc)); 279 + get_cpu_machine_subfuntions(vm, &cpu_subfunc); 280 + 281 + ksft_set_plan(ARRAY_SIZE(testlist)); 282 + for (idx = 0; idx < ARRAY_SIZE(testlist); idx++) { 283 + if (test_facility(testlist[idx].facility_bit)) { 284 + u8 *array = malloc(testlist[idx].array_size); 285 + 286 + testlist[idx].test((u8 (*)[testlist[idx].array_size])array); 287 + 288 + TEST_ASSERT_EQ(memcmp(testlist[idx].subfunc_array, 289 + array, testlist[idx].array_size), 0); 290 + 291 + ksft_test_result_pass("%s\n", testlist[idx].subfunc_name); 292 + free(array); 293 + } else { 294 + ksft_test_result_skip("%s feature is not avaialable\n", 295 + testlist[idx].subfunc_name); 296 + } 297 + } 298 + 299 + kvm_vm_free(vm); 300 + ksft_finished(); 301 + }

+314 -8

tools/testing/selftests/kvm/s390x/ucontrol_test.c

··· 16 16 #include <linux/capability.h> 17 17 #include <linux/sizes.h> 18 18 19 + #define PGM_SEGMENT_TRANSLATION 0x10 20 + 19 21 #define VM_MEM_SIZE (4 * SZ_1M) 22 + #define VM_MEM_EXT_SIZE (2 * SZ_1M) 23 + #define VM_MEM_MAX_M ((VM_MEM_SIZE + VM_MEM_EXT_SIZE) / SZ_1M) 20 24 21 25 /* so directly declare capget to check caps without libcap */ 22 26 int capget(cap_user_header_t header, cap_user_data_t data); ··· 62 58 " j 0b\n" 63 59 ); 64 60 61 + /* Test program manipulating memory */ 62 + extern char test_mem_asm[]; 63 + asm("test_mem_asm:\n" 64 + "xgr %r0, %r0\n" 65 + 66 + "0:\n" 67 + " ahi %r0,1\n" 68 + " st %r1,0(%r5,%r6)\n" 69 + 70 + " xgr %r1,%r1\n" 71 + " l %r1,0(%r5,%r6)\n" 72 + " ahi %r0,1\n" 73 + " diag 0,0,0x44\n" 74 + 75 + " j 0b\n" 76 + ); 77 + 78 + /* Test program manipulating storage keys */ 79 + extern char test_skey_asm[]; 80 + asm("test_skey_asm:\n" 81 + "xgr %r0, %r0\n" 82 + 83 + "0:\n" 84 + " ahi %r0,1\n" 85 + " st %r1,0(%r5,%r6)\n" 86 + 87 + " iske %r1,%r6\n" 88 + " ahi %r0,1\n" 89 + " diag 0,0,0x44\n" 90 + 91 + " sske %r1,%r6\n" 92 + " xgr %r1,%r1\n" 93 + " iske %r1,%r6\n" 94 + " ahi %r0,1\n" 95 + " diag 0,0,0x44\n" 96 + 97 + " rrbe %r1,%r6\n" 98 + " iske %r1,%r6\n" 99 + " ahi %r0,1\n" 100 + " diag 0,0,0x44\n" 101 + 102 + " j 0b\n" 103 + ); 104 + 65 105 FIXTURE(uc_kvm) 66 106 { 67 107 struct kvm_s390_sie_block *sie_block; ··· 115 67 uintptr_t base_hva; 116 68 uintptr_t code_hva; 117 69 int kvm_run_size; 70 + vm_paddr_t pgd; 118 71 void *vm_mem; 119 72 int vcpu_fd; 120 73 int kvm_fd; ··· 165 116 self->base_gpa = 0; 166 117 self->code_gpa = self->base_gpa + (3 * SZ_1M); 167 118 168 - self->vm_mem = aligned_alloc(SZ_1M, VM_MEM_SIZE); 119 + self->vm_mem = aligned_alloc(SZ_1M, VM_MEM_MAX_M * SZ_1M); 169 120 ASSERT_NE(NULL, self->vm_mem) TH_LOG("malloc failed %u", errno); 170 121 self->base_hva = (uintptr_t)self->vm_mem; 171 122 self->code_hva = self->base_hva - self->base_gpa + self->code_gpa; ··· 271 222 close(kvm_fd); 272 223 } 273 224 274 - /* verify SIEIC exit 225 + /* calculate host virtual addr from guest physical addr */ 226 + static void *gpa2hva(FIXTURE_DATA(uc_kvm) *self, u64 gpa) 227 + { 228 + return (void *)(self->base_hva - self->base_gpa + gpa); 229 + } 230 + 231 + /* map / make additional memory available */ 232 + static int uc_map_ext(FIXTURE_DATA(uc_kvm) *self, u64 vcpu_addr, u64 length) 233 + { 234 + struct kvm_s390_ucas_mapping map = { 235 + .user_addr = (u64)gpa2hva(self, vcpu_addr), 236 + .vcpu_addr = vcpu_addr, 237 + .length = length, 238 + }; 239 + pr_info("ucas map %p %p 0x%llx", 240 + (void *)map.user_addr, (void *)map.vcpu_addr, map.length); 241 + return ioctl(self->vcpu_fd, KVM_S390_UCAS_MAP, &map); 242 + } 243 + 244 + /* unmap previously mapped memory */ 245 + static int uc_unmap_ext(FIXTURE_DATA(uc_kvm) *self, u64 vcpu_addr, u64 length) 246 + { 247 + struct kvm_s390_ucas_mapping map = { 248 + .user_addr = (u64)gpa2hva(self, vcpu_addr), 249 + .vcpu_addr = vcpu_addr, 250 + .length = length, 251 + }; 252 + pr_info("ucas unmap %p %p 0x%llx", 253 + (void *)map.user_addr, (void *)map.vcpu_addr, map.length); 254 + return ioctl(self->vcpu_fd, KVM_S390_UCAS_UNMAP, &map); 255 + } 256 + 257 + /* handle ucontrol exit by mapping the accessed segment */ 258 + static void uc_handle_exit_ucontrol(FIXTURE_DATA(uc_kvm) *self) 259 + { 260 + struct kvm_run *run = self->run; 261 + u64 seg_addr; 262 + int rc; 263 + 264 + TEST_ASSERT_EQ(KVM_EXIT_S390_UCONTROL, run->exit_reason); 265 + switch (run->s390_ucontrol.pgm_code) { 266 + case PGM_SEGMENT_TRANSLATION: 267 + seg_addr = run->s390_ucontrol.trans_exc_code & ~(SZ_1M - 1); 268 + pr_info("ucontrol pic segment translation 0x%llx, mapping segment 0x%lx\n", 269 + run->s390_ucontrol.trans_exc_code, seg_addr); 270 + /* map / make additional memory available */ 271 + rc = uc_map_ext(self, seg_addr, SZ_1M); 272 + TEST_ASSERT_EQ(0, rc); 273 + break; 274 + default: 275 + TEST_FAIL("UNEXPECTED PGM CODE %d", run->s390_ucontrol.pgm_code); 276 + } 277 + } 278 + 279 + /* 280 + * Handle the SIEIC exit 275 281 * * fail on codes not expected in the test cases 282 + * Returns if interception is handled / execution can be continued 276 283 */ 277 - static bool uc_handle_sieic(FIXTURE_DATA(uc_kvm) * self) 284 + static void uc_skey_enable(FIXTURE_DATA(uc_kvm) *self) 285 + { 286 + struct kvm_s390_sie_block *sie_block = self->sie_block; 287 + 288 + /* disable KSS */ 289 + sie_block->cpuflags &= ~CPUSTAT_KSS; 290 + /* disable skey inst interception */ 291 + sie_block->ictl &= ~(ICTL_ISKE | ICTL_SSKE | ICTL_RRBE); 292 + } 293 + 294 + /* 295 + * Handle the instruction intercept 296 + * Returns if interception is handled / execution can be continued 297 + */ 298 + static bool uc_handle_insn_ic(FIXTURE_DATA(uc_kvm) *self) 299 + { 300 + struct kvm_s390_sie_block *sie_block = self->sie_block; 301 + int ilen = insn_length(sie_block->ipa >> 8); 302 + struct kvm_run *run = self->run; 303 + 304 + switch (run->s390_sieic.ipa) { 305 + case 0xB229: /* ISKE */ 306 + case 0xB22b: /* SSKE */ 307 + case 0xB22a: /* RRBE */ 308 + uc_skey_enable(self); 309 + 310 + /* rewind to reexecute intercepted instruction */ 311 + run->psw_addr = run->psw_addr - ilen; 312 + pr_info("rewind guest addr to 0x%.16llx\n", run->psw_addr); 313 + return true; 314 + default: 315 + return false; 316 + } 317 + } 318 + 319 + /* 320 + * Handle the SIEIC exit 321 + * * fail on codes not expected in the test cases 322 + * Returns if interception is handled / execution can be continued 323 + */ 324 + static bool uc_handle_sieic(FIXTURE_DATA(uc_kvm) *self) 278 325 { 279 326 struct kvm_s390_sie_block *sie_block = self->sie_block; 280 327 struct kvm_run *run = self->run; 281 328 282 329 /* check SIE interception code */ 283 - pr_info("sieic: 0x%.2x 0x%.4x 0x%.4x\n", 330 + pr_info("sieic: 0x%.2x 0x%.4x 0x%.8x\n", 284 331 run->s390_sieic.icptcode, 285 332 run->s390_sieic.ipa, 286 333 run->s390_sieic.ipb); ··· 384 239 case ICPT_INST: 385 240 /* end execution in caller on intercepted instruction */ 386 241 pr_info("sie instruction interception\n"); 387 - return false; 242 + return uc_handle_insn_ic(self); 243 + case ICPT_KSS: 244 + uc_skey_enable(self); 245 + return true; 388 246 case ICPT_OPEREXC: 389 247 /* operation exception */ 390 248 TEST_FAIL("sie exception on %.4x%.8x", sie_block->ipa, sie_block->ipb); ··· 398 250 } 399 251 400 252 /* verify VM state on exit */ 401 - static bool uc_handle_exit(FIXTURE_DATA(uc_kvm) * self) 253 + static bool uc_handle_exit(FIXTURE_DATA(uc_kvm) *self) 402 254 { 403 255 struct kvm_run *run = self->run; 404 256 405 257 switch (run->exit_reason) { 258 + case KVM_EXIT_S390_UCONTROL: 259 + /** check program interruption code 260 + * handle page fault --> ucas map 261 + */ 262 + uc_handle_exit_ucontrol(self); 263 + break; 406 264 case KVM_EXIT_S390_SIEIC: 407 265 return uc_handle_sieic(self); 408 266 default: ··· 418 264 } 419 265 420 266 /* run the VM until interrupted */ 421 - static int uc_run_once(FIXTURE_DATA(uc_kvm) * self) 267 + static int uc_run_once(FIXTURE_DATA(uc_kvm) *self) 422 268 { 423 269 int rc; 424 270 ··· 429 275 return rc; 430 276 } 431 277 432 - static void uc_assert_diag44(FIXTURE_DATA(uc_kvm) * self) 278 + static void uc_assert_diag44(FIXTURE_DATA(uc_kvm) *self) 433 279 { 434 280 struct kvm_s390_sie_block *sie_block = self->sie_block; 435 281 ··· 438 284 TEST_ASSERT_EQ(ICPT_INST, sie_block->icptcode); 439 285 TEST_ASSERT_EQ(0x8300, sie_block->ipa); 440 286 TEST_ASSERT_EQ(0x440000, sie_block->ipb); 287 + } 288 + 289 + TEST_F(uc_kvm, uc_no_user_region) 290 + { 291 + struct kvm_userspace_memory_region region = { 292 + .slot = 1, 293 + .guest_phys_addr = self->code_gpa, 294 + .memory_size = VM_MEM_EXT_SIZE, 295 + .userspace_addr = (uintptr_t)self->code_hva, 296 + }; 297 + struct kvm_userspace_memory_region2 region2 = { 298 + .slot = 1, 299 + .guest_phys_addr = self->code_gpa, 300 + .memory_size = VM_MEM_EXT_SIZE, 301 + .userspace_addr = (uintptr_t)self->code_hva, 302 + }; 303 + 304 + ASSERT_EQ(-1, ioctl(self->vm_fd, KVM_SET_USER_MEMORY_REGION, &region)); 305 + ASSERT_EQ(EINVAL, errno); 306 + 307 + ASSERT_EQ(-1, ioctl(self->vm_fd, KVM_SET_USER_MEMORY_REGION2, &region2)); 308 + ASSERT_EQ(EINVAL, errno); 309 + } 310 + 311 + TEST_F(uc_kvm, uc_map_unmap) 312 + { 313 + struct kvm_sync_regs *sync_regs = &self->run->s.regs; 314 + struct kvm_run *run = self->run; 315 + const u64 disp = 1; 316 + int rc; 317 + 318 + /* copy test_mem_asm to code_hva / code_gpa */ 319 + TH_LOG("copy code %p to vm mapped memory %p / %p", 320 + &test_mem_asm, (void *)self->code_hva, (void *)self->code_gpa); 321 + memcpy((void *)self->code_hva, &test_mem_asm, PAGE_SIZE); 322 + 323 + /* DAT disabled + 64 bit mode */ 324 + run->psw_mask = 0x0000000180000000ULL; 325 + run->psw_addr = self->code_gpa; 326 + 327 + /* set register content for test_mem_asm to access not mapped memory*/ 328 + sync_regs->gprs[1] = 0x55; 329 + sync_regs->gprs[5] = self->base_gpa; 330 + sync_regs->gprs[6] = VM_MEM_SIZE + disp; 331 + run->kvm_dirty_regs |= KVM_SYNC_GPRS; 332 + 333 + /* run and expect to fail with ucontrol pic segment translation */ 334 + ASSERT_EQ(0, uc_run_once(self)); 335 + ASSERT_EQ(1, sync_regs->gprs[0]); 336 + ASSERT_EQ(KVM_EXIT_S390_UCONTROL, run->exit_reason); 337 + 338 + ASSERT_EQ(PGM_SEGMENT_TRANSLATION, run->s390_ucontrol.pgm_code); 339 + ASSERT_EQ(self->base_gpa + VM_MEM_SIZE, run->s390_ucontrol.trans_exc_code); 340 + 341 + /* fail to map memory with not segment aligned address */ 342 + rc = uc_map_ext(self, self->base_gpa + VM_MEM_SIZE + disp, VM_MEM_EXT_SIZE); 343 + ASSERT_GT(0, rc) 344 + TH_LOG("ucas map for non segment address should fail but didn't; " 345 + "result %d not expected, %s", rc, strerror(errno)); 346 + 347 + /* map / make additional memory available */ 348 + rc = uc_map_ext(self, self->base_gpa + VM_MEM_SIZE, VM_MEM_EXT_SIZE); 349 + ASSERT_EQ(0, rc) 350 + TH_LOG("ucas map result %d not expected, %s", rc, strerror(errno)); 351 + ASSERT_EQ(0, uc_run_once(self)); 352 + ASSERT_EQ(false, uc_handle_exit(self)); 353 + uc_assert_diag44(self); 354 + 355 + /* assert registers and memory are in expected state */ 356 + ASSERT_EQ(2, sync_regs->gprs[0]); 357 + ASSERT_EQ(0x55, sync_regs->gprs[1]); 358 + ASSERT_EQ(0x55, *(u32 *)gpa2hva(self, self->base_gpa + VM_MEM_SIZE + disp)); 359 + 360 + /* unmap and run loop again */ 361 + rc = uc_unmap_ext(self, self->base_gpa + VM_MEM_SIZE, VM_MEM_EXT_SIZE); 362 + ASSERT_EQ(0, rc) 363 + TH_LOG("ucas unmap result %d not expected, %s", rc, strerror(errno)); 364 + ASSERT_EQ(0, uc_run_once(self)); 365 + ASSERT_EQ(3, sync_regs->gprs[0]); 366 + ASSERT_EQ(KVM_EXIT_S390_UCONTROL, run->exit_reason); 367 + ASSERT_EQ(PGM_SEGMENT_TRANSLATION, run->s390_ucontrol.pgm_code); 368 + /* handle ucontrol exit and remap memory after previous map and unmap */ 369 + ASSERT_EQ(true, uc_handle_exit(self)); 441 370 } 442 371 443 372 TEST_F(uc_kvm, uc_gprs) ··· 564 327 ASSERT_EQ(0, ioctl(self->vcpu_fd, KVM_GET_REGS, &regs)); 565 328 ASSERT_EQ(1, regs.gprs[0]); 566 329 ASSERT_EQ(1, sync_regs->gprs[0]); 330 + } 331 + 332 + TEST_F(uc_kvm, uc_skey) 333 + { 334 + struct kvm_s390_sie_block *sie_block = self->sie_block; 335 + struct kvm_sync_regs *sync_regs = &self->run->s.regs; 336 + u64 test_vaddr = VM_MEM_SIZE - (SZ_1M / 2); 337 + struct kvm_run *run = self->run; 338 + const u8 skeyvalue = 0x34; 339 + 340 + /* copy test_skey_asm to code_hva / code_gpa */ 341 + TH_LOG("copy code %p to vm mapped memory %p / %p", 342 + &test_skey_asm, (void *)self->code_hva, (void *)self->code_gpa); 343 + memcpy((void *)self->code_hva, &test_skey_asm, PAGE_SIZE); 344 + 345 + /* set register content for test_skey_asm to access not mapped memory */ 346 + sync_regs->gprs[1] = skeyvalue; 347 + sync_regs->gprs[5] = self->base_gpa; 348 + sync_regs->gprs[6] = test_vaddr; 349 + run->kvm_dirty_regs |= KVM_SYNC_GPRS; 350 + 351 + /* DAT disabled + 64 bit mode */ 352 + run->psw_mask = 0x0000000180000000ULL; 353 + run->psw_addr = self->code_gpa; 354 + 355 + ASSERT_EQ(0, uc_run_once(self)); 356 + ASSERT_EQ(true, uc_handle_exit(self)); 357 + ASSERT_EQ(1, sync_regs->gprs[0]); 358 + 359 + /* ISKE */ 360 + ASSERT_EQ(0, uc_run_once(self)); 361 + 362 + /* 363 + * Bail out and skip the test after uc_skey_enable was executed but iske 364 + * is still intercepted. Instructions are not handled by the kernel. 365 + * Thus there is no need to test this here. 366 + */ 367 + TEST_ASSERT_EQ(0, sie_block->cpuflags & CPUSTAT_KSS); 368 + TEST_ASSERT_EQ(0, sie_block->ictl & (ICTL_ISKE | ICTL_SSKE | ICTL_RRBE)); 369 + TEST_ASSERT_EQ(KVM_EXIT_S390_SIEIC, self->run->exit_reason); 370 + TEST_ASSERT_EQ(ICPT_INST, sie_block->icptcode); 371 + TEST_REQUIRE(sie_block->ipa != 0xb229); 372 + 373 + /* ISKE contd. */ 374 + ASSERT_EQ(false, uc_handle_exit(self)); 375 + ASSERT_EQ(2, sync_regs->gprs[0]); 376 + /* assert initial skey (ACC = 0, R & C = 1) */ 377 + ASSERT_EQ(0x06, sync_regs->gprs[1]); 378 + uc_assert_diag44(self); 379 + 380 + /* SSKE + ISKE */ 381 + sync_regs->gprs[1] = skeyvalue; 382 + run->kvm_dirty_regs |= KVM_SYNC_GPRS; 383 + ASSERT_EQ(0, uc_run_once(self)); 384 + ASSERT_EQ(false, uc_handle_exit(self)); 385 + ASSERT_EQ(3, sync_regs->gprs[0]); 386 + ASSERT_EQ(skeyvalue, sync_regs->gprs[1]); 387 + uc_assert_diag44(self); 388 + 389 + /* RRBE + ISKE */ 390 + sync_regs->gprs[1] = skeyvalue; 391 + run->kvm_dirty_regs |= KVM_SYNC_GPRS; 392 + ASSERT_EQ(0, uc_run_once(self)); 393 + ASSERT_EQ(false, uc_handle_exit(self)); 394 + ASSERT_EQ(4, sync_regs->gprs[0]); 395 + /* assert R reset but rest of skey unchanged */ 396 + ASSERT_EQ(skeyvalue & 0xfa, sync_regs->gprs[1]); 397 + ASSERT_EQ(0, sync_regs->gprs[1] & 0x04); 398 + uc_assert_diag44(self); 567 399 } 568 400 569 401 TEST_HARNESS_MAIN

+4 -19

tools/testing/selftests/kvm/x86_64/amx_test.c

··· 86 86 87 87 static void check_xtile_info(void) 88 88 { 89 + GUEST_ASSERT((xgetbv(0) & XFEATURE_MASK_XTILE) == XFEATURE_MASK_XTILE); 90 + 89 91 GUEST_ASSERT(this_cpu_has_p(X86_PROPERTY_XSTATE_MAX_SIZE_XCR0)); 90 92 GUEST_ASSERT(this_cpu_property(X86_PROPERTY_XSTATE_MAX_SIZE_XCR0) <= XSAVE_SIZE); 91 93 ··· 124 122 } 125 123 } 126 124 127 - static void init_regs(void) 128 - { 129 - uint64_t cr4, xcr0; 130 - 131 - GUEST_ASSERT(this_cpu_has(X86_FEATURE_XSAVE)); 132 - 133 - /* turn on CR4.OSXSAVE */ 134 - cr4 = get_cr4(); 135 - cr4 |= X86_CR4_OSXSAVE; 136 - set_cr4(cr4); 137 - GUEST_ASSERT(this_cpu_has(X86_FEATURE_OSXSAVE)); 138 - 139 - xcr0 = xgetbv(0); 140 - xcr0 |= XFEATURE_MASK_XTILE; 141 - xsetbv(0x0, xcr0); 142 - GUEST_ASSERT((xgetbv(0) & XFEATURE_MASK_XTILE) == XFEATURE_MASK_XTILE); 143 - } 144 - 145 125 static void __attribute__((__flatten__)) guest_code(struct tile_config *amx_cfg, 146 126 struct tile_data *tiledata, 147 127 struct xstate *xstate) 148 128 { 149 - init_regs(); 129 + GUEST_ASSERT(this_cpu_has(X86_FEATURE_XSAVE) && 130 + this_cpu_has(X86_FEATURE_OSXSAVE)); 150 131 check_xtile_info(); 151 132 GUEST_SYNC(1); 152 133

+42 -25

tools/testing/selftests/kvm/x86_64/cpuid_test.c

··· 12 12 #include "kvm_util.h" 13 13 #include "processor.h" 14 14 15 - /* CPUIDs known to differ */ 16 - struct { 17 - u32 function; 18 - u32 index; 19 - } mangled_cpuids[] = { 20 - /* 21 - * These entries depend on the vCPU's XCR0 register and IA32_XSS MSR, 22 - * which are not controlled for by this test. 23 - */ 24 - {.function = 0xd, .index = 0}, 25 - {.function = 0xd, .index = 1}, 15 + struct cpuid_mask { 16 + union { 17 + struct { 18 + u32 eax; 19 + u32 ebx; 20 + u32 ecx; 21 + u32 edx; 22 + }; 23 + u32 regs[4]; 24 + }; 26 25 }; 27 26 28 27 static void test_guest_cpuids(struct kvm_cpuid2 *guest_cpuid) ··· 55 56 GUEST_DONE(); 56 57 } 57 58 58 - static bool is_cpuid_mangled(const struct kvm_cpuid_entry2 *entrie) 59 + static struct cpuid_mask get_const_cpuid_mask(const struct kvm_cpuid_entry2 *entry) 59 60 { 60 - int i; 61 + struct cpuid_mask mask; 61 62 62 - for (i = 0; i < ARRAY_SIZE(mangled_cpuids); i++) { 63 - if (mangled_cpuids[i].function == entrie->function && 64 - mangled_cpuids[i].index == entrie->index) 65 - return true; 63 + memset(&mask, 0xff, sizeof(mask)); 64 + 65 + switch (entry->function) { 66 + case 0x1: 67 + mask.regs[X86_FEATURE_OSXSAVE.reg] &= ~BIT(X86_FEATURE_OSXSAVE.bit); 68 + break; 69 + case 0x7: 70 + mask.regs[X86_FEATURE_OSPKE.reg] &= ~BIT(X86_FEATURE_OSPKE.bit); 71 + break; 72 + case 0xd: 73 + /* 74 + * CPUID.0xD.{0,1}.EBX enumerate XSAVE size based on the current 75 + * XCR0 and IA32_XSS MSR values. 76 + */ 77 + if (entry->index < 2) 78 + mask.ebx = 0; 79 + break; 66 80 } 67 - 68 - return false; 81 + return mask; 69 82 } 70 83 71 84 static void compare_cpuids(const struct kvm_cpuid2 *cpuid1, ··· 90 79 "CPUID nent mismatch: %d vs. %d", cpuid1->nent, cpuid2->nent); 91 80 92 81 for (i = 0; i < cpuid1->nent; i++) { 82 + struct cpuid_mask mask; 83 + 93 84 e1 = &cpuid1->entries[i]; 94 85 e2 = &cpuid2->entries[i]; 95 86 ··· 101 88 i, e1->function, e1->index, e1->flags, 102 89 e2->function, e2->index, e2->flags); 103 90 104 - if (is_cpuid_mangled(e1)) 105 - continue; 91 + /* Mask off dynamic bits, e.g. OSXSAVE, when comparing entries. */ 92 + mask = get_const_cpuid_mask(e1); 106 93 107 - TEST_ASSERT(e1->eax == e2->eax && e1->ebx == e2->ebx && 108 - e1->ecx == e2->ecx && e1->edx == e2->edx, 94 + TEST_ASSERT((e1->eax & mask.eax) == (e2->eax & mask.eax) && 95 + (e1->ebx & mask.ebx) == (e2->ebx & mask.ebx) && 96 + (e1->ecx & mask.ecx) == (e2->ecx & mask.ecx) && 97 + (e1->edx & mask.edx) == (e2->edx & mask.edx), 109 98 "CPUID 0x%x.%x differ: 0x%x:0x%x:0x%x:0x%x vs 0x%x:0x%x:0x%x:0x%x", 110 99 e1->function, e1->index, 111 - e1->eax, e1->ebx, e1->ecx, e1->edx, 112 - e2->eax, e2->ebx, e2->ecx, e2->edx); 100 + e1->eax & mask.eax, e1->ebx & mask.ebx, 101 + e1->ecx & mask.ecx, e1->edx & mask.edx, 102 + e2->eax & mask.eax, e2->ebx & mask.ebx, 103 + e2->ecx & mask.ecx, e2->edx & mask.edx); 113 104 } 114 105 } 115 106

+34 -19

tools/testing/selftests/kvm/x86_64/cr4_cpuid_sync_test.c

··· 19 19 #include "kvm_util.h" 20 20 #include "processor.h" 21 21 22 - static inline bool cr4_cpuid_is_sync(void) 23 - { 24 - uint64_t cr4 = get_cr4(); 25 - 26 - return (this_cpu_has(X86_FEATURE_OSXSAVE) == !!(cr4 & X86_CR4_OSXSAVE)); 27 - } 22 + #define MAGIC_HYPERCALL_PORT 0x80 28 23 29 24 static void guest_code(void) 30 25 { 31 - uint64_t cr4; 26 + u32 regs[4] = { 27 + [KVM_CPUID_EAX] = X86_FEATURE_OSXSAVE.function, 28 + [KVM_CPUID_ECX] = X86_FEATURE_OSXSAVE.index, 29 + }; 32 30 33 - /* turn on CR4.OSXSAVE */ 34 - cr4 = get_cr4(); 35 - cr4 |= X86_CR4_OSXSAVE; 36 - set_cr4(cr4); 31 + /* CR4.OSXSAVE should be enabled by default (for selftests vCPUs). */ 32 + GUEST_ASSERT(get_cr4() & X86_CR4_OSXSAVE); 37 33 38 34 /* verify CR4.OSXSAVE == CPUID.OSXSAVE */ 39 - GUEST_ASSERT(cr4_cpuid_is_sync()); 35 + GUEST_ASSERT(this_cpu_has(X86_FEATURE_OSXSAVE)); 40 36 41 - /* notify hypervisor to change CR4 */ 42 - GUEST_SYNC(0); 37 + /* 38 + * Notify hypervisor to clear CR4.0SXSAVE, do CPUID and save output, 39 + * and then restore CR4. Do this all in assembly to ensure no AVX 40 + * instructions are executed while OSXSAVE=0. 41 + */ 42 + asm volatile ( 43 + "out %%al, $" __stringify(MAGIC_HYPERCALL_PORT) "\n\t" 44 + "cpuid\n\t" 45 + "mov %%rdi, %%cr4\n\t" 46 + : "+a" (regs[KVM_CPUID_EAX]), 47 + "=b" (regs[KVM_CPUID_EBX]), 48 + "+c" (regs[KVM_CPUID_ECX]), 49 + "=d" (regs[KVM_CPUID_EDX]) 50 + : "D" (get_cr4()) 51 + ); 43 52 44 - /* check again */ 45 - GUEST_ASSERT(cr4_cpuid_is_sync()); 53 + /* Verify KVM cleared OSXSAVE in CPUID when it was cleared in CR4. */ 54 + GUEST_ASSERT(!(regs[X86_FEATURE_OSXSAVE.reg] & BIT(X86_FEATURE_OSXSAVE.bit))); 55 + 56 + /* Verify restoring CR4 also restored OSXSAVE in CPUID. */ 57 + GUEST_ASSERT(this_cpu_has(X86_FEATURE_OSXSAVE)); 46 58 47 59 GUEST_DONE(); 48 60 } ··· 74 62 vcpu_run(vcpu); 75 63 TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_IO); 76 64 77 - switch (get_ucall(vcpu, &uc)) { 78 - case UCALL_SYNC: 65 + if (vcpu->run->io.port == MAGIC_HYPERCALL_PORT && 66 + vcpu->run->io.direction == KVM_EXIT_IO_OUT) { 79 67 /* emulate hypervisor clearing CR4.OSXSAVE */ 80 68 vcpu_sregs_get(vcpu, &sregs); 81 69 sregs.cr4 &= ~X86_CR4_OSXSAVE; 82 70 vcpu_sregs_set(vcpu, &sregs); 83 - break; 71 + continue; 72 + } 73 + 74 + switch (get_ucall(vcpu, &uc)) { 84 75 case UCALL_ABORT: 85 76 REPORT_GUEST_ASSERT(uc); 86 77 break;

+1 -1

tools/testing/selftests/kvm/x86_64/debug_regs.c

··· 166 166 /* Test single step */ 167 167 target_rip = CAST_TO_RIP(ss_start); 168 168 target_dr6 = 0xffff4ff0ULL; 169 - for (i = 0; i < (sizeof(ss_size) / sizeof(ss_size[0])); i++) { 169 + for (i = 0; i < ARRAY_SIZE(ss_size); i++) { 170 170 target_rip += ss_size[i]; 171 171 memset(&debug, 0, sizeof(debug)); 172 172 debug.control = KVM_GUESTDBG_ENABLE | KVM_GUESTDBG_SINGLESTEP |

+113

tools/testing/selftests/kvm/x86_64/feature_msrs_test.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * Copyright (C) 2020, Red Hat, Inc. 4 + */ 5 + #include <fcntl.h> 6 + #include <stdio.h> 7 + #include <stdlib.h> 8 + #include <string.h> 9 + #include <sys/ioctl.h> 10 + 11 + #include "test_util.h" 12 + #include "kvm_util.h" 13 + #include "processor.h" 14 + 15 + static bool is_kvm_controlled_msr(uint32_t msr) 16 + { 17 + return msr == MSR_IA32_VMX_CR0_FIXED1 || msr == MSR_IA32_VMX_CR4_FIXED1; 18 + } 19 + 20 + /* 21 + * For VMX MSRs with a "true" variant, KVM requires userspace to set the "true" 22 + * MSR, and doesn't allow setting the hidden version. 23 + */ 24 + static bool is_hidden_vmx_msr(uint32_t msr) 25 + { 26 + switch (msr) { 27 + case MSR_IA32_VMX_PINBASED_CTLS: 28 + case MSR_IA32_VMX_PROCBASED_CTLS: 29 + case MSR_IA32_VMX_EXIT_CTLS: 30 + case MSR_IA32_VMX_ENTRY_CTLS: 31 + return true; 32 + default: 33 + return false; 34 + } 35 + } 36 + 37 + static bool is_quirked_msr(uint32_t msr) 38 + { 39 + return msr != MSR_AMD64_DE_CFG; 40 + } 41 + 42 + static void test_feature_msr(uint32_t msr) 43 + { 44 + const uint64_t supported_mask = kvm_get_feature_msr(msr); 45 + uint64_t reset_value = is_quirked_msr(msr) ? supported_mask : 0; 46 + struct kvm_vcpu *vcpu; 47 + struct kvm_vm *vm; 48 + 49 + /* 50 + * Don't bother testing KVM-controlled MSRs beyond verifying that the 51 + * MSR can be read from userspace. Any value is effectively legal, as 52 + * KVM is bound by x86 architecture, not by ABI. 53 + */ 54 + if (is_kvm_controlled_msr(msr)) 55 + return; 56 + 57 + /* 58 + * More goofy behavior. KVM reports the host CPU's actual revision ID, 59 + * but initializes the vCPU's revision ID to an arbitrary value. 60 + */ 61 + if (msr == MSR_IA32_UCODE_REV) 62 + reset_value = host_cpu_is_intel ? 0x100000000ULL : 0x01000065; 63 + 64 + /* 65 + * For quirked MSRs, KVM's ABI is to initialize the vCPU's value to the 66 + * full set of features supported by KVM. For non-quirked MSRs, and 67 + * when the quirk is disabled, KVM must zero-initialize the MSR and let 68 + * userspace do the configuration. 69 + */ 70 + vm = vm_create_with_one_vcpu(&vcpu, NULL); 71 + TEST_ASSERT(vcpu_get_msr(vcpu, msr) == reset_value, 72 + "Wanted 0x%lx for %squirked MSR 0x%x, got 0x%lx", 73 + reset_value, is_quirked_msr(msr) ? "" : "non-", msr, 74 + vcpu_get_msr(vcpu, msr)); 75 + if (!is_hidden_vmx_msr(msr)) 76 + vcpu_set_msr(vcpu, msr, supported_mask); 77 + kvm_vm_free(vm); 78 + 79 + if (is_hidden_vmx_msr(msr)) 80 + return; 81 + 82 + if (!kvm_has_cap(KVM_CAP_DISABLE_QUIRKS2) || 83 + !(kvm_check_cap(KVM_CAP_DISABLE_QUIRKS2) & KVM_X86_QUIRK_STUFF_FEATURE_MSRS)) 84 + return; 85 + 86 + vm = vm_create(1); 87 + vm_enable_cap(vm, KVM_CAP_DISABLE_QUIRKS2, KVM_X86_QUIRK_STUFF_FEATURE_MSRS); 88 + 89 + vcpu = vm_vcpu_add(vm, 0, NULL); 90 + TEST_ASSERT(!vcpu_get_msr(vcpu, msr), 91 + "Quirk disabled, wanted '0' for MSR 0x%x, got 0x%lx", 92 + msr, vcpu_get_msr(vcpu, msr)); 93 + kvm_vm_free(vm); 94 + } 95 + 96 + int main(int argc, char *argv[]) 97 + { 98 + const struct kvm_msr_list *feature_list; 99 + int i; 100 + 101 + /* 102 + * Skip the entire test if MSR_FEATURES isn't supported, other tests 103 + * will cover the "regular" list of MSRs, the coverage here is purely 104 + * opportunistic and not interesting on its own. 105 + */ 106 + TEST_REQUIRE(kvm_has_cap(KVM_CAP_GET_MSR_FEATURES)); 107 + 108 + (void)kvm_get_msr_index_list(); 109 + 110 + feature_list = kvm_get_feature_msr_index_list(); 111 + for (i = 0; i < feature_list->nmsrs; i++) 112 + test_feature_msr(feature_list->indices[i]); 113 + }

-35

tools/testing/selftests/kvm/x86_64/get_msr_index_features.c

··· 1 - // SPDX-License-Identifier: GPL-2.0 2 - /* 3 - * Test that KVM_GET_MSR_INDEX_LIST and 4 - * KVM_GET_MSR_FEATURE_INDEX_LIST work as intended 5 - * 6 - * Copyright (C) 2020, Red Hat, Inc. 7 - */ 8 - #include <fcntl.h> 9 - #include <stdio.h> 10 - #include <stdlib.h> 11 - #include <string.h> 12 - #include <sys/ioctl.h> 13 - 14 - #include "test_util.h" 15 - #include "kvm_util.h" 16 - #include "processor.h" 17 - 18 - int main(int argc, char *argv[]) 19 - { 20 - const struct kvm_msr_list *feature_list; 21 - int i; 22 - 23 - /* 24 - * Skip the entire test if MSR_FEATURES isn't supported, other tests 25 - * will cover the "regular" list of MSRs, the coverage here is purely 26 - * opportunistic and not interesting on its own. 27 - */ 28 - TEST_REQUIRE(kvm_has_cap(KVM_CAP_GET_MSR_FEATURES)); 29 - 30 - (void)kvm_get_msr_index_list(); 31 - 32 - feature_list = kvm_get_feature_msr_index_list(); 33 - for (i = 0; i < feature_list->nmsrs; i++) 34 - kvm_get_feature_msr(feature_list->indices[i]); 35 - }

-2

tools/testing/selftests/kvm/x86_64/platform_info_test.c

··· 72 72 } 73 73 74 74 done: 75 - vcpu_set_msr(vcpu, MSR_PLATFORM_INFO, msr_platform_info); 76 - 77 75 kvm_vm_free(vm); 78 76 79 77 return 0;

+5 -14

tools/testing/selftests/kvm/x86_64/sev_smoke_test.c

··· 41 41 /* Stash state passed via VMSA before any compiled code runs. */ 42 42 extern void guest_code_xsave(void); 43 43 asm("guest_code_xsave:\n" 44 - "mov $-1, %eax\n" 45 - "mov $-1, %edx\n" 44 + "mov $" __stringify(XFEATURE_MASK_X87_AVX) ", %eax\n" 45 + "xor %edx, %edx\n" 46 46 "xsave (%rdi)\n" 47 47 "jmp guest_sev_es_code"); 48 48 ··· 70 70 71 71 double x87val = M_PI; 72 72 struct kvm_xsave __attribute__((aligned(64))) xsave = { 0 }; 73 - struct kvm_sregs sregs; 74 - struct kvm_xcrs xcrs = { 75 - .nr_xcrs = 1, 76 - .xcrs[0].xcr = 0, 77 - .xcrs[0].value = XFEATURE_MASK_X87_AVX, 78 - }; 79 73 80 74 vm = vm_sev_create_with_one_vcpu(KVM_X86_SEV_ES_VM, guest_code_xsave, &vcpu); 81 75 gva = vm_vaddr_alloc_shared(vm, PAGE_SIZE, KVM_UTIL_MIN_VADDR, ··· 78 84 79 85 vcpu_args_set(vcpu, 1, gva); 80 86 81 - vcpu_sregs_get(vcpu, &sregs); 82 - sregs.cr4 |= X86_CR4_OSFXSR | X86_CR4_OSXSAVE; 83 - vcpu_sregs_set(vcpu, &sregs); 84 - 85 - vcpu_xcrs_set(vcpu, &xcrs); 86 87 asm("fninit\n" 87 88 "vpcmpeqb %%ymm4, %%ymm4, %%ymm4\n" 88 89 "fldl %3\n" ··· 181 192 182 193 int main(int argc, char *argv[]) 183 194 { 195 + const u64 xf_mask = XFEATURE_MASK_X87_AVX; 196 + 184 197 TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_SEV)); 185 198 186 199 test_sev(guest_sev_code, SEV_POLICY_NO_DBG); ··· 195 204 test_sev_es_shutdown(); 196 205 197 206 if (kvm_has_cap(KVM_CAP_XCRS) && 198 - (xgetbv(0) & XFEATURE_MASK_X87_AVX) == XFEATURE_MASK_X87_AVX) { 207 + (xgetbv(0) & kvm_cpu_supported_xcr0() & xf_mask) == xf_mask) { 199 208 test_sync_vmsa(0); 200 209 test_sync_vmsa(SEV_POLICY_NO_DBG); 201 210 }

-5

tools/testing/selftests/kvm/x86_64/state_test.c

··· 145 145 146 146 memset(buffer, 0xcc, sizeof(buffer)); 147 147 148 - set_cr4(get_cr4() | X86_CR4_OSXSAVE); 149 - GUEST_ASSERT(this_cpu_has(X86_FEATURE_OSXSAVE)); 150 - 151 - xsetbv(0, xgetbv(0) | supported_xcr0); 152 - 153 148 /* 154 149 * Modify state for all supported xfeatures to take them out of 155 150 * their "init" state, i.e. to make them show up in XSTATE_BV.

+23

tools/testing/selftests/kvm/x86_64/vmx_pmu_caps_test.c

··· 207 207 TEST_ASSERT(!r, "Writing LBR_TOS should fail after disabling vPMU"); 208 208 } 209 209 210 + KVM_ONE_VCPU_TEST(vmx_pmu_caps, perf_capabilities_unsupported, guest_code) 211 + { 212 + uint64_t val; 213 + int i, r; 214 + 215 + vcpu_set_msr(vcpu, MSR_IA32_PERF_CAPABILITIES, host_cap.capabilities); 216 + val = vcpu_get_msr(vcpu, MSR_IA32_PERF_CAPABILITIES); 217 + TEST_ASSERT_EQ(val, host_cap.capabilities); 218 + 219 + vcpu_clear_cpuid_feature(vcpu, X86_FEATURE_PDCM); 220 + 221 + val = vcpu_get_msr(vcpu, MSR_IA32_PERF_CAPABILITIES); 222 + TEST_ASSERT_EQ(val, 0); 223 + 224 + vcpu_set_msr(vcpu, MSR_IA32_PERF_CAPABILITIES, 0); 225 + 226 + for (i = 0; i < 64; i++) { 227 + r = _vcpu_set_msr(vcpu, MSR_IA32_PERF_CAPABILITIES, BIT_ULL(i)); 228 + TEST_ASSERT(!r, "Setting PERF_CAPABILITIES bit %d (= 0x%llx) should fail without PDCM", 229 + i, BIT_ULL(i)); 230 + } 231 + } 232 + 210 233 int main(int argc, char *argv[]) 211 234 { 212 235 TEST_REQUIRE(kvm_is_pmu_enabled());

+8 -3

tools/testing/selftests/kvm/x86_64/xcr0_cpuid_test.c

··· 48 48 49 49 static void guest_code(void) 50 50 { 51 - uint64_t xcr0_reset; 51 + uint64_t initial_xcr0; 52 52 uint64_t supported_xcr0; 53 53 int i, vector; 54 54 55 55 set_cr4(get_cr4() | X86_CR4_OSXSAVE); 56 56 57 - xcr0_reset = xgetbv(0); 57 + initial_xcr0 = xgetbv(0); 58 58 supported_xcr0 = this_cpu_supported_xcr0(); 59 59 60 - GUEST_ASSERT(xcr0_reset == XFEATURE_MASK_FP); 60 + GUEST_ASSERT(initial_xcr0 == supported_xcr0); 61 61 62 62 /* Check AVX */ 63 63 ASSERT_XFEATURE_DEPENDENCIES(supported_xcr0, ··· 78 78 /* Check AMX */ 79 79 ASSERT_ALL_OR_NONE_XFEATURE(supported_xcr0, 80 80 XFEATURE_MASK_XTILE); 81 + 82 + vector = xsetbv_safe(0, XFEATURE_MASK_FP); 83 + __GUEST_ASSERT(!vector, 84 + "Expected success on XSETBV(FP), got vector '0x%x'", 85 + vector); 81 86 82 87 vector = xsetbv_safe(0, supported_xcr0); 83 88 __GUEST_ASSERT(!vector,

+4

virt/kvm/Kconfig

··· 100 100 select MMU_NOTIFIER 101 101 bool 102 102 103 + config KVM_ELIDE_TLB_FLUSH_IF_YOUNG 104 + depends on KVM_GENERIC_MMU_NOTIFIER 105 + bool 106 + 103 107 config KVM_GENERIC_MEMORY_ATTRIBUTES 104 108 depends on KVM_GENERIC_MMU_NOTIFIER 105 109 bool

+19 -9

virt/kvm/guest_memfd.c

··· 302 302 return get_file_active(&slot->gmem.file); 303 303 } 304 304 305 + static pgoff_t kvm_gmem_get_index(struct kvm_memory_slot *slot, gfn_t gfn) 306 + { 307 + return gfn - slot->base_gfn + slot->gmem.pgoff; 308 + } 309 + 305 310 static struct file_operations kvm_gmem_fops = { 306 311 .open = generic_file_open, 307 312 .release = kvm_gmem_release, ··· 556 551 } 557 552 558 553 /* Returns a locked folio on success. */ 559 - static struct folio * 560 - __kvm_gmem_get_pfn(struct file *file, struct kvm_memory_slot *slot, 561 - gfn_t gfn, kvm_pfn_t *pfn, bool *is_prepared, 562 - int *max_order) 554 + static struct folio *__kvm_gmem_get_pfn(struct file *file, 555 + struct kvm_memory_slot *slot, 556 + pgoff_t index, kvm_pfn_t *pfn, 557 + bool *is_prepared, int *max_order) 563 558 { 564 - pgoff_t index = gfn - slot->base_gfn + slot->gmem.pgoff; 565 559 struct kvm_gmem *gmem = file->private_data; 566 560 struct folio *folio; 567 561 ··· 594 590 } 595 591 596 592 int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, 597 - gfn_t gfn, kvm_pfn_t *pfn, int *max_order) 593 + gfn_t gfn, kvm_pfn_t *pfn, struct page **page, 594 + int *max_order) 598 595 { 596 + pgoff_t index = kvm_gmem_get_index(slot, gfn); 599 597 struct file *file = kvm_gmem_get_file(slot); 600 598 struct folio *folio; 601 599 bool is_prepared = false; ··· 606 600 if (!file) 607 601 return -EFAULT; 608 602 609 - folio = __kvm_gmem_get_pfn(file, slot, gfn, pfn, &is_prepared, max_order); 603 + folio = __kvm_gmem_get_pfn(file, slot, index, pfn, &is_prepared, max_order); 610 604 if (IS_ERR(folio)) { 611 605 r = PTR_ERR(folio); 612 606 goto out; ··· 616 610 r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio); 617 611 618 612 folio_unlock(folio); 619 - if (r < 0) 613 + 614 + if (!r) 615 + *page = folio_file_page(folio, index); 616 + else 620 617 folio_put(folio); 621 618 622 619 out: ··· 657 648 for (i = 0; i < npages; i += (1 << max_order)) { 658 649 struct folio *folio; 659 650 gfn_t gfn = start_gfn + i; 651 + pgoff_t index = kvm_gmem_get_index(slot, gfn); 660 652 bool is_prepared = false; 661 653 kvm_pfn_t pfn; 662 654 ··· 666 656 break; 667 657 } 668 658 669 - folio = __kvm_gmem_get_pfn(file, slot, gfn, &pfn, &is_prepared, &max_order); 659 + folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &is_prepared, &max_order); 670 660 if (IS_ERR(folio)) { 671 661 ret = PTR_ERR(folio); 672 662 break;

+430 -668

virt/kvm/kvm_main.c

··· 95 95 EXPORT_SYMBOL_GPL(halt_poll_ns_shrink); 96 96 97 97 /* 98 + * Allow direct access (from KVM or the CPU) without MMU notifier protection 99 + * to unpinned pages. 100 + */ 101 + static bool allow_unsafe_mappings; 102 + module_param(allow_unsafe_mappings, bool, 0444); 103 + 104 + /* 98 105 * Ordering of locks: 99 106 * 100 107 * kvm->lock --> kvm->slots_lock --> kvm->irq_lock ··· 158 151 159 152 __weak void kvm_arch_guest_memory_reclaimed(struct kvm *kvm) 160 153 { 161 - } 162 - 163 - bool kvm_is_zone_device_page(struct page *page) 164 - { 165 - /* 166 - * The metadata used by is_zone_device_page() to determine whether or 167 - * not a page is ZONE_DEVICE is guaranteed to be valid if and only if 168 - * the device has been pinned, e.g. by get_user_pages(). WARN if the 169 - * page_count() is zero to help detect bad usage of this helper. 170 - */ 171 - if (WARN_ON_ONCE(!page_count(page))) 172 - return false; 173 - 174 - return is_zone_device_page(page); 175 - } 176 - 177 - /* 178 - * Returns a 'struct page' if the pfn is "valid" and backed by a refcounted 179 - * page, NULL otherwise. Note, the list of refcounted PG_reserved page types 180 - * is likely incomplete, it has been compiled purely through people wanting to 181 - * back guest with a certain type of memory and encountering issues. 182 - */ 183 - struct page *kvm_pfn_to_refcounted_page(kvm_pfn_t pfn) 184 - { 185 - struct page *page; 186 - 187 - if (!pfn_valid(pfn)) 188 - return NULL; 189 - 190 - page = pfn_to_page(pfn); 191 - if (!PageReserved(page)) 192 - return page; 193 - 194 - /* The ZERO_PAGE(s) is marked PG_reserved, but is refcounted. */ 195 - if (is_zero_pfn(pfn)) 196 - return page; 197 - 198 - /* 199 - * ZONE_DEVICE pages currently set PG_reserved, but from a refcounting 200 - * perspective they are "normal" pages, albeit with slightly different 201 - * usage rules. 202 - */ 203 - if (kvm_is_zone_device_page(page)) 204 - return page; 205 - 206 - return NULL; 207 154 } 208 155 209 156 /* ··· 447 486 vcpu->kvm = kvm; 448 487 vcpu->vcpu_id = id; 449 488 vcpu->pid = NULL; 489 + rwlock_init(&vcpu->pid_lock); 450 490 #ifndef __KVM_HAVE_ARCH_WQP 451 491 rcuwait_init(&vcpu->wait); 452 492 #endif ··· 475 513 * the vcpu->pid pointer, and at destruction time all file descriptors 476 514 * are already gone. 477 515 */ 478 - put_pid(rcu_dereference_protected(vcpu->pid, 1)); 516 + put_pid(vcpu->pid); 479 517 480 518 free_page((unsigned long)vcpu->run); 481 519 kmem_cache_free(kvm_vcpu_cache, vcpu); ··· 631 669 static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn, 632 670 unsigned long start, 633 671 unsigned long end, 634 - gfn_handler_t handler) 672 + gfn_handler_t handler, 673 + bool flush_on_ret) 635 674 { 636 675 struct kvm *kvm = mmu_notifier_to_kvm(mn); 637 676 const struct kvm_mmu_notifier_range range = { ··· 640 677 .end = end, 641 678 .handler = handler, 642 679 .on_lock = (void *)kvm_null_fn, 643 - .flush_on_ret = true, 680 + .flush_on_ret = flush_on_ret, 644 681 .may_block = false, 645 682 }; 646 683 ··· 652 689 unsigned long end, 653 690 gfn_handler_t handler) 654 691 { 655 - struct kvm *kvm = mmu_notifier_to_kvm(mn); 656 - const struct kvm_mmu_notifier_range range = { 657 - .start = start, 658 - .end = end, 659 - .handler = handler, 660 - .on_lock = (void *)kvm_null_fn, 661 - .flush_on_ret = false, 662 - .may_block = false, 663 - }; 664 - 665 - return __kvm_handle_hva_range(kvm, &range).ret; 692 + return kvm_handle_hva_range(mn, start, end, handler, false); 666 693 } 667 694 668 695 void kvm_mmu_invalidate_begin(struct kvm *kvm) ··· 817 864 { 818 865 trace_kvm_age_hva(start, end); 819 866 820 - return kvm_handle_hva_range(mn, start, end, kvm_age_gfn); 867 + return kvm_handle_hva_range(mn, start, end, kvm_age_gfn, 868 + !IS_ENABLED(CONFIG_KVM_ELIDE_TLB_FLUSH_IF_YOUNG)); 821 869 } 822 870 823 871 static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn, ··· 2700 2746 return gfn_to_hva_memslot_prot(slot, gfn, writable); 2701 2747 } 2702 2748 2703 - static inline int check_user_page_hwpoison(unsigned long addr) 2704 - { 2705 - int rc, flags = FOLL_HWPOISON | FOLL_WRITE; 2706 - 2707 - rc = get_user_pages(addr, 1, flags, NULL); 2708 - return rc == -EHWPOISON; 2709 - } 2710 - 2711 - /* 2712 - * The fast path to get the writable pfn which will be stored in @pfn, 2713 - * true indicates success, otherwise false is returned. It's also the 2714 - * only part that runs if we can in atomic context. 2715 - */ 2716 - static bool hva_to_pfn_fast(unsigned long addr, bool write_fault, 2717 - bool *writable, kvm_pfn_t *pfn) 2718 - { 2719 - struct page *page[1]; 2720 - 2721 - /* 2722 - * Fast pin a writable pfn only if it is a write fault request 2723 - * or the caller allows to map a writable pfn for a read fault 2724 - * request. 2725 - */ 2726 - if (!(write_fault || writable)) 2727 - return false; 2728 - 2729 - if (get_user_page_fast_only(addr, FOLL_WRITE, page)) { 2730 - *pfn = page_to_pfn(page[0]); 2731 - 2732 - if (writable) 2733 - *writable = true; 2734 - return true; 2735 - } 2736 - 2737 - return false; 2738 - } 2739 - 2740 - /* 2741 - * The slow path to get the pfn of the specified host virtual address, 2742 - * 1 indicates success, -errno is returned if error is detected. 2743 - */ 2744 - static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault, 2745 - bool interruptible, bool *writable, kvm_pfn_t *pfn) 2746 - { 2747 - /* 2748 - * When a VCPU accesses a page that is not mapped into the secondary 2749 - * MMU, we lookup the page using GUP to map it, so the guest VCPU can 2750 - * make progress. We always want to honor NUMA hinting faults in that 2751 - * case, because GUP usage corresponds to memory accesses from the VCPU. 2752 - * Otherwise, we'd not trigger NUMA hinting faults once a page is 2753 - * mapped into the secondary MMU and gets accessed by a VCPU. 2754 - * 2755 - * Note that get_user_page_fast_only() and FOLL_WRITE for now 2756 - * implicitly honor NUMA hinting faults and don't need this flag. 2757 - */ 2758 - unsigned int flags = FOLL_HWPOISON | FOLL_HONOR_NUMA_FAULT; 2759 - struct page *page; 2760 - int npages; 2761 - 2762 - might_sleep(); 2763 - 2764 - if (writable) 2765 - *writable = write_fault; 2766 - 2767 - if (write_fault) 2768 - flags |= FOLL_WRITE; 2769 - if (async) 2770 - flags |= FOLL_NOWAIT; 2771 - if (interruptible) 2772 - flags |= FOLL_INTERRUPTIBLE; 2773 - 2774 - npages = get_user_pages_unlocked(addr, 1, &page, flags); 2775 - if (npages != 1) 2776 - return npages; 2777 - 2778 - /* map read fault as writable if possible */ 2779 - if (unlikely(!write_fault) && writable) { 2780 - struct page *wpage; 2781 - 2782 - if (get_user_page_fast_only(addr, FOLL_WRITE, &wpage)) { 2783 - *writable = true; 2784 - put_page(page); 2785 - page = wpage; 2786 - } 2787 - } 2788 - *pfn = page_to_pfn(page); 2789 - return npages; 2790 - } 2791 - 2792 - static bool vma_is_valid(struct vm_area_struct *vma, bool write_fault) 2793 - { 2794 - if (unlikely(!(vma->vm_flags & VM_READ))) 2795 - return false; 2796 - 2797 - if (write_fault && (unlikely(!(vma->vm_flags & VM_WRITE)))) 2798 - return false; 2799 - 2800 - return true; 2801 - } 2802 - 2803 - static int kvm_try_get_pfn(kvm_pfn_t pfn) 2804 - { 2805 - struct page *page = kvm_pfn_to_refcounted_page(pfn); 2806 - 2807 - if (!page) 2808 - return 1; 2809 - 2810 - return get_page_unless_zero(page); 2811 - } 2812 - 2813 - static int hva_to_pfn_remapped(struct vm_area_struct *vma, 2814 - unsigned long addr, bool write_fault, 2815 - bool *writable, kvm_pfn_t *p_pfn) 2816 - { 2817 - struct follow_pfnmap_args args = { .vma = vma, .address = addr }; 2818 - kvm_pfn_t pfn; 2819 - int r; 2820 - 2821 - r = follow_pfnmap_start(&args); 2822 - if (r) { 2823 - /* 2824 - * get_user_pages fails for VM_IO and VM_PFNMAP vmas and does 2825 - * not call the fault handler, so do it here. 2826 - */ 2827 - bool unlocked = false; 2828 - r = fixup_user_fault(current->mm, addr, 2829 - (write_fault ? FAULT_FLAG_WRITE : 0), 2830 - &unlocked); 2831 - if (unlocked) 2832 - return -EAGAIN; 2833 - if (r) 2834 - return r; 2835 - 2836 - r = follow_pfnmap_start(&args); 2837 - if (r) 2838 - return r; 2839 - } 2840 - 2841 - if (write_fault && !args.writable) { 2842 - pfn = KVM_PFN_ERR_RO_FAULT; 2843 - goto out; 2844 - } 2845 - 2846 - if (writable) 2847 - *writable = args.writable; 2848 - pfn = args.pfn; 2849 - 2850 - /* 2851 - * Get a reference here because callers of *hva_to_pfn* and 2852 - * *gfn_to_pfn* ultimately call kvm_release_pfn_clean on the 2853 - * returned pfn. This is only needed if the VMA has VM_MIXEDMAP 2854 - * set, but the kvm_try_get_pfn/kvm_release_pfn_clean pair will 2855 - * simply do nothing for reserved pfns. 2856 - * 2857 - * Whoever called remap_pfn_range is also going to call e.g. 2858 - * unmap_mapping_range before the underlying pages are freed, 2859 - * causing a call to our MMU notifier. 2860 - * 2861 - * Certain IO or PFNMAP mappings can be backed with valid 2862 - * struct pages, but be allocated without refcounting e.g., 2863 - * tail pages of non-compound higher order allocations, which 2864 - * would then underflow the refcount when the caller does the 2865 - * required put_page. Don't allow those pages here. 2866 - */ 2867 - if (!kvm_try_get_pfn(pfn)) 2868 - r = -EFAULT; 2869 - out: 2870 - follow_pfnmap_end(&args); 2871 - *p_pfn = pfn; 2872 - 2873 - return r; 2874 - } 2875 - 2876 - /* 2877 - * Pin guest page in memory and return its pfn. 2878 - * @addr: host virtual address which maps memory to the guest 2879 - * @atomic: whether this function is forbidden from sleeping 2880 - * @interruptible: whether the process can be interrupted by non-fatal signals 2881 - * @async: whether this function need to wait IO complete if the 2882 - * host page is not in the memory 2883 - * @write_fault: whether we should get a writable host page 2884 - * @writable: whether it allows to map a writable host page for !@write_fault 2885 - * 2886 - * The function will map a writable host page for these two cases: 2887 - * 1): @write_fault = true 2888 - * 2): @write_fault = false && @writable, @writable will tell the caller 2889 - * whether the mapping is writable. 2890 - */ 2891 - kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible, 2892 - bool *async, bool write_fault, bool *writable) 2893 - { 2894 - struct vm_area_struct *vma; 2895 - kvm_pfn_t pfn; 2896 - int npages, r; 2897 - 2898 - /* we can do it either atomically or asynchronously, not both */ 2899 - BUG_ON(atomic && async); 2900 - 2901 - if (hva_to_pfn_fast(addr, write_fault, writable, &pfn)) 2902 - return pfn; 2903 - 2904 - if (atomic) 2905 - return KVM_PFN_ERR_FAULT; 2906 - 2907 - npages = hva_to_pfn_slow(addr, async, write_fault, interruptible, 2908 - writable, &pfn); 2909 - if (npages == 1) 2910 - return pfn; 2911 - if (npages == -EINTR) 2912 - return KVM_PFN_ERR_SIGPENDING; 2913 - 2914 - mmap_read_lock(current->mm); 2915 - if (npages == -EHWPOISON || 2916 - (!async && check_user_page_hwpoison(addr))) { 2917 - pfn = KVM_PFN_ERR_HWPOISON; 2918 - goto exit; 2919 - } 2920 - 2921 - retry: 2922 - vma = vma_lookup(current->mm, addr); 2923 - 2924 - if (vma == NULL) 2925 - pfn = KVM_PFN_ERR_FAULT; 2926 - else if (vma->vm_flags & (VM_IO | VM_PFNMAP)) { 2927 - r = hva_to_pfn_remapped(vma, addr, write_fault, writable, &pfn); 2928 - if (r == -EAGAIN) 2929 - goto retry; 2930 - if (r < 0) 2931 - pfn = KVM_PFN_ERR_FAULT; 2932 - } else { 2933 - if (async && vma_is_valid(vma, write_fault)) 2934 - *async = true; 2935 - pfn = KVM_PFN_ERR_FAULT; 2936 - } 2937 - exit: 2938 - mmap_read_unlock(current->mm); 2939 - return pfn; 2940 - } 2941 - 2942 - kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn, 2943 - bool atomic, bool interruptible, bool *async, 2944 - bool write_fault, bool *writable, hva_t *hva) 2945 - { 2946 - unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault); 2947 - 2948 - if (hva) 2949 - *hva = addr; 2950 - 2951 - if (kvm_is_error_hva(addr)) { 2952 - if (writable) 2953 - *writable = false; 2954 - 2955 - return addr == KVM_HVA_ERR_RO_BAD ? KVM_PFN_ERR_RO_FAULT : 2956 - KVM_PFN_NOSLOT; 2957 - } 2958 - 2959 - /* Do not map writable pfn in the readonly memslot. */ 2960 - if (writable && memslot_is_readonly(slot)) { 2961 - *writable = false; 2962 - writable = NULL; 2963 - } 2964 - 2965 - return hva_to_pfn(addr, atomic, interruptible, async, write_fault, 2966 - writable); 2967 - } 2968 - EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot); 2969 - 2970 - kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault, 2971 - bool *writable) 2972 - { 2973 - return __gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn, false, false, 2974 - NULL, write_fault, writable, NULL); 2975 - } 2976 - EXPORT_SYMBOL_GPL(gfn_to_pfn_prot); 2977 - 2978 - kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn) 2979 - { 2980 - return __gfn_to_pfn_memslot(slot, gfn, false, false, NULL, true, 2981 - NULL, NULL); 2982 - } 2983 - EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot); 2984 - 2985 - kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gfn) 2986 - { 2987 - return __gfn_to_pfn_memslot(slot, gfn, true, false, NULL, true, 2988 - NULL, NULL); 2989 - } 2990 - EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot_atomic); 2991 - 2992 - kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn) 2993 - { 2994 - return gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn); 2995 - } 2996 - EXPORT_SYMBOL_GPL(gfn_to_pfn); 2997 - 2998 - int gfn_to_page_many_atomic(struct kvm_memory_slot *slot, gfn_t gfn, 2999 - struct page **pages, int nr_pages) 3000 - { 3001 - unsigned long addr; 3002 - gfn_t entry = 0; 3003 - 3004 - addr = gfn_to_hva_many(slot, gfn, &entry); 3005 - if (kvm_is_error_hva(addr)) 3006 - return -1; 3007 - 3008 - if (entry < nr_pages) 3009 - return 0; 3010 - 3011 - return get_user_pages_fast_only(addr, nr_pages, FOLL_WRITE, pages); 3012 - } 3013 - EXPORT_SYMBOL_GPL(gfn_to_page_many_atomic); 3014 - 3015 - /* 3016 - * Do not use this helper unless you are absolutely certain the gfn _must_ be 3017 - * backed by 'struct page'. A valid example is if the backing memslot is 3018 - * controlled by KVM. Note, if the returned page is valid, it's refcount has 3019 - * been elevated by gfn_to_pfn(). 3020 - */ 3021 - struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn) 3022 - { 3023 - struct page *page; 3024 - kvm_pfn_t pfn; 3025 - 3026 - pfn = gfn_to_pfn(kvm, gfn); 3027 - 3028 - if (is_error_noslot_pfn(pfn)) 3029 - return KVM_ERR_PTR_BAD_PAGE; 3030 - 3031 - page = kvm_pfn_to_refcounted_page(pfn); 3032 - if (!page) 3033 - return KVM_ERR_PTR_BAD_PAGE; 3034 - 3035 - return page; 3036 - } 3037 - EXPORT_SYMBOL_GPL(gfn_to_page); 3038 - 3039 - void kvm_release_pfn(kvm_pfn_t pfn, bool dirty) 3040 - { 3041 - if (dirty) 3042 - kvm_release_pfn_dirty(pfn); 3043 - else 3044 - kvm_release_pfn_clean(pfn); 3045 - } 3046 - 3047 - int kvm_vcpu_map(struct kvm_vcpu *vcpu, gfn_t gfn, struct kvm_host_map *map) 3048 - { 3049 - kvm_pfn_t pfn; 3050 - void *hva = NULL; 3051 - struct page *page = KVM_UNMAPPED_PAGE; 3052 - 3053 - if (!map) 3054 - return -EINVAL; 3055 - 3056 - pfn = gfn_to_pfn(vcpu->kvm, gfn); 3057 - if (is_error_noslot_pfn(pfn)) 3058 - return -EINVAL; 3059 - 3060 - if (pfn_valid(pfn)) { 3061 - page = pfn_to_page(pfn); 3062 - hva = kmap(page); 3063 - #ifdef CONFIG_HAS_IOMEM 3064 - } else { 3065 - hva = memremap(pfn_to_hpa(pfn), PAGE_SIZE, MEMREMAP_WB); 3066 - #endif 3067 - } 3068 - 3069 - if (!hva) 3070 - return -EFAULT; 3071 - 3072 - map->page = page; 3073 - map->hva = hva; 3074 - map->pfn = pfn; 3075 - map->gfn = gfn; 3076 - 3077 - return 0; 3078 - } 3079 - EXPORT_SYMBOL_GPL(kvm_vcpu_map); 3080 - 3081 - void kvm_vcpu_unmap(struct kvm_vcpu *vcpu, struct kvm_host_map *map, bool dirty) 3082 - { 3083 - if (!map) 3084 - return; 3085 - 3086 - if (!map->hva) 3087 - return; 3088 - 3089 - if (map->page != KVM_UNMAPPED_PAGE) 3090 - kunmap(map->page); 3091 - #ifdef CONFIG_HAS_IOMEM 3092 - else 3093 - memunmap(map->hva); 3094 - #endif 3095 - 3096 - if (dirty) 3097 - kvm_vcpu_mark_page_dirty(vcpu, map->gfn); 3098 - 3099 - kvm_release_pfn(map->pfn, dirty); 3100 - 3101 - map->hva = NULL; 3102 - map->page = NULL; 3103 - } 3104 - EXPORT_SYMBOL_GPL(kvm_vcpu_unmap); 3105 - 3106 2749 static bool kvm_is_ad_tracked_page(struct page *page) 3107 2750 { 3108 2751 /* ··· 2723 3172 2724 3173 void kvm_release_page_clean(struct page *page) 2725 3174 { 2726 - WARN_ON(is_error_page(page)); 3175 + if (!page) 3176 + return; 2727 3177 2728 3178 kvm_set_page_accessed(page); 2729 3179 put_page(page); 2730 3180 } 2731 3181 EXPORT_SYMBOL_GPL(kvm_release_page_clean); 2732 3182 2733 - void kvm_release_pfn_clean(kvm_pfn_t pfn) 2734 - { 2735 - struct page *page; 2736 - 2737 - if (is_error_noslot_pfn(pfn)) 2738 - return; 2739 - 2740 - page = kvm_pfn_to_refcounted_page(pfn); 2741 - if (!page) 2742 - return; 2743 - 2744 - kvm_release_page_clean(page); 2745 - } 2746 - EXPORT_SYMBOL_GPL(kvm_release_pfn_clean); 2747 - 2748 3183 void kvm_release_page_dirty(struct page *page) 2749 3184 { 2750 - WARN_ON(is_error_page(page)); 3185 + if (!page) 3186 + return; 2751 3187 2752 3188 kvm_set_page_dirty(page); 2753 3189 kvm_release_page_clean(page); 2754 3190 } 2755 3191 EXPORT_SYMBOL_GPL(kvm_release_page_dirty); 2756 3192 2757 - void kvm_release_pfn_dirty(kvm_pfn_t pfn) 3193 + static kvm_pfn_t kvm_resolve_pfn(struct kvm_follow_pfn *kfp, struct page *page, 3194 + struct follow_pfnmap_args *map, bool writable) 2758 3195 { 2759 - struct page *page; 3196 + kvm_pfn_t pfn; 2760 3197 2761 - if (is_error_noslot_pfn(pfn)) 2762 - return; 3198 + WARN_ON_ONCE(!!page == !!map); 2763 3199 2764 - page = kvm_pfn_to_refcounted_page(pfn); 2765 - if (!page) 2766 - return; 3200 + if (kfp->map_writable) 3201 + *kfp->map_writable = writable; 2767 3202 2768 - kvm_release_page_dirty(page); 3203 + if (map) 3204 + pfn = map->pfn; 3205 + else 3206 + pfn = page_to_pfn(page); 3207 + 3208 + *kfp->refcounted_page = page; 3209 + 3210 + return pfn; 2769 3211 } 2770 - EXPORT_SYMBOL_GPL(kvm_release_pfn_dirty); 2771 3212 2772 3213 /* 2773 - * Note, checking for an error/noslot pfn is the caller's responsibility when 2774 - * directly marking a page dirty/accessed. Unlike the "release" helpers, the 2775 - * "set" helpers are not to be used when the pfn might point at garbage. 3214 + * The fast path to get the writable pfn which will be stored in @pfn, 3215 + * true indicates success, otherwise false is returned. 2776 3216 */ 2777 - void kvm_set_pfn_dirty(kvm_pfn_t pfn) 3217 + static bool hva_to_pfn_fast(struct kvm_follow_pfn *kfp, kvm_pfn_t *pfn) 2778 3218 { 2779 - if (WARN_ON(is_error_noslot_pfn(pfn))) 3219 + struct page *page; 3220 + bool r; 3221 + 3222 + /* 3223 + * Try the fast-only path when the caller wants to pin/get the page for 3224 + * writing. If the caller only wants to read the page, KVM must go 3225 + * down the full, slow path in order to avoid racing an operation that 3226 + * breaks Copy-on-Write (CoW), e.g. so that KVM doesn't end up pointing 3227 + * at the old, read-only page while mm/ points at a new, writable page. 3228 + */ 3229 + if (!((kfp->flags & FOLL_WRITE) || kfp->map_writable)) 3230 + return false; 3231 + 3232 + if (kfp->pin) 3233 + r = pin_user_pages_fast(kfp->hva, 1, FOLL_WRITE, &page) == 1; 3234 + else 3235 + r = get_user_page_fast_only(kfp->hva, FOLL_WRITE, &page); 3236 + 3237 + if (r) { 3238 + *pfn = kvm_resolve_pfn(kfp, page, NULL, true); 3239 + return true; 3240 + } 3241 + 3242 + return false; 3243 + } 3244 + 3245 + /* 3246 + * The slow path to get the pfn of the specified host virtual address, 3247 + * 1 indicates success, -errno is returned if error is detected. 3248 + */ 3249 + static int hva_to_pfn_slow(struct kvm_follow_pfn *kfp, kvm_pfn_t *pfn) 3250 + { 3251 + /* 3252 + * When a VCPU accesses a page that is not mapped into the secondary 3253 + * MMU, we lookup the page using GUP to map it, so the guest VCPU can 3254 + * make progress. We always want to honor NUMA hinting faults in that 3255 + * case, because GUP usage corresponds to memory accesses from the VCPU. 3256 + * Otherwise, we'd not trigger NUMA hinting faults once a page is 3257 + * mapped into the secondary MMU and gets accessed by a VCPU. 3258 + * 3259 + * Note that get_user_page_fast_only() and FOLL_WRITE for now 3260 + * implicitly honor NUMA hinting faults and don't need this flag. 3261 + */ 3262 + unsigned int flags = FOLL_HWPOISON | FOLL_HONOR_NUMA_FAULT | kfp->flags; 3263 + struct page *page, *wpage; 3264 + int npages; 3265 + 3266 + if (kfp->pin) 3267 + npages = pin_user_pages_unlocked(kfp->hva, 1, &page, flags); 3268 + else 3269 + npages = get_user_pages_unlocked(kfp->hva, 1, &page, flags); 3270 + if (npages != 1) 3271 + return npages; 3272 + 3273 + /* 3274 + * Pinning is mutually exclusive with opportunistically mapping a read 3275 + * fault as writable, as KVM should never pin pages when mapping memory 3276 + * into the guest (pinning is only for direct accesses from KVM). 3277 + */ 3278 + if (WARN_ON_ONCE(kfp->map_writable && kfp->pin)) 3279 + goto out; 3280 + 3281 + /* map read fault as writable if possible */ 3282 + if (!(flags & FOLL_WRITE) && kfp->map_writable && 3283 + get_user_page_fast_only(kfp->hva, FOLL_WRITE, &wpage)) { 3284 + put_page(page); 3285 + page = wpage; 3286 + flags |= FOLL_WRITE; 3287 + } 3288 + 3289 + out: 3290 + *pfn = kvm_resolve_pfn(kfp, page, NULL, flags & FOLL_WRITE); 3291 + return npages; 3292 + } 3293 + 3294 + static bool vma_is_valid(struct vm_area_struct *vma, bool write_fault) 3295 + { 3296 + if (unlikely(!(vma->vm_flags & VM_READ))) 3297 + return false; 3298 + 3299 + if (write_fault && (unlikely(!(vma->vm_flags & VM_WRITE)))) 3300 + return false; 3301 + 3302 + return true; 3303 + } 3304 + 3305 + static int hva_to_pfn_remapped(struct vm_area_struct *vma, 3306 + struct kvm_follow_pfn *kfp, kvm_pfn_t *p_pfn) 3307 + { 3308 + struct follow_pfnmap_args args = { .vma = vma, .address = kfp->hva }; 3309 + bool write_fault = kfp->flags & FOLL_WRITE; 3310 + int r; 3311 + 3312 + /* 3313 + * Remapped memory cannot be pinned in any meaningful sense. Bail if 3314 + * the caller wants to pin the page, i.e. access the page outside of 3315 + * MMU notifier protection, and unsafe umappings are disallowed. 3316 + */ 3317 + if (kfp->pin && !allow_unsafe_mappings) 3318 + return -EINVAL; 3319 + 3320 + r = follow_pfnmap_start(&args); 3321 + if (r) { 3322 + /* 3323 + * get_user_pages fails for VM_IO and VM_PFNMAP vmas and does 3324 + * not call the fault handler, so do it here. 3325 + */ 3326 + bool unlocked = false; 3327 + r = fixup_user_fault(current->mm, kfp->hva, 3328 + (write_fault ? FAULT_FLAG_WRITE : 0), 3329 + &unlocked); 3330 + if (unlocked) 3331 + return -EAGAIN; 3332 + if (r) 3333 + return r; 3334 + 3335 + r = follow_pfnmap_start(&args); 3336 + if (r) 3337 + return r; 3338 + } 3339 + 3340 + if (write_fault && !args.writable) { 3341 + *p_pfn = KVM_PFN_ERR_RO_FAULT; 3342 + goto out; 3343 + } 3344 + 3345 + *p_pfn = kvm_resolve_pfn(kfp, NULL, &args, args.writable); 3346 + out: 3347 + follow_pfnmap_end(&args); 3348 + return r; 3349 + } 3350 + 3351 + kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *kfp) 3352 + { 3353 + struct vm_area_struct *vma; 3354 + kvm_pfn_t pfn; 3355 + int npages, r; 3356 + 3357 + might_sleep(); 3358 + 3359 + if (WARN_ON_ONCE(!kfp->refcounted_page)) 3360 + return KVM_PFN_ERR_FAULT; 3361 + 3362 + if (hva_to_pfn_fast(kfp, &pfn)) 3363 + return pfn; 3364 + 3365 + npages = hva_to_pfn_slow(kfp, &pfn); 3366 + if (npages == 1) 3367 + return pfn; 3368 + if (npages == -EINTR || npages == -EAGAIN) 3369 + return KVM_PFN_ERR_SIGPENDING; 3370 + if (npages == -EHWPOISON) 3371 + return KVM_PFN_ERR_HWPOISON; 3372 + 3373 + mmap_read_lock(current->mm); 3374 + retry: 3375 + vma = vma_lookup(current->mm, kfp->hva); 3376 + 3377 + if (vma == NULL) 3378 + pfn = KVM_PFN_ERR_FAULT; 3379 + else if (vma->vm_flags & (VM_IO | VM_PFNMAP)) { 3380 + r = hva_to_pfn_remapped(vma, kfp, &pfn); 3381 + if (r == -EAGAIN) 3382 + goto retry; 3383 + if (r < 0) 3384 + pfn = KVM_PFN_ERR_FAULT; 3385 + } else { 3386 + if ((kfp->flags & FOLL_NOWAIT) && 3387 + vma_is_valid(vma, kfp->flags & FOLL_WRITE)) 3388 + pfn = KVM_PFN_ERR_NEEDS_IO; 3389 + else 3390 + pfn = KVM_PFN_ERR_FAULT; 3391 + } 3392 + mmap_read_unlock(current->mm); 3393 + return pfn; 3394 + } 3395 + 3396 + static kvm_pfn_t kvm_follow_pfn(struct kvm_follow_pfn *kfp) 3397 + { 3398 + kfp->hva = __gfn_to_hva_many(kfp->slot, kfp->gfn, NULL, 3399 + kfp->flags & FOLL_WRITE); 3400 + 3401 + if (kfp->hva == KVM_HVA_ERR_RO_BAD) 3402 + return KVM_PFN_ERR_RO_FAULT; 3403 + 3404 + if (kvm_is_error_hva(kfp->hva)) 3405 + return KVM_PFN_NOSLOT; 3406 + 3407 + if (memslot_is_readonly(kfp->slot) && kfp->map_writable) { 3408 + *kfp->map_writable = false; 3409 + kfp->map_writable = NULL; 3410 + } 3411 + 3412 + return hva_to_pfn(kfp); 3413 + } 3414 + 3415 + kvm_pfn_t __kvm_faultin_pfn(const struct kvm_memory_slot *slot, gfn_t gfn, 3416 + unsigned int foll, bool *writable, 3417 + struct page **refcounted_page) 3418 + { 3419 + struct kvm_follow_pfn kfp = { 3420 + .slot = slot, 3421 + .gfn = gfn, 3422 + .flags = foll, 3423 + .map_writable = writable, 3424 + .refcounted_page = refcounted_page, 3425 + }; 3426 + 3427 + if (WARN_ON_ONCE(!writable || !refcounted_page)) 3428 + return KVM_PFN_ERR_FAULT; 3429 + 3430 + *writable = false; 3431 + *refcounted_page = NULL; 3432 + 3433 + return kvm_follow_pfn(&kfp); 3434 + } 3435 + EXPORT_SYMBOL_GPL(__kvm_faultin_pfn); 3436 + 3437 + int kvm_prefetch_pages(struct kvm_memory_slot *slot, gfn_t gfn, 3438 + struct page **pages, int nr_pages) 3439 + { 3440 + unsigned long addr; 3441 + gfn_t entry = 0; 3442 + 3443 + addr = gfn_to_hva_many(slot, gfn, &entry); 3444 + if (kvm_is_error_hva(addr)) 3445 + return -1; 3446 + 3447 + if (entry < nr_pages) 3448 + return 0; 3449 + 3450 + return get_user_pages_fast_only(addr, nr_pages, FOLL_WRITE, pages); 3451 + } 3452 + EXPORT_SYMBOL_GPL(kvm_prefetch_pages); 3453 + 3454 + /* 3455 + * Don't use this API unless you are absolutely, positively certain that KVM 3456 + * needs to get a struct page, e.g. to pin the page for firmware DMA. 3457 + * 3458 + * FIXME: Users of this API likely need to FOLL_PIN the page, not just elevate 3459 + * its refcount. 3460 + */ 3461 + struct page *__gfn_to_page(struct kvm *kvm, gfn_t gfn, bool write) 3462 + { 3463 + struct page *refcounted_page = NULL; 3464 + struct kvm_follow_pfn kfp = { 3465 + .slot = gfn_to_memslot(kvm, gfn), 3466 + .gfn = gfn, 3467 + .flags = write ? FOLL_WRITE : 0, 3468 + .refcounted_page = &refcounted_page, 3469 + }; 3470 + 3471 + (void)kvm_follow_pfn(&kfp); 3472 + return refcounted_page; 3473 + } 3474 + EXPORT_SYMBOL_GPL(__gfn_to_page); 3475 + 3476 + int __kvm_vcpu_map(struct kvm_vcpu *vcpu, gfn_t gfn, struct kvm_host_map *map, 3477 + bool writable) 3478 + { 3479 + struct kvm_follow_pfn kfp = { 3480 + .slot = gfn_to_memslot(vcpu->kvm, gfn), 3481 + .gfn = gfn, 3482 + .flags = writable ? FOLL_WRITE : 0, 3483 + .refcounted_page = &map->pinned_page, 3484 + .pin = true, 3485 + }; 3486 + 3487 + map->pinned_page = NULL; 3488 + map->page = NULL; 3489 + map->hva = NULL; 3490 + map->gfn = gfn; 3491 + map->writable = writable; 3492 + 3493 + map->pfn = kvm_follow_pfn(&kfp); 3494 + if (is_error_noslot_pfn(map->pfn)) 3495 + return -EINVAL; 3496 + 3497 + if (pfn_valid(map->pfn)) { 3498 + map->page = pfn_to_page(map->pfn); 3499 + map->hva = kmap(map->page); 3500 + #ifdef CONFIG_HAS_IOMEM 3501 + } else { 3502 + map->hva = memremap(pfn_to_hpa(map->pfn), PAGE_SIZE, MEMREMAP_WB); 3503 + #endif 3504 + } 3505 + 3506 + return map->hva ? 0 : -EFAULT; 3507 + } 3508 + EXPORT_SYMBOL_GPL(__kvm_vcpu_map); 3509 + 3510 + void kvm_vcpu_unmap(struct kvm_vcpu *vcpu, struct kvm_host_map *map) 3511 + { 3512 + if (!map->hva) 2780 3513 return; 2781 3514 2782 - if (pfn_valid(pfn)) 2783 - kvm_set_page_dirty(pfn_to_page(pfn)); 2784 - } 2785 - EXPORT_SYMBOL_GPL(kvm_set_pfn_dirty); 3515 + if (map->page) 3516 + kunmap(map->page); 3517 + #ifdef CONFIG_HAS_IOMEM 3518 + else 3519 + memunmap(map->hva); 3520 + #endif 2786 3521 2787 - void kvm_set_pfn_accessed(kvm_pfn_t pfn) 2788 - { 2789 - if (WARN_ON(is_error_noslot_pfn(pfn))) 2790 - return; 3522 + if (map->writable) 3523 + kvm_vcpu_mark_page_dirty(vcpu, map->gfn); 2791 3524 2792 - if (pfn_valid(pfn)) 2793 - kvm_set_page_accessed(pfn_to_page(pfn)); 3525 + if (map->pinned_page) { 3526 + if (map->writable) 3527 + kvm_set_page_dirty(map->pinned_page); 3528 + kvm_set_page_accessed(map->pinned_page); 3529 + unpin_user_page(map->pinned_page); 3530 + } 3531 + 3532 + map->hva = NULL; 3533 + map->page = NULL; 3534 + map->pinned_page = NULL; 2794 3535 } 2795 - EXPORT_SYMBOL_GPL(kvm_set_pfn_accessed); 3536 + EXPORT_SYMBOL_GPL(kvm_vcpu_unmap); 2796 3537 2797 3538 static int next_segment(unsigned long len, int offset) 2798 3539 { ··· 3763 3920 3764 3921 int kvm_vcpu_yield_to(struct kvm_vcpu *target) 3765 3922 { 3766 - struct pid *pid; 3767 3923 struct task_struct *task = NULL; 3768 - int ret = 0; 3924 + int ret; 3769 3925 3770 - rcu_read_lock(); 3771 - pid = rcu_dereference(target->pid); 3772 - if (pid) 3773 - task = get_pid_task(pid, PIDTYPE_PID); 3774 - rcu_read_unlock(); 3926 + if (!read_trylock(&target->pid_lock)) 3927 + return 0; 3928 + 3929 + if (target->pid) 3930 + task = get_pid_task(target->pid, PIDTYPE_PID); 3931 + 3932 + read_unlock(&target->pid_lock); 3933 + 3775 3934 if (!task) 3776 - return ret; 3935 + return 0; 3777 3936 ret = yield_to(task, 1); 3778 3937 put_task_struct(task); 3779 3938 ··· 3864 4019 3865 4020 void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode) 3866 4021 { 4022 + int nr_vcpus, start, i, idx, yielded; 3867 4023 struct kvm *kvm = me->kvm; 3868 4024 struct kvm_vcpu *vcpu; 3869 - int last_boosted_vcpu; 3870 - unsigned long i; 3871 - int yielded = 0; 3872 4025 int try = 3; 3873 - int pass; 3874 4026 3875 - last_boosted_vcpu = READ_ONCE(kvm->last_boosted_vcpu); 4027 + nr_vcpus = atomic_read(&kvm->online_vcpus); 4028 + if (nr_vcpus < 2) 4029 + return; 4030 + 4031 + /* Pairs with the smp_wmb() in kvm_vm_ioctl_create_vcpu(). */ 4032 + smp_rmb(); 4033 + 3876 4034 kvm_vcpu_set_in_spin_loop(me, true); 4035 + 3877 4036 /* 3878 - * We boost the priority of a VCPU that is runnable but not 3879 - * currently running, because it got preempted by something 3880 - * else and called schedule in __vcpu_run. Hopefully that 3881 - * VCPU is holding the lock that we need and will release it. 3882 - * We approximate round-robin by starting at the last boosted VCPU. 4037 + * The current vCPU ("me") is spinning in kernel mode, i.e. is likely 4038 + * waiting for a resource to become available. Attempt to yield to a 4039 + * vCPU that is runnable, but not currently running, e.g. because the 4040 + * vCPU was preempted by a higher priority task. With luck, the vCPU 4041 + * that was preempted is holding a lock or some other resource that the 4042 + * current vCPU is waiting to acquire, and yielding to the other vCPU 4043 + * will allow it to make forward progress and release the lock (or kick 4044 + * the spinning vCPU, etc). 4045 + * 4046 + * Since KVM has no insight into what exactly the guest is doing, 4047 + * approximate a round-robin selection by iterating over all vCPUs, 4048 + * starting at the last boosted vCPU. I.e. if N=kvm->last_boosted_vcpu, 4049 + * iterate over vCPU[N+1]..vCPU[N-1], wrapping as needed. 4050 + * 4051 + * Note, this is inherently racy, e.g. if multiple vCPUs are spinning, 4052 + * they may all try to yield to the same vCPU(s). But as above, this 4053 + * is all best effort due to KVM's lack of visibility into the guest. 3883 4054 */ 3884 - for (pass = 0; pass < 2 && !yielded && try; pass++) { 3885 - kvm_for_each_vcpu(i, vcpu, kvm) { 3886 - if (!pass && i <= last_boosted_vcpu) { 3887 - i = last_boosted_vcpu; 3888 - continue; 3889 - } else if (pass && i > last_boosted_vcpu) 3890 - break; 3891 - if (!READ_ONCE(vcpu->ready)) 3892 - continue; 3893 - if (vcpu == me) 3894 - continue; 3895 - if (kvm_vcpu_is_blocking(vcpu) && !vcpu_dy_runnable(vcpu)) 3896 - continue; 4055 + start = READ_ONCE(kvm->last_boosted_vcpu) + 1; 4056 + for (i = 0; i < nr_vcpus; i++) { 4057 + idx = (start + i) % nr_vcpus; 4058 + if (idx == me->vcpu_idx) 4059 + continue; 3897 4060 3898 - /* 3899 - * Treat the target vCPU as being in-kernel if it has a 3900 - * pending interrupt, as the vCPU trying to yield may 3901 - * be spinning waiting on IPI delivery, i.e. the target 3902 - * vCPU is in-kernel for the purposes of directed yield. 3903 - */ 3904 - if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode && 3905 - !kvm_arch_dy_has_pending_interrupt(vcpu) && 3906 - !kvm_arch_vcpu_preempted_in_kernel(vcpu)) 3907 - continue; 3908 - if (!kvm_vcpu_eligible_for_directed_yield(vcpu)) 3909 - continue; 4061 + vcpu = xa_load(&kvm->vcpu_array, idx); 4062 + if (!READ_ONCE(vcpu->ready)) 4063 + continue; 4064 + if (kvm_vcpu_is_blocking(vcpu) && !vcpu_dy_runnable(vcpu)) 4065 + continue; 3910 4066 3911 - yielded = kvm_vcpu_yield_to(vcpu); 3912 - if (yielded > 0) { 3913 - WRITE_ONCE(kvm->last_boosted_vcpu, i); 3914 - break; 3915 - } else if (yielded < 0) { 3916 - try--; 3917 - if (!try) 3918 - break; 3919 - } 4067 + /* 4068 + * Treat the target vCPU as being in-kernel if it has a pending 4069 + * interrupt, as the vCPU trying to yield may be spinning 4070 + * waiting on IPI delivery, i.e. the target vCPU is in-kernel 4071 + * for the purposes of directed yield. 4072 + */ 4073 + if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode && 4074 + !kvm_arch_dy_has_pending_interrupt(vcpu) && 4075 + !kvm_arch_vcpu_preempted_in_kernel(vcpu)) 4076 + continue; 4077 + 4078 + if (!kvm_vcpu_eligible_for_directed_yield(vcpu)) 4079 + continue; 4080 + 4081 + yielded = kvm_vcpu_yield_to(vcpu); 4082 + if (yielded > 0) { 4083 + WRITE_ONCE(kvm->last_boosted_vcpu, i); 4084 + break; 4085 + } else if (yielded < 0 && !--try) { 4086 + break; 3920 4087 } 3921 4088 } 3922 4089 kvm_vcpu_set_in_spin_loop(me, false); ··· 4025 4168 { 4026 4169 struct kvm_vcpu *vcpu = data; 4027 4170 4028 - rcu_read_lock(); 4029 - *val = pid_nr(rcu_dereference(vcpu->pid)); 4030 - rcu_read_unlock(); 4171 + read_lock(&vcpu->pid_lock); 4172 + *val = pid_nr(vcpu->pid); 4173 + read_unlock(&vcpu->pid_lock); 4031 4174 return 0; 4032 4175 } 4033 4176 ··· 4313 4456 r = -EINVAL; 4314 4457 if (arg) 4315 4458 goto out; 4316 - oldpid = rcu_access_pointer(vcpu->pid); 4459 + 4460 + /* 4461 + * Note, vcpu->pid is primarily protected by vcpu->mutex. The 4462 + * dedicated r/w lock allows other tasks, e.g. other vCPUs, to 4463 + * read vcpu->pid while this vCPU is in KVM_RUN, e.g. to yield 4464 + * directly to this vCPU 4465 + */ 4466 + oldpid = vcpu->pid; 4317 4467 if (unlikely(oldpid != task_pid(current))) { 4318 4468 /* The thread running this VCPU changed. */ 4319 4469 struct pid *newpid; ··· 4330 4466 break; 4331 4467 4332 4468 newpid = get_task_pid(current, PIDTYPE_PID); 4333 - rcu_assign_pointer(vcpu->pid, newpid); 4334 - if (oldpid) 4335 - synchronize_rcu(); 4469 + write_lock(&vcpu->pid_lock); 4470 + vcpu->pid = newpid; 4471 + write_unlock(&vcpu->pid_lock); 4472 + 4336 4473 put_pid(oldpid); 4337 4474 } 4338 4475 vcpu->wants_to_run = !READ_ONCE(vcpu->run->immediate_exit__unsafe); ··· 6426 6561 kvm_irqfd_exit(); 6427 6562 } 6428 6563 EXPORT_SYMBOL_GPL(kvm_exit); 6429 - 6430 - struct kvm_vm_worker_thread_context { 6431 - struct kvm *kvm; 6432 - struct task_struct *parent; 6433 - struct completion init_done; 6434 - kvm_vm_thread_fn_t thread_fn; 6435 - uintptr_t data; 6436 - int err; 6437 - }; 6438 - 6439 - static int kvm_vm_worker_thread(void *context) 6440 - { 6441 - /* 6442 - * The init_context is allocated on the stack of the parent thread, so 6443 - * we have to locally copy anything that is needed beyond initialization 6444 - */ 6445 - struct kvm_vm_worker_thread_context *init_context = context; 6446 - struct task_struct *parent; 6447 - struct kvm *kvm = init_context->kvm; 6448 - kvm_vm_thread_fn_t thread_fn = init_context->thread_fn; 6449 - uintptr_t data = init_context->data; 6450 - int err; 6451 - 6452 - err = kthread_park(current); 6453 - /* kthread_park(current) is never supposed to return an error */ 6454 - WARN_ON(err != 0); 6455 - if (err) 6456 - goto init_complete; 6457 - 6458 - err = cgroup_attach_task_all(init_context->parent, current); 6459 - if (err) { 6460 - kvm_err("%s: cgroup_attach_task_all failed with err %d\n", 6461 - __func__, err); 6462 - goto init_complete; 6463 - } 6464 - 6465 - set_user_nice(current, task_nice(init_context->parent)); 6466 - 6467 - init_complete: 6468 - init_context->err = err; 6469 - complete(&init_context->init_done); 6470 - init_context = NULL; 6471 - 6472 - if (err) 6473 - goto out; 6474 - 6475 - /* Wait to be woken up by the spawner before proceeding. */ 6476 - kthread_parkme(); 6477 - 6478 - if (!kthread_should_stop()) 6479 - err = thread_fn(kvm, data); 6480 - 6481 - out: 6482 - /* 6483 - * Move kthread back to its original cgroup to prevent it lingering in 6484 - * the cgroup of the VM process, after the latter finishes its 6485 - * execution. 6486 - * 6487 - * kthread_stop() waits on the 'exited' completion condition which is 6488 - * set in exit_mm(), via mm_release(), in do_exit(). However, the 6489 - * kthread is removed from the cgroup in the cgroup_exit() which is 6490 - * called after the exit_mm(). This causes the kthread_stop() to return 6491 - * before the kthread actually quits the cgroup. 6492 - */ 6493 - rcu_read_lock(); 6494 - parent = rcu_dereference(current->real_parent); 6495 - get_task_struct(parent); 6496 - rcu_read_unlock(); 6497 - cgroup_attach_task_all(parent, current); 6498 - put_task_struct(parent); 6499 - 6500 - return err; 6501 - } 6502 - 6503 - int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn, 6504 - uintptr_t data, const char *name, 6505 - struct task_struct **thread_ptr) 6506 - { 6507 - struct kvm_vm_worker_thread_context init_context = {}; 6508 - struct task_struct *thread; 6509 - 6510 - *thread_ptr = NULL; 6511 - init_context.kvm = kvm; 6512 - init_context.parent = current; 6513 - init_context.thread_fn = thread_fn; 6514 - init_context.data = data; 6515 - init_completion(&init_context.init_done); 6516 - 6517 - thread = kthread_run(kvm_vm_worker_thread, &init_context, 6518 - "%s-%d", name, task_pid_nr(current)); 6519 - if (IS_ERR(thread)) 6520 - return PTR_ERR(thread); 6521 - 6522 - /* kthread_run is never supposed to return NULL */ 6523 - WARN_ON(thread == NULL); 6524 - 6525 - wait_for_completion(&init_context.init_done); 6526 - 6527 - if (!init_context.err) 6528 - *thread_ptr = thread; 6529 - 6530 - return init_context.err; 6531 - }

+34 -2

virt/kvm/kvm_mm.h

··· 20 20 #define KVM_MMU_UNLOCK(kvm) spin_unlock(&(kvm)->mmu_lock) 21 21 #endif /* KVM_HAVE_MMU_RWLOCK */ 22 22 23 - kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible, 24 - bool *async, bool write_fault, bool *writable); 23 + 24 + struct kvm_follow_pfn { 25 + const struct kvm_memory_slot *slot; 26 + const gfn_t gfn; 27 + 28 + unsigned long hva; 29 + 30 + /* FOLL_* flags modifying lookup behavior, e.g. FOLL_WRITE. */ 31 + unsigned int flags; 32 + 33 + /* 34 + * Pin the page (effectively FOLL_PIN, which is an mm/ internal flag). 35 + * The page *must* be pinned if KVM will write to the page via a kernel 36 + * mapping, e.g. via kmap(), mremap(), etc. 37 + */ 38 + bool pin; 39 + 40 + /* 41 + * If non-NULL, try to get a writable mapping even for a read fault. 42 + * Set to true if a writable mapping was obtained. 43 + */ 44 + bool *map_writable; 45 + 46 + /* 47 + * Optional output. Set to a valid "struct page" if the returned pfn 48 + * is for a refcounted or pinned struct page, NULL if the returned pfn 49 + * has no struct page or if the struct page is not being refcounted 50 + * (e.g. tail pages of non-compound higher order allocations from 51 + * IO/PFNMAP mappings). 52 + */ 53 + struct page **refcounted_page; 54 + }; 55 + 56 + kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *kfp); 25 57 26 58 #ifdef CONFIG_HAVE_KVM_PFNCACHE 27 59 void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm,

+14 -6

virt/kvm/pfncache.c

··· 159 159 kvm_pfn_t new_pfn = KVM_PFN_ERR_FAULT; 160 160 void *new_khva = NULL; 161 161 unsigned long mmu_seq; 162 + struct page *page; 163 + 164 + struct kvm_follow_pfn kfp = { 165 + .slot = gpc->memslot, 166 + .gfn = gpa_to_gfn(gpc->gpa), 167 + .flags = FOLL_WRITE, 168 + .hva = gpc->uhva, 169 + .refcounted_page = &page, 170 + }; 162 171 163 172 lockdep_assert_held(&gpc->refresh_lock); 164 173 ··· 201 192 if (new_khva != old_khva) 202 193 gpc_unmap(new_pfn, new_khva); 203 194 204 - kvm_release_pfn_clean(new_pfn); 195 + kvm_release_page_unused(page); 205 196 206 197 cond_resched(); 207 198 } 208 199 209 - /* We always request a writeable mapping */ 210 - new_pfn = hva_to_pfn(gpc->uhva, false, false, NULL, true, NULL); 200 + new_pfn = hva_to_pfn(&kfp); 211 201 if (is_error_noslot_pfn(new_pfn)) 212 202 goto out_error; 213 203 ··· 221 213 new_khva = gpc_map(new_pfn); 222 214 223 215 if (!new_khva) { 224 - kvm_release_pfn_clean(new_pfn); 216 + kvm_release_page_unused(page); 225 217 goto out_error; 226 218 } 227 219 ··· 239 231 gpc->khva = new_khva + offset_in_page(gpc->uhva); 240 232 241 233 /* 242 - * Put the reference to the _new_ pfn. The pfn is now tracked by the 234 + * Put the reference to the _new_ page. The page is now tracked by the 243 235 * cache and can be safely migrated, swapped, etc... as the cache will 244 236 * invalidate any mappings in response to relevant mmu_notifier events. 245 237 */ 246 - kvm_release_pfn_clean(new_pfn); 238 + kvm_release_page_clean(page); 247 239 248 240 return 0; 249 241

Configure Feed

Configure Feed