Merge tag 'kvmarm-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

+2 -2

Documentation/admin-guide/kernel-parameters.txt

··· 3247 3247 for the host. To force nVHE on VHE hardware, add 3248 3248 "arm64_sw.hvhe=0 id_aa64mmfr1.vh=0" to the 3249 3249 command-line. 3250 - "nested" is experimental and should be used with 3251 - extreme caution. 3250 + "nested" and "protected" are experimental and should be 3251 + used with extreme caution. 3252 3252 3253 3253 kvm-arm.vgic_v3_group0_trap= 3254 3254 [KVM,ARM,EARLY] Trap guest accesses to GICv3 group-0

+11

Documentation/trace/index.rst

··· 91 91 user_events 92 92 uprobetracer 93 93 94 + Remote Tracing 95 + -------------- 96 + 97 + This section covers the framework to read compatible ring-buffers, written by 98 + entities outside of the kernel (most likely firmware or hypervisor) 99 + 100 + .. toctree:: 101 + :maxdepth: 1 102 + 103 + remotes 104 + 94 105 Additional Resources 95 106 -------------------- 96 107

+66

Documentation/trace/remotes.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + =============== 4 + Tracing Remotes 5 + =============== 6 + 7 + :Author: Vincent Donnefort <vdonnefort@google.com> 8 + 9 + Overview 10 + ======== 11 + Firmware and hypervisors are black boxes to the kernel. Having a way to see what 12 + they are doing can be useful to debug both. This is where remote tracing buffers 13 + come in. A remote tracing buffer is a ring buffer executed by the firmware or 14 + hypervisor into memory that is memory mapped to the host kernel. This is similar 15 + to how user space memory maps the kernel ring buffer but in this case the kernel 16 + is acting like user space and the firmware or hypervisor is the "kernel" side. 17 + With a trace remote ring buffer, the firmware and hypervisor can record events 18 + for which the host kernel can see and expose to user space. 19 + 20 + Register a remote 21 + ================= 22 + A remote must provide a set of callbacks `struct trace_remote_callbacks` whom 23 + description can be found below. Those callbacks allows Tracefs to enable and 24 + disable tracing and events, to load and unload a tracing buffer (a set of 25 + ring-buffers) and to swap a reader page with the head page, which enables 26 + consuming reading. 27 + 28 + .. kernel-doc:: include/linux/trace_remote.h 29 + 30 + Once registered, an instance will appear for this remote in the Tracefs 31 + directory **remotes/**. Buffers can then be read using the usual Tracefs files 32 + **trace_pipe** and **trace**. 33 + 34 + Declare a remote event 35 + ====================== 36 + Macros are provided to ease the declaration of remote events, in a similar 37 + fashion to in-kernel events. A declaration must provide an ID, a description of 38 + the event arguments and how to print the event: 39 + 40 + .. code-block:: c 41 + 42 + REMOTE_EVENT(foo, EVENT_FOO_ID, 43 + RE_STRUCT( 44 + re_field(u64, bar) 45 + ), 46 + RE_PRINTK("bar=%lld", __entry->bar) 47 + ); 48 + 49 + Then those events must be declared in a C file with the following: 50 + 51 + .. code-block:: c 52 + 53 + #define REMOTE_EVENT_INCLUDE_FILE foo_events.h 54 + #include <trace/define_remote_events.h> 55 + 56 + This will provide a `struct remote_event remote_event_foo` that can be given to 57 + `trace_remote_register`. 58 + 59 + Registered events appear in the remote directory under **events/**. 60 + 61 + Simple ring-buffer 62 + ================== 63 + A simple implementation for a ring-buffer writer can be found in 64 + kernel/trace/simple_ring_buffer.c. 65 + 66 + .. kernel-doc:: include/linux/simple_ring_buffer.h

+4 -2

Documentation/virt/kvm/api.rst

··· 907 907 - KVM_ARM_IRQ_TYPE_CPU: 908 908 out-of-kernel GIC: irq_id 0 is IRQ, irq_id 1 is FIQ 909 909 - KVM_ARM_IRQ_TYPE_SPI: 910 - in-kernel GIC: SPI, irq_id between 32 and 1019 (incl.) 910 + in-kernel GICv2/GICv3: SPI, irq_id between 32 and 1019 (incl.) 911 911 (the vcpu_index field is ignored) 912 + in-kernel GICv5: SPI, irq_id between 0 and 65535 (incl.) 912 913 - KVM_ARM_IRQ_TYPE_PPI: 913 - in-kernel GIC: PPI, irq_id between 16 and 31 (incl.) 914 + in-kernel GICv2/GICv3: PPI, irq_id between 16 and 31 (incl.) 915 + in-kernel GICv5: PPI, irq_id between 0 and 127 (incl.) 914 916 915 917 (The irq_id field thus corresponds nicely to the IRQ ID in the ARM GIC specs) 916 918

+1

Documentation/virt/kvm/arm/index.rst

··· 10 10 fw-pseudo-registers 11 11 hyp-abi 12 12 hypercalls 13 + pkvm 13 14 pvtime 14 15 ptp_kvm 15 16 vcpu-features

+106

Documentation/virt/kvm/arm/pkvm.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ==================== 4 + Protected KVM (pKVM) 5 + ==================== 6 + 7 + **NOTE**: pKVM is currently an experimental, development feature and 8 + subject to breaking changes as new isolation features are implemented. 9 + Please reach out to the developers at kvmarm@lists.linux.dev if you have 10 + any questions. 11 + 12 + Overview 13 + ======== 14 + 15 + Booting a host kernel with '``kvm-arm.mode=protected``' enables 16 + "Protected KVM" (pKVM). During boot, pKVM installs a stage-2 identity 17 + map page-table for the host and uses it to isolate the hypervisor 18 + running at EL2 from the rest of the host running at EL1/0. 19 + 20 + pKVM permits creation of protected virtual machines (pVMs) by passing 21 + the ``KVM_VM_TYPE_ARM_PROTECTED`` machine type identifier to the 22 + ``KVM_CREATE_VM`` ioctl(). The hypervisor isolates pVMs from the host by 23 + unmapping pages from the stage-2 identity map as they are accessed by a 24 + pVM. Hypercalls are provided for a pVM to share specific regions of its 25 + IPA space back with the host, allowing for communication with the VMM. 26 + A Linux guest must be configured with ``CONFIG_ARM_PKVM_GUEST=y`` in 27 + order to issue these hypercalls. 28 + 29 + See hypercalls.rst for more details. 30 + 31 + Isolation mechanisms 32 + ==================== 33 + 34 + pKVM relies on a number of mechanisms to isolate PVMs from the host: 35 + 36 + CPU memory isolation 37 + -------------------- 38 + 39 + Status: Isolation of anonymous memory and metadata pages. 40 + 41 + Metadata pages (e.g. page-table pages and '``struct kvm_vcpu``' pages) 42 + are donated from the host to the hypervisor during pVM creation and 43 + are consequently unmapped from the stage-2 identity map until the pVM is 44 + destroyed. 45 + 46 + Similarly to regular KVM, pages are lazily mapped into the guest in 47 + response to stage-2 page faults handled by the host. However, when 48 + running a pVM, these pages are first pinned and then unmapped from the 49 + stage-2 identity map as part of the donation procedure. This gives rise 50 + to some user-visible differences when compared to non-protected VMs, 51 + largely due to the lack of MMU notifiers: 52 + 53 + * Memslots cannot be moved or deleted once the pVM has started running. 54 + * Read-only memslots and dirty logging are not supported. 55 + * With the exception of swap, file-backed pages cannot be mapped into a 56 + pVM. 57 + * Donated pages are accounted against ``RLIMIT_MLOCK`` and so the VMM 58 + must have a sufficient resource limit or be granted ``CAP_IPC_LOCK``. 59 + The lack of a runtime reclaim mechanism means that memory locked for 60 + a pVM will remain locked until the pVM is destroyed. 61 + * Changes to the VMM address space (e.g. a ``MAP_FIXED`` mmap() over a 62 + mapping associated with a memslot) are not reflected in the guest and 63 + may lead to loss of coherency. 64 + * Accessing pVM memory that has not been shared back will result in the 65 + delivery of a SIGSEGV. 66 + * If a system call accesses pVM memory that has not been shared back 67 + then it will either return ``-EFAULT`` or forcefully reclaim the 68 + memory pages. Reclaimed memory is zeroed by the hypervisor and a 69 + subsequent attempt to access it in the pVM will return ``-EFAULT`` 70 + from the ``VCPU_RUN`` ioctl(). 71 + 72 + CPU state isolation 73 + ------------------- 74 + 75 + Status: **Unimplemented.** 76 + 77 + DMA isolation using an IOMMU 78 + ---------------------------- 79 + 80 + Status: **Unimplemented.** 81 + 82 + Proxying of Trustzone services 83 + ------------------------------ 84 + 85 + Status: FF-A and PSCI calls from the host are proxied by the pKVM 86 + hypervisor. 87 + 88 + The FF-A proxy ensures that the host cannot share pVM or hypervisor 89 + memory with Trustzone as part of a "confused deputy" attack. 90 + 91 + The PSCI proxy ensures that CPUs always have the stage-2 identity map 92 + installed when they are executing in the host. 93 + 94 + Protected VM firmware (pvmfw) 95 + ----------------------------- 96 + 97 + Status: **Unimplemented.** 98 + 99 + Resources 100 + ========= 101 + 102 + Quentin Perret's KVM Forum 2022 talk entitled "Protected KVM on arm64: A 103 + technical deep dive" remains a good resource for learning more about 104 + pKVM, despite some of the details having changed in the meantime: 105 + 106 + https://www.youtube.com/watch?v=9npebeVFbFw

+50

Documentation/virt/kvm/devices/arm-vgic-v5.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ==================================================== 4 + ARM Virtual Generic Interrupt Controller v5 (VGICv5) 5 + ==================================================== 6 + 7 + 8 + Device types supported: 9 + - KVM_DEV_TYPE_ARM_VGIC_V5 ARM Generic Interrupt Controller v5.0 10 + 11 + Only one VGIC instance may be instantiated through this API. The created VGIC 12 + will act as the VM interrupt controller, requiring emulated user-space devices 13 + to inject interrupts to the VGIC instead of directly to CPUs. 14 + 15 + Creating a guest GICv5 device requires a host GICv5 host. The current VGICv5 16 + device only supports PPI interrupts. These can either be injected from emulated 17 + in-kernel devices (such as the Arch Timer, or PMU), or via the KVM_IRQ_LINE 18 + ioctl. 19 + 20 + Groups: 21 + KVM_DEV_ARM_VGIC_GRP_CTRL 22 + Attributes: 23 + 24 + KVM_DEV_ARM_VGIC_CTRL_INIT 25 + request the initialization of the VGIC, no additional parameter in 26 + kvm_device_attr.addr. Must be called after all VCPUs have been created. 27 + 28 + KVM_DEV_ARM_VGIC_USERPSPACE_PPIs 29 + request the mask of userspace-drivable PPIs. Only a subset of the PPIs can 30 + be directly driven from userspace with GICv5, and the returned mask 31 + informs userspace of which it is allowed to drive via KVM_IRQ_LINE. 32 + 33 + Userspace must allocate and point to __u64[2] of data in 34 + kvm_device_attr.addr. When this call returns, the provided memory will be 35 + populated with the userspace PPI mask. The lower __u64 contains the mask 36 + for the lower 64 PPIS, with the remaining 64 being in the second __u64. 37 + 38 + This is a read-only attribute, and cannot be set. Attempts to set it are 39 + rejected. 40 + 41 + Errors: 42 + 43 + ======= ======================================================== 44 + -ENXIO VGIC not properly configured as required prior to calling 45 + this attribute 46 + -ENODEV no online VCPU 47 + -ENOMEM memory shortage when allocating vgic internal data 48 + -EFAULT Invalid guest ram access 49 + -EBUSY One or more VCPUS are running 50 + ======= ========================================================

+1

Documentation/virt/kvm/devices/index.rst

··· 10 10 arm-vgic-its 11 11 arm-vgic 12 12 arm-vgic-v3 13 + arm-vgic-v5 13 14 mpic 14 15 s390_flic 15 16 vcpu

+3 -2

Documentation/virt/kvm/devices/vcpu.rst

··· 37 37 A value describing the PMUv3 (Performance Monitor Unit v3) overflow interrupt 38 38 number for this vcpu. This interrupt could be a PPI or SPI, but the interrupt 39 39 type must be same for each vcpu. As a PPI, the interrupt number is the same for 40 - all vcpus, while as an SPI it must be a separate number per vcpu. 40 + all vcpus, while as an SPI it must be a separate number per vcpu. For 41 + GICv5-based guests, the architected PPI (23) must be used. 41 42 42 43 1.2 ATTRIBUTE: KVM_ARM_VCPU_PMU_V3_INIT 43 44 --------------------------------------- ··· 51 50 -EEXIST Interrupt number already used 52 51 -ENODEV PMUv3 not supported or GIC not initialized 53 52 -ENXIO PMUv3 not supported, missing VCPU feature or interrupt 54 - number not set 53 + number not set (non-GICv5 guests, only) 55 54 -EBUSY PMUv3 already initialized 56 55 ======= ====================================================== 57 56

+2 -2

arch/arm64/include/asm/el2_setup.h

··· 50 50 * effectively VHE-only or not. 51 51 */ 52 52 msr_hcr_el2 x0 // Setup HCR_EL2 as nVHE 53 - isb 54 53 mov x1, #1 // Write something to FAR_EL1 55 54 msr far_el1, x1 56 55 isb ··· 63 64 .LnE2H0_\@: 64 65 orr x0, x0, #HCR_E2H 65 66 msr_hcr_el2 x0 66 - isb 67 67 .LnVHE_\@: 68 68 .endm 69 69 ··· 246 248 ICH_HFGWTR_EL2_ICC_CR0_EL1 | \ 247 249 ICH_HFGWTR_EL2_ICC_APR_EL1) 248 250 msr_s SYS_ICH_HFGWTR_EL2, x0 // Disable reg write traps 251 + mov x0, #(ICH_VCTLR_EL2_En) 252 + msr_s SYS_ICH_VCTLR_EL2, x0 // Enable vHPPI selection 249 253 .Lskip_gicv5_\@: 250 254 .endm 251 255

+32 -12

arch/arm64/include/asm/kvm_asm.h

··· 51 51 #include <linux/mm.h> 52 52 53 53 enum __kvm_host_smccc_func { 54 - /* Hypercalls available only prior to pKVM finalisation */ 54 + /* Hypercalls that are unavailable once pKVM has finalised. */ 55 55 /* __KVM_HOST_SMCCC_FUNC___kvm_hyp_init */ 56 56 __KVM_HOST_SMCCC_FUNC___pkvm_init = __KVM_HOST_SMCCC_FUNC___kvm_hyp_init + 1, 57 57 __KVM_HOST_SMCCC_FUNC___pkvm_create_private_mapping, ··· 60 60 __KVM_HOST_SMCCC_FUNC___vgic_v3_init_lrs, 61 61 __KVM_HOST_SMCCC_FUNC___vgic_v3_get_gic_config, 62 62 __KVM_HOST_SMCCC_FUNC___pkvm_prot_finalize, 63 + __KVM_HOST_SMCCC_FUNC_MIN_PKVM = __KVM_HOST_SMCCC_FUNC___pkvm_prot_finalize, 63 64 64 - /* Hypercalls available after pKVM finalisation */ 65 - __KVM_HOST_SMCCC_FUNC___pkvm_host_share_hyp, 66 - __KVM_HOST_SMCCC_FUNC___pkvm_host_unshare_hyp, 67 - __KVM_HOST_SMCCC_FUNC___pkvm_host_share_guest, 68 - __KVM_HOST_SMCCC_FUNC___pkvm_host_unshare_guest, 69 - __KVM_HOST_SMCCC_FUNC___pkvm_host_relax_perms_guest, 70 - __KVM_HOST_SMCCC_FUNC___pkvm_host_wrprotect_guest, 71 - __KVM_HOST_SMCCC_FUNC___pkvm_host_test_clear_young_guest, 72 - __KVM_HOST_SMCCC_FUNC___pkvm_host_mkyoung_guest, 65 + /* Hypercalls that are always available and common to [nh]VHE/pKVM. */ 73 66 __KVM_HOST_SMCCC_FUNC___kvm_adjust_pc, 74 67 __KVM_HOST_SMCCC_FUNC___kvm_vcpu_run, 75 68 __KVM_HOST_SMCCC_FUNC___kvm_flush_vm_context, ··· 74 81 __KVM_HOST_SMCCC_FUNC___kvm_timer_set_cntvoff, 75 82 __KVM_HOST_SMCCC_FUNC___vgic_v3_save_aprs, 76 83 __KVM_HOST_SMCCC_FUNC___vgic_v3_restore_vmcr_aprs, 84 + __KVM_HOST_SMCCC_FUNC___vgic_v5_save_apr, 85 + __KVM_HOST_SMCCC_FUNC___vgic_v5_restore_vmcr_apr, 86 + __KVM_HOST_SMCCC_FUNC_MAX_NO_PKVM = __KVM_HOST_SMCCC_FUNC___vgic_v5_restore_vmcr_apr, 87 + 88 + /* Hypercalls that are available only when pKVM has finalised. */ 89 + __KVM_HOST_SMCCC_FUNC___pkvm_host_share_hyp, 90 + __KVM_HOST_SMCCC_FUNC___pkvm_host_unshare_hyp, 91 + __KVM_HOST_SMCCC_FUNC___pkvm_host_donate_guest, 92 + __KVM_HOST_SMCCC_FUNC___pkvm_host_share_guest, 93 + __KVM_HOST_SMCCC_FUNC___pkvm_host_unshare_guest, 94 + __KVM_HOST_SMCCC_FUNC___pkvm_host_relax_perms_guest, 95 + __KVM_HOST_SMCCC_FUNC___pkvm_host_wrprotect_guest, 96 + __KVM_HOST_SMCCC_FUNC___pkvm_host_test_clear_young_guest, 97 + __KVM_HOST_SMCCC_FUNC___pkvm_host_mkyoung_guest, 77 98 __KVM_HOST_SMCCC_FUNC___pkvm_reserve_vm, 78 99 __KVM_HOST_SMCCC_FUNC___pkvm_unreserve_vm, 79 100 __KVM_HOST_SMCCC_FUNC___pkvm_init_vm, 80 101 __KVM_HOST_SMCCC_FUNC___pkvm_init_vcpu, 81 - __KVM_HOST_SMCCC_FUNC___pkvm_teardown_vm, 102 + __KVM_HOST_SMCCC_FUNC___pkvm_vcpu_in_poison_fault, 103 + __KVM_HOST_SMCCC_FUNC___pkvm_force_reclaim_guest_page, 104 + __KVM_HOST_SMCCC_FUNC___pkvm_reclaim_dying_guest_page, 105 + __KVM_HOST_SMCCC_FUNC___pkvm_start_teardown_vm, 106 + __KVM_HOST_SMCCC_FUNC___pkvm_finalize_teardown_vm, 82 107 __KVM_HOST_SMCCC_FUNC___pkvm_vcpu_load, 83 108 __KVM_HOST_SMCCC_FUNC___pkvm_vcpu_put, 84 109 __KVM_HOST_SMCCC_FUNC___pkvm_tlb_flush_vmid, 110 + __KVM_HOST_SMCCC_FUNC___tracing_load, 111 + __KVM_HOST_SMCCC_FUNC___tracing_unload, 112 + __KVM_HOST_SMCCC_FUNC___tracing_enable, 113 + __KVM_HOST_SMCCC_FUNC___tracing_swap_reader, 114 + __KVM_HOST_SMCCC_FUNC___tracing_update_clock, 115 + __KVM_HOST_SMCCC_FUNC___tracing_reset, 116 + __KVM_HOST_SMCCC_FUNC___tracing_enable_event, 117 + __KVM_HOST_SMCCC_FUNC___tracing_write_event, 85 118 }; 86 119 87 120 #define DECLARE_KVM_VHE_SYM(sym) extern char sym[] ··· 310 291 asmlinkage void kvm_unexpected_el2_exception(void); 311 292 struct kvm_cpu_context; 312 293 void handle_trap(struct kvm_cpu_context *host_ctxt); 313 - asmlinkage void __noreturn __kvm_host_psci_cpu_entry(bool is_cpu_on); 294 + asmlinkage void __noreturn __kvm_host_psci_cpu_on_entry(void); 295 + asmlinkage void __noreturn __kvm_host_psci_cpu_resume_entry(void); 314 296 void __noreturn __pkvm_init_finalise(void); 315 297 void kvm_nvhe_prepare_backtrace(unsigned long fp, unsigned long pc); 316 298 void kvm_patch_vector_branch(struct alt_instr *alt,

+16

arch/arm64/include/asm/kvm_define_hypevents.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + 3 + #define REMOTE_EVENT_INCLUDE_FILE arch/arm64/include/asm/kvm_hypevents.h 4 + 5 + #define REMOTE_EVENT_SECTION "_hyp_events" 6 + 7 + #define HE_STRUCT(__args) __args 8 + #define HE_PRINTK(__args...) __args 9 + #define he_field re_field 10 + 11 + #define HYP_EVENT(__name, __proto, __struct, __assign, __printk) \ 12 + REMOTE_EVENT(__name, 0, RE_STRUCT(__struct), RE_PRINTK(__printk)) 13 + 14 + #define HYP_EVENT_MULTI_READ 15 + #include <trace/define_remote_events.h> 16 + #undef HYP_EVENT_MULTI_READ

+49 -1

arch/arm64/include/asm/kvm_host.h

··· 217 217 */ 218 218 bool nested_stage2_enabled; 219 219 220 + #ifdef CONFIG_PTDUMP_STAGE2_DEBUGFS 221 + struct dentry *shadow_pt_debugfs_dentry; 222 + #endif 223 + 220 224 /* 221 225 * true when this MMU needs to be unmapped before being used for a new 222 226 * purpose. ··· 251 247 unsigned long vendor_hyp_bmap_2; /* Function numbers 64-127 */ 252 248 }; 253 249 254 - typedef unsigned int pkvm_handle_t; 250 + typedef u16 pkvm_handle_t; 255 251 256 252 struct kvm_protected_vm { 257 253 pkvm_handle_t handle; ··· 259 255 struct kvm_hyp_memcache stage2_teardown_mc; 260 256 bool is_protected; 261 257 bool is_created; 258 + 259 + /* 260 + * True when the guest is being torn down. When in this state, the 261 + * guest's vCPUs can't be loaded anymore, but its pages can be 262 + * reclaimed by the host. 263 + */ 264 + bool is_dying; 262 265 }; 263 266 264 267 struct kvm_mpidr_data { ··· 298 287 HDFGRTR2_GROUP, 299 288 HDFGWTR2_GROUP = HDFGRTR2_GROUP, 300 289 HFGITR2_GROUP, 290 + ICH_HFGRTR_GROUP, 291 + ICH_HFGWTR_GROUP = ICH_HFGRTR_GROUP, 292 + ICH_HFGITR_GROUP, 301 293 302 294 /* Must be last */ 303 295 __NR_FGT_GROUP_IDS__ ··· 419 405 * the associated pKVM instance in the hypervisor. 420 406 */ 421 407 struct kvm_protected_vm pkvm; 408 + 409 + #ifdef CONFIG_PTDUMP_STAGE2_DEBUGFS 410 + /* Nested virtualization info */ 411 + struct dentry *debugfs_nv_dentry; 412 + #endif 422 413 }; 423 414 424 415 struct kvm_vcpu_fault_info { ··· 639 620 VNCR(ICH_HCR_EL2), 640 621 VNCR(ICH_VMCR_EL2), 641 622 623 + VNCR(ICH_HFGRTR_EL2), 624 + VNCR(ICH_HFGWTR_EL2), 625 + VNCR(ICH_HFGITR_EL2), 626 + 642 627 NR_SYS_REGS /* Nothing after this line! */ 643 628 }; 644 629 ··· 698 675 extern struct fgt_masks hfgitr2_masks; 699 676 extern struct fgt_masks hdfgrtr2_masks; 700 677 extern struct fgt_masks hdfgwtr2_masks; 678 + extern struct fgt_masks ich_hfgrtr_masks; 679 + extern struct fgt_masks ich_hfgwtr_masks; 680 + extern struct fgt_masks ich_hfgitr_masks; 701 681 702 682 extern struct fgt_masks kvm_nvhe_sym(hfgrtr_masks); 703 683 extern struct fgt_masks kvm_nvhe_sym(hfgwtr_masks); ··· 713 687 extern struct fgt_masks kvm_nvhe_sym(hfgitr2_masks); 714 688 extern struct fgt_masks kvm_nvhe_sym(hdfgrtr2_masks); 715 689 extern struct fgt_masks kvm_nvhe_sym(hdfgwtr2_masks); 690 + extern struct fgt_masks kvm_nvhe_sym(ich_hfgrtr_masks); 691 + extern struct fgt_masks kvm_nvhe_sym(ich_hfgwtr_masks); 692 + extern struct fgt_masks kvm_nvhe_sym(ich_hfgitr_masks); 716 693 717 694 struct kvm_cpu_context { 718 695 struct user_pt_regs regs; /* sp = sp_el0 */ ··· 797 768 struct kvm_guest_debug_arch regs; 798 769 /* Statistical profiling extension */ 799 770 u64 pmscr_el1; 771 + u64 pmblimitr_el1; 800 772 /* Self-hosted trace */ 801 773 u64 trfcr_el1; 774 + u64 trblimitr_el1; 802 775 /* Values of trap registers for the host before guest entry. */ 803 776 u64 mdcr_el2; 804 777 u64 brbcr_el1; ··· 818 787 819 788 /* Last vgic_irq part of the AP list recorded in an LR */ 820 789 struct vgic_irq *last_lr_irq; 790 + 791 + /* PPI state tracking for GICv5-based guests */ 792 + struct { 793 + DECLARE_BITMAP(pendr, VGIC_V5_NR_PRIVATE_IRQS); 794 + 795 + /* The saved state of the regs when leaving the guest */ 796 + DECLARE_BITMAP(activer_exit, VGIC_V5_NR_PRIVATE_IRQS); 797 + } vgic_v5_ppi_state; 821 798 }; 822 799 823 800 struct kvm_host_psci_config { ··· 962 923 963 924 /* Per-vcpu TLB for VNCR_EL2 -- NULL when !NV */ 964 925 struct vncr_tlb *vncr_tlb; 926 + 927 + /* Hyp-readable copy of kvm_vcpu::pid */ 928 + pid_t pid; 965 929 }; 966 930 967 931 /* ··· 1701 1659 case HDFGRTR2_EL2: 1702 1660 case HDFGWTR2_EL2: 1703 1661 return HDFGRTR2_GROUP; 1662 + case ICH_HFGRTR_EL2: 1663 + case ICH_HFGWTR_EL2: 1664 + return ICH_HFGRTR_GROUP; 1665 + case ICH_HFGITR_EL2: 1666 + return ICH_HFGITR_GROUP; 1704 1667 default: 1705 1668 BUILD_BUG_ON(1); 1706 1669 } ··· 1720 1673 case HDFGWTR_EL2: \ 1721 1674 case HFGWTR2_EL2: \ 1722 1675 case HDFGWTR2_EL2: \ 1676 + case ICH_HFGWTR_EL2: \ 1723 1677 p = &(vcpu)->arch.fgt[id].w; \ 1724 1678 break; \ 1725 1679 default: \

+12 -2

arch/arm64/include/asm/kvm_hyp.h

··· 87 87 void __vgic_v3_restore_vmcr_aprs(struct vgic_v3_cpu_if *cpu_if); 88 88 int __vgic_v3_perform_cpuif_access(struct kvm_vcpu *vcpu); 89 89 90 + /* GICv5 */ 91 + void __vgic_v5_save_apr(struct vgic_v5_cpu_if *cpu_if); 92 + void __vgic_v5_restore_vmcr_apr(struct vgic_v5_cpu_if *cpu_if); 93 + /* No hypercalls for the following */ 94 + void __vgic_v5_save_ppi_state(struct vgic_v5_cpu_if *cpu_if); 95 + void __vgic_v5_restore_ppi_state(struct vgic_v5_cpu_if *cpu_if); 96 + void __vgic_v5_save_state(struct vgic_v5_cpu_if *cpu_if); 97 + void __vgic_v5_restore_state(struct vgic_v5_cpu_if *cpu_if); 98 + 90 99 #ifdef __KVM_NVHE_HYPERVISOR__ 91 100 void __timer_enable_traps(struct kvm_vcpu *vcpu); 92 101 void __timer_disable_traps(struct kvm_vcpu *vcpu); ··· 138 129 #ifdef __KVM_NVHE_HYPERVISOR__ 139 130 void __pkvm_init_switch_pgd(phys_addr_t pgd, unsigned long sp, 140 131 void (*fn)(void)); 141 - int __pkvm_init(phys_addr_t phys, unsigned long size, unsigned long nr_cpus, 142 - unsigned long *per_cpu_base, u32 hyp_va_bits); 132 + int __pkvm_init(phys_addr_t phys, unsigned long size, unsigned long *per_cpu_base, u32 hyp_va_bits); 143 133 void __noreturn __host_enter(struct kvm_cpu_context *host_ctxt); 144 134 #endif 145 135 146 136 extern u64 kvm_nvhe_sym(id_aa64pfr0_el1_sys_val); 147 137 extern u64 kvm_nvhe_sym(id_aa64pfr1_el1_sys_val); 138 + extern u64 kvm_nvhe_sym(id_aa64pfr2_el1_sys_val); 148 139 extern u64 kvm_nvhe_sym(id_aa64isar0_el1_sys_val); 149 140 extern u64 kvm_nvhe_sym(id_aa64isar1_el1_sys_val); 150 141 extern u64 kvm_nvhe_sym(id_aa64isar2_el1_sys_val); ··· 156 147 extern unsigned long kvm_nvhe_sym(__icache_flags); 157 148 extern unsigned int kvm_nvhe_sym(kvm_arm_vmid_bits); 158 149 extern unsigned int kvm_nvhe_sym(kvm_host_sve_max_vl); 150 + extern unsigned long kvm_nvhe_sym(hyp_nr_cpus); 159 151 160 152 #endif /* __ARM64_KVM_HYP_H__ */

+60

arch/arm64/include/asm/kvm_hypevents.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + 3 + #if !defined(__ARM64_KVM_HYPEVENTS_H_) || defined(HYP_EVENT_MULTI_READ) 4 + #define __ARM64_KVM_HYPEVENTS_H_ 5 + 6 + #ifdef __KVM_NVHE_HYPERVISOR__ 7 + #include <nvhe/trace.h> 8 + #endif 9 + 10 + #ifndef __HYP_ENTER_EXIT_REASON 11 + #define __HYP_ENTER_EXIT_REASON 12 + enum hyp_enter_exit_reason { 13 + HYP_REASON_SMC, 14 + HYP_REASON_HVC, 15 + HYP_REASON_PSCI, 16 + HYP_REASON_HOST_ABORT, 17 + HYP_REASON_GUEST_EXIT, 18 + HYP_REASON_ERET_HOST, 19 + HYP_REASON_ERET_GUEST, 20 + HYP_REASON_UNKNOWN /* Must be last */ 21 + }; 22 + #endif 23 + 24 + HYP_EVENT(hyp_enter, 25 + HE_PROTO(struct kvm_cpu_context *host_ctxt, u8 reason), 26 + HE_STRUCT( 27 + he_field(u8, reason) 28 + he_field(pid_t, vcpu) 29 + ), 30 + HE_ASSIGN( 31 + __entry->reason = reason; 32 + __entry->vcpu = __tracing_get_vcpu_pid(host_ctxt); 33 + ), 34 + HE_PRINTK("reason=%s vcpu=%d", __hyp_enter_exit_reason_str(__entry->reason), __entry->vcpu) 35 + ); 36 + 37 + HYP_EVENT(hyp_exit, 38 + HE_PROTO(struct kvm_cpu_context *host_ctxt, u8 reason), 39 + HE_STRUCT( 40 + he_field(u8, reason) 41 + he_field(pid_t, vcpu) 42 + ), 43 + HE_ASSIGN( 44 + __entry->reason = reason; 45 + __entry->vcpu = __tracing_get_vcpu_pid(host_ctxt); 46 + ), 47 + HE_PRINTK("reason=%s vcpu=%d", __hyp_enter_exit_reason_str(__entry->reason), __entry->vcpu) 48 + ); 49 + 50 + HYP_EVENT(selftest, 51 + HE_PROTO(u64 id), 52 + HE_STRUCT( 53 + he_field(u64, id) 54 + ), 55 + HE_ASSIGN( 56 + __entry->id = id; 57 + ), 58 + RE_PRINTK("id=%llu", __entry->id) 59 + ); 60 + #endif

+26

arch/arm64/include/asm/kvm_hyptrace.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0-only */ 2 + #ifndef __ARM64_KVM_HYPTRACE_H_ 3 + #define __ARM64_KVM_HYPTRACE_H_ 4 + 5 + #include <linux/ring_buffer.h> 6 + 7 + struct hyp_trace_desc { 8 + unsigned long bpages_backing_start; 9 + size_t bpages_backing_size; 10 + struct trace_buffer_desc trace_buffer_desc; 11 + 12 + }; 13 + 14 + struct hyp_event_id { 15 + unsigned short id; 16 + atomic_t enabled; 17 + }; 18 + 19 + extern struct remote_event __hyp_events_start[]; 20 + extern struct remote_event __hyp_events_end[]; 21 + 22 + /* hyp_event section used by the hypervisor */ 23 + extern struct hyp_event_id __hyp_event_ids_start[]; 24 + extern struct hyp_event_id __hyp_event_ids_end[]; 25 + 26 + #endif

+4

arch/arm64/include/asm/kvm_mmu.h

··· 393 393 394 394 #ifdef CONFIG_PTDUMP_STAGE2_DEBUGFS 395 395 void kvm_s2_ptdump_create_debugfs(struct kvm *kvm); 396 + void kvm_nested_s2_ptdump_create_debugfs(struct kvm_s2_mmu *mmu); 397 + void kvm_nested_s2_ptdump_remove_debugfs(struct kvm_s2_mmu *mmu); 396 398 #else 397 399 static inline void kvm_s2_ptdump_create_debugfs(struct kvm *kvm) {} 400 + static inline void kvm_nested_s2_ptdump_create_debugfs(struct kvm_s2_mmu *mmu) {} 401 + static inline void kvm_nested_s2_ptdump_remove_debugfs(struct kvm_s2_mmu *mmu) {} 398 402 #endif /* CONFIG_PTDUMP_STAGE2_DEBUGFS */ 399 403 400 404 #endif /* __ASSEMBLER__ */

+33 -12

arch/arm64/include/asm/kvm_pgtable.h

··· 99 99 KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W | \ 100 100 KVM_PTE_LEAF_ATTR_HI_S2_XN) 101 101 102 - #define KVM_INVALID_PTE_OWNER_MASK GENMASK(9, 2) 103 - #define KVM_MAX_OWNER_ID 1 102 + /* pKVM invalid pte encodings */ 103 + #define KVM_INVALID_PTE_TYPE_MASK GENMASK(63, 60) 104 + #define KVM_INVALID_PTE_ANNOT_MASK ~(KVM_PTE_VALID | \ 105 + KVM_INVALID_PTE_TYPE_MASK) 104 106 105 - /* 106 - * Used to indicate a pte for which a 'break-before-make' sequence is in 107 - * progress. 108 - */ 109 - #define KVM_INVALID_PTE_LOCKED BIT(10) 107 + enum kvm_invalid_pte_type { 108 + /* 109 + * Used to indicate a pte for which a 'break-before-make' 110 + * sequence is in progress. 111 + */ 112 + KVM_INVALID_PTE_TYPE_LOCKED = 1, 113 + 114 + /* 115 + * pKVM has unmapped the page from the host due to a change of 116 + * ownership. 117 + */ 118 + KVM_HOST_INVALID_PTE_TYPE_DONATION, 119 + 120 + /* 121 + * The page has been forcefully reclaimed from the guest by the 122 + * host. 123 + */ 124 + KVM_GUEST_INVALID_PTE_TYPE_POISONED, 125 + }; 110 126 111 127 static inline bool kvm_pte_valid(kvm_pte_t pte) 112 128 { ··· 674 658 void *mc, enum kvm_pgtable_walk_flags flags); 675 659 676 660 /** 677 - * kvm_pgtable_stage2_set_owner() - Unmap and annotate pages in the IPA space to 678 - * track ownership. 661 + * kvm_pgtable_stage2_annotate() - Unmap and annotate pages in the IPA space 662 + * to track ownership (and more). 679 663 * @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*(). 680 664 * @addr: Base intermediate physical address to annotate. 681 665 * @size: Size of the annotated range. 682 666 * @mc: Cache of pre-allocated and zeroed memory from which to allocate 683 667 * page-table pages. 684 - * @owner_id: Unique identifier for the owner of the page. 668 + * @type: The type of the annotation, determining its meaning and format. 669 + * @annotation: A 59-bit value that will be stored in the page tables. 670 + * @annotation[0] and @annotation[63:60] must be 0. 671 + * @annotation[59:1] is stored in the page tables, along 672 + * with @type. 685 673 * 686 674 * By default, all page-tables are owned by identifier 0. This function can be 687 675 * used to mark portions of the IPA space as owned by other entities. When a ··· 694 674 * 695 675 * Return: 0 on success, negative error code on failure. 696 676 */ 697 - int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size, 698 - void *mc, u8 owner_id); 677 + int kvm_pgtable_stage2_annotate(struct kvm_pgtable *pgt, u64 addr, u64 size, 678 + void *mc, enum kvm_invalid_pte_type type, 679 + kvm_pte_t annotation); 699 680 700 681 /** 701 682 * kvm_pgtable_stage2_unmap() - Remove a mapping from a guest stage-2 page-table.

+1 -3

arch/arm64/include/asm/kvm_pkvm.h

··· 17 17 18 18 #define HYP_MEMBLOCK_REGIONS 128 19 19 20 - int pkvm_init_host_vm(struct kvm *kvm); 20 + int pkvm_init_host_vm(struct kvm *kvm, unsigned long type); 21 21 int pkvm_create_hyp_vm(struct kvm *kvm); 22 22 bool pkvm_hyp_vm_is_created(struct kvm *kvm); 23 23 void pkvm_destroy_hyp_vm(struct kvm *kvm); ··· 40 40 case KVM_CAP_MAX_VCPU_ID: 41 41 case KVM_CAP_MSI_DEVID: 42 42 case KVM_CAP_ARM_VM_IPA_SIZE: 43 - case KVM_CAP_ARM_PMU_V3: 44 - case KVM_CAP_ARM_SVE: 45 43 case KVM_CAP_ARM_PTRAUTH_ADDRESS: 46 44 case KVM_CAP_ARM_PTRAUTH_GENERIC: 47 45 return true;

+9 -4

arch/arm64/include/asm/sysreg.h

··· 1052 1052 #define GICV5_OP_GIC_CDPRI sys_insn(1, 0, 12, 1, 2) 1053 1053 #define GICV5_OP_GIC_CDRCFG sys_insn(1, 0, 12, 1, 5) 1054 1054 #define GICV5_OP_GICR_CDIA sys_insn(1, 0, 12, 3, 0) 1055 + #define GICV5_OP_GICR_CDNMIA sys_insn(1, 0, 12, 3, 1) 1055 1056 1056 1057 /* Definitions for GIC CDAFF */ 1057 1058 #define GICV5_GIC_CDAFF_IAFFID_MASK GENMASK_ULL(47, 32) ··· 1099 1098 #define GICV5_GIC_CDIA_TYPE_MASK GENMASK_ULL(31, 29) 1100 1099 #define GICV5_GIC_CDIA_ID_MASK GENMASK_ULL(23, 0) 1101 1100 1101 + /* Definitions for GICR CDNMIA */ 1102 + #define GICV5_GICR_CDNMIA_VALID_MASK BIT_ULL(32) 1103 + #define GICV5_GICR_CDNMIA_VALID(r) FIELD_GET(GICV5_GICR_CDNMIA_VALID_MASK, r) 1104 + #define GICV5_GICR_CDNMIA_TYPE_MASK GENMASK_ULL(31, 29) 1105 + #define GICV5_GICR_CDNMIA_ID_MASK GENMASK_ULL(23, 0) 1106 + 1102 1107 #define gicr_insn(insn) read_sysreg_s(GICV5_OP_GICR_##insn) 1103 1108 #define gic_insn(v, insn) write_sysreg_s(v, GICV5_OP_GIC_##insn) 1104 1109 ··· 1121 1114 .macro msr_hcr_el2, reg 1122 1115 #if IS_ENABLED(CONFIG_AMPERE_ERRATUM_AC04_CPU_23) 1123 1116 dsb nsh 1124 - msr hcr_el2, \reg 1125 - isb 1126 - #else 1127 - msr hcr_el2, \reg 1128 1117 #endif 1118 + msr hcr_el2, \reg 1119 + isb // Required by AMPERE_ERRATUM_AC04_CPU_23 1129 1120 .endm 1130 1121 #else 1131 1122

+9

arch/arm64/include/asm/virt.h

··· 94 94 static_branch_likely(&kvm_protected_mode_initialized); 95 95 } 96 96 97 + #ifdef CONFIG_KVM 98 + bool pkvm_force_reclaim_guest_page(phys_addr_t phys); 99 + #else 100 + static inline bool pkvm_force_reclaim_guest_page(phys_addr_t phys) 101 + { 102 + return false; 103 + } 104 + #endif 105 + 97 106 /* Reports the availability of HYP mode */ 98 107 static inline bool is_hyp_mode_available(void) 99 108 {

+3

arch/arm64/include/asm/vncr_mapping.h

··· 108 108 #define VNCR_MPAMVPM5_EL2 0x968 109 109 #define VNCR_MPAMVPM6_EL2 0x970 110 110 #define VNCR_MPAMVPM7_EL2 0x978 111 + #define VNCR_ICH_HFGITR_EL2 0xB10 112 + #define VNCR_ICH_HFGRTR_EL2 0xB18 113 + #define VNCR_ICH_HFGWTR_EL2 0xB20 111 114 112 115 #endif /* __ARM64_VNCR_MAPPING_H__ */

+1

arch/arm64/include/uapi/asm/kvm.h

··· 428 428 #define KVM_DEV_ARM_ITS_RESTORE_TABLES 2 429 429 #define KVM_DEV_ARM_VGIC_SAVE_PENDING_TABLES 3 430 430 #define KVM_DEV_ARM_ITS_CTRL_RESET 4 431 + #define KVM_DEV_ARM_VGIC_USERSPACE_PPIS 5 431 432 432 433 /* Device Control API on vcpu fd */ 433 434 #define KVM_ARM_VCPU_PMU_V3_CTRL 0

+1

arch/arm64/kernel/cpufeature.c

··· 325 325 326 326 static const struct arm64_ftr_bits ftr_id_aa64pfr2[] = { 327 327 ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_LOWER_SAFE, ID_AA64PFR2_EL1_FPMR_SHIFT, 4, 0), 328 + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_AA64PFR2_EL1_GCIE_SHIFT, 4, ID_AA64PFR2_EL1_GCIE_NI), 328 329 ARM64_FTR_BITS(FTR_VISIBLE, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64PFR2_EL1_MTEFAR_SHIFT, 4, ID_AA64PFR2_EL1_MTEFAR_NI), 329 330 ARM64_FTR_BITS(FTR_VISIBLE, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64PFR2_EL1_MTESTOREONLY_SHIFT, 4, ID_AA64PFR2_EL1_MTESTOREONLY_NI), 330 331 ARM64_FTR_END,

-1

arch/arm64/kernel/hyp-stub.S

··· 103 103 // Engage the VHE magic! 104 104 mov_q x0, HCR_HOST_VHE_FLAGS 105 105 msr_hcr_el2 x0 106 - isb 107 106 108 107 // Use the EL1 allocated stack, per-cpu offset 109 108 mrs x0, sp_el1

+4

arch/arm64/kernel/image-vars.h

··· 138 138 KVM_NVHE_ALIAS(__hyp_data_end); 139 139 KVM_NVHE_ALIAS(__hyp_rodata_start); 140 140 KVM_NVHE_ALIAS(__hyp_rodata_end); 141 + #ifdef CONFIG_NVHE_EL2_TRACING 142 + KVM_NVHE_ALIAS(__hyp_event_ids_start); 143 + KVM_NVHE_ALIAS(__hyp_event_ids_end); 144 + #endif 141 145 142 146 /* pKVM static key */ 143 147 KVM_NVHE_ALIAS(kvm_protected_mode_initialized);

+18

arch/arm64/kernel/vmlinux.lds.S

··· 13 13 *(__kvm_ex_table) \ 14 14 __stop___kvm_ex_table = .; 15 15 16 + #ifdef CONFIG_NVHE_EL2_TRACING 17 + #define HYPERVISOR_EVENT_IDS \ 18 + . = ALIGN(PAGE_SIZE); \ 19 + __hyp_event_ids_start = .; \ 20 + *(HYP_SECTION_NAME(.event_ids)) \ 21 + __hyp_event_ids_end = .; 22 + #else 23 + #define HYPERVISOR_EVENT_IDS 24 + #endif 25 + 16 26 #define HYPERVISOR_RODATA_SECTIONS \ 17 27 HYP_SECTION_NAME(.rodata) : { \ 18 28 . = ALIGN(PAGE_SIZE); \ 19 29 __hyp_rodata_start = .; \ 20 30 *(HYP_SECTION_NAME(.data..ro_after_init)) \ 21 31 *(HYP_SECTION_NAME(.rodata)) \ 32 + HYPERVISOR_EVENT_IDS \ 22 33 . = ALIGN(PAGE_SIZE); \ 23 34 __hyp_rodata_end = .; \ 24 35 } ··· 319 308 320 309 HYPERVISOR_DATA_SECTION 321 310 311 + #ifdef CONFIG_NVHE_EL2_TRACING 312 + .data.hyp_events : { 313 + __hyp_events_start = .; 314 + *(SORT(_hyp_events.*)) 315 + __hyp_events_end = .; 316 + } 317 + #endif 322 318 /* 323 319 * Data written with the MMU off but read with the MMU on requires 324 320 * cache lines to be invalidated, discarding up to a Cache Writeback

+45 -23

arch/arm64/kvm/Kconfig

··· 42 42 43 43 If unsure, say N. 44 44 45 - config NVHE_EL2_DEBUG 46 - bool "Debug mode for non-VHE EL2 object" 47 - depends on KVM 48 - help 49 - Say Y here to enable the debug mode for the non-VHE KVM EL2 object. 50 - Failure reports will BUG() in the hypervisor. This is intended for 51 - local EL2 hypervisor development. 52 - 53 - If unsure, say N. 54 - 55 - config PROTECTED_NVHE_STACKTRACE 56 - bool "Protected KVM hypervisor stacktraces" 57 - depends on NVHE_EL2_DEBUG 58 - default n 59 - help 60 - Say Y here to enable pKVM hypervisor stacktraces on hyp_panic() 61 - 62 - If using protected nVHE mode, but cannot afford the associated 63 - memory cost (less than 0.75 page per CPU) of pKVM stacktraces, 64 - say N. 65 - 66 - If unsure, or not using protected nVHE (pKVM), say N. 45 + if KVM 67 46 68 47 config PTDUMP_STAGE2_DEBUGFS 69 48 bool "Present the stage-2 pagetables to debugfs" 70 - depends on KVM 71 49 depends on DEBUG_KERNEL 72 50 depends on DEBUG_FS 73 51 depends on ARCH_HAS_PTDUMP ··· 60 82 61 83 If in doubt, say N. 62 84 85 + config NVHE_EL2_DEBUG 86 + bool "Debug mode for non-VHE EL2 object" 87 + default n 88 + help 89 + Say Y here to enable the debug mode for the non-VHE KVM EL2 object. 90 + Failure reports will BUG() in the hypervisor. This is intended for 91 + local EL2 hypervisor development. 92 + 93 + If unsure, say N. 94 + 95 + if NVHE_EL2_DEBUG 96 + 97 + config NVHE_EL2_TRACING 98 + bool 99 + depends on TRACING && FTRACE 100 + select TRACE_REMOTE 101 + default y 102 + 103 + config PKVM_DISABLE_STAGE2_ON_PANIC 104 + bool "Disable the host stage-2 on panic" 105 + default n 106 + help 107 + Relax the host stage-2 on hypervisor panic to allow the kernel to 108 + unwind and symbolize the hypervisor stacktrace. This however tampers 109 + the system security. This is intended for local EL2 hypervisor 110 + development. 111 + 112 + If unsure, say N. 113 + 114 + config PKVM_STACKTRACE 115 + bool "Protected KVM hypervisor stacktraces" 116 + depends on PKVM_DISABLE_STAGE2_ON_PANIC 117 + default y 118 + help 119 + Say Y here to enable pKVM hypervisor stacktraces on hyp_panic() 120 + 121 + If using protected nVHE mode, but cannot afford the associated 122 + memory cost (less than 0.75 page per CPU) of pKVM stacktraces, 123 + say N. 124 + 125 + If unsure, or not using protected nVHE (pKVM), say N. 126 + 127 + endif # NVHE_EL2_DEBUG 128 + endif # KVM 63 129 endif # VIRTUALIZATION

+2

arch/arm64/kvm/Makefile

··· 30 30 kvm-$(CONFIG_ARM64_PTR_AUTH) += pauth.o 31 31 kvm-$(CONFIG_PTDUMP_STAGE2_DEBUGFS) += ptdump.o 32 32 33 + kvm-$(CONFIG_NVHE_EL2_TRACING) += hyp_trace.o 34 + 33 35 always-y := hyp_constants.h hyp-constants.s 34 36 35 37 define rule_gen_hyp_constants

+75 -27

arch/arm64/kvm/arch_timer.c

··· 56 56 .get_input_level = kvm_arch_timer_get_input_level, 57 57 }; 58 58 59 + static struct irq_ops arch_timer_irq_ops_vgic_v5 = { 60 + .get_input_level = kvm_arch_timer_get_input_level, 61 + .queue_irq_unlock = vgic_v5_ppi_queue_irq_unlock, 62 + .set_direct_injection = vgic_v5_set_ppi_dvi, 63 + }; 64 + 59 65 static int nr_timers(struct kvm_vcpu *vcpu) 60 66 { 61 67 if (!vcpu_has_nv(vcpu)) ··· 453 447 if (userspace_irqchip(vcpu->kvm)) 454 448 return; 455 449 450 + /* Skip injecting on GICv5 for directly injected (DVI'd) timers */ 451 + if (vgic_is_v5(vcpu->kvm)) { 452 + struct timer_map map; 453 + 454 + get_timer_map(vcpu, &map); 455 + 456 + if (map.direct_ptimer == timer_ctx || 457 + map.direct_vtimer == timer_ctx) 458 + return; 459 + } 460 + 456 461 kvm_vgic_inject_irq(vcpu->kvm, vcpu, 457 462 timer_irq(timer_ctx), 458 463 timer_ctx->irq.level, ··· 691 674 phys_active = kvm_vgic_map_is_active(vcpu, timer_irq(ctx)); 692 675 693 676 phys_active |= ctx->irq.level; 677 + phys_active |= vgic_is_v5(vcpu->kvm); 694 678 695 679 set_timer_irq_phys_active(ctx, phys_active); 696 680 } ··· 758 740 759 741 ret = kvm_vgic_map_phys_irq(vcpu, 760 742 map->direct_vtimer->host_timer_irq, 761 - timer_irq(map->direct_vtimer), 762 - &arch_timer_irq_ops); 743 + timer_irq(map->direct_vtimer)); 763 744 WARN_ON_ONCE(ret); 764 745 ret = kvm_vgic_map_phys_irq(vcpu, 765 746 map->direct_ptimer->host_timer_irq, 766 - timer_irq(map->direct_ptimer), 767 - &arch_timer_irq_ops); 747 + timer_irq(map->direct_ptimer)); 768 748 WARN_ON_ONCE(ret); 769 749 } 770 750 } ··· 880 864 get_timer_map(vcpu, &map); 881 865 882 866 if (static_branch_likely(&has_gic_active_state)) { 883 - if (vcpu_has_nv(vcpu)) 867 + /* We don't do NV on GICv5, yet */ 868 + if (vcpu_has_nv(vcpu) && !vgic_is_v5(vcpu->kvm)) 884 869 kvm_timer_vcpu_load_nested_switch(vcpu, &map); 885 870 886 871 kvm_timer_vcpu_load_gic(map.direct_vtimer); ··· 951 934 952 935 if (kvm_vcpu_is_blocking(vcpu)) 953 936 kvm_timer_blocking(vcpu); 937 + 938 + if (vgic_is_v5(vcpu->kvm)) { 939 + set_timer_irq_phys_active(map.direct_vtimer, false); 940 + if (map.direct_ptimer) 941 + set_timer_irq_phys_active(map.direct_ptimer, false); 942 + } 954 943 } 955 944 956 945 void kvm_timer_sync_nested(struct kvm_vcpu *vcpu) ··· 1120 1097 HRTIMER_MODE_ABS_HARD); 1121 1098 } 1122 1099 1100 + /* 1101 + * This is always called during kvm_arch_init_vm, but will also be 1102 + * called from kvm_vgic_create if we have a vGICv5. 1103 + */ 1123 1104 void kvm_timer_init_vm(struct kvm *kvm) 1124 1105 { 1106 + /* 1107 + * Set up the default PPIs - note that we adjust them based on 1108 + * the model of the GIC as GICv5 uses a different way to 1109 + * describing interrupts. 1110 + */ 1125 1111 for (int i = 0; i < NR_KVM_TIMERS; i++) 1126 - kvm->arch.timer_data.ppi[i] = default_ppi[i]; 1112 + kvm->arch.timer_data.ppi[i] = get_vgic_ppi(kvm, default_ppi[i]); 1127 1113 } 1128 1114 1129 1115 void kvm_timer_cpu_up(void) ··· 1301 1269 1302 1270 static void timer_irq_eoi(struct irq_data *d) 1303 1271 { 1304 - if (!irqd_is_forwarded_to_vcpu(d)) 1272 + /* 1273 + * On a GICv5 host, we still need to call EOI on the parent for 1274 + * PPIs. The host driver already handles irqs which are forwarded to 1275 + * vcpus, and skips the GIC CDDI while still doing the GIC CDEOI. This 1276 + * is required to emulate the EOIMode=1 on GICv5 hardware. Failure to 1277 + * call EOI unsurprisingly results in *BAD* lock-ups. 1278 + */ 1279 + if (!irqd_is_forwarded_to_vcpu(d) || 1280 + kvm_vgic_global_state.type == VGIC_V5) 1305 1281 irq_chip_eoi_parent(d); 1306 1282 } 1307 1283 ··· 1373 1333 host_vtimer_irq = info->virtual_irq; 1374 1334 kvm_irq_fixup_flags(host_vtimer_irq, &host_vtimer_irq_flags); 1375 1335 1376 - if (kvm_vgic_global_state.no_hw_deactivation) { 1336 + if (kvm_vgic_global_state.no_hw_deactivation || 1337 + kvm_vgic_global_state.type == VGIC_V5) { 1377 1338 struct fwnode_handle *fwnode; 1378 1339 struct irq_data *data; 1379 1340 ··· 1392 1351 return -ENOMEM; 1393 1352 } 1394 1353 1395 - arch_timer_irq_ops.flags |= VGIC_IRQ_SW_RESAMPLE; 1354 + if (kvm_vgic_global_state.no_hw_deactivation) 1355 + arch_timer_irq_ops.flags |= VGIC_IRQ_SW_RESAMPLE; 1396 1356 WARN_ON(irq_domain_push_irq(domain, host_vtimer_irq, 1397 1357 (void *)TIMER_VTIMER)); 1398 1358 } ··· 1543 1501 if (kvm_vgic_set_owner(vcpu, irq, ctx)) 1544 1502 break; 1545 1503 1504 + /* With GICv5, the default PPI is what you get -- nothing else */ 1505 + if (vgic_is_v5(vcpu->kvm) && irq != get_vgic_ppi(vcpu->kvm, default_ppi[i])) 1506 + break; 1507 + 1546 1508 /* 1547 - * We know by construction that we only have PPIs, so 1548 - * all values are less than 32. 1509 + * We know by construction that we only have PPIs, so all values 1510 + * are less than 32 for non-GICv5 VGICs. On GICv5, they are 1511 + * architecturally defined to be under 32 too. However, we mask 1512 + * off most of the bits as we might be presented with a GICv5 1513 + * style PPI where the type is encoded in the top-bits. 1549 1514 */ 1550 - ppis |= BIT(irq); 1515 + ppis |= BIT(irq & 0x1f); 1551 1516 } 1552 1517 1553 1518 valid = hweight32(ppis) == nr_timers(vcpu); ··· 1592 1543 { 1593 1544 struct arch_timer_cpu *timer = vcpu_timer(vcpu); 1594 1545 struct timer_map map; 1546 + struct irq_ops *ops; 1595 1547 int ret; 1596 1548 1597 1549 if (timer->enabled) ··· 1613 1563 1614 1564 get_timer_map(vcpu, &map); 1615 1565 1566 + ops = vgic_is_v5(vcpu->kvm) ? &arch_timer_irq_ops_vgic_v5 : 1567 + &arch_timer_irq_ops; 1568 + 1569 + for (int i = 0; i < nr_timers(vcpu); i++) 1570 + kvm_vgic_set_irq_ops(vcpu, timer_irq(vcpu_get_timer(vcpu, i)), ops); 1571 + 1616 1572 ret = kvm_vgic_map_phys_irq(vcpu, 1617 1573 map.direct_vtimer->host_timer_irq, 1618 - timer_irq(map.direct_vtimer), 1619 - &arch_timer_irq_ops); 1574 + timer_irq(map.direct_vtimer)); 1620 1575 if (ret) 1621 1576 return ret; 1622 1577 1623 - if (map.direct_ptimer) { 1578 + if (map.direct_ptimer) 1624 1579 ret = kvm_vgic_map_phys_irq(vcpu, 1625 1580 map.direct_ptimer->host_timer_irq, 1626 - timer_irq(map.direct_ptimer), 1627 - &arch_timer_irq_ops); 1628 - } 1629 - 1581 + timer_irq(map.direct_ptimer)); 1630 1582 if (ret) 1631 1583 return ret; 1632 1584 ··· 1655 1603 if (get_user(irq, uaddr)) 1656 1604 return -EFAULT; 1657 1605 1658 - if (!(irq_is_ppi(irq))) 1606 + if (!(irq_is_ppi(vcpu->kvm, irq))) 1659 1607 return -EINVAL; 1660 1608 1661 - mutex_lock(&vcpu->kvm->arch.config_lock); 1609 + guard(mutex)(&vcpu->kvm->arch.config_lock); 1662 1610 1663 1611 if (test_bit(KVM_ARCH_FLAG_TIMER_PPIS_IMMUTABLE, 1664 1612 &vcpu->kvm->arch.flags)) { 1665 - ret = -EBUSY; 1666 - goto out; 1613 + return -EBUSY; 1667 1614 } 1668 1615 1669 1616 switch (attr->attr) { ··· 1679 1628 idx = TIMER_HPTIMER; 1680 1629 break; 1681 1630 default: 1682 - ret = -ENXIO; 1683 - goto out; 1631 + return -ENXIO; 1684 1632 } 1685 1633 1686 1634 /* ··· 1689 1639 */ 1690 1640 vcpu->kvm->arch.timer_data.ppi[idx] = irq; 1691 1641 1692 - out: 1693 - mutex_unlock(&vcpu->kvm->arch.config_lock); 1694 1642 return ret; 1695 1643 } 1696 1644

+62 -7

arch/arm64/kvm/arm.c

··· 24 24 25 25 #define CREATE_TRACE_POINTS 26 26 #include "trace_arm.h" 27 + #include "hyp_trace.h" 27 28 28 29 #include <linux/uaccess.h> 29 30 #include <asm/ptrace.h> ··· 36 35 #include <asm/kvm_arm.h> 37 36 #include <asm/kvm_asm.h> 38 37 #include <asm/kvm_emulate.h> 38 + #include <asm/kvm_hyp.h> 39 39 #include <asm/kvm_mmu.h> 40 40 #include <asm/kvm_nested.h> 41 41 #include <asm/kvm_pkvm.h> ··· 47 45 #include <kvm/arm_hypercalls.h> 48 46 #include <kvm/arm_pmu.h> 49 47 #include <kvm/arm_psci.h> 48 + #include <kvm/arm_vgic.h> 49 + 50 + #include <linux/irqchip/arm-gic-v5.h> 50 51 51 52 #include "sys_regs.h" 52 53 ··· 208 203 { 209 204 int ret; 210 205 206 + if (type & ~KVM_VM_TYPE_ARM_MASK) 207 + return -EINVAL; 208 + 211 209 mutex_init(&kvm->arch.config_lock); 212 210 213 211 #ifdef CONFIG_LOCKDEP ··· 242 234 * If any failures occur after this is successful, make sure to 243 235 * call __pkvm_unreserve_vm to unreserve the VM in hyp. 244 236 */ 245 - ret = pkvm_init_host_vm(kvm); 237 + ret = pkvm_init_host_vm(kvm, type); 246 238 if (ret) 247 - goto err_free_cpumask; 239 + goto err_uninit_mmu; 240 + } else if (type & KVM_VM_TYPE_ARM_PROTECTED) { 241 + ret = -EINVAL; 242 + goto err_uninit_mmu; 248 243 } 249 244 250 245 kvm_vgic_early_init(kvm); ··· 263 252 264 253 return 0; 265 254 255 + err_uninit_mmu: 256 + kvm_uninit_stage2_mmu(kvm); 266 257 err_free_cpumask: 267 258 free_cpumask_var(kvm->arch.supported_cpus); 268 259 err_unshare_kvm: ··· 314 301 if (is_protected_kvm_enabled()) 315 302 pkvm_destroy_hyp_vm(kvm); 316 303 304 + kvm_uninit_stage2_mmu(kvm); 317 305 kvm_destroy_mpidr_data(kvm); 318 306 319 307 kfree(kvm->arch.sysreg_masks); ··· 627 613 if (unlikely(kvm_wfi_trap_policy != KVM_WFX_NOTRAP_SINGLE_TASK)) 628 614 return kvm_wfi_trap_policy == KVM_WFX_NOTRAP; 629 615 616 + if (vgic_is_v5(vcpu->kvm)) 617 + return single_task_running(); 618 + 630 619 return single_task_running() && 631 620 vcpu->kvm->arch.vgic.vgic_model == KVM_DEV_TYPE_ARM_VGIC_V3 && 632 621 (atomic_read(&vcpu->arch.vgic_cpu.vgic_v3.its_vpe.vlpi_count) || ··· 722 705 723 706 if (!cpumask_test_cpu(cpu, vcpu->kvm->arch.supported_cpus)) 724 707 vcpu_set_on_unsupported_cpu(vcpu); 708 + 709 + vcpu->arch.pid = pid_nr(vcpu->pid); 725 710 } 726 711 727 712 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) ··· 952 933 if (ret) 953 934 return ret; 954 935 } 936 + 937 + ret = vgic_v5_finalize_ppi_state(kvm); 938 + if (ret) 939 + return ret; 955 940 956 941 if (is_protected_kvm_enabled()) { 957 942 ret = pkvm_create_hyp_vm(kvm); ··· 1462 1439 int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irq_level, 1463 1440 bool line_status) 1464 1441 { 1465 - u32 irq = irq_level->irq; 1466 1442 unsigned int irq_type, vcpu_id, irq_num; 1467 1443 struct kvm_vcpu *vcpu = NULL; 1468 1444 bool level = irq_level->level; 1445 + u32 irq = irq_level->irq; 1446 + unsigned long *mask; 1469 1447 1470 1448 irq_type = (irq >> KVM_ARM_IRQ_TYPE_SHIFT) & KVM_ARM_IRQ_TYPE_MASK; 1471 1449 vcpu_id = (irq >> KVM_ARM_IRQ_VCPU_SHIFT) & KVM_ARM_IRQ_VCPU_MASK; ··· 1496 1472 if (!vcpu) 1497 1473 return -EINVAL; 1498 1474 1499 - if (irq_num < VGIC_NR_SGIS || irq_num >= VGIC_NR_PRIVATE_IRQS) 1475 + if (vgic_is_v5(kvm)) { 1476 + if (irq_num >= VGIC_V5_NR_PRIVATE_IRQS) 1477 + return -EINVAL; 1478 + 1479 + /* 1480 + * Only allow PPIs that are explicitly exposed to 1481 + * usespace to be driven via KVM_IRQ_LINE 1482 + */ 1483 + mask = kvm->arch.vgic.gicv5_vm.userspace_ppis; 1484 + if (!test_bit(irq_num, mask)) 1485 + return -EINVAL; 1486 + 1487 + /* Build a GICv5-style IntID here */ 1488 + irq_num = vgic_v5_make_ppi(irq_num); 1489 + } else if (irq_num < VGIC_NR_SGIS || 1490 + irq_num >= VGIC_NR_PRIVATE_IRQS) { 1500 1491 return -EINVAL; 1492 + } 1501 1493 1502 1494 return kvm_vgic_inject_irq(kvm, vcpu, irq_num, level, NULL); 1503 1495 case KVM_ARM_IRQ_TYPE_SPI: 1504 1496 if (!irqchip_in_kernel(kvm)) 1505 1497 return -ENXIO; 1506 1498 1507 - if (irq_num < VGIC_NR_PRIVATE_IRQS) 1508 - return -EINVAL; 1499 + if (vgic_is_v5(kvm)) { 1500 + /* Build a GICv5-style IntID here */ 1501 + irq_num = vgic_v5_make_spi(irq_num); 1502 + } else { 1503 + if (irq_num < VGIC_NR_PRIVATE_IRQS) 1504 + return -EINVAL; 1505 + } 1509 1506 1510 1507 return kvm_vgic_inject_irq(kvm, NULL, irq_num, level, NULL); 1511 1508 } ··· 2459 2414 2460 2415 kvm_register_perf_callbacks(); 2461 2416 2417 + err = kvm_hyp_trace_init(); 2418 + if (err) 2419 + kvm_err("Failed to initialize Hyp tracing\n"); 2420 + 2462 2421 out: 2463 2422 if (err) 2464 2423 hyp_cpu_pm_exit(); ··· 2514 2465 preempt_disable(); 2515 2466 cpu_hyp_init_context(); 2516 2467 ret = kvm_call_hyp_nvhe(__pkvm_init, hyp_mem_base, hyp_mem_size, 2517 - num_possible_cpus(), kern_hyp_va(per_cpu_base), 2468 + kern_hyp_va(per_cpu_base), 2518 2469 hyp_va_bits); 2519 2470 cpu_hyp_init_features(); 2520 2471 ··· 2556 2507 { 2557 2508 kvm_nvhe_sym(id_aa64pfr0_el1_sys_val) = get_hyp_id_aa64pfr0_el1(); 2558 2509 kvm_nvhe_sym(id_aa64pfr1_el1_sys_val) = read_sanitised_ftr_reg(SYS_ID_AA64PFR1_EL1); 2510 + kvm_nvhe_sym(id_aa64pfr2_el1_sys_val) = read_sanitised_ftr_reg(SYS_ID_AA64PFR2_EL1); 2559 2511 kvm_nvhe_sym(id_aa64isar0_el1_sys_val) = read_sanitised_ftr_reg(SYS_ID_AA64ISAR0_EL1); 2560 2512 kvm_nvhe_sym(id_aa64isar1_el1_sys_val) = read_sanitised_ftr_reg(SYS_ID_AA64ISAR1_EL1); 2561 2513 kvm_nvhe_sym(id_aa64isar2_el1_sys_val) = read_sanitised_ftr_reg(SYS_ID_AA64ISAR2_EL1); ··· 2579 2529 kvm_nvhe_sym(hfgitr2_masks) = hfgitr2_masks; 2580 2530 kvm_nvhe_sym(hdfgrtr2_masks)= hdfgrtr2_masks; 2581 2531 kvm_nvhe_sym(hdfgwtr2_masks)= hdfgwtr2_masks; 2532 + kvm_nvhe_sym(ich_hfgrtr_masks) = ich_hfgrtr_masks; 2533 + kvm_nvhe_sym(ich_hfgwtr_masks) = ich_hfgwtr_masks; 2534 + kvm_nvhe_sym(ich_hfgitr_masks) = ich_hfgitr_masks; 2582 2535 2583 2536 /* 2584 2537 * Flush entire BSS since part of its data containing init symbols is read ··· 2726 2673 memcpy(page_addr, CHOOSE_NVHE_SYM(__per_cpu_start), nvhe_percpu_size()); 2727 2674 kvm_nvhe_sym(kvm_arm_hyp_percpu_base)[cpu] = (unsigned long)page_addr; 2728 2675 } 2676 + 2677 + kvm_nvhe_sym(hyp_nr_cpus) = num_possible_cpus(); 2729 2678 2730 2679 /* 2731 2680 * Map the Hyp-code called directly from the host

+118 -9

arch/arm64/kvm/config.c

··· 225 225 #define FEAT_MTPMU ID_AA64DFR0_EL1, MTPMU, IMP 226 226 #define FEAT_HCX ID_AA64MMFR1_EL1, HCX, IMP 227 227 #define FEAT_S2PIE ID_AA64MMFR3_EL1, S2PIE, IMP 228 + #define FEAT_GCIE ID_AA64PFR2_EL1, GCIE, IMP 228 229 229 230 static bool not_feat_aa64el3(struct kvm *kvm) 230 231 { ··· 1278 1277 static const DECLARE_FEAT_MAP(vtcr_el2_desc, VTCR_EL2, 1279 1278 vtcr_el2_feat_map, FEAT_AA64EL2); 1280 1279 1280 + static const struct reg_bits_to_feat_map ich_hfgrtr_feat_map[] = { 1281 + NEEDS_FEAT(ICH_HFGRTR_EL2_ICC_APR_EL1 | 1282 + ICH_HFGRTR_EL2_ICC_IDRn_EL1 | 1283 + ICH_HFGRTR_EL2_ICC_CR0_EL1 | 1284 + ICH_HFGRTR_EL2_ICC_HPPIR_EL1 | 1285 + ICH_HFGRTR_EL2_ICC_PCR_EL1 | 1286 + ICH_HFGRTR_EL2_ICC_ICSR_EL1 | 1287 + ICH_HFGRTR_EL2_ICC_IAFFIDR_EL1 | 1288 + ICH_HFGRTR_EL2_ICC_PPI_HMRn_EL1 | 1289 + ICH_HFGRTR_EL2_ICC_PPI_ENABLERn_EL1 | 1290 + ICH_HFGRTR_EL2_ICC_PPI_PENDRn_EL1 | 1291 + ICH_HFGRTR_EL2_ICC_PPI_PRIORITYRn_EL1 | 1292 + ICH_HFGRTR_EL2_ICC_PPI_ACTIVERn_EL1, 1293 + FEAT_GCIE), 1294 + }; 1295 + 1296 + static const DECLARE_FEAT_MAP_FGT(ich_hfgrtr_desc, ich_hfgrtr_masks, 1297 + ich_hfgrtr_feat_map, FEAT_GCIE); 1298 + 1299 + static const struct reg_bits_to_feat_map ich_hfgwtr_feat_map[] = { 1300 + NEEDS_FEAT(ICH_HFGWTR_EL2_ICC_APR_EL1 | 1301 + ICH_HFGWTR_EL2_ICC_CR0_EL1 | 1302 + ICH_HFGWTR_EL2_ICC_PCR_EL1 | 1303 + ICH_HFGWTR_EL2_ICC_ICSR_EL1 | 1304 + ICH_HFGWTR_EL2_ICC_PPI_ENABLERn_EL1 | 1305 + ICH_HFGWTR_EL2_ICC_PPI_PENDRn_EL1 | 1306 + ICH_HFGWTR_EL2_ICC_PPI_PRIORITYRn_EL1 | 1307 + ICH_HFGWTR_EL2_ICC_PPI_ACTIVERn_EL1, 1308 + FEAT_GCIE), 1309 + }; 1310 + 1311 + static const DECLARE_FEAT_MAP_FGT(ich_hfgwtr_desc, ich_hfgwtr_masks, 1312 + ich_hfgwtr_feat_map, FEAT_GCIE); 1313 + 1314 + static const struct reg_bits_to_feat_map ich_hfgitr_feat_map[] = { 1315 + NEEDS_FEAT(ICH_HFGITR_EL2_GICCDEN | 1316 + ICH_HFGITR_EL2_GICCDDIS | 1317 + ICH_HFGITR_EL2_GICCDPRI | 1318 + ICH_HFGITR_EL2_GICCDAFF | 1319 + ICH_HFGITR_EL2_GICCDPEND | 1320 + ICH_HFGITR_EL2_GICCDRCFG | 1321 + ICH_HFGITR_EL2_GICCDHM | 1322 + ICH_HFGITR_EL2_GICCDEOI | 1323 + ICH_HFGITR_EL2_GICCDDI | 1324 + ICH_HFGITR_EL2_GICRCDIA | 1325 + ICH_HFGITR_EL2_GICRCDNMIA, 1326 + FEAT_GCIE), 1327 + }; 1328 + 1329 + static const DECLARE_FEAT_MAP_FGT(ich_hfgitr_desc, ich_hfgitr_masks, 1330 + ich_hfgitr_feat_map, FEAT_GCIE); 1331 + 1281 1332 static void __init check_feat_map(const struct reg_bits_to_feat_map *map, 1282 1333 int map_size, u64 resx, const char *str) 1283 1334 { ··· 1381 1328 check_reg_desc(&sctlr_el2_desc); 1382 1329 check_reg_desc(&mdcr_el2_desc); 1383 1330 check_reg_desc(&vtcr_el2_desc); 1331 + check_reg_desc(&ich_hfgrtr_desc); 1332 + check_reg_desc(&ich_hfgwtr_desc); 1333 + check_reg_desc(&ich_hfgitr_desc); 1384 1334 } 1385 1335 1386 1336 static bool idreg_feat_match(struct kvm *kvm, const struct reg_bits_to_feat_map *map) ··· 1516 1460 val |= compute_fgu_bits(kvm, &hdfgrtr2_desc); 1517 1461 val |= compute_fgu_bits(kvm, &hdfgwtr2_desc); 1518 1462 break; 1463 + case ICH_HFGRTR_GROUP: 1464 + val |= compute_fgu_bits(kvm, &ich_hfgrtr_desc); 1465 + val |= compute_fgu_bits(kvm, &ich_hfgwtr_desc); 1466 + break; 1467 + case ICH_HFGITR_GROUP: 1468 + val |= compute_fgu_bits(kvm, &ich_hfgitr_desc); 1469 + break; 1519 1470 default: 1520 1471 BUG(); 1521 1472 } ··· 1594 1531 case VTCR_EL2: 1595 1532 resx = compute_reg_resx_bits(kvm, &vtcr_el2_desc, 0, 0); 1596 1533 break; 1534 + case ICH_HFGRTR_EL2: 1535 + resx = compute_reg_resx_bits(kvm, &ich_hfgrtr_desc, 0, 0); 1536 + break; 1537 + case ICH_HFGWTR_EL2: 1538 + resx = compute_reg_resx_bits(kvm, &ich_hfgwtr_desc, 0, 0); 1539 + break; 1540 + case ICH_HFGITR_EL2: 1541 + resx = compute_reg_resx_bits(kvm, &ich_hfgitr_desc, 0, 0); 1542 + break; 1597 1543 default: 1598 1544 WARN_ON_ONCE(1); 1599 1545 resx = (typeof(resx)){}; ··· 1637 1565 return &hdfgrtr2_masks; 1638 1566 case HDFGWTR2_EL2: 1639 1567 return &hdfgwtr2_masks; 1568 + case ICH_HFGRTR_EL2: 1569 + return &ich_hfgrtr_masks; 1570 + case ICH_HFGWTR_EL2: 1571 + return &ich_hfgwtr_masks; 1572 + case ICH_HFGITR_EL2: 1573 + return &ich_hfgitr_masks; 1640 1574 default: 1641 1575 BUILD_BUG_ON(1); 1642 1576 } ··· 1663 1585 clear |= ~nested & m->nmask; 1664 1586 } 1665 1587 1666 - val |= set; 1667 - val &= ~clear; 1588 + val |= set | m->res1; 1589 + val &= ~(clear | m->res0); 1668 1590 *vcpu_fgt(vcpu, reg) = val; 1669 1591 } 1670 1592 ··· 1684 1606 *vcpu_fgt(vcpu, HDFGWTR_EL2) |= HDFGWTR_EL2_MDSCR_EL1; 1685 1607 } 1686 1608 1609 + static void __compute_ich_hfgrtr(struct kvm_vcpu *vcpu) 1610 + { 1611 + __compute_fgt(vcpu, ICH_HFGRTR_EL2); 1612 + 1613 + /* 1614 + * ICC_IAFFIDR_EL1 *always* needs to be trapped when running a guest. 1615 + * 1616 + * We also trap accesses to ICC_IDR0_EL1 to allow us to completely hide 1617 + * FEAT_GCIE_LEGACY from the guest, and to (potentially) present fewer 1618 + * ID bits than the host supports. 1619 + */ 1620 + *vcpu_fgt(vcpu, ICH_HFGRTR_EL2) &= ~(ICH_HFGRTR_EL2_ICC_IAFFIDR_EL1 | 1621 + ICH_HFGRTR_EL2_ICC_IDRn_EL1); 1622 + } 1623 + 1624 + static void __compute_ich_hfgwtr(struct kvm_vcpu *vcpu) 1625 + { 1626 + __compute_fgt(vcpu, ICH_HFGWTR_EL2); 1627 + 1628 + /* 1629 + * We present a different subset of PPIs the guest from what 1630 + * exist in real hardware. We only trap writes, not reads. 1631 + */ 1632 + *vcpu_fgt(vcpu, ICH_HFGWTR_EL2) &= ~(ICH_HFGWTR_EL2_ICC_PPI_ENABLERn_EL1); 1633 + } 1634 + 1687 1635 void kvm_vcpu_load_fgt(struct kvm_vcpu *vcpu) 1688 1636 { 1689 1637 if (!cpus_have_final_cap(ARM64_HAS_FGT)) ··· 1722 1618 __compute_hdfgwtr(vcpu); 1723 1619 __compute_fgt(vcpu, HAFGRTR_EL2); 1724 1620 1725 - if (!cpus_have_final_cap(ARM64_HAS_FGT2)) 1726 - return; 1621 + if (cpus_have_final_cap(ARM64_HAS_FGT2)) { 1622 + __compute_fgt(vcpu, HFGRTR2_EL2); 1623 + __compute_fgt(vcpu, HFGWTR2_EL2); 1624 + __compute_fgt(vcpu, HFGITR2_EL2); 1625 + __compute_fgt(vcpu, HDFGRTR2_EL2); 1626 + __compute_fgt(vcpu, HDFGWTR2_EL2); 1627 + } 1727 1628 1728 - __compute_fgt(vcpu, HFGRTR2_EL2); 1729 - __compute_fgt(vcpu, HFGWTR2_EL2); 1730 - __compute_fgt(vcpu, HFGITR2_EL2); 1731 - __compute_fgt(vcpu, HDFGRTR2_EL2); 1732 - __compute_fgt(vcpu, HDFGWTR2_EL2); 1629 + if (cpus_have_final_cap(ARM64_HAS_GICV5_CPUIF)) { 1630 + __compute_ich_hfgrtr(vcpu); 1631 + __compute_ich_hfgwtr(vcpu); 1632 + __compute_fgt(vcpu, ICH_HFGITR_EL2); 1633 + } 1733 1634 }

+68

arch/arm64/kvm/emulate-nested.c

··· 2053 2053 SR_FGT(SYS_AMEVCNTR0_EL0(2), HAFGRTR, AMEVCNTR02_EL0, 1), 2054 2054 SR_FGT(SYS_AMEVCNTR0_EL0(1), HAFGRTR, AMEVCNTR01_EL0, 1), 2055 2055 SR_FGT(SYS_AMEVCNTR0_EL0(0), HAFGRTR, AMEVCNTR00_EL0, 1), 2056 + 2057 + /* 2058 + * ICH_HFGRTR_EL2 & ICH_HFGWTR_EL2 2059 + */ 2060 + SR_FGT(SYS_ICC_APR_EL1, ICH_HFGRTR, ICC_APR_EL1, 0), 2061 + SR_FGT(SYS_ICC_IDR0_EL1, ICH_HFGRTR, ICC_IDRn_EL1, 0), 2062 + SR_FGT(SYS_ICC_CR0_EL1, ICH_HFGRTR, ICC_CR0_EL1, 0), 2063 + SR_FGT(SYS_ICC_HPPIR_EL1, ICH_HFGRTR, ICC_HPPIR_EL1, 0), 2064 + SR_FGT(SYS_ICC_PCR_EL1, ICH_HFGRTR, ICC_PCR_EL1, 0), 2065 + SR_FGT(SYS_ICC_ICSR_EL1, ICH_HFGRTR, ICC_ICSR_EL1, 0), 2066 + SR_FGT(SYS_ICC_IAFFIDR_EL1, ICH_HFGRTR, ICC_IAFFIDR_EL1, 0), 2067 + SR_FGT(SYS_ICC_PPI_HMR0_EL1, ICH_HFGRTR, ICC_PPI_HMRn_EL1, 0), 2068 + SR_FGT(SYS_ICC_PPI_HMR1_EL1, ICH_HFGRTR, ICC_PPI_HMRn_EL1, 0), 2069 + SR_FGT(SYS_ICC_PPI_ENABLER0_EL1, ICH_HFGRTR, ICC_PPI_ENABLERn_EL1, 0), 2070 + SR_FGT(SYS_ICC_PPI_ENABLER1_EL1, ICH_HFGRTR, ICC_PPI_ENABLERn_EL1, 0), 2071 + SR_FGT(SYS_ICC_PPI_CPENDR0_EL1, ICH_HFGRTR, ICC_PPI_PENDRn_EL1, 0), 2072 + SR_FGT(SYS_ICC_PPI_CPENDR1_EL1, ICH_HFGRTR, ICC_PPI_PENDRn_EL1, 0), 2073 + SR_FGT(SYS_ICC_PPI_SPENDR0_EL1, ICH_HFGRTR, ICC_PPI_PENDRn_EL1, 0), 2074 + SR_FGT(SYS_ICC_PPI_SPENDR1_EL1, ICH_HFGRTR, ICC_PPI_PENDRn_EL1, 0), 2075 + SR_FGT(SYS_ICC_PPI_PRIORITYR0_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), 2076 + SR_FGT(SYS_ICC_PPI_PRIORITYR1_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), 2077 + SR_FGT(SYS_ICC_PPI_PRIORITYR2_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), 2078 + SR_FGT(SYS_ICC_PPI_PRIORITYR3_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), 2079 + SR_FGT(SYS_ICC_PPI_PRIORITYR4_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), 2080 + SR_FGT(SYS_ICC_PPI_PRIORITYR5_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), 2081 + SR_FGT(SYS_ICC_PPI_PRIORITYR6_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), 2082 + SR_FGT(SYS_ICC_PPI_PRIORITYR7_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), 2083 + SR_FGT(SYS_ICC_PPI_PRIORITYR8_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), 2084 + SR_FGT(SYS_ICC_PPI_PRIORITYR9_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), 2085 + SR_FGT(SYS_ICC_PPI_PRIORITYR10_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), 2086 + SR_FGT(SYS_ICC_PPI_PRIORITYR11_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), 2087 + SR_FGT(SYS_ICC_PPI_PRIORITYR12_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), 2088 + SR_FGT(SYS_ICC_PPI_PRIORITYR13_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), 2089 + SR_FGT(SYS_ICC_PPI_PRIORITYR14_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), 2090 + SR_FGT(SYS_ICC_PPI_PRIORITYR15_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), 2091 + SR_FGT(SYS_ICC_PPI_CACTIVER0_EL1, ICH_HFGRTR, ICC_PPI_ACTIVERn_EL1, 0), 2092 + SR_FGT(SYS_ICC_PPI_CACTIVER1_EL1, ICH_HFGRTR, ICC_PPI_ACTIVERn_EL1, 0), 2093 + SR_FGT(SYS_ICC_PPI_SACTIVER0_EL1, ICH_HFGRTR, ICC_PPI_ACTIVERn_EL1, 0), 2094 + SR_FGT(SYS_ICC_PPI_SACTIVER1_EL1, ICH_HFGRTR, ICC_PPI_ACTIVERn_EL1, 0), 2095 + 2096 + /* 2097 + * ICH_HFGITR_EL2 2098 + */ 2099 + SR_FGT(GICV5_OP_GIC_CDEN, ICH_HFGITR, GICCDEN, 0), 2100 + SR_FGT(GICV5_OP_GIC_CDDIS, ICH_HFGITR, GICCDDIS, 0), 2101 + SR_FGT(GICV5_OP_GIC_CDPRI, ICH_HFGITR, GICCDPRI, 0), 2102 + SR_FGT(GICV5_OP_GIC_CDAFF, ICH_HFGITR, GICCDAFF, 0), 2103 + SR_FGT(GICV5_OP_GIC_CDPEND, ICH_HFGITR, GICCDPEND, 0), 2104 + SR_FGT(GICV5_OP_GIC_CDRCFG, ICH_HFGITR, GICCDRCFG, 0), 2105 + SR_FGT(GICV5_OP_GIC_CDHM, ICH_HFGITR, GICCDHM, 0), 2106 + SR_FGT(GICV5_OP_GIC_CDEOI, ICH_HFGITR, GICCDEOI, 0), 2107 + SR_FGT(GICV5_OP_GIC_CDDI, ICH_HFGITR, GICCDDI, 0), 2108 + SR_FGT(GICV5_OP_GICR_CDIA, ICH_HFGITR, GICRCDIA, 0), 2109 + SR_FGT(GICV5_OP_GICR_CDNMIA, ICH_HFGITR, GICRCDNMIA, 0), 2056 2110 }; 2057 2111 2058 2112 /* ··· 2181 2127 FGT_MASKS(hfgitr2_masks, HFGITR2_EL2); 2182 2128 FGT_MASKS(hdfgrtr2_masks, HDFGRTR2_EL2); 2183 2129 FGT_MASKS(hdfgwtr2_masks, HDFGWTR2_EL2); 2130 + FGT_MASKS(ich_hfgrtr_masks, ICH_HFGRTR_EL2); 2131 + FGT_MASKS(ich_hfgwtr_masks, ICH_HFGWTR_EL2); 2132 + FGT_MASKS(ich_hfgitr_masks, ICH_HFGITR_EL2); 2184 2133 2185 2134 static __init bool aggregate_fgt(union trap_config tc) 2186 2135 { ··· 2217 2160 break; 2218 2161 case HFGITR2_GROUP: 2219 2162 rmasks = &hfgitr2_masks; 2163 + wmasks = NULL; 2164 + break; 2165 + case ICH_HFGRTR_GROUP: 2166 + rmasks = &ich_hfgrtr_masks; 2167 + wmasks = &ich_hfgwtr_masks; 2168 + break; 2169 + case ICH_HFGITR_GROUP: 2170 + rmasks = &ich_hfgitr_masks; 2220 2171 wmasks = NULL; 2221 2172 break; 2222 2173 } ··· 2297 2232 &hfgitr2_masks, 2298 2233 &hdfgrtr2_masks, 2299 2234 &hdfgwtr2_masks, 2235 + &ich_hfgrtr_masks, 2236 + &ich_hfgwtr_masks, 2237 + &ich_hfgitr_masks, 2300 2238 }; 2301 2239 int err = 0; 2302 2240

+1 -1

arch/arm64/kvm/handle_exit.c

··· 539 539 540 540 /* All hyp bugs, including warnings, are treated as fatal. */ 541 541 if (!is_protected_kvm_enabled() || 542 - IS_ENABLED(CONFIG_NVHE_EL2_DEBUG)) { 542 + IS_ENABLED(CONFIG_PKVM_DISABLE_STAGE2_ON_PANIC)) { 543 543 struct bug_entry *bug = find_bug(elr_in_kimg); 544 544 545 545 if (bug)

+27

arch/arm64/kvm/hyp/include/hyp/switch.h

··· 233 233 __activate_fgt(hctxt, vcpu, HDFGWTR2_EL2); 234 234 } 235 235 236 + static inline void __activate_traps_ich_hfgxtr(struct kvm_vcpu *vcpu) 237 + { 238 + struct kvm_cpu_context *hctxt = host_data_ptr(host_ctxt); 239 + 240 + if (!cpus_have_final_cap(ARM64_HAS_GICV5_CPUIF)) 241 + return; 242 + 243 + __activate_fgt(hctxt, vcpu, ICH_HFGRTR_EL2); 244 + __activate_fgt(hctxt, vcpu, ICH_HFGWTR_EL2); 245 + __activate_fgt(hctxt, vcpu, ICH_HFGITR_EL2); 246 + } 247 + 236 248 #define __deactivate_fgt(htcxt, vcpu, reg) \ 237 249 do { \ 238 250 write_sysreg_s(ctxt_sys_reg(hctxt, reg), \ ··· 275 263 __deactivate_fgt(hctxt, vcpu, HFGITR2_EL2); 276 264 __deactivate_fgt(hctxt, vcpu, HDFGRTR2_EL2); 277 265 __deactivate_fgt(hctxt, vcpu, HDFGWTR2_EL2); 266 + } 267 + 268 + static inline void __deactivate_traps_ich_hfgxtr(struct kvm_vcpu *vcpu) 269 + { 270 + struct kvm_cpu_context *hctxt = host_data_ptr(host_ctxt); 271 + 272 + if (!cpus_have_final_cap(ARM64_HAS_GICV5_CPUIF)) 273 + return; 274 + 275 + __deactivate_fgt(hctxt, vcpu, ICH_HFGRTR_EL2); 276 + __deactivate_fgt(hctxt, vcpu, ICH_HFGWTR_EL2); 277 + __deactivate_fgt(hctxt, vcpu, ICH_HFGITR_EL2); 278 + 278 279 } 279 280 280 281 static inline void __activate_traps_mpam(struct kvm_vcpu *vcpu) ··· 353 328 } 354 329 355 330 __activate_traps_hfgxtr(vcpu); 331 + __activate_traps_ich_hfgxtr(vcpu); 356 332 __activate_traps_mpam(vcpu); 357 333 } 358 334 ··· 371 345 write_sysreg_s(ctxt_sys_reg(hctxt, HCRX_EL2), SYS_HCRX_EL2); 372 346 373 347 __deactivate_traps_hfgxtr(vcpu); 348 + __deactivate_traps_ich_hfgxtr(vcpu); 374 349 __deactivate_traps_mpam(); 375 350 } 376 351

+23

arch/arm64/kvm/hyp/include/nvhe/arm-smccc.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0-only */ 2 + #ifndef __ARM64_KVM_HYP_NVHE_ARM_SMCCC_H__ 3 + #define __ARM64_KVM_HYP_NVHE_ARM_SMCCC_H__ 4 + 5 + #include <asm/kvm_hypevents.h> 6 + 7 + #include <linux/arm-smccc.h> 8 + 9 + #define hyp_smccc_1_1_smc(...) \ 10 + do { \ 11 + trace_hyp_exit(NULL, HYP_REASON_SMC); \ 12 + arm_smccc_1_1_smc(__VA_ARGS__); \ 13 + trace_hyp_enter(NULL, HYP_REASON_SMC); \ 14 + } while (0) 15 + 16 + #define hyp_smccc_1_2_smc(...) \ 17 + do { \ 18 + trace_hyp_exit(NULL, HYP_REASON_SMC); \ 19 + arm_smccc_1_2_smc(__VA_ARGS__); \ 20 + trace_hyp_enter(NULL, HYP_REASON_SMC); \ 21 + } while (0) 22 + 23 + #endif /* __ARM64_KVM_HYP_NVHE_ARM_SMCCC_H__ */

+16

arch/arm64/kvm/hyp/include/nvhe/clock.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #ifndef __ARM64_KVM_HYP_NVHE_CLOCK_H 3 + #define __ARM64_KVM_HYP_NVHE_CLOCK_H 4 + #include <linux/types.h> 5 + 6 + #include <asm/kvm_hyp.h> 7 + 8 + #ifdef CONFIG_NVHE_EL2_TRACING 9 + void trace_clock_update(u32 mult, u32 shift, u64 epoch_ns, u64 epoch_cyc); 10 + u64 trace_clock(void); 11 + #else 12 + static inline void 13 + trace_clock_update(u32 mult, u32 shift, u64 epoch_ns, u64 epoch_cyc) { } 14 + static inline u64 trace_clock(void) { return 0; } 15 + #endif 16 + #endif

+14

arch/arm64/kvm/hyp/include/nvhe/define_events.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + 3 + #undef HYP_EVENT 4 + #define HYP_EVENT(__name, __proto, __struct, __assign, __printk) \ 5 + struct hyp_event_id hyp_event_id_##__name \ 6 + __section(".hyp.event_ids."#__name) = { \ 7 + .enabled = ATOMIC_INIT(0), \ 8 + } 9 + 10 + #define HYP_EVENT_MULTI_READ 11 + #include <asm/kvm_hypevents.h> 12 + #undef HYP_EVENT_MULTI_READ 13 + 14 + #undef HYP_EVENT

+9 -3

arch/arm64/kvm/hyp/include/nvhe/mem_protect.h

··· 27 27 enum pkvm_component_id { 28 28 PKVM_ID_HOST, 29 29 PKVM_ID_HYP, 30 - PKVM_ID_FFA, 30 + PKVM_ID_GUEST, 31 31 }; 32 - 33 - extern unsigned long hyp_nr_cpus; 34 32 35 33 int __pkvm_prot_finalize(void); 36 34 int __pkvm_host_share_hyp(u64 pfn); 35 + int __pkvm_guest_share_host(struct pkvm_hyp_vcpu *vcpu, u64 gfn); 36 + int __pkvm_guest_unshare_host(struct pkvm_hyp_vcpu *vcpu, u64 gfn); 37 37 int __pkvm_host_unshare_hyp(u64 pfn); 38 38 int __pkvm_host_donate_hyp(u64 pfn, u64 nr_pages); 39 39 int __pkvm_hyp_donate_host(u64 pfn, u64 nr_pages); 40 40 int __pkvm_host_share_ffa(u64 pfn, u64 nr_pages); 41 41 int __pkvm_host_unshare_ffa(u64 pfn, u64 nr_pages); 42 + int __pkvm_host_donate_guest(u64 pfn, u64 gfn, struct pkvm_hyp_vcpu *vcpu); 43 + int __pkvm_vcpu_in_poison_fault(struct pkvm_hyp_vcpu *hyp_vcpu); 44 + int __pkvm_host_force_reclaim_page_guest(phys_addr_t phys); 45 + int __pkvm_host_reclaim_page_guest(u64 gfn, struct pkvm_hyp_vm *vm); 42 46 int __pkvm_host_share_guest(u64 pfn, u64 gfn, u64 nr_pages, struct pkvm_hyp_vcpu *vcpu, 43 47 enum kvm_pgtable_prot prot); 44 48 int __pkvm_host_unshare_guest(u64 gfn, u64 nr_pages, struct pkvm_hyp_vm *hyp_vm); ··· 74 70 75 71 #ifdef CONFIG_NVHE_EL2_DEBUG 76 72 void pkvm_ownership_selftest(void *base); 73 + struct pkvm_hyp_vcpu *init_selftest_vm(void *virt); 74 + void teardown_selftest_vm(void); 77 75 #else 78 76 static inline void pkvm_ownership_selftest(void *base) { } 79 77 #endif

+9 -3

arch/arm64/kvm/hyp/include/nvhe/memory.h

··· 30 30 * struct hyp_page. 31 31 */ 32 32 PKVM_NOPAGE = BIT(0) | BIT(1), 33 + 34 + /* 35 + * 'Meta-states' which aren't encoded directly in the PTE's SW bits (or 36 + * the hyp_vmemmap entry for the host) 37 + */ 38 + PKVM_POISON = BIT(2), 33 39 }; 34 - #define PKVM_PAGE_STATE_MASK (BIT(0) | BIT(1)) 40 + #define PKVM_PAGE_STATE_VMEMMAP_MASK (BIT(0) | BIT(1)) 35 41 36 42 #define PKVM_PAGE_STATE_PROT_MASK (KVM_PGTABLE_PROT_SW0 | KVM_PGTABLE_PROT_SW1) 37 43 static inline enum kvm_pgtable_prot pkvm_mkstate(enum kvm_pgtable_prot prot, ··· 114 108 115 109 static inline enum pkvm_page_state get_hyp_state(struct hyp_page *p) 116 110 { 117 - return p->__hyp_state_comp ^ PKVM_PAGE_STATE_MASK; 111 + return p->__hyp_state_comp ^ PKVM_PAGE_STATE_VMEMMAP_MASK; 118 112 } 119 113 120 114 static inline void set_hyp_state(struct hyp_page *p, enum pkvm_page_state state) 121 115 { 122 - p->__hyp_state_comp = state ^ PKVM_PAGE_STATE_MASK; 116 + p->__hyp_state_comp = state ^ PKVM_PAGE_STATE_VMEMMAP_MASK; 123 117 } 124 118 125 119 /*

+6 -1

arch/arm64/kvm/hyp/include/nvhe/pkvm.h

··· 73 73 unsigned long pgd_hva); 74 74 int __pkvm_init_vcpu(pkvm_handle_t handle, struct kvm_vcpu *host_vcpu, 75 75 unsigned long vcpu_hva); 76 - int __pkvm_teardown_vm(pkvm_handle_t handle); 77 76 77 + int __pkvm_reclaim_dying_guest_page(pkvm_handle_t handle, u64 gfn); 78 + int __pkvm_start_teardown_vm(pkvm_handle_t handle); 79 + int __pkvm_finalize_teardown_vm(pkvm_handle_t handle); 80 + 81 + struct pkvm_hyp_vm *get_vm_by_handle(pkvm_handle_t handle); 78 82 struct pkvm_hyp_vcpu *pkvm_load_hyp_vcpu(pkvm_handle_t handle, 79 83 unsigned int vcpu_idx); 80 84 void pkvm_put_hyp_vcpu(struct pkvm_hyp_vcpu *hyp_vcpu); ··· 88 84 struct pkvm_hyp_vm *get_np_pkvm_hyp_vm(pkvm_handle_t handle); 89 85 void put_pkvm_hyp_vm(struct pkvm_hyp_vm *hyp_vm); 90 86 87 + bool kvm_handle_pvm_hvc64(struct kvm_vcpu *vcpu, u64 *exit_code); 91 88 bool kvm_handle_pvm_sysreg(struct kvm_vcpu *vcpu, u64 *exit_code); 92 89 bool kvm_handle_pvm_restricted(struct kvm_vcpu *vcpu, u64 *exit_code); 93 90 void kvm_init_pvm_id_regs(struct kvm_vcpu *vcpu);

+70

arch/arm64/kvm/hyp/include/nvhe/trace.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0-only */ 2 + #ifndef __ARM64_KVM_HYP_NVHE_TRACE_H 3 + #define __ARM64_KVM_HYP_NVHE_TRACE_H 4 + 5 + #include <linux/trace_remote_event.h> 6 + 7 + #include <asm/kvm_hyptrace.h> 8 + 9 + static inline pid_t __tracing_get_vcpu_pid(struct kvm_cpu_context *host_ctxt) 10 + { 11 + struct kvm_vcpu *vcpu; 12 + 13 + if (!host_ctxt) 14 + host_ctxt = host_data_ptr(host_ctxt); 15 + 16 + vcpu = host_ctxt->__hyp_running_vcpu; 17 + 18 + return vcpu ? vcpu->arch.pid : 0; 19 + } 20 + 21 + #define HE_PROTO(__args...) __args 22 + #define HE_ASSIGN(__args...) __args 23 + #define HE_STRUCT RE_STRUCT 24 + #define he_field re_field 25 + 26 + #ifdef CONFIG_NVHE_EL2_TRACING 27 + 28 + #define HYP_EVENT(__name, __proto, __struct, __assign, __printk) \ 29 + REMOTE_EVENT_FORMAT(__name, __struct); \ 30 + extern struct hyp_event_id hyp_event_id_##__name; \ 31 + static __always_inline void trace_##__name(__proto) \ 32 + { \ 33 + struct remote_event_format_##__name *__entry; \ 34 + size_t length = sizeof(*__entry); \ 35 + \ 36 + if (!atomic_read(&hyp_event_id_##__name.enabled)) \ 37 + return; \ 38 + __entry = tracing_reserve_entry(length); \ 39 + if (!__entry) \ 40 + return; \ 41 + __entry->hdr.id = hyp_event_id_##__name.id; \ 42 + __assign \ 43 + tracing_commit_entry(); \ 44 + } 45 + 46 + void *tracing_reserve_entry(unsigned long length); 47 + void tracing_commit_entry(void); 48 + 49 + int __tracing_load(unsigned long desc_va, size_t desc_size); 50 + void __tracing_unload(void); 51 + int __tracing_enable(bool enable); 52 + int __tracing_swap_reader(unsigned int cpu); 53 + void __tracing_update_clock(u32 mult, u32 shift, u64 epoch_ns, u64 epoch_cyc); 54 + int __tracing_reset(unsigned int cpu); 55 + int __tracing_enable_event(unsigned short id, bool enable); 56 + #else 57 + static inline void *tracing_reserve_entry(unsigned long length) { return NULL; } 58 + static inline void tracing_commit_entry(void) { } 59 + #define HYP_EVENT(__name, __proto, __struct, __assign, __printk) \ 60 + static inline void trace_##__name(__proto) {} 61 + 62 + static inline int __tracing_load(unsigned long desc_va, size_t desc_size) { return -ENODEV; } 63 + static inline void __tracing_unload(void) { } 64 + static inline int __tracing_enable(bool enable) { return -ENODEV; } 65 + static inline int __tracing_swap_reader(unsigned int cpu) { return -ENODEV; } 66 + static inline void __tracing_update_clock(u32 mult, u32 shift, u64 epoch_ns, u64 epoch_cyc) { } 67 + static inline int __tracing_reset(unsigned int cpu) { return -ENODEV; } 68 + static inline int __tracing_enable_event(unsigned short id, bool enable) { return -ENODEV; } 69 + #endif 70 + #endif

+2

arch/arm64/kvm/hyp/include/nvhe/trap_handler.h

··· 16 16 __always_unused int ___check_reg_ ## reg; \ 17 17 type name = (type)cpu_reg(ctxt, (reg)) 18 18 19 + void inject_host_exception(u64 esr); 20 + 19 21 #endif /* __ARM64_KVM_NVHE_TRAP_HANDLER_H__ */

+6 -2

arch/arm64/kvm/hyp/nvhe/Makefile

··· 17 17 hostprogs := gen-hyprel 18 18 HOST_EXTRACFLAGS += -I$(objtree)/include 19 19 20 - lib-objs := clear_page.o copy_page.o memcpy.o memset.o 20 + lib-objs := clear_page.o copy_page.o memcpy.o memset.o tishift.o 21 21 lib-objs := $(addprefix ../../../lib/, $(lib-objs)) 22 22 23 23 CFLAGS_switch.nvhe.o += -Wno-override-init ··· 26 26 hyp-main.o hyp-smp.o psci-relay.o early_alloc.o page_alloc.o \ 27 27 cache.o setup.o mm.o mem_protect.o sys_regs.o pkvm.o stacktrace.o ffa.o 28 28 hyp-obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \ 29 - ../fpsimd.o ../hyp-entry.o ../exception.o ../pgtable.o 29 + ../fpsimd.o ../hyp-entry.o ../exception.o ../pgtable.o ../vgic-v5-sr.o 30 30 hyp-obj-y += ../../../kernel/smccc-call.o 31 31 hyp-obj-$(CONFIG_LIST_HARDENED) += list_debug.o 32 + hyp-obj-$(CONFIG_NVHE_EL2_TRACING) += clock.o trace.o events.o 32 33 hyp-obj-y += $(lib-objs) 34 + 35 + # Path to simple_ring_buffer.c 36 + CFLAGS_trace.nvhe.o += -I$(srctree)/kernel/trace/ 33 37 34 38 ## 35 39 ## Build rules for compiling nVHE hyp code

+65

arch/arm64/kvm/hyp/nvhe/clock.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * Copyright (C) 2025 Google LLC 4 + * Author: Vincent Donnefort <vdonnefort@google.com> 5 + */ 6 + 7 + #include <nvhe/clock.h> 8 + 9 + #include <asm/arch_timer.h> 10 + #include <asm/div64.h> 11 + 12 + static struct clock_data { 13 + struct { 14 + u32 mult; 15 + u32 shift; 16 + u64 epoch_ns; 17 + u64 epoch_cyc; 18 + u64 cyc_overflow64; 19 + } data[2]; 20 + u64 cur; 21 + } trace_clock_data; 22 + 23 + static u64 __clock_mult_uint128(u64 cyc, u32 mult, u32 shift) 24 + { 25 + __uint128_t ns = (__uint128_t)cyc * mult; 26 + 27 + ns >>= shift; 28 + 29 + return (u64)ns; 30 + } 31 + 32 + /* Does not guarantee no reader on the modified bank. */ 33 + void trace_clock_update(u32 mult, u32 shift, u64 epoch_ns, u64 epoch_cyc) 34 + { 35 + struct clock_data *clock = &trace_clock_data; 36 + u64 bank = clock->cur ^ 1; 37 + 38 + clock->data[bank].mult = mult; 39 + clock->data[bank].shift = shift; 40 + clock->data[bank].epoch_ns = epoch_ns; 41 + clock->data[bank].epoch_cyc = epoch_cyc; 42 + clock->data[bank].cyc_overflow64 = ULONG_MAX / mult; 43 + 44 + smp_store_release(&clock->cur, bank); 45 + } 46 + 47 + /* Use untrusted host data */ 48 + u64 trace_clock(void) 49 + { 50 + struct clock_data *clock = &trace_clock_data; 51 + u64 bank = smp_load_acquire(&clock->cur); 52 + u64 cyc, ns; 53 + 54 + cyc = __arch_counter_get_cntvct() - clock->data[bank].epoch_cyc; 55 + 56 + if (likely(cyc < clock->data[bank].cyc_overflow64)) { 57 + ns = cyc * clock->data[bank].mult; 58 + ns >>= clock->data[bank].shift; 59 + } else { 60 + ns = __clock_mult_uint128(cyc, clock->data[bank].mult, 61 + clock->data[bank].shift); 62 + } 63 + 64 + return (u64)ns + clock->data[bank].epoch_ns; 65 + }

+92 -24

arch/arm64/kvm/hyp/nvhe/debug-sr.c

··· 14 14 #include <asm/kvm_hyp.h> 15 15 #include <asm/kvm_mmu.h> 16 16 17 - static void __debug_save_spe(u64 *pmscr_el1) 17 + static void __debug_save_spe(void) 18 18 { 19 - u64 reg; 19 + u64 *pmscr_el1, *pmblimitr_el1; 20 20 21 - /* Clear pmscr in case of early return */ 22 - *pmscr_el1 = 0; 21 + pmscr_el1 = host_data_ptr(host_debug_state.pmscr_el1); 22 + pmblimitr_el1 = host_data_ptr(host_debug_state.pmblimitr_el1); 23 23 24 24 /* 25 25 * At this point, we know that this CPU implements 26 26 * SPE and is available to the host. 27 27 * Check if the host is actually using it ? 28 28 */ 29 - reg = read_sysreg_s(SYS_PMBLIMITR_EL1); 30 - if (!(reg & BIT(PMBLIMITR_EL1_E_SHIFT))) 29 + *pmblimitr_el1 = read_sysreg_s(SYS_PMBLIMITR_EL1); 30 + if (!(*pmblimitr_el1 & BIT(PMBLIMITR_EL1_E_SHIFT))) 31 31 return; 32 32 33 33 /* Yes; save the control register and disable data generation */ ··· 37 37 38 38 /* Now drain all buffered data to memory */ 39 39 psb_csync(); 40 + dsb(nsh); 41 + 42 + /* And disable the profiling buffer */ 43 + write_sysreg_s(0, SYS_PMBLIMITR_EL1); 44 + isb(); 40 45 } 41 46 42 - static void __debug_restore_spe(u64 pmscr_el1) 47 + static void __debug_restore_spe(void) 43 48 { 44 - if (!pmscr_el1) 49 + u64 pmblimitr_el1 = *host_data_ptr(host_debug_state.pmblimitr_el1); 50 + 51 + if (!(pmblimitr_el1 & BIT(PMBLIMITR_EL1_E_SHIFT))) 45 52 return; 46 53 47 54 /* The host page table is installed, but not yet synchronised */ 48 55 isb(); 49 56 57 + /* Re-enable the profiling buffer. */ 58 + write_sysreg_s(pmblimitr_el1, SYS_PMBLIMITR_EL1); 59 + isb(); 60 + 50 61 /* Re-enable data generation */ 51 - write_sysreg_el1(pmscr_el1, SYS_PMSCR); 62 + write_sysreg_el1(*host_data_ptr(host_debug_state.pmscr_el1), SYS_PMSCR); 52 63 } 53 64 54 65 static void __trace_do_switch(u64 *saved_trfcr, u64 new_trfcr) ··· 68 57 write_sysreg_el1(new_trfcr, SYS_TRFCR); 69 58 } 70 59 71 - static bool __trace_needs_drain(void) 60 + static void __trace_drain_and_disable(void) 72 61 { 73 - if (is_protected_kvm_enabled() && host_data_test_flag(HAS_TRBE)) 74 - return read_sysreg_s(SYS_TRBLIMITR_EL1) & TRBLIMITR_EL1_E; 62 + u64 *trblimitr_el1 = host_data_ptr(host_debug_state.trblimitr_el1); 63 + bool needs_drain = is_protected_kvm_enabled() ? 64 + host_data_test_flag(HAS_TRBE) : 65 + host_data_test_flag(TRBE_ENABLED); 75 66 76 - return host_data_test_flag(TRBE_ENABLED); 67 + if (!needs_drain) { 68 + *trblimitr_el1 = 0; 69 + return; 70 + } 71 + 72 + *trblimitr_el1 = read_sysreg_s(SYS_TRBLIMITR_EL1); 73 + if (*trblimitr_el1 & TRBLIMITR_EL1_E) { 74 + /* 75 + * The host has enabled the Trace Buffer Unit so we have 76 + * to beat the CPU with a stick until it stops accessing 77 + * memory. 78 + */ 79 + 80 + /* First, ensure that our prior write to TRFCR has stuck. */ 81 + isb(); 82 + 83 + /* Now synchronise with the trace and drain the buffer. */ 84 + tsb_csync(); 85 + dsb(nsh); 86 + 87 + /* 88 + * With no more trace being generated, we can disable the 89 + * Trace Buffer Unit. 90 + */ 91 + write_sysreg_s(0, SYS_TRBLIMITR_EL1); 92 + if (cpus_have_final_cap(ARM64_WORKAROUND_2064142)) { 93 + /* 94 + * Some CPUs are so good, we have to drain 'em 95 + * twice. 96 + */ 97 + tsb_csync(); 98 + dsb(nsh); 99 + } 100 + 101 + /* 102 + * Ensure that the Trace Buffer Unit is disabled before 103 + * we start mucking with the stage-2 and trap 104 + * configuration. 105 + */ 106 + isb(); 107 + } 77 108 } 78 109 79 110 static bool __trace_needs_switch(void) ··· 132 79 133 80 __trace_do_switch(host_data_ptr(host_debug_state.trfcr_el1), 134 81 *host_data_ptr(trfcr_while_in_guest)); 135 - 136 - if (__trace_needs_drain()) { 137 - isb(); 138 - tsb_csync(); 139 - } 82 + __trace_drain_and_disable(); 140 83 } 141 84 142 85 static void __trace_switch_to_host(void) 143 86 { 87 + u64 trblimitr_el1 = *host_data_ptr(host_debug_state.trblimitr_el1); 88 + 89 + if (trblimitr_el1 & TRBLIMITR_EL1_E) { 90 + /* Re-enable the Trace Buffer Unit for the host. */ 91 + write_sysreg_s(trblimitr_el1, SYS_TRBLIMITR_EL1); 92 + isb(); 93 + if (cpus_have_final_cap(ARM64_WORKAROUND_2038923)) { 94 + /* 95 + * Make sure the unit is re-enabled before we 96 + * poke TRFCR. 97 + */ 98 + isb(); 99 + } 100 + } 101 + 144 102 __trace_do_switch(host_data_ptr(trfcr_while_in_guest), 145 103 *host_data_ptr(host_debug_state.trfcr_el1)); 146 104 } 147 105 148 - static void __debug_save_brbe(u64 *brbcr_el1) 106 + static void __debug_save_brbe(void) 149 107 { 108 + u64 *brbcr_el1 = host_data_ptr(host_debug_state.brbcr_el1); 109 + 150 110 *brbcr_el1 = 0; 151 111 152 112 /* Check if the BRBE is enabled */ ··· 175 109 write_sysreg_el1(0, SYS_BRBCR); 176 110 } 177 111 178 - static void __debug_restore_brbe(u64 brbcr_el1) 112 + static void __debug_restore_brbe(void) 179 113 { 114 + u64 brbcr_el1 = *host_data_ptr(host_debug_state.brbcr_el1); 115 + 180 116 if (!brbcr_el1) 181 117 return; 182 118 ··· 190 122 { 191 123 /* Disable and flush SPE data generation */ 192 124 if (host_data_test_flag(HAS_SPE)) 193 - __debug_save_spe(host_data_ptr(host_debug_state.pmscr_el1)); 125 + __debug_save_spe(); 194 126 195 127 /* Disable BRBE branch records */ 196 128 if (host_data_test_flag(HAS_BRBE)) 197 - __debug_save_brbe(host_data_ptr(host_debug_state.brbcr_el1)); 129 + __debug_save_brbe(); 198 130 199 131 if (__trace_needs_switch()) 200 132 __trace_switch_to_guest(); ··· 208 140 void __debug_restore_host_buffers_nvhe(struct kvm_vcpu *vcpu) 209 141 { 210 142 if (host_data_test_flag(HAS_SPE)) 211 - __debug_restore_spe(*host_data_ptr(host_debug_state.pmscr_el1)); 143 + __debug_restore_spe(); 212 144 if (host_data_test_flag(HAS_BRBE)) 213 - __debug_restore_brbe(*host_data_ptr(host_debug_state.brbcr_el1)); 145 + __debug_restore_brbe(); 214 146 if (__trace_needs_switch()) 215 147 __trace_switch_to_host(); 216 148 }

+25

arch/arm64/kvm/hyp/nvhe/events.c

··· 1 + // SPDX-License-Identifier: GPL-2.0-only 2 + /* 3 + * Copyright (C) 2025 Google LLC 4 + * Author: Vincent Donnefort <vdonnefort@google.com> 5 + */ 6 + 7 + #include <nvhe/mm.h> 8 + #include <nvhe/trace.h> 9 + 10 + #include <nvhe/define_events.h> 11 + 12 + int __tracing_enable_event(unsigned short id, bool enable) 13 + { 14 + struct hyp_event_id *event_id = &__hyp_event_ids_start[id]; 15 + atomic_t *enabled; 16 + 17 + if (event_id >= __hyp_event_ids_end) 18 + return -EINVAL; 19 + 20 + enabled = hyp_fixmap_map(__hyp_pa(&event_id->enabled)); 21 + atomic_set(enabled, enable); 22 + hyp_fixmap_unmap(); 23 + 24 + return 0; 25 + }

+14 -14

arch/arm64/kvm/hyp/nvhe/ffa.c

··· 26 26 * the duration and are therefore serialised. 27 27 */ 28 28 29 - #include <linux/arm-smccc.h> 30 29 #include <linux/arm_ffa.h> 31 30 #include <asm/kvm_pkvm.h> 32 31 32 + #include <nvhe/arm-smccc.h> 33 33 #include <nvhe/ffa.h> 34 34 #include <nvhe/mem_protect.h> 35 35 #include <nvhe/memory.h> ··· 147 147 { 148 148 struct arm_smccc_1_2_regs res; 149 149 150 - arm_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { 150 + hyp_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { 151 151 .a0 = FFA_FN64_RXTX_MAP, 152 152 .a1 = hyp_virt_to_phys(hyp_buffers.tx), 153 153 .a2 = hyp_virt_to_phys(hyp_buffers.rx), ··· 161 161 { 162 162 struct arm_smccc_1_2_regs res; 163 163 164 - arm_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { 164 + hyp_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { 165 165 .a0 = FFA_RXTX_UNMAP, 166 166 .a1 = HOST_FFA_ID, 167 167 }, &res); ··· 172 172 static void ffa_mem_frag_tx(struct arm_smccc_1_2_regs *res, u32 handle_lo, 173 173 u32 handle_hi, u32 fraglen, u32 endpoint_id) 174 174 { 175 - arm_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { 175 + hyp_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { 176 176 .a0 = FFA_MEM_FRAG_TX, 177 177 .a1 = handle_lo, 178 178 .a2 = handle_hi, ··· 184 184 static void ffa_mem_frag_rx(struct arm_smccc_1_2_regs *res, u32 handle_lo, 185 185 u32 handle_hi, u32 fragoff) 186 186 { 187 - arm_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { 187 + hyp_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { 188 188 .a0 = FFA_MEM_FRAG_RX, 189 189 .a1 = handle_lo, 190 190 .a2 = handle_hi, ··· 196 196 static void ffa_mem_xfer(struct arm_smccc_1_2_regs *res, u64 func_id, u32 len, 197 197 u32 fraglen) 198 198 { 199 - arm_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { 199 + hyp_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { 200 200 .a0 = func_id, 201 201 .a1 = len, 202 202 .a2 = fraglen, ··· 206 206 static void ffa_mem_reclaim(struct arm_smccc_1_2_regs *res, u32 handle_lo, 207 207 u32 handle_hi, u32 flags) 208 208 { 209 - arm_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { 209 + hyp_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { 210 210 .a0 = FFA_MEM_RECLAIM, 211 211 .a1 = handle_lo, 212 212 .a2 = handle_hi, ··· 216 216 217 217 static void ffa_retrieve_req(struct arm_smccc_1_2_regs *res, u32 len) 218 218 { 219 - arm_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { 219 + hyp_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { 220 220 .a0 = FFA_FN64_MEM_RETRIEVE_REQ, 221 221 .a1 = len, 222 222 .a2 = len, ··· 225 225 226 226 static void ffa_rx_release(struct arm_smccc_1_2_regs *res) 227 227 { 228 - arm_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { 228 + hyp_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { 229 229 .a0 = FFA_RX_RELEASE, 230 230 }, res); 231 231 } ··· 728 728 size_t min_rxtx_sz; 729 729 struct arm_smccc_1_2_regs res; 730 730 731 - arm_smccc_1_2_smc(&(struct arm_smccc_1_2_regs){ 731 + hyp_smccc_1_2_smc(&(struct arm_smccc_1_2_regs){ 732 732 .a0 = FFA_ID_GET, 733 733 }, &res); 734 734 if (res.a0 != FFA_SUCCESS) ··· 737 737 if (res.a2 != HOST_FFA_ID) 738 738 return -EINVAL; 739 739 740 - arm_smccc_1_2_smc(&(struct arm_smccc_1_2_regs){ 740 + hyp_smccc_1_2_smc(&(struct arm_smccc_1_2_regs){ 741 741 .a0 = FFA_FEATURES, 742 742 .a1 = FFA_FN64_RXTX_MAP, 743 743 }, &res); ··· 788 788 * first if TEE supports it. 789 789 */ 790 790 if (FFA_MINOR_VERSION(ffa_req_version) < FFA_MINOR_VERSION(hyp_ffa_version)) { 791 - arm_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { 791 + hyp_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { 792 792 .a0 = FFA_VERSION, 793 793 .a1 = ffa_req_version, 794 794 }, res); ··· 824 824 goto out_unlock; 825 825 } 826 826 827 - arm_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { 827 + hyp_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { 828 828 .a0 = FFA_PARTITION_INFO_GET, 829 829 .a1 = uuid0, 830 830 .a2 = uuid1, ··· 939 939 if (kvm_host_psci_config.smccc_version < ARM_SMCCC_VERSION_1_2) 940 940 return 0; 941 941 942 - arm_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { 942 + hyp_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { 943 943 .a0 = FFA_VERSION, 944 944 .a1 = FFA_VERSION_1_2, 945 945 }, &res);

+1 -12

arch/arm64/kvm/hyp/nvhe/host.S

··· 120 120 121 121 mov x29, x0 122 122 123 - #ifdef CONFIG_NVHE_EL2_DEBUG 123 + #ifdef PKVM_DISABLE_STAGE2_ON_PANIC 124 124 /* Ensure host stage-2 is disabled */ 125 125 mrs x0, hcr_el2 126 126 bic x0, x0, #HCR_VM 127 127 msr_hcr_el2 x0 128 - isb 129 128 tlbi vmalls12e1 130 129 dsb nsh 131 130 #endif ··· 290 291 291 292 ret 292 293 SYM_CODE_END(__kvm_hyp_host_forward_smc) 293 - 294 - /* 295 - * kvm_host_psci_cpu_entry is called through br instruction, which requires 296 - * bti j instruction as compilers (gcc and llvm) doesn't insert bti j for external 297 - * functions, but bti c instead. 298 - */ 299 - SYM_CODE_START(kvm_host_psci_cpu_entry) 300 - bti j 301 - b __kvm_host_psci_cpu_entry 302 - SYM_CODE_END(kvm_host_psci_cpu_entry)

+15 -26

arch/arm64/kvm/hyp/nvhe/hyp-init.S

··· 173 173 * x0: struct kvm_nvhe_init_params PA 174 174 */ 175 175 SYM_CODE_START(kvm_hyp_cpu_entry) 176 - mov x1, #1 // is_cpu_on = true 176 + ldr x29, =__kvm_host_psci_cpu_on_entry 177 177 b __kvm_hyp_init_cpu 178 - SYM_CODE_END(kvm_hyp_cpu_entry) 179 178 180 179 /* 181 180 * PSCI CPU_SUSPEND / SYSTEM_SUSPEND entry point ··· 182 183 * x0: struct kvm_nvhe_init_params PA 183 184 */ 184 185 SYM_CODE_START(kvm_hyp_cpu_resume) 185 - mov x1, #0 // is_cpu_on = false 186 - b __kvm_hyp_init_cpu 187 - SYM_CODE_END(kvm_hyp_cpu_resume) 186 + ldr x29, =__kvm_host_psci_cpu_resume_entry 188 187 189 - /* 190 - * Common code for CPU entry points. Initializes EL2 state and 191 - * installs the hypervisor before handing over to a C handler. 192 - * 193 - * x0: struct kvm_nvhe_init_params PA 194 - * x1: bool is_cpu_on 195 - */ 196 - SYM_CODE_START_LOCAL(__kvm_hyp_init_cpu) 188 + SYM_INNER_LABEL(__kvm_hyp_init_cpu, SYM_L_LOCAL) 197 189 mov x28, x0 // Stash arguments 198 - mov x29, x1 199 190 200 191 /* Check that the core was booted in EL2. */ 201 192 mrs x0, CurrentEL 202 193 cmp x0, #CurrentEL_EL2 203 - b.eq 2f 194 + b.ne 1f 204 195 205 - /* The core booted in EL1. KVM cannot be initialized on it. */ 206 - 1: wfe 207 - wfi 208 - b 1b 209 - 210 - 2: msr SPsel, #1 // We want to use SP_EL{1,2} 196 + msr SPsel, #1 // We want to use SP_EL2 211 197 212 198 init_el2_hcr 0 213 199 ··· 202 218 mov x0, x28 203 219 bl ___kvm_hyp_init // Clobbers x0..x2 204 220 205 - /* Leave idmap. */ 206 - mov x0, x29 207 - ldr x1, =kvm_host_psci_cpu_entry 208 - br x1 209 - SYM_CODE_END(__kvm_hyp_init_cpu) 221 + /* Leave idmap -- using BLR is OK, LR is restored from host context */ 222 + blr x29 223 + 224 + // The core booted in EL1, or the C code unexpectedly returned. 225 + // Either way, KVM cannot be initialized on it. 226 + 1: wfe 227 + wfi 228 + b 1b 229 + SYM_CODE_END(kvm_hyp_cpu_resume) 230 + SYM_CODE_END(kvm_hyp_cpu_entry) 210 231 211 232 SYM_CODE_START(__kvm_handle_stub_hvc) 212 233 /*

+218 -76

arch/arm64/kvm/hyp/nvhe/hyp-main.c

··· 12 12 #include <asm/kvm_emulate.h> 13 13 #include <asm/kvm_host.h> 14 14 #include <asm/kvm_hyp.h> 15 + #include <asm/kvm_hypevents.h> 15 16 #include <asm/kvm_mmu.h> 16 17 17 18 #include <nvhe/ffa.h> 18 19 #include <nvhe/mem_protect.h> 19 20 #include <nvhe/mm.h> 20 21 #include <nvhe/pkvm.h> 22 + #include <nvhe/trace.h> 21 23 #include <nvhe/trap_handler.h> 22 24 23 25 DEFINE_PER_CPU(struct kvm_nvhe_init_params, kvm_init_params); ··· 138 136 hyp_vcpu->vcpu.arch.vsesr_el2 = host_vcpu->arch.vsesr_el2; 139 137 140 138 hyp_vcpu->vcpu.arch.vgic_cpu.vgic_v3 = host_vcpu->arch.vgic_cpu.vgic_v3; 139 + 140 + hyp_vcpu->vcpu.arch.pid = host_vcpu->arch.pid; 141 141 } 142 142 143 143 static void sync_hyp_vcpu(struct pkvm_hyp_vcpu *hyp_vcpu) ··· 173 169 DECLARE_REG(u64, hcr_el2, host_ctxt, 3); 174 170 struct pkvm_hyp_vcpu *hyp_vcpu; 175 171 176 - if (!is_protected_kvm_enabled()) 177 - return; 178 - 179 172 hyp_vcpu = pkvm_load_hyp_vcpu(handle, vcpu_idx); 180 173 if (!hyp_vcpu) 181 174 return; ··· 189 188 190 189 static void handle___pkvm_vcpu_put(struct kvm_cpu_context *host_ctxt) 191 190 { 192 - struct pkvm_hyp_vcpu *hyp_vcpu; 191 + struct pkvm_hyp_vcpu *hyp_vcpu = pkvm_get_loaded_hyp_vcpu(); 193 192 194 - if (!is_protected_kvm_enabled()) 195 - return; 196 - 197 - hyp_vcpu = pkvm_get_loaded_hyp_vcpu(); 198 193 if (hyp_vcpu) 199 194 pkvm_put_hyp_vcpu(hyp_vcpu); 200 195 } ··· 245 248 &host_vcpu->arch.pkvm_memcache); 246 249 } 247 250 251 + static void handle___pkvm_host_donate_guest(struct kvm_cpu_context *host_ctxt) 252 + { 253 + DECLARE_REG(u64, pfn, host_ctxt, 1); 254 + DECLARE_REG(u64, gfn, host_ctxt, 2); 255 + struct pkvm_hyp_vcpu *hyp_vcpu; 256 + int ret = -EINVAL; 257 + 258 + hyp_vcpu = pkvm_get_loaded_hyp_vcpu(); 259 + if (!hyp_vcpu || !pkvm_hyp_vcpu_is_protected(hyp_vcpu)) 260 + goto out; 261 + 262 + ret = pkvm_refill_memcache(hyp_vcpu); 263 + if (ret) 264 + goto out; 265 + 266 + ret = __pkvm_host_donate_guest(pfn, gfn, hyp_vcpu); 267 + out: 268 + cpu_reg(host_ctxt, 1) = ret; 269 + } 270 + 248 271 static void handle___pkvm_host_share_guest(struct kvm_cpu_context *host_ctxt) 249 272 { 250 273 DECLARE_REG(u64, pfn, host_ctxt, 1); ··· 273 256 DECLARE_REG(enum kvm_pgtable_prot, prot, host_ctxt, 4); 274 257 struct pkvm_hyp_vcpu *hyp_vcpu; 275 258 int ret = -EINVAL; 276 - 277 - if (!is_protected_kvm_enabled()) 278 - goto out; 279 259 280 260 hyp_vcpu = pkvm_get_loaded_hyp_vcpu(); 281 261 if (!hyp_vcpu || pkvm_hyp_vcpu_is_protected(hyp_vcpu)) ··· 295 281 struct pkvm_hyp_vm *hyp_vm; 296 282 int ret = -EINVAL; 297 283 298 - if (!is_protected_kvm_enabled()) 299 - goto out; 300 - 301 284 hyp_vm = get_np_pkvm_hyp_vm(handle); 302 285 if (!hyp_vm) 303 286 goto out; ··· 312 301 struct pkvm_hyp_vcpu *hyp_vcpu; 313 302 int ret = -EINVAL; 314 303 315 - if (!is_protected_kvm_enabled()) 316 - goto out; 317 - 318 304 hyp_vcpu = pkvm_get_loaded_hyp_vcpu(); 319 305 if (!hyp_vcpu || pkvm_hyp_vcpu_is_protected(hyp_vcpu)) 320 306 goto out; ··· 328 320 DECLARE_REG(u64, nr_pages, host_ctxt, 3); 329 321 struct pkvm_hyp_vm *hyp_vm; 330 322 int ret = -EINVAL; 331 - 332 - if (!is_protected_kvm_enabled()) 333 - goto out; 334 323 335 324 hyp_vm = get_np_pkvm_hyp_vm(handle); 336 325 if (!hyp_vm) ··· 348 343 struct pkvm_hyp_vm *hyp_vm; 349 344 int ret = -EINVAL; 350 345 351 - if (!is_protected_kvm_enabled()) 352 - goto out; 353 - 354 346 hyp_vm = get_np_pkvm_hyp_vm(handle); 355 347 if (!hyp_vm) 356 348 goto out; ··· 363 361 DECLARE_REG(u64, gfn, host_ctxt, 1); 364 362 struct pkvm_hyp_vcpu *hyp_vcpu; 365 363 int ret = -EINVAL; 366 - 367 - if (!is_protected_kvm_enabled()) 368 - goto out; 369 364 370 365 hyp_vcpu = pkvm_get_loaded_hyp_vcpu(); 371 366 if (!hyp_vcpu || pkvm_hyp_vcpu_is_protected(hyp_vcpu)) ··· 423 424 static void handle___pkvm_tlb_flush_vmid(struct kvm_cpu_context *host_ctxt) 424 425 { 425 426 DECLARE_REG(pkvm_handle_t, handle, host_ctxt, 1); 426 - struct pkvm_hyp_vm *hyp_vm; 427 + struct pkvm_hyp_vm *hyp_vm = get_np_pkvm_hyp_vm(handle); 427 428 428 - if (!is_protected_kvm_enabled()) 429 - return; 430 - 431 - hyp_vm = get_np_pkvm_hyp_vm(handle); 432 429 if (!hyp_vm) 433 430 return; 434 431 ··· 481 486 { 482 487 DECLARE_REG(phys_addr_t, phys, host_ctxt, 1); 483 488 DECLARE_REG(unsigned long, size, host_ctxt, 2); 484 - DECLARE_REG(unsigned long, nr_cpus, host_ctxt, 3); 485 - DECLARE_REG(unsigned long *, per_cpu_base, host_ctxt, 4); 486 - DECLARE_REG(u32, hyp_va_bits, host_ctxt, 5); 489 + DECLARE_REG(unsigned long *, per_cpu_base, host_ctxt, 3); 490 + DECLARE_REG(u32, hyp_va_bits, host_ctxt, 4); 487 491 488 492 /* 489 493 * __pkvm_init() will return only if an error occurred, otherwise it 490 494 * will tail-call in __pkvm_init_finalise() which will have to deal 491 495 * with the host context directly. 492 496 */ 493 - cpu_reg(host_ctxt, 1) = __pkvm_init(phys, size, nr_cpus, per_cpu_base, 494 - hyp_va_bits); 497 + cpu_reg(host_ctxt, 1) = __pkvm_init(phys, size, per_cpu_base, hyp_va_bits); 495 498 } 496 499 497 500 static void handle___pkvm_cpu_set_vector(struct kvm_cpu_context *host_ctxt) ··· 575 582 cpu_reg(host_ctxt, 1) = __pkvm_init_vcpu(handle, host_vcpu, vcpu_hva); 576 583 } 577 584 578 - static void handle___pkvm_teardown_vm(struct kvm_cpu_context *host_ctxt) 585 + static void handle___pkvm_vcpu_in_poison_fault(struct kvm_cpu_context *host_ctxt) 586 + { 587 + int ret; 588 + struct pkvm_hyp_vcpu *hyp_vcpu = pkvm_get_loaded_hyp_vcpu(); 589 + 590 + ret = hyp_vcpu ? __pkvm_vcpu_in_poison_fault(hyp_vcpu) : -EINVAL; 591 + cpu_reg(host_ctxt, 1) = ret; 592 + } 593 + 594 + static void handle___pkvm_force_reclaim_guest_page(struct kvm_cpu_context *host_ctxt) 595 + { 596 + DECLARE_REG(phys_addr_t, phys, host_ctxt, 1); 597 + 598 + cpu_reg(host_ctxt, 1) = __pkvm_host_force_reclaim_page_guest(phys); 599 + } 600 + 601 + static void handle___pkvm_reclaim_dying_guest_page(struct kvm_cpu_context *host_ctxt) 602 + { 603 + DECLARE_REG(pkvm_handle_t, handle, host_ctxt, 1); 604 + DECLARE_REG(u64, gfn, host_ctxt, 2); 605 + 606 + cpu_reg(host_ctxt, 1) = __pkvm_reclaim_dying_guest_page(handle, gfn); 607 + } 608 + 609 + static void handle___pkvm_start_teardown_vm(struct kvm_cpu_context *host_ctxt) 579 610 { 580 611 DECLARE_REG(pkvm_handle_t, handle, host_ctxt, 1); 581 612 582 - cpu_reg(host_ctxt, 1) = __pkvm_teardown_vm(handle); 613 + cpu_reg(host_ctxt, 1) = __pkvm_start_teardown_vm(handle); 614 + } 615 + 616 + static void handle___pkvm_finalize_teardown_vm(struct kvm_cpu_context *host_ctxt) 617 + { 618 + DECLARE_REG(pkvm_handle_t, handle, host_ctxt, 1); 619 + 620 + cpu_reg(host_ctxt, 1) = __pkvm_finalize_teardown_vm(handle); 621 + } 622 + 623 + static void handle___tracing_load(struct kvm_cpu_context *host_ctxt) 624 + { 625 + DECLARE_REG(unsigned long, desc_hva, host_ctxt, 1); 626 + DECLARE_REG(size_t, desc_size, host_ctxt, 2); 627 + 628 + cpu_reg(host_ctxt, 1) = __tracing_load(desc_hva, desc_size); 629 + } 630 + 631 + static void handle___tracing_unload(struct kvm_cpu_context *host_ctxt) 632 + { 633 + __tracing_unload(); 634 + } 635 + 636 + static void handle___tracing_enable(struct kvm_cpu_context *host_ctxt) 637 + { 638 + DECLARE_REG(bool, enable, host_ctxt, 1); 639 + 640 + cpu_reg(host_ctxt, 1) = __tracing_enable(enable); 641 + } 642 + 643 + static void handle___tracing_swap_reader(struct kvm_cpu_context *host_ctxt) 644 + { 645 + DECLARE_REG(unsigned int, cpu, host_ctxt, 1); 646 + 647 + cpu_reg(host_ctxt, 1) = __tracing_swap_reader(cpu); 648 + } 649 + 650 + static void handle___tracing_update_clock(struct kvm_cpu_context *host_ctxt) 651 + { 652 + DECLARE_REG(u32, mult, host_ctxt, 1); 653 + DECLARE_REG(u32, shift, host_ctxt, 2); 654 + DECLARE_REG(u64, epoch_ns, host_ctxt, 3); 655 + DECLARE_REG(u64, epoch_cyc, host_ctxt, 4); 656 + 657 + __tracing_update_clock(mult, shift, epoch_ns, epoch_cyc); 658 + } 659 + 660 + static void handle___tracing_reset(struct kvm_cpu_context *host_ctxt) 661 + { 662 + DECLARE_REG(unsigned int, cpu, host_ctxt, 1); 663 + 664 + cpu_reg(host_ctxt, 1) = __tracing_reset(cpu); 665 + } 666 + 667 + static void handle___tracing_enable_event(struct kvm_cpu_context *host_ctxt) 668 + { 669 + DECLARE_REG(unsigned short, id, host_ctxt, 1); 670 + DECLARE_REG(bool, enable, host_ctxt, 2); 671 + 672 + cpu_reg(host_ctxt, 1) = __tracing_enable_event(id, enable); 673 + } 674 + 675 + static void handle___tracing_write_event(struct kvm_cpu_context *host_ctxt) 676 + { 677 + DECLARE_REG(u64, id, host_ctxt, 1); 678 + 679 + trace_selftest(id); 680 + } 681 + 682 + static void handle___vgic_v5_save_apr(struct kvm_cpu_context *host_ctxt) 683 + { 684 + DECLARE_REG(struct vgic_v5_cpu_if *, cpu_if, host_ctxt, 1); 685 + 686 + __vgic_v5_save_apr(kern_hyp_va(cpu_if)); 687 + } 688 + 689 + static void handle___vgic_v5_restore_vmcr_apr(struct kvm_cpu_context *host_ctxt) 690 + { 691 + DECLARE_REG(struct vgic_v5_cpu_if *, cpu_if, host_ctxt, 1); 692 + 693 + __vgic_v5_restore_vmcr_apr(kern_hyp_va(cpu_if)); 583 694 } 584 695 585 696 typedef void (*hcall_t)(struct kvm_cpu_context *); ··· 700 603 HANDLE_FUNC(__vgic_v3_get_gic_config), 701 604 HANDLE_FUNC(__pkvm_prot_finalize), 702 605 703 - HANDLE_FUNC(__pkvm_host_share_hyp), 704 - HANDLE_FUNC(__pkvm_host_unshare_hyp), 705 - HANDLE_FUNC(__pkvm_host_share_guest), 706 - HANDLE_FUNC(__pkvm_host_unshare_guest), 707 - HANDLE_FUNC(__pkvm_host_relax_perms_guest), 708 - HANDLE_FUNC(__pkvm_host_wrprotect_guest), 709 - HANDLE_FUNC(__pkvm_host_test_clear_young_guest), 710 - HANDLE_FUNC(__pkvm_host_mkyoung_guest), 711 606 HANDLE_FUNC(__kvm_adjust_pc), 712 607 HANDLE_FUNC(__kvm_vcpu_run), 713 608 HANDLE_FUNC(__kvm_flush_vm_context), ··· 711 622 HANDLE_FUNC(__kvm_timer_set_cntvoff), 712 623 HANDLE_FUNC(__vgic_v3_save_aprs), 713 624 HANDLE_FUNC(__vgic_v3_restore_vmcr_aprs), 625 + HANDLE_FUNC(__vgic_v5_save_apr), 626 + HANDLE_FUNC(__vgic_v5_restore_vmcr_apr), 627 + 628 + HANDLE_FUNC(__pkvm_host_share_hyp), 629 + HANDLE_FUNC(__pkvm_host_unshare_hyp), 630 + HANDLE_FUNC(__pkvm_host_donate_guest), 631 + HANDLE_FUNC(__pkvm_host_share_guest), 632 + HANDLE_FUNC(__pkvm_host_unshare_guest), 633 + HANDLE_FUNC(__pkvm_host_relax_perms_guest), 634 + HANDLE_FUNC(__pkvm_host_wrprotect_guest), 635 + HANDLE_FUNC(__pkvm_host_test_clear_young_guest), 636 + HANDLE_FUNC(__pkvm_host_mkyoung_guest), 714 637 HANDLE_FUNC(__pkvm_reserve_vm), 715 638 HANDLE_FUNC(__pkvm_unreserve_vm), 716 639 HANDLE_FUNC(__pkvm_init_vm), 717 640 HANDLE_FUNC(__pkvm_init_vcpu), 718 - HANDLE_FUNC(__pkvm_teardown_vm), 641 + HANDLE_FUNC(__pkvm_vcpu_in_poison_fault), 642 + HANDLE_FUNC(__pkvm_force_reclaim_guest_page), 643 + HANDLE_FUNC(__pkvm_reclaim_dying_guest_page), 644 + HANDLE_FUNC(__pkvm_start_teardown_vm), 645 + HANDLE_FUNC(__pkvm_finalize_teardown_vm), 719 646 HANDLE_FUNC(__pkvm_vcpu_load), 720 647 HANDLE_FUNC(__pkvm_vcpu_put), 721 648 HANDLE_FUNC(__pkvm_tlb_flush_vmid), 649 + HANDLE_FUNC(__tracing_load), 650 + HANDLE_FUNC(__tracing_unload), 651 + HANDLE_FUNC(__tracing_enable), 652 + HANDLE_FUNC(__tracing_swap_reader), 653 + HANDLE_FUNC(__tracing_update_clock), 654 + HANDLE_FUNC(__tracing_reset), 655 + HANDLE_FUNC(__tracing_enable_event), 656 + HANDLE_FUNC(__tracing_write_event), 722 657 }; 723 658 724 659 static void handle_host_hcall(struct kvm_cpu_context *host_ctxt) 725 660 { 726 661 DECLARE_REG(unsigned long, id, host_ctxt, 0); 727 - unsigned long hcall_min = 0; 662 + unsigned long hcall_min = 0, hcall_max = -1; 728 663 hcall_t hfn; 729 664 730 665 /* ··· 760 647 * basis. This is all fine, however, since __pkvm_prot_finalize 761 648 * returns -EPERM after the first call for a given CPU. 762 649 */ 763 - if (static_branch_unlikely(&kvm_protected_mode_initialized)) 764 - hcall_min = __KVM_HOST_SMCCC_FUNC___pkvm_prot_finalize; 650 + if (static_branch_unlikely(&kvm_protected_mode_initialized)) { 651 + hcall_min = __KVM_HOST_SMCCC_FUNC_MIN_PKVM; 652 + } else { 653 + hcall_max = __KVM_HOST_SMCCC_FUNC_MAX_NO_PKVM; 654 + } 765 655 766 656 id &= ~ARM_SMCCC_CALL_HINTS; 767 657 id -= KVM_HOST_SMCCC_ID(0); 768 658 769 - if (unlikely(id < hcall_min || id >= ARRAY_SIZE(host_hcall))) 659 + if (unlikely(id < hcall_min || id > hcall_max || 660 + id >= ARRAY_SIZE(host_hcall))) { 770 661 goto inval; 662 + } 771 663 772 664 hfn = host_hcall[id]; 773 665 if (unlikely(!hfn)) ··· 788 670 789 671 static void default_host_smc_handler(struct kvm_cpu_context *host_ctxt) 790 672 { 673 + trace_hyp_exit(host_ctxt, HYP_REASON_SMC); 791 674 __kvm_hyp_host_forward_smc(host_ctxt); 675 + trace_hyp_enter(host_ctxt, HYP_REASON_SMC); 792 676 } 793 677 794 678 static void handle_host_smc(struct kvm_cpu_context *host_ctxt) 795 679 { 796 680 DECLARE_REG(u64, func_id, host_ctxt, 0); 681 + u64 esr = read_sysreg_el2(SYS_ESR); 797 682 bool handled; 683 + 684 + if (esr & ESR_ELx_xVC_IMM_MASK) { 685 + cpu_reg(host_ctxt, 0) = SMCCC_RET_NOT_SUPPORTED; 686 + goto exit_skip_instr; 687 + } 798 688 799 689 func_id &= ~ARM_SMCCC_CALL_HINTS; 800 690 ··· 812 686 if (!handled) 813 687 default_host_smc_handler(host_ctxt); 814 688 689 + exit_skip_instr: 815 690 /* SMC was trapped, move ELR past the current PC. */ 816 691 kvm_skip_host_instr(); 817 692 } 818 693 819 - /* 820 - * Inject an Undefined Instruction exception into the host. 821 - * 822 - * This is open-coded to allow control over PSTATE construction without 823 - * complicating the generic exception entry helpers. 824 - */ 825 - static void inject_undef64(void) 694 + void inject_host_exception(u64 esr) 826 695 { 827 - u64 spsr_mask, vbar, sctlr, old_spsr, new_spsr, esr, offset; 696 + u64 sctlr, spsr_el1, spsr_el2, exc_offset = except_type_sync; 697 + const u64 spsr_mask = PSR_N_BIT | PSR_Z_BIT | PSR_C_BIT | 698 + PSR_V_BIT | PSR_DIT_BIT | PSR_PAN_BIT; 828 699 829 - spsr_mask = PSR_N_BIT | PSR_Z_BIT | PSR_C_BIT | PSR_V_BIT | PSR_DIT_BIT | PSR_PAN_BIT; 700 + spsr_el1 = spsr_el2 = read_sysreg_el2(SYS_SPSR); 701 + switch (spsr_el1 & (PSR_MODE_MASK | PSR_MODE32_BIT)) { 702 + case PSR_MODE_EL0t: 703 + exc_offset += LOWER_EL_AArch64_VECTOR; 704 + break; 705 + case PSR_MODE_EL0t | PSR_MODE32_BIT: 706 + exc_offset += LOWER_EL_AArch32_VECTOR; 707 + break; 708 + default: 709 + exc_offset += CURRENT_EL_SP_ELx_VECTOR; 710 + } 830 711 831 - vbar = read_sysreg_el1(SYS_VBAR); 712 + spsr_el2 &= spsr_mask; 713 + spsr_el2 |= PSR_D_BIT | PSR_A_BIT | PSR_I_BIT | PSR_F_BIT | 714 + PSR_MODE_EL1h; 715 + 832 716 sctlr = read_sysreg_el1(SYS_SCTLR); 833 - old_spsr = read_sysreg_el2(SYS_SPSR); 834 - 835 - new_spsr = old_spsr & spsr_mask; 836 - new_spsr |= PSR_D_BIT | PSR_A_BIT | PSR_I_BIT | PSR_F_BIT; 837 - new_spsr |= PSR_MODE_EL1h; 838 - 839 717 if (!(sctlr & SCTLR_EL1_SPAN)) 840 - new_spsr |= PSR_PAN_BIT; 718 + spsr_el2 |= PSR_PAN_BIT; 841 719 842 720 if (sctlr & SCTLR_ELx_DSSBS) 843 - new_spsr |= PSR_SSBS_BIT; 721 + spsr_el2 |= PSR_SSBS_BIT; 844 722 845 723 if (system_supports_mte()) 846 - new_spsr |= PSR_TCO_BIT; 724 + spsr_el2 |= PSR_TCO_BIT; 847 725 848 - esr = (ESR_ELx_EC_UNKNOWN << ESR_ELx_EC_SHIFT) | ESR_ELx_IL; 849 - offset = CURRENT_EL_SP_ELx_VECTOR + except_type_sync; 726 + if (esr_fsc_is_translation_fault(esr)) 727 + write_sysreg_el1(read_sysreg_el2(SYS_FAR), SYS_FAR); 850 728 851 729 write_sysreg_el1(esr, SYS_ESR); 852 730 write_sysreg_el1(read_sysreg_el2(SYS_ELR), SYS_ELR); 853 - write_sysreg_el1(old_spsr, SYS_SPSR); 854 - write_sysreg_el2(vbar + offset, SYS_ELR); 855 - write_sysreg_el2(new_spsr, SYS_SPSR); 731 + write_sysreg_el1(spsr_el1, SYS_SPSR); 732 + write_sysreg_el2(read_sysreg_el1(SYS_VBAR) + exc_offset, SYS_ELR); 733 + write_sysreg_el2(spsr_el2, SYS_SPSR); 734 + } 735 + 736 + static void inject_host_undef64(void) 737 + { 738 + inject_host_exception((ESR_ELx_EC_UNKNOWN << ESR_ELx_EC_SHIFT) | 739 + ESR_ELx_IL); 856 740 } 857 741 858 742 static bool handle_host_mte(u64 esr) ··· 885 749 return false; 886 750 } 887 751 888 - inject_undef64(); 752 + inject_host_undef64(); 889 753 return true; 890 754 } 891 755 ··· 893 757 { 894 758 u64 esr = read_sysreg_el2(SYS_ESR); 895 759 760 + 896 761 switch (ESR_ELx_EC(esr)) { 897 762 case ESR_ELx_EC_HVC64: 763 + trace_hyp_enter(host_ctxt, HYP_REASON_HVC); 898 764 handle_host_hcall(host_ctxt); 899 765 break; 900 766 case ESR_ELx_EC_SMC64: 767 + trace_hyp_enter(host_ctxt, HYP_REASON_SMC); 901 768 handle_host_smc(host_ctxt); 902 769 break; 903 770 case ESR_ELx_EC_IABT_LOW: 904 771 case ESR_ELx_EC_DABT_LOW: 772 + trace_hyp_enter(host_ctxt, HYP_REASON_HOST_ABORT); 905 773 handle_host_mem_abort(host_ctxt); 906 774 break; 907 775 case ESR_ELx_EC_SYS64: ··· 915 775 default: 916 776 BUG(); 917 777 } 778 + 779 + trace_hyp_exit(host_ctxt, HYP_REASON_ERET_HOST); 918 780 }

+6

arch/arm64/kvm/hyp/nvhe/hyp.lds.S

··· 16 16 HYP_SECTION(.text) 17 17 HYP_SECTION(.data..ro_after_init) 18 18 HYP_SECTION(.rodata) 19 + #ifdef CONFIG_NVHE_EL2_TRACING 20 + . = ALIGN(PAGE_SIZE); 21 + BEGIN_HYP_SECTION(.event_ids) 22 + *(SORT(.hyp.event_ids.*)) 23 + END_HYP_SECTION 24 + #endif 19 25 20 26 /* 21 27 * .hyp..data..percpu needs to be page aligned to maintain the same

+530 -57

arch/arm64/kvm/hyp/nvhe/mem_protect.c

··· 18 18 #include <nvhe/memory.h> 19 19 #include <nvhe/mem_protect.h> 20 20 #include <nvhe/mm.h> 21 + #include <nvhe/trap_handler.h> 21 22 22 23 #define KVM_HOST_S2_FLAGS (KVM_PGTABLE_S2_AS_S1 | KVM_PGTABLE_S2_IDMAP) 23 24 ··· 462 461 static inline int __host_stage2_idmap(u64 start, u64 end, 463 462 enum kvm_pgtable_prot prot) 464 463 { 464 + /* 465 + * We don't make permission changes to the host idmap after 466 + * initialisation, so we can squash -EAGAIN to save callers 467 + * having to treat it like success in the case that they try to 468 + * map something that is already mapped. 469 + */ 465 470 return kvm_pgtable_stage2_map(&host_mmu.pgt, start, end - start, start, 466 - prot, &host_s2_pool, 0); 471 + prot, &host_s2_pool, 472 + KVM_PGTABLE_WALK_IGNORE_EAGAIN); 467 473 } 468 474 469 475 /* ··· 512 504 return ret; 513 505 514 506 if (kvm_pte_valid(pte)) 515 - return -EAGAIN; 507 + return -EEXIST; 516 508 517 509 if (pte) { 518 510 WARN_ON(addr_is_memory(addr) && ··· 549 541 set_host_state(page, state); 550 542 } 551 543 552 - int host_stage2_set_owner_locked(phys_addr_t addr, u64 size, u8 owner_id) 544 + #define KVM_HOST_DONATION_PTE_OWNER_MASK GENMASK(3, 1) 545 + #define KVM_HOST_DONATION_PTE_EXTRA_MASK GENMASK(59, 4) 546 + static int host_stage2_set_owner_metadata_locked(phys_addr_t addr, u64 size, 547 + u8 owner_id, u64 meta) 553 548 { 549 + kvm_pte_t annotation; 554 550 int ret; 551 + 552 + if (owner_id == PKVM_ID_HOST) 553 + return -EINVAL; 555 554 556 555 if (!range_is_memory(addr, addr + size)) 557 556 return -EPERM; 558 557 559 - ret = host_stage2_try(kvm_pgtable_stage2_set_owner, &host_mmu.pgt, 560 - addr, size, &host_s2_pool, owner_id); 561 - if (ret) 562 - return ret; 558 + if (!FIELD_FIT(KVM_HOST_DONATION_PTE_OWNER_MASK, owner_id)) 559 + return -EINVAL; 563 560 564 - /* Don't forget to update the vmemmap tracking for the host */ 565 - if (owner_id == PKVM_ID_HOST) 566 - __host_update_page_state(addr, size, PKVM_PAGE_OWNED); 567 - else 561 + if (!FIELD_FIT(KVM_HOST_DONATION_PTE_EXTRA_MASK, meta)) 562 + return -EINVAL; 563 + 564 + annotation = FIELD_PREP(KVM_HOST_DONATION_PTE_OWNER_MASK, owner_id) | 565 + FIELD_PREP(KVM_HOST_DONATION_PTE_EXTRA_MASK, meta); 566 + ret = host_stage2_try(kvm_pgtable_stage2_annotate, &host_mmu.pgt, 567 + addr, size, &host_s2_pool, 568 + KVM_HOST_INVALID_PTE_TYPE_DONATION, annotation); 569 + if (!ret) 568 570 __host_update_page_state(addr, size, PKVM_NOPAGE); 569 571 572 + return ret; 573 + } 574 + 575 + int host_stage2_set_owner_locked(phys_addr_t addr, u64 size, u8 owner_id) 576 + { 577 + int ret = -EINVAL; 578 + 579 + switch (owner_id) { 580 + case PKVM_ID_HOST: 581 + if (!range_is_memory(addr, addr + size)) 582 + return -EPERM; 583 + 584 + ret = host_stage2_idmap_locked(addr, size, PKVM_HOST_MEM_PROT); 585 + if (!ret) 586 + __host_update_page_state(addr, size, PKVM_PAGE_OWNED); 587 + break; 588 + case PKVM_ID_HYP: 589 + ret = host_stage2_set_owner_metadata_locked(addr, size, 590 + owner_id, 0); 591 + break; 592 + } 593 + 594 + return ret; 595 + } 596 + 597 + #define KVM_HOST_PTE_OWNER_GUEST_HANDLE_MASK GENMASK(15, 0) 598 + /* We need 40 bits for the GFN to cover a 52-bit IPA with 4k pages and LPA2 */ 599 + #define KVM_HOST_PTE_OWNER_GUEST_GFN_MASK GENMASK(55, 16) 600 + static u64 host_stage2_encode_gfn_meta(struct pkvm_hyp_vm *vm, u64 gfn) 601 + { 602 + pkvm_handle_t handle = vm->kvm.arch.pkvm.handle; 603 + 604 + BUILD_BUG_ON((pkvm_handle_t)-1 > KVM_HOST_PTE_OWNER_GUEST_HANDLE_MASK); 605 + WARN_ON(!FIELD_FIT(KVM_HOST_PTE_OWNER_GUEST_GFN_MASK, gfn)); 606 + 607 + return FIELD_PREP(KVM_HOST_PTE_OWNER_GUEST_HANDLE_MASK, handle) | 608 + FIELD_PREP(KVM_HOST_PTE_OWNER_GUEST_GFN_MASK, gfn); 609 + } 610 + 611 + static int host_stage2_decode_gfn_meta(kvm_pte_t pte, struct pkvm_hyp_vm **vm, 612 + u64 *gfn) 613 + { 614 + pkvm_handle_t handle; 615 + u64 meta; 616 + 617 + if (WARN_ON(kvm_pte_valid(pte))) 618 + return -EINVAL; 619 + 620 + if (FIELD_GET(KVM_INVALID_PTE_TYPE_MASK, pte) != 621 + KVM_HOST_INVALID_PTE_TYPE_DONATION) { 622 + return -EINVAL; 623 + } 624 + 625 + if (FIELD_GET(KVM_HOST_DONATION_PTE_OWNER_MASK, pte) != PKVM_ID_GUEST) 626 + return -EPERM; 627 + 628 + meta = FIELD_GET(KVM_HOST_DONATION_PTE_EXTRA_MASK, pte); 629 + handle = FIELD_GET(KVM_HOST_PTE_OWNER_GUEST_HANDLE_MASK, meta); 630 + *vm = get_vm_by_handle(handle); 631 + if (!*vm) { 632 + /* We probably raced with teardown; try again */ 633 + return -EAGAIN; 634 + } 635 + 636 + *gfn = FIELD_GET(KVM_HOST_PTE_OWNER_GUEST_GFN_MASK, meta); 570 637 return 0; 571 638 } 572 639 ··· 688 605 return ret; 689 606 } 690 607 608 + static void host_inject_mem_abort(struct kvm_cpu_context *host_ctxt) 609 + { 610 + u64 ec, esr, spsr; 611 + 612 + esr = read_sysreg_el2(SYS_ESR); 613 + spsr = read_sysreg_el2(SYS_SPSR); 614 + 615 + /* Repaint the ESR to report a same-level fault if taken from EL1 */ 616 + if ((spsr & PSR_MODE_MASK) != PSR_MODE_EL0t) { 617 + ec = ESR_ELx_EC(esr); 618 + if (ec == ESR_ELx_EC_DABT_LOW) 619 + ec = ESR_ELx_EC_DABT_CUR; 620 + else if (ec == ESR_ELx_EC_IABT_LOW) 621 + ec = ESR_ELx_EC_IABT_CUR; 622 + else 623 + WARN_ON(1); 624 + esr &= ~ESR_ELx_EC_MASK; 625 + esr |= ec << ESR_ELx_EC_SHIFT; 626 + } 627 + 628 + /* 629 + * Since S1PTW should only ever be set for stage-2 faults, we're pretty 630 + * much guaranteed that it won't be set in ESR_EL1 by the hardware. So, 631 + * let's use that bit to allow the host abort handler to differentiate 632 + * this abort from normal userspace faults. 633 + * 634 + * Note: although S1PTW is RES0 at EL1, it is guaranteed by the 635 + * architecture to be backed by flops, so it should be safe to use. 636 + */ 637 + esr |= ESR_ELx_S1PTW; 638 + inject_host_exception(esr); 639 + } 640 + 691 641 void handle_host_mem_abort(struct kvm_cpu_context *host_ctxt) 692 642 { 693 643 struct kvm_vcpu_fault_info fault; 694 644 u64 esr, addr; 695 - int ret = 0; 696 645 697 646 esr = read_sysreg_el2(SYS_ESR); 698 647 if (!__get_fault_info(esr, &fault)) { ··· 743 628 BUG_ON(!(fault.hpfar_el2 & HPFAR_EL2_NS)); 744 629 addr = FIELD_GET(HPFAR_EL2_FIPA, fault.hpfar_el2) << 12; 745 630 746 - ret = host_stage2_idmap(addr); 747 - BUG_ON(ret && ret != -EAGAIN); 631 + switch (host_stage2_idmap(addr)) { 632 + case -EPERM: 633 + host_inject_mem_abort(host_ctxt); 634 + fallthrough; 635 + case -EEXIST: 636 + case 0: 637 + break; 638 + default: 639 + BUG(); 640 + } 748 641 } 749 642 750 643 struct check_walk_data { ··· 830 707 return 0; 831 708 } 832 709 710 + static bool guest_pte_is_poisoned(kvm_pte_t pte) 711 + { 712 + if (kvm_pte_valid(pte)) 713 + return false; 714 + 715 + return FIELD_GET(KVM_INVALID_PTE_TYPE_MASK, pte) == 716 + KVM_GUEST_INVALID_PTE_TYPE_POISONED; 717 + } 718 + 833 719 static enum pkvm_page_state guest_get_page_state(kvm_pte_t pte, u64 addr) 834 720 { 721 + if (guest_pte_is_poisoned(pte)) 722 + return PKVM_POISON; 723 + 835 724 if (!kvm_pte_valid(pte)) 836 725 return PKVM_NOPAGE; 837 726 ··· 860 725 861 726 hyp_assert_lock_held(&vm->lock); 862 727 return check_page_state_range(&vm->pgt, addr, size, &d); 728 + } 729 + 730 + static int get_valid_guest_pte(struct pkvm_hyp_vm *vm, u64 ipa, kvm_pte_t *ptep, u64 *physp) 731 + { 732 + kvm_pte_t pte; 733 + u64 phys; 734 + s8 level; 735 + int ret; 736 + 737 + ret = kvm_pgtable_get_leaf(&vm->pgt, ipa, &pte, &level); 738 + if (ret) 739 + return ret; 740 + if (guest_pte_is_poisoned(pte)) 741 + return -EHWPOISON; 742 + if (!kvm_pte_valid(pte)) 743 + return -ENOENT; 744 + if (level != KVM_PGTABLE_LAST_LEVEL) 745 + return -E2BIG; 746 + 747 + phys = kvm_pte_to_phys(pte); 748 + ret = check_range_allowed_memory(phys, phys + PAGE_SIZE); 749 + if (WARN_ON(ret)) 750 + return ret; 751 + 752 + *ptep = pte; 753 + *physp = phys; 754 + 755 + return 0; 756 + } 757 + 758 + int __pkvm_vcpu_in_poison_fault(struct pkvm_hyp_vcpu *hyp_vcpu) 759 + { 760 + struct pkvm_hyp_vm *vm = pkvm_hyp_vcpu_to_hyp_vm(hyp_vcpu); 761 + kvm_pte_t pte; 762 + s8 level; 763 + u64 ipa; 764 + int ret; 765 + 766 + switch (kvm_vcpu_trap_get_class(&hyp_vcpu->vcpu)) { 767 + case ESR_ELx_EC_DABT_LOW: 768 + case ESR_ELx_EC_IABT_LOW: 769 + if (kvm_vcpu_trap_is_translation_fault(&hyp_vcpu->vcpu)) 770 + break; 771 + fallthrough; 772 + default: 773 + return -EINVAL; 774 + } 775 + 776 + /* 777 + * The host has the faulting IPA when it calls us from the guest 778 + * fault handler but we retrieve it ourselves from the FAR so as 779 + * to avoid exposing an "oracle" that could reveal data access 780 + * patterns of the guest after initial donation of its pages. 781 + */ 782 + ipa = kvm_vcpu_get_fault_ipa(&hyp_vcpu->vcpu); 783 + ipa |= FAR_TO_FIPA_OFFSET(kvm_vcpu_get_hfar(&hyp_vcpu->vcpu)); 784 + 785 + guest_lock_component(vm); 786 + ret = kvm_pgtable_get_leaf(&vm->pgt, ipa, &pte, &level); 787 + if (ret) 788 + goto unlock; 789 + 790 + if (level != KVM_PGTABLE_LAST_LEVEL) { 791 + ret = -EINVAL; 792 + goto unlock; 793 + } 794 + 795 + ret = guest_pte_is_poisoned(pte); 796 + unlock: 797 + guest_unlock_component(vm); 798 + return ret; 863 799 } 864 800 865 801 int __pkvm_host_share_hyp(u64 pfn) ··· 954 748 955 749 unlock: 956 750 hyp_unlock_component(); 751 + host_unlock_component(); 752 + 753 + return ret; 754 + } 755 + 756 + int __pkvm_guest_share_host(struct pkvm_hyp_vcpu *vcpu, u64 gfn) 757 + { 758 + struct pkvm_hyp_vm *vm = pkvm_hyp_vcpu_to_hyp_vm(vcpu); 759 + u64 phys, ipa = hyp_pfn_to_phys(gfn); 760 + kvm_pte_t pte; 761 + int ret; 762 + 763 + host_lock_component(); 764 + guest_lock_component(vm); 765 + 766 + ret = get_valid_guest_pte(vm, ipa, &pte, &phys); 767 + if (ret) 768 + goto unlock; 769 + 770 + ret = -EPERM; 771 + if (pkvm_getstate(kvm_pgtable_stage2_pte_prot(pte)) != PKVM_PAGE_OWNED) 772 + goto unlock; 773 + if (__host_check_page_state_range(phys, PAGE_SIZE, PKVM_NOPAGE)) 774 + goto unlock; 775 + 776 + ret = 0; 777 + WARN_ON(kvm_pgtable_stage2_map(&vm->pgt, ipa, PAGE_SIZE, phys, 778 + pkvm_mkstate(KVM_PGTABLE_PROT_RWX, PKVM_PAGE_SHARED_OWNED), 779 + &vcpu->vcpu.arch.pkvm_memcache, 0)); 780 + WARN_ON(__host_set_page_state_range(phys, PAGE_SIZE, PKVM_PAGE_SHARED_BORROWED)); 781 + unlock: 782 + guest_unlock_component(vm); 783 + host_unlock_component(); 784 + 785 + return ret; 786 + } 787 + 788 + int __pkvm_guest_unshare_host(struct pkvm_hyp_vcpu *vcpu, u64 gfn) 789 + { 790 + struct pkvm_hyp_vm *vm = pkvm_hyp_vcpu_to_hyp_vm(vcpu); 791 + u64 meta, phys, ipa = hyp_pfn_to_phys(gfn); 792 + kvm_pte_t pte; 793 + int ret; 794 + 795 + host_lock_component(); 796 + guest_lock_component(vm); 797 + 798 + ret = get_valid_guest_pte(vm, ipa, &pte, &phys); 799 + if (ret) 800 + goto unlock; 801 + 802 + ret = -EPERM; 803 + if (pkvm_getstate(kvm_pgtable_stage2_pte_prot(pte)) != PKVM_PAGE_SHARED_OWNED) 804 + goto unlock; 805 + if (__host_check_page_state_range(phys, PAGE_SIZE, PKVM_PAGE_SHARED_BORROWED)) 806 + goto unlock; 807 + 808 + ret = 0; 809 + meta = host_stage2_encode_gfn_meta(vm, gfn); 810 + WARN_ON(host_stage2_set_owner_metadata_locked(phys, PAGE_SIZE, 811 + PKVM_ID_GUEST, meta)); 812 + WARN_ON(kvm_pgtable_stage2_map(&vm->pgt, ipa, PAGE_SIZE, phys, 813 + pkvm_mkstate(KVM_PGTABLE_PROT_RWX, PKVM_PAGE_OWNED), 814 + &vcpu->vcpu.arch.pkvm_memcache, 0)); 815 + unlock: 816 + guest_unlock_component(vm); 957 817 host_unlock_component(); 958 818 959 819 return ret; ··· 1230 958 1231 959 *size = block_size; 1232 960 return 0; 961 + } 962 + 963 + static void hyp_poison_page(phys_addr_t phys) 964 + { 965 + void *addr = hyp_fixmap_map(phys); 966 + 967 + memset(addr, 0, PAGE_SIZE); 968 + /* 969 + * Prefer kvm_flush_dcache_to_poc() over __clean_dcache_guest_page() 970 + * here as the latter may elide the CMO under the assumption that FWB 971 + * will be enabled on CPUs that support it. This is incorrect for the 972 + * host stage-2 and would otherwise lead to a malicious host potentially 973 + * being able to read the contents of newly reclaimed guest pages. 974 + */ 975 + kvm_flush_dcache_to_poc(addr, PAGE_SIZE); 976 + hyp_fixmap_unmap(); 977 + } 978 + 979 + static int host_stage2_get_guest_info(phys_addr_t phys, struct pkvm_hyp_vm **vm, 980 + u64 *gfn) 981 + { 982 + enum pkvm_page_state state; 983 + kvm_pte_t pte; 984 + s8 level; 985 + int ret; 986 + 987 + if (!addr_is_memory(phys)) 988 + return -EFAULT; 989 + 990 + state = get_host_state(hyp_phys_to_page(phys)); 991 + switch (state) { 992 + case PKVM_PAGE_OWNED: 993 + case PKVM_PAGE_SHARED_OWNED: 994 + case PKVM_PAGE_SHARED_BORROWED: 995 + /* The access should no longer fault; try again. */ 996 + return -EAGAIN; 997 + case PKVM_NOPAGE: 998 + break; 999 + default: 1000 + return -EPERM; 1001 + } 1002 + 1003 + ret = kvm_pgtable_get_leaf(&host_mmu.pgt, phys, &pte, &level); 1004 + if (ret) 1005 + return ret; 1006 + 1007 + if (WARN_ON(level != KVM_PGTABLE_LAST_LEVEL)) 1008 + return -EINVAL; 1009 + 1010 + return host_stage2_decode_gfn_meta(pte, vm, gfn); 1011 + } 1012 + 1013 + int __pkvm_host_force_reclaim_page_guest(phys_addr_t phys) 1014 + { 1015 + struct pkvm_hyp_vm *vm; 1016 + u64 gfn, ipa, pa; 1017 + kvm_pte_t pte; 1018 + int ret; 1019 + 1020 + phys &= PAGE_MASK; 1021 + 1022 + hyp_spin_lock(&vm_table_lock); 1023 + host_lock_component(); 1024 + 1025 + ret = host_stage2_get_guest_info(phys, &vm, &gfn); 1026 + if (ret) 1027 + goto unlock_host; 1028 + 1029 + ipa = hyp_pfn_to_phys(gfn); 1030 + guest_lock_component(vm); 1031 + ret = get_valid_guest_pte(vm, ipa, &pte, &pa); 1032 + if (ret) 1033 + goto unlock_guest; 1034 + 1035 + WARN_ON(pa != phys); 1036 + if (guest_get_page_state(pte, ipa) != PKVM_PAGE_OWNED) { 1037 + ret = -EPERM; 1038 + goto unlock_guest; 1039 + } 1040 + 1041 + /* We really shouldn't be allocating, so don't pass a memcache */ 1042 + ret = kvm_pgtable_stage2_annotate(&vm->pgt, ipa, PAGE_SIZE, NULL, 1043 + KVM_GUEST_INVALID_PTE_TYPE_POISONED, 1044 + 0); 1045 + if (ret) 1046 + goto unlock_guest; 1047 + 1048 + hyp_poison_page(phys); 1049 + WARN_ON(host_stage2_set_owner_locked(phys, PAGE_SIZE, PKVM_ID_HOST)); 1050 + unlock_guest: 1051 + guest_unlock_component(vm); 1052 + unlock_host: 1053 + host_unlock_component(); 1054 + hyp_spin_unlock(&vm_table_lock); 1055 + 1056 + return ret; 1057 + } 1058 + 1059 + int __pkvm_host_reclaim_page_guest(u64 gfn, struct pkvm_hyp_vm *vm) 1060 + { 1061 + u64 ipa = hyp_pfn_to_phys(gfn); 1062 + kvm_pte_t pte; 1063 + u64 phys; 1064 + int ret; 1065 + 1066 + host_lock_component(); 1067 + guest_lock_component(vm); 1068 + 1069 + ret = get_valid_guest_pte(vm, ipa, &pte, &phys); 1070 + if (ret) 1071 + goto unlock; 1072 + 1073 + switch (guest_get_page_state(pte, ipa)) { 1074 + case PKVM_PAGE_OWNED: 1075 + WARN_ON(__host_check_page_state_range(phys, PAGE_SIZE, PKVM_NOPAGE)); 1076 + hyp_poison_page(phys); 1077 + break; 1078 + case PKVM_PAGE_SHARED_OWNED: 1079 + WARN_ON(__host_check_page_state_range(phys, PAGE_SIZE, PKVM_PAGE_SHARED_BORROWED)); 1080 + break; 1081 + default: 1082 + ret = -EPERM; 1083 + goto unlock; 1084 + } 1085 + 1086 + WARN_ON(kvm_pgtable_stage2_unmap(&vm->pgt, ipa, PAGE_SIZE)); 1087 + WARN_ON(host_stage2_set_owner_locked(phys, PAGE_SIZE, PKVM_ID_HOST)); 1088 + 1089 + unlock: 1090 + guest_unlock_component(vm); 1091 + host_unlock_component(); 1092 + 1093 + /* 1094 + * -EHWPOISON implies that the page was forcefully reclaimed already 1095 + * so return success for the GUP pin to be dropped. 1096 + */ 1097 + return ret && ret != -EHWPOISON ? ret : 0; 1098 + } 1099 + 1100 + int __pkvm_host_donate_guest(u64 pfn, u64 gfn, struct pkvm_hyp_vcpu *vcpu) 1101 + { 1102 + struct pkvm_hyp_vm *vm = pkvm_hyp_vcpu_to_hyp_vm(vcpu); 1103 + u64 phys = hyp_pfn_to_phys(pfn); 1104 + u64 ipa = hyp_pfn_to_phys(gfn); 1105 + u64 meta; 1106 + int ret; 1107 + 1108 + host_lock_component(); 1109 + guest_lock_component(vm); 1110 + 1111 + ret = __host_check_page_state_range(phys, PAGE_SIZE, PKVM_PAGE_OWNED); 1112 + if (ret) 1113 + goto unlock; 1114 + 1115 + ret = __guest_check_page_state_range(vm, ipa, PAGE_SIZE, PKVM_NOPAGE); 1116 + if (ret) 1117 + goto unlock; 1118 + 1119 + meta = host_stage2_encode_gfn_meta(vm, gfn); 1120 + WARN_ON(host_stage2_set_owner_metadata_locked(phys, PAGE_SIZE, 1121 + PKVM_ID_GUEST, meta)); 1122 + WARN_ON(kvm_pgtable_stage2_map(&vm->pgt, ipa, PAGE_SIZE, phys, 1123 + pkvm_mkstate(KVM_PGTABLE_PROT_RWX, PKVM_PAGE_OWNED), 1124 + &vcpu->vcpu.arch.pkvm_memcache, 0)); 1125 + 1126 + unlock: 1127 + guest_unlock_component(vm); 1128 + host_unlock_component(); 1129 + 1130 + return ret; 1233 1131 } 1234 1132 1235 1133 int __pkvm_host_share_guest(u64 pfn, u64 gfn, u64 nr_pages, struct pkvm_hyp_vcpu *vcpu, ··· 1648 1206 1649 1207 static struct pkvm_expected_state selftest_state; 1650 1208 static struct hyp_page *selftest_page; 1651 - 1652 - static struct pkvm_hyp_vm selftest_vm = { 1653 - .kvm = { 1654 - .arch = { 1655 - .mmu = { 1656 - .arch = &selftest_vm.kvm.arch, 1657 - .pgt = &selftest_vm.pgt, 1658 - }, 1659 - }, 1660 - }, 1661 - }; 1662 - 1663 - static struct pkvm_hyp_vcpu selftest_vcpu = { 1664 - .vcpu = { 1665 - .arch = { 1666 - .hw_mmu = &selftest_vm.kvm.arch.mmu, 1667 - }, 1668 - .kvm = &selftest_vm.kvm, 1669 - }, 1670 - }; 1671 - 1672 - static void init_selftest_vm(void *virt) 1673 - { 1674 - struct hyp_page *p = hyp_virt_to_page(virt); 1675 - int i; 1676 - 1677 - selftest_vm.kvm.arch.mmu.vtcr = host_mmu.arch.mmu.vtcr; 1678 - WARN_ON(kvm_guest_prepare_stage2(&selftest_vm, virt)); 1679 - 1680 - for (i = 0; i < pkvm_selftest_pages(); i++) { 1681 - if (p[i].refcount) 1682 - continue; 1683 - p[i].refcount = 1; 1684 - hyp_put_page(&selftest_vm.pool, hyp_page_to_virt(&p[i])); 1685 - } 1686 - } 1209 + static struct pkvm_hyp_vcpu *selftest_vcpu; 1687 1210 1688 1211 static u64 selftest_ipa(void) 1689 1212 { 1690 - return BIT(selftest_vm.pgt.ia_bits - 1); 1213 + return BIT(selftest_vcpu->vcpu.arch.hw_mmu->pgt->ia_bits - 1); 1691 1214 } 1692 1215 1693 1216 static void assert_page_state(void) 1694 1217 { 1695 1218 void *virt = hyp_page_to_virt(selftest_page); 1696 1219 u64 size = PAGE_SIZE << selftest_page->order; 1697 - struct pkvm_hyp_vcpu *vcpu = &selftest_vcpu; 1220 + struct pkvm_hyp_vcpu *vcpu = selftest_vcpu; 1698 1221 u64 phys = hyp_virt_to_phys(virt); 1699 1222 u64 ipa[2] = { selftest_ipa(), selftest_ipa() + PAGE_SIZE }; 1700 1223 struct pkvm_hyp_vm *vm; ··· 1674 1267 WARN_ON(__hyp_check_page_state_range(phys, size, selftest_state.hyp)); 1675 1268 hyp_unlock_component(); 1676 1269 1677 - guest_lock_component(&selftest_vm); 1270 + guest_lock_component(vm); 1678 1271 WARN_ON(__guest_check_page_state_range(vm, ipa[0], size, selftest_state.guest[0])); 1679 1272 WARN_ON(__guest_check_page_state_range(vm, ipa[1], size, selftest_state.guest[1])); 1680 - guest_unlock_component(&selftest_vm); 1273 + guest_unlock_component(vm); 1681 1274 } 1682 1275 1683 1276 #define assert_transition_res(res, fn, ...) \ ··· 1690 1283 { 1691 1284 enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_RWX; 1692 1285 void *virt = hyp_alloc_pages(&host_s2_pool, 0); 1693 - struct pkvm_hyp_vcpu *vcpu = &selftest_vcpu; 1694 - struct pkvm_hyp_vm *vm = &selftest_vm; 1286 + struct pkvm_hyp_vcpu *vcpu; 1695 1287 u64 phys, size, pfn, gfn; 1288 + struct pkvm_hyp_vm *vm; 1696 1289 1697 1290 WARN_ON(!virt); 1698 1291 selftest_page = hyp_virt_to_page(virt); 1699 1292 selftest_page->refcount = 0; 1700 - init_selftest_vm(base); 1293 + selftest_vcpu = vcpu = init_selftest_vm(base); 1294 + vm = pkvm_hyp_vcpu_to_hyp_vm(vcpu); 1701 1295 1702 1296 size = PAGE_SIZE << selftest_page->order; 1703 1297 phys = hyp_virt_to_phys(virt); ··· 1717 1309 assert_transition_res(-EPERM, hyp_pin_shared_mem, virt, virt + size); 1718 1310 assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn, 1, vcpu, prot); 1719 1311 assert_transition_res(-ENOENT, __pkvm_host_unshare_guest, gfn, 1, vm); 1312 + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn, vcpu); 1720 1313 1721 1314 selftest_state.host = PKVM_PAGE_OWNED; 1722 1315 selftest_state.hyp = PKVM_NOPAGE; ··· 1737 1328 assert_transition_res(-EPERM, __pkvm_hyp_donate_host, pfn, 1); 1738 1329 assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn, 1, vcpu, prot); 1739 1330 assert_transition_res(-ENOENT, __pkvm_host_unshare_guest, gfn, 1, vm); 1331 + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn, vcpu); 1740 1332 1741 1333 assert_transition_res(0, hyp_pin_shared_mem, virt, virt + size); 1742 1334 assert_transition_res(0, hyp_pin_shared_mem, virt, virt + size); ··· 1750 1340 assert_transition_res(-EPERM, __pkvm_hyp_donate_host, pfn, 1); 1751 1341 assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn, 1, vcpu, prot); 1752 1342 assert_transition_res(-ENOENT, __pkvm_host_unshare_guest, gfn, 1, vm); 1343 + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn, vcpu); 1753 1344 1754 1345 hyp_unpin_shared_mem(virt, virt + size); 1755 1346 assert_page_state(); ··· 1770 1359 assert_transition_res(-EPERM, __pkvm_hyp_donate_host, pfn, 1); 1771 1360 assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn, 1, vcpu, prot); 1772 1361 assert_transition_res(-ENOENT, __pkvm_host_unshare_guest, gfn, 1, vm); 1362 + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn, vcpu); 1773 1363 assert_transition_res(-EPERM, hyp_pin_shared_mem, virt, virt + size); 1774 1364 1775 1365 selftest_state.host = PKVM_PAGE_OWNED; ··· 1787 1375 assert_transition_res(-EPERM, __pkvm_host_share_hyp, pfn); 1788 1376 assert_transition_res(-EPERM, __pkvm_host_unshare_hyp, pfn); 1789 1377 assert_transition_res(-EPERM, __pkvm_hyp_donate_host, pfn, 1); 1378 + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn, vcpu); 1790 1379 assert_transition_res(-EPERM, hyp_pin_shared_mem, virt, virt + size); 1791 1380 1792 1381 selftest_state.guest[1] = PKVM_PAGE_SHARED_BORROWED; ··· 1802 1389 assert_transition_res(0, __pkvm_host_unshare_guest, gfn + 1, 1, vm); 1803 1390 1804 1391 selftest_state.host = PKVM_NOPAGE; 1392 + selftest_state.guest[0] = PKVM_PAGE_OWNED; 1393 + assert_transition_res(0, __pkvm_host_donate_guest, pfn, gfn, vcpu); 1394 + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn, vcpu); 1395 + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn + 1, vcpu); 1396 + assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn, 1, vcpu, prot); 1397 + assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn + 1, 1, vcpu, prot); 1398 + assert_transition_res(-EPERM, __pkvm_host_share_ffa, pfn, 1); 1399 + assert_transition_res(-EPERM, __pkvm_host_donate_hyp, pfn, 1); 1400 + assert_transition_res(-EPERM, __pkvm_host_share_hyp, pfn); 1401 + assert_transition_res(-EPERM, __pkvm_host_unshare_hyp, pfn); 1402 + assert_transition_res(-EPERM, __pkvm_hyp_donate_host, pfn, 1); 1403 + 1404 + selftest_state.host = PKVM_PAGE_SHARED_BORROWED; 1405 + selftest_state.guest[0] = PKVM_PAGE_SHARED_OWNED; 1406 + assert_transition_res(0, __pkvm_guest_share_host, vcpu, gfn); 1407 + assert_transition_res(-EPERM, __pkvm_guest_share_host, vcpu, gfn); 1408 + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn, vcpu); 1409 + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn + 1, vcpu); 1410 + assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn, 1, vcpu, prot); 1411 + assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn + 1, 1, vcpu, prot); 1412 + assert_transition_res(-EPERM, __pkvm_host_share_ffa, pfn, 1); 1413 + assert_transition_res(-EPERM, __pkvm_host_donate_hyp, pfn, 1); 1414 + assert_transition_res(-EPERM, __pkvm_host_share_hyp, pfn); 1415 + assert_transition_res(-EPERM, __pkvm_host_unshare_hyp, pfn); 1416 + assert_transition_res(-EPERM, __pkvm_hyp_donate_host, pfn, 1); 1417 + 1418 + selftest_state.host = PKVM_NOPAGE; 1419 + selftest_state.guest[0] = PKVM_PAGE_OWNED; 1420 + assert_transition_res(0, __pkvm_guest_unshare_host, vcpu, gfn); 1421 + assert_transition_res(-EPERM, __pkvm_guest_unshare_host, vcpu, gfn); 1422 + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn, vcpu); 1423 + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn + 1, vcpu); 1424 + assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn, 1, vcpu, prot); 1425 + assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn + 1, 1, vcpu, prot); 1426 + assert_transition_res(-EPERM, __pkvm_host_share_ffa, pfn, 1); 1427 + assert_transition_res(-EPERM, __pkvm_host_donate_hyp, pfn, 1); 1428 + assert_transition_res(-EPERM, __pkvm_host_share_hyp, pfn); 1429 + assert_transition_res(-EPERM, __pkvm_host_unshare_hyp, pfn); 1430 + assert_transition_res(-EPERM, __pkvm_hyp_donate_host, pfn, 1); 1431 + 1432 + selftest_state.host = PKVM_PAGE_OWNED; 1433 + selftest_state.guest[0] = PKVM_POISON; 1434 + assert_transition_res(0, __pkvm_host_force_reclaim_page_guest, phys); 1435 + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn, vcpu); 1436 + assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn, 1, vcpu, prot); 1437 + assert_transition_res(-EHWPOISON, __pkvm_guest_share_host, vcpu, gfn); 1438 + assert_transition_res(-EHWPOISON, __pkvm_guest_unshare_host, vcpu, gfn); 1439 + 1440 + selftest_state.host = PKVM_NOPAGE; 1441 + selftest_state.guest[1] = PKVM_PAGE_OWNED; 1442 + assert_transition_res(0, __pkvm_host_donate_guest, pfn, gfn + 1, vcpu); 1443 + 1444 + selftest_state.host = PKVM_PAGE_OWNED; 1445 + selftest_state.guest[1] = PKVM_NOPAGE; 1446 + assert_transition_res(0, __pkvm_host_reclaim_page_guest, gfn + 1, vm); 1447 + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn, vcpu); 1448 + assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn, 1, vcpu, prot); 1449 + 1450 + selftest_state.host = PKVM_NOPAGE; 1805 1451 selftest_state.hyp = PKVM_PAGE_OWNED; 1806 1452 assert_transition_res(0, __pkvm_host_donate_hyp, pfn, 1); 1807 1453 1454 + teardown_selftest_vm(); 1808 1455 selftest_page->refcount = 1; 1809 1456 hyp_put_page(&host_s2_pool, virt); 1810 1457 }

+2 -2

arch/arm64/kvm/hyp/nvhe/mm.c

··· 244 244 245 245 void *hyp_fixmap_map(phys_addr_t phys) 246 246 { 247 - return fixmap_map_slot(this_cpu_ptr(&fixmap_slots), phys); 247 + return fixmap_map_slot(this_cpu_ptr(&fixmap_slots), phys) + offset_in_page(phys); 248 248 } 249 249 250 250 static void fixmap_clear_slot(struct hyp_fixmap_slot *slot) ··· 366 366 #ifdef HAS_FIXBLOCK 367 367 *size = PMD_SIZE; 368 368 hyp_spin_lock(&hyp_fixblock_lock); 369 - return fixmap_map_slot(&hyp_fixblock_slot, phys); 369 + return fixmap_map_slot(&hyp_fixblock_slot, phys) + offset_in_page(phys); 370 370 #else 371 371 *size = PAGE_SIZE; 372 372 return hyp_fixmap_map(phys);

+228 -11

arch/arm64/kvm/hyp/nvhe/pkvm.c

··· 4 4 * Author: Fuad Tabba <tabba@google.com> 5 5 */ 6 6 7 + #include <kvm/arm_hypercalls.h> 8 + 7 9 #include <linux/kvm_host.h> 8 10 #include <linux/mm.h> 9 11 ··· 224 222 225 223 void pkvm_hyp_vm_table_init(void *tbl) 226 224 { 225 + BUILD_BUG_ON((u64)HANDLE_OFFSET + KVM_MAX_PVMS > (pkvm_handle_t)-1); 227 226 WARN_ON(vm_table); 228 227 vm_table = tbl; 229 228 } ··· 232 229 /* 233 230 * Return the hyp vm structure corresponding to the handle. 234 231 */ 235 - static struct pkvm_hyp_vm *get_vm_by_handle(pkvm_handle_t handle) 232 + struct pkvm_hyp_vm *get_vm_by_handle(pkvm_handle_t handle) 236 233 { 237 234 unsigned int idx = vm_handle_to_idx(handle); 235 + 236 + hyp_assert_lock_held(&vm_table_lock); 238 237 239 238 if (unlikely(idx >= KVM_MAX_PVMS)) 240 239 return NULL; ··· 260 255 261 256 hyp_spin_lock(&vm_table_lock); 262 257 hyp_vm = get_vm_by_handle(handle); 263 - if (!hyp_vm || hyp_vm->kvm.created_vcpus <= vcpu_idx) 258 + if (!hyp_vm || hyp_vm->kvm.arch.pkvm.is_dying) 259 + goto unlock; 260 + 261 + if (hyp_vm->kvm.created_vcpus <= vcpu_idx) 264 262 goto unlock; 265 263 266 264 hyp_vcpu = hyp_vm->vcpus[vcpu_idx]; ··· 727 719 hyp_spin_unlock(&vm_table_lock); 728 720 } 729 721 722 + #ifdef CONFIG_NVHE_EL2_DEBUG 723 + static struct pkvm_hyp_vm selftest_vm = { 724 + .kvm = { 725 + .arch = { 726 + .mmu = { 727 + .arch = &selftest_vm.kvm.arch, 728 + .pgt = &selftest_vm.pgt, 729 + }, 730 + }, 731 + }, 732 + }; 733 + 734 + static struct pkvm_hyp_vcpu selftest_vcpu = { 735 + .vcpu = { 736 + .arch = { 737 + .hw_mmu = &selftest_vm.kvm.arch.mmu, 738 + }, 739 + .kvm = &selftest_vm.kvm, 740 + }, 741 + }; 742 + 743 + struct pkvm_hyp_vcpu *init_selftest_vm(void *virt) 744 + { 745 + struct hyp_page *p = hyp_virt_to_page(virt); 746 + int i; 747 + 748 + selftest_vm.kvm.arch.mmu.vtcr = host_mmu.arch.mmu.vtcr; 749 + WARN_ON(kvm_guest_prepare_stage2(&selftest_vm, virt)); 750 + 751 + for (i = 0; i < pkvm_selftest_pages(); i++) { 752 + if (p[i].refcount) 753 + continue; 754 + p[i].refcount = 1; 755 + hyp_put_page(&selftest_vm.pool, hyp_page_to_virt(&p[i])); 756 + } 757 + 758 + selftest_vm.kvm.arch.pkvm.handle = __pkvm_reserve_vm(); 759 + insert_vm_table_entry(selftest_vm.kvm.arch.pkvm.handle, &selftest_vm); 760 + return &selftest_vcpu; 761 + } 762 + 763 + void teardown_selftest_vm(void) 764 + { 765 + hyp_spin_lock(&vm_table_lock); 766 + remove_vm_table_entry(selftest_vm.kvm.arch.pkvm.handle); 767 + hyp_spin_unlock(&vm_table_lock); 768 + } 769 + #endif /* CONFIG_NVHE_EL2_DEBUG */ 770 + 730 771 /* 731 772 * Initialize the hypervisor copy of the VM state using host-donated memory. 732 773 * ··· 916 859 unmap_donated_memory_noclear(addr, size); 917 860 } 918 861 919 - int __pkvm_teardown_vm(pkvm_handle_t handle) 862 + int __pkvm_reclaim_dying_guest_page(pkvm_handle_t handle, u64 gfn) 863 + { 864 + struct pkvm_hyp_vm *hyp_vm = get_pkvm_hyp_vm(handle); 865 + int ret = -EINVAL; 866 + 867 + if (!hyp_vm) 868 + return ret; 869 + 870 + if (hyp_vm->kvm.arch.pkvm.is_dying) 871 + ret = __pkvm_host_reclaim_page_guest(gfn, hyp_vm); 872 + 873 + put_pkvm_hyp_vm(hyp_vm); 874 + return ret; 875 + } 876 + 877 + static struct pkvm_hyp_vm *get_pkvm_unref_hyp_vm_locked(pkvm_handle_t handle) 878 + { 879 + struct pkvm_hyp_vm *hyp_vm; 880 + 881 + hyp_assert_lock_held(&vm_table_lock); 882 + 883 + hyp_vm = get_vm_by_handle(handle); 884 + if (!hyp_vm || hyp_page_count(hyp_vm)) 885 + return NULL; 886 + 887 + return hyp_vm; 888 + } 889 + 890 + int __pkvm_start_teardown_vm(pkvm_handle_t handle) 891 + { 892 + struct pkvm_hyp_vm *hyp_vm; 893 + int ret = 0; 894 + 895 + hyp_spin_lock(&vm_table_lock); 896 + hyp_vm = get_pkvm_unref_hyp_vm_locked(handle); 897 + if (!hyp_vm || hyp_vm->kvm.arch.pkvm.is_dying) { 898 + ret = -EINVAL; 899 + goto unlock; 900 + } 901 + 902 + hyp_vm->kvm.arch.pkvm.is_dying = true; 903 + unlock: 904 + hyp_spin_unlock(&vm_table_lock); 905 + 906 + return ret; 907 + } 908 + 909 + int __pkvm_finalize_teardown_vm(pkvm_handle_t handle) 920 910 { 921 911 struct kvm_hyp_memcache *mc, *stage2_mc; 922 912 struct pkvm_hyp_vm *hyp_vm; ··· 973 869 int err; 974 870 975 871 hyp_spin_lock(&vm_table_lock); 976 - hyp_vm = get_vm_by_handle(handle); 977 - if (!hyp_vm) { 978 - err = -ENOENT; 979 - goto err_unlock; 980 - } 981 - 982 - if (WARN_ON(hyp_page_count(hyp_vm))) { 983 - err = -EBUSY; 872 + hyp_vm = get_pkvm_unref_hyp_vm_locked(handle); 873 + if (!hyp_vm || !hyp_vm->kvm.arch.pkvm.is_dying) { 874 + err = -EINVAL; 984 875 goto err_unlock; 985 876 } 986 877 ··· 1020 921 err_unlock: 1021 922 hyp_spin_unlock(&vm_table_lock); 1022 923 return err; 924 + } 925 + 926 + static u64 __pkvm_memshare_page_req(struct kvm_vcpu *vcpu, u64 ipa) 927 + { 928 + u64 elr; 929 + 930 + /* Fake up a data abort (level 3 translation fault on write) */ 931 + vcpu->arch.fault.esr_el2 = (ESR_ELx_EC_DABT_LOW << ESR_ELx_EC_SHIFT) | 932 + ESR_ELx_WNR | ESR_ELx_FSC_FAULT | 933 + FIELD_PREP(ESR_ELx_FSC_LEVEL, 3); 934 + 935 + /* Shuffle the IPA around into the HPFAR */ 936 + vcpu->arch.fault.hpfar_el2 = (HPFAR_EL2_NS | (ipa >> 8)) & HPFAR_MASK; 937 + 938 + /* This is a virtual address. 0's good. Let's go with 0. */ 939 + vcpu->arch.fault.far_el2 = 0; 940 + 941 + /* Rewind the ELR so we return to the HVC once the IPA is mapped */ 942 + elr = read_sysreg(elr_el2); 943 + elr -= 4; 944 + write_sysreg(elr, elr_el2); 945 + 946 + return ARM_EXCEPTION_TRAP; 947 + } 948 + 949 + static bool pkvm_memshare_call(u64 *ret, struct kvm_vcpu *vcpu, u64 *exit_code) 950 + { 951 + struct pkvm_hyp_vcpu *hyp_vcpu; 952 + u64 ipa = smccc_get_arg1(vcpu); 953 + 954 + if (!PAGE_ALIGNED(ipa)) 955 + goto out_guest; 956 + 957 + hyp_vcpu = container_of(vcpu, struct pkvm_hyp_vcpu, vcpu); 958 + switch (__pkvm_guest_share_host(hyp_vcpu, hyp_phys_to_pfn(ipa))) { 959 + case 0: 960 + ret[0] = SMCCC_RET_SUCCESS; 961 + goto out_guest; 962 + case -ENOENT: 963 + /* 964 + * Convert the exception into a data abort so that the page 965 + * being shared is mapped into the guest next time. 966 + */ 967 + *exit_code = __pkvm_memshare_page_req(vcpu, ipa); 968 + goto out_host; 969 + } 970 + 971 + out_guest: 972 + return true; 973 + out_host: 974 + return false; 975 + } 976 + 977 + static void pkvm_memunshare_call(u64 *ret, struct kvm_vcpu *vcpu) 978 + { 979 + struct pkvm_hyp_vcpu *hyp_vcpu; 980 + u64 ipa = smccc_get_arg1(vcpu); 981 + 982 + if (!PAGE_ALIGNED(ipa)) 983 + return; 984 + 985 + hyp_vcpu = container_of(vcpu, struct pkvm_hyp_vcpu, vcpu); 986 + if (!__pkvm_guest_unshare_host(hyp_vcpu, hyp_phys_to_pfn(ipa))) 987 + ret[0] = SMCCC_RET_SUCCESS; 988 + } 989 + 990 + /* 991 + * Handler for protected VM HVC calls. 992 + * 993 + * Returns true if the hypervisor has handled the exit (and control 994 + * should return to the guest) or false if it hasn't (and the handling 995 + * should be performed by the host). 996 + */ 997 + bool kvm_handle_pvm_hvc64(struct kvm_vcpu *vcpu, u64 *exit_code) 998 + { 999 + u64 val[4] = { SMCCC_RET_INVALID_PARAMETER }; 1000 + bool handled = true; 1001 + 1002 + switch (smccc_get_function(vcpu)) { 1003 + case ARM_SMCCC_VENDOR_HYP_KVM_FEATURES_FUNC_ID: 1004 + val[0] = BIT(ARM_SMCCC_KVM_FUNC_FEATURES); 1005 + val[0] |= BIT(ARM_SMCCC_KVM_FUNC_HYP_MEMINFO); 1006 + val[0] |= BIT(ARM_SMCCC_KVM_FUNC_MEM_SHARE); 1007 + val[0] |= BIT(ARM_SMCCC_KVM_FUNC_MEM_UNSHARE); 1008 + break; 1009 + case ARM_SMCCC_VENDOR_HYP_KVM_HYP_MEMINFO_FUNC_ID: 1010 + if (smccc_get_arg1(vcpu) || 1011 + smccc_get_arg2(vcpu) || 1012 + smccc_get_arg3(vcpu)) { 1013 + break; 1014 + } 1015 + 1016 + val[0] = PAGE_SIZE; 1017 + break; 1018 + case ARM_SMCCC_VENDOR_HYP_KVM_MEM_SHARE_FUNC_ID: 1019 + if (smccc_get_arg2(vcpu) || 1020 + smccc_get_arg3(vcpu)) { 1021 + break; 1022 + } 1023 + 1024 + handled = pkvm_memshare_call(val, vcpu, exit_code); 1025 + break; 1026 + case ARM_SMCCC_VENDOR_HYP_KVM_MEM_UNSHARE_FUNC_ID: 1027 + if (smccc_get_arg2(vcpu) || 1028 + smccc_get_arg3(vcpu)) { 1029 + break; 1030 + } 1031 + 1032 + pkvm_memunshare_call(val, vcpu); 1033 + break; 1034 + default: 1035 + /* Punt everything else back to the host, for now. */ 1036 + handled = false; 1037 + } 1038 + 1039 + if (handled) 1040 + smccc_set_retval(vcpu, val[0], val[1], val[2], val[3]); 1041 + return handled; 1023 1042 }

+29 -16

arch/arm64/kvm/hyp/nvhe/psci-relay.c

··· 6 6 7 7 #include <asm/kvm_asm.h> 8 8 #include <asm/kvm_hyp.h> 9 + #include <asm/kvm_hypevents.h> 9 10 #include <asm/kvm_mmu.h> 10 - #include <linux/arm-smccc.h> 11 11 #include <linux/kvm_host.h> 12 12 #include <uapi/linux/psci.h> 13 13 14 + #include <nvhe/arm-smccc.h> 14 15 #include <nvhe/memory.h> 15 16 #include <nvhe/trap_handler.h> 16 17 ··· 66 65 { 67 66 struct arm_smccc_res res; 68 67 69 - arm_smccc_1_1_smc(fn, arg0, arg1, arg2, &res); 68 + hyp_smccc_1_1_smc(fn, arg0, arg1, arg2, &res); 70 69 return res.a0; 71 70 } 72 71 ··· 201 200 __hyp_pa(init_params), 0); 202 201 } 203 202 204 - asmlinkage void __noreturn __kvm_host_psci_cpu_entry(bool is_cpu_on) 203 + static void __noreturn __kvm_host_psci_cpu_entry(unsigned long pc, unsigned long r0) 205 204 { 206 - struct psci_boot_args *boot_args; 207 - struct kvm_cpu_context *host_ctxt; 205 + struct kvm_cpu_context *host_ctxt = host_data_ptr(host_ctxt); 208 206 209 - host_ctxt = host_data_ptr(host_ctxt); 207 + trace_hyp_enter(host_ctxt, HYP_REASON_PSCI); 210 208 211 - if (is_cpu_on) 212 - boot_args = this_cpu_ptr(&cpu_on_args); 213 - else 214 - boot_args = this_cpu_ptr(&suspend_args); 215 - 216 - cpu_reg(host_ctxt, 0) = boot_args->r0; 217 - write_sysreg_el2(boot_args->pc, SYS_ELR); 218 - 219 - if (is_cpu_on) 220 - release_boot_args(boot_args); 209 + cpu_reg(host_ctxt, 0) = r0; 210 + write_sysreg_el2(pc, SYS_ELR); 221 211 222 212 write_sysreg_el1(INIT_SCTLR_EL1_MMU_OFF, SYS_SCTLR); 223 213 write_sysreg(INIT_PSTATE_EL1, SPSR_EL2); 224 214 215 + trace_hyp_exit(host_ctxt, HYP_REASON_PSCI); 225 216 __host_enter(host_ctxt); 217 + } 218 + 219 + asmlinkage void __noreturn __kvm_host_psci_cpu_on_entry(void) 220 + { 221 + struct psci_boot_args *boot_args = this_cpu_ptr(&cpu_on_args); 222 + unsigned long pc, r0; 223 + 224 + pc = READ_ONCE(boot_args->pc); 225 + r0 = READ_ONCE(boot_args->r0); 226 + 227 + release_boot_args(boot_args); 228 + 229 + __kvm_host_psci_cpu_entry(pc, r0); 230 + } 231 + 232 + asmlinkage void __noreturn __kvm_host_psci_cpu_resume_entry(void) 233 + { 234 + struct psci_boot_args *boot_args = this_cpu_ptr(&suspend_args); 235 + 236 + __kvm_host_psci_cpu_entry(boot_args->pc, boot_args->r0); 226 237 } 227 238 228 239 static unsigned long psci_0_1_handler(u64 func_id, struct kvm_cpu_context *host_ctxt)

+1 -3

arch/arm64/kvm/hyp/nvhe/setup.c

··· 341 341 __host_enter(host_ctxt); 342 342 } 343 343 344 - int __pkvm_init(phys_addr_t phys, unsigned long size, unsigned long nr_cpus, 345 - unsigned long *per_cpu_base, u32 hyp_va_bits) 344 + int __pkvm_init(phys_addr_t phys, unsigned long size, unsigned long *per_cpu_base, u32 hyp_va_bits) 346 345 { 347 346 struct kvm_nvhe_init_params *params; 348 347 void *virt = hyp_phys_to_virt(phys); ··· 354 355 return -EINVAL; 355 356 356 357 hyp_spin_lock_init(&pkvm_pgd_lock); 357 - hyp_nr_cpus = nr_cpus; 358 358 359 359 ret = divide_memory_pool(virt, size); 360 360 if (ret)

+3 -3

arch/arm64/kvm/hyp/nvhe/stacktrace.c

··· 34 34 stacktrace_info->pc = pc; 35 35 } 36 36 37 - #ifdef CONFIG_PROTECTED_NVHE_STACKTRACE 37 + #ifdef CONFIG_PKVM_STACKTRACE 38 38 #include <asm/stacktrace/nvhe.h> 39 39 40 40 DEFINE_PER_CPU(unsigned long [NVHE_STACKTRACE_SIZE/sizeof(long)], pkvm_stacktrace); ··· 134 134 135 135 unwind(&state, pkvm_save_backtrace_entry, &idx); 136 136 } 137 - #else /* !CONFIG_PROTECTED_NVHE_STACKTRACE */ 137 + #else /* !CONFIG_PKVM_STACKTRACE */ 138 138 static void pkvm_save_backtrace(unsigned long fp, unsigned long pc) 139 139 { 140 140 } 141 - #endif /* CONFIG_PROTECTED_NVHE_STACKTRACE */ 141 + #endif /* CONFIG_PKVM_STACKTRACE */ 142 142 143 143 /* 144 144 * kvm_nvhe_prepare_backtrace - prepare to dump the nVHE backtrace

+21 -2

arch/arm64/kvm/hyp/nvhe/switch.c

··· 7 7 #include <hyp/switch.h> 8 8 #include <hyp/sysreg-sr.h> 9 9 10 - #include <linux/arm-smccc.h> 11 10 #include <linux/kvm_host.h> 12 11 #include <linux/types.h> 13 12 #include <linux/jump_label.h> ··· 20 21 #include <asm/kvm_asm.h> 21 22 #include <asm/kvm_emulate.h> 22 23 #include <asm/kvm_hyp.h> 24 + #include <asm/kvm_hypevents.h> 23 25 #include <asm/kvm_mmu.h> 24 26 #include <asm/fpsimd.h> 25 27 #include <asm/debug-monitors.h> ··· 44 44 struct fgt_masks hfgitr2_masks; 45 45 struct fgt_masks hdfgrtr2_masks; 46 46 struct fgt_masks hdfgwtr2_masks; 47 + struct fgt_masks ich_hfgrtr_masks; 48 + struct fgt_masks ich_hfgwtr_masks; 49 + struct fgt_masks ich_hfgitr_masks; 47 50 48 51 extern void kvm_nvhe_prepare_backtrace(unsigned long fp, unsigned long pc); 49 52 ··· 113 110 /* Save VGICv3 state on non-VHE systems */ 114 111 static void __hyp_vgic_save_state(struct kvm_vcpu *vcpu) 115 112 { 113 + if (vgic_is_v5(kern_hyp_va(vcpu->kvm))) { 114 + __vgic_v5_save_state(&vcpu->arch.vgic_cpu.vgic_v5); 115 + __vgic_v5_save_ppi_state(&vcpu->arch.vgic_cpu.vgic_v5); 116 + return; 117 + } 118 + 116 119 if (static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) { 117 120 __vgic_v3_save_state(&vcpu->arch.vgic_cpu.vgic_v3); 118 121 __vgic_v3_deactivate_traps(&vcpu->arch.vgic_cpu.vgic_v3); ··· 128 119 /* Restore VGICv3 state on non-VHE systems */ 129 120 static void __hyp_vgic_restore_state(struct kvm_vcpu *vcpu) 130 121 { 122 + if (vgic_is_v5(kern_hyp_va(vcpu->kvm))) { 123 + __vgic_v5_restore_state(&vcpu->arch.vgic_cpu.vgic_v5); 124 + __vgic_v5_restore_ppi_state(&vcpu->arch.vgic_cpu.vgic_v5); 125 + return; 126 + } 127 + 131 128 if (static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) { 132 129 __vgic_v3_activate_traps(&vcpu->arch.vgic_cpu.vgic_v3); 133 130 __vgic_v3_restore_state(&vcpu->arch.vgic_cpu.vgic_v3); ··· 205 190 206 191 static const exit_handler_fn pvm_exit_handlers[] = { 207 192 [0 ... ESR_ELx_EC_MAX] = NULL, 193 + [ESR_ELx_EC_HVC64] = kvm_handle_pvm_hvc64, 208 194 [ESR_ELx_EC_SYS64] = kvm_handle_pvm_sys64, 209 195 [ESR_ELx_EC_SVE] = kvm_handle_pvm_restricted, 210 196 [ESR_ELx_EC_FP_ASIMD] = kvm_hyp_handle_fpsimd, ··· 294 278 * We're about to restore some new MMU state. Make sure 295 279 * ongoing page-table walks that have started before we 296 280 * trapped to EL2 have completed. This also synchronises the 297 - * above disabling of BRBE, SPE and TRBE. 281 + * above disabling of BRBE. 298 282 * 299 283 * See DDI0487I.a D8.1.5 "Out-of-context translation regimes", 300 284 * rule R_LFHQG and subsequent information statements. ··· 324 308 __debug_switch_to_guest(vcpu); 325 309 326 310 do { 311 + trace_hyp_exit(host_ctxt, HYP_REASON_ERET_GUEST); 312 + 327 313 /* Jump in the fire! */ 328 314 exit_code = __guest_enter(vcpu); 329 315 330 316 /* And we're baaack! */ 317 + trace_hyp_enter(host_ctxt, HYP_REASON_GUEST_EXIT); 331 318 } while (fixup_guest_exit(vcpu, &exit_code)); 332 319 333 320 __sysreg_save_state_nvhe(guest_ctxt);

+17 -1

arch/arm64/kvm/hyp/nvhe/sys_regs.c

··· 20 20 */ 21 21 u64 id_aa64pfr0_el1_sys_val; 22 22 u64 id_aa64pfr1_el1_sys_val; 23 + u64 id_aa64pfr2_el1_sys_val; 23 24 u64 id_aa64isar0_el1_sys_val; 24 25 u64 id_aa64isar1_el1_sys_val; 25 26 u64 id_aa64isar2_el1_sys_val; ··· 106 105 MAX_FEAT(ID_AA64PFR1_EL1, BT, IMP), 107 106 MAX_FEAT(ID_AA64PFR1_EL1, SSBS, SSBS2), 108 107 MAX_FEAT_ENUM(ID_AA64PFR1_EL1, MTE_frac, NI), 108 + FEAT_END 109 + }; 110 + 111 + static const struct pvm_ftr_bits pvmid_aa64pfr2[] = { 112 + MAX_FEAT(ID_AA64PFR2_EL1, GCIE, NI), 109 113 FEAT_END 110 114 }; 111 115 ··· 227 221 return get_restricted_features(vcpu, id_aa64pfr0_el1_sys_val, pvmid_aa64pfr0); 228 222 case SYS_ID_AA64PFR1_EL1: 229 223 return get_restricted_features(vcpu, id_aa64pfr1_el1_sys_val, pvmid_aa64pfr1); 224 + case SYS_ID_AA64PFR2_EL1: 225 + return get_restricted_features(vcpu, id_aa64pfr2_el1_sys_val, pvmid_aa64pfr2); 230 226 case SYS_ID_AA64ISAR0_EL1: 231 227 return id_aa64isar0_el1_sys_val; 232 228 case SYS_ID_AA64ISAR1_EL1: ··· 400 392 /* Cache maintenance by set/way operations are restricted. */ 401 393 402 394 /* Debug and Trace Registers are restricted. */ 395 + RAZ_WI(SYS_DBGBVRn_EL1(0)), 396 + RAZ_WI(SYS_DBGBCRn_EL1(0)), 397 + RAZ_WI(SYS_DBGWVRn_EL1(0)), 398 + RAZ_WI(SYS_DBGWCRn_EL1(0)), 399 + RAZ_WI(SYS_MDSCR_EL1), 400 + RAZ_WI(SYS_OSLAR_EL1), 401 + RAZ_WI(SYS_OSLSR_EL1), 402 + RAZ_WI(SYS_OSDLR_EL1), 403 403 404 404 /* Group 1 ID registers */ 405 405 HOST_HANDLED(SYS_REVIDR_EL1), ··· 447 431 /* CRm=4 */ 448 432 AARCH64(SYS_ID_AA64PFR0_EL1), 449 433 AARCH64(SYS_ID_AA64PFR1_EL1), 450 - ID_UNALLOCATED(4,2), 434 + AARCH64(SYS_ID_AA64PFR2_EL1), 451 435 ID_UNALLOCATED(4,3), 452 436 AARCH64(SYS_ID_AA64ZFR0_EL1), 453 437 ID_UNALLOCATED(4,5),

+306

arch/arm64/kvm/hyp/nvhe/trace.c

··· 1 + // SPDX-License-Identifier: GPL-2.0-only 2 + /* 3 + * Copyright (C) 2025 Google LLC 4 + * Author: Vincent Donnefort <vdonnefort@google.com> 5 + */ 6 + 7 + #include <nvhe/clock.h> 8 + #include <nvhe/mem_protect.h> 9 + #include <nvhe/mm.h> 10 + #include <nvhe/trace.h> 11 + 12 + #include <asm/percpu.h> 13 + #include <asm/kvm_mmu.h> 14 + #include <asm/local.h> 15 + 16 + #include "simple_ring_buffer.c" 17 + 18 + static DEFINE_PER_CPU(struct simple_rb_per_cpu, __simple_rbs); 19 + 20 + static struct hyp_trace_buffer { 21 + struct simple_rb_per_cpu __percpu *simple_rbs; 22 + void *bpages_backing_start; 23 + size_t bpages_backing_size; 24 + hyp_spinlock_t lock; 25 + } trace_buffer = { 26 + .simple_rbs = &__simple_rbs, 27 + .lock = __HYP_SPIN_LOCK_UNLOCKED, 28 + }; 29 + 30 + static bool hyp_trace_buffer_loaded(struct hyp_trace_buffer *trace_buffer) 31 + { 32 + return trace_buffer->bpages_backing_size > 0; 33 + } 34 + 35 + void *tracing_reserve_entry(unsigned long length) 36 + { 37 + return simple_ring_buffer_reserve(this_cpu_ptr(trace_buffer.simple_rbs), length, 38 + trace_clock()); 39 + } 40 + 41 + void tracing_commit_entry(void) 42 + { 43 + simple_ring_buffer_commit(this_cpu_ptr(trace_buffer.simple_rbs)); 44 + } 45 + 46 + static int __admit_host_mem(void *start, u64 size) 47 + { 48 + if (!PAGE_ALIGNED(start) || !PAGE_ALIGNED(size) || !size) 49 + return -EINVAL; 50 + 51 + if (!is_protected_kvm_enabled()) 52 + return 0; 53 + 54 + return __pkvm_host_donate_hyp(hyp_virt_to_pfn(start), size >> PAGE_SHIFT); 55 + } 56 + 57 + static void __release_host_mem(void *start, u64 size) 58 + { 59 + if (!is_protected_kvm_enabled()) 60 + return; 61 + 62 + WARN_ON(__pkvm_hyp_donate_host(hyp_virt_to_pfn(start), size >> PAGE_SHIFT)); 63 + } 64 + 65 + static int hyp_trace_buffer_load_bpage_backing(struct hyp_trace_buffer *trace_buffer, 66 + struct hyp_trace_desc *desc) 67 + { 68 + void *start = (void *)kern_hyp_va(desc->bpages_backing_start); 69 + size_t size = desc->bpages_backing_size; 70 + int ret; 71 + 72 + ret = __admit_host_mem(start, size); 73 + if (ret) 74 + return ret; 75 + 76 + memset(start, 0, size); 77 + 78 + trace_buffer->bpages_backing_start = start; 79 + trace_buffer->bpages_backing_size = size; 80 + 81 + return 0; 82 + } 83 + 84 + static void hyp_trace_buffer_unload_bpage_backing(struct hyp_trace_buffer *trace_buffer) 85 + { 86 + void *start = trace_buffer->bpages_backing_start; 87 + size_t size = trace_buffer->bpages_backing_size; 88 + 89 + if (!size) 90 + return; 91 + 92 + memset(start, 0, size); 93 + 94 + __release_host_mem(start, size); 95 + 96 + trace_buffer->bpages_backing_start = 0; 97 + trace_buffer->bpages_backing_size = 0; 98 + } 99 + 100 + static void *__pin_shared_page(unsigned long kern_va) 101 + { 102 + void *va = kern_hyp_va((void *)kern_va); 103 + 104 + if (!is_protected_kvm_enabled()) 105 + return va; 106 + 107 + return hyp_pin_shared_mem(va, va + PAGE_SIZE) ? NULL : va; 108 + } 109 + 110 + static void __unpin_shared_page(void *va) 111 + { 112 + if (!is_protected_kvm_enabled()) 113 + return; 114 + 115 + hyp_unpin_shared_mem(va, va + PAGE_SIZE); 116 + } 117 + 118 + static void hyp_trace_buffer_unload(struct hyp_trace_buffer *trace_buffer) 119 + { 120 + int cpu; 121 + 122 + hyp_assert_lock_held(&trace_buffer->lock); 123 + 124 + if (!hyp_trace_buffer_loaded(trace_buffer)) 125 + return; 126 + 127 + for (cpu = 0; cpu < hyp_nr_cpus; cpu++) 128 + simple_ring_buffer_unload_mm(per_cpu_ptr(trace_buffer->simple_rbs, cpu), 129 + __unpin_shared_page); 130 + 131 + hyp_trace_buffer_unload_bpage_backing(trace_buffer); 132 + } 133 + 134 + static int hyp_trace_buffer_load(struct hyp_trace_buffer *trace_buffer, 135 + struct hyp_trace_desc *desc) 136 + { 137 + struct simple_buffer_page *bpages; 138 + struct ring_buffer_desc *rb_desc; 139 + int ret, cpu; 140 + 141 + hyp_assert_lock_held(&trace_buffer->lock); 142 + 143 + if (hyp_trace_buffer_loaded(trace_buffer)) 144 + return -EINVAL; 145 + 146 + ret = hyp_trace_buffer_load_bpage_backing(trace_buffer, desc); 147 + if (ret) 148 + return ret; 149 + 150 + bpages = trace_buffer->bpages_backing_start; 151 + for_each_ring_buffer_desc(rb_desc, cpu, &desc->trace_buffer_desc) { 152 + ret = simple_ring_buffer_init_mm(per_cpu_ptr(trace_buffer->simple_rbs, cpu), 153 + bpages, rb_desc, __pin_shared_page, 154 + __unpin_shared_page); 155 + if (ret) 156 + break; 157 + 158 + bpages += rb_desc->nr_page_va; 159 + } 160 + 161 + if (ret) 162 + hyp_trace_buffer_unload(trace_buffer); 163 + 164 + return ret; 165 + } 166 + 167 + static bool hyp_trace_desc_validate(struct hyp_trace_desc *desc, size_t desc_size) 168 + { 169 + struct ring_buffer_desc *rb_desc; 170 + unsigned int cpu; 171 + size_t nr_bpages; 172 + void *desc_end; 173 + 174 + /* 175 + * Both desc_size and bpages_backing_size are untrusted host-provided 176 + * values. We rely on __pkvm_host_donate_hyp() to enforce their validity. 177 + */ 178 + desc_end = (void *)desc + desc_size; 179 + nr_bpages = desc->bpages_backing_size / sizeof(struct simple_buffer_page); 180 + 181 + for_each_ring_buffer_desc(rb_desc, cpu, &desc->trace_buffer_desc) { 182 + /* Can we read nr_page_va? */ 183 + if ((void *)rb_desc + struct_size(rb_desc, page_va, 0) > desc_end) 184 + return false; 185 + 186 + /* Overflow desc? */ 187 + if ((void *)rb_desc + struct_size(rb_desc, page_va, rb_desc->nr_page_va) > desc_end) 188 + return false; 189 + 190 + /* Overflow bpages backing memory? */ 191 + if (nr_bpages < rb_desc->nr_page_va) 192 + return false; 193 + 194 + if (cpu >= hyp_nr_cpus) 195 + return false; 196 + 197 + if (cpu != rb_desc->cpu) 198 + return false; 199 + 200 + nr_bpages -= rb_desc->nr_page_va; 201 + } 202 + 203 + return true; 204 + } 205 + 206 + int __tracing_load(unsigned long desc_hva, size_t desc_size) 207 + { 208 + struct hyp_trace_desc *desc = (struct hyp_trace_desc *)kern_hyp_va(desc_hva); 209 + int ret; 210 + 211 + ret = __admit_host_mem(desc, desc_size); 212 + if (ret) 213 + return ret; 214 + 215 + if (!hyp_trace_desc_validate(desc, desc_size)) 216 + goto err_release_desc; 217 + 218 + hyp_spin_lock(&trace_buffer.lock); 219 + 220 + ret = hyp_trace_buffer_load(&trace_buffer, desc); 221 + 222 + hyp_spin_unlock(&trace_buffer.lock); 223 + 224 + err_release_desc: 225 + __release_host_mem(desc, desc_size); 226 + return ret; 227 + } 228 + 229 + void __tracing_unload(void) 230 + { 231 + hyp_spin_lock(&trace_buffer.lock); 232 + hyp_trace_buffer_unload(&trace_buffer); 233 + hyp_spin_unlock(&trace_buffer.lock); 234 + } 235 + 236 + int __tracing_enable(bool enable) 237 + { 238 + int cpu, ret = enable ? -EINVAL : 0; 239 + 240 + hyp_spin_lock(&trace_buffer.lock); 241 + 242 + if (!hyp_trace_buffer_loaded(&trace_buffer)) 243 + goto unlock; 244 + 245 + for (cpu = 0; cpu < hyp_nr_cpus; cpu++) 246 + simple_ring_buffer_enable_tracing(per_cpu_ptr(trace_buffer.simple_rbs, cpu), 247 + enable); 248 + 249 + ret = 0; 250 + 251 + unlock: 252 + hyp_spin_unlock(&trace_buffer.lock); 253 + 254 + return ret; 255 + } 256 + 257 + int __tracing_swap_reader(unsigned int cpu) 258 + { 259 + int ret = -ENODEV; 260 + 261 + if (cpu >= hyp_nr_cpus) 262 + return -EINVAL; 263 + 264 + hyp_spin_lock(&trace_buffer.lock); 265 + 266 + if (hyp_trace_buffer_loaded(&trace_buffer)) 267 + ret = simple_ring_buffer_swap_reader_page( 268 + per_cpu_ptr(trace_buffer.simple_rbs, cpu)); 269 + 270 + hyp_spin_unlock(&trace_buffer.lock); 271 + 272 + return ret; 273 + } 274 + 275 + void __tracing_update_clock(u32 mult, u32 shift, u64 epoch_ns, u64 epoch_cyc) 276 + { 277 + int cpu; 278 + 279 + /* After this loop, all CPUs are observing the new bank... */ 280 + for (cpu = 0; cpu < hyp_nr_cpus; cpu++) { 281 + struct simple_rb_per_cpu *simple_rb = per_cpu_ptr(trace_buffer.simple_rbs, cpu); 282 + 283 + while (READ_ONCE(simple_rb->status) == SIMPLE_RB_WRITING) 284 + ; 285 + } 286 + 287 + /* ...we can now override the old one and swap. */ 288 + trace_clock_update(mult, shift, epoch_ns, epoch_cyc); 289 + } 290 + 291 + int __tracing_reset(unsigned int cpu) 292 + { 293 + int ret = -ENODEV; 294 + 295 + if (cpu >= hyp_nr_cpus) 296 + return -EINVAL; 297 + 298 + hyp_spin_lock(&trace_buffer.lock); 299 + 300 + if (hyp_trace_buffer_loaded(&trace_buffer)) 301 + ret = simple_ring_buffer_reset(per_cpu_ptr(trace_buffer.simple_rbs, cpu)); 302 + 303 + hyp_spin_unlock(&trace_buffer.lock); 304 + 305 + return ret; 306 + }

+20 -13

arch/arm64/kvm/hyp/pgtable.c

··· 114 114 return pte; 115 115 } 116 116 117 - static kvm_pte_t kvm_init_invalid_leaf_owner(u8 owner_id) 118 - { 119 - return FIELD_PREP(KVM_INVALID_PTE_OWNER_MASK, owner_id); 120 - } 121 - 122 117 static int kvm_pgtable_visitor_cb(struct kvm_pgtable_walk_data *data, 123 118 const struct kvm_pgtable_visit_ctx *ctx, 124 119 enum kvm_pgtable_walk_flags visit) ··· 576 581 struct stage2_map_data { 577 582 const u64 phys; 578 583 kvm_pte_t attr; 579 - u8 owner_id; 584 + kvm_pte_t pte_annot; 580 585 581 586 kvm_pte_t *anchor; 582 587 kvm_pte_t *childp; ··· 793 798 794 799 static bool stage2_pte_is_locked(kvm_pte_t pte) 795 800 { 796 - return !kvm_pte_valid(pte) && (pte & KVM_INVALID_PTE_LOCKED); 801 + if (kvm_pte_valid(pte)) 802 + return false; 803 + 804 + return FIELD_GET(KVM_INVALID_PTE_TYPE_MASK, pte) == 805 + KVM_INVALID_PTE_TYPE_LOCKED; 797 806 } 798 807 799 808 static bool stage2_try_set_pte(const struct kvm_pgtable_visit_ctx *ctx, kvm_pte_t new) ··· 828 829 struct kvm_s2_mmu *mmu) 829 830 { 830 831 struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops; 832 + kvm_pte_t locked_pte; 831 833 832 834 if (stage2_pte_is_locked(ctx->old)) { 833 835 /* ··· 839 839 return false; 840 840 } 841 841 842 - if (!stage2_try_set_pte(ctx, KVM_INVALID_PTE_LOCKED)) 842 + locked_pte = FIELD_PREP(KVM_INVALID_PTE_TYPE_MASK, 843 + KVM_INVALID_PTE_TYPE_LOCKED); 844 + if (!stage2_try_set_pte(ctx, locked_pte)) 843 845 return false; 844 846 845 847 if (!kvm_pgtable_walk_skip_bbm_tlbi(ctx)) { ··· 966 964 if (!data->annotation) 967 965 new = kvm_init_valid_leaf_pte(phys, data->attr, ctx->level); 968 966 else 969 - new = kvm_init_invalid_leaf_owner(data->owner_id); 967 + new = data->pte_annot; 970 968 971 969 /* 972 970 * Skip updating the PTE if we are trying to recreate the exact ··· 1120 1118 return ret; 1121 1119 } 1122 1120 1123 - int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size, 1124 - void *mc, u8 owner_id) 1121 + int kvm_pgtable_stage2_annotate(struct kvm_pgtable *pgt, u64 addr, u64 size, 1122 + void *mc, enum kvm_invalid_pte_type type, 1123 + kvm_pte_t pte_annot) 1125 1124 { 1126 1125 int ret; 1127 1126 struct stage2_map_data map_data = { 1128 1127 .mmu = pgt->mmu, 1129 1128 .memcache = mc, 1130 - .owner_id = owner_id, 1131 1129 .force_pte = true, 1132 1130 .annotation = true, 1131 + .pte_annot = pte_annot | 1132 + FIELD_PREP(KVM_INVALID_PTE_TYPE_MASK, type), 1133 1133 }; 1134 1134 struct kvm_pgtable_walker walker = { 1135 1135 .cb = stage2_map_walker, ··· 1140 1136 .arg = &map_data, 1141 1137 }; 1142 1138 1143 - if (owner_id > KVM_MAX_OWNER_ID) 1139 + if (pte_annot & ~KVM_INVALID_PTE_ANNOT_MASK) 1140 + return -EINVAL; 1141 + 1142 + if (!type || type == KVM_INVALID_PTE_TYPE_LOCKED) 1144 1143 return -EINVAL; 1145 1144 1146 1145 ret = kvm_pgtable_walk(pgt, addr, size, &walker);

+166

arch/arm64/kvm/hyp/vgic-v5-sr.c

··· 1 + // SPDX-License-Identifier: GPL-2.0-only 2 + /* 3 + * Copyright (C) 2025, 2026 - Arm Ltd 4 + */ 5 + 6 + #include <linux/irqchip/arm-gic-v5.h> 7 + 8 + #include <asm/kvm_hyp.h> 9 + 10 + void __vgic_v5_save_apr(struct vgic_v5_cpu_if *cpu_if) 11 + { 12 + cpu_if->vgic_apr = read_sysreg_s(SYS_ICH_APR_EL2); 13 + } 14 + 15 + static void __vgic_v5_compat_mode_disable(void) 16 + { 17 + sysreg_clear_set_s(SYS_ICH_VCTLR_EL2, ICH_VCTLR_EL2_V3, 0); 18 + isb(); 19 + } 20 + 21 + void __vgic_v5_restore_vmcr_apr(struct vgic_v5_cpu_if *cpu_if) 22 + { 23 + __vgic_v5_compat_mode_disable(); 24 + 25 + write_sysreg_s(cpu_if->vgic_vmcr, SYS_ICH_VMCR_EL2); 26 + write_sysreg_s(cpu_if->vgic_apr, SYS_ICH_APR_EL2); 27 + } 28 + 29 + void __vgic_v5_save_ppi_state(struct vgic_v5_cpu_if *cpu_if) 30 + { 31 + /* 32 + * The following code assumes that the bitmap storage that we have for 33 + * PPIs is either 64 (architected PPIs, only) or 128 bits (architected & 34 + * impdef PPIs). 35 + */ 36 + BUILD_BUG_ON(VGIC_V5_NR_PRIVATE_IRQS % 64); 37 + 38 + bitmap_write(host_data_ptr(vgic_v5_ppi_state)->activer_exit, 39 + read_sysreg_s(SYS_ICH_PPI_ACTIVER0_EL2), 0, 64); 40 + bitmap_write(host_data_ptr(vgic_v5_ppi_state)->pendr, 41 + read_sysreg_s(SYS_ICH_PPI_PENDR0_EL2), 0, 64); 42 + 43 + cpu_if->vgic_ppi_priorityr[0] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR0_EL2); 44 + cpu_if->vgic_ppi_priorityr[1] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR1_EL2); 45 + cpu_if->vgic_ppi_priorityr[2] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR2_EL2); 46 + cpu_if->vgic_ppi_priorityr[3] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR3_EL2); 47 + cpu_if->vgic_ppi_priorityr[4] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR4_EL2); 48 + cpu_if->vgic_ppi_priorityr[5] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR5_EL2); 49 + cpu_if->vgic_ppi_priorityr[6] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR6_EL2); 50 + cpu_if->vgic_ppi_priorityr[7] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR7_EL2); 51 + 52 + if (VGIC_V5_NR_PRIVATE_IRQS == 128) { 53 + bitmap_write(host_data_ptr(vgic_v5_ppi_state)->activer_exit, 54 + read_sysreg_s(SYS_ICH_PPI_ACTIVER1_EL2), 64, 64); 55 + bitmap_write(host_data_ptr(vgic_v5_ppi_state)->pendr, 56 + read_sysreg_s(SYS_ICH_PPI_PENDR1_EL2), 64, 64); 57 + 58 + cpu_if->vgic_ppi_priorityr[8] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR8_EL2); 59 + cpu_if->vgic_ppi_priorityr[9] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR9_EL2); 60 + cpu_if->vgic_ppi_priorityr[10] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR10_EL2); 61 + cpu_if->vgic_ppi_priorityr[11] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR11_EL2); 62 + cpu_if->vgic_ppi_priorityr[12] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR12_EL2); 63 + cpu_if->vgic_ppi_priorityr[13] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR13_EL2); 64 + cpu_if->vgic_ppi_priorityr[14] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR14_EL2); 65 + cpu_if->vgic_ppi_priorityr[15] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR15_EL2); 66 + } 67 + 68 + /* Now that we are done, disable DVI */ 69 + write_sysreg_s(0, SYS_ICH_PPI_DVIR0_EL2); 70 + write_sysreg_s(0, SYS_ICH_PPI_DVIR1_EL2); 71 + } 72 + 73 + void __vgic_v5_restore_ppi_state(struct vgic_v5_cpu_if *cpu_if) 74 + { 75 + DECLARE_BITMAP(pendr, VGIC_V5_NR_PRIVATE_IRQS); 76 + 77 + /* We assume 64 or 128 PPIs - see above comment */ 78 + BUILD_BUG_ON(VGIC_V5_NR_PRIVATE_IRQS % 64); 79 + 80 + /* Enable DVI so that the guest's interrupt config takes over */ 81 + write_sysreg_s(bitmap_read(cpu_if->vgic_ppi_dvir, 0, 64), 82 + SYS_ICH_PPI_DVIR0_EL2); 83 + 84 + write_sysreg_s(bitmap_read(cpu_if->vgic_ppi_activer, 0, 64), 85 + SYS_ICH_PPI_ACTIVER0_EL2); 86 + write_sysreg_s(bitmap_read(cpu_if->vgic_ppi_enabler, 0, 64), 87 + SYS_ICH_PPI_ENABLER0_EL2); 88 + 89 + /* Update the pending state of the NON-DVI'd PPIs, only */ 90 + bitmap_andnot(pendr, host_data_ptr(vgic_v5_ppi_state)->pendr, 91 + cpu_if->vgic_ppi_dvir, VGIC_V5_NR_PRIVATE_IRQS); 92 + write_sysreg_s(bitmap_read(pendr, 0, 64), SYS_ICH_PPI_PENDR0_EL2); 93 + 94 + write_sysreg_s(cpu_if->vgic_ppi_priorityr[0], 95 + SYS_ICH_PPI_PRIORITYR0_EL2); 96 + write_sysreg_s(cpu_if->vgic_ppi_priorityr[1], 97 + SYS_ICH_PPI_PRIORITYR1_EL2); 98 + write_sysreg_s(cpu_if->vgic_ppi_priorityr[2], 99 + SYS_ICH_PPI_PRIORITYR2_EL2); 100 + write_sysreg_s(cpu_if->vgic_ppi_priorityr[3], 101 + SYS_ICH_PPI_PRIORITYR3_EL2); 102 + write_sysreg_s(cpu_if->vgic_ppi_priorityr[4], 103 + SYS_ICH_PPI_PRIORITYR4_EL2); 104 + write_sysreg_s(cpu_if->vgic_ppi_priorityr[5], 105 + SYS_ICH_PPI_PRIORITYR5_EL2); 106 + write_sysreg_s(cpu_if->vgic_ppi_priorityr[6], 107 + SYS_ICH_PPI_PRIORITYR6_EL2); 108 + write_sysreg_s(cpu_if->vgic_ppi_priorityr[7], 109 + SYS_ICH_PPI_PRIORITYR7_EL2); 110 + 111 + if (VGIC_V5_NR_PRIVATE_IRQS == 128) { 112 + /* Enable DVI so that the guest's interrupt config takes over */ 113 + write_sysreg_s(bitmap_read(cpu_if->vgic_ppi_dvir, 64, 64), 114 + SYS_ICH_PPI_DVIR1_EL2); 115 + 116 + write_sysreg_s(bitmap_read(cpu_if->vgic_ppi_activer, 64, 64), 117 + SYS_ICH_PPI_ACTIVER1_EL2); 118 + write_sysreg_s(bitmap_read(cpu_if->vgic_ppi_enabler, 64, 64), 119 + SYS_ICH_PPI_ENABLER1_EL2); 120 + write_sysreg_s(bitmap_read(pendr, 64, 64), 121 + SYS_ICH_PPI_PENDR1_EL2); 122 + 123 + write_sysreg_s(cpu_if->vgic_ppi_priorityr[8], 124 + SYS_ICH_PPI_PRIORITYR8_EL2); 125 + write_sysreg_s(cpu_if->vgic_ppi_priorityr[9], 126 + SYS_ICH_PPI_PRIORITYR9_EL2); 127 + write_sysreg_s(cpu_if->vgic_ppi_priorityr[10], 128 + SYS_ICH_PPI_PRIORITYR10_EL2); 129 + write_sysreg_s(cpu_if->vgic_ppi_priorityr[11], 130 + SYS_ICH_PPI_PRIORITYR11_EL2); 131 + write_sysreg_s(cpu_if->vgic_ppi_priorityr[12], 132 + SYS_ICH_PPI_PRIORITYR12_EL2); 133 + write_sysreg_s(cpu_if->vgic_ppi_priorityr[13], 134 + SYS_ICH_PPI_PRIORITYR13_EL2); 135 + write_sysreg_s(cpu_if->vgic_ppi_priorityr[14], 136 + SYS_ICH_PPI_PRIORITYR14_EL2); 137 + write_sysreg_s(cpu_if->vgic_ppi_priorityr[15], 138 + SYS_ICH_PPI_PRIORITYR15_EL2); 139 + } else { 140 + write_sysreg_s(0, SYS_ICH_PPI_DVIR1_EL2); 141 + 142 + write_sysreg_s(0, SYS_ICH_PPI_ACTIVER1_EL2); 143 + write_sysreg_s(0, SYS_ICH_PPI_ENABLER1_EL2); 144 + write_sysreg_s(0, SYS_ICH_PPI_PENDR1_EL2); 145 + 146 + write_sysreg_s(0, SYS_ICH_PPI_PRIORITYR8_EL2); 147 + write_sysreg_s(0, SYS_ICH_PPI_PRIORITYR9_EL2); 148 + write_sysreg_s(0, SYS_ICH_PPI_PRIORITYR10_EL2); 149 + write_sysreg_s(0, SYS_ICH_PPI_PRIORITYR11_EL2); 150 + write_sysreg_s(0, SYS_ICH_PPI_PRIORITYR12_EL2); 151 + write_sysreg_s(0, SYS_ICH_PPI_PRIORITYR13_EL2); 152 + write_sysreg_s(0, SYS_ICH_PPI_PRIORITYR14_EL2); 153 + write_sysreg_s(0, SYS_ICH_PPI_PRIORITYR15_EL2); 154 + } 155 + } 156 + 157 + void __vgic_v5_save_state(struct vgic_v5_cpu_if *cpu_if) 158 + { 159 + cpu_if->vgic_vmcr = read_sysreg_s(SYS_ICH_VMCR_EL2); 160 + cpu_if->vgic_icsr = read_sysreg_s(SYS_ICC_ICSR_EL1); 161 + } 162 + 163 + void __vgic_v5_restore_state(struct vgic_v5_cpu_if *cpu_if) 164 + { 165 + write_sysreg_s(cpu_if->vgic_icsr, SYS_ICC_ICSR_EL1); 166 + }

+1 -1

arch/arm64/kvm/hyp/vhe/Makefile

··· 10 10 11 11 obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o 12 12 obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \ 13 - ../fpsimd.o ../hyp-entry.o ../exception.o 13 + ../fpsimd.o ../hyp-entry.o ../exception.o ../vgic-v5-sr.o

+442

arch/arm64/kvm/hyp_trace.c

··· 1 + // SPDX-License-Identifier: GPL-2.0-only 2 + /* 3 + * Copyright (C) 2025 Google LLC 4 + * Author: Vincent Donnefort <vdonnefort@google.com> 5 + */ 6 + 7 + #include <linux/cpumask.h> 8 + #include <linux/trace_remote.h> 9 + #include <linux/tracefs.h> 10 + #include <linux/simple_ring_buffer.h> 11 + 12 + #include <asm/arch_timer.h> 13 + #include <asm/kvm_host.h> 14 + #include <asm/kvm_hyptrace.h> 15 + #include <asm/kvm_mmu.h> 16 + 17 + #include "hyp_trace.h" 18 + 19 + /* Same 10min used by clocksource when width is more than 32-bits */ 20 + #define CLOCK_MAX_CONVERSION_S 600 21 + /* 22 + * Time to give for the clock init. Long enough to get a good mult/shift 23 + * estimation. Short enough to not delay the tracing start too much. 24 + */ 25 + #define CLOCK_INIT_MS 100 26 + /* 27 + * Time between clock checks. Must be small enough to catch clock deviation when 28 + * it is still tiny. 29 + */ 30 + #define CLOCK_UPDATE_MS 500 31 + 32 + static struct hyp_trace_clock { 33 + u64 cycles; 34 + u64 cyc_overflow64; 35 + u64 boot; 36 + u32 mult; 37 + u32 shift; 38 + struct delayed_work work; 39 + struct completion ready; 40 + struct mutex lock; 41 + bool running; 42 + } hyp_clock; 43 + 44 + static void __hyp_clock_work(struct work_struct *work) 45 + { 46 + struct delayed_work *dwork = to_delayed_work(work); 47 + struct hyp_trace_clock *hyp_clock; 48 + struct system_time_snapshot snap; 49 + u64 rate, delta_cycles; 50 + u64 boot, delta_boot; 51 + 52 + hyp_clock = container_of(dwork, struct hyp_trace_clock, work); 53 + 54 + ktime_get_snapshot(&snap); 55 + boot = ktime_to_ns(snap.boot); 56 + 57 + delta_boot = boot - hyp_clock->boot; 58 + delta_cycles = snap.cycles - hyp_clock->cycles; 59 + 60 + /* Compare hyp clock with the kernel boot clock */ 61 + if (hyp_clock->mult) { 62 + u64 err, cur = delta_cycles; 63 + 64 + if (WARN_ON_ONCE(cur >= hyp_clock->cyc_overflow64)) { 65 + __uint128_t tmp = (__uint128_t)cur * hyp_clock->mult; 66 + 67 + cur = tmp >> hyp_clock->shift; 68 + } else { 69 + cur *= hyp_clock->mult; 70 + cur >>= hyp_clock->shift; 71 + } 72 + cur += hyp_clock->boot; 73 + 74 + err = abs_diff(cur, boot); 75 + /* No deviation, only update epoch if necessary */ 76 + if (!err) { 77 + if (delta_cycles >= (hyp_clock->cyc_overflow64 >> 1)) 78 + goto fast_forward; 79 + 80 + goto resched; 81 + } 82 + 83 + /* Warn if the error is above tracing precision (1us) */ 84 + if (err > NSEC_PER_USEC) 85 + pr_warn_ratelimited("hyp trace clock off by %lluus\n", 86 + err / NSEC_PER_USEC); 87 + } 88 + 89 + rate = div64_u64(delta_cycles * NSEC_PER_SEC, delta_boot); 90 + 91 + clocks_calc_mult_shift(&hyp_clock->mult, &hyp_clock->shift, 92 + rate, NSEC_PER_SEC, CLOCK_MAX_CONVERSION_S); 93 + 94 + /* Add a comfortable 50% margin */ 95 + hyp_clock->cyc_overflow64 = (U64_MAX / hyp_clock->mult) >> 1; 96 + 97 + fast_forward: 98 + hyp_clock->cycles = snap.cycles; 99 + hyp_clock->boot = boot; 100 + kvm_call_hyp_nvhe(__tracing_update_clock, hyp_clock->mult, 101 + hyp_clock->shift, hyp_clock->boot, hyp_clock->cycles); 102 + complete(&hyp_clock->ready); 103 + 104 + resched: 105 + schedule_delayed_work(&hyp_clock->work, 106 + msecs_to_jiffies(CLOCK_UPDATE_MS)); 107 + } 108 + 109 + static void hyp_trace_clock_enable(struct hyp_trace_clock *hyp_clock, bool enable) 110 + { 111 + struct system_time_snapshot snap; 112 + 113 + if (hyp_clock->running == enable) 114 + return; 115 + 116 + if (!enable) { 117 + cancel_delayed_work_sync(&hyp_clock->work); 118 + hyp_clock->running = false; 119 + } 120 + 121 + ktime_get_snapshot(&snap); 122 + 123 + hyp_clock->boot = ktime_to_ns(snap.boot); 124 + hyp_clock->cycles = snap.cycles; 125 + hyp_clock->mult = 0; 126 + 127 + init_completion(&hyp_clock->ready); 128 + INIT_DELAYED_WORK(&hyp_clock->work, __hyp_clock_work); 129 + schedule_delayed_work(&hyp_clock->work, msecs_to_jiffies(CLOCK_INIT_MS)); 130 + wait_for_completion(&hyp_clock->ready); 131 + hyp_clock->running = true; 132 + } 133 + 134 + /* Access to this struct within the trace_remote_callbacks are protected by the trace_remote lock */ 135 + static struct hyp_trace_buffer { 136 + struct hyp_trace_desc *desc; 137 + size_t desc_size; 138 + } trace_buffer; 139 + 140 + static int __map_hyp(void *start, size_t size) 141 + { 142 + if (is_protected_kvm_enabled()) 143 + return 0; 144 + 145 + return create_hyp_mappings(start, start + size, PAGE_HYP); 146 + } 147 + 148 + static int __share_page(unsigned long va) 149 + { 150 + return kvm_share_hyp((void *)va, (void *)va + 1); 151 + } 152 + 153 + static void __unshare_page(unsigned long va) 154 + { 155 + kvm_unshare_hyp((void *)va, (void *)va + 1); 156 + } 157 + 158 + static int hyp_trace_buffer_alloc_bpages_backing(struct hyp_trace_buffer *trace_buffer, size_t size) 159 + { 160 + int nr_bpages = (PAGE_ALIGN(size) / PAGE_SIZE) + 1; 161 + size_t backing_size; 162 + void *start; 163 + 164 + backing_size = PAGE_ALIGN(sizeof(struct simple_buffer_page) * nr_bpages * 165 + num_possible_cpus()); 166 + 167 + start = alloc_pages_exact(backing_size, GFP_KERNEL_ACCOUNT); 168 + if (!start) 169 + return -ENOMEM; 170 + 171 + trace_buffer->desc->bpages_backing_start = (unsigned long)start; 172 + trace_buffer->desc->bpages_backing_size = backing_size; 173 + 174 + return __map_hyp(start, backing_size); 175 + } 176 + 177 + static void hyp_trace_buffer_free_bpages_backing(struct hyp_trace_buffer *trace_buffer) 178 + { 179 + free_pages_exact((void *)trace_buffer->desc->bpages_backing_start, 180 + trace_buffer->desc->bpages_backing_size); 181 + } 182 + 183 + static void hyp_trace_buffer_unshare_hyp(struct hyp_trace_buffer *trace_buffer, int last_cpu) 184 + { 185 + struct ring_buffer_desc *rb_desc; 186 + int cpu, p; 187 + 188 + for_each_ring_buffer_desc(rb_desc, cpu, &trace_buffer->desc->trace_buffer_desc) { 189 + if (cpu > last_cpu) 190 + break; 191 + 192 + __share_page(rb_desc->meta_va); 193 + for (p = 0; p < rb_desc->nr_page_va; p++) 194 + __unshare_page(rb_desc->page_va[p]); 195 + } 196 + } 197 + 198 + static int hyp_trace_buffer_share_hyp(struct hyp_trace_buffer *trace_buffer) 199 + { 200 + struct ring_buffer_desc *rb_desc; 201 + int cpu, p, ret = 0; 202 + 203 + for_each_ring_buffer_desc(rb_desc, cpu, &trace_buffer->desc->trace_buffer_desc) { 204 + ret = __share_page(rb_desc->meta_va); 205 + if (ret) 206 + break; 207 + 208 + for (p = 0; p < rb_desc->nr_page_va; p++) { 209 + ret = __share_page(rb_desc->page_va[p]); 210 + if (ret) 211 + break; 212 + } 213 + 214 + if (ret) { 215 + for (p--; p >= 0; p--) 216 + __unshare_page(rb_desc->page_va[p]); 217 + break; 218 + } 219 + } 220 + 221 + if (ret) 222 + hyp_trace_buffer_unshare_hyp(trace_buffer, cpu--); 223 + 224 + return ret; 225 + } 226 + 227 + static struct trace_buffer_desc *hyp_trace_load(unsigned long size, void *priv) 228 + { 229 + struct hyp_trace_buffer *trace_buffer = priv; 230 + struct hyp_trace_desc *desc; 231 + size_t desc_size; 232 + int ret; 233 + 234 + if (WARN_ON(trace_buffer->desc)) 235 + return ERR_PTR(-EINVAL); 236 + 237 + desc_size = trace_buffer_desc_size(size, num_possible_cpus()); 238 + if (desc_size == SIZE_MAX) 239 + return ERR_PTR(-E2BIG); 240 + 241 + desc_size = PAGE_ALIGN(desc_size); 242 + desc = (struct hyp_trace_desc *)alloc_pages_exact(desc_size, GFP_KERNEL); 243 + if (!desc) 244 + return ERR_PTR(-ENOMEM); 245 + 246 + ret = __map_hyp(desc, desc_size); 247 + if (ret) 248 + goto err_free_desc; 249 + 250 + trace_buffer->desc = desc; 251 + 252 + ret = hyp_trace_buffer_alloc_bpages_backing(trace_buffer, size); 253 + if (ret) 254 + goto err_free_desc; 255 + 256 + ret = trace_remote_alloc_buffer(&desc->trace_buffer_desc, desc_size, size, 257 + cpu_possible_mask); 258 + if (ret) 259 + goto err_free_backing; 260 + 261 + ret = hyp_trace_buffer_share_hyp(trace_buffer); 262 + if (ret) 263 + goto err_free_buffer; 264 + 265 + ret = kvm_call_hyp_nvhe(__tracing_load, (unsigned long)desc, desc_size); 266 + if (ret) 267 + goto err_unload_pages; 268 + 269 + return &desc->trace_buffer_desc; 270 + 271 + err_unload_pages: 272 + hyp_trace_buffer_unshare_hyp(trace_buffer, INT_MAX); 273 + 274 + err_free_buffer: 275 + trace_remote_free_buffer(&desc->trace_buffer_desc); 276 + 277 + err_free_backing: 278 + hyp_trace_buffer_free_bpages_backing(trace_buffer); 279 + 280 + err_free_desc: 281 + free_pages_exact(desc, desc_size); 282 + trace_buffer->desc = NULL; 283 + 284 + return ERR_PTR(ret); 285 + } 286 + 287 + static void hyp_trace_unload(struct trace_buffer_desc *desc, void *priv) 288 + { 289 + struct hyp_trace_buffer *trace_buffer = priv; 290 + 291 + if (WARN_ON(desc != &trace_buffer->desc->trace_buffer_desc)) 292 + return; 293 + 294 + kvm_call_hyp_nvhe(__tracing_unload); 295 + hyp_trace_buffer_unshare_hyp(trace_buffer, INT_MAX); 296 + trace_remote_free_buffer(desc); 297 + hyp_trace_buffer_free_bpages_backing(trace_buffer); 298 + free_pages_exact(trace_buffer->desc, trace_buffer->desc_size); 299 + trace_buffer->desc = NULL; 300 + } 301 + 302 + static int hyp_trace_enable_tracing(bool enable, void *priv) 303 + { 304 + hyp_trace_clock_enable(&hyp_clock, enable); 305 + 306 + return kvm_call_hyp_nvhe(__tracing_enable, enable); 307 + } 308 + 309 + static int hyp_trace_swap_reader_page(unsigned int cpu, void *priv) 310 + { 311 + return kvm_call_hyp_nvhe(__tracing_swap_reader, cpu); 312 + } 313 + 314 + static int hyp_trace_reset(unsigned int cpu, void *priv) 315 + { 316 + return kvm_call_hyp_nvhe(__tracing_reset, cpu); 317 + } 318 + 319 + static int hyp_trace_enable_event(unsigned short id, bool enable, void *priv) 320 + { 321 + struct hyp_event_id *event_id = lm_alias(&__hyp_event_ids_start[id]); 322 + struct page *page; 323 + atomic_t *enabled; 324 + void *map; 325 + 326 + if (is_protected_kvm_enabled()) 327 + return kvm_call_hyp_nvhe(__tracing_enable_event, id, enable); 328 + 329 + enabled = &event_id->enabled; 330 + page = virt_to_page(enabled); 331 + map = vmap(&page, 1, VM_MAP, PAGE_KERNEL); 332 + if (!map) 333 + return -ENOMEM; 334 + 335 + enabled = map + offset_in_page(enabled); 336 + atomic_set(enabled, enable); 337 + 338 + vunmap(map); 339 + 340 + return 0; 341 + } 342 + 343 + static int hyp_trace_clock_show(struct seq_file *m, void *v) 344 + { 345 + seq_puts(m, "[boot]\n"); 346 + 347 + return 0; 348 + } 349 + DEFINE_SHOW_ATTRIBUTE(hyp_trace_clock); 350 + 351 + static ssize_t hyp_trace_write_event_write(struct file *f, const char __user *ubuf, 352 + size_t cnt, loff_t *pos) 353 + { 354 + unsigned long val; 355 + int ret; 356 + 357 + ret = kstrtoul_from_user(ubuf, cnt, 10, &val); 358 + if (ret) 359 + return ret; 360 + 361 + kvm_call_hyp_nvhe(__tracing_write_event, val); 362 + 363 + return cnt; 364 + } 365 + 366 + static const struct file_operations hyp_trace_write_event_fops = { 367 + .write = hyp_trace_write_event_write, 368 + }; 369 + 370 + static int hyp_trace_init_tracefs(struct dentry *d, void *priv) 371 + { 372 + if (!tracefs_create_file("write_event", 0200, d, NULL, &hyp_trace_write_event_fops)) 373 + return -ENOMEM; 374 + 375 + return tracefs_create_file("trace_clock", 0440, d, NULL, &hyp_trace_clock_fops) ? 376 + 0 : -ENOMEM; 377 + } 378 + 379 + static struct trace_remote_callbacks trace_remote_callbacks = { 380 + .init = hyp_trace_init_tracefs, 381 + .load_trace_buffer = hyp_trace_load, 382 + .unload_trace_buffer = hyp_trace_unload, 383 + .enable_tracing = hyp_trace_enable_tracing, 384 + .swap_reader_page = hyp_trace_swap_reader_page, 385 + .reset = hyp_trace_reset, 386 + .enable_event = hyp_trace_enable_event, 387 + }; 388 + 389 + static const char *__hyp_enter_exit_reason_str(u8 reason); 390 + 391 + #include <asm/kvm_define_hypevents.h> 392 + 393 + static const char *__hyp_enter_exit_reason_str(u8 reason) 394 + { 395 + static const char strs[][12] = { 396 + "smc", 397 + "hvc", 398 + "psci", 399 + "host_abort", 400 + "guest_exit", 401 + "eret_host", 402 + "eret_guest", 403 + "unknown", 404 + }; 405 + 406 + return strs[min(reason, HYP_REASON_UNKNOWN)]; 407 + } 408 + 409 + static void __init hyp_trace_init_events(void) 410 + { 411 + struct hyp_event_id *hyp_event_id = __hyp_event_ids_start; 412 + struct remote_event *event = __hyp_events_start; 413 + int id = 0; 414 + 415 + /* Events on both sides hypervisor are sorted */ 416 + for (; event < __hyp_events_end; event++, hyp_event_id++, id++) 417 + event->id = hyp_event_id->id = id; 418 + } 419 + 420 + int __init kvm_hyp_trace_init(void) 421 + { 422 + int cpu; 423 + 424 + if (is_kernel_in_hyp_mode()) 425 + return 0; 426 + 427 + for_each_possible_cpu(cpu) { 428 + const struct arch_timer_erratum_workaround *wa = 429 + per_cpu(timer_unstable_counter_workaround, cpu); 430 + 431 + if (IS_ENABLED(CONFIG_ARM_ARCH_TIMER_OOL_WORKAROUND) && 432 + wa && wa->read_cntvct_el0) { 433 + pr_warn("hyp trace can't handle CNTVCT workaround '%s'\n", wa->desc); 434 + return -EOPNOTSUPP; 435 + } 436 + } 437 + 438 + hyp_trace_init_events(); 439 + 440 + return trace_remote_register("hypervisor", &trace_remote_callbacks, &trace_buffer, 441 + __hyp_events_start, __hyp_events_end - __hyp_events_start); 442 + }

+11

arch/arm64/kvm/hyp_trace.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + 3 + #ifndef __ARM64_KVM_HYP_TRACE_H__ 4 + #define __ARM64_KVM_HYP_TRACE_H__ 5 + 6 + #ifdef CONFIG_NVHE_EL2_TRACING 7 + int kvm_hyp_trace_init(void); 8 + #else 9 + static inline int kvm_hyp_trace_init(void) { return 0; } 10 + #endif 11 + #endif

+401 -223

arch/arm64/kvm/mmu.c

··· 340 340 void kvm_stage2_unmap_range(struct kvm_s2_mmu *mmu, phys_addr_t start, 341 341 u64 size, bool may_block) 342 342 { 343 + if (kvm_vm_is_protected(kvm_s2_mmu_to_kvm(mmu))) 344 + return; 345 + 343 346 __unmap_stage2_range(mmu, start, size, may_block); 344 347 } 345 348 ··· 881 878 u64 mmfr0, mmfr1; 882 879 u32 phys_shift; 883 880 884 - if (type & ~KVM_VM_TYPE_ARM_IPA_SIZE_MASK) 885 - return -EINVAL; 886 - 887 881 phys_shift = KVM_VM_TYPE_ARM_IPA_SIZE(type); 888 882 if (is_protected_kvm_enabled()) { 889 883 phys_shift = kvm_ipa_limit; ··· 1013 1013 1014 1014 out_destroy_pgtable: 1015 1015 kvm_stage2_destroy(pgt); 1016 + mmu->pgt = NULL; 1016 1017 out_free_pgtable: 1017 1018 kfree(pgt); 1018 1019 return err; ··· 1401 1400 */ 1402 1401 static long 1403 1402 transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot, 1404 - unsigned long hva, kvm_pfn_t *pfnp, 1405 - phys_addr_t *ipap) 1403 + unsigned long hva, kvm_pfn_t *pfnp, gfn_t *gfnp) 1406 1404 { 1407 1405 kvm_pfn_t pfn = *pfnp; 1406 + gfn_t gfn = *gfnp; 1408 1407 1409 1408 /* 1410 1409 * Make sure the adjustment is done only for THP pages. Also make ··· 1420 1419 if (sz < PMD_SIZE) 1421 1420 return PAGE_SIZE; 1422 1421 1423 - *ipap &= PMD_MASK; 1422 + gfn &= ~(PTRS_PER_PMD - 1); 1423 + *gfnp = gfn; 1424 1424 pfn &= ~(PTRS_PER_PMD - 1); 1425 1425 *pfnp = pfn; 1426 1426 ··· 1514 1512 } 1515 1513 } 1516 1514 1517 - static int prepare_mmu_memcache(struct kvm_vcpu *vcpu, bool topup_memcache, 1518 - void **memcache) 1515 + static void *get_mmu_memcache(struct kvm_vcpu *vcpu) 1519 1516 { 1520 - int min_pages; 1521 - 1522 1517 if (!is_protected_kvm_enabled()) 1523 - *memcache = &vcpu->arch.mmu_page_cache; 1518 + return &vcpu->arch.mmu_page_cache; 1524 1519 else 1525 - *memcache = &vcpu->arch.pkvm_memcache; 1520 + return &vcpu->arch.pkvm_memcache; 1521 + } 1526 1522 1527 - if (!topup_memcache) 1528 - return 0; 1529 - 1530 - min_pages = kvm_mmu_cache_min_pages(vcpu->arch.hw_mmu); 1523 + static int topup_mmu_memcache(struct kvm_vcpu *vcpu, void *memcache) 1524 + { 1525 + int min_pages = kvm_mmu_cache_min_pages(vcpu->arch.hw_mmu); 1531 1526 1532 1527 if (!is_protected_kvm_enabled()) 1533 - return kvm_mmu_topup_memory_cache(*memcache, min_pages); 1528 + return kvm_mmu_topup_memory_cache(memcache, min_pages); 1534 1529 1535 - return topup_hyp_memcache(*memcache, min_pages); 1530 + return topup_hyp_memcache(memcache, min_pages); 1536 1531 } 1537 1532 1538 1533 /* ··· 1542 1543 * TLB invalidation from the guest and used to limit the invalidation scope if a 1543 1544 * TTL hint or a range isn't provided. 1544 1545 */ 1545 - static void adjust_nested_fault_perms(struct kvm_s2_trans *nested, 1546 - enum kvm_pgtable_prot *prot, 1547 - bool *writable) 1546 + static enum kvm_pgtable_prot adjust_nested_fault_perms(struct kvm_s2_trans *nested, 1547 + enum kvm_pgtable_prot prot) 1548 1548 { 1549 - *writable &= kvm_s2_trans_writable(nested); 1549 + if (!kvm_s2_trans_writable(nested)) 1550 + prot &= ~KVM_PGTABLE_PROT_W; 1550 1551 if (!kvm_s2_trans_readable(nested)) 1551 - *prot &= ~KVM_PGTABLE_PROT_R; 1552 + prot &= ~KVM_PGTABLE_PROT_R; 1552 1553 1553 - *prot |= kvm_encode_nested_level(nested); 1554 + return prot | kvm_encode_nested_level(nested); 1554 1555 } 1555 1556 1556 - static void adjust_nested_exec_perms(struct kvm *kvm, 1557 - struct kvm_s2_trans *nested, 1558 - enum kvm_pgtable_prot *prot) 1557 + static enum kvm_pgtable_prot adjust_nested_exec_perms(struct kvm *kvm, 1558 + struct kvm_s2_trans *nested, 1559 + enum kvm_pgtable_prot prot) 1559 1560 { 1560 1561 if (!kvm_s2_trans_exec_el0(kvm, nested)) 1561 - *prot &= ~KVM_PGTABLE_PROT_UX; 1562 + prot &= ~KVM_PGTABLE_PROT_UX; 1562 1563 if (!kvm_s2_trans_exec_el1(kvm, nested)) 1563 - *prot &= ~KVM_PGTABLE_PROT_PX; 1564 + prot &= ~KVM_PGTABLE_PROT_PX; 1565 + 1566 + return prot; 1564 1567 } 1565 1568 1566 - static int gmem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, 1567 - struct kvm_s2_trans *nested, 1568 - struct kvm_memory_slot *memslot, bool is_perm) 1569 + struct kvm_s2_fault_desc { 1570 + struct kvm_vcpu *vcpu; 1571 + phys_addr_t fault_ipa; 1572 + struct kvm_s2_trans *nested; 1573 + struct kvm_memory_slot *memslot; 1574 + unsigned long hva; 1575 + }; 1576 + 1577 + static int gmem_abort(const struct kvm_s2_fault_desc *s2fd) 1569 1578 { 1570 - bool write_fault, exec_fault, writable; 1579 + bool write_fault, exec_fault; 1571 1580 enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_SHARED; 1572 1581 enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R; 1573 - struct kvm_pgtable *pgt = vcpu->arch.hw_mmu->pgt; 1582 + struct kvm_pgtable *pgt = s2fd->vcpu->arch.hw_mmu->pgt; 1574 1583 unsigned long mmu_seq; 1575 1584 struct page *page; 1576 - struct kvm *kvm = vcpu->kvm; 1585 + struct kvm *kvm = s2fd->vcpu->kvm; 1577 1586 void *memcache; 1578 1587 kvm_pfn_t pfn; 1579 1588 gfn_t gfn; 1580 1589 int ret; 1581 1590 1582 - ret = prepare_mmu_memcache(vcpu, true, &memcache); 1591 + memcache = get_mmu_memcache(s2fd->vcpu); 1592 + ret = topup_mmu_memcache(s2fd->vcpu, memcache); 1583 1593 if (ret) 1584 1594 return ret; 1585 1595 1586 - if (nested) 1587 - gfn = kvm_s2_trans_output(nested) >> PAGE_SHIFT; 1596 + if (s2fd->nested) 1597 + gfn = kvm_s2_trans_output(s2fd->nested) >> PAGE_SHIFT; 1588 1598 else 1589 - gfn = fault_ipa >> PAGE_SHIFT; 1599 + gfn = s2fd->fault_ipa >> PAGE_SHIFT; 1590 1600 1591 - write_fault = kvm_is_write_fault(vcpu); 1592 - exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu); 1601 + write_fault = kvm_is_write_fault(s2fd->vcpu); 1602 + exec_fault = kvm_vcpu_trap_is_exec_fault(s2fd->vcpu); 1593 1603 1594 1604 VM_WARN_ON_ONCE(write_fault && exec_fault); 1595 1605 ··· 1606 1598 /* Pairs with the smp_wmb() in kvm_mmu_invalidate_end(). */ 1607 1599 smp_rmb(); 1608 1600 1609 - ret = kvm_gmem_get_pfn(kvm, memslot, gfn, &pfn, &page, NULL); 1601 + ret = kvm_gmem_get_pfn(kvm, s2fd->memslot, gfn, &pfn, &page, NULL); 1610 1602 if (ret) { 1611 - kvm_prepare_memory_fault_exit(vcpu, fault_ipa, PAGE_SIZE, 1603 + kvm_prepare_memory_fault_exit(s2fd->vcpu, s2fd->fault_ipa, PAGE_SIZE, 1612 1604 write_fault, exec_fault, false); 1613 1605 return ret; 1614 1606 } 1615 1607 1616 - writable = !(memslot->flags & KVM_MEM_READONLY); 1617 - 1618 - if (nested) 1619 - adjust_nested_fault_perms(nested, &prot, &writable); 1620 - 1621 - if (writable) 1608 + if (!(s2fd->memslot->flags & KVM_MEM_READONLY)) 1622 1609 prot |= KVM_PGTABLE_PROT_W; 1610 + 1611 + if (s2fd->nested) 1612 + prot = adjust_nested_fault_perms(s2fd->nested, prot); 1623 1613 1624 1614 if (exec_fault || cpus_have_final_cap(ARM64_HAS_CACHE_DIC)) 1625 1615 prot |= KVM_PGTABLE_PROT_X; 1626 1616 1627 - if (nested) 1628 - adjust_nested_exec_perms(kvm, nested, &prot); 1617 + if (s2fd->nested) 1618 + prot = adjust_nested_exec_perms(kvm, s2fd->nested, prot); 1629 1619 1630 1620 kvm_fault_lock(kvm); 1631 1621 if (mmu_invalidate_retry(kvm, mmu_seq)) { ··· 1631 1625 goto out_unlock; 1632 1626 } 1633 1627 1634 - ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, fault_ipa, PAGE_SIZE, 1628 + ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, s2fd->fault_ipa, PAGE_SIZE, 1635 1629 __pfn_to_phys(pfn), prot, 1636 1630 memcache, flags); 1637 1631 1638 1632 out_unlock: 1639 - kvm_release_faultin_page(kvm, page, !!ret, writable); 1633 + kvm_release_faultin_page(kvm, page, !!ret, prot & KVM_PGTABLE_PROT_W); 1640 1634 kvm_fault_unlock(kvm); 1641 1635 1642 - if (writable && !ret) 1643 - mark_page_dirty_in_slot(kvm, memslot, gfn); 1636 + if ((prot & KVM_PGTABLE_PROT_W) && !ret) 1637 + mark_page_dirty_in_slot(kvm, s2fd->memslot, gfn); 1644 1638 1645 1639 return ret != -EAGAIN ? ret : 0; 1646 1640 } 1647 1641 1648 - static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, 1649 - struct kvm_s2_trans *nested, 1650 - struct kvm_memory_slot *memslot, unsigned long hva, 1651 - bool fault_is_perm) 1642 + struct kvm_s2_fault_vma_info { 1643 + unsigned long mmu_seq; 1644 + long vma_pagesize; 1645 + vm_flags_t vm_flags; 1646 + unsigned long max_map_size; 1647 + struct page *page; 1648 + kvm_pfn_t pfn; 1649 + gfn_t gfn; 1650 + bool device; 1651 + bool mte_allowed; 1652 + bool is_vma_cacheable; 1653 + bool map_writable; 1654 + bool map_non_cacheable; 1655 + }; 1656 + 1657 + static int pkvm_mem_abort(const struct kvm_s2_fault_desc *s2fd) 1652 1658 { 1653 - int ret = 0; 1654 - bool topup_memcache; 1655 - bool write_fault, writable; 1656 - bool exec_fault, mte_allowed, is_vma_cacheable; 1657 - bool s2_force_noncacheable = false, vfio_allow_any_uc = false; 1658 - unsigned long mmu_seq; 1659 - phys_addr_t ipa = fault_ipa; 1659 + unsigned int flags = FOLL_HWPOISON | FOLL_LONGTERM | FOLL_WRITE; 1660 + struct kvm_vcpu *vcpu = s2fd->vcpu; 1661 + struct kvm_pgtable *pgt = vcpu->arch.hw_mmu->pgt; 1662 + struct mm_struct *mm = current->mm; 1660 1663 struct kvm *kvm = vcpu->kvm; 1661 - struct vm_area_struct *vma; 1662 - short vma_shift; 1663 - void *memcache; 1664 - gfn_t gfn; 1665 - kvm_pfn_t pfn; 1666 - bool logging_active = memslot_is_logging(memslot); 1667 - bool force_pte = logging_active; 1668 - long vma_pagesize, fault_granule; 1669 - enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R; 1670 - struct kvm_pgtable *pgt; 1664 + void *hyp_memcache; 1671 1665 struct page *page; 1672 - vm_flags_t vm_flags; 1673 - enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_SHARED; 1666 + int ret; 1674 1667 1675 - if (fault_is_perm) 1676 - fault_granule = kvm_vcpu_trap_get_perm_fault_granule(vcpu); 1677 - write_fault = kvm_is_write_fault(vcpu); 1678 - exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu); 1679 - VM_WARN_ON_ONCE(write_fault && exec_fault); 1668 + hyp_memcache = get_mmu_memcache(vcpu); 1669 + ret = topup_mmu_memcache(vcpu, hyp_memcache); 1670 + if (ret) 1671 + return -ENOMEM; 1680 1672 1681 - /* 1682 - * Permission faults just need to update the existing leaf entry, 1683 - * and so normally don't require allocations from the memcache. The 1684 - * only exception to this is when dirty logging is enabled at runtime 1685 - * and a write fault needs to collapse a block entry into a table. 1686 - */ 1687 - topup_memcache = !fault_is_perm || (logging_active && write_fault); 1688 - ret = prepare_mmu_memcache(vcpu, topup_memcache, &memcache); 1673 + ret = account_locked_vm(mm, 1, true); 1689 1674 if (ret) 1690 1675 return ret; 1691 1676 1692 - /* 1693 - * Let's check if we will get back a huge page backed by hugetlbfs, or 1694 - * get block mapping for device MMIO region. 1695 - */ 1696 - mmap_read_lock(current->mm); 1697 - vma = vma_lookup(current->mm, hva); 1698 - if (unlikely(!vma)) { 1699 - kvm_err("Failed to find VMA for hva 0x%lx\n", hva); 1700 - mmap_read_unlock(current->mm); 1701 - return -EFAULT; 1677 + mmap_read_lock(mm); 1678 + ret = pin_user_pages(s2fd->hva, 1, flags, &page); 1679 + mmap_read_unlock(mm); 1680 + 1681 + if (ret == -EHWPOISON) { 1682 + kvm_send_hwpoison_signal(s2fd->hva, PAGE_SHIFT); 1683 + ret = 0; 1684 + goto dec_account; 1685 + } else if (ret != 1) { 1686 + ret = -EFAULT; 1687 + goto dec_account; 1688 + } else if (!folio_test_swapbacked(page_folio(page))) { 1689 + /* 1690 + * We really can't deal with page-cache pages returned by GUP 1691 + * because (a) we may trigger writeback of a page for which we 1692 + * no longer have access and (b) page_mkclean() won't find the 1693 + * stage-2 mapping in the rmap so we can get out-of-whack with 1694 + * the filesystem when marking the page dirty during unpinning 1695 + * (see cc5095747edf ("ext4: don't BUG if someone dirty pages 1696 + * without asking ext4 first")). 1697 + * 1698 + * Ideally we'd just restrict ourselves to anonymous pages, but 1699 + * we also want to allow memfd (i.e. shmem) pages, so check for 1700 + * pages backed by swap in the knowledge that the GUP pin will 1701 + * prevent try_to_unmap() from succeeding. 1702 + */ 1703 + ret = -EIO; 1704 + goto unpin; 1702 1705 } 1703 1706 1704 - if (force_pte) 1707 + write_lock(&kvm->mmu_lock); 1708 + ret = pkvm_pgtable_stage2_map(pgt, s2fd->fault_ipa, PAGE_SIZE, 1709 + page_to_phys(page), KVM_PGTABLE_PROT_RWX, 1710 + hyp_memcache, 0); 1711 + write_unlock(&kvm->mmu_lock); 1712 + if (ret) { 1713 + if (ret == -EAGAIN) 1714 + ret = 0; 1715 + goto unpin; 1716 + } 1717 + 1718 + return 0; 1719 + unpin: 1720 + unpin_user_pages(&page, 1); 1721 + dec_account: 1722 + account_locked_vm(mm, 1, false); 1723 + return ret; 1724 + } 1725 + 1726 + static short kvm_s2_resolve_vma_size(const struct kvm_s2_fault_desc *s2fd, 1727 + struct kvm_s2_fault_vma_info *s2vi, 1728 + struct vm_area_struct *vma) 1729 + { 1730 + short vma_shift; 1731 + 1732 + if (memslot_is_logging(s2fd->memslot)) { 1733 + s2vi->max_map_size = PAGE_SIZE; 1705 1734 vma_shift = PAGE_SHIFT; 1706 - else 1707 - vma_shift = get_vma_page_shift(vma, hva); 1735 + } else { 1736 + s2vi->max_map_size = PUD_SIZE; 1737 + vma_shift = get_vma_page_shift(vma, s2fd->hva); 1738 + } 1708 1739 1709 1740 switch (vma_shift) { 1710 1741 #ifndef __PAGETABLE_PMD_FOLDED 1711 1742 case PUD_SHIFT: 1712 - if (fault_supports_stage2_huge_mapping(memslot, hva, PUD_SIZE)) 1743 + if (fault_supports_stage2_huge_mapping(s2fd->memslot, s2fd->hva, PUD_SIZE)) 1713 1744 break; 1714 1745 fallthrough; 1715 1746 #endif ··· 1754 1711 vma_shift = PMD_SHIFT; 1755 1712 fallthrough; 1756 1713 case PMD_SHIFT: 1757 - if (fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE)) 1714 + if (fault_supports_stage2_huge_mapping(s2fd->memslot, s2fd->hva, PMD_SIZE)) 1758 1715 break; 1759 1716 fallthrough; 1760 1717 case CONT_PTE_SHIFT: 1761 1718 vma_shift = PAGE_SHIFT; 1762 - force_pte = true; 1719 + s2vi->max_map_size = PAGE_SIZE; 1763 1720 fallthrough; 1764 1721 case PAGE_SHIFT: 1765 1722 break; ··· 1767 1724 WARN_ONCE(1, "Unknown vma_shift %d", vma_shift); 1768 1725 } 1769 1726 1770 - vma_pagesize = 1UL << vma_shift; 1771 - 1772 - if (nested) { 1727 + if (s2fd->nested) { 1773 1728 unsigned long max_map_size; 1774 1729 1775 - max_map_size = force_pte ? PAGE_SIZE : PUD_SIZE; 1776 - 1777 - ipa = kvm_s2_trans_output(nested); 1730 + max_map_size = min(s2vi->max_map_size, PUD_SIZE); 1778 1731 1779 1732 /* 1780 1733 * If we're about to create a shadow stage 2 entry, then we 1781 1734 * can only create a block mapping if the guest stage 2 page 1782 1735 * table uses at least as big a mapping. 1783 1736 */ 1784 - max_map_size = min(kvm_s2_trans_size(nested), max_map_size); 1737 + max_map_size = min(kvm_s2_trans_size(s2fd->nested), max_map_size); 1785 1738 1786 1739 /* 1787 1740 * Be careful that if the mapping size falls between ··· 1788 1749 else if (max_map_size >= PAGE_SIZE && max_map_size < PMD_SIZE) 1789 1750 max_map_size = PAGE_SIZE; 1790 1751 1791 - force_pte = (max_map_size == PAGE_SIZE); 1792 - vma_pagesize = min_t(long, vma_pagesize, max_map_size); 1793 - vma_shift = __ffs(vma_pagesize); 1752 + s2vi->max_map_size = max_map_size; 1753 + vma_shift = min_t(short, vma_shift, __ffs(max_map_size)); 1794 1754 } 1755 + 1756 + return vma_shift; 1757 + } 1758 + 1759 + static bool kvm_s2_fault_is_perm(const struct kvm_s2_fault_desc *s2fd) 1760 + { 1761 + return kvm_vcpu_trap_is_permission_fault(s2fd->vcpu); 1762 + } 1763 + 1764 + static int kvm_s2_fault_get_vma_info(const struct kvm_s2_fault_desc *s2fd, 1765 + struct kvm_s2_fault_vma_info *s2vi) 1766 + { 1767 + struct vm_area_struct *vma; 1768 + struct kvm *kvm = s2fd->vcpu->kvm; 1769 + 1770 + mmap_read_lock(current->mm); 1771 + vma = vma_lookup(current->mm, s2fd->hva); 1772 + if (unlikely(!vma)) { 1773 + kvm_err("Failed to find VMA for hva 0x%lx\n", s2fd->hva); 1774 + mmap_read_unlock(current->mm); 1775 + return -EFAULT; 1776 + } 1777 + 1778 + s2vi->vma_pagesize = BIT(kvm_s2_resolve_vma_size(s2fd, s2vi, vma)); 1795 1779 1796 1780 /* 1797 1781 * Both the canonical IPA and fault IPA must be aligned to the 1798 1782 * mapping size to ensure we find the right PFN and lay down the 1799 1783 * mapping in the right place. 1800 1784 */ 1801 - fault_ipa = ALIGN_DOWN(fault_ipa, vma_pagesize); 1802 - ipa = ALIGN_DOWN(ipa, vma_pagesize); 1785 + s2vi->gfn = ALIGN_DOWN(s2fd->fault_ipa, s2vi->vma_pagesize) >> PAGE_SHIFT; 1803 1786 1804 - gfn = ipa >> PAGE_SHIFT; 1805 - mte_allowed = kvm_vma_mte_allowed(vma); 1787 + s2vi->mte_allowed = kvm_vma_mte_allowed(vma); 1806 1788 1807 - vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED; 1789 + s2vi->vm_flags = vma->vm_flags; 1808 1790 1809 - vm_flags = vma->vm_flags; 1810 - 1811 - is_vma_cacheable = kvm_vma_is_cacheable(vma); 1812 - 1813 - /* Don't use the VMA after the unlock -- it may have vanished */ 1814 - vma = NULL; 1791 + s2vi->is_vma_cacheable = kvm_vma_is_cacheable(vma); 1815 1792 1816 1793 /* 1817 1794 * Read mmu_invalidate_seq so that KVM can detect if the results of ··· 1837 1782 * Rely on mmap_read_unlock() for an implicit smp_rmb(), which pairs 1838 1783 * with the smp_wmb() in kvm_mmu_invalidate_end(). 1839 1784 */ 1840 - mmu_seq = kvm->mmu_invalidate_seq; 1785 + s2vi->mmu_seq = kvm->mmu_invalidate_seq; 1841 1786 mmap_read_unlock(current->mm); 1842 1787 1843 - pfn = __kvm_faultin_pfn(memslot, gfn, write_fault ? FOLL_WRITE : 0, 1844 - &writable, &page); 1845 - if (pfn == KVM_PFN_ERR_HWPOISON) { 1846 - kvm_send_hwpoison_signal(hva, vma_shift); 1847 - return 0; 1848 - } 1849 - if (is_error_noslot_pfn(pfn)) 1788 + return 0; 1789 + } 1790 + 1791 + static gfn_t get_canonical_gfn(const struct kvm_s2_fault_desc *s2fd, 1792 + const struct kvm_s2_fault_vma_info *s2vi) 1793 + { 1794 + phys_addr_t ipa; 1795 + 1796 + if (!s2fd->nested) 1797 + return s2vi->gfn; 1798 + 1799 + ipa = kvm_s2_trans_output(s2fd->nested); 1800 + return ALIGN_DOWN(ipa, s2vi->vma_pagesize) >> PAGE_SHIFT; 1801 + } 1802 + 1803 + static int kvm_s2_fault_pin_pfn(const struct kvm_s2_fault_desc *s2fd, 1804 + struct kvm_s2_fault_vma_info *s2vi) 1805 + { 1806 + int ret; 1807 + 1808 + ret = kvm_s2_fault_get_vma_info(s2fd, s2vi); 1809 + if (ret) 1810 + return ret; 1811 + 1812 + s2vi->pfn = __kvm_faultin_pfn(s2fd->memslot, get_canonical_gfn(s2fd, s2vi), 1813 + kvm_is_write_fault(s2fd->vcpu) ? FOLL_WRITE : 0, 1814 + &s2vi->map_writable, &s2vi->page); 1815 + if (unlikely(is_error_noslot_pfn(s2vi->pfn))) { 1816 + if (s2vi->pfn == KVM_PFN_ERR_HWPOISON) { 1817 + kvm_send_hwpoison_signal(s2fd->hva, __ffs(s2vi->vma_pagesize)); 1818 + return 0; 1819 + } 1850 1820 return -EFAULT; 1821 + } 1851 1822 1852 1823 /* 1853 1824 * Check if this is non-struct page memory PFN, and cannot support 1854 1825 * CMOs. It could potentially be unsafe to access as cacheable. 1855 1826 */ 1856 - if (vm_flags & (VM_PFNMAP | VM_MIXEDMAP) && !pfn_is_map_memory(pfn)) { 1857 - if (is_vma_cacheable) { 1827 + if (s2vi->vm_flags & (VM_PFNMAP | VM_MIXEDMAP) && !pfn_is_map_memory(s2vi->pfn)) { 1828 + if (s2vi->is_vma_cacheable) { 1858 1829 /* 1859 1830 * Whilst the VMA owner expects cacheable mapping to this 1860 1831 * PFN, hardware also has to support the FWB and CACHE DIC ··· 1893 1812 * S2FWB and CACHE DIC are mandatory to avoid the need for 1894 1813 * cache maintenance. 1895 1814 */ 1896 - if (!kvm_supports_cacheable_pfnmap()) 1897 - ret = -EFAULT; 1815 + if (!kvm_supports_cacheable_pfnmap()) { 1816 + kvm_release_faultin_page(s2fd->vcpu->kvm, s2vi->page, true, false); 1817 + return -EFAULT; 1818 + } 1898 1819 } else { 1899 1820 /* 1900 1821 * If the page was identified as device early by looking at ··· 1908 1825 * In both cases, we don't let transparent_hugepage_adjust() 1909 1826 * change things at the last minute. 1910 1827 */ 1911 - s2_force_noncacheable = true; 1828 + s2vi->map_non_cacheable = true; 1912 1829 } 1913 - } else if (logging_active && !write_fault) { 1914 - /* 1915 - * Only actually map the page as writable if this was a write 1916 - * fault. 1917 - */ 1918 - writable = false; 1830 + 1831 + s2vi->device = true; 1919 1832 } 1920 1833 1921 - if (exec_fault && s2_force_noncacheable) 1922 - ret = -ENOEXEC; 1834 + return 1; 1835 + } 1923 1836 1924 - if (ret) 1925 - goto out_put_page; 1837 + static int kvm_s2_fault_compute_prot(const struct kvm_s2_fault_desc *s2fd, 1838 + const struct kvm_s2_fault_vma_info *s2vi, 1839 + enum kvm_pgtable_prot *prot) 1840 + { 1841 + struct kvm *kvm = s2fd->vcpu->kvm; 1842 + 1843 + if (kvm_vcpu_trap_is_exec_fault(s2fd->vcpu) && s2vi->map_non_cacheable) 1844 + return -ENOEXEC; 1926 1845 1927 1846 /* 1928 1847 * Guest performs atomic/exclusive operations on memory with unsupported ··· 1932 1847 * and trigger the exception here. Since the memslot is valid, inject 1933 1848 * the fault back to the guest. 1934 1849 */ 1935 - if (esr_fsc_is_excl_atomic_fault(kvm_vcpu_get_esr(vcpu))) { 1936 - kvm_inject_dabt_excl_atomic(vcpu, kvm_vcpu_get_hfar(vcpu)); 1937 - ret = 1; 1938 - goto out_put_page; 1850 + if (esr_fsc_is_excl_atomic_fault(kvm_vcpu_get_esr(s2fd->vcpu))) { 1851 + kvm_inject_dabt_excl_atomic(s2fd->vcpu, kvm_vcpu_get_hfar(s2fd->vcpu)); 1852 + return 1; 1939 1853 } 1940 1854 1941 - if (nested) 1942 - adjust_nested_fault_perms(nested, &prot, &writable); 1855 + *prot = KVM_PGTABLE_PROT_R; 1856 + 1857 + if (s2vi->map_writable && (s2vi->device || 1858 + !memslot_is_logging(s2fd->memslot) || 1859 + kvm_is_write_fault(s2fd->vcpu))) 1860 + *prot |= KVM_PGTABLE_PROT_W; 1861 + 1862 + if (s2fd->nested) 1863 + *prot = adjust_nested_fault_perms(s2fd->nested, *prot); 1864 + 1865 + if (kvm_vcpu_trap_is_exec_fault(s2fd->vcpu)) 1866 + *prot |= KVM_PGTABLE_PROT_X; 1867 + 1868 + if (s2vi->map_non_cacheable) 1869 + *prot |= (s2vi->vm_flags & VM_ALLOW_ANY_UNCACHED) ? 1870 + KVM_PGTABLE_PROT_NORMAL_NC : KVM_PGTABLE_PROT_DEVICE; 1871 + else if (cpus_have_final_cap(ARM64_HAS_CACHE_DIC)) 1872 + *prot |= KVM_PGTABLE_PROT_X; 1873 + 1874 + if (s2fd->nested) 1875 + *prot = adjust_nested_exec_perms(kvm, s2fd->nested, *prot); 1876 + 1877 + if (!kvm_s2_fault_is_perm(s2fd) && !s2vi->map_non_cacheable && kvm_has_mte(kvm)) { 1878 + /* Check the VMM hasn't introduced a new disallowed VMA */ 1879 + if (!s2vi->mte_allowed) 1880 + return -EFAULT; 1881 + } 1882 + 1883 + return 0; 1884 + } 1885 + 1886 + static int kvm_s2_fault_map(const struct kvm_s2_fault_desc *s2fd, 1887 + const struct kvm_s2_fault_vma_info *s2vi, 1888 + enum kvm_pgtable_prot prot, 1889 + void *memcache) 1890 + { 1891 + enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_SHARED; 1892 + bool writable = prot & KVM_PGTABLE_PROT_W; 1893 + struct kvm *kvm = s2fd->vcpu->kvm; 1894 + struct kvm_pgtable *pgt; 1895 + long perm_fault_granule; 1896 + long mapping_size; 1897 + kvm_pfn_t pfn; 1898 + gfn_t gfn; 1899 + int ret; 1943 1900 1944 1901 kvm_fault_lock(kvm); 1945 - pgt = vcpu->arch.hw_mmu->pgt; 1946 - if (mmu_invalidate_retry(kvm, mmu_seq)) { 1947 - ret = -EAGAIN; 1902 + pgt = s2fd->vcpu->arch.hw_mmu->pgt; 1903 + ret = -EAGAIN; 1904 + if (mmu_invalidate_retry(kvm, s2vi->mmu_seq)) 1948 1905 goto out_unlock; 1949 - } 1906 + 1907 + perm_fault_granule = (kvm_s2_fault_is_perm(s2fd) ? 1908 + kvm_vcpu_trap_get_perm_fault_granule(s2fd->vcpu) : 0); 1909 + mapping_size = s2vi->vma_pagesize; 1910 + pfn = s2vi->pfn; 1911 + gfn = s2vi->gfn; 1950 1912 1951 1913 /* 1952 1914 * If we are not forced to use page mapping, check if we are 1953 1915 * backed by a THP and thus use block mapping if possible. 1954 1916 */ 1955 - if (vma_pagesize == PAGE_SIZE && !(force_pte || s2_force_noncacheable)) { 1956 - if (fault_is_perm && fault_granule > PAGE_SIZE) 1957 - vma_pagesize = fault_granule; 1958 - else 1959 - vma_pagesize = transparent_hugepage_adjust(kvm, memslot, 1960 - hva, &pfn, 1961 - &fault_ipa); 1962 - 1963 - if (vma_pagesize < 0) { 1964 - ret = vma_pagesize; 1965 - goto out_unlock; 1966 - } 1967 - } 1968 - 1969 - if (!fault_is_perm && !s2_force_noncacheable && kvm_has_mte(kvm)) { 1970 - /* Check the VMM hasn't introduced a new disallowed VMA */ 1971 - if (mte_allowed) { 1972 - sanitise_mte_tags(kvm, pfn, vma_pagesize); 1917 + if (mapping_size == PAGE_SIZE && 1918 + !(s2vi->max_map_size == PAGE_SIZE || s2vi->map_non_cacheable)) { 1919 + if (perm_fault_granule > PAGE_SIZE) { 1920 + mapping_size = perm_fault_granule; 1973 1921 } else { 1974 - ret = -EFAULT; 1975 - goto out_unlock; 1922 + mapping_size = transparent_hugepage_adjust(kvm, s2fd->memslot, 1923 + s2fd->hva, &pfn, 1924 + &gfn); 1925 + if (mapping_size < 0) { 1926 + ret = mapping_size; 1927 + goto out_unlock; 1928 + } 1976 1929 } 1977 1930 } 1978 1931 1979 - if (writable) 1980 - prot |= KVM_PGTABLE_PROT_W; 1981 - 1982 - if (exec_fault) 1983 - prot |= KVM_PGTABLE_PROT_X; 1984 - 1985 - if (s2_force_noncacheable) { 1986 - if (vfio_allow_any_uc) 1987 - prot |= KVM_PGTABLE_PROT_NORMAL_NC; 1988 - else 1989 - prot |= KVM_PGTABLE_PROT_DEVICE; 1990 - } else if (cpus_have_final_cap(ARM64_HAS_CACHE_DIC)) { 1991 - prot |= KVM_PGTABLE_PROT_X; 1992 - } 1993 - 1994 - if (nested) 1995 - adjust_nested_exec_perms(kvm, nested, &prot); 1932 + if (!perm_fault_granule && !s2vi->map_non_cacheable && kvm_has_mte(kvm)) 1933 + sanitise_mte_tags(kvm, pfn, mapping_size); 1996 1934 1997 1935 /* 1998 1936 * Under the premise of getting a FSC_PERM fault, we just need to relax 1999 - * permissions only if vma_pagesize equals fault_granule. Otherwise, 1937 + * permissions only if mapping_size equals perm_fault_granule. Otherwise, 2000 1938 * kvm_pgtable_stage2_map() should be called to change block size. 2001 1939 */ 2002 - if (fault_is_perm && vma_pagesize == fault_granule) { 1940 + if (mapping_size == perm_fault_granule) { 2003 1941 /* 2004 1942 * Drop the SW bits in favour of those stored in the 2005 1943 * PTE, which will be preserved. 2006 1944 */ 2007 1945 prot &= ~KVM_NV_GUEST_MAP_SZ; 2008 - ret = KVM_PGT_FN(kvm_pgtable_stage2_relax_perms)(pgt, fault_ipa, prot, flags); 1946 + ret = KVM_PGT_FN(kvm_pgtable_stage2_relax_perms)(pgt, gfn_to_gpa(gfn), 1947 + prot, flags); 2009 1948 } else { 2010 - ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, fault_ipa, vma_pagesize, 2011 - __pfn_to_phys(pfn), prot, 2012 - memcache, flags); 1949 + ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, gfn_to_gpa(gfn), mapping_size, 1950 + __pfn_to_phys(pfn), prot, 1951 + memcache, flags); 2013 1952 } 2014 1953 2015 1954 out_unlock: 2016 - kvm_release_faultin_page(kvm, page, !!ret, writable); 1955 + kvm_release_faultin_page(kvm, s2vi->page, !!ret, writable); 2017 1956 kvm_fault_unlock(kvm); 2018 1957 2019 - /* Mark the page dirty only if the fault is handled successfully */ 2020 - if (writable && !ret) 2021 - mark_page_dirty_in_slot(kvm, memslot, gfn); 1958 + /* 1959 + * Mark the page dirty only if the fault is handled successfully, 1960 + * making sure we adjust the canonical IPA if the mapping size has 1961 + * been updated (via a THP upgrade, for example). 1962 + */ 1963 + if (writable && !ret) { 1964 + phys_addr_t ipa = gfn_to_gpa(get_canonical_gfn(s2fd, s2vi)); 1965 + ipa &= ~(mapping_size - 1); 1966 + mark_page_dirty_in_slot(kvm, s2fd->memslot, gpa_to_gfn(ipa)); 1967 + } 2022 1968 2023 - return ret != -EAGAIN ? ret : 0; 1969 + if (ret != -EAGAIN) 1970 + return ret; 1971 + return 0; 1972 + } 2024 1973 2025 - out_put_page: 2026 - kvm_release_page_unused(page); 2027 - return ret; 1974 + static int user_mem_abort(const struct kvm_s2_fault_desc *s2fd) 1975 + { 1976 + bool perm_fault = kvm_vcpu_trap_is_permission_fault(s2fd->vcpu); 1977 + struct kvm_s2_fault_vma_info s2vi = {}; 1978 + enum kvm_pgtable_prot prot; 1979 + void *memcache; 1980 + int ret; 1981 + 1982 + /* 1983 + * Permission faults just need to update the existing leaf entry, 1984 + * and so normally don't require allocations from the memcache. The 1985 + * only exception to this is when dirty logging is enabled at runtime 1986 + * and a write fault needs to collapse a block entry into a table. 1987 + */ 1988 + memcache = get_mmu_memcache(s2fd->vcpu); 1989 + if (!perm_fault || (memslot_is_logging(s2fd->memslot) && 1990 + kvm_is_write_fault(s2fd->vcpu))) { 1991 + ret = topup_mmu_memcache(s2fd->vcpu, memcache); 1992 + if (ret) 1993 + return ret; 1994 + } 1995 + 1996 + /* 1997 + * Let's check if we will get back a huge page backed by hugetlbfs, or 1998 + * get block mapping for device MMIO region. 1999 + */ 2000 + ret = kvm_s2_fault_pin_pfn(s2fd, &s2vi); 2001 + if (ret != 1) 2002 + return ret; 2003 + 2004 + ret = kvm_s2_fault_compute_prot(s2fd, &s2vi, &prot); 2005 + if (ret) { 2006 + kvm_release_page_unused(s2vi.page); 2007 + return ret; 2008 + } 2009 + 2010 + return kvm_s2_fault_map(s2fd, &s2vi, prot, memcache); 2028 2011 } 2029 2012 2030 2013 /* Resolve the access fault by making the page young again. */ ··· 2355 2202 goto out_unlock; 2356 2203 } 2357 2204 2358 - VM_WARN_ON_ONCE(kvm_vcpu_trap_is_permission_fault(vcpu) && 2359 - !write_fault && !kvm_vcpu_trap_is_exec_fault(vcpu)); 2205 + const struct kvm_s2_fault_desc s2fd = { 2206 + .vcpu = vcpu, 2207 + .fault_ipa = fault_ipa, 2208 + .nested = nested, 2209 + .memslot = memslot, 2210 + .hva = hva, 2211 + }; 2360 2212 2361 - if (kvm_slot_has_gmem(memslot)) 2362 - ret = gmem_abort(vcpu, fault_ipa, nested, memslot, 2363 - esr_fsc_is_permission_fault(esr)); 2364 - else 2365 - ret = user_mem_abort(vcpu, fault_ipa, nested, memslot, hva, 2366 - esr_fsc_is_permission_fault(esr)); 2213 + if (kvm_vm_is_protected(vcpu->kvm)) { 2214 + ret = pkvm_mem_abort(&s2fd); 2215 + } else { 2216 + VM_WARN_ON_ONCE(kvm_vcpu_trap_is_permission_fault(vcpu) && 2217 + !write_fault && 2218 + !kvm_vcpu_trap_is_exec_fault(vcpu)); 2219 + 2220 + if (kvm_slot_has_gmem(memslot)) 2221 + ret = gmem_abort(&s2fd); 2222 + else 2223 + ret = user_mem_abort(&s2fd); 2224 + } 2225 + 2367 2226 if (ret == 0) 2368 2227 ret = 1; 2369 2228 out: ··· 2388 2223 2389 2224 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) 2390 2225 { 2391 - if (!kvm->arch.mmu.pgt) 2226 + if (!kvm->arch.mmu.pgt || kvm_vm_is_protected(kvm)) 2392 2227 return false; 2393 2228 2394 2229 __unmap_stage2_range(&kvm->arch.mmu, range->start << PAGE_SHIFT, ··· 2403 2238 { 2404 2239 u64 size = (range->end - range->start) << PAGE_SHIFT; 2405 2240 2406 - if (!kvm->arch.mmu.pgt) 2241 + if (!kvm->arch.mmu.pgt || kvm_vm_is_protected(kvm)) 2407 2242 return false; 2408 2243 2409 2244 return KVM_PGT_FN(kvm_pgtable_stage2_test_clear_young)(kvm->arch.mmu.pgt, ··· 2419 2254 { 2420 2255 u64 size = (range->end - range->start) << PAGE_SHIFT; 2421 2256 2422 - if (!kvm->arch.mmu.pgt) 2257 + if (!kvm->arch.mmu.pgt || kvm_vm_is_protected(kvm)) 2423 2258 return false; 2424 2259 2425 2260 return KVM_PGT_FN(kvm_pgtable_stage2_test_clear_young)(kvm->arch.mmu.pgt, ··· 2575 2410 { 2576 2411 hva_t hva, reg_end; 2577 2412 int ret = 0; 2413 + 2414 + if (kvm_vm_is_protected(kvm)) { 2415 + /* Cannot modify memslots once a pVM has run. */ 2416 + if (pkvm_hyp_vm_is_created(kvm) && 2417 + (change == KVM_MR_DELETE || change == KVM_MR_MOVE)) { 2418 + return -EPERM; 2419 + } 2420 + 2421 + if (new && 2422 + new->flags & (KVM_MEM_LOG_DIRTY_PAGES | KVM_MEM_READONLY)) { 2423 + return -EPERM; 2424 + } 2425 + } 2578 2426 2579 2427 if (change != KVM_MR_CREATE && change != KVM_MR_MOVE && 2580 2428 change != KVM_MR_FLAGS_ONLY)

+10 -1

arch/arm64/kvm/nested.c

··· 735 735 kvm->arch.nested_mmus_next = (i + 1) % kvm->arch.nested_mmus_size; 736 736 737 737 /* Make sure we don't forget to do the laundry */ 738 - if (kvm_s2_mmu_valid(s2_mmu)) 738 + if (kvm_s2_mmu_valid(s2_mmu)) { 739 + kvm_nested_s2_ptdump_remove_debugfs(s2_mmu); 739 740 s2_mmu->pending_unmap = true; 741 + } 740 742 741 743 /* 742 744 * The virtual VMID (modulo CnP) will be used as a key when matching ··· 751 749 s2_mmu->tlb_vttbr = vcpu_read_sys_reg(vcpu, VTTBR_EL2) & ~VTTBR_CNP_BIT; 752 750 s2_mmu->tlb_vtcr = vcpu_read_sys_reg(vcpu, VTCR_EL2); 753 751 s2_mmu->nested_stage2_enabled = vcpu_read_sys_reg(vcpu, HCR_EL2) & HCR_VM; 752 + 753 + kvm_nested_s2_ptdump_create_debugfs(s2_mmu); 754 754 755 755 out: 756 756 atomic_inc(&s2_mmu->refcnt); ··· 1560 1556 ID_AA64PFR1_EL1_RES0 | 1561 1557 ID_AA64PFR1_EL1_MPAM_frac | 1562 1558 ID_AA64PFR1_EL1_MTE); 1559 + break; 1560 + 1561 + case SYS_ID_AA64PFR2_EL1: 1562 + /* GICv5 is not yet supported for NV */ 1563 + val &= ~ID_AA64PFR2_EL1_GCIE; 1563 1564 break; 1564 1565 1565 1566 case SYS_ID_AA64MMFR0_EL1:

+131 -26

arch/arm64/kvm/pkvm.c

··· 88 88 static void __pkvm_destroy_hyp_vm(struct kvm *kvm) 89 89 { 90 90 if (pkvm_hyp_vm_is_created(kvm)) { 91 - WARN_ON(kvm_call_hyp_nvhe(__pkvm_teardown_vm, 91 + WARN_ON(kvm_call_hyp_nvhe(__pkvm_finalize_teardown_vm, 92 92 kvm->arch.pkvm.handle)); 93 93 } else if (kvm->arch.pkvm.handle) { 94 94 /* ··· 192 192 { 193 193 int ret = 0; 194 194 195 + /* 196 + * Synchronise with kvm_arch_prepare_memory_region(), as we 197 + * prevent memslot modifications on a pVM that has been run. 198 + */ 199 + mutex_lock(&kvm->slots_lock); 195 200 mutex_lock(&kvm->arch.config_lock); 196 201 if (!pkvm_hyp_vm_is_created(kvm)) 197 202 ret = __pkvm_create_hyp_vm(kvm); 198 203 mutex_unlock(&kvm->arch.config_lock); 204 + mutex_unlock(&kvm->slots_lock); 199 205 200 206 return ret; 201 207 } ··· 225 219 mutex_unlock(&kvm->arch.config_lock); 226 220 } 227 221 228 - int pkvm_init_host_vm(struct kvm *kvm) 222 + int pkvm_init_host_vm(struct kvm *kvm, unsigned long type) 229 223 { 230 224 int ret; 225 + bool protected = type & KVM_VM_TYPE_ARM_PROTECTED; 231 226 232 227 if (pkvm_hyp_vm_is_created(kvm)) 233 228 return -EINVAL; ··· 243 236 return ret; 244 237 245 238 kvm->arch.pkvm.handle = ret; 239 + kvm->arch.pkvm.is_protected = protected; 240 + if (protected) { 241 + pr_warn_once("kvm: protected VMs are experimental and for development only, tainting kernel\n"); 242 + add_taint(TAINT_USER, LOCKDEP_STILL_OK); 243 + } 246 244 247 245 return 0; 248 246 } ··· 334 322 return 0; 335 323 } 336 324 337 - static int __pkvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 start, u64 end) 325 + static int __pkvm_pgtable_stage2_reclaim(struct kvm_pgtable *pgt, u64 start, u64 end) 338 326 { 339 327 struct kvm *kvm = kvm_s2_mmu_to_kvm(pgt->mmu); 340 328 pkvm_handle_t handle = kvm->arch.pkvm.handle; 341 329 struct pkvm_mapping *mapping; 342 330 int ret; 343 331 344 - if (!handle) 345 - return 0; 332 + for_each_mapping_in_range_safe(pgt, start, end, mapping) { 333 + struct page *page; 334 + 335 + ret = kvm_call_hyp_nvhe(__pkvm_reclaim_dying_guest_page, 336 + handle, mapping->gfn); 337 + if (WARN_ON(ret)) 338 + continue; 339 + 340 + page = pfn_to_page(mapping->pfn); 341 + WARN_ON_ONCE(mapping->nr_pages != 1); 342 + unpin_user_pages_dirty_lock(&page, 1, true); 343 + account_locked_vm(current->mm, 1, false); 344 + pkvm_mapping_remove(mapping, &pgt->pkvm_mappings); 345 + kfree(mapping); 346 + } 347 + 348 + return 0; 349 + } 350 + 351 + static int __pkvm_pgtable_stage2_unshare(struct kvm_pgtable *pgt, u64 start, u64 end) 352 + { 353 + struct kvm *kvm = kvm_s2_mmu_to_kvm(pgt->mmu); 354 + pkvm_handle_t handle = kvm->arch.pkvm.handle; 355 + struct pkvm_mapping *mapping; 356 + int ret; 346 357 347 358 for_each_mapping_in_range_safe(pgt, start, end, mapping) { 348 359 ret = kvm_call_hyp_nvhe(__pkvm_host_unshare_guest, handle, mapping->gfn, ··· 382 347 void pkvm_pgtable_stage2_destroy_range(struct kvm_pgtable *pgt, 383 348 u64 addr, u64 size) 384 349 { 385 - __pkvm_pgtable_stage2_unmap(pgt, addr, addr + size); 350 + struct kvm *kvm = kvm_s2_mmu_to_kvm(pgt->mmu); 351 + pkvm_handle_t handle = kvm->arch.pkvm.handle; 352 + 353 + if (!handle) 354 + return; 355 + 356 + if (pkvm_hyp_vm_is_created(kvm) && !kvm->arch.pkvm.is_dying) { 357 + WARN_ON(kvm_call_hyp_nvhe(__pkvm_start_teardown_vm, handle)); 358 + kvm->arch.pkvm.is_dying = true; 359 + } 360 + 361 + if (kvm_vm_is_protected(kvm)) 362 + __pkvm_pgtable_stage2_reclaim(pgt, addr, addr + size); 363 + else 364 + __pkvm_pgtable_stage2_unshare(pgt, addr, addr + size); 386 365 } 387 366 388 367 void pkvm_pgtable_stage2_destroy_pgd(struct kvm_pgtable *pgt) ··· 414 365 struct kvm_hyp_memcache *cache = mc; 415 366 u64 gfn = addr >> PAGE_SHIFT; 416 367 u64 pfn = phys >> PAGE_SHIFT; 368 + u64 end = addr + size; 417 369 int ret; 418 370 419 - if (size != PAGE_SIZE && size != PMD_SIZE) 420 - return -EINVAL; 421 - 422 371 lockdep_assert_held_write(&kvm->mmu_lock); 372 + mapping = pkvm_mapping_iter_first(&pgt->pkvm_mappings, addr, end - 1); 423 373 424 - /* 425 - * Calling stage2_map() on top of existing mappings is either happening because of a race 426 - * with another vCPU, or because we're changing between page and block mappings. As per 427 - * user_mem_abort(), same-size permission faults are handled in the relax_perms() path. 428 - */ 429 - mapping = pkvm_mapping_iter_first(&pgt->pkvm_mappings, addr, addr + size - 1); 430 - if (mapping) { 431 - if (size == (mapping->nr_pages * PAGE_SIZE)) 432 - return -EAGAIN; 374 + if (kvm_vm_is_protected(kvm)) { 375 + /* Protected VMs are mapped using RWX page-granular mappings */ 376 + if (WARN_ON_ONCE(size != PAGE_SIZE)) 377 + return -EINVAL; 433 378 434 - /* Remove _any_ pkvm_mapping overlapping with the range, bigger or smaller. */ 435 - ret = __pkvm_pgtable_stage2_unmap(pgt, addr, addr + size); 436 - if (ret) 437 - return ret; 438 - mapping = NULL; 379 + if (WARN_ON_ONCE(prot != KVM_PGTABLE_PROT_RWX)) 380 + return -EINVAL; 381 + 382 + /* 383 + * We either raced with another vCPU or the guest PTE 384 + * has been poisoned by an erroneous host access. 385 + */ 386 + if (mapping) { 387 + ret = kvm_call_hyp_nvhe(__pkvm_vcpu_in_poison_fault); 388 + return ret ? -EFAULT : -EAGAIN; 389 + } 390 + 391 + ret = kvm_call_hyp_nvhe(__pkvm_host_donate_guest, pfn, gfn); 392 + } else { 393 + if (WARN_ON_ONCE(size != PAGE_SIZE && size != PMD_SIZE)) 394 + return -EINVAL; 395 + 396 + /* 397 + * We either raced with another vCPU or we're changing between 398 + * page and block mappings. As per user_mem_abort(), same-size 399 + * permission faults are handled in the relax_perms() path. 400 + */ 401 + if (mapping) { 402 + if (size == (mapping->nr_pages * PAGE_SIZE)) 403 + return -EAGAIN; 404 + 405 + /* 406 + * Remove _any_ pkvm_mapping overlapping with the range, 407 + * bigger or smaller. 408 + */ 409 + ret = __pkvm_pgtable_stage2_unshare(pgt, addr, end); 410 + if (ret) 411 + return ret; 412 + 413 + mapping = NULL; 414 + } 415 + 416 + ret = kvm_call_hyp_nvhe(__pkvm_host_share_guest, pfn, gfn, 417 + size / PAGE_SIZE, prot); 439 418 } 440 419 441 - ret = kvm_call_hyp_nvhe(__pkvm_host_share_guest, pfn, gfn, size / PAGE_SIZE, prot); 442 420 if (WARN_ON(ret)) 443 421 return ret; 444 422 ··· 480 404 481 405 int pkvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size) 482 406 { 483 - lockdep_assert_held_write(&kvm_s2_mmu_to_kvm(pgt->mmu)->mmu_lock); 407 + struct kvm *kvm = kvm_s2_mmu_to_kvm(pgt->mmu); 484 408 485 - return __pkvm_pgtable_stage2_unmap(pgt, addr, addr + size); 409 + if (WARN_ON(kvm_vm_is_protected(kvm))) 410 + return -EPERM; 411 + 412 + lockdep_assert_held_write(&kvm->mmu_lock); 413 + 414 + return __pkvm_pgtable_stage2_unshare(pgt, addr, addr + size); 486 415 } 487 416 488 417 int pkvm_pgtable_stage2_wrprotect(struct kvm_pgtable *pgt, u64 addr, u64 size) ··· 496 415 pkvm_handle_t handle = kvm->arch.pkvm.handle; 497 416 struct pkvm_mapping *mapping; 498 417 int ret = 0; 418 + 419 + if (WARN_ON(kvm_vm_is_protected(kvm))) 420 + return -EPERM; 499 421 500 422 lockdep_assert_held(&kvm->mmu_lock); 501 423 for_each_mapping_in_range_safe(pgt, addr, addr + size, mapping) { ··· 531 447 struct pkvm_mapping *mapping; 532 448 bool young = false; 533 449 450 + if (WARN_ON(kvm_vm_is_protected(kvm))) 451 + return false; 452 + 534 453 lockdep_assert_held(&kvm->mmu_lock); 535 454 for_each_mapping_in_range_safe(pgt, addr, addr + size, mapping) 536 455 young |= kvm_call_hyp_nvhe(__pkvm_host_test_clear_young_guest, handle, mapping->gfn, ··· 545 458 int pkvm_pgtable_stage2_relax_perms(struct kvm_pgtable *pgt, u64 addr, enum kvm_pgtable_prot prot, 546 459 enum kvm_pgtable_walk_flags flags) 547 460 { 461 + if (WARN_ON(kvm_vm_is_protected(kvm_s2_mmu_to_kvm(pgt->mmu)))) 462 + return -EPERM; 463 + 548 464 return kvm_call_hyp_nvhe(__pkvm_host_relax_perms_guest, addr >> PAGE_SHIFT, prot); 549 465 } 550 466 551 467 void pkvm_pgtable_stage2_mkyoung(struct kvm_pgtable *pgt, u64 addr, 552 468 enum kvm_pgtable_walk_flags flags) 553 469 { 470 + if (WARN_ON(kvm_vm_is_protected(kvm_s2_mmu_to_kvm(pgt->mmu)))) 471 + return; 472 + 554 473 WARN_ON(kvm_call_hyp_nvhe(__pkvm_host_mkyoung_guest, addr >> PAGE_SHIFT)); 555 474 } 556 475 ··· 577 484 { 578 485 WARN_ON_ONCE(1); 579 486 return -EINVAL; 487 + } 488 + 489 + /* 490 + * Forcefully reclaim a page from the guest, zeroing its contents and 491 + * poisoning the stage-2 pte so that pages can no longer be mapped at 492 + * the same IPA. The page remains pinned until the guest is destroyed. 493 + */ 494 + bool pkvm_force_reclaim_guest_page(phys_addr_t phys) 495 + { 496 + int ret = kvm_call_hyp_nvhe(__pkvm_force_reclaim_guest_page, phys); 497 + 498 + return !ret || ret == -EAGAIN; 580 499 }

+15 -5

arch/arm64/kvm/pmu-emul.c

··· 939 939 * number against the dimensions of the vgic and make sure 940 940 * it's valid. 941 941 */ 942 - if (!irq_is_ppi(irq) && !vgic_valid_spi(vcpu->kvm, irq)) 942 + if (!irq_is_ppi(vcpu->kvm, irq) && 943 + !vgic_valid_spi(vcpu->kvm, irq)) 943 944 return -EINVAL; 944 945 } else if (kvm_arm_pmu_irq_initialized(vcpu)) { 945 946 return -EINVAL; ··· 962 961 if (!vgic_initialized(vcpu->kvm)) 963 962 return -ENODEV; 964 963 965 - if (!kvm_arm_pmu_irq_initialized(vcpu)) 966 - return -ENXIO; 964 + if (!kvm_arm_pmu_irq_initialized(vcpu)) { 965 + if (!vgic_is_v5(vcpu->kvm)) 966 + return -ENXIO; 967 + 968 + /* Use the architected irq number for GICv5. */ 969 + vcpu->arch.pmu.irq_num = KVM_ARMV8_PMU_GICV5_IRQ; 970 + } 967 971 968 972 ret = kvm_vgic_set_owner(vcpu, vcpu->arch.pmu.irq_num, 969 973 &vcpu->arch.pmu); ··· 993 987 unsigned long i; 994 988 struct kvm_vcpu *vcpu; 995 989 990 + /* On GICv5, the PMUIRQ is architecturally mandated to be PPI 23 */ 991 + if (vgic_is_v5(kvm) && irq != KVM_ARMV8_PMU_GICV5_IRQ) 992 + return false; 993 + 996 994 kvm_for_each_vcpu(i, vcpu, kvm) { 997 995 if (!kvm_arm_pmu_irq_initialized(vcpu)) 998 996 continue; 999 997 1000 - if (irq_is_ppi(irq)) { 998 + if (irq_is_ppi(vcpu->kvm, irq)) { 1001 999 if (vcpu->arch.pmu.irq_num != irq) 1002 1000 return false; 1003 1001 } else { ··· 1152 1142 return -EFAULT; 1153 1143 1154 1144 /* The PMU overflow interrupt can be a PPI or a valid SPI. */ 1155 - if (!(irq_is_ppi(irq) || irq_is_spi(irq))) 1145 + if (!(irq_is_ppi(vcpu->kvm, irq) || irq_is_spi(vcpu->kvm, irq))) 1156 1146 return -EINVAL; 1157 1147 1158 1148 if (!pmu_irq_is_valid(kvm, irq))

+50 -29

arch/arm64/kvm/ptdump.c

··· 10 10 #include <linux/kvm_host.h> 11 11 #include <linux/seq_file.h> 12 12 13 + #include <asm/cpufeature.h> 13 14 #include <asm/kvm_mmu.h> 14 15 #include <asm/kvm_pgtable.h> 15 16 #include <asm/ptdump.h> 16 17 17 18 #define MARKERS_LEN 2 18 19 #define KVM_PGTABLE_MAX_LEVELS (KVM_PGTABLE_LAST_LEVEL + 1) 20 + #define S2FNAMESZ sizeof("0x0123456789abcdef-0x0123456789abcdef-s2-disabled") 19 21 20 22 struct kvm_ptdump_guest_state { 21 - struct kvm *kvm; 23 + struct kvm_s2_mmu *mmu; 22 24 struct ptdump_pg_state parser_state; 23 25 struct addr_marker ipa_marker[MARKERS_LEN]; 24 26 struct ptdump_pg_level level[KVM_PGTABLE_MAX_LEVELS]; 25 - struct ptdump_range range[MARKERS_LEN]; 26 27 }; 27 28 28 29 static const struct ptdump_prot_bits stage2_pte_bits[] = { ··· 113 112 return 0; 114 113 } 115 114 116 - static struct kvm_ptdump_guest_state *kvm_ptdump_parser_create(struct kvm *kvm) 115 + static struct kvm_ptdump_guest_state *kvm_ptdump_parser_create(struct kvm_s2_mmu *mmu) 117 116 { 118 117 struct kvm_ptdump_guest_state *st; 119 - struct kvm_s2_mmu *mmu = &kvm->arch.mmu; 120 118 struct kvm_pgtable *pgtable = mmu->pgt; 121 119 int ret; 122 120 ··· 131 131 132 132 st->ipa_marker[0].name = "Guest IPA"; 133 133 st->ipa_marker[1].start_address = BIT(pgtable->ia_bits); 134 - st->range[0].end = BIT(pgtable->ia_bits); 135 134 136 - st->kvm = kvm; 137 - st->parser_state = (struct ptdump_pg_state) { 138 - .marker = &st->ipa_marker[0], 139 - .level = -1, 140 - .pg_level = &st->level[0], 141 - .ptdump.range = &st->range[0], 142 - .start_address = 0, 143 - }; 144 - 135 + st->mmu = mmu; 145 136 return st; 146 137 } 147 138 ··· 140 149 { 141 150 int ret; 142 151 struct kvm_ptdump_guest_state *st = m->private; 143 - struct kvm *kvm = st->kvm; 144 - struct kvm_s2_mmu *mmu = &kvm->arch.mmu; 145 - struct ptdump_pg_state *parser_state = &st->parser_state; 152 + struct kvm_s2_mmu *mmu = st->mmu; 153 + struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu); 146 154 struct kvm_pgtable_walker walker = (struct kvm_pgtable_walker) { 147 155 .cb = kvm_ptdump_visitor, 148 - .arg = parser_state, 156 + .arg = &st->parser_state, 149 157 .flags = KVM_PGTABLE_WALK_LEAF, 150 158 }; 151 159 152 - parser_state->seq = m; 160 + st->parser_state = (struct ptdump_pg_state) { 161 + .marker = &st->ipa_marker[0], 162 + .level = -1, 163 + .pg_level = &st->level[0], 164 + .seq = m, 165 + }; 153 166 154 167 write_lock(&kvm->mmu_lock); 155 168 ret = kvm_pgtable_walk(mmu->pgt, 0, BIT(mmu->pgt->ia_bits), &walker); ··· 164 169 165 170 static int kvm_ptdump_guest_open(struct inode *m, struct file *file) 166 171 { 167 - struct kvm *kvm = m->i_private; 172 + struct kvm_s2_mmu *mmu = m->i_private; 173 + struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu); 168 174 struct kvm_ptdump_guest_state *st; 169 175 int ret; 170 176 171 177 if (!kvm_get_kvm_safe(kvm)) 172 178 return -ENOENT; 173 179 174 - st = kvm_ptdump_parser_create(kvm); 180 + st = kvm_ptdump_parser_create(mmu); 175 181 if (IS_ERR(st)) { 176 182 ret = PTR_ERR(st); 177 183 goto err_with_kvm_ref; ··· 190 194 191 195 static int kvm_ptdump_guest_close(struct inode *m, struct file *file) 192 196 { 193 - struct kvm *kvm = m->i_private; 197 + struct kvm *kvm = kvm_s2_mmu_to_kvm(m->i_private); 194 198 void *st = ((struct seq_file *)file->private_data)->private; 195 199 196 200 kfree(st); ··· 225 229 static int kvm_pgtable_debugfs_open(struct inode *m, struct file *file, 226 230 int (*show)(struct seq_file *, void *)) 227 231 { 228 - struct kvm *kvm = m->i_private; 232 + struct kvm_s2_mmu *mmu = m->i_private; 233 + struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu); 229 234 struct kvm_pgtable *pgtable; 230 235 int ret; 231 236 232 237 if (!kvm_get_kvm_safe(kvm)) 233 238 return -ENOENT; 234 239 235 - pgtable = kvm->arch.mmu.pgt; 240 + pgtable = mmu->pgt; 236 241 237 242 ret = single_open(file, show, pgtable); 238 243 if (ret < 0) ··· 253 256 254 257 static int kvm_pgtable_debugfs_close(struct inode *m, struct file *file) 255 258 { 256 - struct kvm *kvm = m->i_private; 259 + struct kvm *kvm = kvm_s2_mmu_to_kvm(m->i_private); 257 260 258 261 kvm_put_kvm(kvm); 259 262 return single_release(m, file); ··· 273 276 .release = kvm_pgtable_debugfs_close, 274 277 }; 275 278 279 + void kvm_nested_s2_ptdump_create_debugfs(struct kvm_s2_mmu *mmu) 280 + { 281 + struct dentry *dent; 282 + char file_name[S2FNAMESZ]; 283 + 284 + snprintf(file_name, sizeof(file_name), "0x%016llx-0x%016llx-s2-%sabled", 285 + mmu->tlb_vttbr, 286 + mmu->tlb_vtcr, 287 + mmu->nested_stage2_enabled ? "en" : "dis"); 288 + 289 + dent = debugfs_create_file(file_name, 0400, 290 + mmu->arch->debugfs_nv_dentry, mmu, 291 + &kvm_ptdump_guest_fops); 292 + 293 + mmu->shadow_pt_debugfs_dentry = dent; 294 + } 295 + 296 + void kvm_nested_s2_ptdump_remove_debugfs(struct kvm_s2_mmu *mmu) 297 + { 298 + debugfs_remove(mmu->shadow_pt_debugfs_dentry); 299 + } 300 + 276 301 void kvm_s2_ptdump_create_debugfs(struct kvm *kvm) 277 302 { 278 303 debugfs_create_file("stage2_page_tables", 0400, kvm->debugfs_dentry, 279 - kvm, &kvm_ptdump_guest_fops); 280 - debugfs_create_file("ipa_range", 0400, kvm->debugfs_dentry, kvm, 281 - &kvm_pgtable_range_fops); 304 + &kvm->arch.mmu, &kvm_ptdump_guest_fops); 305 + debugfs_create_file("ipa_range", 0400, kvm->debugfs_dentry, 306 + &kvm->arch.mmu, &kvm_pgtable_range_fops); 282 307 debugfs_create_file("stage2_levels", 0400, kvm->debugfs_dentry, 283 - kvm, &kvm_pgtable_levels_fops); 308 + &kvm->arch.mmu, &kvm_pgtable_levels_fops); 309 + if (cpus_have_final_cap(ARM64_HAS_NESTED_VIRT)) 310 + kvm->arch.debugfs_nv_dentry = debugfs_create_dir("nested", kvm->debugfs_dentry); 284 311 }

+4 -4

arch/arm64/kvm/stacktrace.c

··· 197 197 kvm_nvhe_dump_backtrace_end(); 198 198 } 199 199 200 - #ifdef CONFIG_PROTECTED_NVHE_STACKTRACE 200 + #ifdef CONFIG_PKVM_STACKTRACE 201 201 DECLARE_KVM_NVHE_PER_CPU(unsigned long [NVHE_STACKTRACE_SIZE/sizeof(long)], 202 202 pkvm_stacktrace); 203 203 ··· 225 225 kvm_nvhe_dump_backtrace_entry((void *)hyp_offset, stacktrace[i]); 226 226 kvm_nvhe_dump_backtrace_end(); 227 227 } 228 - #else /* !CONFIG_PROTECTED_NVHE_STACKTRACE */ 228 + #else /* !CONFIG_PKVM_STACKTRACE */ 229 229 static void pkvm_dump_backtrace(unsigned long hyp_offset) 230 230 { 231 - kvm_err("Cannot dump pKVM nVHE stacktrace: !CONFIG_PROTECTED_NVHE_STACKTRACE\n"); 231 + kvm_err("Cannot dump pKVM nVHE stacktrace: !CONFIG_PKVM_STACKTRACE\n"); 232 232 } 233 - #endif /* CONFIG_PROTECTED_NVHE_STACKTRACE */ 233 + #endif /* CONFIG_PKVM_STACKTRACE */ 234 234 235 235 /* 236 236 * kvm_nvhe_dump_backtrace - Dump KVM nVHE hypervisor backtrace.

+170 -30

arch/arm64/kvm/sys_regs.c

··· 681 681 return true; 682 682 } 683 683 684 + static bool access_gicv5_idr0(struct kvm_vcpu *vcpu, struct sys_reg_params *p, 685 + const struct sys_reg_desc *r) 686 + { 687 + if (p->is_write) 688 + return undef_access(vcpu, p, r); 689 + 690 + /* 691 + * Expose KVM's priority- and ID-bits to the guest, but not GCIE_LEGACY. 692 + * 693 + * Note: for GICv5 the mimic the way that the num_pri_bits and 694 + * num_id_bits fields are used with GICv3: 695 + * - num_pri_bits stores the actual number of priority bits, whereas the 696 + * register field stores num_pri_bits - 1. 697 + * - num_id_bits stores the raw field value, which is 0b0000 for 16 bits 698 + * and 0b0001 for 24 bits. 699 + */ 700 + p->regval = FIELD_PREP(ICC_IDR0_EL1_PRI_BITS, vcpu->arch.vgic_cpu.num_pri_bits - 1) | 701 + FIELD_PREP(ICC_IDR0_EL1_ID_BITS, vcpu->arch.vgic_cpu.num_id_bits); 702 + 703 + return true; 704 + } 705 + 706 + static bool access_gicv5_iaffid(struct kvm_vcpu *vcpu, struct sys_reg_params *p, 707 + const struct sys_reg_desc *r) 708 + { 709 + if (p->is_write) 710 + return undef_access(vcpu, p, r); 711 + 712 + /* 713 + * For GICv5 VMs, the IAFFID value is the same as the VPE ID. The VPE ID 714 + * is the same as the VCPU's ID. 715 + */ 716 + p->regval = FIELD_PREP(ICC_IAFFIDR_EL1_IAFFID, vcpu->vcpu_id); 717 + 718 + return true; 719 + } 720 + 721 + static bool access_gicv5_ppi_enabler(struct kvm_vcpu *vcpu, 722 + struct sys_reg_params *p, 723 + const struct sys_reg_desc *r) 724 + { 725 + unsigned long *mask = vcpu->kvm->arch.vgic.gicv5_vm.vgic_ppi_mask; 726 + struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5; 727 + int i; 728 + 729 + /* We never expect to get here with a read! */ 730 + if (WARN_ON_ONCE(!p->is_write)) 731 + return undef_access(vcpu, p, r); 732 + 733 + /* 734 + * If we're only handling architected PPIs and the guest writes to the 735 + * enable for the non-architected PPIs, we just return as there's 736 + * nothing to do at all. We don't even allocate the storage for them in 737 + * this case. 738 + */ 739 + if (VGIC_V5_NR_PRIVATE_IRQS == 64 && p->Op2 % 2) 740 + return true; 741 + 742 + /* 743 + * Merge the raw guest write into out bitmap at an offset of either 0 or 744 + * 64, then and it with our PPI mask. 745 + */ 746 + bitmap_write(cpu_if->vgic_ppi_enabler, p->regval, 64 * (p->Op2 % 2), 64); 747 + bitmap_and(cpu_if->vgic_ppi_enabler, cpu_if->vgic_ppi_enabler, mask, 748 + VGIC_V5_NR_PRIVATE_IRQS); 749 + 750 + /* 751 + * Sync the change in enable states to the vgic_irqs. We consider all 752 + * PPIs as we don't expose many to the guest. 753 + */ 754 + for_each_set_bit(i, mask, VGIC_V5_NR_PRIVATE_IRQS) { 755 + u32 intid = vgic_v5_make_ppi(i); 756 + struct vgic_irq *irq; 757 + 758 + irq = vgic_get_vcpu_irq(vcpu, intid); 759 + 760 + scoped_guard(raw_spinlock_irqsave, &irq->irq_lock) 761 + irq->enabled = test_bit(i, cpu_if->vgic_ppi_enabler); 762 + 763 + vgic_put_irq(vcpu->kvm, irq); 764 + } 765 + 766 + return true; 767 + } 768 + 684 769 static bool trap_raz_wi(struct kvm_vcpu *vcpu, 685 770 struct sys_reg_params *p, 686 771 const struct sys_reg_desc *r) ··· 1843 1758 1844 1759 static u64 sanitise_id_aa64pfr0_el1(const struct kvm_vcpu *vcpu, u64 val); 1845 1760 static u64 sanitise_id_aa64pfr1_el1(const struct kvm_vcpu *vcpu, u64 val); 1761 + static u64 sanitise_id_aa64pfr2_el1(const struct kvm_vcpu *vcpu, u64 val); 1846 1762 static u64 sanitise_id_aa64dfr0_el1(const struct kvm_vcpu *vcpu, u64 val); 1847 1763 1848 1764 /* Read a sanitised cpufeature ID register by sys_reg_desc */ ··· 1869 1783 val = sanitise_id_aa64pfr1_el1(vcpu, val); 1870 1784 break; 1871 1785 case SYS_ID_AA64PFR2_EL1: 1872 - val &= ID_AA64PFR2_EL1_FPMR | 1873 - (kvm_has_mte(vcpu->kvm) ? 1874 - ID_AA64PFR2_EL1_MTEFAR | ID_AA64PFR2_EL1_MTESTOREONLY : 1875 - 0); 1786 + val = sanitise_id_aa64pfr2_el1(vcpu, val); 1876 1787 break; 1877 1788 case SYS_ID_AA64ISAR1_EL1: 1878 1789 if (!vcpu_has_ptrauth(vcpu)) ··· 2068 1985 val |= SYS_FIELD_PREP_ENUM(ID_AA64PFR0_EL1, CSV3, IMP); 2069 1986 } 2070 1987 2071 - if (vgic_is_v3(vcpu->kvm)) { 1988 + if (vgic_host_has_gicv3()) { 2072 1989 val &= ~ID_AA64PFR0_EL1_GIC_MASK; 2073 1990 val |= SYS_FIELD_PREP_ENUM(ID_AA64PFR0_EL1, GIC, IMP); 2074 1991 } ··· 2106 2023 val &= ~ID_AA64PFR1_EL1_MTEX; 2107 2024 val &= ~ID_AA64PFR1_EL1_PFAR; 2108 2025 val &= ~ID_AA64PFR1_EL1_MPAM_frac; 2026 + 2027 + return val; 2028 + } 2029 + 2030 + static u64 sanitise_id_aa64pfr2_el1(const struct kvm_vcpu *vcpu, u64 val) 2031 + { 2032 + val &= ID_AA64PFR2_EL1_FPMR | 2033 + ID_AA64PFR2_EL1_MTEFAR | 2034 + ID_AA64PFR2_EL1_MTESTOREONLY; 2035 + 2036 + if (!kvm_has_mte(vcpu->kvm)) { 2037 + val &= ~ID_AA64PFR2_EL1_MTEFAR; 2038 + val &= ~ID_AA64PFR2_EL1_MTESTOREONLY; 2039 + } 2040 + 2041 + if (vgic_host_has_gicv5()) 2042 + val |= SYS_FIELD_PREP_ENUM(ID_AA64PFR2_EL1, GCIE, IMP); 2109 2043 2110 2044 return val; 2111 2045 } ··· 2277 2177 (vcpu_has_nv(vcpu) && !FIELD_GET(ID_AA64PFR0_EL1_EL2, user_val))) 2278 2178 return -EINVAL; 2279 2179 2280 - /* 2281 - * If we are running on a GICv5 host and support FEAT_GCIE_LEGACY, then 2282 - * we support GICv3. Fail attempts to do anything but set that to IMP. 2283 - */ 2284 - if (vgic_is_v3_compat(vcpu->kvm) && 2285 - FIELD_GET(ID_AA64PFR0_EL1_GIC_MASK, user_val) != ID_AA64PFR0_EL1_GIC_IMP) 2286 - return -EINVAL; 2287 - 2288 2180 return set_id_reg(vcpu, rd, user_val); 2289 2181 } 2290 2182 ··· 2313 2221 user_val |= hw_val & ID_AA64PFR1_EL1_MTE_frac_MASK; 2314 2222 } 2315 2223 2224 + return set_id_reg(vcpu, rd, user_val); 2225 + } 2226 + 2227 + static int set_id_aa64pfr2_el1(struct kvm_vcpu *vcpu, 2228 + const struct sys_reg_desc *rd, u64 user_val) 2229 + { 2316 2230 return set_id_reg(vcpu, rd, user_val); 2317 2231 } 2318 2232 ··· 3303 3205 ID_AA64PFR1_EL1_RES0 | 3304 3206 ID_AA64PFR1_EL1_MPAM_frac | 3305 3207 ID_AA64PFR1_EL1_MTE)), 3306 - ID_WRITABLE(ID_AA64PFR2_EL1, 3307 - ID_AA64PFR2_EL1_FPMR | 3308 - ID_AA64PFR2_EL1_MTEFAR | 3309 - ID_AA64PFR2_EL1_MTESTOREONLY), 3208 + ID_FILTERED(ID_AA64PFR2_EL1, id_aa64pfr2_el1, 3209 + (ID_AA64PFR2_EL1_FPMR | 3210 + ID_AA64PFR2_EL1_MTEFAR | 3211 + ID_AA64PFR2_EL1_MTESTOREONLY | 3212 + ID_AA64PFR2_EL1_GCIE)), 3310 3213 ID_UNALLOCATED(4,3), 3311 3214 ID_WRITABLE(ID_AA64ZFR0_EL1, ~ID_AA64ZFR0_EL1_RES0), 3312 3215 ID_HIDDEN(ID_AA64SMFR0_EL1), ··· 3490 3391 { SYS_DESC(SYS_ICC_AP1R1_EL1), undef_access }, 3491 3392 { SYS_DESC(SYS_ICC_AP1R2_EL1), undef_access }, 3492 3393 { SYS_DESC(SYS_ICC_AP1R3_EL1), undef_access }, 3394 + { SYS_DESC(SYS_ICC_IDR0_EL1), access_gicv5_idr0 }, 3395 + { SYS_DESC(SYS_ICC_IAFFIDR_EL1), access_gicv5_iaffid }, 3396 + { SYS_DESC(SYS_ICC_PPI_ENABLER0_EL1), access_gicv5_ppi_enabler }, 3397 + { SYS_DESC(SYS_ICC_PPI_ENABLER1_EL1), access_gicv5_ppi_enabler }, 3493 3398 { SYS_DESC(SYS_ICC_DIR_EL1), access_gic_dir }, 3494 3399 { SYS_DESC(SYS_ICC_RPR_EL1), undef_access }, 3495 3400 { SYS_DESC(SYS_ICC_SGI1R_EL1), access_gic_sgi }, ··· 5750 5647 compute_fgu(kvm, HFGRTR2_GROUP); 5751 5648 compute_fgu(kvm, HFGITR2_GROUP); 5752 5649 compute_fgu(kvm, HDFGRTR2_GROUP); 5650 + compute_fgu(kvm, ICH_HFGRTR_GROUP); 5651 + compute_fgu(kvm, ICH_HFGITR_GROUP); 5753 5652 5754 5653 set_bit(KVM_ARCH_FLAG_FGU_INITIALIZED, &kvm->arch.flags); 5755 5654 out: ··· 5772 5667 5773 5668 guard(mutex)(&kvm->arch.config_lock); 5774 5669 5775 - /* 5776 - * This hacks into the ID registers, so only perform it when the 5777 - * first vcpu runs, or the kvm_set_vm_id_reg() helper will scream. 5778 - */ 5779 - if (!irqchip_in_kernel(kvm) && !kvm_vm_has_ran_once(kvm)) { 5780 - u64 val; 5781 - 5782 - val = kvm_read_vm_id_reg(kvm, SYS_ID_AA64PFR0_EL1) & ~ID_AA64PFR0_EL1_GIC; 5783 - kvm_set_vm_id_reg(kvm, SYS_ID_AA64PFR0_EL1, val); 5784 - val = kvm_read_vm_id_reg(kvm, SYS_ID_PFR1_EL1) & ~ID_PFR1_EL1_GIC; 5785 - kvm_set_vm_id_reg(kvm, SYS_ID_PFR1_EL1, val); 5786 - } 5787 - 5788 5670 if (vcpu_has_nv(vcpu)) { 5789 5671 int ret = kvm_init_nv_sysregs(vcpu); 5790 5672 if (ret) 5791 5673 return ret; 5674 + } 5675 + 5676 + if (kvm_vm_has_ran_once(kvm)) 5677 + return 0; 5678 + 5679 + /* 5680 + * This hacks into the ID registers, so only perform it when the 5681 + * first vcpu runs, or the kvm_set_vm_id_reg() helper will scream. 5682 + */ 5683 + if (!irqchip_in_kernel(kvm)) { 5684 + u64 val; 5685 + 5686 + val = kvm_read_vm_id_reg(kvm, SYS_ID_AA64PFR0_EL1) & ~ID_AA64PFR0_EL1_GIC; 5687 + kvm_set_vm_id_reg(kvm, SYS_ID_AA64PFR0_EL1, val); 5688 + val = kvm_read_vm_id_reg(kvm, SYS_ID_AA64PFR2_EL1) & ~ID_AA64PFR2_EL1_GCIE; 5689 + kvm_set_vm_id_reg(kvm, SYS_ID_AA64PFR2_EL1, val); 5690 + val = kvm_read_vm_id_reg(kvm, SYS_ID_PFR1_EL1) & ~ID_PFR1_EL1_GIC; 5691 + kvm_set_vm_id_reg(kvm, SYS_ID_PFR1_EL1, val); 5692 + } else { 5693 + /* 5694 + * Certain userspace software - QEMU - samples the system 5695 + * register state without creating an irqchip, then blindly 5696 + * restores the state prior to running the final guest. This 5697 + * means that it restores the virtualization & emulation 5698 + * capabilities of the host system, rather than something that 5699 + * reflects the final guest state. Moreover, it checks that the 5700 + * state was "correctly" restored (i.e., verbatim), bailing if 5701 + * it isn't, so masking off invalid state isn't an option. 5702 + * 5703 + * On GICv5 hardware that supports FEAT_GCIE_LEGACY we can run 5704 + * both GICv3- and GICv5-based guests. Therefore, we initially 5705 + * present both ID_AA64PFR0.GIC and ID_AA64PFR2.GCIE as IMP to 5706 + * reflect that userspace can create EITHER a vGICv3 or a 5707 + * vGICv5. This is an architecturally invalid combination, of 5708 + * course. Once an in-kernel GIC is created, the sysreg state is 5709 + * updated to reflect the actual, valid configuration. 5710 + * 5711 + * Setting both the GIC and GCIE features to IMP unsurprisingly 5712 + * results in guests falling over, and hence we need to fix up 5713 + * this mess in KVM. Before running for the first time we yet 5714 + * again ensure that the GIC and GCIE fields accurately reflect 5715 + * the actual hardware the guest should see. 5716 + * 5717 + * This hack allows legacy QEMU-based GICv3 guests to run 5718 + * unmodified on compatible GICv5 hosts, and avoids the inverse 5719 + * problem for GICv5-based guests in the future. 5720 + */ 5721 + kvm_vgic_finalize_idregs(kvm); 5792 5722 } 5793 5723 5794 5724 return 0;

+164 -66

arch/arm64/kvm/vgic/vgic-init.c

··· 66 66 * or through the generic KVM_CREATE_DEVICE API ioctl. 67 67 * irqchip_in_kernel() tells you if this function succeeded or not. 68 68 * @kvm: kvm struct pointer 69 - * @type: KVM_DEV_TYPE_ARM_VGIC_V[23] 69 + * @type: KVM_DEV_TYPE_ARM_VGIC_V[235] 70 70 */ 71 71 int kvm_vgic_create(struct kvm *kvm, u32 type) 72 72 { 73 73 struct kvm_vcpu *vcpu; 74 - u64 aa64pfr0, pfr1; 75 74 unsigned long i; 76 75 int ret; 77 76 ··· 131 132 132 133 if (type == KVM_DEV_TYPE_ARM_VGIC_V2) 133 134 kvm->max_vcpus = VGIC_V2_MAX_CPUS; 134 - else 135 + else if (type == KVM_DEV_TYPE_ARM_VGIC_V3) 135 136 kvm->max_vcpus = VGIC_V3_MAX_CPUS; 137 + else if (type == KVM_DEV_TYPE_ARM_VGIC_V5) 138 + kvm->max_vcpus = min(VGIC_V5_MAX_CPUS, 139 + kvm_vgic_global_state.max_gic_vcpus); 136 140 137 141 if (atomic_read(&kvm->online_vcpus) > kvm->max_vcpus) { 138 142 ret = -E2BIG; ··· 147 145 kvm->arch.vgic.implementation_rev = KVM_VGIC_IMP_REV_LATEST; 148 146 kvm->arch.vgic.vgic_dist_base = VGIC_ADDR_UNDEF; 149 147 150 - aa64pfr0 = kvm_read_vm_id_reg(kvm, SYS_ID_AA64PFR0_EL1) & ~ID_AA64PFR0_EL1_GIC; 151 - pfr1 = kvm_read_vm_id_reg(kvm, SYS_ID_PFR1_EL1) & ~ID_PFR1_EL1_GIC; 152 - 153 - if (type == KVM_DEV_TYPE_ARM_VGIC_V2) { 148 + switch (type) { 149 + case KVM_DEV_TYPE_ARM_VGIC_V2: 154 150 kvm->arch.vgic.vgic_cpu_base = VGIC_ADDR_UNDEF; 155 - } else { 151 + break; 152 + case KVM_DEV_TYPE_ARM_VGIC_V3: 156 153 INIT_LIST_HEAD(&kvm->arch.vgic.rd_regions); 157 - aa64pfr0 |= SYS_FIELD_PREP_ENUM(ID_AA64PFR0_EL1, GIC, IMP); 158 - pfr1 |= SYS_FIELD_PREP_ENUM(ID_PFR1_EL1, GIC, GICv3); 154 + break; 159 155 } 160 156 161 - kvm_set_vm_id_reg(kvm, SYS_ID_AA64PFR0_EL1, aa64pfr0); 162 - kvm_set_vm_id_reg(kvm, SYS_ID_PFR1_EL1, pfr1); 157 + /* 158 + * We've now created the GIC. Update the system register state 159 + * to accurately reflect what we've created. 160 + */ 161 + kvm_vgic_finalize_idregs(kvm); 163 162 164 163 kvm_for_each_vcpu(i, vcpu, kvm) { 165 164 ret = vgic_allocate_private_irqs_locked(vcpu, type); ··· 181 178 182 179 if (type == KVM_DEV_TYPE_ARM_VGIC_V3) 183 180 kvm->arch.vgic.nassgicap = system_supports_direct_sgis(); 181 + 182 + /* 183 + * We now know that we have a GICv5. The Arch Timer PPI interrupts may 184 + * have been initialised at this stage, but will have done so assuming 185 + * that we have an older GIC, meaning that the IntIDs won't be 186 + * correct. We init them again, and this time they will be correct. 187 + */ 188 + if (type == KVM_DEV_TYPE_ARM_VGIC_V5) 189 + kvm_timer_init_vm(kvm); 184 190 185 191 out_unlock: 186 192 mutex_unlock(&kvm->arch.config_lock); ··· 271 259 return ret; 272 260 } 273 261 262 + static void vgic_allocate_private_irq(struct kvm_vcpu *vcpu, int i, u32 type) 263 + { 264 + struct vgic_irq *irq = &vcpu->arch.vgic_cpu.private_irqs[i]; 265 + 266 + INIT_LIST_HEAD(&irq->ap_list); 267 + raw_spin_lock_init(&irq->irq_lock); 268 + irq->vcpu = NULL; 269 + irq->target_vcpu = vcpu; 270 + refcount_set(&irq->refcount, 0); 271 + 272 + irq->intid = i; 273 + if (vgic_irq_is_sgi(i)) { 274 + /* SGIs */ 275 + irq->enabled = 1; 276 + irq->config = VGIC_CONFIG_EDGE; 277 + } else { 278 + /* PPIs */ 279 + irq->config = VGIC_CONFIG_LEVEL; 280 + } 281 + 282 + switch (type) { 283 + case KVM_DEV_TYPE_ARM_VGIC_V3: 284 + irq->group = 1; 285 + irq->mpidr = kvm_vcpu_get_mpidr_aff(vcpu); 286 + break; 287 + case KVM_DEV_TYPE_ARM_VGIC_V2: 288 + irq->group = 0; 289 + irq->targets = BIT(vcpu->vcpu_id); 290 + break; 291 + } 292 + } 293 + 294 + static void vgic_v5_allocate_private_irq(struct kvm_vcpu *vcpu, int i, u32 type) 295 + { 296 + struct vgic_irq *irq = &vcpu->arch.vgic_cpu.private_irqs[i]; 297 + u32 intid = vgic_v5_make_ppi(i); 298 + 299 + INIT_LIST_HEAD(&irq->ap_list); 300 + raw_spin_lock_init(&irq->irq_lock); 301 + irq->vcpu = NULL; 302 + irq->target_vcpu = vcpu; 303 + refcount_set(&irq->refcount, 0); 304 + 305 + irq->intid = intid; 306 + 307 + /* The only Edge architected PPI is the SW_PPI */ 308 + if (i == GICV5_ARCH_PPI_SW_PPI) 309 + irq->config = VGIC_CONFIG_EDGE; 310 + else 311 + irq->config = VGIC_CONFIG_LEVEL; 312 + 313 + /* Register the GICv5-specific PPI ops */ 314 + vgic_v5_set_ppi_ops(vcpu, intid); 315 + } 316 + 274 317 static int vgic_allocate_private_irqs_locked(struct kvm_vcpu *vcpu, u32 type) 275 318 { 276 319 struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic_cpu; 320 + u32 num_private_irqs; 277 321 int i; 278 322 279 323 lockdep_assert_held(&vcpu->kvm->arch.config_lock); ··· 337 269 if (vgic_cpu->private_irqs) 338 270 return 0; 339 271 272 + if (vgic_is_v5(vcpu->kvm)) 273 + num_private_irqs = VGIC_V5_NR_PRIVATE_IRQS; 274 + else 275 + num_private_irqs = VGIC_NR_PRIVATE_IRQS; 276 + 340 277 vgic_cpu->private_irqs = kzalloc_objs(struct vgic_irq, 341 - VGIC_NR_PRIVATE_IRQS, 278 + num_private_irqs, 342 279 GFP_KERNEL_ACCOUNT); 343 280 344 281 if (!vgic_cpu->private_irqs) ··· 353 280 * Enable and configure all SGIs to be edge-triggered and 354 281 * configure all PPIs as level-triggered. 355 282 */ 356 - for (i = 0; i < VGIC_NR_PRIVATE_IRQS; i++) { 357 - struct vgic_irq *irq = &vgic_cpu->private_irqs[i]; 358 - 359 - INIT_LIST_HEAD(&irq->ap_list); 360 - raw_spin_lock_init(&irq->irq_lock); 361 - irq->intid = i; 362 - irq->vcpu = NULL; 363 - irq->target_vcpu = vcpu; 364 - refcount_set(&irq->refcount, 0); 365 - if (vgic_irq_is_sgi(i)) { 366 - /* SGIs */ 367 - irq->enabled = 1; 368 - irq->config = VGIC_CONFIG_EDGE; 369 - } else { 370 - /* PPIs */ 371 - irq->config = VGIC_CONFIG_LEVEL; 372 - } 373 - 374 - switch (type) { 375 - case KVM_DEV_TYPE_ARM_VGIC_V3: 376 - irq->group = 1; 377 - irq->mpidr = kvm_vcpu_get_mpidr_aff(vcpu); 378 - break; 379 - case KVM_DEV_TYPE_ARM_VGIC_V2: 380 - irq->group = 0; 381 - irq->targets = BIT(vcpu->vcpu_id); 382 - break; 383 - } 283 + for (i = 0; i < num_private_irqs; i++) { 284 + if (vgic_is_v5(vcpu->kvm)) 285 + vgic_v5_allocate_private_irq(vcpu, i, type); 286 + else 287 + vgic_allocate_private_irq(vcpu, i, type); 384 288 } 385 289 386 290 return 0; ··· 416 366 417 367 static void kvm_vgic_vcpu_reset(struct kvm_vcpu *vcpu) 418 368 { 419 - if (kvm_vgic_global_state.type == VGIC_V2) 369 + const struct vgic_dist *dist = &vcpu->kvm->arch.vgic; 370 + 371 + if (dist->vgic_model == KVM_DEV_TYPE_ARM_VGIC_V5) 372 + vgic_v5_reset(vcpu); 373 + else if (kvm_vgic_global_state.type == VGIC_V2) 420 374 vgic_v2_reset(vcpu); 421 375 else 422 376 vgic_v3_reset(vcpu); ··· 451 397 if (kvm->created_vcpus != atomic_read(&kvm->online_vcpus)) 452 398 return -EBUSY; 453 399 454 - /* freeze the number of spis */ 455 - if (!dist->nr_spis) 456 - dist->nr_spis = VGIC_NR_IRQS_LEGACY - VGIC_NR_PRIVATE_IRQS; 400 + if (!vgic_is_v5(kvm)) { 401 + /* freeze the number of spis */ 402 + if (!dist->nr_spis) 403 + dist->nr_spis = VGIC_NR_IRQS_LEGACY - VGIC_NR_PRIVATE_IRQS; 457 404 458 - ret = kvm_vgic_dist_init(kvm, dist->nr_spis); 459 - if (ret) 460 - goto out; 461 - 462 - /* 463 - * Ensure vPEs are allocated if direct IRQ injection (e.g. vSGIs, 464 - * vLPIs) is supported. 465 - */ 466 - if (vgic_supports_direct_irqs(kvm)) { 467 - ret = vgic_v4_init(kvm); 405 + ret = kvm_vgic_dist_init(kvm, dist->nr_spis); 468 406 if (ret) 469 - goto out; 407 + return ret; 408 + 409 + /* 410 + * Ensure vPEs are allocated if direct IRQ injection (e.g. vSGIs, 411 + * vLPIs) is supported. 412 + */ 413 + if (vgic_supports_direct_irqs(kvm)) { 414 + ret = vgic_v4_init(kvm); 415 + if (ret) 416 + return ret; 417 + } 418 + } else { 419 + ret = vgic_v5_init(kvm); 420 + if (ret) 421 + return ret; 470 422 } 471 423 472 424 kvm_for_each_vcpu(idx, vcpu, kvm) ··· 480 420 481 421 ret = kvm_vgic_setup_default_irq_routing(kvm); 482 422 if (ret) 483 - goto out; 423 + return ret; 484 424 485 425 vgic_debug_init(kvm); 486 426 dist->initialized = true; 487 - out: 488 - return ret; 427 + 428 + return 0; 489 429 } 490 430 491 431 static void kvm_vgic_dist_destroy(struct kvm *kvm) ··· 629 569 int kvm_vgic_map_resources(struct kvm *kvm) 630 570 { 631 571 struct vgic_dist *dist = &kvm->arch.vgic; 572 + bool needs_dist = true; 632 573 enum vgic_type type; 633 574 gpa_t dist_base; 634 575 int ret = 0; ··· 648 587 if (dist->vgic_model == KVM_DEV_TYPE_ARM_VGIC_V2) { 649 588 ret = vgic_v2_map_resources(kvm); 650 589 type = VGIC_V2; 651 - } else { 590 + } else if (dist->vgic_model == KVM_DEV_TYPE_ARM_VGIC_V3) { 652 591 ret = vgic_v3_map_resources(kvm); 653 592 type = VGIC_V3; 593 + } else { 594 + ret = vgic_v5_map_resources(kvm); 595 + type = VGIC_V5; 596 + needs_dist = false; 654 597 } 655 598 656 599 if (ret) 657 600 goto out; 658 601 659 - dist_base = dist->vgic_dist_base; 660 - mutex_unlock(&kvm->arch.config_lock); 602 + if (needs_dist) { 603 + dist_base = dist->vgic_dist_base; 604 + mutex_unlock(&kvm->arch.config_lock); 661 605 662 - ret = vgic_register_dist_iodev(kvm, dist_base, type); 663 - if (ret) { 664 - kvm_err("Unable to register VGIC dist MMIO regions\n"); 665 - goto out_slots; 606 + ret = vgic_register_dist_iodev(kvm, dist_base, type); 607 + if (ret) { 608 + kvm_err("Unable to register VGIC dist MMIO regions\n"); 609 + goto out_slots; 610 + } 611 + } else { 612 + mutex_unlock(&kvm->arch.config_lock); 666 613 } 667 614 668 615 smp_store_release(&dist->ready, true); ··· 684 615 mutex_unlock(&kvm->slots_lock); 685 616 686 617 return ret; 618 + } 619 + 620 + void kvm_vgic_finalize_idregs(struct kvm *kvm) 621 + { 622 + u32 type = kvm->arch.vgic.vgic_model; 623 + u64 aa64pfr0, aa64pfr2, pfr1; 624 + 625 + aa64pfr0 = kvm_read_vm_id_reg(kvm, SYS_ID_AA64PFR0_EL1) & ~ID_AA64PFR0_EL1_GIC; 626 + aa64pfr2 = kvm_read_vm_id_reg(kvm, SYS_ID_AA64PFR2_EL1) & ~ID_AA64PFR2_EL1_GCIE; 627 + pfr1 = kvm_read_vm_id_reg(kvm, SYS_ID_PFR1_EL1) & ~ID_PFR1_EL1_GIC; 628 + 629 + switch (type) { 630 + case KVM_DEV_TYPE_ARM_VGIC_V2: 631 + break; 632 + case KVM_DEV_TYPE_ARM_VGIC_V3: 633 + aa64pfr0 |= SYS_FIELD_PREP_ENUM(ID_AA64PFR0_EL1, GIC, IMP); 634 + if (kvm_supports_32bit_el0()) 635 + pfr1 |= SYS_FIELD_PREP_ENUM(ID_PFR1_EL1, GIC, GICv3); 636 + break; 637 + case KVM_DEV_TYPE_ARM_VGIC_V5: 638 + aa64pfr2 |= SYS_FIELD_PREP_ENUM(ID_AA64PFR2_EL1, GCIE, IMP); 639 + break; 640 + default: 641 + WARN_ONCE(1, "Unknown VGIC type!!!\n"); 642 + } 643 + 644 + kvm_set_vm_id_reg(kvm, SYS_ID_AA64PFR0_EL1, aa64pfr0); 645 + kvm_set_vm_id_reg(kvm, SYS_ID_AA64PFR2_EL1, aa64pfr2); 646 + kvm_set_vm_id_reg(kvm, SYS_ID_PFR1_EL1, pfr1); 687 647 } 688 648 689 649 /* GENERIC PROBE */

+106 -1

arch/arm64/kvm/vgic/vgic-kvm-device.c

··· 336 336 break; 337 337 ret = kvm_vgic_register_its_device(); 338 338 break; 339 + case KVM_DEV_TYPE_ARM_VGIC_V5: 340 + ret = kvm_register_device_ops(&kvm_arm_vgic_v5_ops, 341 + KVM_DEV_TYPE_ARM_VGIC_V5); 342 + break; 339 343 } 340 344 341 345 return ret; ··· 643 639 if (vgic_initialized(dev->kvm)) 644 640 return -EBUSY; 645 641 646 - if (!irq_is_ppi(val)) 642 + if (!irq_is_ppi(dev->kvm, val)) 647 643 return -EINVAL; 648 644 649 645 dev->kvm->arch.vgic.mi_intid = val; ··· 718 714 .set_attr = vgic_v3_set_attr, 719 715 .get_attr = vgic_v3_get_attr, 720 716 .has_attr = vgic_v3_has_attr, 717 + }; 718 + 719 + static int vgic_v5_get_userspace_ppis(struct kvm_device *dev, 720 + struct kvm_device_attr *attr) 721 + { 722 + struct vgic_v5_vm *gicv5_vm = &dev->kvm->arch.vgic.gicv5_vm; 723 + u64 __user *uaddr = (u64 __user *)(long)attr->addr; 724 + int ret; 725 + 726 + guard(mutex)(&dev->kvm->arch.config_lock); 727 + 728 + /* 729 + * We either support 64 or 128 PPIs. In the former case, we need to 730 + * return 0s for the second 64 bits as we have no storage backing those. 731 + */ 732 + ret = put_user(bitmap_read(gicv5_vm->userspace_ppis, 0, 64), uaddr); 733 + if (ret) 734 + return ret; 735 + uaddr++; 736 + 737 + if (VGIC_V5_NR_PRIVATE_IRQS == 128) 738 + ret = put_user(bitmap_read(gicv5_vm->userspace_ppis, 64, 128), uaddr); 739 + else 740 + ret = put_user(0, uaddr); 741 + 742 + return ret; 743 + } 744 + 745 + static int vgic_v5_set_attr(struct kvm_device *dev, 746 + struct kvm_device_attr *attr) 747 + { 748 + switch (attr->group) { 749 + case KVM_DEV_ARM_VGIC_GRP_ADDR: 750 + case KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS: 751 + case KVM_DEV_ARM_VGIC_GRP_NR_IRQS: 752 + return -ENXIO; 753 + case KVM_DEV_ARM_VGIC_GRP_CTRL: 754 + switch (attr->attr) { 755 + case KVM_DEV_ARM_VGIC_CTRL_INIT: 756 + return vgic_set_common_attr(dev, attr); 757 + case KVM_DEV_ARM_VGIC_USERSPACE_PPIS: 758 + default: 759 + return -ENXIO; 760 + } 761 + default: 762 + return -ENXIO; 763 + } 764 + 765 + } 766 + 767 + static int vgic_v5_get_attr(struct kvm_device *dev, 768 + struct kvm_device_attr *attr) 769 + { 770 + switch (attr->group) { 771 + case KVM_DEV_ARM_VGIC_GRP_ADDR: 772 + case KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS: 773 + case KVM_DEV_ARM_VGIC_GRP_NR_IRQS: 774 + return -ENXIO; 775 + case KVM_DEV_ARM_VGIC_GRP_CTRL: 776 + switch (attr->attr) { 777 + case KVM_DEV_ARM_VGIC_CTRL_INIT: 778 + return vgic_get_common_attr(dev, attr); 779 + case KVM_DEV_ARM_VGIC_USERSPACE_PPIS: 780 + return vgic_v5_get_userspace_ppis(dev, attr); 781 + default: 782 + return -ENXIO; 783 + } 784 + default: 785 + return -ENXIO; 786 + } 787 + } 788 + 789 + static int vgic_v5_has_attr(struct kvm_device *dev, 790 + struct kvm_device_attr *attr) 791 + { 792 + switch (attr->group) { 793 + case KVM_DEV_ARM_VGIC_GRP_ADDR: 794 + case KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS: 795 + case KVM_DEV_ARM_VGIC_GRP_NR_IRQS: 796 + return -ENXIO; 797 + case KVM_DEV_ARM_VGIC_GRP_CTRL: 798 + switch (attr->attr) { 799 + case KVM_DEV_ARM_VGIC_CTRL_INIT: 800 + return 0; 801 + case KVM_DEV_ARM_VGIC_USERSPACE_PPIS: 802 + return 0; 803 + default: 804 + return -ENXIO; 805 + } 806 + default: 807 + return -ENXIO; 808 + } 809 + } 810 + 811 + struct kvm_device_ops kvm_arm_vgic_v5_ops = { 812 + .name = "kvm-arm-vgic-v5", 813 + .create = vgic_create, 814 + .destroy = vgic_destroy, 815 + .set_attr = vgic_v5_set_attr, 816 + .get_attr = vgic_v5_get_attr, 817 + .has_attr = vgic_v5_has_attr, 721 818 };

+34 -6

arch/arm64/kvm/vgic/vgic-mmio.c

··· 842 842 843 843 void vgic_set_vmcr(struct kvm_vcpu *vcpu, struct vgic_vmcr *vmcr) 844 844 { 845 - if (kvm_vgic_global_state.type == VGIC_V2) 846 - vgic_v2_set_vmcr(vcpu, vmcr); 847 - else 845 + const struct vgic_dist *dist = &vcpu->kvm->arch.vgic; 846 + 847 + switch (dist->vgic_model) { 848 + case KVM_DEV_TYPE_ARM_VGIC_V5: 849 + vgic_v5_set_vmcr(vcpu, vmcr); 850 + break; 851 + case KVM_DEV_TYPE_ARM_VGIC_V3: 848 852 vgic_v3_set_vmcr(vcpu, vmcr); 853 + break; 854 + case KVM_DEV_TYPE_ARM_VGIC_V2: 855 + if (static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) 856 + vgic_v3_set_vmcr(vcpu, vmcr); 857 + else 858 + vgic_v2_set_vmcr(vcpu, vmcr); 859 + break; 860 + default: 861 + BUG(); 862 + } 849 863 } 850 864 851 865 void vgic_get_vmcr(struct kvm_vcpu *vcpu, struct vgic_vmcr *vmcr) 852 866 { 853 - if (kvm_vgic_global_state.type == VGIC_V2) 854 - vgic_v2_get_vmcr(vcpu, vmcr); 855 - else 867 + const struct vgic_dist *dist = &vcpu->kvm->arch.vgic; 868 + 869 + switch (dist->vgic_model) { 870 + case KVM_DEV_TYPE_ARM_VGIC_V5: 871 + vgic_v5_get_vmcr(vcpu, vmcr); 872 + break; 873 + case KVM_DEV_TYPE_ARM_VGIC_V3: 856 874 vgic_v3_get_vmcr(vcpu, vmcr); 875 + break; 876 + case KVM_DEV_TYPE_ARM_VGIC_V2: 877 + if (static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) 878 + vgic_v3_get_vmcr(vcpu, vmcr); 879 + else 880 + vgic_v2_get_vmcr(vcpu, vmcr); 881 + break; 882 + default: 883 + BUG(); 884 + } 857 885 } 858 886 859 887 /*

+1 -1

arch/arm64/kvm/vgic/vgic-v3.c

··· 499 499 { 500 500 struct vgic_v3_cpu_if *vgic_v3 = &vcpu->arch.vgic_cpu.vgic_v3; 501 501 502 - if (!vgic_is_v3(vcpu->kvm)) 502 + if (!vgic_host_has_gicv3()) 503 503 return; 504 504 505 505 /* Hide GICv3 sysreg if necessary */

+490 -9

arch/arm64/kvm/vgic/vgic-v5.c

··· 1 1 // SPDX-License-Identifier: GPL-2.0-only 2 + /* 3 + * Copyright (C) 2025, 2026 Arm Ltd. 4 + */ 2 5 3 6 #include <kvm/arm_vgic.h> 7 + 8 + #include <linux/bitops.h> 4 9 #include <linux/irqchip/arm-vgic-info.h> 5 10 6 11 #include "vgic.h" 7 12 13 + static struct vgic_v5_ppi_caps ppi_caps; 14 + 15 + /* 16 + * Not all PPIs are guaranteed to be implemented for GICv5. Deterermine which 17 + * ones are, and generate a mask. 18 + */ 19 + static void vgic_v5_get_implemented_ppis(void) 20 + { 21 + if (!cpus_have_final_cap(ARM64_HAS_GICV5_CPUIF)) 22 + return; 23 + 24 + /* 25 + * If we have KVM, we have EL2, which means that we have support for the 26 + * EL1 and EL2 Physical & Virtual timers. 27 + */ 28 + __assign_bit(GICV5_ARCH_PPI_CNTHP, ppi_caps.impl_ppi_mask, 1); 29 + __assign_bit(GICV5_ARCH_PPI_CNTV, ppi_caps.impl_ppi_mask, 1); 30 + __assign_bit(GICV5_ARCH_PPI_CNTHV, ppi_caps.impl_ppi_mask, 1); 31 + __assign_bit(GICV5_ARCH_PPI_CNTP, ppi_caps.impl_ppi_mask, 1); 32 + 33 + /* The SW_PPI should be available */ 34 + __assign_bit(GICV5_ARCH_PPI_SW_PPI, ppi_caps.impl_ppi_mask, 1); 35 + 36 + /* The PMUIRQ is available if we have the PMU */ 37 + __assign_bit(GICV5_ARCH_PPI_PMUIRQ, ppi_caps.impl_ppi_mask, system_supports_pmuv3()); 38 + } 39 + 8 40 /* 9 41 * Probe for a vGICv5 compatible interrupt controller, returning 0 on success. 10 - * Currently only supports GICv3-based VMs on a GICv5 host, and hence only 11 - * registers a VGIC_V3 device. 12 42 */ 13 43 int vgic_v5_probe(const struct gic_kvm_info *info) 14 44 { 45 + bool v5_registered = false; 15 46 u64 ich_vtr_el2; 16 47 int ret; 17 48 18 - if (!cpus_have_final_cap(ARM64_HAS_GICV5_LEGACY)) 19 - return -ENODEV; 20 - 21 49 kvm_vgic_global_state.type = VGIC_V5; 22 - kvm_vgic_global_state.has_gcie_v3_compat = true; 23 - 24 - /* We only support v3 compat mode - use vGICv3 limits */ 25 - kvm_vgic_global_state.max_gic_vcpus = VGIC_V3_MAX_CPUS; 26 50 27 51 kvm_vgic_global_state.vcpu_base = 0; 28 52 kvm_vgic_global_state.vctrl_base = NULL; ··· 54 30 kvm_vgic_global_state.has_gicv4 = false; 55 31 kvm_vgic_global_state.has_gicv4_1 = false; 56 32 33 + /* 34 + * GICv5 is currently not supported in Protected mode. Skip the 35 + * registration of GICv5 completely to make sure no guests can create a 36 + * GICv5-based guest. 37 + */ 38 + if (is_protected_kvm_enabled()) { 39 + kvm_info("GICv5-based guests are not supported with pKVM\n"); 40 + goto skip_v5; 41 + } 42 + 43 + kvm_vgic_global_state.max_gic_vcpus = VGIC_V5_MAX_CPUS; 44 + 45 + vgic_v5_get_implemented_ppis(); 46 + 47 + ret = kvm_register_vgic_device(KVM_DEV_TYPE_ARM_VGIC_V5); 48 + if (ret) { 49 + kvm_err("Cannot register GICv5 KVM device.\n"); 50 + goto skip_v5; 51 + } 52 + 53 + v5_registered = true; 54 + kvm_info("GCIE system register CPU interface\n"); 55 + 56 + skip_v5: 57 + /* If we don't support the GICv3 compat mode we're done. */ 58 + if (!cpus_have_final_cap(ARM64_HAS_GICV5_LEGACY)) { 59 + if (!v5_registered) 60 + return -ENODEV; 61 + return 0; 62 + } 63 + 64 + kvm_vgic_global_state.has_gcie_v3_compat = true; 57 65 ich_vtr_el2 = kvm_call_hyp_ret(__vgic_v3_get_gic_config); 58 66 kvm_vgic_global_state.ich_vtr_el2 = (u32)ich_vtr_el2; 59 67 ··· 101 45 return ret; 102 46 } 103 47 48 + /* We potentially limit the max VCPUs further than we need to here */ 49 + kvm_vgic_global_state.max_gic_vcpus = min(VGIC_V3_MAX_CPUS, 50 + VGIC_V5_MAX_CPUS); 51 + 104 52 static_branch_enable(&kvm_vgic_global_state.gicv3_cpuif); 105 53 kvm_info("GCIE legacy system register CPU interface\n"); 106 54 107 55 vgic_v3_enable_cpuif_traps(); 108 56 109 57 return 0; 58 + } 59 + 60 + void vgic_v5_reset(struct kvm_vcpu *vcpu) 61 + { 62 + /* 63 + * We always present 16-bits of ID space to the guest, irrespective of 64 + * the host allowing more. 65 + */ 66 + vcpu->arch.vgic_cpu.num_id_bits = ICC_IDR0_EL1_ID_BITS_16BITS; 67 + 68 + /* 69 + * The GICv5 architeture only supports 5-bits of priority in the 70 + * CPUIF (but potentially fewer in the IRS). 71 + */ 72 + vcpu->arch.vgic_cpu.num_pri_bits = 5; 73 + } 74 + 75 + int vgic_v5_init(struct kvm *kvm) 76 + { 77 + struct kvm_vcpu *vcpu; 78 + unsigned long idx; 79 + 80 + if (vgic_initialized(kvm)) 81 + return 0; 82 + 83 + kvm_for_each_vcpu(idx, vcpu, kvm) { 84 + if (vcpu_has_nv(vcpu)) { 85 + kvm_err("Nested GICv5 VMs are currently unsupported\n"); 86 + return -EINVAL; 87 + } 88 + } 89 + 90 + /* We only allow userspace to drive the SW_PPI, if it is implemented. */ 91 + bitmap_zero(kvm->arch.vgic.gicv5_vm.userspace_ppis, 92 + VGIC_V5_NR_PRIVATE_IRQS); 93 + __assign_bit(GICV5_ARCH_PPI_SW_PPI, 94 + kvm->arch.vgic.gicv5_vm.userspace_ppis, 95 + VGIC_V5_NR_PRIVATE_IRQS); 96 + bitmap_and(kvm->arch.vgic.gicv5_vm.userspace_ppis, 97 + kvm->arch.vgic.gicv5_vm.userspace_ppis, 98 + ppi_caps.impl_ppi_mask, VGIC_V5_NR_PRIVATE_IRQS); 99 + 100 + return 0; 101 + } 102 + 103 + int vgic_v5_map_resources(struct kvm *kvm) 104 + { 105 + if (!vgic_initialized(kvm)) 106 + return -EBUSY; 107 + 108 + return 0; 109 + } 110 + 111 + int vgic_v5_finalize_ppi_state(struct kvm *kvm) 112 + { 113 + struct kvm_vcpu *vcpu0; 114 + int i; 115 + 116 + if (!vgic_is_v5(kvm)) 117 + return 0; 118 + 119 + guard(mutex)(&kvm->arch.config_lock); 120 + 121 + /* 122 + * If SW_PPI has been advertised, then we know we already 123 + * initialised the whole thing, and we can return early. Yes, 124 + * this is pretty hackish as far as state tracking goes... 125 + */ 126 + if (test_bit(GICV5_ARCH_PPI_SW_PPI, kvm->arch.vgic.gicv5_vm.vgic_ppi_mask)) 127 + return 0; 128 + 129 + /* The PPI state for all VCPUs should be the same. Pick the first. */ 130 + vcpu0 = kvm_get_vcpu(kvm, 0); 131 + 132 + bitmap_zero(kvm->arch.vgic.gicv5_vm.vgic_ppi_mask, VGIC_V5_NR_PRIVATE_IRQS); 133 + bitmap_zero(kvm->arch.vgic.gicv5_vm.vgic_ppi_hmr, VGIC_V5_NR_PRIVATE_IRQS); 134 + 135 + for_each_set_bit(i, ppi_caps.impl_ppi_mask, VGIC_V5_NR_PRIVATE_IRQS) { 136 + const u32 intid = vgic_v5_make_ppi(i); 137 + struct vgic_irq *irq; 138 + 139 + irq = vgic_get_vcpu_irq(vcpu0, intid); 140 + 141 + /* Expose PPIs with an owner or the SW_PPI, only */ 142 + scoped_guard(raw_spinlock_irqsave, &irq->irq_lock) { 143 + if (irq->owner || i == GICV5_ARCH_PPI_SW_PPI) { 144 + __assign_bit(i, kvm->arch.vgic.gicv5_vm.vgic_ppi_mask, 1); 145 + __assign_bit(i, kvm->arch.vgic.gicv5_vm.vgic_ppi_hmr, 146 + irq->config == VGIC_CONFIG_LEVEL); 147 + } 148 + } 149 + 150 + vgic_put_irq(vcpu0->kvm, irq); 151 + } 152 + 153 + return 0; 154 + } 155 + 156 + static u32 vgic_v5_get_effective_priority_mask(struct kvm_vcpu *vcpu) 157 + { 158 + struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5; 159 + u32 highest_ap, priority_mask, apr; 160 + 161 + /* 162 + * If the guest's CPU has not opted to receive interrupts, then the 163 + * effective running priority is the highest priority. Just return 0 164 + * (the highest priority). 165 + */ 166 + if (!FIELD_GET(FEAT_GCIE_ICH_VMCR_EL2_EN, cpu_if->vgic_vmcr)) 167 + return 0; 168 + 169 + /* 170 + * Counting the number of trailing zeros gives the current active 171 + * priority. Explicitly use the 32-bit version here as we have 32 172 + * priorities. 32 then means that there are no active priorities. 173 + */ 174 + apr = cpu_if->vgic_apr; 175 + highest_ap = apr ? __builtin_ctz(apr) : 32; 176 + 177 + /* 178 + * An interrupt is of sufficient priority if it is equal to or 179 + * greater than the priority mask. Add 1 to the priority mask 180 + * (i.e., lower priority) to match the APR logic before taking 181 + * the min. This gives us the lowest priority that is masked. 182 + */ 183 + priority_mask = FIELD_GET(FEAT_GCIE_ICH_VMCR_EL2_VPMR, cpu_if->vgic_vmcr); 184 + 185 + return min(highest_ap, priority_mask + 1); 186 + } 187 + 188 + /* 189 + * For GICv5, the PPIs are mostly directly managed by the hardware. We (the 190 + * hypervisor) handle the pending, active, enable state save/restore, but don't 191 + * need the PPIs to be queued on a per-VCPU AP list. Therefore, sanity check the 192 + * state, unlock, and return. 193 + */ 194 + bool vgic_v5_ppi_queue_irq_unlock(struct kvm *kvm, struct vgic_irq *irq, 195 + unsigned long flags) 196 + __releases(&irq->irq_lock) 197 + { 198 + struct kvm_vcpu *vcpu; 199 + 200 + lockdep_assert_held(&irq->irq_lock); 201 + 202 + if (WARN_ON_ONCE(!__irq_is_ppi(KVM_DEV_TYPE_ARM_VGIC_V5, irq->intid))) 203 + goto out_unlock_fail; 204 + 205 + vcpu = irq->target_vcpu; 206 + if (WARN_ON_ONCE(!vcpu)) 207 + goto out_unlock_fail; 208 + 209 + raw_spin_unlock_irqrestore(&irq->irq_lock, flags); 210 + 211 + /* Directly kick the target VCPU to make sure it sees the IRQ */ 212 + kvm_make_request(KVM_REQ_IRQ_PENDING, vcpu); 213 + kvm_vcpu_kick(vcpu); 214 + 215 + return true; 216 + 217 + out_unlock_fail: 218 + raw_spin_unlock_irqrestore(&irq->irq_lock, flags); 219 + 220 + return false; 221 + } 222 + 223 + /* 224 + * Sets/clears the corresponding bit in the ICH_PPI_DVIR register. 225 + */ 226 + void vgic_v5_set_ppi_dvi(struct kvm_vcpu *vcpu, struct vgic_irq *irq, bool dvi) 227 + { 228 + struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5; 229 + u32 ppi; 230 + 231 + lockdep_assert_held(&irq->irq_lock); 232 + 233 + ppi = vgic_v5_get_hwirq_id(irq->intid); 234 + __assign_bit(ppi, cpu_if->vgic_ppi_dvir, dvi); 235 + } 236 + 237 + static struct irq_ops vgic_v5_ppi_irq_ops = { 238 + .queue_irq_unlock = vgic_v5_ppi_queue_irq_unlock, 239 + .set_direct_injection = vgic_v5_set_ppi_dvi, 240 + }; 241 + 242 + void vgic_v5_set_ppi_ops(struct kvm_vcpu *vcpu, u32 vintid) 243 + { 244 + kvm_vgic_set_irq_ops(vcpu, vintid, &vgic_v5_ppi_irq_ops); 245 + } 246 + 247 + /* 248 + * Sync back the PPI priorities to the vgic_irq shadow state for any interrupts 249 + * exposed to the guest (skipping all others). 250 + */ 251 + static void vgic_v5_sync_ppi_priorities(struct kvm_vcpu *vcpu) 252 + { 253 + struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5; 254 + u64 priorityr; 255 + int i; 256 + 257 + /* 258 + * We have up to 16 PPI Priority regs, but only have a few interrupts 259 + * that the guest is allowed to use. Limit our sync of PPI priorities to 260 + * those actually exposed to the guest by first iterating over the mask 261 + * of exposed PPIs. 262 + */ 263 + for_each_set_bit(i, vcpu->kvm->arch.vgic.gicv5_vm.vgic_ppi_mask, VGIC_V5_NR_PRIVATE_IRQS) { 264 + u32 intid = vgic_v5_make_ppi(i); 265 + struct vgic_irq *irq; 266 + int pri_idx, pri_reg, pri_bit; 267 + u8 priority; 268 + 269 + /* 270 + * Determine which priority register and the field within it to 271 + * extract. 272 + */ 273 + pri_reg = i / 8; 274 + pri_idx = i % 8; 275 + pri_bit = pri_idx * 8; 276 + 277 + priorityr = cpu_if->vgic_ppi_priorityr[pri_reg]; 278 + priority = field_get(GENMASK(pri_bit + 4, pri_bit), priorityr); 279 + 280 + irq = vgic_get_vcpu_irq(vcpu, intid); 281 + 282 + scoped_guard(raw_spinlock_irqsave, &irq->irq_lock) 283 + irq->priority = priority; 284 + 285 + vgic_put_irq(vcpu->kvm, irq); 286 + } 287 + } 288 + 289 + bool vgic_v5_has_pending_ppi(struct kvm_vcpu *vcpu) 290 + { 291 + unsigned int priority_mask; 292 + int i; 293 + 294 + priority_mask = vgic_v5_get_effective_priority_mask(vcpu); 295 + 296 + /* 297 + * If the combined priority mask is 0, nothing can be signalled! In the 298 + * case where the guest has disabled interrupt delivery for the vcpu 299 + * (via ICV_CR0_EL1.EN->ICH_VMCR_EL2.EN), we calculate the priority mask 300 + * as 0 too (the highest possible priority). 301 + */ 302 + if (!priority_mask) 303 + return false; 304 + 305 + for_each_set_bit(i, vcpu->kvm->arch.vgic.gicv5_vm.vgic_ppi_mask, VGIC_V5_NR_PRIVATE_IRQS) { 306 + u32 intid = vgic_v5_make_ppi(i); 307 + bool has_pending = false; 308 + struct vgic_irq *irq; 309 + 310 + irq = vgic_get_vcpu_irq(vcpu, intid); 311 + 312 + scoped_guard(raw_spinlock_irqsave, &irq->irq_lock) 313 + if (irq->enabled && irq->priority < priority_mask) 314 + has_pending = irq->hw ? vgic_get_phys_line_level(irq) : irq_is_pending(irq); 315 + 316 + vgic_put_irq(vcpu->kvm, irq); 317 + 318 + if (has_pending) 319 + return true; 320 + } 321 + 322 + return false; 323 + } 324 + 325 + /* 326 + * Detect any PPIs state changes, and propagate the state with KVM's 327 + * shadow structures. 328 + */ 329 + void vgic_v5_fold_ppi_state(struct kvm_vcpu *vcpu) 330 + { 331 + struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5; 332 + unsigned long *activer, *pendr; 333 + int i; 334 + 335 + activer = host_data_ptr(vgic_v5_ppi_state)->activer_exit; 336 + pendr = host_data_ptr(vgic_v5_ppi_state)->pendr; 337 + 338 + for_each_set_bit(i, vcpu->kvm->arch.vgic.gicv5_vm.vgic_ppi_mask, 339 + VGIC_V5_NR_PRIVATE_IRQS) { 340 + u32 intid = vgic_v5_make_ppi(i); 341 + struct vgic_irq *irq; 342 + 343 + irq = vgic_get_vcpu_irq(vcpu, intid); 344 + 345 + scoped_guard(raw_spinlock_irqsave, &irq->irq_lock) { 346 + irq->active = test_bit(i, activer); 347 + 348 + /* This is an OR to avoid losing incoming edges! */ 349 + if (irq->config == VGIC_CONFIG_EDGE) 350 + irq->pending_latch |= test_bit(i, pendr); 351 + } 352 + 353 + vgic_put_irq(vcpu->kvm, irq); 354 + } 355 + 356 + /* 357 + * Re-inject the exit state as entry state next time! 358 + * 359 + * Note that the write of the Enable state is trapped, and hence there 360 + * is nothing to explcitly sync back here as we already have the latest 361 + * copy by definition. 362 + */ 363 + bitmap_copy(cpu_if->vgic_ppi_activer, activer, VGIC_V5_NR_PRIVATE_IRQS); 364 + } 365 + 366 + void vgic_v5_flush_ppi_state(struct kvm_vcpu *vcpu) 367 + { 368 + DECLARE_BITMAP(pendr, VGIC_V5_NR_PRIVATE_IRQS); 369 + int i; 370 + 371 + /* 372 + * Time to enter the guest - we first need to build the guest's 373 + * ICC_PPI_PENDRx_EL1, however. 374 + */ 375 + bitmap_zero(pendr, VGIC_V5_NR_PRIVATE_IRQS); 376 + for_each_set_bit(i, vcpu->kvm->arch.vgic.gicv5_vm.vgic_ppi_mask, 377 + VGIC_V5_NR_PRIVATE_IRQS) { 378 + u32 intid = vgic_v5_make_ppi(i); 379 + struct vgic_irq *irq; 380 + 381 + irq = vgic_get_vcpu_irq(vcpu, intid); 382 + 383 + scoped_guard(raw_spinlock_irqsave, &irq->irq_lock) { 384 + __assign_bit(i, pendr, irq_is_pending(irq)); 385 + if (irq->config == VGIC_CONFIG_EDGE) 386 + irq->pending_latch = false; 387 + } 388 + 389 + vgic_put_irq(vcpu->kvm, irq); 390 + } 391 + 392 + /* 393 + * Copy the shadow state to the pending reg that will be written to the 394 + * ICH_PPI_PENDRx_EL2 regs. While the guest is running we track any 395 + * incoming changes to the pending state in the vgic_irq structures. The 396 + * incoming changes are merged with the outgoing changes on the return 397 + * path. 398 + */ 399 + bitmap_copy(host_data_ptr(vgic_v5_ppi_state)->pendr, pendr, 400 + VGIC_V5_NR_PRIVATE_IRQS); 401 + } 402 + 403 + void vgic_v5_load(struct kvm_vcpu *vcpu) 404 + { 405 + struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5; 406 + 407 + /* 408 + * On the WFI path, vgic_load is called a second time. The first is when 409 + * scheduling in the vcpu thread again, and the second is when leaving 410 + * WFI. Skip the second instance as it serves no purpose and just 411 + * restores the same state again. 412 + */ 413 + if (cpu_if->gicv5_vpe.resident) 414 + return; 415 + 416 + kvm_call_hyp(__vgic_v5_restore_vmcr_apr, cpu_if); 417 + 418 + cpu_if->gicv5_vpe.resident = true; 419 + } 420 + 421 + void vgic_v5_put(struct kvm_vcpu *vcpu) 422 + { 423 + struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5; 424 + 425 + /* 426 + * Do nothing if we're not resident. This can happen in the WFI path 427 + * where we do a vgic_put in the WFI path and again later when 428 + * descheduling the thread. We risk losing VMCR state if we sync it 429 + * twice, so instead return early in this case. 430 + */ 431 + if (!cpu_if->gicv5_vpe.resident) 432 + return; 433 + 434 + kvm_call_hyp(__vgic_v5_save_apr, cpu_if); 435 + 436 + cpu_if->gicv5_vpe.resident = false; 437 + 438 + /* The shadow priority is only updated on entering WFI */ 439 + if (vcpu_get_flag(vcpu, IN_WFI)) 440 + vgic_v5_sync_ppi_priorities(vcpu); 441 + } 442 + 443 + void vgic_v5_get_vmcr(struct kvm_vcpu *vcpu, struct vgic_vmcr *vmcrp) 444 + { 445 + struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5; 446 + u64 vmcr = cpu_if->vgic_vmcr; 447 + 448 + vmcrp->en = FIELD_GET(FEAT_GCIE_ICH_VMCR_EL2_EN, vmcr); 449 + vmcrp->pmr = FIELD_GET(FEAT_GCIE_ICH_VMCR_EL2_VPMR, vmcr); 450 + } 451 + 452 + void vgic_v5_set_vmcr(struct kvm_vcpu *vcpu, struct vgic_vmcr *vmcrp) 453 + { 454 + struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5; 455 + u64 vmcr; 456 + 457 + vmcr = FIELD_PREP(FEAT_GCIE_ICH_VMCR_EL2_VPMR, vmcrp->pmr) | 458 + FIELD_PREP(FEAT_GCIE_ICH_VMCR_EL2_EN, vmcrp->en); 459 + 460 + cpu_if->vgic_vmcr = vmcr; 461 + } 462 + 463 + void vgic_v5_restore_state(struct kvm_vcpu *vcpu) 464 + { 465 + struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5; 466 + 467 + __vgic_v5_restore_state(cpu_if); 468 + __vgic_v5_restore_ppi_state(cpu_if); 469 + dsb(sy); 470 + } 471 + 472 + void vgic_v5_save_state(struct kvm_vcpu *vcpu) 473 + { 474 + struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5; 475 + 476 + __vgic_v5_save_state(cpu_if); 477 + __vgic_v5_save_ppi_state(cpu_if); 478 + dsb(sy); 110 479 }

+139 -34

arch/arm64/kvm/vgic/vgic.c

··· 86 86 */ 87 87 struct vgic_irq *vgic_get_irq(struct kvm *kvm, u32 intid) 88 88 { 89 + /* Non-private IRQs are not yet implemented for GICv5 */ 90 + if (vgic_is_v5(kvm)) 91 + return NULL; 92 + 89 93 /* SPIs */ 90 94 if (intid >= VGIC_NR_PRIVATE_IRQS && 91 95 intid < (kvm->arch.vgic.nr_spis + VGIC_NR_PRIVATE_IRQS)) { ··· 98 94 } 99 95 100 96 /* LPIs */ 101 - if (intid >= VGIC_MIN_LPI) 97 + if (irq_is_lpi(kvm, intid)) 102 98 return vgic_get_lpi(kvm, intid); 103 99 104 100 return NULL; ··· 108 104 { 109 105 if (WARN_ON(!vcpu)) 110 106 return NULL; 107 + 108 + if (vgic_is_v5(vcpu->kvm)) { 109 + u32 int_num, hwirq_id; 110 + 111 + if (!__irq_is_ppi(KVM_DEV_TYPE_ARM_VGIC_V5, intid)) 112 + return NULL; 113 + 114 + hwirq_id = FIELD_GET(GICV5_HWIRQ_ID, intid); 115 + int_num = array_index_nospec(hwirq_id, VGIC_V5_NR_PRIVATE_IRQS); 116 + 117 + return &vcpu->arch.vgic_cpu.private_irqs[int_num]; 118 + } 111 119 112 120 /* SGIs and PPIs */ 113 121 if (intid < VGIC_NR_PRIVATE_IRQS) { ··· 139 123 140 124 static __must_check bool __vgic_put_irq(struct kvm *kvm, struct vgic_irq *irq) 141 125 { 142 - if (irq->intid < VGIC_MIN_LPI) 126 + if (!irq_is_lpi(kvm, irq->intid)) 143 127 return false; 144 128 145 129 return refcount_dec_and_test(&irq->refcount); ··· 164 148 * Acquire/release it early on lockdep kernels to make locking issues 165 149 * in rare release paths a bit more obvious. 166 150 */ 167 - if (IS_ENABLED(CONFIG_LOCKDEP) && irq->intid >= VGIC_MIN_LPI) { 151 + if (IS_ENABLED(CONFIG_LOCKDEP) && irq_is_lpi(kvm, irq->intid)) { 168 152 guard(spinlock_irqsave)(&dist->lpi_xa.xa_lock); 169 153 } 170 154 ··· 202 186 raw_spin_lock_irqsave(&vgic_cpu->ap_list_lock, flags); 203 187 204 188 list_for_each_entry_safe(irq, tmp, &vgic_cpu->ap_list_head, ap_list) { 205 - if (irq->intid >= VGIC_MIN_LPI) { 189 + if (irq_is_lpi(vcpu->kvm, irq->intid)) { 206 190 raw_spin_lock(&irq->irq_lock); 207 191 list_del(&irq->ap_list); 208 192 irq->vcpu = NULL; ··· 420 404 421 405 lockdep_assert_held(&irq->irq_lock); 422 406 407 + if (irq->ops && irq->ops->queue_irq_unlock) 408 + return irq->ops->queue_irq_unlock(kvm, irq, flags); 409 + 423 410 retry: 424 411 vcpu = vgic_target_oracle(irq); 425 412 if (irq->vcpu || !vcpu) { ··· 540 521 if (ret) 541 522 return ret; 542 523 543 - if (!vcpu && intid < VGIC_NR_PRIVATE_IRQS) 524 + if (!vcpu && irq_is_private(kvm, intid)) 544 525 return -EINVAL; 545 526 546 527 trace_vgic_update_irq_pending(vcpu ? vcpu->vcpu_idx : 0, intid, level); 547 528 548 - if (intid < VGIC_NR_PRIVATE_IRQS) 529 + if (irq_is_private(kvm, intid)) 549 530 irq = vgic_get_vcpu_irq(vcpu, intid); 550 531 else 551 532 irq = vgic_get_irq(kvm, intid); ··· 572 553 return 0; 573 554 } 574 555 556 + void kvm_vgic_set_irq_ops(struct kvm_vcpu *vcpu, u32 vintid, 557 + struct irq_ops *ops) 558 + { 559 + struct vgic_irq *irq = vgic_get_vcpu_irq(vcpu, vintid); 560 + 561 + BUG_ON(!irq); 562 + 563 + scoped_guard(raw_spinlock_irqsave, &irq->irq_lock) 564 + irq->ops = ops; 565 + 566 + vgic_put_irq(vcpu->kvm, irq); 567 + } 568 + 569 + void kvm_vgic_clear_irq_ops(struct kvm_vcpu *vcpu, u32 vintid) 570 + { 571 + kvm_vgic_set_irq_ops(vcpu, vintid, NULL); 572 + } 573 + 575 574 /* @irq->irq_lock must be held */ 576 575 static int kvm_vgic_map_irq(struct kvm_vcpu *vcpu, struct vgic_irq *irq, 577 - unsigned int host_irq, 578 - struct irq_ops *ops) 576 + unsigned int host_irq) 579 577 { 580 578 struct irq_desc *desc; 581 579 struct irq_data *data; ··· 612 576 irq->hw = true; 613 577 irq->host_irq = host_irq; 614 578 irq->hwintid = data->hwirq; 615 - irq->ops = ops; 579 + 580 + if (irq->ops && irq->ops->set_direct_injection) 581 + irq->ops->set_direct_injection(vcpu, irq, true); 582 + 616 583 return 0; 617 584 } 618 585 619 586 /* @irq->irq_lock must be held */ 620 587 static inline void kvm_vgic_unmap_irq(struct vgic_irq *irq) 621 588 { 589 + if (irq->ops && irq->ops->set_direct_injection) 590 + irq->ops->set_direct_injection(irq->target_vcpu, irq, false); 591 + 622 592 irq->hw = false; 623 593 irq->hwintid = 0; 624 - irq->ops = NULL; 625 594 } 626 595 627 596 int kvm_vgic_map_phys_irq(struct kvm_vcpu *vcpu, unsigned int host_irq, 628 - u32 vintid, struct irq_ops *ops) 597 + u32 vintid) 629 598 { 630 599 struct vgic_irq *irq = vgic_get_vcpu_irq(vcpu, vintid); 631 600 unsigned long flags; ··· 639 598 BUG_ON(!irq); 640 599 641 600 raw_spin_lock_irqsave(&irq->irq_lock, flags); 642 - ret = kvm_vgic_map_irq(vcpu, irq, host_irq, ops); 601 + ret = kvm_vgic_map_irq(vcpu, irq, host_irq); 643 602 raw_spin_unlock_irqrestore(&irq->irq_lock, flags); 644 603 vgic_put_irq(vcpu->kvm, irq); 645 604 ··· 726 685 return -EAGAIN; 727 686 728 687 /* SGIs and LPIs cannot be wired up to any device */ 729 - if (!irq_is_ppi(intid) && !vgic_valid_spi(vcpu->kvm, intid)) 688 + if (!irq_is_ppi(vcpu->kvm, intid) && !vgic_valid_spi(vcpu->kvm, intid)) 730 689 return -EINVAL; 731 690 732 691 irq = vgic_get_vcpu_irq(vcpu, intid); ··· 853 812 vgic_release_deleted_lpis(vcpu->kvm); 854 813 } 855 814 856 - static inline void vgic_fold_lr_state(struct kvm_vcpu *vcpu) 815 + static void vgic_fold_state(struct kvm_vcpu *vcpu) 857 816 { 817 + if (vgic_is_v5(vcpu->kvm)) { 818 + vgic_v5_fold_ppi_state(vcpu); 819 + return; 820 + } 821 + 858 822 if (!*host_data_ptr(last_lr_irq)) 859 823 return; 860 824 ··· 1048 1002 1049 1003 static inline void vgic_save_state(struct kvm_vcpu *vcpu) 1050 1004 { 1051 - if (!static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) 1005 + /* No switch statement here. See comment in vgic_restore_state() */ 1006 + if (vgic_is_v5(vcpu->kvm)) 1007 + vgic_v5_save_state(vcpu); 1008 + else if (!static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) 1052 1009 vgic_v2_save_state(vcpu); 1053 1010 else 1054 1011 __vgic_v3_save_state(&vcpu->arch.vgic_cpu.vgic_v3); ··· 1060 1011 /* Sync back the hardware VGIC state into our emulation after a guest's run. */ 1061 1012 void kvm_vgic_sync_hwstate(struct kvm_vcpu *vcpu) 1062 1013 { 1063 - /* If nesting, emulate the HW effect from L0 to L1 */ 1064 - if (vgic_state_is_nested(vcpu)) { 1065 - vgic_v3_sync_nested(vcpu); 1066 - return; 1067 - } 1014 + if (vgic_is_v3(vcpu->kvm)) { 1015 + /* If nesting, emulate the HW effect from L0 to L1 */ 1016 + if (vgic_state_is_nested(vcpu)) { 1017 + vgic_v3_sync_nested(vcpu); 1018 + return; 1019 + } 1068 1020 1069 - if (vcpu_has_nv(vcpu)) 1070 - vgic_v3_nested_update_mi(vcpu); 1021 + if (vcpu_has_nv(vcpu)) 1022 + vgic_v3_nested_update_mi(vcpu); 1023 + } 1071 1024 1072 1025 if (can_access_vgic_from_kernel()) 1073 1026 vgic_save_state(vcpu); 1074 1027 1075 - vgic_fold_lr_state(vcpu); 1076 - vgic_prune_ap_list(vcpu); 1028 + vgic_fold_state(vcpu); 1029 + 1030 + if (!vgic_is_v5(vcpu->kvm)) 1031 + vgic_prune_ap_list(vcpu); 1077 1032 } 1078 1033 1079 1034 /* Sync interrupts that were deactivated through a DIR trap */ ··· 1093 1040 1094 1041 static inline void vgic_restore_state(struct kvm_vcpu *vcpu) 1095 1042 { 1096 - if (!static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) 1043 + /* 1044 + * As nice as it would be to restructure this code into a switch 1045 + * statement as can be found elsewhere, the logic quickly gets ugly. 1046 + * 1047 + * __vgic_v3_restore_state() is doing a lot of heavy lifting here. It is 1048 + * required for GICv3-on-GICv3, GICv2-on-GICv3, GICv3-on-GICv5, and the 1049 + * no-in-kernel-irqchip case on GICv3 hardware. Hence, adding a switch 1050 + * here results in much more complex code. 1051 + */ 1052 + if (vgic_is_v5(vcpu->kvm)) 1053 + vgic_v5_restore_state(vcpu); 1054 + else if (!static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) 1097 1055 vgic_v2_restore_state(vcpu); 1098 1056 else 1099 1057 __vgic_v3_restore_state(&vcpu->arch.vgic_cpu.vgic_v3); 1058 + } 1059 + 1060 + static void vgic_flush_state(struct kvm_vcpu *vcpu) 1061 + { 1062 + if (vgic_is_v5(vcpu->kvm)) { 1063 + vgic_v5_flush_ppi_state(vcpu); 1064 + return; 1065 + } 1066 + 1067 + scoped_guard(raw_spinlock, &vcpu->arch.vgic_cpu.ap_list_lock) 1068 + vgic_flush_lr_state(vcpu); 1100 1069 } 1101 1070 1102 1071 /* Flush our emulation state into the GIC hardware before entering the guest. */ ··· 1157 1082 1158 1083 DEBUG_SPINLOCK_BUG_ON(!irqs_disabled()); 1159 1084 1160 - scoped_guard(raw_spinlock, &vcpu->arch.vgic_cpu.ap_list_lock) 1161 - vgic_flush_lr_state(vcpu); 1085 + vgic_flush_state(vcpu); 1162 1086 1163 1087 if (can_access_vgic_from_kernel()) 1164 1088 vgic_restore_state(vcpu); 1165 1089 1166 - if (vgic_supports_direct_irqs(vcpu->kvm)) 1090 + if (vgic_supports_direct_irqs(vcpu->kvm) && kvm_vgic_global_state.has_gicv4) 1167 1091 vgic_v4_commit(vcpu); 1168 1092 } 1169 1093 1170 1094 void kvm_vgic_load(struct kvm_vcpu *vcpu) 1171 1095 { 1096 + const struct vgic_dist *dist = &vcpu->kvm->arch.vgic; 1097 + 1172 1098 if (unlikely(!irqchip_in_kernel(vcpu->kvm) || !vgic_initialized(vcpu->kvm))) { 1173 1099 if (has_vhe() && static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) 1174 1100 __vgic_v3_activate_traps(&vcpu->arch.vgic_cpu.vgic_v3); 1175 1101 return; 1176 1102 } 1177 1103 1178 - if (!static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) 1179 - vgic_v2_load(vcpu); 1180 - else 1104 + switch (dist->vgic_model) { 1105 + case KVM_DEV_TYPE_ARM_VGIC_V5: 1106 + vgic_v5_load(vcpu); 1107 + break; 1108 + case KVM_DEV_TYPE_ARM_VGIC_V3: 1181 1109 vgic_v3_load(vcpu); 1110 + break; 1111 + case KVM_DEV_TYPE_ARM_VGIC_V2: 1112 + if (static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) 1113 + vgic_v3_load(vcpu); 1114 + else 1115 + vgic_v2_load(vcpu); 1116 + break; 1117 + default: 1118 + BUG(); 1119 + } 1182 1120 } 1183 1121 1184 1122 void kvm_vgic_put(struct kvm_vcpu *vcpu) 1185 1123 { 1124 + const struct vgic_dist *dist = &vcpu->kvm->arch.vgic; 1125 + 1186 1126 if (unlikely(!irqchip_in_kernel(vcpu->kvm) || !vgic_initialized(vcpu->kvm))) { 1187 1127 if (has_vhe() && static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) 1188 1128 __vgic_v3_deactivate_traps(&vcpu->arch.vgic_cpu.vgic_v3); 1189 1129 return; 1190 1130 } 1191 1131 1192 - if (!static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) 1193 - vgic_v2_put(vcpu); 1194 - else 1132 + switch (dist->vgic_model) { 1133 + case KVM_DEV_TYPE_ARM_VGIC_V5: 1134 + vgic_v5_put(vcpu); 1135 + break; 1136 + case KVM_DEV_TYPE_ARM_VGIC_V3: 1195 1137 vgic_v3_put(vcpu); 1138 + break; 1139 + case KVM_DEV_TYPE_ARM_VGIC_V2: 1140 + if (static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) 1141 + vgic_v3_put(vcpu); 1142 + else 1143 + vgic_v2_put(vcpu); 1144 + break; 1145 + default: 1146 + BUG(); 1147 + } 1196 1148 } 1197 1149 1198 1150 int kvm_vgic_vcpu_pending_irq(struct kvm_vcpu *vcpu) ··· 1229 1127 bool pending = false; 1230 1128 unsigned long flags; 1231 1129 struct vgic_vmcr vmcr; 1130 + 1131 + if (vgic_is_v5(vcpu->kvm)) 1132 + return vgic_v5_has_pending_ppi(vcpu); 1232 1133 1233 1134 if (!vcpu->kvm->arch.vgic.enabled) 1234 1135 return false;

+40 -13

arch/arm64/kvm/vgic/vgic.h

··· 187 187 * registers regardless of the hardware backed GIC used. 188 188 */ 189 189 struct vgic_vmcr { 190 + u32 en; /* GICv5-specific */ 190 191 u32 grpen0; 191 192 u32 grpen1; 192 193 ··· 364 363 void vgic_debug_destroy(struct kvm *kvm); 365 364 366 365 int vgic_v5_probe(const struct gic_kvm_info *info); 366 + void vgic_v5_reset(struct kvm_vcpu *vcpu); 367 + int vgic_v5_init(struct kvm *kvm); 368 + int vgic_v5_map_resources(struct kvm *kvm); 369 + void vgic_v5_set_ppi_ops(struct kvm_vcpu *vcpu, u32 vintid); 370 + bool vgic_v5_has_pending_ppi(struct kvm_vcpu *vcpu); 371 + void vgic_v5_flush_ppi_state(struct kvm_vcpu *vcpu); 372 + void vgic_v5_fold_ppi_state(struct kvm_vcpu *vcpu); 373 + void vgic_v5_load(struct kvm_vcpu *vcpu); 374 + void vgic_v5_put(struct kvm_vcpu *vcpu); 375 + void vgic_v5_set_vmcr(struct kvm_vcpu *vcpu, struct vgic_vmcr *vmcr); 376 + void vgic_v5_get_vmcr(struct kvm_vcpu *vcpu, struct vgic_vmcr *vmcr); 377 + void vgic_v5_restore_state(struct kvm_vcpu *vcpu); 378 + void vgic_v5_save_state(struct kvm_vcpu *vcpu); 367 379 368 380 static inline int vgic_v3_max_apr_idx(struct kvm_vcpu *vcpu) 369 381 { ··· 439 425 int vgic_its_inv_lpi(struct kvm *kvm, struct vgic_irq *irq); 440 426 int vgic_its_invall(struct kvm_vcpu *vcpu); 441 427 442 - bool system_supports_direct_sgis(void); 443 - bool vgic_supports_direct_msis(struct kvm *kvm); 444 - bool vgic_supports_direct_sgis(struct kvm *kvm); 445 - 446 - static inline bool vgic_supports_direct_irqs(struct kvm *kvm) 447 - { 448 - return vgic_supports_direct_msis(kvm) || vgic_supports_direct_sgis(kvm); 449 - } 450 - 451 428 int vgic_v4_init(struct kvm *kvm); 452 429 void vgic_v4_teardown(struct kvm *kvm); 453 430 void vgic_v4_configure_vsgis(struct kvm *kvm); ··· 452 447 return kvm_has_feat(kvm, ID_AA64PFR0_EL1, GIC, IMP); 453 448 } 454 449 450 + static inline bool kvm_has_gicv5(struct kvm *kvm) 451 + { 452 + return kvm_has_feat(kvm, ID_AA64PFR2_EL1, GCIE, IMP); 453 + } 454 + 455 455 void vgic_v3_flush_nested(struct kvm_vcpu *vcpu); 456 456 void vgic_v3_sync_nested(struct kvm_vcpu *vcpu); 457 457 void vgic_v3_load_nested(struct kvm_vcpu *vcpu); ··· 464 454 void vgic_v3_handle_nested_maint_irq(struct kvm_vcpu *vcpu); 465 455 void vgic_v3_nested_update_mi(struct kvm_vcpu *vcpu); 466 456 467 - static inline bool vgic_is_v3_compat(struct kvm *kvm) 457 + static inline bool vgic_host_has_gicv3(void) 468 458 { 469 - return cpus_have_final_cap(ARM64_HAS_GICV5_CPUIF) && 459 + /* 460 + * Either the host is a native GICv3, or it is GICv5 with 461 + * FEAT_GCIE_LEGACY. 462 + */ 463 + return kvm_vgic_global_state.type == VGIC_V3 || 470 464 kvm_vgic_global_state.has_gcie_v3_compat; 471 465 } 472 466 473 - static inline bool vgic_is_v3(struct kvm *kvm) 467 + static inline bool vgic_host_has_gicv5(void) 474 468 { 475 - return kvm_vgic_global_state.type == VGIC_V3 || vgic_is_v3_compat(kvm); 469 + return kvm_vgic_global_state.type == VGIC_V5; 470 + } 471 + 472 + bool system_supports_direct_sgis(void); 473 + bool vgic_supports_direct_msis(struct kvm *kvm); 474 + bool vgic_supports_direct_sgis(struct kvm *kvm); 475 + 476 + static inline bool vgic_supports_direct_irqs(struct kvm *kvm) 477 + { 478 + /* GICv5 always supports direct IRQs */ 479 + if (vgic_is_v5(kvm)) 480 + return true; 481 + 482 + return vgic_supports_direct_msis(kvm) || vgic_supports_direct_sgis(kvm); 476 483 } 477 484 478 485 int vgic_its_debug_init(struct kvm_device *dev);

+30 -3

arch/arm64/mm/fault.c

··· 43 43 #include <asm/system_misc.h> 44 44 #include <asm/tlbflush.h> 45 45 #include <asm/traps.h> 46 + #include <asm/virt.h> 46 47 47 48 struct fault_info { 48 49 int (*fn)(unsigned long far, unsigned long esr, ··· 270 269 return false; 271 270 } 272 271 272 + static bool is_pkvm_stage2_abort(unsigned int esr) 273 + { 274 + /* 275 + * S1PTW should only ever be set in ESR_EL1 if the pkvm hypervisor 276 + * injected a stage-2 abort -- see host_inject_mem_abort(). 277 + */ 278 + return is_pkvm_initialized() && (esr & ESR_ELx_S1PTW); 279 + } 280 + 273 281 static bool __kprobes is_spurious_el1_translation_fault(unsigned long addr, 274 282 unsigned long esr, 275 283 struct pt_regs *regs) ··· 299 289 * If we now have a valid translation, treat the translation fault as 300 290 * spurious. 301 291 */ 302 - if (!(par & SYS_PAR_EL1_F)) 292 + if (!(par & SYS_PAR_EL1_F)) { 293 + if (is_pkvm_stage2_abort(esr)) { 294 + par &= SYS_PAR_EL1_PA; 295 + return pkvm_force_reclaim_guest_page(par); 296 + } 297 + 303 298 return true; 299 + } 304 300 305 301 /* 306 302 * If we got a different type of fault from the AT instruction, ··· 392 376 if (!is_el1_instruction_abort(esr) && fixup_exception(regs, esr)) 393 377 return; 394 378 395 - if (WARN_RATELIMIT(is_spurious_el1_translation_fault(addr, esr, regs), 396 - "Ignoring spurious kernel translation fault at virtual address %016lx\n", addr)) 379 + if (is_spurious_el1_translation_fault(addr, esr, regs)) { 380 + WARN_RATELIMIT(!is_pkvm_stage2_abort(esr), 381 + "Ignoring spurious kernel translation fault at virtual address %016lx\n", addr); 397 382 return; 383 + } 398 384 399 385 if (is_el1_mte_sync_tag_check_fault(esr)) { 400 386 do_tag_recovery(addr, esr, regs); ··· 413 395 msg = "read from unreadable memory"; 414 396 } else if (addr < PAGE_SIZE) { 415 397 msg = "NULL pointer dereference"; 398 + } else if (is_pkvm_stage2_abort(esr)) { 399 + msg = "access to hypervisor-protected memory"; 416 400 } else { 417 401 if (esr_fsc_is_translation_fault(esr) && 418 402 kfence_handle_page_fault(addr, esr & ESR_ELx_WNR, regs)) ··· 639 619 if (!insn_may_access_user(regs->pc, esr)) 640 620 die_kernel_fault("access to user memory outside uaccess routines", 641 621 addr, esr, regs); 622 + } 623 + 624 + if (is_pkvm_stage2_abort(esr)) { 625 + if (!user_mode(regs)) 626 + goto no_context; 627 + arm64_force_sig_fault(SIGSEGV, SEGV_ACCERR, far, "stage-2 fault"); 628 + return 0; 642 629 } 643 630 644 631 perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);

+480

arch/arm64/tools/sysreg

··· 3243 3243 EndEnum 3244 3244 EndSysreg 3245 3245 3246 + Sysreg ICC_HPPIR_EL1 3 0 12 10 3 3247 + Res0 63:33 3248 + Field 32 HPPIV 3249 + Field 31:29 TYPE 3250 + Res0 28:24 3251 + Field 23:0 ID 3252 + EndSysreg 3253 + 3246 3254 Sysreg ICC_ICSR_EL1 3 0 12 10 4 3247 3255 Res0 63:48 3248 3256 Field 47:32 IAFFID ··· 3263 3255 Field 2 Pending 3264 3256 Field 1 Enabled 3265 3257 Field 0 F 3258 + EndSysreg 3259 + 3260 + Sysreg ICC_IAFFIDR_EL1 3 0 12 10 5 3261 + Res0 63:16 3262 + Field 15:0 IAFFID 3266 3263 EndSysreg 3267 3264 3268 3265 SysregFields ICC_PPI_ENABLERx_EL1 ··· 3674 3661 Field 15 SMPS 3675 3662 Res0 14:12 3676 3663 Field 11:0 AFFINITY 3664 + EndSysreg 3665 + 3666 + Sysreg ICC_APR_EL1 3 1 12 0 0 3667 + Res0 63:32 3668 + Field 31 P31 3669 + Field 30 P30 3670 + Field 29 P29 3671 + Field 28 P28 3672 + Field 27 P27 3673 + Field 26 P26 3674 + Field 25 P25 3675 + Field 24 P24 3676 + Field 23 P23 3677 + Field 22 P22 3678 + Field 21 P21 3679 + Field 20 P20 3680 + Field 19 P19 3681 + Field 18 P18 3682 + Field 17 P17 3683 + Field 16 P16 3684 + Field 15 P15 3685 + Field 14 P14 3686 + Field 13 P13 3687 + Field 12 P12 3688 + Field 11 P11 3689 + Field 10 P10 3690 + Field 9 P9 3691 + Field 8 P8 3692 + Field 7 P7 3693 + Field 6 P6 3694 + Field 5 P5 3695 + Field 4 P4 3696 + Field 3 P3 3697 + Field 2 P2 3698 + Field 1 P1 3699 + Field 0 P0 3677 3700 EndSysreg 3678 3701 3679 3702 Sysreg ICC_CR0_EL1 3 1 12 0 1 ··· 4736 4687 Field 15:0 PhyPARTID28 4737 4688 EndSysreg 4738 4689 4690 + Sysreg ICH_APR_EL2 3 4 12 8 4 4691 + Res0 63:32 4692 + Field 31 P31 4693 + Field 30 P30 4694 + Field 29 P29 4695 + Field 28 P28 4696 + Field 27 P27 4697 + Field 26 P26 4698 + Field 25 P25 4699 + Field 24 P24 4700 + Field 23 P23 4701 + Field 22 P22 4702 + Field 21 P21 4703 + Field 20 P20 4704 + Field 19 P19 4705 + Field 18 P18 4706 + Field 17 P17 4707 + Field 16 P16 4708 + Field 15 P15 4709 + Field 14 P14 4710 + Field 13 P13 4711 + Field 12 P12 4712 + Field 11 P11 4713 + Field 10 P10 4714 + Field 9 P9 4715 + Field 8 P8 4716 + Field 7 P7 4717 + Field 6 P6 4718 + Field 5 P5 4719 + Field 4 P4 4720 + Field 3 P3 4721 + Field 2 P2 4722 + Field 1 P1 4723 + Field 0 P0 4724 + EndSysreg 4725 + 4739 4726 Sysreg ICH_HFGRTR_EL2 3 4 12 9 4 4740 4727 Res0 63:21 4741 4728 Field 20 ICC_PPI_ACTIVERn_EL1 ··· 4818 4733 Field 2 GICCDPRI 4819 4734 Field 1 GICCDDIS 4820 4735 Field 0 GICCDEN 4736 + EndSysreg 4737 + 4738 + SysregFields ICH_PPI_DVIRx_EL2 4739 + Field 63 DVI63 4740 + Field 62 DVI62 4741 + Field 61 DVI61 4742 + Field 60 DVI60 4743 + Field 59 DVI59 4744 + Field 58 DVI58 4745 + Field 57 DVI57 4746 + Field 56 DVI56 4747 + Field 55 DVI55 4748 + Field 54 DVI54 4749 + Field 53 DVI53 4750 + Field 52 DVI52 4751 + Field 51 DVI51 4752 + Field 50 DVI50 4753 + Field 49 DVI49 4754 + Field 48 DVI48 4755 + Field 47 DVI47 4756 + Field 46 DVI46 4757 + Field 45 DVI45 4758 + Field 44 DVI44 4759 + Field 43 DVI43 4760 + Field 42 DVI42 4761 + Field 41 DVI41 4762 + Field 40 DVI40 4763 + Field 39 DVI39 4764 + Field 38 DVI38 4765 + Field 37 DVI37 4766 + Field 36 DVI36 4767 + Field 35 DVI35 4768 + Field 34 DVI34 4769 + Field 33 DVI33 4770 + Field 32 DVI32 4771 + Field 31 DVI31 4772 + Field 30 DVI30 4773 + Field 29 DVI29 4774 + Field 28 DVI28 4775 + Field 27 DVI27 4776 + Field 26 DVI26 4777 + Field 25 DVI25 4778 + Field 24 DVI24 4779 + Field 23 DVI23 4780 + Field 22 DVI22 4781 + Field 21 DVI21 4782 + Field 20 DVI20 4783 + Field 19 DVI19 4784 + Field 18 DVI18 4785 + Field 17 DVI17 4786 + Field 16 DVI16 4787 + Field 15 DVI15 4788 + Field 14 DVI14 4789 + Field 13 DVI13 4790 + Field 12 DVI12 4791 + Field 11 DVI11 4792 + Field 10 DVI10 4793 + Field 9 DVI9 4794 + Field 8 DVI8 4795 + Field 7 DVI7 4796 + Field 6 DVI6 4797 + Field 5 DVI5 4798 + Field 4 DVI4 4799 + Field 3 DVI3 4800 + Field 2 DVI2 4801 + Field 1 DVI1 4802 + Field 0 DVI0 4803 + EndSysregFields 4804 + 4805 + Sysreg ICH_PPI_DVIR0_EL2 3 4 12 10 0 4806 + Fields ICH_PPI_DVIRx_EL2 4807 + EndSysreg 4808 + 4809 + Sysreg ICH_PPI_DVIR1_EL2 3 4 12 10 1 4810 + Fields ICH_PPI_DVIRx_EL2 4811 + EndSysreg 4812 + 4813 + SysregFields ICH_PPI_ENABLERx_EL2 4814 + Field 63 EN63 4815 + Field 62 EN62 4816 + Field 61 EN61 4817 + Field 60 EN60 4818 + Field 59 EN59 4819 + Field 58 EN58 4820 + Field 57 EN57 4821 + Field 56 EN56 4822 + Field 55 EN55 4823 + Field 54 EN54 4824 + Field 53 EN53 4825 + Field 52 EN52 4826 + Field 51 EN51 4827 + Field 50 EN50 4828 + Field 49 EN49 4829 + Field 48 EN48 4830 + Field 47 EN47 4831 + Field 46 EN46 4832 + Field 45 EN45 4833 + Field 44 EN44 4834 + Field 43 EN43 4835 + Field 42 EN42 4836 + Field 41 EN41 4837 + Field 40 EN40 4838 + Field 39 EN39 4839 + Field 38 EN38 4840 + Field 37 EN37 4841 + Field 36 EN36 4842 + Field 35 EN35 4843 + Field 34 EN34 4844 + Field 33 EN33 4845 + Field 32 EN32 4846 + Field 31 EN31 4847 + Field 30 EN30 4848 + Field 29 EN29 4849 + Field 28 EN28 4850 + Field 27 EN27 4851 + Field 26 EN26 4852 + Field 25 EN25 4853 + Field 24 EN24 4854 + Field 23 EN23 4855 + Field 22 EN22 4856 + Field 21 EN21 4857 + Field 20 EN20 4858 + Field 19 EN19 4859 + Field 18 EN18 4860 + Field 17 EN17 4861 + Field 16 EN16 4862 + Field 15 EN15 4863 + Field 14 EN14 4864 + Field 13 EN13 4865 + Field 12 EN12 4866 + Field 11 EN11 4867 + Field 10 EN10 4868 + Field 9 EN9 4869 + Field 8 EN8 4870 + Field 7 EN7 4871 + Field 6 EN6 4872 + Field 5 EN5 4873 + Field 4 EN4 4874 + Field 3 EN3 4875 + Field 2 EN2 4876 + Field 1 EN1 4877 + Field 0 EN0 4878 + EndSysregFields 4879 + 4880 + Sysreg ICH_PPI_ENABLER0_EL2 3 4 12 10 2 4881 + Fields ICH_PPI_ENABLERx_EL2 4882 + EndSysreg 4883 + 4884 + Sysreg ICH_PPI_ENABLER1_EL2 3 4 12 10 3 4885 + Fields ICH_PPI_ENABLERx_EL2 4886 + EndSysreg 4887 + 4888 + SysregFields ICH_PPI_PENDRx_EL2 4889 + Field 63 PEND63 4890 + Field 62 PEND62 4891 + Field 61 PEND61 4892 + Field 60 PEND60 4893 + Field 59 PEND59 4894 + Field 58 PEND58 4895 + Field 57 PEND57 4896 + Field 56 PEND56 4897 + Field 55 PEND55 4898 + Field 54 PEND54 4899 + Field 53 PEND53 4900 + Field 52 PEND52 4901 + Field 51 PEND51 4902 + Field 50 PEND50 4903 + Field 49 PEND49 4904 + Field 48 PEND48 4905 + Field 47 PEND47 4906 + Field 46 PEND46 4907 + Field 45 PEND45 4908 + Field 44 PEND44 4909 + Field 43 PEND43 4910 + Field 42 PEND42 4911 + Field 41 PEND41 4912 + Field 40 PEND40 4913 + Field 39 PEND39 4914 + Field 38 PEND38 4915 + Field 37 PEND37 4916 + Field 36 PEND36 4917 + Field 35 PEND35 4918 + Field 34 PEND34 4919 + Field 33 PEND33 4920 + Field 32 PEND32 4921 + Field 31 PEND31 4922 + Field 30 PEND30 4923 + Field 29 PEND29 4924 + Field 28 PEND28 4925 + Field 27 PEND27 4926 + Field 26 PEND26 4927 + Field 25 PEND25 4928 + Field 24 PEND24 4929 + Field 23 PEND23 4930 + Field 22 PEND22 4931 + Field 21 PEND21 4932 + Field 20 PEND20 4933 + Field 19 PEND19 4934 + Field 18 PEND18 4935 + Field 17 PEND17 4936 + Field 16 PEND16 4937 + Field 15 PEND15 4938 + Field 14 PEND14 4939 + Field 13 PEND13 4940 + Field 12 PEND12 4941 + Field 11 PEND11 4942 + Field 10 PEND10 4943 + Field 9 PEND9 4944 + Field 8 PEND8 4945 + Field 7 PEND7 4946 + Field 6 PEND6 4947 + Field 5 PEND5 4948 + Field 4 PEND4 4949 + Field 3 PEND3 4950 + Field 2 PEND2 4951 + Field 1 PEND1 4952 + Field 0 PEND0 4953 + EndSysregFields 4954 + 4955 + Sysreg ICH_PPI_PENDR0_EL2 3 4 12 10 4 4956 + Fields ICH_PPI_PENDRx_EL2 4957 + EndSysreg 4958 + 4959 + Sysreg ICH_PPI_PENDR1_EL2 3 4 12 10 5 4960 + Fields ICH_PPI_PENDRx_EL2 4961 + EndSysreg 4962 + 4963 + SysregFields ICH_PPI_ACTIVERx_EL2 4964 + Field 63 ACTIVE63 4965 + Field 62 ACTIVE62 4966 + Field 61 ACTIVE61 4967 + Field 60 ACTIVE60 4968 + Field 59 ACTIVE59 4969 + Field 58 ACTIVE58 4970 + Field 57 ACTIVE57 4971 + Field 56 ACTIVE56 4972 + Field 55 ACTIVE55 4973 + Field 54 ACTIVE54 4974 + Field 53 ACTIVE53 4975 + Field 52 ACTIVE52 4976 + Field 51 ACTIVE51 4977 + Field 50 ACTIVE50 4978 + Field 49 ACTIVE49 4979 + Field 48 ACTIVE48 4980 + Field 47 ACTIVE47 4981 + Field 46 ACTIVE46 4982 + Field 45 ACTIVE45 4983 + Field 44 ACTIVE44 4984 + Field 43 ACTIVE43 4985 + Field 42 ACTIVE42 4986 + Field 41 ACTIVE41 4987 + Field 40 ACTIVE40 4988 + Field 39 ACTIVE39 4989 + Field 38 ACTIVE38 4990 + Field 37 ACTIVE37 4991 + Field 36 ACTIVE36 4992 + Field 35 ACTIVE35 4993 + Field 34 ACTIVE34 4994 + Field 33 ACTIVE33 4995 + Field 32 ACTIVE32 4996 + Field 31 ACTIVE31 4997 + Field 30 ACTIVE30 4998 + Field 29 ACTIVE29 4999 + Field 28 ACTIVE28 5000 + Field 27 ACTIVE27 5001 + Field 26 ACTIVE26 5002 + Field 25 ACTIVE25 5003 + Field 24 ACTIVE24 5004 + Field 23 ACTIVE23 5005 + Field 22 ACTIVE22 5006 + Field 21 ACTIVE21 5007 + Field 20 ACTIVE20 5008 + Field 19 ACTIVE19 5009 + Field 18 ACTIVE18 5010 + Field 17 ACTIVE17 5011 + Field 16 ACTIVE16 5012 + Field 15 ACTIVE15 5013 + Field 14 ACTIVE14 5014 + Field 13 ACTIVE13 5015 + Field 12 ACTIVE12 5016 + Field 11 ACTIVE11 5017 + Field 10 ACTIVE10 5018 + Field 9 ACTIVE9 5019 + Field 8 ACTIVE8 5020 + Field 7 ACTIVE7 5021 + Field 6 ACTIVE6 5022 + Field 5 ACTIVE5 5023 + Field 4 ACTIVE4 5024 + Field 3 ACTIVE3 5025 + Field 2 ACTIVE2 5026 + Field 1 ACTIVE1 5027 + Field 0 ACTIVE0 5028 + EndSysregFields 5029 + 5030 + Sysreg ICH_PPI_ACTIVER0_EL2 3 4 12 10 6 5031 + Fields ICH_PPI_ACTIVERx_EL2 5032 + EndSysreg 5033 + 5034 + Sysreg ICH_PPI_ACTIVER1_EL2 3 4 12 10 7 5035 + Fields ICH_PPI_ACTIVERx_EL2 4821 5036 EndSysreg 4822 5037 4823 5038 Sysreg ICH_HCR_EL2 3 4 12 11 0 ··· 5174 4789 Field 0 En 5175 4790 EndSysreg 5176 4791 4792 + Sysreg ICH_CONTEXTR_EL2 3 4 12 11 6 4793 + Field 63 V 4794 + Field 62 F 4795 + Field 61 IRICHPPIDIS 4796 + Field 60 DB 4797 + Field 59:55 DBPM 4798 + Res0 54:48 4799 + Field 47:32 VPE 4800 + Res0 31:16 4801 + Field 15:0 VM 4802 + EndSysreg 4803 + 5177 4804 Sysreg ICH_VMCR_EL2 3 4 12 11 7 5178 4805 Prefix FEAT_GCIE 5179 4806 Res0 63:32 ··· 5205 4808 Field 2 VAckCtl 5206 4809 Field 1 VENG1 5207 4810 Field 0 VENG0 4811 + EndSysreg 4812 + 4813 + SysregFields ICH_PPI_PRIORITYRx_EL2 4814 + Res0 63:61 4815 + Field 60:56 Priority7 4816 + Res0 55:53 4817 + Field 52:48 Priority6 4818 + Res0 47:45 4819 + Field 44:40 Priority5 4820 + Res0 39:37 4821 + Field 36:32 Priority4 4822 + Res0 31:29 4823 + Field 28:24 Priority3 4824 + Res0 23:21 4825 + Field 20:16 Priority2 4826 + Res0 15:13 4827 + Field 12:8 Priority1 4828 + Res0 7:5 4829 + Field 4:0 Priority0 4830 + EndSysregFields 4831 + 4832 + Sysreg ICH_PPI_PRIORITYR0_EL2 3 4 12 14 0 4833 + Fields ICH_PPI_PRIORITYRx_EL2 4834 + EndSysreg 4835 + 4836 + Sysreg ICH_PPI_PRIORITYR1_EL2 3 4 12 14 1 4837 + Fields ICH_PPI_PRIORITYRx_EL2 4838 + EndSysreg 4839 + 4840 + Sysreg ICH_PPI_PRIORITYR2_EL2 3 4 12 14 2 4841 + Fields ICH_PPI_PRIORITYRx_EL2 4842 + EndSysreg 4843 + 4844 + Sysreg ICH_PPI_PRIORITYR3_EL2 3 4 12 14 3 4845 + Fields ICH_PPI_PRIORITYRx_EL2 4846 + EndSysreg 4847 + 4848 + Sysreg ICH_PPI_PRIORITYR4_EL2 3 4 12 14 4 4849 + Fields ICH_PPI_PRIORITYRx_EL2 4850 + EndSysreg 4851 + 4852 + Sysreg ICH_PPI_PRIORITYR5_EL2 3 4 12 14 5 4853 + Fields ICH_PPI_PRIORITYRx_EL2 4854 + EndSysreg 4855 + 4856 + Sysreg ICH_PPI_PRIORITYR6_EL2 3 4 12 14 6 4857 + Fields ICH_PPI_PRIORITYRx_EL2 4858 + EndSysreg 4859 + 4860 + Sysreg ICH_PPI_PRIORITYR7_EL2 3 4 12 14 7 4861 + Fields ICH_PPI_PRIORITYRx_EL2 4862 + EndSysreg 4863 + 4864 + Sysreg ICH_PPI_PRIORITYR8_EL2 3 4 12 15 0 4865 + Fields ICH_PPI_PRIORITYRx_EL2 4866 + EndSysreg 4867 + 4868 + Sysreg ICH_PPI_PRIORITYR9_EL2 3 4 12 15 1 4869 + Fields ICH_PPI_PRIORITYRx_EL2 4870 + EndSysreg 4871 + 4872 + Sysreg ICH_PPI_PRIORITYR10_EL2 3 4 12 15 2 4873 + Fields ICH_PPI_PRIORITYRx_EL2 4874 + EndSysreg 4875 + 4876 + Sysreg ICH_PPI_PRIORITYR11_EL2 3 4 12 15 3 4877 + Fields ICH_PPI_PRIORITYRx_EL2 4878 + EndSysreg 4879 + 4880 + Sysreg ICH_PPI_PRIORITYR12_EL2 3 4 12 15 4 4881 + Fields ICH_PPI_PRIORITYRx_EL2 4882 + EndSysreg 4883 + 4884 + Sysreg ICH_PPI_PRIORITYR13_EL2 3 4 12 15 5 4885 + Fields ICH_PPI_PRIORITYRx_EL2 4886 + EndSysreg 4887 + 4888 + Sysreg ICH_PPI_PRIORITYR14_EL2 3 4 12 15 6 4889 + Fields ICH_PPI_PRIORITYRx_EL2 4890 + EndSysreg 4891 + 4892 + Sysreg ICH_PPI_PRIORITYR15_EL2 3 4 12 15 7 4893 + Fields ICH_PPI_PRIORITYRx_EL2 5208 4894 EndSysreg 5209 4895 5210 4896 Sysreg CONTEXTIDR_EL2 3 4 13 0 1

+18

drivers/irqchip/irq-gic-v5.c

··· 511 511 return !!(read_ppi_sysreg_s(hwirq, PPI_HM) & bit); 512 512 } 513 513 514 + static int gicv5_ppi_irq_set_type(struct irq_data *d, unsigned int type) 515 + { 516 + /* 517 + * GICv5's PPIs do not have a configurable trigger or handling 518 + * mode. Check that the attempt to set a type matches what the 519 + * hardware reports in the HMR, and error on a mismatch. 520 + */ 521 + 522 + if (type & IRQ_TYPE_EDGE_BOTH && gicv5_ppi_irq_is_level(d->hwirq)) 523 + return -EINVAL; 524 + 525 + if (type & IRQ_TYPE_LEVEL_MASK && !gicv5_ppi_irq_is_level(d->hwirq)) 526 + return -EINVAL; 527 + 528 + return 0; 529 + } 530 + 514 531 static int gicv5_ppi_irq_set_vcpu_affinity(struct irq_data *d, void *vcpu) 515 532 { 516 533 if (vcpu) ··· 543 526 .irq_mask = gicv5_ppi_irq_mask, 544 527 .irq_unmask = gicv5_ppi_irq_unmask, 545 528 .irq_eoi = gicv5_ppi_irq_eoi, 529 + .irq_set_type = gicv5_ppi_irq_set_type, 546 530 .irq_get_irqchip_state = gicv5_ppi_irq_get_irqchip_state, 547 531 .irq_set_irqchip_state = gicv5_ppi_irq_set_irqchip_state, 548 532 .irq_set_vcpu_affinity = gicv5_ppi_irq_set_vcpu_affinity,

+1 -1

drivers/virt/coco/pkvm-guest/Kconfig

··· 1 1 config ARM_PKVM_GUEST 2 2 bool "Arm pKVM protected guest driver" 3 - depends on ARM64 3 + depends on ARM64 && DMA_RESTRICTED_POOL 4 4 help 5 5 Protected guests running under the pKVM hypervisor on arm64 6 6 are isolated from the host and must issue hypercalls to enable

+1

fs/tracefs/inode.c

··· 664 664 fsnotify_create(d_inode(dentry->d_parent), dentry); 665 665 return tracefs_end_creating(dentry); 666 666 } 667 + EXPORT_SYMBOL_GPL(tracefs_create_file); 667 668 668 669 static struct dentry *__create_dir(const char *name, struct dentry *parent, 669 670 const struct inode_operations *ops)

+7 -1

include/kvm/arm_arch_timer.h

··· 10 10 #include <linux/clocksource.h> 11 11 #include <linux/hrtimer.h> 12 12 13 + #include <linux/irqchip/arm-gic-v5.h> 14 + 13 15 enum kvm_arch_timers { 14 16 TIMER_PTIMER, 15 17 TIMER_VTIMER, ··· 49 47 u64 poffset; 50 48 51 49 /* The PPI for each timer, global to the VM */ 52 - u8 ppi[NR_KVM_TIMERS]; 50 + u32 ppi[NR_KVM_TIMERS]; 53 51 }; 54 52 55 53 struct arch_timer_context { ··· 131 129 #define timer_context_to_vcpu(ctx) container_of((ctx), struct kvm_vcpu, arch.timer_cpu.timers[(ctx)->timer_id]) 132 130 #define timer_vm_data(ctx) (&(timer_context_to_vcpu(ctx)->kvm->arch.timer_data)) 133 131 #define timer_irq(ctx) (timer_vm_data(ctx)->ppi[arch_timer_ctx_index(ctx)]) 132 + 133 + #define get_vgic_ppi(k, i) (((k)->arch.vgic.vgic_model != KVM_DEV_TYPE_ARM_VGIC_V5) ? \ 134 + (i) : (FIELD_PREP(GICV5_HWIRQ_ID, i) | \ 135 + FIELD_PREP(GICV5_HWIRQ_TYPE, GICV5_HWIRQ_TYPE_PPI))) 134 136 135 137 u64 kvm_arm_timer_read_sysreg(struct kvm_vcpu *vcpu, 136 138 enum kvm_arch_timers tmr,

+4 -1

include/kvm/arm_pmu.h

··· 12 12 13 13 #define KVM_ARMV8_PMU_MAX_COUNTERS 32 14 14 15 + /* PPI #23 - architecturally specified for GICv5 */ 16 + #define KVM_ARMV8_PMU_GICV5_IRQ 0x20000017 17 + 15 18 #if IS_ENABLED(CONFIG_HW_PERF_EVENTS) && IS_ENABLED(CONFIG_KVM) 16 19 struct kvm_pmc { 17 20 u8 idx; /* index into the pmu->pmc array */ ··· 41 38 }; 42 39 43 40 bool kvm_supports_guest_pmuv3(void); 44 - #define kvm_arm_pmu_irq_initialized(v) ((v)->arch.pmu.irq_num >= VGIC_NR_SGIS) 41 + #define kvm_arm_pmu_irq_initialized(v) ((v)->arch.pmu.irq_num != 0) 45 42 u64 kvm_pmu_get_counter_value(struct kvm_vcpu *vcpu, u64 select_idx); 46 43 void kvm_pmu_set_counter_value(struct kvm_vcpu *vcpu, u64 select_idx, u64 val); 47 44 void kvm_pmu_set_counter_value_user(struct kvm_vcpu *vcpu, u64 select_idx, u64 val);

+185 -6

include/kvm/arm_vgic.h

··· 19 19 #include <linux/jump_label.h> 20 20 21 21 #include <linux/irqchip/arm-gic-v4.h> 22 + #include <linux/irqchip/arm-gic-v5.h> 22 23 24 + #define VGIC_V5_MAX_CPUS 512 23 25 #define VGIC_V3_MAX_CPUS 512 24 26 #define VGIC_V2_MAX_CPUS 8 25 27 #define VGIC_NR_IRQS_LEGACY 256 ··· 33 31 #define VGIC_MIN_LPI 8192 34 32 #define KVM_IRQCHIP_NUM_PINS (1020 - 32) 35 33 36 - #define irq_is_ppi(irq) ((irq) >= VGIC_NR_SGIS && (irq) < VGIC_NR_PRIVATE_IRQS) 37 - #define irq_is_spi(irq) ((irq) >= VGIC_NR_PRIVATE_IRQS && \ 38 - (irq) <= VGIC_MAX_SPI) 34 + /* 35 + * GICv5 supports 128 PPIs, but only the first 64 are architected. We only 36 + * support the timers and PMU in KVM, both of which are architected. Rather than 37 + * handling twice the state, we instead opt to only support the architected set 38 + * in KVM for now. At a future stage, this can be bumped up to 128, if required. 39 + */ 40 + #define VGIC_V5_NR_PRIVATE_IRQS 64 41 + 42 + #define is_v5_type(t, i) (FIELD_GET(GICV5_HWIRQ_TYPE, (i)) == (t)) 43 + 44 + #define __irq_is_sgi(t, i) \ 45 + ({ \ 46 + bool __ret; \ 47 + \ 48 + switch (t) { \ 49 + case KVM_DEV_TYPE_ARM_VGIC_V5: \ 50 + __ret = false; \ 51 + break; \ 52 + default: \ 53 + __ret = (i) < VGIC_NR_SGIS; \ 54 + } \ 55 + \ 56 + __ret; \ 57 + }) 58 + 59 + #define __irq_is_ppi(t, i) \ 60 + ({ \ 61 + bool __ret; \ 62 + \ 63 + switch (t) { \ 64 + case KVM_DEV_TYPE_ARM_VGIC_V5: \ 65 + __ret = is_v5_type(GICV5_HWIRQ_TYPE_PPI, (i)); \ 66 + break; \ 67 + default: \ 68 + __ret = (i) >= VGIC_NR_SGIS; \ 69 + __ret &= (i) < VGIC_NR_PRIVATE_IRQS; \ 70 + } \ 71 + \ 72 + __ret; \ 73 + }) 74 + 75 + #define __irq_is_spi(t, i) \ 76 + ({ \ 77 + bool __ret; \ 78 + \ 79 + switch (t) { \ 80 + case KVM_DEV_TYPE_ARM_VGIC_V5: \ 81 + __ret = is_v5_type(GICV5_HWIRQ_TYPE_SPI, (i)); \ 82 + break; \ 83 + default: \ 84 + __ret = (i) <= VGIC_MAX_SPI; \ 85 + __ret &= (i) >= VGIC_NR_PRIVATE_IRQS; \ 86 + } \ 87 + \ 88 + __ret; \ 89 + }) 90 + 91 + #define __irq_is_lpi(t, i) \ 92 + ({ \ 93 + bool __ret; \ 94 + \ 95 + switch (t) { \ 96 + case KVM_DEV_TYPE_ARM_VGIC_V5: \ 97 + __ret = is_v5_type(GICV5_HWIRQ_TYPE_LPI, (i)); \ 98 + break; \ 99 + default: \ 100 + __ret = (i) >= 8192; \ 101 + } \ 102 + \ 103 + __ret; \ 104 + }) 105 + 106 + #define irq_is_sgi(k, i) __irq_is_sgi((k)->arch.vgic.vgic_model, i) 107 + #define irq_is_ppi(k, i) __irq_is_ppi((k)->arch.vgic.vgic_model, i) 108 + #define irq_is_spi(k, i) __irq_is_spi((k)->arch.vgic.vgic_model, i) 109 + #define irq_is_lpi(k, i) __irq_is_lpi((k)->arch.vgic.vgic_model, i) 110 + 111 + #define irq_is_private(k, i) (irq_is_ppi(k, i) || irq_is_sgi(k, i)) 112 + 113 + #define vgic_v5_get_hwirq_id(x) FIELD_GET(GICV5_HWIRQ_ID, (x)) 114 + #define vgic_v5_set_hwirq_id(x) FIELD_PREP(GICV5_HWIRQ_ID, (x)) 115 + 116 + #define __vgic_v5_set_type(t) (FIELD_PREP(GICV5_HWIRQ_TYPE, GICV5_HWIRQ_TYPE_##t)) 117 + #define vgic_v5_make_ppi(x) (__vgic_v5_set_type(PPI) | vgic_v5_set_hwirq_id(x)) 118 + #define vgic_v5_make_spi(x) (__vgic_v5_set_type(SPI) | vgic_v5_set_hwirq_id(x)) 119 + #define vgic_v5_make_lpi(x) (__vgic_v5_set_type(LPI) | vgic_v5_set_hwirq_id(x)) 120 + 121 + #define __vgic_is_v(k, v) ((k)->arch.vgic.vgic_model == KVM_DEV_TYPE_ARM_VGIC_V##v) 122 + #define vgic_is_v3(k) (__vgic_is_v(k, 3)) 123 + #define vgic_is_v5(k) (__vgic_is_v(k, 5)) 39 124 40 125 enum vgic_type { 41 126 VGIC_V2, /* Good ol' GICv2 */ ··· 190 101 VGIC_CONFIG_LEVEL 191 102 }; 192 103 104 + struct vgic_irq; 105 + 193 106 /* 194 107 * Per-irq ops overriding some common behavious. 195 108 * ··· 210 119 * peaking into the physical GIC. 211 120 */ 212 121 bool (*get_input_level)(int vintid); 122 + 123 + /* 124 + * Function pointer to override the queuing of an IRQ. 125 + */ 126 + bool (*queue_irq_unlock)(struct kvm *kvm, struct vgic_irq *irq, 127 + unsigned long flags) __releases(&irq->irq_lock); 128 + 129 + /* 130 + * Callback function pointer to either enable or disable direct 131 + * injection for a mapped interrupt. 132 + */ 133 + void (*set_direct_injection)(struct kvm_vcpu *vcpu, 134 + struct vgic_irq *irq, bool direct); 213 135 }; 214 136 215 137 struct vgic_irq { ··· 342 238 struct list_head list; 343 239 }; 344 240 241 + struct vgic_v5_vm { 242 + /* 243 + * We only expose a subset of PPIs to the guest. This subset is a 244 + * combination of the PPIs that are actually implemented and what we 245 + * actually choose to expose. 246 + */ 247 + DECLARE_BITMAP(vgic_ppi_mask, VGIC_V5_NR_PRIVATE_IRQS); 248 + 249 + /* A mask of the PPIs that are exposed for userspace to drive. */ 250 + DECLARE_BITMAP(userspace_ppis, VGIC_V5_NR_PRIVATE_IRQS); 251 + 252 + /* 253 + * The HMR itself is handled by the hardware, but we still need to have 254 + * a mask that we can use when merging in pending state (only the state 255 + * of Edge PPIs is merged back in from the guest an the HMR provides a 256 + * convenient way to do that). 257 + */ 258 + DECLARE_BITMAP(vgic_ppi_hmr, VGIC_V5_NR_PRIVATE_IRQS); 259 + }; 260 + 345 261 struct vgic_dist { 346 262 bool in_kernel; 347 263 bool ready; ··· 434 310 * else. 435 311 */ 436 312 struct its_vm its_vm; 313 + 314 + /* 315 + * GICv5 per-VM data. 316 + */ 317 + struct vgic_v5_vm gicv5_vm; 437 318 }; 438 319 439 320 struct vgic_v2_cpu_if { ··· 469 340 unsigned int used_lrs; 470 341 }; 471 342 343 + struct vgic_v5_cpu_if { 344 + u64 vgic_apr; 345 + u64 vgic_vmcr; 346 + 347 + /* PPI register state */ 348 + DECLARE_BITMAP(vgic_ppi_dvir, VGIC_V5_NR_PRIVATE_IRQS); 349 + DECLARE_BITMAP(vgic_ppi_activer, VGIC_V5_NR_PRIVATE_IRQS); 350 + DECLARE_BITMAP(vgic_ppi_enabler, VGIC_V5_NR_PRIVATE_IRQS); 351 + /* We have one byte (of which 5 bits are used) per PPI for priority */ 352 + u64 vgic_ppi_priorityr[VGIC_V5_NR_PRIVATE_IRQS / 8]; 353 + 354 + /* 355 + * The ICSR is re-used across host and guest, and hence it needs to be 356 + * saved/restored. Only one copy is required as the host should block 357 + * preemption between executing GIC CDRCFG and acccessing the 358 + * ICC_ICSR_EL1. A guest, of course, can never guarantee this, and hence 359 + * it is the hyp's responsibility to keep the state constistent. 360 + */ 361 + u64 vgic_icsr; 362 + 363 + struct gicv5_vpe gicv5_vpe; 364 + }; 365 + 366 + /* What PPI capabilities does a GICv5 host have */ 367 + struct vgic_v5_ppi_caps { 368 + DECLARE_BITMAP(impl_ppi_mask, VGIC_V5_NR_PRIVATE_IRQS); 369 + }; 370 + 472 371 struct vgic_cpu { 473 372 /* CPU vif control registers for world switch */ 474 373 union { 475 374 struct vgic_v2_cpu_if vgic_v2; 476 375 struct vgic_v3_cpu_if vgic_v3; 376 + struct vgic_v5_cpu_if vgic_v5; 477 377 }; 478 378 479 379 struct vgic_irq *private_irqs; ··· 550 392 void kvm_vgic_destroy(struct kvm *kvm); 551 393 void kvm_vgic_vcpu_destroy(struct kvm_vcpu *vcpu); 552 394 int kvm_vgic_map_resources(struct kvm *kvm); 395 + void kvm_vgic_finalize_idregs(struct kvm *kvm); 553 396 int kvm_vgic_hyp_init(void); 554 397 void kvm_vgic_init_cpu_hardware(void); 555 398 556 399 int kvm_vgic_inject_irq(struct kvm *kvm, struct kvm_vcpu *vcpu, 557 400 unsigned int intid, bool level, void *owner); 401 + void kvm_vgic_set_irq_ops(struct kvm_vcpu *vcpu, u32 vintid, 402 + struct irq_ops *ops); 403 + void kvm_vgic_clear_irq_ops(struct kvm_vcpu *vcpu, u32 vintid); 558 404 int kvm_vgic_map_phys_irq(struct kvm_vcpu *vcpu, unsigned int host_irq, 559 - u32 vintid, struct irq_ops *ops); 405 + u32 vintid); 560 406 int kvm_vgic_unmap_phys_irq(struct kvm_vcpu *vcpu, unsigned int vintid); 561 407 int kvm_vgic_get_map(struct kvm_vcpu *vcpu, unsigned int vintid); 562 408 bool kvm_vgic_map_is_active(struct kvm_vcpu *vcpu, unsigned int vintid); ··· 576 414 577 415 #define irqchip_in_kernel(k) (!!((k)->arch.vgic.in_kernel)) 578 416 #define vgic_initialized(k) ((k)->arch.vgic.initialized) 579 - #define vgic_valid_spi(k, i) (((i) >= VGIC_NR_PRIVATE_IRQS) && \ 580 - ((i) < (k)->arch.vgic.nr_spis + VGIC_NR_PRIVATE_IRQS)) 417 + #define vgic_valid_spi(k, i) \ 418 + ({ \ 419 + bool __ret = irq_is_spi(k, i); \ 420 + \ 421 + switch ((k)->arch.vgic.vgic_model) { \ 422 + case KVM_DEV_TYPE_ARM_VGIC_V5: \ 423 + __ret &= FIELD_GET(GICV5_HWIRQ_ID, i) < (k)->arch.vgic.nr_spis; \ 424 + break; \ 425 + default: \ 426 + __ret &= (i) < ((k)->arch.vgic.nr_spis + VGIC_NR_PRIVATE_IRQS); \ 427 + } \ 428 + \ 429 + __ret; \ 430 + }) 581 431 582 432 bool kvm_vcpu_has_pending_irqs(struct kvm_vcpu *vcpu); 583 433 void kvm_vgic_sync_hwstate(struct kvm_vcpu *vcpu); ··· 628 454 int vgic_v4_load(struct kvm_vcpu *vcpu); 629 455 void vgic_v4_commit(struct kvm_vcpu *vcpu); 630 456 int vgic_v4_put(struct kvm_vcpu *vcpu); 457 + 458 + int vgic_v5_finalize_ppi_state(struct kvm *kvm); 459 + bool vgic_v5_ppi_queue_irq_unlock(struct kvm *kvm, struct vgic_irq *irq, 460 + unsigned long flags); 461 + void vgic_v5_set_ppi_dvi(struct kvm_vcpu *vcpu, struct vgic_irq *irq, bool dvi); 631 462 632 463 bool vgic_state_is_nested(struct kvm_vcpu *vcpu); 633 464

+27

include/linux/irqchip/arm-gic-v5.h

··· 25 25 #define GICV5_HWIRQ_TYPE_SPI UL(0x3) 26 26 27 27 /* 28 + * Architected PPIs 29 + */ 30 + #define GICV5_ARCH_PPI_S_DB_PPI 0x0 31 + #define GICV5_ARCH_PPI_RL_DB_PPI 0x1 32 + #define GICV5_ARCH_PPI_NS_DB_PPI 0x2 33 + #define GICV5_ARCH_PPI_SW_PPI 0x3 34 + #define GICV5_ARCH_PPI_HACDBSIRQ 0xf 35 + #define GICV5_ARCH_PPI_CNTHVS 0x13 36 + #define GICV5_ARCH_PPI_CNTHPS 0x14 37 + #define GICV5_ARCH_PPI_PMBIRQ 0x15 38 + #define GICV5_ARCH_PPI_COMMIRQ 0x16 39 + #define GICV5_ARCH_PPI_PMUIRQ 0x17 40 + #define GICV5_ARCH_PPI_CTIIRQ 0x18 41 + #define GICV5_ARCH_PPI_GICMNT 0x19 42 + #define GICV5_ARCH_PPI_CNTHP 0x1a 43 + #define GICV5_ARCH_PPI_CNTV 0x1b 44 + #define GICV5_ARCH_PPI_CNTHV 0x1c 45 + #define GICV5_ARCH_PPI_CNTPS 0x1d 46 + #define GICV5_ARCH_PPI_CNTP 0x1e 47 + #define GICV5_ARCH_PPI_TRBIRQ 0x1f 48 + 49 + /* 28 50 * Tables attributes 29 51 */ 30 52 #define GICV5_NO_READ_ALLOC 0b0 ··· 386 364 int gicv5_spi_irq_set_type(struct irq_data *d, unsigned int type); 387 365 int gicv5_irs_iste_alloc(u32 lpi); 388 366 void gicv5_irs_syncr(void); 367 + 368 + /* Embedded in kvm.arch */ 369 + struct gicv5_vpe { 370 + bool resident; 371 + }; 389 372 390 373 struct gicv5_its_devtab_cfg { 391 374 union {

+1

include/linux/kvm_host.h

··· 2366 2366 extern struct kvm_device_ops kvm_mpic_ops; 2367 2367 extern struct kvm_device_ops kvm_arm_vgic_v2_ops; 2368 2368 extern struct kvm_device_ops kvm_arm_vgic_v3_ops; 2369 + extern struct kvm_device_ops kvm_arm_vgic_v5_ops; 2369 2370 2370 2371 #ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT 2371 2372

+58

include/linux/ring_buffer.h

··· 251 251 void ring_buffer_map_dup(struct trace_buffer *buffer, int cpu); 252 252 int ring_buffer_unmap(struct trace_buffer *buffer, int cpu); 253 253 int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu); 254 + 255 + struct ring_buffer_desc { 256 + int cpu; 257 + unsigned int nr_page_va; /* excludes the meta page */ 258 + unsigned long meta_va; 259 + unsigned long page_va[] __counted_by(nr_page_va); 260 + }; 261 + 262 + struct trace_buffer_desc { 263 + int nr_cpus; 264 + size_t struct_len; 265 + char __data[]; /* list of ring_buffer_desc */ 266 + }; 267 + 268 + static inline struct ring_buffer_desc *__next_ring_buffer_desc(struct ring_buffer_desc *desc) 269 + { 270 + size_t len = struct_size(desc, page_va, desc->nr_page_va); 271 + 272 + return (struct ring_buffer_desc *)((void *)desc + len); 273 + } 274 + 275 + static inline struct ring_buffer_desc *__first_ring_buffer_desc(struct trace_buffer_desc *desc) 276 + { 277 + return (struct ring_buffer_desc *)(&desc->__data[0]); 278 + } 279 + 280 + static inline size_t trace_buffer_desc_size(size_t buffer_size, unsigned int nr_cpus) 281 + { 282 + unsigned int nr_pages = max(DIV_ROUND_UP(buffer_size, PAGE_SIZE), 2UL) + 1; 283 + struct ring_buffer_desc *rbdesc; 284 + 285 + return size_add(offsetof(struct trace_buffer_desc, __data), 286 + size_mul(nr_cpus, struct_size(rbdesc, page_va, nr_pages))); 287 + } 288 + 289 + #define for_each_ring_buffer_desc(__pdesc, __cpu, __trace_pdesc) \ 290 + for (__pdesc = __first_ring_buffer_desc(__trace_pdesc), __cpu = 0; \ 291 + (__cpu) < (__trace_pdesc)->nr_cpus; \ 292 + (__cpu)++, __pdesc = __next_ring_buffer_desc(__pdesc)) 293 + 294 + struct ring_buffer_remote { 295 + struct trace_buffer_desc *desc; 296 + int (*swap_reader_page)(unsigned int cpu, void *priv); 297 + int (*reset)(unsigned int cpu, void *priv); 298 + void *priv; 299 + }; 300 + 301 + int ring_buffer_poll_remote(struct trace_buffer *buffer, int cpu); 302 + 303 + struct trace_buffer * 304 + __ring_buffer_alloc_remote(struct ring_buffer_remote *remote, 305 + struct lock_class_key *key); 306 + 307 + #define ring_buffer_alloc_remote(remote) \ 308 + ({ \ 309 + static struct lock_class_key __key; \ 310 + __ring_buffer_alloc_remote(remote, &__key); \ 311 + }) 254 312 #endif /* _LINUX_RING_BUFFER_H */

+41

include/linux/ring_buffer_types.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #ifndef _LINUX_RING_BUFFER_TYPES_H 3 + #define _LINUX_RING_BUFFER_TYPES_H 4 + 5 + #include <asm/local.h> 6 + 7 + #define TS_SHIFT 27 8 + #define TS_MASK ((1ULL << TS_SHIFT) - 1) 9 + #define TS_DELTA_TEST (~TS_MASK) 10 + 11 + /* 12 + * We need to fit the time_stamp delta into 27 bits. 13 + */ 14 + static inline bool test_time_stamp(u64 delta) 15 + { 16 + return !!(delta & TS_DELTA_TEST); 17 + } 18 + 19 + #define BUF_PAGE_HDR_SIZE offsetof(struct buffer_data_page, data) 20 + 21 + #define RB_EVNT_HDR_SIZE (offsetof(struct ring_buffer_event, array)) 22 + #define RB_ALIGNMENT 4U 23 + #define RB_MAX_SMALL_DATA (RB_ALIGNMENT * RINGBUF_TYPE_DATA_TYPE_LEN_MAX) 24 + #define RB_EVNT_MIN_SIZE 8U /* two 32bit words */ 25 + 26 + #ifndef CONFIG_HAVE_64BIT_ALIGNED_ACCESS 27 + # define RB_FORCE_8BYTE_ALIGNMENT 0 28 + # define RB_ARCH_ALIGNMENT RB_ALIGNMENT 29 + #else 30 + # define RB_FORCE_8BYTE_ALIGNMENT 1 31 + # define RB_ARCH_ALIGNMENT 8U 32 + #endif 33 + 34 + #define RB_ALIGN_DATA __aligned(RB_ARCH_ALIGNMENT) 35 + 36 + struct buffer_data_page { 37 + u64 time_stamp; /* page time stamp */ 38 + local_t commit; /* write committed index */ 39 + unsigned char data[] RB_ALIGN_DATA; /* data of buffer page */ 40 + }; 41 + #endif

+65

include/linux/simple_ring_buffer.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #ifndef _LINUX_SIMPLE_RING_BUFFER_H 3 + #define _LINUX_SIMPLE_RING_BUFFER_H 4 + 5 + #include <linux/list.h> 6 + #include <linux/ring_buffer.h> 7 + #include <linux/ring_buffer_types.h> 8 + #include <linux/types.h> 9 + 10 + /* 11 + * Ideally those struct would stay private but the caller needs to know 12 + * the allocation size for simple_ring_buffer_init(). 13 + */ 14 + struct simple_buffer_page { 15 + struct list_head link; 16 + struct buffer_data_page *page; 17 + u64 entries; 18 + u32 write; 19 + u32 id; 20 + }; 21 + 22 + struct simple_rb_per_cpu { 23 + struct simple_buffer_page *tail_page; 24 + struct simple_buffer_page *reader_page; 25 + struct simple_buffer_page *head_page; 26 + struct simple_buffer_page *bpages; 27 + struct trace_buffer_meta *meta; 28 + u32 nr_pages; 29 + 30 + #define SIMPLE_RB_UNAVAILABLE 0 31 + #define SIMPLE_RB_READY 1 32 + #define SIMPLE_RB_WRITING 2 33 + u32 status; 34 + 35 + u64 last_overrun; 36 + u64 write_stamp; 37 + 38 + struct simple_rb_cbs *cbs; 39 + }; 40 + 41 + int simple_ring_buffer_init(struct simple_rb_per_cpu *cpu_buffer, struct simple_buffer_page *bpages, 42 + const struct ring_buffer_desc *desc); 43 + 44 + void simple_ring_buffer_unload(struct simple_rb_per_cpu *cpu_buffer); 45 + 46 + void *simple_ring_buffer_reserve(struct simple_rb_per_cpu *cpu_buffer, unsigned long length, 47 + u64 timestamp); 48 + 49 + void simple_ring_buffer_commit(struct simple_rb_per_cpu *cpu_buffer); 50 + 51 + int simple_ring_buffer_enable_tracing(struct simple_rb_per_cpu *cpu_buffer, bool enable); 52 + 53 + int simple_ring_buffer_reset(struct simple_rb_per_cpu *cpu_buffer); 54 + 55 + int simple_ring_buffer_swap_reader_page(struct simple_rb_per_cpu *cpu_buffer); 56 + 57 + int simple_ring_buffer_init_mm(struct simple_rb_per_cpu *cpu_buffer, 58 + struct simple_buffer_page *bpages, 59 + const struct ring_buffer_desc *desc, 60 + void *(*load_page)(unsigned long va), 61 + void (*unload_page)(void *va)); 62 + 63 + void simple_ring_buffer_unload_mm(struct simple_rb_per_cpu *cpu_buffer, 64 + void (*unload_page)(void *)); 65 + #endif

+48

include/linux/trace_remote.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + 3 + #ifndef _LINUX_TRACE_REMOTE_H 4 + #define _LINUX_TRACE_REMOTE_H 5 + 6 + #include <linux/dcache.h> 7 + #include <linux/ring_buffer.h> 8 + #include <linux/trace_remote_event.h> 9 + 10 + /** 11 + * struct trace_remote_callbacks - Callbacks used by Tracefs to control the remote 12 + * @init: Called once the remote has been registered. Allows the 13 + * caller to extend the Tracefs remote directory 14 + * @load_trace_buffer: Called before Tracefs accesses the trace buffer for the first 15 + * time. Must return a &trace_buffer_desc 16 + * (most likely filled with trace_remote_alloc_buffer()) 17 + * @unload_trace_buffer: 18 + * Called once Tracefs has no use for the trace buffer 19 + * (most likely call trace_remote_free_buffer()) 20 + * @enable_tracing: Called on Tracefs tracing_on. It is expected from the 21 + * remote to allow writing. 22 + * @swap_reader_page: Called when Tracefs consumes a new page from a 23 + * ring-buffer. It is expected from the remote to isolate a 24 + * @reset: Called on `echo 0 > trace`. It is expected from the 25 + * remote to reset all ring-buffer pages. 26 + * new reader-page from the @cpu ring-buffer. 27 + * @enable_event: Called on events/event_name/enable. It is expected from 28 + * the remote to allow the writing event @id. 29 + */ 30 + struct trace_remote_callbacks { 31 + int (*init)(struct dentry *d, void *priv); 32 + struct trace_buffer_desc *(*load_trace_buffer)(unsigned long size, void *priv); 33 + void (*unload_trace_buffer)(struct trace_buffer_desc *desc, void *priv); 34 + int (*enable_tracing)(bool enable, void *priv); 35 + int (*swap_reader_page)(unsigned int cpu, void *priv); 36 + int (*reset)(unsigned int cpu, void *priv); 37 + int (*enable_event)(unsigned short id, bool enable, void *priv); 38 + }; 39 + 40 + int trace_remote_register(const char *name, struct trace_remote_callbacks *cbs, void *priv, 41 + struct remote_event *events, size_t nr_events); 42 + 43 + int trace_remote_alloc_buffer(struct trace_buffer_desc *desc, size_t desc_size, size_t buffer_size, 44 + const struct cpumask *cpumask); 45 + 46 + void trace_remote_free_buffer(struct trace_buffer_desc *desc); 47 + 48 + #endif

+33

include/linux/trace_remote_event.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + 3 + #ifndef _LINUX_TRACE_REMOTE_EVENTS_H 4 + #define _LINUX_TRACE_REMOTE_EVENTS_H 5 + 6 + struct trace_remote; 7 + struct trace_event_fields; 8 + struct trace_seq; 9 + 10 + struct remote_event_hdr { 11 + unsigned short id; 12 + }; 13 + 14 + #define REMOTE_EVENT_NAME_MAX 30 15 + struct remote_event { 16 + char name[REMOTE_EVENT_NAME_MAX]; 17 + unsigned short id; 18 + bool enabled; 19 + struct trace_remote *remote; 20 + struct trace_event_fields *fields; 21 + char *print_fmt; 22 + void (*print)(void *evt, struct trace_seq *seq); 23 + }; 24 + 25 + #define RE_STRUCT(__args...) __args 26 + #define re_field(__type, __field) __type __field; 27 + 28 + #define REMOTE_EVENT_FORMAT(__name, __struct) \ 29 + struct remote_event_format_##__name { \ 30 + struct remote_event_hdr hdr; \ 31 + __struct \ 32 + } 33 + #endif

+73

include/trace/define_remote_events.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + 3 + #include <linux/trace_events.h> 4 + #include <linux/trace_remote_event.h> 5 + #include <linux/trace_seq.h> 6 + #include <linux/stringify.h> 7 + 8 + #define REMOTE_EVENT_INCLUDE(__file) __stringify(../../__file) 9 + 10 + #ifdef REMOTE_EVENT_SECTION 11 + # define __REMOTE_EVENT_SECTION(__name) __used __section(REMOTE_EVENT_SECTION"."#__name) 12 + #else 13 + # define __REMOTE_EVENT_SECTION(__name) 14 + #endif 15 + 16 + #define REMOTE_PRINTK_COUNT_ARGS(__args...) \ 17 + __COUNT_ARGS(, ##__args, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 0) 18 + 19 + #define __remote_printk0() \ 20 + trace_seq_putc(seq, '\n') 21 + 22 + #define __remote_printk1(__fmt) \ 23 + trace_seq_puts(seq, " " __fmt "\n") \ 24 + 25 + #define __remote_printk2(__fmt, __args...) \ 26 + do { \ 27 + trace_seq_putc(seq, ' '); \ 28 + trace_seq_printf(seq, __fmt, __args); \ 29 + trace_seq_putc(seq, '\n'); \ 30 + } while (0) 31 + 32 + /* Apply the appropriate trace_seq sequence according to the number of arguments */ 33 + #define remote_printk(__args...) \ 34 + CONCATENATE(__remote_printk, REMOTE_PRINTK_COUNT_ARGS(__args))(__args) 35 + 36 + #define RE_PRINTK(__args...) __args 37 + 38 + #define REMOTE_EVENT(__name, __id, __struct, __printk) \ 39 + REMOTE_EVENT_FORMAT(__name, __struct); \ 40 + static void remote_event_print_##__name(void *evt, struct trace_seq *seq) \ 41 + { \ 42 + struct remote_event_format_##__name __maybe_unused *__entry = evt; \ 43 + trace_seq_puts(seq, #__name); \ 44 + remote_printk(__printk); \ 45 + } 46 + #include REMOTE_EVENT_INCLUDE(REMOTE_EVENT_INCLUDE_FILE) 47 + 48 + #undef REMOTE_EVENT 49 + #undef RE_PRINTK 50 + #undef re_field 51 + #define re_field(__type, __field) \ 52 + { \ 53 + .type = #__type, .name = #__field, \ 54 + .size = sizeof(__type), .align = __alignof__(__type), \ 55 + .is_signed = is_signed_type(__type), \ 56 + }, 57 + #define __entry REC 58 + #define RE_PRINTK(__fmt, __args...) "\"" __fmt "\", " __stringify(__args) 59 + #define REMOTE_EVENT(__name, __id, __struct, __printk) \ 60 + static struct trace_event_fields remote_event_fields_##__name[] = { \ 61 + __struct \ 62 + {} \ 63 + }; \ 64 + static char remote_event_print_fmt_##__name[] = __printk; \ 65 + static struct remote_event __REMOTE_EVENT_SECTION(__name) \ 66 + remote_event_##__name = { \ 67 + .name = #__name, \ 68 + .id = __id, \ 69 + .fields = remote_event_fields_##__name, \ 70 + .print_fmt = remote_event_print_fmt_##__name, \ 71 + .print = remote_event_print_##__name, \ 72 + } 73 + #include REMOTE_EVENT_INCLUDE(REMOTE_EVENT_INCLUDE_FILE)

+7

include/uapi/linux/kvm.h

··· 704 704 #define KVM_VM_TYPE_ARM_IPA_SIZE_MASK 0xffULL 705 705 #define KVM_VM_TYPE_ARM_IPA_SIZE(x) \ 706 706 ((x) & KVM_VM_TYPE_ARM_IPA_SIZE_MASK) 707 + 708 + #define KVM_VM_TYPE_ARM_PROTECTED (1UL << 31) 709 + #define KVM_VM_TYPE_ARM_MASK (KVM_VM_TYPE_ARM_IPA_SIZE_MASK | \ 710 + KVM_VM_TYPE_ARM_PROTECTED) 711 + 707 712 /* 708 713 * ioctls for /dev/kvm fds: 709 714 */ ··· 1232 1227 #define KVM_DEV_TYPE_LOONGARCH_PCHPIC KVM_DEV_TYPE_LOONGARCH_PCHPIC 1233 1228 KVM_DEV_TYPE_LOONGARCH_DMSINTC, 1234 1229 #define KVM_DEV_TYPE_LOONGARCH_DMSINTC KVM_DEV_TYPE_LOONGARCH_DMSINTC 1230 + KVM_DEV_TYPE_ARM_VGIC_V5, 1231 + #define KVM_DEV_TYPE_ARM_VGIC_V5 KVM_DEV_TYPE_ARM_VGIC_V5 1235 1232 1236 1233 KVM_DEV_TYPE_MAX, 1237 1234

+4 -4

include/uapi/linux/trace_mmap.h

··· 17 17 * @entries: Number of entries in the ring-buffer. 18 18 * @overrun: Number of entries lost in the ring-buffer. 19 19 * @read: Number of entries that have been read. 20 - * @Reserved1: Internal use only. 21 - * @Reserved2: Internal use only. 20 + * @pages_lost: Number of pages overwritten by the writer. 21 + * @pages_touched: Number of pages written by the writer. 22 22 */ 23 23 struct trace_buffer_meta { 24 24 __u32 meta_page_size; ··· 39 39 __u64 overrun; 40 40 __u64 read; 41 41 42 - __u64 Reserved1; 43 - __u64 Reserved2; 42 + __u64 pages_lost; 43 + __u64 pages_touched; 44 44 }; 45 45 46 46 #define TRACE_MMAP_IOCTL_GET_READER _IO('R', 0x20)

+14

kernel/trace/Kconfig

··· 1281 1281 1282 1282 source "kernel/trace/rv/Kconfig" 1283 1283 1284 + config TRACE_REMOTE 1285 + bool 1286 + 1287 + config SIMPLE_RING_BUFFER 1288 + bool 1289 + 1290 + config TRACE_REMOTE_TEST 1291 + tristate "Test module for remote tracing" 1292 + select TRACE_REMOTE 1293 + select SIMPLE_RING_BUFFER 1294 + help 1295 + This trace remote includes a ring-buffer writer implementation using 1296 + "simple_ring_buffer". This is solely intending for testing. 1297 + 1284 1298 endif # FTRACE

+59

kernel/trace/Makefile

··· 128 128 obj-$(CONFIG_TRACEPOINT_BENCHMARK) += trace_benchmark.o 129 129 obj-$(CONFIG_RV) += rv/ 130 130 131 + obj-$(CONFIG_TRACE_REMOTE) += trace_remote.o 132 + obj-$(CONFIG_SIMPLE_RING_BUFFER) += simple_ring_buffer.o 133 + obj-$(CONFIG_TRACE_REMOTE_TEST) += remote_test.o 134 + 135 + # 136 + # simple_ring_buffer is used by the pKVM hypervisor which does not have access 137 + # to all kernel symbols. Fail the build if forbidden symbols are found. 138 + # 139 + # undefsyms_base generates a set of compiler and tooling-generated symbols that can 140 + # safely be ignored for simple_ring_buffer. 141 + # 142 + filechk_undefsyms_base = \ 143 + echo '$(pound)include <linux/atomic.h>'; \ 144 + echo '$(pound)include <linux/string.h>'; \ 145 + echo '$(pound)include <asm/page.h>'; \ 146 + echo 'static char page[PAGE_SIZE] __aligned(PAGE_SIZE);'; \ 147 + echo 'void undefsyms_base(void *p, int n);'; \ 148 + echo 'void undefsyms_base(void *p, int n) {'; \ 149 + echo ' char buffer[256] = { 0 };'; \ 150 + echo ' u32 u = 0;'; \ 151 + echo ' memset((char * volatile)page, 8, PAGE_SIZE);'; \ 152 + echo ' memset((char * volatile)buffer, 8, sizeof(buffer));'; \ 153 + echo ' memcpy((void * volatile)p, buffer, sizeof(buffer));'; \ 154 + echo ' cmpxchg((u32 * volatile)&u, 0, 8);'; \ 155 + echo ' WARN_ON(n == 0xdeadbeef);'; \ 156 + echo '}' 157 + 158 + $(obj)/undefsyms_base.c: FORCE 159 + $(call filechk,undefsyms_base) 160 + 161 + clean-files += undefsyms_base.c 162 + 163 + $(obj)/undefsyms_base.o: $(obj)/undefsyms_base.c 164 + 165 + targets += undefsyms_base.o 166 + 167 + # Ensure KASAN is enabled to avoid logic that may disable FORTIFY_SOURCE when 168 + # KASAN is not enabled. undefsyms_base.o does not automatically get KASAN flags 169 + # because it is not linked into vmlinux. 170 + KASAN_SANITIZE_undefsyms_base.o := y 171 + 172 + UNDEFINED_ALLOWLIST = __asan __gcov __kasan __kcsan __hwasan __sancov __sanitizer __tsan __ubsan __x86_indirect_thunk \ 173 + __msan simple_ring_buffer \ 174 + $(shell $(NM) -u $(obj)/undefsyms_base.o 2>/dev/null | awk '{print $$2}') 175 + 176 + quiet_cmd_check_undefined = NM $< 177 + cmd_check_undefined = \ 178 + undefsyms=$$($(NM) -u $< | grep -v $(addprefix -e , $(UNDEFINED_ALLOWLIST)) || true); \ 179 + if [ -n "$$undefsyms" ]; then \ 180 + echo "Unexpected symbols in $<:" >&2; \ 181 + echo "$$undefsyms" >&2; \ 182 + false; \ 183 + fi 184 + 185 + $(obj)/%.o.checked: $(obj)/%.o $(obj)/undefsyms_base.o FORCE 186 + $(call if_changed,check_undefined) 187 + 188 + always-$(CONFIG_SIMPLE_RING_BUFFER) += simple_ring_buffer.o.checked 189 + 131 190 libftrace-y := ftrace.o

+261

kernel/trace/remote_test.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * Copyright (C) 2025 - Google LLC 4 + * Author: Vincent Donnefort <vdonnefort@google.com> 5 + */ 6 + 7 + #include <linux/module.h> 8 + #include <linux/simple_ring_buffer.h> 9 + #include <linux/trace_remote.h> 10 + #include <linux/tracefs.h> 11 + #include <linux/types.h> 12 + 13 + #define REMOTE_EVENT_INCLUDE_FILE kernel/trace/remote_test_events.h 14 + #include <trace/define_remote_events.h> 15 + 16 + static DEFINE_PER_CPU(struct simple_rb_per_cpu *, simple_rbs); 17 + static struct trace_buffer_desc *remote_test_buffer_desc; 18 + 19 + /* 20 + * The trace_remote lock already serializes accesses from the trace_remote_callbacks. 21 + * However write_event can still race with load/unload. 22 + */ 23 + static DEFINE_MUTEX(simple_rbs_lock); 24 + 25 + static int remote_test_load_simple_rb(int cpu, struct ring_buffer_desc *rb_desc) 26 + { 27 + struct simple_rb_per_cpu *cpu_buffer; 28 + struct simple_buffer_page *bpages; 29 + int ret = -ENOMEM; 30 + 31 + cpu_buffer = kmalloc_obj(*cpu_buffer); 32 + if (!cpu_buffer) 33 + return ret; 34 + 35 + bpages = kmalloc_objs(*bpages, rb_desc->nr_page_va); 36 + if (!bpages) 37 + goto err_free_cpu_buffer; 38 + 39 + ret = simple_ring_buffer_init(cpu_buffer, bpages, rb_desc); 40 + if (ret) 41 + goto err_free_bpages; 42 + 43 + scoped_guard(mutex, &simple_rbs_lock) { 44 + WARN_ON(*per_cpu_ptr(&simple_rbs, cpu)); 45 + *per_cpu_ptr(&simple_rbs, cpu) = cpu_buffer; 46 + } 47 + 48 + return 0; 49 + 50 + err_free_bpages: 51 + kfree(bpages); 52 + 53 + err_free_cpu_buffer: 54 + kfree(cpu_buffer); 55 + 56 + return ret; 57 + } 58 + 59 + static void remote_test_unload_simple_rb(int cpu) 60 + { 61 + struct simple_rb_per_cpu *cpu_buffer = *per_cpu_ptr(&simple_rbs, cpu); 62 + struct simple_buffer_page *bpages; 63 + 64 + if (!cpu_buffer) 65 + return; 66 + 67 + guard(mutex)(&simple_rbs_lock); 68 + 69 + bpages = cpu_buffer->bpages; 70 + simple_ring_buffer_unload(cpu_buffer); 71 + kfree(bpages); 72 + kfree(cpu_buffer); 73 + *per_cpu_ptr(&simple_rbs, cpu) = NULL; 74 + } 75 + 76 + static struct trace_buffer_desc *remote_test_load(unsigned long size, void *unused) 77 + { 78 + struct ring_buffer_desc *rb_desc; 79 + struct trace_buffer_desc *desc; 80 + size_t desc_size; 81 + int cpu, ret; 82 + 83 + if (WARN_ON(remote_test_buffer_desc)) 84 + return ERR_PTR(-EINVAL); 85 + 86 + desc_size = trace_buffer_desc_size(size, num_possible_cpus()); 87 + if (desc_size == SIZE_MAX) { 88 + ret = -E2BIG; 89 + goto err; 90 + } 91 + 92 + desc = kmalloc(desc_size, GFP_KERNEL); 93 + if (!desc) { 94 + ret = -ENOMEM; 95 + goto err; 96 + } 97 + 98 + ret = trace_remote_alloc_buffer(desc, desc_size, size, cpu_possible_mask); 99 + if (ret) 100 + goto err_free_desc; 101 + 102 + for_each_ring_buffer_desc(rb_desc, cpu, desc) { 103 + ret = remote_test_load_simple_rb(rb_desc->cpu, rb_desc); 104 + if (ret) 105 + goto err_unload; 106 + } 107 + 108 + remote_test_buffer_desc = desc; 109 + 110 + return remote_test_buffer_desc; 111 + 112 + err_unload: 113 + for_each_ring_buffer_desc(rb_desc, cpu, remote_test_buffer_desc) 114 + remote_test_unload_simple_rb(rb_desc->cpu); 115 + trace_remote_free_buffer(remote_test_buffer_desc); 116 + 117 + err_free_desc: 118 + kfree(desc); 119 + 120 + err: 121 + return ERR_PTR(ret); 122 + } 123 + 124 + static void remote_test_unload(struct trace_buffer_desc *desc, void *unused) 125 + { 126 + struct ring_buffer_desc *rb_desc; 127 + int cpu; 128 + 129 + if (WARN_ON(desc != remote_test_buffer_desc)) 130 + return; 131 + 132 + for_each_ring_buffer_desc(rb_desc, cpu, desc) 133 + remote_test_unload_simple_rb(rb_desc->cpu); 134 + 135 + remote_test_buffer_desc = NULL; 136 + trace_remote_free_buffer(desc); 137 + kfree(desc); 138 + } 139 + 140 + static int remote_test_enable_tracing(bool enable, void *unused) 141 + { 142 + struct ring_buffer_desc *rb_desc; 143 + int cpu; 144 + 145 + if (!remote_test_buffer_desc) 146 + return -ENODEV; 147 + 148 + for_each_ring_buffer_desc(rb_desc, cpu, remote_test_buffer_desc) 149 + WARN_ON(simple_ring_buffer_enable_tracing(*per_cpu_ptr(&simple_rbs, rb_desc->cpu), 150 + enable)); 151 + return 0; 152 + } 153 + 154 + static int remote_test_swap_reader_page(unsigned int cpu, void *unused) 155 + { 156 + struct simple_rb_per_cpu *cpu_buffer; 157 + 158 + if (cpu >= NR_CPUS) 159 + return -EINVAL; 160 + 161 + cpu_buffer = *per_cpu_ptr(&simple_rbs, cpu); 162 + if (!cpu_buffer) 163 + return -EINVAL; 164 + 165 + return simple_ring_buffer_swap_reader_page(cpu_buffer); 166 + } 167 + 168 + static int remote_test_reset(unsigned int cpu, void *unused) 169 + { 170 + struct simple_rb_per_cpu *cpu_buffer; 171 + 172 + if (cpu >= NR_CPUS) 173 + return -EINVAL; 174 + 175 + cpu_buffer = *per_cpu_ptr(&simple_rbs, cpu); 176 + if (!cpu_buffer) 177 + return -EINVAL; 178 + 179 + return simple_ring_buffer_reset(cpu_buffer); 180 + } 181 + 182 + static int remote_test_enable_event(unsigned short id, bool enable, void *unused) 183 + { 184 + if (id != REMOTE_TEST_EVENT_ID) 185 + return -EINVAL; 186 + 187 + /* 188 + * Let's just use the struct remote_event enabled field that is turned on and off by 189 + * trace_remote. This is a bit racy but good enough for a simple test module. 190 + */ 191 + return 0; 192 + } 193 + 194 + static ssize_t 195 + write_event_write(struct file *filp, const char __user *ubuf, size_t cnt, loff_t *pos) 196 + { 197 + struct remote_event_format_selftest *evt_test; 198 + struct simple_rb_per_cpu *cpu_buffer; 199 + unsigned long val; 200 + int ret; 201 + 202 + ret = kstrtoul_from_user(ubuf, cnt, 10, &val); 203 + if (ret) 204 + return ret; 205 + 206 + guard(mutex)(&simple_rbs_lock); 207 + 208 + if (!remote_event_selftest.enabled) 209 + return -ENODEV; 210 + 211 + guard(preempt)(); 212 + 213 + cpu_buffer = *this_cpu_ptr(&simple_rbs); 214 + if (!cpu_buffer) 215 + return -ENODEV; 216 + 217 + evt_test = simple_ring_buffer_reserve(cpu_buffer, 218 + sizeof(struct remote_event_format_selftest), 219 + trace_clock_global()); 220 + if (!evt_test) 221 + return -ENODEV; 222 + 223 + evt_test->hdr.id = REMOTE_TEST_EVENT_ID; 224 + evt_test->id = val; 225 + 226 + simple_ring_buffer_commit(cpu_buffer); 227 + 228 + return cnt; 229 + } 230 + 231 + static const struct file_operations write_event_fops = { 232 + .write = write_event_write, 233 + }; 234 + 235 + static int remote_test_init_tracefs(struct dentry *d, void *unused) 236 + { 237 + return tracefs_create_file("write_event", 0200, d, NULL, &write_event_fops) ? 238 + 0 : -ENOMEM; 239 + } 240 + 241 + static struct trace_remote_callbacks trace_remote_callbacks = { 242 + .init = remote_test_init_tracefs, 243 + .load_trace_buffer = remote_test_load, 244 + .unload_trace_buffer = remote_test_unload, 245 + .enable_tracing = remote_test_enable_tracing, 246 + .swap_reader_page = remote_test_swap_reader_page, 247 + .reset = remote_test_reset, 248 + .enable_event = remote_test_enable_event, 249 + }; 250 + 251 + static int __init remote_test_init(void) 252 + { 253 + return trace_remote_register("test", &trace_remote_callbacks, NULL, 254 + &remote_event_selftest, 1); 255 + } 256 + 257 + module_init(remote_test_init); 258 + 259 + MODULE_DESCRIPTION("Test module for the trace remote interface"); 260 + MODULE_AUTHOR("Vincent Donnefort"); 261 + MODULE_LICENSE("GPL");

+10

kernel/trace/remote_test_events.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + 3 + #define REMOTE_TEST_EVENT_ID 1 4 + 5 + REMOTE_EVENT(selftest, REMOTE_TEST_EVENT_ID, 6 + RE_STRUCT( 7 + re_field(u64, id) 8 + ), 9 + RE_PRINTK("id=%llu", __entry->id) 10 + );

+303 -51

kernel/trace/ring_buffer.c

··· 4 4 * 5 5 * Copyright (C) 2008 Steven Rostedt <srostedt@redhat.com> 6 6 */ 7 + #include <linux/ring_buffer_types.h> 7 8 #include <linux/sched/isolation.h> 8 9 #include <linux/trace_recursion.h> 9 10 #include <linux/trace_events.h> ··· 158 157 /* Used for individual buffers (after the counter) */ 159 158 #define RB_BUFFER_OFF (1 << 20) 160 159 161 - #define BUF_PAGE_HDR_SIZE offsetof(struct buffer_data_page, data) 162 - 163 - #define RB_EVNT_HDR_SIZE (offsetof(struct ring_buffer_event, array)) 164 - #define RB_ALIGNMENT 4U 165 - #define RB_MAX_SMALL_DATA (RB_ALIGNMENT * RINGBUF_TYPE_DATA_TYPE_LEN_MAX) 166 - #define RB_EVNT_MIN_SIZE 8U /* two 32bit words */ 167 - 168 - #ifndef CONFIG_HAVE_64BIT_ALIGNED_ACCESS 169 - # define RB_FORCE_8BYTE_ALIGNMENT 0 170 - # define RB_ARCH_ALIGNMENT RB_ALIGNMENT 171 - #else 172 - # define RB_FORCE_8BYTE_ALIGNMENT 1 173 - # define RB_ARCH_ALIGNMENT 8U 174 - #endif 175 - 176 - #define RB_ALIGN_DATA __aligned(RB_ARCH_ALIGNMENT) 177 - 178 160 /* define RINGBUF_TYPE_DATA for 'case RINGBUF_TYPE_DATA:' */ 179 161 #define RINGBUF_TYPE_DATA 0 ... RINGBUF_TYPE_DATA_TYPE_LEN_MAX 180 162 ··· 300 316 #define for_each_online_buffer_cpu(buffer, cpu) \ 301 317 for_each_cpu_and(cpu, buffer->cpumask, cpu_online_mask) 302 318 303 - #define TS_SHIFT 27 304 - #define TS_MASK ((1ULL << TS_SHIFT) - 1) 305 - #define TS_DELTA_TEST (~TS_MASK) 306 - 307 319 static u64 rb_event_time_stamp(struct ring_buffer_event *event) 308 320 { 309 321 u64 ts; ··· 317 337 #define RB_MISSED_STORED (1 << 30) 318 338 319 339 #define RB_MISSED_MASK (3 << 30) 320 - 321 - struct buffer_data_page { 322 - u64 time_stamp; /* page time stamp */ 323 - local_t commit; /* write committed index */ 324 - unsigned char data[] RB_ALIGN_DATA; /* data of buffer page */ 325 - }; 326 340 327 341 struct buffer_data_read_page { 328 342 unsigned order; /* order of the page */ ··· 409 435 rb_init_page(dpage); 410 436 411 437 return dpage; 412 - } 413 - 414 - /* 415 - * We need to fit the time_stamp delta into 27 bits. 416 - */ 417 - static inline bool test_time_stamp(u64 delta) 418 - { 419 - return !!(delta & TS_DELTA_TEST); 420 438 } 421 439 422 440 struct rb_irq_work { ··· 521 555 unsigned int mapped; 522 556 unsigned int user_mapped; /* user space mapping */ 523 557 struct mutex mapping_lock; 524 - unsigned long *subbuf_ids; /* ID to subbuf VA */ 558 + struct buffer_page **subbuf_ids; /* ID to subbuf VA */ 525 559 struct trace_buffer_meta *meta_page; 526 560 struct ring_buffer_cpu_meta *ring_meta; 561 + 562 + struct ring_buffer_remote *remote; 527 563 528 564 /* ring buffer pages to update, > 0 to add, < 0 to remove */ 529 565 long nr_pages_to_update; ··· 548 580 struct mutex mutex; 549 581 550 582 struct ring_buffer_per_cpu **buffers; 583 + 584 + struct ring_buffer_remote *remote; 551 585 552 586 struct hlist_node node; 553 587 u64 (*clock)(void); ··· 606 636 trace_seq_printf(s, "\tfield: char data;\t" 607 637 "offset:%u;\tsize:%u;\tsigned:%u;\n", 608 638 (unsigned int)offsetof(typeof(field), data), 609 - (unsigned int)buffer->subbuf_size, 639 + (unsigned int)(buffer ? buffer->subbuf_size : 640 + PAGE_SIZE - BUF_PAGE_HDR_SIZE), 610 641 (unsigned int)is_signed_type(char)); 611 642 612 643 return !trace_seq_has_overflowed(s); ··· 2209 2238 } 2210 2239 } 2211 2240 2241 + static struct ring_buffer_desc *ring_buffer_desc(struct trace_buffer_desc *trace_desc, int cpu) 2242 + { 2243 + struct ring_buffer_desc *desc, *end; 2244 + size_t len; 2245 + int i; 2246 + 2247 + if (!trace_desc) 2248 + return NULL; 2249 + 2250 + if (cpu >= trace_desc->nr_cpus) 2251 + return NULL; 2252 + 2253 + end = (struct ring_buffer_desc *)((void *)trace_desc + trace_desc->struct_len); 2254 + desc = __first_ring_buffer_desc(trace_desc); 2255 + len = struct_size(desc, page_va, desc->nr_page_va); 2256 + desc = (struct ring_buffer_desc *)((void *)desc + (len * cpu)); 2257 + 2258 + if (desc < end && desc->cpu == cpu) 2259 + return desc; 2260 + 2261 + /* Missing CPUs, need to linear search */ 2262 + for_each_ring_buffer_desc(desc, i, trace_desc) { 2263 + if (desc->cpu == cpu) 2264 + return desc; 2265 + } 2266 + 2267 + return NULL; 2268 + } 2269 + 2270 + static void *ring_buffer_desc_page(struct ring_buffer_desc *desc, int page_id) 2271 + { 2272 + return page_id > desc->nr_page_va ? NULL : (void *)desc->page_va[page_id]; 2273 + } 2274 + 2212 2275 static int __rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer, 2213 2276 long nr_pages, struct list_head *pages) 2214 2277 { ··· 2250 2245 struct ring_buffer_cpu_meta *meta = NULL; 2251 2246 struct buffer_page *bpage, *tmp; 2252 2247 bool user_thread = current->mm != NULL; 2248 + struct ring_buffer_desc *desc = NULL; 2253 2249 long i; 2254 2250 2255 2251 /* ··· 2279 2273 if (buffer->range_addr_start) 2280 2274 meta = rb_range_meta(buffer, nr_pages, cpu_buffer->cpu); 2281 2275 2276 + if (buffer->remote) { 2277 + desc = ring_buffer_desc(buffer->remote->desc, cpu_buffer->cpu); 2278 + if (!desc || WARN_ON(desc->nr_page_va != (nr_pages + 1))) 2279 + return -EINVAL; 2280 + } 2281 + 2282 2282 for (i = 0; i < nr_pages; i++) { 2283 2283 2284 2284 bpage = alloc_cpu_page(cpu_buffer->cpu); ··· 2309 2297 rb_meta_buffer_update(cpu_buffer, bpage); 2310 2298 bpage->range = 1; 2311 2299 bpage->id = i + 1; 2300 + } else if (desc) { 2301 + void *p = ring_buffer_desc_page(desc, i + 1); 2302 + 2303 + if (WARN_ON(!p)) 2304 + goto free_pages; 2305 + 2306 + bpage->page = p; 2307 + bpage->range = 1; /* bpage->page can't be freed */ 2308 + bpage->id = i + 1; 2309 + cpu_buffer->subbuf_ids[i + 1] = bpage; 2312 2310 } else { 2313 2311 int order = cpu_buffer->buffer->subbuf_order; 2314 2312 bpage->page = alloc_cpu_data(cpu_buffer->cpu, order); ··· 2416 2394 if (cpu_buffer->ring_meta->head_buffer) 2417 2395 rb_meta_buffer_update(cpu_buffer, bpage); 2418 2396 bpage->range = 1; 2397 + } else if (buffer->remote) { 2398 + struct ring_buffer_desc *desc = ring_buffer_desc(buffer->remote->desc, cpu); 2399 + 2400 + if (!desc) 2401 + goto fail_free_reader; 2402 + 2403 + cpu_buffer->remote = buffer->remote; 2404 + cpu_buffer->meta_page = (struct trace_buffer_meta *)(void *)desc->meta_va; 2405 + cpu_buffer->nr_pages = nr_pages; 2406 + cpu_buffer->subbuf_ids = kcalloc(cpu_buffer->nr_pages + 1, 2407 + sizeof(*cpu_buffer->subbuf_ids), GFP_KERNEL); 2408 + if (!cpu_buffer->subbuf_ids) 2409 + goto fail_free_reader; 2410 + 2411 + /* Remote buffers are read-only and immutable */ 2412 + atomic_inc(&cpu_buffer->record_disabled); 2413 + atomic_inc(&cpu_buffer->resize_disabled); 2414 + 2415 + bpage->page = ring_buffer_desc_page(desc, cpu_buffer->meta_page->reader.id); 2416 + if (!bpage->page) 2417 + goto fail_free_reader; 2418 + 2419 + bpage->range = 1; 2420 + cpu_buffer->subbuf_ids[0] = bpage; 2419 2421 } else { 2420 2422 int order = cpu_buffer->buffer->subbuf_order; 2421 2423 bpage->page = alloc_cpu_data(cpu, order); ··· 2499 2453 2500 2454 irq_work_sync(&cpu_buffer->irq_work.work); 2501 2455 2456 + if (cpu_buffer->remote) 2457 + kfree(cpu_buffer->subbuf_ids); 2458 + 2502 2459 free_buffer_page(cpu_buffer->reader_page); 2503 2460 2504 2461 if (head) { ··· 2524 2475 int order, unsigned long start, 2525 2476 unsigned long end, 2526 2477 unsigned long scratch_size, 2527 - struct lock_class_key *key) 2478 + struct lock_class_key *key, 2479 + struct ring_buffer_remote *remote) 2528 2480 { 2529 2481 struct trace_buffer *buffer __free(kfree) = NULL; 2530 2482 long nr_pages; ··· 2564 2514 GFP_KERNEL); 2565 2515 if (!buffer->buffers) 2566 2516 goto fail_free_cpumask; 2517 + 2518 + cpu = raw_smp_processor_id(); 2567 2519 2568 2520 /* If start/end are specified, then that overrides size */ 2569 2521 if (start && end) { ··· 2622 2570 buffer->range_addr_end = end; 2623 2571 2624 2572 rb_range_meta_init(buffer, nr_pages, scratch_size); 2573 + } else if (remote) { 2574 + struct ring_buffer_desc *desc = ring_buffer_desc(remote->desc, cpu); 2575 + 2576 + buffer->remote = remote; 2577 + /* The writer is remote. This ring-buffer is read-only */ 2578 + atomic_inc(&buffer->record_disabled); 2579 + nr_pages = desc->nr_page_va - 1; 2580 + if (nr_pages < 2) 2581 + goto fail_free_buffers; 2625 2582 } else { 2626 2583 2627 2584 /* need at least two pages */ ··· 2639 2578 nr_pages = 2; 2640 2579 } 2641 2580 2642 - cpu = raw_smp_processor_id(); 2643 2581 cpumask_set_cpu(cpu, buffer->cpumask); 2644 2582 buffer->buffers[cpu] = rb_allocate_cpu_buffer(buffer, nr_pages, cpu); 2645 2583 if (!buffer->buffers[cpu]) ··· 2680 2620 struct lock_class_key *key) 2681 2621 { 2682 2622 /* Default buffer page size - one system page */ 2683 - return alloc_buffer(size, flags, 0, 0, 0, 0, key); 2623 + return alloc_buffer(size, flags, 0, 0, 0, 0, key, NULL); 2684 2624 2685 2625 } 2686 2626 EXPORT_SYMBOL_GPL(__ring_buffer_alloc); ··· 2707 2647 struct lock_class_key *key) 2708 2648 { 2709 2649 return alloc_buffer(size, flags, order, start, start + range_size, 2710 - scratch_size, key); 2650 + scratch_size, key, NULL); 2651 + } 2652 + 2653 + /** 2654 + * __ring_buffer_alloc_remote - allocate a new ring_buffer from a remote 2655 + * @remote: Contains a description of the ring-buffer pages and remote callbacks. 2656 + * @key: ring buffer reader_lock_key. 2657 + */ 2658 + struct trace_buffer *__ring_buffer_alloc_remote(struct ring_buffer_remote *remote, 2659 + struct lock_class_key *key) 2660 + { 2661 + return alloc_buffer(0, 0, 0, 0, 0, 0, key, remote); 2711 2662 } 2712 2663 2713 2664 void *ring_buffer_meta_scratch(struct trace_buffer *buffer, unsigned int *size) ··· 5345 5274 } 5346 5275 EXPORT_SYMBOL_GPL(ring_buffer_overruns); 5347 5276 5277 + static bool rb_read_remote_meta_page(struct ring_buffer_per_cpu *cpu_buffer) 5278 + { 5279 + local_set(&cpu_buffer->entries, READ_ONCE(cpu_buffer->meta_page->entries)); 5280 + local_set(&cpu_buffer->overrun, READ_ONCE(cpu_buffer->meta_page->overrun)); 5281 + local_set(&cpu_buffer->pages_touched, READ_ONCE(cpu_buffer->meta_page->pages_touched)); 5282 + local_set(&cpu_buffer->pages_lost, READ_ONCE(cpu_buffer->meta_page->pages_lost)); 5283 + 5284 + return rb_num_of_entries(cpu_buffer); 5285 + } 5286 + 5287 + static void rb_update_remote_head(struct ring_buffer_per_cpu *cpu_buffer) 5288 + { 5289 + struct buffer_page *next, *orig; 5290 + int retry = 3; 5291 + 5292 + orig = next = cpu_buffer->head_page; 5293 + rb_inc_page(&next); 5294 + 5295 + /* Run after the writer */ 5296 + while (cpu_buffer->head_page->page->time_stamp > next->page->time_stamp) { 5297 + rb_inc_page(&next); 5298 + 5299 + rb_list_head_clear(cpu_buffer->head_page->list.prev); 5300 + rb_inc_page(&cpu_buffer->head_page); 5301 + rb_set_list_to_head(cpu_buffer->head_page->list.prev); 5302 + 5303 + if (cpu_buffer->head_page == orig) { 5304 + if (WARN_ON_ONCE(!(--retry))) 5305 + return; 5306 + } 5307 + } 5308 + 5309 + orig = cpu_buffer->commit_page = cpu_buffer->head_page; 5310 + retry = 3; 5311 + 5312 + while (cpu_buffer->commit_page->page->time_stamp < next->page->time_stamp) { 5313 + rb_inc_page(&next); 5314 + rb_inc_page(&cpu_buffer->commit_page); 5315 + 5316 + if (cpu_buffer->commit_page == orig) { 5317 + if (WARN_ON_ONCE(!(--retry))) 5318 + return; 5319 + } 5320 + } 5321 + } 5322 + 5348 5323 static void rb_iter_reset(struct ring_buffer_iter *iter) 5349 5324 { 5350 5325 struct ring_buffer_per_cpu *cpu_buffer = iter->cpu_buffer; 5326 + 5327 + if (cpu_buffer->remote) { 5328 + rb_read_remote_meta_page(cpu_buffer); 5329 + rb_update_remote_head(cpu_buffer); 5330 + } 5351 5331 5352 5332 /* Iterator usage is expected to have record disabled */ 5353 5333 iter->head_page = cpu_buffer->reader_page; ··· 5550 5428 } 5551 5429 5552 5430 static struct buffer_page * 5553 - rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer) 5431 + __rb_get_reader_page_from_remote(struct ring_buffer_per_cpu *cpu_buffer) 5432 + { 5433 + struct buffer_page *new_reader, *prev_reader, *prev_head, *new_head, *last; 5434 + 5435 + if (!rb_read_remote_meta_page(cpu_buffer)) 5436 + return NULL; 5437 + 5438 + /* More to read on the reader page */ 5439 + if (cpu_buffer->reader_page->read < rb_page_size(cpu_buffer->reader_page)) { 5440 + if (!cpu_buffer->reader_page->read) 5441 + cpu_buffer->read_stamp = cpu_buffer->reader_page->page->time_stamp; 5442 + return cpu_buffer->reader_page; 5443 + } 5444 + 5445 + prev_reader = cpu_buffer->subbuf_ids[cpu_buffer->meta_page->reader.id]; 5446 + 5447 + WARN_ON_ONCE(cpu_buffer->remote->swap_reader_page(cpu_buffer->cpu, 5448 + cpu_buffer->remote->priv)); 5449 + /* nr_pages doesn't include the reader page */ 5450 + if (WARN_ON_ONCE(cpu_buffer->meta_page->reader.id > cpu_buffer->nr_pages)) 5451 + return NULL; 5452 + 5453 + new_reader = cpu_buffer->subbuf_ids[cpu_buffer->meta_page->reader.id]; 5454 + 5455 + WARN_ON_ONCE(prev_reader == new_reader); 5456 + 5457 + prev_head = new_reader; /* New reader was also the previous head */ 5458 + new_head = prev_head; 5459 + rb_inc_page(&new_head); 5460 + last = prev_head; 5461 + rb_dec_page(&last); 5462 + 5463 + /* Clear the old HEAD flag */ 5464 + rb_list_head_clear(cpu_buffer->head_page->list.prev); 5465 + 5466 + prev_reader->list.next = prev_head->list.next; 5467 + prev_reader->list.prev = prev_head->list.prev; 5468 + 5469 + /* Swap prev_reader with new_reader */ 5470 + last->list.next = &prev_reader->list; 5471 + new_head->list.prev = &prev_reader->list; 5472 + 5473 + new_reader->list.prev = &new_reader->list; 5474 + new_reader->list.next = &new_head->list; 5475 + 5476 + /* Reactivate the HEAD flag */ 5477 + rb_set_list_to_head(&last->list); 5478 + 5479 + cpu_buffer->head_page = new_head; 5480 + cpu_buffer->reader_page = new_reader; 5481 + cpu_buffer->pages = &new_head->list; 5482 + cpu_buffer->read_stamp = new_reader->page->time_stamp; 5483 + cpu_buffer->lost_events = cpu_buffer->meta_page->reader.lost_events; 5484 + 5485 + return rb_page_size(cpu_buffer->reader_page) ? cpu_buffer->reader_page : NULL; 5486 + } 5487 + 5488 + static struct buffer_page * 5489 + __rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer) 5554 5490 { 5555 5491 struct buffer_page *reader = NULL; 5556 5492 unsigned long bsize = READ_ONCE(cpu_buffer->buffer->subbuf_size); ··· 5776 5596 5777 5597 5778 5598 return reader; 5599 + } 5600 + 5601 + static struct buffer_page * 5602 + rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer) 5603 + { 5604 + return cpu_buffer->remote ? __rb_get_reader_page_from_remote(cpu_buffer) : 5605 + __rb_get_reader_page(cpu_buffer); 5779 5606 } 5780 5607 5781 5608 static void rb_advance_reader(struct ring_buffer_per_cpu *cpu_buffer) ··· 6341 6154 meta->entries = local_read(&cpu_buffer->entries); 6342 6155 meta->overrun = local_read(&cpu_buffer->overrun); 6343 6156 meta->read = cpu_buffer->read; 6157 + meta->pages_lost = local_read(&cpu_buffer->pages_lost); 6158 + meta->pages_touched = local_read(&cpu_buffer->pages_touched); 6344 6159 6345 6160 /* Some archs do not have data cache coherency between kernel and user-space */ 6346 6161 flush_kernel_vmap_range(cpu_buffer->meta_page, PAGE_SIZE); ··· 6352 6163 rb_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer) 6353 6164 { 6354 6165 struct buffer_page *page; 6166 + 6167 + if (cpu_buffer->remote) { 6168 + if (!cpu_buffer->remote->reset) 6169 + return; 6170 + 6171 + cpu_buffer->remote->reset(cpu_buffer->cpu, cpu_buffer->remote->priv); 6172 + rb_read_remote_meta_page(cpu_buffer); 6173 + 6174 + /* Read related values, not covered by the meta-page */ 6175 + local_set(&cpu_buffer->pages_read, 0); 6176 + cpu_buffer->read = 0; 6177 + cpu_buffer->read_bytes = 0; 6178 + cpu_buffer->last_overrun = 0; 6179 + cpu_buffer->reader_page->read = 0; 6180 + 6181 + return; 6182 + } 6355 6183 6356 6184 rb_head_page_deactivate(cpu_buffer); 6357 6185 ··· 6599 6393 return ret; 6600 6394 } 6601 6395 EXPORT_SYMBOL_GPL(ring_buffer_empty_cpu); 6396 + 6397 + int ring_buffer_poll_remote(struct trace_buffer *buffer, int cpu) 6398 + { 6399 + struct ring_buffer_per_cpu *cpu_buffer; 6400 + 6401 + if (cpu != RING_BUFFER_ALL_CPUS) { 6402 + if (!cpumask_test_cpu(cpu, buffer->cpumask)) 6403 + return -EINVAL; 6404 + 6405 + cpu_buffer = buffer->buffers[cpu]; 6406 + 6407 + guard(raw_spinlock)(&cpu_buffer->reader_lock); 6408 + if (rb_read_remote_meta_page(cpu_buffer)) 6409 + rb_wakeups(buffer, cpu_buffer); 6410 + 6411 + return 0; 6412 + } 6413 + 6414 + guard(cpus_read_lock)(); 6415 + 6416 + /* 6417 + * Make sure all the ring buffers are up to date before we start reading 6418 + * them. 6419 + */ 6420 + for_each_buffer_cpu(buffer, cpu) { 6421 + cpu_buffer = buffer->buffers[cpu]; 6422 + 6423 + guard(raw_spinlock)(&cpu_buffer->reader_lock); 6424 + rb_read_remote_meta_page(cpu_buffer); 6425 + } 6426 + 6427 + for_each_buffer_cpu(buffer, cpu) { 6428 + cpu_buffer = buffer->buffers[cpu]; 6429 + 6430 + if (rb_num_of_entries(cpu_buffer)) 6431 + rb_wakeups(buffer, cpu_buffer); 6432 + } 6433 + 6434 + return 0; 6435 + } 6602 6436 6603 6437 #ifdef CONFIG_RING_BUFFER_ALLOW_SWAP 6604 6438 /** ··· 6878 6632 unsigned int commit; 6879 6633 unsigned int read; 6880 6634 u64 save_timestamp; 6635 + bool force_memcpy; 6881 6636 6882 6637 if (!cpumask_test_cpu(cpu, buffer->cpumask)) 6883 6638 return -1; ··· 6916 6669 /* Check if any events were dropped */ 6917 6670 missed_events = cpu_buffer->lost_events; 6918 6671 6672 + force_memcpy = cpu_buffer->mapped || cpu_buffer->remote; 6673 + 6919 6674 /* 6920 6675 * If this page has been partially read or 6921 6676 * if len is not big enough to read the rest of the page or ··· 6927 6678 */ 6928 6679 if (read || (len < (commit - read)) || 6929 6680 cpu_buffer->reader_page == cpu_buffer->commit_page || 6930 - cpu_buffer->mapped) { 6681 + force_memcpy) { 6931 6682 struct buffer_data_page *rpage = cpu_buffer->reader_page->page; 6932 6683 unsigned int rpos = read; 6933 6684 unsigned int pos = 0; ··· 7283 7034 } 7284 7035 7285 7036 static void rb_setup_ids_meta_page(struct ring_buffer_per_cpu *cpu_buffer, 7286 - unsigned long *subbuf_ids) 7037 + struct buffer_page **subbuf_ids) 7287 7038 { 7288 7039 struct trace_buffer_meta *meta = cpu_buffer->meta_page; 7289 7040 unsigned int nr_subbufs = cpu_buffer->nr_pages + 1; ··· 7292 7043 int id = 0; 7293 7044 7294 7045 id = rb_page_id(cpu_buffer, cpu_buffer->reader_page, id); 7295 - subbuf_ids[id++] = (unsigned long)cpu_buffer->reader_page->page; 7046 + subbuf_ids[id++] = cpu_buffer->reader_page; 7296 7047 cnt++; 7297 7048 7298 7049 first_subbuf = subbuf = rb_set_head_page(cpu_buffer); ··· 7302 7053 if (WARN_ON(id >= nr_subbufs)) 7303 7054 break; 7304 7055 7305 - subbuf_ids[id] = (unsigned long)subbuf->page; 7056 + subbuf_ids[id] = subbuf; 7306 7057 7307 7058 rb_inc_page(&subbuf); 7308 7059 id++; ··· 7311 7062 7312 7063 WARN_ON(cnt != nr_subbufs); 7313 7064 7314 - /* install subbuf ID to kern VA translation */ 7065 + /* install subbuf ID to bpage translation */ 7315 7066 cpu_buffer->subbuf_ids = subbuf_ids; 7316 7067 7317 7068 meta->meta_struct_len = sizeof(*meta); ··· 7467 7218 } 7468 7219 7469 7220 while (p < nr_pages) { 7221 + struct buffer_page *subbuf; 7470 7222 struct page *page; 7471 7223 int off = 0; 7472 7224 7473 7225 if (WARN_ON_ONCE(s >= nr_subbufs)) 7474 7226 return -EINVAL; 7475 7227 7476 - page = virt_to_page((void *)cpu_buffer->subbuf_ids[s]); 7228 + subbuf = cpu_buffer->subbuf_ids[s]; 7229 + page = virt_to_page((void *)subbuf->page); 7477 7230 7478 7231 for (; off < (1 << (subbuf_order)); off++, page++) { 7479 7232 if (p >= nr_pages) ··· 7502 7251 struct vm_area_struct *vma) 7503 7252 { 7504 7253 struct ring_buffer_per_cpu *cpu_buffer; 7505 - unsigned long flags, *subbuf_ids; 7254 + struct buffer_page **subbuf_ids; 7255 + unsigned long flags; 7506 7256 int err; 7507 7257 7508 - if (!cpumask_test_cpu(cpu, buffer->cpumask)) 7258 + if (!cpumask_test_cpu(cpu, buffer->cpumask) || buffer->remote) 7509 7259 return -EINVAL; 7510 7260 7511 7261 cpu_buffer = buffer->buffers[cpu]; ··· 7527 7275 if (err) 7528 7276 return err; 7529 7277 7530 - /* subbuf_ids include the reader while nr_pages does not */ 7278 + /* subbuf_ids includes the reader while nr_pages does not */ 7531 7279 subbuf_ids = kcalloc(cpu_buffer->nr_pages + 1, sizeof(*subbuf_ids), GFP_KERNEL); 7532 7280 if (!subbuf_ids) { 7533 7281 rb_free_meta_page(cpu_buffer);

+517

kernel/trace/simple_ring_buffer.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * Copyright (C) 2025 - Google LLC 4 + * Author: Vincent Donnefort <vdonnefort@google.com> 5 + */ 6 + 7 + #include <linux/atomic.h> 8 + #include <linux/simple_ring_buffer.h> 9 + 10 + #include <asm/barrier.h> 11 + #include <asm/local.h> 12 + 13 + enum simple_rb_link_type { 14 + SIMPLE_RB_LINK_NORMAL = 0, 15 + SIMPLE_RB_LINK_HEAD = 1, 16 + SIMPLE_RB_LINK_HEAD_MOVING 17 + }; 18 + 19 + #define SIMPLE_RB_LINK_MASK ~(SIMPLE_RB_LINK_HEAD | SIMPLE_RB_LINK_HEAD_MOVING) 20 + 21 + static void simple_bpage_set_head_link(struct simple_buffer_page *bpage) 22 + { 23 + unsigned long link = (unsigned long)bpage->link.next; 24 + 25 + link &= SIMPLE_RB_LINK_MASK; 26 + link |= SIMPLE_RB_LINK_HEAD; 27 + 28 + /* 29 + * Paired with simple_rb_find_head() to order access between the head 30 + * link and overrun. It ensures we always report an up-to-date value 31 + * after swapping the reader page. 32 + */ 33 + smp_store_release(&bpage->link.next, (struct list_head *)link); 34 + } 35 + 36 + static bool simple_bpage_unset_head_link(struct simple_buffer_page *bpage, 37 + struct simple_buffer_page *dst, 38 + enum simple_rb_link_type new_type) 39 + { 40 + unsigned long *link = (unsigned long *)(&bpage->link.next); 41 + unsigned long old = (*link & SIMPLE_RB_LINK_MASK) | SIMPLE_RB_LINK_HEAD; 42 + unsigned long new = (unsigned long)(&dst->link) | new_type; 43 + 44 + return try_cmpxchg(link, &old, new); 45 + } 46 + 47 + static void simple_bpage_set_normal_link(struct simple_buffer_page *bpage) 48 + { 49 + unsigned long link = (unsigned long)bpage->link.next; 50 + 51 + WRITE_ONCE(bpage->link.next, (struct list_head *)(link & SIMPLE_RB_LINK_MASK)); 52 + } 53 + 54 + static struct simple_buffer_page *simple_bpage_from_link(struct list_head *link) 55 + { 56 + unsigned long ptr = (unsigned long)link & SIMPLE_RB_LINK_MASK; 57 + 58 + return container_of((struct list_head *)ptr, struct simple_buffer_page, link); 59 + } 60 + 61 + static struct simple_buffer_page *simple_bpage_next_page(struct simple_buffer_page *bpage) 62 + { 63 + return simple_bpage_from_link(bpage->link.next); 64 + } 65 + 66 + static void simple_bpage_reset(struct simple_buffer_page *bpage) 67 + { 68 + bpage->write = 0; 69 + bpage->entries = 0; 70 + 71 + local_set(&bpage->page->commit, 0); 72 + } 73 + 74 + static void simple_bpage_init(struct simple_buffer_page *bpage, void *page) 75 + { 76 + INIT_LIST_HEAD(&bpage->link); 77 + bpage->page = (struct buffer_data_page *)page; 78 + 79 + simple_bpage_reset(bpage); 80 + } 81 + 82 + #define simple_rb_meta_inc(__meta, __inc) \ 83 + WRITE_ONCE((__meta), (__meta + __inc)) 84 + 85 + static bool simple_rb_loaded(struct simple_rb_per_cpu *cpu_buffer) 86 + { 87 + return !!cpu_buffer->bpages; 88 + } 89 + 90 + static int simple_rb_find_head(struct simple_rb_per_cpu *cpu_buffer) 91 + { 92 + int retry = cpu_buffer->nr_pages * 2; 93 + struct simple_buffer_page *head; 94 + 95 + head = cpu_buffer->head_page; 96 + 97 + while (retry--) { 98 + unsigned long link; 99 + 100 + spin: 101 + /* See smp_store_release in simple_bpage_set_head_link() */ 102 + link = (unsigned long)smp_load_acquire(&head->link.prev->next); 103 + 104 + switch (link & ~SIMPLE_RB_LINK_MASK) { 105 + /* Found the head */ 106 + case SIMPLE_RB_LINK_HEAD: 107 + cpu_buffer->head_page = head; 108 + return 0; 109 + /* The writer caught the head, we can spin, that won't be long */ 110 + case SIMPLE_RB_LINK_HEAD_MOVING: 111 + goto spin; 112 + } 113 + 114 + head = simple_bpage_next_page(head); 115 + } 116 + 117 + return -EBUSY; 118 + } 119 + 120 + /** 121 + * simple_ring_buffer_swap_reader_page - Swap ring-buffer head with the reader 122 + * @cpu_buffer: A simple_rb_per_cpu 123 + * 124 + * This function enables consuming reading. It ensures the current head page will not be overwritten 125 + * and can be safely read. 126 + * 127 + * Returns 0 on success, -ENODEV if @cpu_buffer was unloaded or -EBUSY if we failed to catch the 128 + * head page. 129 + */ 130 + int simple_ring_buffer_swap_reader_page(struct simple_rb_per_cpu *cpu_buffer) 131 + { 132 + struct simple_buffer_page *last, *head, *reader; 133 + unsigned long overrun; 134 + int retry = 8; 135 + int ret; 136 + 137 + if (!simple_rb_loaded(cpu_buffer)) 138 + return -ENODEV; 139 + 140 + reader = cpu_buffer->reader_page; 141 + 142 + do { 143 + /* Run after the writer to find the head */ 144 + ret = simple_rb_find_head(cpu_buffer); 145 + if (ret) 146 + return ret; 147 + 148 + head = cpu_buffer->head_page; 149 + 150 + /* Connect the reader page around the header page */ 151 + reader->link.next = head->link.next; 152 + reader->link.prev = head->link.prev; 153 + 154 + /* The last page before the head */ 155 + last = simple_bpage_from_link(head->link.prev); 156 + 157 + /* The reader page points to the new header page */ 158 + simple_bpage_set_head_link(reader); 159 + 160 + overrun = cpu_buffer->meta->overrun; 161 + } while (!simple_bpage_unset_head_link(last, reader, SIMPLE_RB_LINK_NORMAL) && retry--); 162 + 163 + if (!retry) 164 + return -EINVAL; 165 + 166 + cpu_buffer->head_page = simple_bpage_from_link(reader->link.next); 167 + cpu_buffer->head_page->link.prev = &reader->link; 168 + cpu_buffer->reader_page = head; 169 + cpu_buffer->meta->reader.lost_events = overrun - cpu_buffer->last_overrun; 170 + cpu_buffer->meta->reader.id = cpu_buffer->reader_page->id; 171 + cpu_buffer->last_overrun = overrun; 172 + 173 + return 0; 174 + } 175 + EXPORT_SYMBOL_GPL(simple_ring_buffer_swap_reader_page); 176 + 177 + static struct simple_buffer_page *simple_rb_move_tail(struct simple_rb_per_cpu *cpu_buffer) 178 + { 179 + struct simple_buffer_page *tail, *new_tail; 180 + 181 + tail = cpu_buffer->tail_page; 182 + new_tail = simple_bpage_next_page(tail); 183 + 184 + if (simple_bpage_unset_head_link(tail, new_tail, SIMPLE_RB_LINK_HEAD_MOVING)) { 185 + /* 186 + * Oh no! we've caught the head. There is none anymore and 187 + * swap_reader will spin until we set the new one. Overrun must 188 + * be written first, to make sure we report the correct number 189 + * of lost events. 190 + */ 191 + simple_rb_meta_inc(cpu_buffer->meta->overrun, new_tail->entries); 192 + simple_rb_meta_inc(cpu_buffer->meta->pages_lost, 1); 193 + 194 + simple_bpage_set_head_link(new_tail); 195 + simple_bpage_set_normal_link(tail); 196 + } 197 + 198 + simple_bpage_reset(new_tail); 199 + cpu_buffer->tail_page = new_tail; 200 + 201 + simple_rb_meta_inc(cpu_buffer->meta->pages_touched, 1); 202 + 203 + return new_tail; 204 + } 205 + 206 + static unsigned long rb_event_size(unsigned long length) 207 + { 208 + struct ring_buffer_event *event; 209 + 210 + return length + RB_EVNT_HDR_SIZE + sizeof(event->array[0]); 211 + } 212 + 213 + static struct ring_buffer_event * 214 + rb_event_add_ts_extend(struct ring_buffer_event *event, u64 delta) 215 + { 216 + event->type_len = RINGBUF_TYPE_TIME_EXTEND; 217 + event->time_delta = delta & TS_MASK; 218 + event->array[0] = delta >> TS_SHIFT; 219 + 220 + return (struct ring_buffer_event *)((unsigned long)event + 8); 221 + } 222 + 223 + static struct ring_buffer_event * 224 + simple_rb_reserve_next(struct simple_rb_per_cpu *cpu_buffer, unsigned long length, u64 timestamp) 225 + { 226 + unsigned long ts_ext_size = 0, event_size = rb_event_size(length); 227 + struct simple_buffer_page *tail = cpu_buffer->tail_page; 228 + struct ring_buffer_event *event; 229 + u32 write, prev_write; 230 + u64 time_delta; 231 + 232 + time_delta = timestamp - cpu_buffer->write_stamp; 233 + 234 + if (test_time_stamp(time_delta)) 235 + ts_ext_size = 8; 236 + 237 + prev_write = tail->write; 238 + write = prev_write + event_size + ts_ext_size; 239 + 240 + if (unlikely(write > (PAGE_SIZE - BUF_PAGE_HDR_SIZE))) 241 + tail = simple_rb_move_tail(cpu_buffer); 242 + 243 + if (!tail->entries) { 244 + tail->page->time_stamp = timestamp; 245 + time_delta = 0; 246 + ts_ext_size = 0; 247 + write = event_size; 248 + prev_write = 0; 249 + } 250 + 251 + tail->write = write; 252 + tail->entries++; 253 + 254 + cpu_buffer->write_stamp = timestamp; 255 + 256 + event = (struct ring_buffer_event *)(tail->page->data + prev_write); 257 + if (ts_ext_size) { 258 + event = rb_event_add_ts_extend(event, time_delta); 259 + time_delta = 0; 260 + } 261 + 262 + event->type_len = 0; 263 + event->time_delta = time_delta; 264 + event->array[0] = event_size - RB_EVNT_HDR_SIZE; 265 + 266 + return event; 267 + } 268 + 269 + /** 270 + * simple_ring_buffer_reserve - Reserve an entry in @cpu_buffer 271 + * @cpu_buffer: A simple_rb_per_cpu 272 + * @length: Size of the entry in bytes 273 + * @timestamp: Timestamp of the entry 274 + * 275 + * Returns the address of the entry where to write data or NULL 276 + */ 277 + void *simple_ring_buffer_reserve(struct simple_rb_per_cpu *cpu_buffer, unsigned long length, 278 + u64 timestamp) 279 + { 280 + struct ring_buffer_event *rb_event; 281 + 282 + if (cmpxchg(&cpu_buffer->status, SIMPLE_RB_READY, SIMPLE_RB_WRITING) != SIMPLE_RB_READY) 283 + return NULL; 284 + 285 + rb_event = simple_rb_reserve_next(cpu_buffer, length, timestamp); 286 + 287 + return &rb_event->array[1]; 288 + } 289 + EXPORT_SYMBOL_GPL(simple_ring_buffer_reserve); 290 + 291 + /** 292 + * simple_ring_buffer_commit - Commit the entry reserved with simple_ring_buffer_reserve() 293 + * @cpu_buffer: The simple_rb_per_cpu where the entry has been reserved 294 + */ 295 + void simple_ring_buffer_commit(struct simple_rb_per_cpu *cpu_buffer) 296 + { 297 + local_set(&cpu_buffer->tail_page->page->commit, 298 + cpu_buffer->tail_page->write); 299 + simple_rb_meta_inc(cpu_buffer->meta->entries, 1); 300 + 301 + /* 302 + * Paired with simple_rb_enable_tracing() to ensure data is 303 + * written to the ring-buffer before teardown. 304 + */ 305 + smp_store_release(&cpu_buffer->status, SIMPLE_RB_READY); 306 + } 307 + EXPORT_SYMBOL_GPL(simple_ring_buffer_commit); 308 + 309 + static u32 simple_rb_enable_tracing(struct simple_rb_per_cpu *cpu_buffer, bool enable) 310 + { 311 + u32 prev_status; 312 + 313 + if (enable) 314 + return cmpxchg(&cpu_buffer->status, SIMPLE_RB_UNAVAILABLE, SIMPLE_RB_READY); 315 + 316 + /* Wait for the buffer to be released */ 317 + do { 318 + prev_status = cmpxchg_acquire(&cpu_buffer->status, 319 + SIMPLE_RB_READY, 320 + SIMPLE_RB_UNAVAILABLE); 321 + } while (prev_status == SIMPLE_RB_WRITING); 322 + 323 + return prev_status; 324 + } 325 + 326 + /** 327 + * simple_ring_buffer_reset - Reset @cpu_buffer 328 + * @cpu_buffer: A simple_rb_per_cpu 329 + * 330 + * This will not clear the content of the data, only reset counters and pointers 331 + * 332 + * Returns 0 on success or -ENODEV if @cpu_buffer was unloaded. 333 + */ 334 + int simple_ring_buffer_reset(struct simple_rb_per_cpu *cpu_buffer) 335 + { 336 + struct simple_buffer_page *bpage; 337 + u32 prev_status; 338 + int ret; 339 + 340 + if (!simple_rb_loaded(cpu_buffer)) 341 + return -ENODEV; 342 + 343 + prev_status = simple_rb_enable_tracing(cpu_buffer, false); 344 + 345 + ret = simple_rb_find_head(cpu_buffer); 346 + if (ret) 347 + return ret; 348 + 349 + bpage = cpu_buffer->tail_page = cpu_buffer->head_page; 350 + do { 351 + simple_bpage_reset(bpage); 352 + bpage = simple_bpage_next_page(bpage); 353 + } while (bpage != cpu_buffer->head_page); 354 + 355 + simple_bpage_reset(cpu_buffer->reader_page); 356 + 357 + cpu_buffer->last_overrun = 0; 358 + cpu_buffer->write_stamp = 0; 359 + 360 + cpu_buffer->meta->reader.read = 0; 361 + cpu_buffer->meta->reader.lost_events = 0; 362 + cpu_buffer->meta->entries = 0; 363 + cpu_buffer->meta->overrun = 0; 364 + cpu_buffer->meta->read = 0; 365 + cpu_buffer->meta->pages_lost = 0; 366 + cpu_buffer->meta->pages_touched = 0; 367 + 368 + if (prev_status == SIMPLE_RB_READY) 369 + simple_rb_enable_tracing(cpu_buffer, true); 370 + 371 + return 0; 372 + } 373 + EXPORT_SYMBOL_GPL(simple_ring_buffer_reset); 374 + 375 + int simple_ring_buffer_init_mm(struct simple_rb_per_cpu *cpu_buffer, 376 + struct simple_buffer_page *bpages, 377 + const struct ring_buffer_desc *desc, 378 + void *(*load_page)(unsigned long va), 379 + void (*unload_page)(void *va)) 380 + { 381 + struct simple_buffer_page *bpage = bpages; 382 + int ret = 0; 383 + void *page; 384 + int i; 385 + 386 + /* At least 1 reader page and two pages in the ring-buffer */ 387 + if (desc->nr_page_va < 3) 388 + return -EINVAL; 389 + 390 + memset(cpu_buffer, 0, sizeof(*cpu_buffer)); 391 + 392 + cpu_buffer->meta = load_page(desc->meta_va); 393 + if (!cpu_buffer->meta) 394 + return -EINVAL; 395 + 396 + memset(cpu_buffer->meta, 0, sizeof(*cpu_buffer->meta)); 397 + cpu_buffer->meta->meta_page_size = PAGE_SIZE; 398 + cpu_buffer->meta->nr_subbufs = cpu_buffer->nr_pages; 399 + 400 + /* The reader page is not part of the ring initially */ 401 + page = load_page(desc->page_va[0]); 402 + if (!page) { 403 + unload_page(cpu_buffer->meta); 404 + return -EINVAL; 405 + } 406 + 407 + simple_bpage_init(bpage, page); 408 + bpage->id = 0; 409 + 410 + cpu_buffer->nr_pages = 1; 411 + 412 + cpu_buffer->reader_page = bpage; 413 + cpu_buffer->tail_page = bpage + 1; 414 + cpu_buffer->head_page = bpage + 1; 415 + 416 + for (i = 1; i < desc->nr_page_va; i++) { 417 + page = load_page(desc->page_va[i]); 418 + if (!page) { 419 + ret = -EINVAL; 420 + break; 421 + } 422 + 423 + simple_bpage_init(++bpage, page); 424 + 425 + bpage->link.next = &(bpage + 1)->link; 426 + bpage->link.prev = &(bpage - 1)->link; 427 + bpage->id = i; 428 + 429 + cpu_buffer->nr_pages = i + 1; 430 + } 431 + 432 + if (ret) { 433 + for (i--; i >= 0; i--) 434 + unload_page((void *)desc->page_va[i]); 435 + unload_page(cpu_buffer->meta); 436 + 437 + return ret; 438 + } 439 + 440 + /* Close the ring */ 441 + bpage->link.next = &cpu_buffer->tail_page->link; 442 + cpu_buffer->tail_page->link.prev = &bpage->link; 443 + 444 + /* The last init'ed page points to the head page */ 445 + simple_bpage_set_head_link(bpage); 446 + 447 + cpu_buffer->bpages = bpages; 448 + 449 + return 0; 450 + } 451 + 452 + static void *__load_page(unsigned long page) 453 + { 454 + return (void *)page; 455 + } 456 + 457 + static void __unload_page(void *page) { } 458 + 459 + /** 460 + * simple_ring_buffer_init - Init @cpu_buffer based on @desc 461 + * @cpu_buffer: A simple_rb_per_cpu buffer to init, allocated by the caller. 462 + * @bpages: Array of simple_buffer_pages, with as many elements as @desc->nr_page_va 463 + * @desc: A ring_buffer_desc 464 + * 465 + * Returns 0 on success or -EINVAL if the content of @desc is invalid 466 + */ 467 + int simple_ring_buffer_init(struct simple_rb_per_cpu *cpu_buffer, struct simple_buffer_page *bpages, 468 + const struct ring_buffer_desc *desc) 469 + { 470 + return simple_ring_buffer_init_mm(cpu_buffer, bpages, desc, __load_page, __unload_page); 471 + } 472 + EXPORT_SYMBOL_GPL(simple_ring_buffer_init); 473 + 474 + void simple_ring_buffer_unload_mm(struct simple_rb_per_cpu *cpu_buffer, 475 + void (*unload_page)(void *)) 476 + { 477 + int p; 478 + 479 + if (!simple_rb_loaded(cpu_buffer)) 480 + return; 481 + 482 + simple_rb_enable_tracing(cpu_buffer, false); 483 + 484 + unload_page(cpu_buffer->meta); 485 + for (p = 0; p < cpu_buffer->nr_pages; p++) 486 + unload_page(cpu_buffer->bpages[p].page); 487 + 488 + cpu_buffer->bpages = NULL; 489 + } 490 + 491 + /** 492 + * simple_ring_buffer_unload - Prepare @cpu_buffer for deletion 493 + * @cpu_buffer: A simple_rb_per_cpu that will be deleted. 494 + */ 495 + void simple_ring_buffer_unload(struct simple_rb_per_cpu *cpu_buffer) 496 + { 497 + return simple_ring_buffer_unload_mm(cpu_buffer, __unload_page); 498 + } 499 + EXPORT_SYMBOL_GPL(simple_ring_buffer_unload); 500 + 501 + /** 502 + * simple_ring_buffer_enable_tracing - Enable or disable writing to @cpu_buffer 503 + * @cpu_buffer: A simple_rb_per_cpu 504 + * @enable: True to enable tracing, False to disable it 505 + * 506 + * Returns 0 on success or -ENODEV if @cpu_buffer was unloaded 507 + */ 508 + int simple_ring_buffer_enable_tracing(struct simple_rb_per_cpu *cpu_buffer, bool enable) 509 + { 510 + if (!simple_rb_loaded(cpu_buffer)) 511 + return -ENODEV; 512 + 513 + simple_rb_enable_tracing(cpu_buffer, enable); 514 + 515 + return 0; 516 + } 517 + EXPORT_SYMBOL_GPL(simple_ring_buffer_enable_tracing);

+2 -2

kernel/trace/trace.c

··· 3856 3856 * Should be used after trace_array_get(), trace_types_lock 3857 3857 * ensures that i_cdev was already initialized. 3858 3858 */ 3859 - static inline int tracing_get_cpu(struct inode *inode) 3859 + int tracing_get_cpu(struct inode *inode) 3860 3860 { 3861 3861 if (inode->i_cdev) /* See trace_create_cpu_file() */ 3862 3862 return (long)inode->i_cdev - 1; ··· 8606 8606 return tr->percpu_dir; 8607 8607 } 8608 8608 8609 - static struct dentry * 8609 + struct dentry * 8610 8610 trace_create_cpu_file(const char *name, umode_t mode, struct dentry *parent, 8611 8611 void *data, long cpu, const struct file_operations *fops) 8612 8612 {

+7

kernel/trace/trace.h

··· 689 689 struct dentry *parent, 690 690 void *data, 691 691 const struct file_operations *fops); 692 + struct dentry *trace_create_cpu_file(const char *name, 693 + umode_t mode, 694 + struct dentry *parent, 695 + void *data, 696 + long cpu, 697 + const struct file_operations *fops); 698 + int tracing_get_cpu(struct inode *inode); 692 699 693 700 694 701 /**

+1384

kernel/trace/trace_remote.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * Copyright (C) 2025 - Google LLC 4 + * Author: Vincent Donnefort <vdonnefort@google.com> 5 + */ 6 + 7 + #include <linux/kstrtox.h> 8 + #include <linux/lockdep.h> 9 + #include <linux/mutex.h> 10 + #include <linux/tracefs.h> 11 + #include <linux/trace_remote.h> 12 + #include <linux/trace_seq.h> 13 + #include <linux/types.h> 14 + 15 + #include "trace.h" 16 + 17 + #define TRACEFS_DIR "remotes" 18 + #define TRACEFS_MODE_WRITE 0640 19 + #define TRACEFS_MODE_READ 0440 20 + 21 + enum tri_type { 22 + TRI_CONSUMING, 23 + TRI_NONCONSUMING, 24 + }; 25 + 26 + struct trace_remote_iterator { 27 + struct trace_remote *remote; 28 + struct trace_seq seq; 29 + struct delayed_work poll_work; 30 + unsigned long lost_events; 31 + u64 ts; 32 + struct ring_buffer_iter *rb_iter; 33 + struct ring_buffer_iter **rb_iters; 34 + struct remote_event_hdr *evt; 35 + int cpu; 36 + int evt_cpu; 37 + loff_t pos; 38 + enum tri_type type; 39 + }; 40 + 41 + struct trace_remote { 42 + struct trace_remote_callbacks *cbs; 43 + void *priv; 44 + struct trace_buffer *trace_buffer; 45 + struct trace_buffer_desc *trace_buffer_desc; 46 + struct dentry *dentry; 47 + struct eventfs_inode *eventfs; 48 + struct remote_event *events; 49 + unsigned long nr_events; 50 + unsigned long trace_buffer_size; 51 + struct ring_buffer_remote rb_remote; 52 + struct mutex lock; 53 + struct rw_semaphore reader_lock; 54 + struct rw_semaphore *pcpu_reader_locks; 55 + unsigned int nr_readers; 56 + unsigned int poll_ms; 57 + bool tracing_on; 58 + }; 59 + 60 + static bool trace_remote_loaded(struct trace_remote *remote) 61 + { 62 + return !!remote->trace_buffer; 63 + } 64 + 65 + static int trace_remote_load(struct trace_remote *remote) 66 + { 67 + struct ring_buffer_remote *rb_remote = &remote->rb_remote; 68 + struct trace_buffer_desc *desc; 69 + 70 + lockdep_assert_held(&remote->lock); 71 + 72 + if (trace_remote_loaded(remote)) 73 + return 0; 74 + 75 + desc = remote->cbs->load_trace_buffer(remote->trace_buffer_size, remote->priv); 76 + if (IS_ERR(desc)) 77 + return PTR_ERR(desc); 78 + 79 + rb_remote->desc = desc; 80 + rb_remote->swap_reader_page = remote->cbs->swap_reader_page; 81 + rb_remote->priv = remote->priv; 82 + rb_remote->reset = remote->cbs->reset; 83 + remote->trace_buffer = ring_buffer_alloc_remote(rb_remote); 84 + if (!remote->trace_buffer) { 85 + remote->cbs->unload_trace_buffer(desc, remote->priv); 86 + return -ENOMEM; 87 + } 88 + 89 + remote->trace_buffer_desc = desc; 90 + 91 + return 0; 92 + } 93 + 94 + static void trace_remote_try_unload(struct trace_remote *remote) 95 + { 96 + lockdep_assert_held(&remote->lock); 97 + 98 + if (!trace_remote_loaded(remote)) 99 + return; 100 + 101 + /* The buffer is being read or writable */ 102 + if (remote->nr_readers || remote->tracing_on) 103 + return; 104 + 105 + /* The buffer has readable data */ 106 + if (!ring_buffer_empty(remote->trace_buffer)) 107 + return; 108 + 109 + ring_buffer_free(remote->trace_buffer); 110 + remote->trace_buffer = NULL; 111 + remote->cbs->unload_trace_buffer(remote->trace_buffer_desc, remote->priv); 112 + } 113 + 114 + static int trace_remote_enable_tracing(struct trace_remote *remote) 115 + { 116 + int ret; 117 + 118 + lockdep_assert_held(&remote->lock); 119 + 120 + if (remote->tracing_on) 121 + return 0; 122 + 123 + ret = trace_remote_load(remote); 124 + if (ret) 125 + return ret; 126 + 127 + ret = remote->cbs->enable_tracing(true, remote->priv); 128 + if (ret) { 129 + trace_remote_try_unload(remote); 130 + return ret; 131 + } 132 + 133 + remote->tracing_on = true; 134 + 135 + return 0; 136 + } 137 + 138 + static int trace_remote_disable_tracing(struct trace_remote *remote) 139 + { 140 + int ret; 141 + 142 + lockdep_assert_held(&remote->lock); 143 + 144 + if (!remote->tracing_on) 145 + return 0; 146 + 147 + ret = remote->cbs->enable_tracing(false, remote->priv); 148 + if (ret) 149 + return ret; 150 + 151 + ring_buffer_poll_remote(remote->trace_buffer, RING_BUFFER_ALL_CPUS); 152 + remote->tracing_on = false; 153 + trace_remote_try_unload(remote); 154 + 155 + return 0; 156 + } 157 + 158 + static void trace_remote_reset(struct trace_remote *remote, int cpu) 159 + { 160 + lockdep_assert_held(&remote->lock); 161 + 162 + if (!trace_remote_loaded(remote)) 163 + return; 164 + 165 + if (cpu == RING_BUFFER_ALL_CPUS) 166 + ring_buffer_reset(remote->trace_buffer); 167 + else 168 + ring_buffer_reset_cpu(remote->trace_buffer, cpu); 169 + 170 + trace_remote_try_unload(remote); 171 + } 172 + 173 + static ssize_t 174 + tracing_on_write(struct file *filp, const char __user *ubuf, size_t cnt, loff_t *ppos) 175 + { 176 + struct seq_file *seq = filp->private_data; 177 + struct trace_remote *remote = seq->private; 178 + unsigned long val; 179 + int ret; 180 + 181 + ret = kstrtoul_from_user(ubuf, cnt, 10, &val); 182 + if (ret) 183 + return ret; 184 + 185 + guard(mutex)(&remote->lock); 186 + 187 + ret = val ? trace_remote_enable_tracing(remote) : trace_remote_disable_tracing(remote); 188 + if (ret) 189 + return ret; 190 + 191 + return cnt; 192 + } 193 + static int tracing_on_show(struct seq_file *s, void *unused) 194 + { 195 + struct trace_remote *remote = s->private; 196 + 197 + seq_printf(s, "%d\n", remote->tracing_on); 198 + 199 + return 0; 200 + } 201 + DEFINE_SHOW_STORE_ATTRIBUTE(tracing_on); 202 + 203 + static ssize_t buffer_size_kb_write(struct file *filp, const char __user *ubuf, size_t cnt, 204 + loff_t *ppos) 205 + { 206 + struct seq_file *seq = filp->private_data; 207 + struct trace_remote *remote = seq->private; 208 + unsigned long val; 209 + int ret; 210 + 211 + ret = kstrtoul_from_user(ubuf, cnt, 10, &val); 212 + if (ret) 213 + return ret; 214 + 215 + /* KiB to Bytes */ 216 + if (!val || check_shl_overflow(val, 10, &val)) 217 + return -EINVAL; 218 + 219 + guard(mutex)(&remote->lock); 220 + 221 + if (trace_remote_loaded(remote)) 222 + return -EBUSY; 223 + 224 + remote->trace_buffer_size = val; 225 + 226 + return cnt; 227 + } 228 + 229 + static int buffer_size_kb_show(struct seq_file *s, void *unused) 230 + { 231 + struct trace_remote *remote = s->private; 232 + 233 + seq_printf(s, "%lu (%s)\n", remote->trace_buffer_size >> 10, 234 + trace_remote_loaded(remote) ? "loaded" : "unloaded"); 235 + 236 + return 0; 237 + } 238 + DEFINE_SHOW_STORE_ATTRIBUTE(buffer_size_kb); 239 + 240 + static int trace_remote_get(struct trace_remote *remote, int cpu) 241 + { 242 + int ret; 243 + 244 + if (remote->nr_readers == UINT_MAX) 245 + return -EBUSY; 246 + 247 + ret = trace_remote_load(remote); 248 + if (ret) 249 + return ret; 250 + 251 + if (cpu != RING_BUFFER_ALL_CPUS && !remote->pcpu_reader_locks) { 252 + int lock_cpu; 253 + 254 + remote->pcpu_reader_locks = kcalloc(nr_cpu_ids, sizeof(*remote->pcpu_reader_locks), 255 + GFP_KERNEL); 256 + if (!remote->pcpu_reader_locks) { 257 + trace_remote_try_unload(remote); 258 + return -ENOMEM; 259 + } 260 + 261 + for_each_possible_cpu(lock_cpu) 262 + init_rwsem(&remote->pcpu_reader_locks[lock_cpu]); 263 + } 264 + 265 + remote->nr_readers++; 266 + 267 + return 0; 268 + } 269 + 270 + static void trace_remote_put(struct trace_remote *remote) 271 + { 272 + if (WARN_ON(!remote->nr_readers)) 273 + return; 274 + 275 + remote->nr_readers--; 276 + if (remote->nr_readers) 277 + return; 278 + 279 + kfree(remote->pcpu_reader_locks); 280 + remote->pcpu_reader_locks = NULL; 281 + 282 + trace_remote_try_unload(remote); 283 + } 284 + 285 + static bool trace_remote_has_cpu(struct trace_remote *remote, int cpu) 286 + { 287 + if (cpu == RING_BUFFER_ALL_CPUS) 288 + return true; 289 + 290 + return ring_buffer_poll_remote(remote->trace_buffer, cpu) == 0; 291 + } 292 + 293 + static void __poll_remote(struct work_struct *work) 294 + { 295 + struct delayed_work *dwork = to_delayed_work(work); 296 + struct trace_remote_iterator *iter; 297 + 298 + iter = container_of(dwork, struct trace_remote_iterator, poll_work); 299 + ring_buffer_poll_remote(iter->remote->trace_buffer, iter->cpu); 300 + schedule_delayed_work((struct delayed_work *)work, 301 + msecs_to_jiffies(iter->remote->poll_ms)); 302 + } 303 + 304 + static void __free_ring_buffer_iter(struct trace_remote_iterator *iter, int cpu) 305 + { 306 + if (cpu != RING_BUFFER_ALL_CPUS) { 307 + ring_buffer_read_finish(iter->rb_iter); 308 + return; 309 + } 310 + 311 + for_each_possible_cpu(cpu) { 312 + if (iter->rb_iters[cpu]) 313 + ring_buffer_read_finish(iter->rb_iters[cpu]); 314 + } 315 + 316 + kfree(iter->rb_iters); 317 + } 318 + 319 + static int __alloc_ring_buffer_iter(struct trace_remote_iterator *iter, int cpu) 320 + { 321 + if (cpu != RING_BUFFER_ALL_CPUS) { 322 + iter->rb_iter = ring_buffer_read_start(iter->remote->trace_buffer, cpu, GFP_KERNEL); 323 + 324 + return iter->rb_iter ? 0 : -ENOMEM; 325 + } 326 + 327 + iter->rb_iters = kcalloc(nr_cpu_ids, sizeof(*iter->rb_iters), GFP_KERNEL); 328 + if (!iter->rb_iters) 329 + return -ENOMEM; 330 + 331 + for_each_possible_cpu(cpu) { 332 + iter->rb_iters[cpu] = ring_buffer_read_start(iter->remote->trace_buffer, cpu, 333 + GFP_KERNEL); 334 + if (!iter->rb_iters[cpu]) { 335 + /* This CPU isn't part of trace_buffer. Skip it */ 336 + if (!trace_remote_has_cpu(iter->remote, cpu)) 337 + continue; 338 + 339 + __free_ring_buffer_iter(iter, RING_BUFFER_ALL_CPUS); 340 + return -ENOMEM; 341 + } 342 + } 343 + 344 + return 0; 345 + } 346 + 347 + static struct trace_remote_iterator 348 + *trace_remote_iter(struct trace_remote *remote, int cpu, enum tri_type type) 349 + { 350 + struct trace_remote_iterator *iter = NULL; 351 + int ret; 352 + 353 + lockdep_assert_held(&remote->lock); 354 + 355 + if (type == TRI_NONCONSUMING && !trace_remote_loaded(remote)) 356 + return NULL; 357 + 358 + ret = trace_remote_get(remote, cpu); 359 + if (ret) 360 + return ERR_PTR(ret); 361 + 362 + if (!trace_remote_has_cpu(remote, cpu)) { 363 + ret = -ENODEV; 364 + goto err; 365 + } 366 + 367 + iter = kzalloc_obj(*iter); 368 + if (iter) { 369 + iter->remote = remote; 370 + iter->cpu = cpu; 371 + iter->type = type; 372 + trace_seq_init(&iter->seq); 373 + 374 + switch (type) { 375 + case TRI_CONSUMING: 376 + ring_buffer_poll_remote(remote->trace_buffer, cpu); 377 + INIT_DELAYED_WORK(&iter->poll_work, __poll_remote); 378 + schedule_delayed_work(&iter->poll_work, msecs_to_jiffies(remote->poll_ms)); 379 + break; 380 + case TRI_NONCONSUMING: 381 + ret = __alloc_ring_buffer_iter(iter, cpu); 382 + break; 383 + } 384 + 385 + if (ret) 386 + goto err; 387 + 388 + return iter; 389 + } 390 + ret = -ENOMEM; 391 + 392 + err: 393 + kfree(iter); 394 + trace_remote_put(remote); 395 + 396 + return ERR_PTR(ret); 397 + } 398 + 399 + static void trace_remote_iter_free(struct trace_remote_iterator *iter) 400 + { 401 + struct trace_remote *remote; 402 + 403 + if (!iter) 404 + return; 405 + 406 + remote = iter->remote; 407 + 408 + lockdep_assert_held(&remote->lock); 409 + 410 + switch (iter->type) { 411 + case TRI_CONSUMING: 412 + cancel_delayed_work_sync(&iter->poll_work); 413 + break; 414 + case TRI_NONCONSUMING: 415 + __free_ring_buffer_iter(iter, iter->cpu); 416 + break; 417 + } 418 + 419 + kfree(iter); 420 + trace_remote_put(remote); 421 + } 422 + 423 + static void trace_remote_iter_read_start(struct trace_remote_iterator *iter) 424 + { 425 + struct trace_remote *remote = iter->remote; 426 + int cpu = iter->cpu; 427 + 428 + /* Acquire global reader lock */ 429 + if (cpu == RING_BUFFER_ALL_CPUS && iter->type == TRI_CONSUMING) 430 + down_write(&remote->reader_lock); 431 + else 432 + down_read(&remote->reader_lock); 433 + 434 + if (cpu == RING_BUFFER_ALL_CPUS) 435 + return; 436 + 437 + /* 438 + * No need for the remote lock here, iter holds a reference on 439 + * remote->nr_readers 440 + */ 441 + 442 + /* Get the per-CPU one */ 443 + if (WARN_ON_ONCE(!remote->pcpu_reader_locks)) 444 + return; 445 + 446 + if (iter->type == TRI_CONSUMING) 447 + down_write(&remote->pcpu_reader_locks[cpu]); 448 + else 449 + down_read(&remote->pcpu_reader_locks[cpu]); 450 + } 451 + 452 + static void trace_remote_iter_read_finished(struct trace_remote_iterator *iter) 453 + { 454 + struct trace_remote *remote = iter->remote; 455 + int cpu = iter->cpu; 456 + 457 + /* Release per-CPU reader lock */ 458 + if (cpu != RING_BUFFER_ALL_CPUS) { 459 + /* 460 + * No need for the remote lock here, iter holds a reference on 461 + * remote->nr_readers 462 + */ 463 + if (iter->type == TRI_CONSUMING) 464 + up_write(&remote->pcpu_reader_locks[cpu]); 465 + else 466 + up_read(&remote->pcpu_reader_locks[cpu]); 467 + } 468 + 469 + /* Release global reader lock */ 470 + if (cpu == RING_BUFFER_ALL_CPUS && iter->type == TRI_CONSUMING) 471 + up_write(&remote->reader_lock); 472 + else 473 + up_read(&remote->reader_lock); 474 + } 475 + 476 + static struct ring_buffer_iter *__get_rb_iter(struct trace_remote_iterator *iter, int cpu) 477 + { 478 + return iter->cpu != RING_BUFFER_ALL_CPUS ? iter->rb_iter : iter->rb_iters[cpu]; 479 + } 480 + 481 + static struct ring_buffer_event * 482 + __peek_event(struct trace_remote_iterator *iter, int cpu, u64 *ts, unsigned long *lost_events) 483 + { 484 + struct ring_buffer_event *rb_evt; 485 + struct ring_buffer_iter *rb_iter; 486 + 487 + switch (iter->type) { 488 + case TRI_CONSUMING: 489 + return ring_buffer_peek(iter->remote->trace_buffer, cpu, ts, lost_events); 490 + case TRI_NONCONSUMING: 491 + rb_iter = __get_rb_iter(iter, cpu); 492 + if (!rb_iter) 493 + return NULL; 494 + 495 + rb_evt = ring_buffer_iter_peek(rb_iter, ts); 496 + if (!rb_evt) 497 + return NULL; 498 + 499 + *lost_events = ring_buffer_iter_dropped(rb_iter); 500 + 501 + return rb_evt; 502 + } 503 + 504 + return NULL; 505 + } 506 + 507 + static bool trace_remote_iter_read_event(struct trace_remote_iterator *iter) 508 + { 509 + struct trace_buffer *trace_buffer = iter->remote->trace_buffer; 510 + struct ring_buffer_event *rb_evt; 511 + int cpu = iter->cpu; 512 + 513 + if (cpu != RING_BUFFER_ALL_CPUS) { 514 + if (ring_buffer_empty_cpu(trace_buffer, cpu)) 515 + return false; 516 + 517 + rb_evt = __peek_event(iter, cpu, &iter->ts, &iter->lost_events); 518 + if (!rb_evt) 519 + return false; 520 + 521 + iter->evt_cpu = cpu; 522 + iter->evt = ring_buffer_event_data(rb_evt); 523 + return true; 524 + } 525 + 526 + iter->ts = U64_MAX; 527 + for_each_possible_cpu(cpu) { 528 + unsigned long lost_events; 529 + u64 ts; 530 + 531 + if (ring_buffer_empty_cpu(trace_buffer, cpu)) 532 + continue; 533 + 534 + rb_evt = __peek_event(iter, cpu, &ts, &lost_events); 535 + if (!rb_evt) 536 + continue; 537 + 538 + if (ts >= iter->ts) 539 + continue; 540 + 541 + iter->ts = ts; 542 + iter->evt_cpu = cpu; 543 + iter->evt = ring_buffer_event_data(rb_evt); 544 + iter->lost_events = lost_events; 545 + } 546 + 547 + return iter->ts != U64_MAX; 548 + } 549 + 550 + static void trace_remote_iter_move(struct trace_remote_iterator *iter) 551 + { 552 + struct trace_buffer *trace_buffer = iter->remote->trace_buffer; 553 + 554 + switch (iter->type) { 555 + case TRI_CONSUMING: 556 + ring_buffer_consume(trace_buffer, iter->evt_cpu, NULL, NULL); 557 + break; 558 + case TRI_NONCONSUMING: 559 + ring_buffer_iter_advance(__get_rb_iter(iter, iter->evt_cpu)); 560 + break; 561 + } 562 + } 563 + 564 + static struct remote_event *trace_remote_find_event(struct trace_remote *remote, unsigned short id); 565 + 566 + static int trace_remote_iter_print_event(struct trace_remote_iterator *iter) 567 + { 568 + struct remote_event *evt; 569 + unsigned long usecs_rem; 570 + u64 ts = iter->ts; 571 + 572 + if (iter->lost_events) 573 + trace_seq_printf(&iter->seq, "CPU:%d [LOST %lu EVENTS]\n", 574 + iter->evt_cpu, iter->lost_events); 575 + 576 + do_div(ts, 1000); 577 + usecs_rem = do_div(ts, USEC_PER_SEC); 578 + 579 + trace_seq_printf(&iter->seq, "[%03d]\t%5llu.%06lu: ", iter->evt_cpu, 580 + ts, usecs_rem); 581 + 582 + evt = trace_remote_find_event(iter->remote, iter->evt->id); 583 + if (!evt) 584 + trace_seq_printf(&iter->seq, "UNKNOWN id=%d\n", iter->evt->id); 585 + else 586 + evt->print(iter->evt, &iter->seq); 587 + 588 + return trace_seq_has_overflowed(&iter->seq) ? -EOVERFLOW : 0; 589 + } 590 + 591 + static int trace_pipe_open(struct inode *inode, struct file *filp) 592 + { 593 + struct trace_remote *remote = inode->i_private; 594 + struct trace_remote_iterator *iter; 595 + int cpu = tracing_get_cpu(inode); 596 + 597 + guard(mutex)(&remote->lock); 598 + 599 + iter = trace_remote_iter(remote, cpu, TRI_CONSUMING); 600 + if (IS_ERR(iter)) 601 + return PTR_ERR(iter); 602 + 603 + filp->private_data = iter; 604 + 605 + return IS_ERR(iter) ? PTR_ERR(iter) : 0; 606 + } 607 + 608 + static int trace_pipe_release(struct inode *inode, struct file *filp) 609 + { 610 + struct trace_remote_iterator *iter = filp->private_data; 611 + struct trace_remote *remote = iter->remote; 612 + 613 + guard(mutex)(&remote->lock); 614 + 615 + trace_remote_iter_free(iter); 616 + 617 + return 0; 618 + } 619 + 620 + static ssize_t trace_pipe_read(struct file *filp, char __user *ubuf, size_t cnt, loff_t *ppos) 621 + { 622 + struct trace_remote_iterator *iter = filp->private_data; 623 + struct trace_buffer *trace_buffer = iter->remote->trace_buffer; 624 + int ret; 625 + 626 + copy_to_user: 627 + ret = trace_seq_to_user(&iter->seq, ubuf, cnt); 628 + if (ret != -EBUSY) 629 + return ret; 630 + 631 + trace_seq_init(&iter->seq); 632 + 633 + ret = ring_buffer_wait(trace_buffer, iter->cpu, 0, NULL, NULL); 634 + if (ret < 0) 635 + return ret; 636 + 637 + trace_remote_iter_read_start(iter); 638 + 639 + while (trace_remote_iter_read_event(iter)) { 640 + int prev_len = iter->seq.seq.len; 641 + 642 + if (trace_remote_iter_print_event(iter)) { 643 + iter->seq.seq.len = prev_len; 644 + break; 645 + } 646 + 647 + trace_remote_iter_move(iter); 648 + } 649 + 650 + trace_remote_iter_read_finished(iter); 651 + 652 + goto copy_to_user; 653 + } 654 + 655 + static const struct file_operations trace_pipe_fops = { 656 + .open = trace_pipe_open, 657 + .read = trace_pipe_read, 658 + .release = trace_pipe_release, 659 + }; 660 + 661 + static void *trace_next(struct seq_file *m, void *v, loff_t *pos) 662 + { 663 + struct trace_remote_iterator *iter = m->private; 664 + 665 + ++*pos; 666 + 667 + if (!iter || !trace_remote_iter_read_event(iter)) 668 + return NULL; 669 + 670 + trace_remote_iter_move(iter); 671 + iter->pos++; 672 + 673 + return iter; 674 + } 675 + 676 + static void *trace_start(struct seq_file *m, loff_t *pos) 677 + { 678 + struct trace_remote_iterator *iter = m->private; 679 + loff_t i; 680 + 681 + if (!iter) 682 + return NULL; 683 + 684 + trace_remote_iter_read_start(iter); 685 + 686 + if (!*pos) { 687 + iter->pos = -1; 688 + return trace_next(m, NULL, &i); 689 + } 690 + 691 + i = iter->pos; 692 + while (i < *pos) { 693 + iter = trace_next(m, NULL, &i); 694 + if (!iter) 695 + return NULL; 696 + } 697 + 698 + return iter; 699 + } 700 + 701 + static int trace_show(struct seq_file *m, void *v) 702 + { 703 + struct trace_remote_iterator *iter = v; 704 + 705 + trace_seq_init(&iter->seq); 706 + 707 + if (trace_remote_iter_print_event(iter)) { 708 + seq_printf(m, "[EVENT %d PRINT TOO BIG]\n", iter->evt->id); 709 + return 0; 710 + } 711 + 712 + return trace_print_seq(m, &iter->seq); 713 + } 714 + 715 + static void trace_stop(struct seq_file *m, void *v) 716 + { 717 + struct trace_remote_iterator *iter = m->private; 718 + 719 + if (iter) 720 + trace_remote_iter_read_finished(iter); 721 + } 722 + 723 + static const struct seq_operations trace_sops = { 724 + .start = trace_start, 725 + .next = trace_next, 726 + .show = trace_show, 727 + .stop = trace_stop, 728 + }; 729 + 730 + static int trace_open(struct inode *inode, struct file *filp) 731 + { 732 + struct trace_remote *remote = inode->i_private; 733 + struct trace_remote_iterator *iter = NULL; 734 + int cpu = tracing_get_cpu(inode); 735 + int ret; 736 + 737 + if (!(filp->f_mode & FMODE_READ)) 738 + return 0; 739 + 740 + guard(mutex)(&remote->lock); 741 + 742 + iter = trace_remote_iter(remote, cpu, TRI_NONCONSUMING); 743 + if (IS_ERR(iter)) 744 + return PTR_ERR(iter); 745 + 746 + ret = seq_open(filp, &trace_sops); 747 + if (ret) { 748 + trace_remote_iter_free(iter); 749 + return ret; 750 + } 751 + 752 + ((struct seq_file *)filp->private_data)->private = (void *)iter; 753 + 754 + return 0; 755 + } 756 + 757 + static int trace_release(struct inode *inode, struct file *filp) 758 + { 759 + struct trace_remote_iterator *iter; 760 + 761 + if (!(filp->f_mode & FMODE_READ)) 762 + return 0; 763 + 764 + iter = ((struct seq_file *)filp->private_data)->private; 765 + seq_release(inode, filp); 766 + 767 + if (!iter) 768 + return 0; 769 + 770 + guard(mutex)(&iter->remote->lock); 771 + 772 + trace_remote_iter_free(iter); 773 + 774 + return 0; 775 + } 776 + 777 + static ssize_t trace_write(struct file *filp, const char __user *ubuf, size_t cnt, loff_t *ppos) 778 + { 779 + struct inode *inode = file_inode(filp); 780 + struct trace_remote *remote = inode->i_private; 781 + int cpu = tracing_get_cpu(inode); 782 + 783 + guard(mutex)(&remote->lock); 784 + 785 + trace_remote_reset(remote, cpu); 786 + 787 + return cnt; 788 + } 789 + 790 + static const struct file_operations trace_fops = { 791 + .open = trace_open, 792 + .write = trace_write, 793 + .read = seq_read, 794 + .read_iter = seq_read_iter, 795 + .release = trace_release, 796 + }; 797 + 798 + static int trace_remote_init_tracefs(const char *name, struct trace_remote *remote) 799 + { 800 + struct dentry *remote_d, *percpu_d, *d; 801 + static struct dentry *root; 802 + static DEFINE_MUTEX(lock); 803 + bool root_inited = false; 804 + int cpu; 805 + 806 + guard(mutex)(&lock); 807 + 808 + if (!root) { 809 + root = tracefs_create_dir(TRACEFS_DIR, NULL); 810 + if (!root) { 811 + pr_err("Failed to create tracefs dir "TRACEFS_DIR"\n"); 812 + return -ENOMEM; 813 + } 814 + root_inited = true; 815 + } 816 + 817 + remote_d = tracefs_create_dir(name, root); 818 + if (!remote_d) { 819 + pr_err("Failed to create tracefs dir "TRACEFS_DIR"%s/\n", name); 820 + goto err; 821 + } 822 + 823 + d = trace_create_file("tracing_on", TRACEFS_MODE_WRITE, remote_d, remote, &tracing_on_fops); 824 + if (!d) 825 + goto err; 826 + 827 + d = trace_create_file("buffer_size_kb", TRACEFS_MODE_WRITE, remote_d, remote, 828 + &buffer_size_kb_fops); 829 + if (!d) 830 + goto err; 831 + 832 + d = trace_create_file("trace_pipe", TRACEFS_MODE_READ, remote_d, remote, &trace_pipe_fops); 833 + if (!d) 834 + goto err; 835 + 836 + d = trace_create_file("trace", TRACEFS_MODE_WRITE, remote_d, remote, &trace_fops); 837 + if (!d) 838 + goto err; 839 + 840 + percpu_d = tracefs_create_dir("per_cpu", remote_d); 841 + if (!percpu_d) { 842 + pr_err("Failed to create tracefs dir "TRACEFS_DIR"%s/per_cpu/\n", name); 843 + goto err; 844 + } 845 + 846 + for_each_possible_cpu(cpu) { 847 + struct dentry *cpu_d; 848 + char cpu_name[16]; 849 + 850 + snprintf(cpu_name, sizeof(cpu_name), "cpu%d", cpu); 851 + cpu_d = tracefs_create_dir(cpu_name, percpu_d); 852 + if (!cpu_d) { 853 + pr_err("Failed to create tracefs dir "TRACEFS_DIR"%s/percpu/cpu%d\n", 854 + name, cpu); 855 + goto err; 856 + } 857 + 858 + d = trace_create_cpu_file("trace_pipe", TRACEFS_MODE_READ, cpu_d, remote, cpu, 859 + &trace_pipe_fops); 860 + if (!d) 861 + goto err; 862 + 863 + d = trace_create_cpu_file("trace", TRACEFS_MODE_WRITE, cpu_d, remote, cpu, 864 + &trace_fops); 865 + if (!d) 866 + goto err; 867 + } 868 + 869 + remote->dentry = remote_d; 870 + 871 + return 0; 872 + 873 + err: 874 + if (root_inited) { 875 + tracefs_remove(root); 876 + root = NULL; 877 + } else { 878 + tracefs_remove(remote_d); 879 + } 880 + 881 + return -ENOMEM; 882 + } 883 + 884 + static int trace_remote_register_events(const char *remote_name, struct trace_remote *remote, 885 + struct remote_event *events, size_t nr_events); 886 + 887 + /** 888 + * trace_remote_register() - Register a Tracefs remote 889 + * @name: Name of the remote, used for the Tracefs remotes/ directory. 890 + * @cbs: Set of callbacks used to control the remote. 891 + * @priv: Private data, passed to each callback from @cbs. 892 + * @events: Array of events. &remote_event.name and &remote_event.id must be 893 + * filled by the caller. 894 + * @nr_events: Number of events in the @events array. 895 + * 896 + * A trace remote is an entity, outside of the kernel (most likely firmware or 897 + * hypervisor) capable of writing events into a Tracefs compatible ring-buffer. 898 + * The kernel would then act as a reader. 899 + * 900 + * The registered remote will be found under the Tracefs directory 901 + * remotes/<name>. 902 + * 903 + * Return: 0 on success, negative error code on failure. 904 + */ 905 + int trace_remote_register(const char *name, struct trace_remote_callbacks *cbs, void *priv, 906 + struct remote_event *events, size_t nr_events) 907 + { 908 + struct trace_remote *remote; 909 + int ret; 910 + 911 + remote = kzalloc_obj(*remote); 912 + if (!remote) 913 + return -ENOMEM; 914 + 915 + remote->cbs = cbs; 916 + remote->priv = priv; 917 + remote->trace_buffer_size = 7 << 10; 918 + remote->poll_ms = 100; 919 + mutex_init(&remote->lock); 920 + init_rwsem(&remote->reader_lock); 921 + 922 + if (trace_remote_init_tracefs(name, remote)) { 923 + kfree(remote); 924 + return -ENOMEM; 925 + } 926 + 927 + ret = trace_remote_register_events(name, remote, events, nr_events); 928 + if (ret) { 929 + pr_err("Failed to register events for trace remote '%s' (%d)\n", 930 + name, ret); 931 + return ret; 932 + } 933 + 934 + ret = cbs->init ? cbs->init(remote->dentry, priv) : 0; 935 + if (ret) 936 + pr_err("Init failed for trace remote '%s' (%d)\n", name, ret); 937 + 938 + return ret; 939 + } 940 + EXPORT_SYMBOL_GPL(trace_remote_register); 941 + 942 + /** 943 + * trace_remote_free_buffer() - Free trace buffer allocated with trace_remote_alloc_buffer() 944 + * @desc: Descriptor of the per-CPU ring-buffers, originally filled by 945 + * trace_remote_alloc_buffer() 946 + * 947 + * Most likely called from &trace_remote_callbacks.unload_trace_buffer. 948 + */ 949 + void trace_remote_free_buffer(struct trace_buffer_desc *desc) 950 + { 951 + struct ring_buffer_desc *rb_desc; 952 + int cpu; 953 + 954 + for_each_ring_buffer_desc(rb_desc, cpu, desc) { 955 + unsigned int id; 956 + 957 + free_page(rb_desc->meta_va); 958 + 959 + for (id = 0; id < rb_desc->nr_page_va; id++) 960 + free_page(rb_desc->page_va[id]); 961 + } 962 + } 963 + EXPORT_SYMBOL_GPL(trace_remote_free_buffer); 964 + 965 + /** 966 + * trace_remote_alloc_buffer() - Dynamically allocate a trace buffer 967 + * @desc: Uninitialized trace_buffer_desc 968 + * @desc_size: Size of the trace_buffer_desc. Must be at least equal to 969 + * trace_buffer_desc_size() 970 + * @buffer_size: Size in bytes of each per-CPU ring-buffer 971 + * @cpumask: CPUs to allocate a ring-buffer for 972 + * 973 + * Helper to dynamically allocate a set of pages (enough to cover @buffer_size) 974 + * for each CPU from @cpumask and fill @desc. Most likely called from 975 + * &trace_remote_callbacks.load_trace_buffer. 976 + * 977 + * Return: 0 on success, negative error code on failure. 978 + */ 979 + int trace_remote_alloc_buffer(struct trace_buffer_desc *desc, size_t desc_size, size_t buffer_size, 980 + const struct cpumask *cpumask) 981 + { 982 + unsigned int nr_pages = max(DIV_ROUND_UP(buffer_size, PAGE_SIZE), 2UL) + 1; 983 + void *desc_end = desc + desc_size; 984 + struct ring_buffer_desc *rb_desc; 985 + int cpu, ret = -ENOMEM; 986 + 987 + if (desc_size < struct_size(desc, __data, 0)) 988 + return -EINVAL; 989 + 990 + desc->nr_cpus = 0; 991 + desc->struct_len = struct_size(desc, __data, 0); 992 + 993 + rb_desc = (struct ring_buffer_desc *)&desc->__data[0]; 994 + 995 + for_each_cpu(cpu, cpumask) { 996 + unsigned int id; 997 + 998 + if ((void *)rb_desc + struct_size(rb_desc, page_va, nr_pages) > desc_end) { 999 + ret = -EINVAL; 1000 + goto err; 1001 + } 1002 + 1003 + rb_desc->cpu = cpu; 1004 + rb_desc->nr_page_va = 0; 1005 + rb_desc->meta_va = (unsigned long)__get_free_page(GFP_KERNEL); 1006 + if (!rb_desc->meta_va) 1007 + goto err; 1008 + 1009 + for (id = 0; id < nr_pages; id++) { 1010 + rb_desc->page_va[id] = (unsigned long)__get_free_page(GFP_KERNEL); 1011 + if (!rb_desc->page_va[id]) 1012 + goto err; 1013 + 1014 + rb_desc->nr_page_va++; 1015 + } 1016 + desc->nr_cpus++; 1017 + desc->struct_len += offsetof(struct ring_buffer_desc, page_va); 1018 + desc->struct_len += struct_size(rb_desc, page_va, rb_desc->nr_page_va); 1019 + rb_desc = __next_ring_buffer_desc(rb_desc); 1020 + } 1021 + 1022 + return 0; 1023 + 1024 + err: 1025 + trace_remote_free_buffer(desc); 1026 + return ret; 1027 + } 1028 + EXPORT_SYMBOL_GPL(trace_remote_alloc_buffer); 1029 + 1030 + static int 1031 + trace_remote_enable_event(struct trace_remote *remote, struct remote_event *evt, bool enable) 1032 + { 1033 + int ret; 1034 + 1035 + lockdep_assert_held(&remote->lock); 1036 + 1037 + if (evt->enabled == enable) 1038 + return 0; 1039 + 1040 + ret = remote->cbs->enable_event(evt->id, enable, remote->priv); 1041 + if (ret) 1042 + return ret; 1043 + 1044 + evt->enabled = enable; 1045 + 1046 + return 0; 1047 + } 1048 + 1049 + static int remote_event_enable_show(struct seq_file *s, void *unused) 1050 + { 1051 + struct remote_event *evt = s->private; 1052 + 1053 + seq_printf(s, "%d\n", evt->enabled); 1054 + 1055 + return 0; 1056 + } 1057 + 1058 + static ssize_t remote_event_enable_write(struct file *filp, const char __user *ubuf, 1059 + size_t count, loff_t *ppos) 1060 + { 1061 + struct seq_file *seq = filp->private_data; 1062 + struct remote_event *evt = seq->private; 1063 + struct trace_remote *remote = evt->remote; 1064 + u8 enable; 1065 + int ret; 1066 + 1067 + ret = kstrtou8_from_user(ubuf, count, 10, &enable); 1068 + if (ret) 1069 + return ret; 1070 + 1071 + guard(mutex)(&remote->lock); 1072 + 1073 + ret = trace_remote_enable_event(remote, evt, enable); 1074 + if (ret) 1075 + return ret; 1076 + 1077 + return count; 1078 + } 1079 + DEFINE_SHOW_STORE_ATTRIBUTE(remote_event_enable); 1080 + 1081 + static int remote_event_id_show(struct seq_file *s, void *unused) 1082 + { 1083 + struct remote_event *evt = s->private; 1084 + 1085 + seq_printf(s, "%d\n", evt->id); 1086 + 1087 + return 0; 1088 + } 1089 + DEFINE_SHOW_ATTRIBUTE(remote_event_id); 1090 + 1091 + static int remote_event_format_show(struct seq_file *s, void *unused) 1092 + { 1093 + size_t offset = sizeof(struct remote_event_hdr); 1094 + struct remote_event *evt = s->private; 1095 + struct trace_event_fields *field; 1096 + 1097 + seq_printf(s, "name: %s\n", evt->name); 1098 + seq_printf(s, "ID: %d\n", evt->id); 1099 + seq_puts(s, 1100 + "format:\n\tfield:unsigned short common_type;\toffset:0;\tsize:2;\tsigned:0;\n\n"); 1101 + 1102 + field = &evt->fields[0]; 1103 + while (field->name) { 1104 + seq_printf(s, "\tfield:%s %s;\toffset:%zu;\tsize:%u;\tsigned:%d;\n", 1105 + field->type, field->name, offset, field->size, 1106 + field->is_signed); 1107 + offset += field->size; 1108 + field++; 1109 + } 1110 + 1111 + if (field != &evt->fields[0]) 1112 + seq_puts(s, "\n"); 1113 + 1114 + seq_printf(s, "print fmt: %s\n", evt->print_fmt); 1115 + 1116 + return 0; 1117 + } 1118 + DEFINE_SHOW_ATTRIBUTE(remote_event_format); 1119 + 1120 + static int remote_event_callback(const char *name, umode_t *mode, void **data, 1121 + const struct file_operations **fops) 1122 + { 1123 + if (!strcmp(name, "enable")) { 1124 + *mode = TRACEFS_MODE_WRITE; 1125 + *fops = &remote_event_enable_fops; 1126 + return 1; 1127 + } 1128 + 1129 + if (!strcmp(name, "id")) { 1130 + *mode = TRACEFS_MODE_READ; 1131 + *fops = &remote_event_id_fops; 1132 + return 1; 1133 + } 1134 + 1135 + if (!strcmp(name, "format")) { 1136 + *mode = TRACEFS_MODE_READ; 1137 + *fops = &remote_event_format_fops; 1138 + return 1; 1139 + } 1140 + 1141 + return 0; 1142 + } 1143 + 1144 + static ssize_t remote_events_dir_enable_write(struct file *filp, const char __user *ubuf, 1145 + size_t count, loff_t *ppos) 1146 + { 1147 + struct trace_remote *remote = file_inode(filp)->i_private; 1148 + int i, ret; 1149 + u8 enable; 1150 + 1151 + ret = kstrtou8_from_user(ubuf, count, 10, &enable); 1152 + if (ret) 1153 + return ret; 1154 + 1155 + guard(mutex)(&remote->lock); 1156 + 1157 + for (i = 0; i < remote->nr_events; i++) { 1158 + struct remote_event *evt = &remote->events[i]; 1159 + 1160 + trace_remote_enable_event(remote, evt, enable); 1161 + } 1162 + 1163 + return count; 1164 + } 1165 + 1166 + static ssize_t remote_events_dir_enable_read(struct file *filp, char __user *ubuf, size_t cnt, 1167 + loff_t *ppos) 1168 + { 1169 + struct trace_remote *remote = file_inode(filp)->i_private; 1170 + const char enabled_char[] = {'0', '1', 'X'}; 1171 + char enabled_str[] = " \n"; 1172 + int i, enabled = -1; 1173 + 1174 + guard(mutex)(&remote->lock); 1175 + 1176 + for (i = 0; i < remote->nr_events; i++) { 1177 + struct remote_event *evt = &remote->events[i]; 1178 + 1179 + if (enabled == -1) { 1180 + enabled = evt->enabled; 1181 + } else if (enabled != evt->enabled) { 1182 + enabled = 2; 1183 + break; 1184 + } 1185 + } 1186 + 1187 + enabled_str[0] = enabled_char[enabled == -1 ? 0 : enabled]; 1188 + 1189 + return simple_read_from_buffer(ubuf, cnt, ppos, enabled_str, 2); 1190 + } 1191 + 1192 + static const struct file_operations remote_events_dir_enable_fops = { 1193 + .write = remote_events_dir_enable_write, 1194 + .read = remote_events_dir_enable_read, 1195 + }; 1196 + 1197 + static ssize_t 1198 + remote_events_dir_header_page_read(struct file *filp, char __user *ubuf, size_t cnt, loff_t *ppos) 1199 + { 1200 + struct trace_seq *s; 1201 + int ret; 1202 + 1203 + s = kmalloc(sizeof(*s), GFP_KERNEL); 1204 + if (!s) 1205 + return -ENOMEM; 1206 + 1207 + trace_seq_init(s); 1208 + 1209 + ring_buffer_print_page_header(NULL, s); 1210 + ret = simple_read_from_buffer(ubuf, cnt, ppos, s->buffer, trace_seq_used(s)); 1211 + kfree(s); 1212 + 1213 + return ret; 1214 + } 1215 + 1216 + static const struct file_operations remote_events_dir_header_page_fops = { 1217 + .read = remote_events_dir_header_page_read, 1218 + }; 1219 + 1220 + static ssize_t 1221 + remote_events_dir_header_event_read(struct file *filp, char __user *ubuf, size_t cnt, loff_t *ppos) 1222 + { 1223 + struct trace_seq *s; 1224 + int ret; 1225 + 1226 + s = kmalloc(sizeof(*s), GFP_KERNEL); 1227 + if (!s) 1228 + return -ENOMEM; 1229 + 1230 + trace_seq_init(s); 1231 + 1232 + ring_buffer_print_entry_header(s); 1233 + ret = simple_read_from_buffer(ubuf, cnt, ppos, s->buffer, trace_seq_used(s)); 1234 + kfree(s); 1235 + 1236 + return ret; 1237 + } 1238 + 1239 + static const struct file_operations remote_events_dir_header_event_fops = { 1240 + .read = remote_events_dir_header_event_read, 1241 + }; 1242 + 1243 + static int remote_events_dir_callback(const char *name, umode_t *mode, void **data, 1244 + const struct file_operations **fops) 1245 + { 1246 + if (!strcmp(name, "enable")) { 1247 + *mode = TRACEFS_MODE_WRITE; 1248 + *fops = &remote_events_dir_enable_fops; 1249 + return 1; 1250 + } 1251 + 1252 + if (!strcmp(name, "header_page")) { 1253 + *mode = TRACEFS_MODE_READ; 1254 + *fops = &remote_events_dir_header_page_fops; 1255 + return 1; 1256 + } 1257 + 1258 + if (!strcmp(name, "header_event")) { 1259 + *mode = TRACEFS_MODE_READ; 1260 + *fops = &remote_events_dir_header_event_fops; 1261 + return 1; 1262 + } 1263 + 1264 + return 0; 1265 + } 1266 + 1267 + static int trace_remote_init_eventfs(const char *remote_name, struct trace_remote *remote, 1268 + struct remote_event *evt) 1269 + { 1270 + struct eventfs_inode *eventfs = remote->eventfs; 1271 + static struct eventfs_entry dir_entries[] = { 1272 + { 1273 + .name = "enable", 1274 + .callback = remote_events_dir_callback, 1275 + }, { 1276 + .name = "header_page", 1277 + .callback = remote_events_dir_callback, 1278 + }, { 1279 + .name = "header_event", 1280 + .callback = remote_events_dir_callback, 1281 + } 1282 + }; 1283 + static struct eventfs_entry entries[] = { 1284 + { 1285 + .name = "enable", 1286 + .callback = remote_event_callback, 1287 + }, { 1288 + .name = "id", 1289 + .callback = remote_event_callback, 1290 + }, { 1291 + .name = "format", 1292 + .callback = remote_event_callback, 1293 + } 1294 + }; 1295 + bool eventfs_create = false; 1296 + 1297 + if (!eventfs) { 1298 + eventfs = eventfs_create_events_dir("events", remote->dentry, dir_entries, 1299 + ARRAY_SIZE(dir_entries), remote); 1300 + if (IS_ERR(eventfs)) 1301 + return PTR_ERR(eventfs); 1302 + 1303 + /* 1304 + * Create similar hierarchy as local events even if a single system is supported at 1305 + * the moment 1306 + */ 1307 + eventfs = eventfs_create_dir(remote_name, eventfs, NULL, 0, NULL); 1308 + if (IS_ERR(eventfs)) 1309 + return PTR_ERR(eventfs); 1310 + 1311 + remote->eventfs = eventfs; 1312 + eventfs_create = true; 1313 + } 1314 + 1315 + eventfs = eventfs_create_dir(evt->name, eventfs, entries, ARRAY_SIZE(entries), evt); 1316 + if (IS_ERR(eventfs)) { 1317 + if (eventfs_create) { 1318 + eventfs_remove_events_dir(remote->eventfs); 1319 + remote->eventfs = NULL; 1320 + } 1321 + return PTR_ERR(eventfs); 1322 + } 1323 + 1324 + return 0; 1325 + } 1326 + 1327 + static int trace_remote_attach_events(struct trace_remote *remote, struct remote_event *events, 1328 + size_t nr_events) 1329 + { 1330 + int i; 1331 + 1332 + for (i = 0; i < nr_events; i++) { 1333 + struct remote_event *evt = &events[i]; 1334 + 1335 + if (evt->remote) 1336 + return -EEXIST; 1337 + 1338 + evt->remote = remote; 1339 + 1340 + /* We need events to be sorted for efficient lookup */ 1341 + if (i && evt->id <= events[i - 1].id) 1342 + return -EINVAL; 1343 + } 1344 + 1345 + remote->events = events; 1346 + remote->nr_events = nr_events; 1347 + 1348 + return 0; 1349 + } 1350 + 1351 + static int trace_remote_register_events(const char *remote_name, struct trace_remote *remote, 1352 + struct remote_event *events, size_t nr_events) 1353 + { 1354 + int i, ret; 1355 + 1356 + ret = trace_remote_attach_events(remote, events, nr_events); 1357 + if (ret) 1358 + return ret; 1359 + 1360 + for (i = 0; i < nr_events; i++) { 1361 + struct remote_event *evt = &events[i]; 1362 + 1363 + ret = trace_remote_init_eventfs(remote_name, remote, evt); 1364 + if (ret) 1365 + pr_warn("Failed to init eventfs for event '%s' (%d)", 1366 + evt->name, ret); 1367 + } 1368 + 1369 + return 0; 1370 + } 1371 + 1372 + static int __cmp_events(const void *key, const void *data) 1373 + { 1374 + const struct remote_event *evt = data; 1375 + int id = (int)((long)key); 1376 + 1377 + return id - (int)evt->id; 1378 + } 1379 + 1380 + static struct remote_event *trace_remote_find_event(struct trace_remote *remote, unsigned short id) 1381 + { 1382 + return bsearch((const void *)(unsigned long)id, remote->events, remote->nr_events, 1383 + sizeof(*remote->events), __cmp_events); 1384 + }

+1

tools/arch/arm64/include/uapi/asm/kvm.h

··· 428 428 #define KVM_DEV_ARM_ITS_RESTORE_TABLES 2 429 429 #define KVM_DEV_ARM_VGIC_SAVE_PENDING_TABLES 3 430 430 #define KVM_DEV_ARM_ITS_CTRL_RESET 4 431 + #define KVM_DEV_ARM_VGIC_USERSPACE_PPIS 5 431 432 432 433 /* Device Control API on vcpu fd */ 433 434 #define KVM_ARM_VCPU_PMU_V3_CTRL 0

+2

tools/include/uapi/linux/kvm.h

··· 1224 1224 #define KVM_DEV_TYPE_LOONGARCH_EIOINTC KVM_DEV_TYPE_LOONGARCH_EIOINTC 1225 1225 KVM_DEV_TYPE_LOONGARCH_PCHPIC, 1226 1226 #define KVM_DEV_TYPE_LOONGARCH_PCHPIC KVM_DEV_TYPE_LOONGARCH_PCHPIC 1227 + KVM_DEV_TYPE_ARM_VGIC_V5, 1228 + #define KVM_DEV_TYPE_ARM_VGIC_V5 KVM_DEV_TYPE_ARM_VGIC_V5 1227 1229 1228 1230 KVM_DEV_TYPE_MAX, 1229 1231

+25

tools/testing/selftests/ftrace/test.d/remotes/buffer_size.tc

··· 1 + #!/bin/sh 2 + # SPDX-License-Identifier: GPL-2.0 3 + # description: Test trace remote buffer size 4 + # requires: remotes/test 5 + 6 + . $TEST_DIR/remotes/functions 7 + 8 + test_buffer_size() 9 + { 10 + echo 0 > tracing_on 11 + assert_unloaded 12 + 13 + echo 4096 > buffer_size_kb 14 + echo 1 > tracing_on 15 + assert_loaded 16 + 17 + echo 0 > tracing_on 18 + echo 7 > buffer_size_kb 19 + } 20 + 21 + if [ -z "$SOURCE_REMOTE_TEST" ]; then 22 + set -e 23 + setup_remote_test 24 + test_buffer_size 25 + fi

+99

tools/testing/selftests/ftrace/test.d/remotes/functions

··· 1 + # SPDX-License-Identifier: GPL-2.0 2 + 3 + setup_remote() 4 + { 5 + local name=$1 6 + 7 + [ -e $TRACING_DIR/remotes/$name/write_event ] || exit_unresolved 8 + 9 + cd remotes/$name/ 10 + echo 0 > tracing_on 11 + clear_trace 12 + echo 7 > buffer_size_kb 13 + echo 0 > events/enable 14 + echo 1 > events/$name/selftest/enable 15 + echo 1 > tracing_on 16 + } 17 + 18 + setup_remote_test() 19 + { 20 + [ -d $TRACING_DIR/remotes/test/ ] || modprobe remote_test || exit_unresolved 21 + 22 + setup_remote "test" 23 + } 24 + 25 + assert_loaded() 26 + { 27 + grep -q "(loaded)" buffer_size_kb || return 1 28 + } 29 + 30 + assert_unloaded() 31 + { 32 + grep -q "(unloaded)" buffer_size_kb || return 1 33 + } 34 + 35 + reload_remote() 36 + { 37 + echo 0 > tracing_on 38 + clear_trace 39 + assert_unloaded 40 + echo 1 > tracing_on 41 + assert_loaded 42 + } 43 + 44 + dump_trace_pipe() 45 + { 46 + output=$(mktemp $TMPDIR/remote_test.XXXXXX) 47 + cat trace_pipe > $output & 48 + pid=$! 49 + sleep 1 50 + kill -1 $pid 51 + 52 + echo $output 53 + } 54 + 55 + check_trace() 56 + { 57 + start_id="$1" 58 + end_id="$2" 59 + file="$3" 60 + 61 + # Ensure the file is not empty 62 + test -n "$(head $file)" 63 + 64 + prev_ts=0 65 + id=0 66 + 67 + # Only keep <timestamp> <id> 68 + tmp=$(mktemp $TMPDIR/remote_test.XXXXXX) 69 + sed -e 's/\[[0-9]*\]\s*$[0-9]*.[0-9]*$: [a-z]* id=$[0-9]*$/\1 \2/' $file > $tmp 70 + 71 + while IFS= read -r line; do 72 + ts=$(echo $line | cut -d ' ' -f 1) 73 + id=$(echo $line | cut -d ' ' -f 2) 74 + 75 + test $(echo "$ts>$prev_ts" | bc) -eq 1 76 + test $id -eq $start_id 77 + 78 + prev_ts=$ts 79 + start_id=$((start_id + 1)) 80 + done < $tmp 81 + 82 + test $id -eq $end_id 83 + rm $tmp 84 + } 85 + 86 + get_cpu_ids() 87 + { 88 + sed -n 's/^processor\s*:\s*$[0-9]\+$.*/\1/p' /proc/cpuinfo 89 + } 90 + 91 + get_page_size() 92 + { 93 + sed -ne 's/^.*data.*size:$[0-9][0-9]*$.*/\1/p' events/header_page 94 + } 95 + 96 + get_selftest_event_size() 97 + { 98 + sed -ne 's/^.*field:.*;.*size:$[0-9][0-9]*$;.*/\1/p' events/*/selftest/format | awk '{s+=$1} END {print s}' 99 + }

+88

tools/testing/selftests/ftrace/test.d/remotes/hotplug.tc

··· 1 + #!/bin/sh 2 + # SPDX-License-Identifier: GPL-2.0 3 + # description: Test trace remote read with an offline CPU 4 + # requires: remotes/test 5 + 6 + . $TEST_DIR/remotes/functions 7 + 8 + hotunplug_one_cpu() 9 + { 10 + [ "$(get_cpu_ids | wc -l)" -ge 2 ] || return 1 11 + 12 + for cpu in $(get_cpu_ids); do 13 + echo 0 > /sys/devices/system/cpu/cpu$cpu/online || return 1 14 + break 15 + done 16 + 17 + echo $cpu 18 + } 19 + 20 + # Check non-consuming and consuming read 21 + check_read() 22 + { 23 + for i in $(seq 1 8); do 24 + echo $i > write_event 25 + done 26 + 27 + check_trace 1 8 trace 28 + 29 + output=$(dump_trace_pipe) 30 + check_trace 1 8 $output 31 + rm $output 32 + } 33 + 34 + test_hotplug() 35 + { 36 + echo 0 > trace 37 + assert_loaded 38 + 39 + # 40 + # Test a trace buffer containing an offline CPU 41 + # 42 + 43 + cpu=$(hotunplug_one_cpu) || exit_unsupported 44 + trap "echo 1 > /sys/devices/system/cpu/cpu$cpu/online" EXIT 45 + 46 + check_read 47 + 48 + # 49 + # Test a trace buffer with a missing CPU 50 + # 51 + 52 + reload_remote 53 + 54 + check_read 55 + 56 + # 57 + # Test a trace buffer with a CPU added later 58 + # 59 + 60 + echo 1 > /sys/devices/system/cpu/cpu$cpu/online 61 + trap "" EXIT 62 + assert_loaded 63 + 64 + check_read 65 + 66 + # Test if the ring-buffer for the newly added CPU is both writable and 67 + # readable 68 + for i in $(seq 1 8); do 69 + taskset -c $cpu echo $i > write_event 70 + done 71 + 72 + cd per_cpu/cpu$cpu/ 73 + 74 + check_trace 1 8 trace 75 + 76 + output=$(dump_trace_pipe) 77 + check_trace 1 8 $output 78 + rm $output 79 + 80 + cd - 81 + } 82 + 83 + if [ -z "$SOURCE_REMOTE_TEST" ]; then 84 + set -e 85 + 86 + setup_remote_test 87 + test_hotplug 88 + fi

+11

tools/testing/selftests/ftrace/test.d/remotes/hypervisor/buffer_size.tc

··· 1 + #!/bin/sh 2 + # SPDX-License-Identifier: GPL-2.0 3 + # description: Test hypervisor trace buffer size 4 + # requires: remotes/hypervisor/write_event 5 + 6 + SOURCE_REMOTE_TEST=1 7 + . $TEST_DIR/remotes/buffer_size.tc 8 + 9 + set -e 10 + setup_remote "hypervisor" 11 + test_buffer_size

+11

tools/testing/selftests/ftrace/test.d/remotes/hypervisor/hotplug.tc

··· 1 + #!/bin/sh 2 + # SPDX-License-Identifier: GPL-2.0 3 + # description: Test hypervisor trace read with an offline CPU 4 + # requires: remotes/hypervisor/write_event 5 + 6 + SOURCE_REMOTE_TEST=1 7 + . $TEST_DIR/remotes/hotplug.tc 8 + 9 + set -e 10 + setup_remote "hypervisor" 11 + test_hotplug

+11

tools/testing/selftests/ftrace/test.d/remotes/hypervisor/reset.tc

··· 1 + #!/bin/sh 2 + # SPDX-License-Identifier: GPL-2.0 3 + # description: Test hypervisor trace buffer reset 4 + # requires: remotes/hypervisor/write_event 5 + 6 + SOURCE_REMOTE_TEST=1 7 + . $TEST_DIR/remotes/reset.tc 8 + 9 + set -e 10 + setup_remote "hypervisor" 11 + test_reset

+11

tools/testing/selftests/ftrace/test.d/remotes/hypervisor/trace.tc

··· 1 + #!/bin/sh 2 + # SPDX-License-Identifier: GPL-2.0 3 + # description: Test hypervisor non-consuming trace read 4 + # requires: remotes/hypervisor/write_event 5 + 6 + SOURCE_REMOTE_TEST=1 7 + . $TEST_DIR/remotes/trace.tc 8 + 9 + set -e 10 + setup_remote "hypervisor" 11 + test_trace

+11

tools/testing/selftests/ftrace/test.d/remotes/hypervisor/trace_pipe.tc

··· 1 + #!/bin/sh 2 + # SPDX-License-Identifier: GPL-2.0 3 + # description: Test hypervisor consuming trace read 4 + # requires: remotes/hypervisor/write_event 5 + 6 + SOURCE_REMOTE_TEST=1 7 + . $TEST_DIR/remotes/trace_pipe.tc 8 + 9 + set -e 10 + setup_remote "hypervisor" 11 + test_trace_pipe

+11

tools/testing/selftests/ftrace/test.d/remotes/hypervisor/unloading.tc

··· 1 + #!/bin/sh 2 + # SPDX-License-Identifier: GPL-2.0 3 + # description: Test hypervisor trace buffer unloading 4 + # requires: remotes/hypervisor/write_event 5 + 6 + SOURCE_REMOTE_TEST=1 7 + . $TEST_DIR/remotes/unloading.tc 8 + 9 + set -e 10 + setup_remote "hypervisor" 11 + test_unloading

+90

tools/testing/selftests/ftrace/test.d/remotes/reset.tc

··· 1 + #!/bin/sh 2 + # SPDX-License-Identifier: GPL-2.0 3 + # description: Test trace remote reset 4 + # requires: remotes/test 5 + 6 + . $TEST_DIR/remotes/functions 7 + 8 + check_reset() 9 + { 10 + write_event_path="write_event" 11 + taskset="" 12 + 13 + clear_trace 14 + 15 + # Is the buffer empty? 16 + output=$(dump_trace_pipe) 17 + test $(wc -l $output | cut -d ' ' -f1) -eq 0 18 + 19 + if $(echo $(pwd) | grep -q "per_cpu/cpu"); then 20 + write_event_path="../../write_event" 21 + cpu_id=$(echo $(pwd) | sed -e 's/.*per_cpu\/cpu//') 22 + taskset="taskset -c $cpu_id" 23 + fi 24 + rm $output 25 + 26 + # Can we properly write a new event? 27 + $taskset echo 7890 > $write_event_path 28 + output=$(dump_trace_pipe) 29 + test $(wc -l $output | cut -d ' ' -f1) -eq 1 30 + grep -q "id=7890" $output 31 + rm $output 32 + } 33 + 34 + test_global_interface() 35 + { 36 + output=$(mktemp $TMPDIR/remote_test.XXXXXX) 37 + 38 + # Confidence check 39 + echo 123456 > write_event 40 + output=$(dump_trace_pipe) 41 + grep -q "id=123456" $output 42 + rm $output 43 + 44 + # Reset single event 45 + echo 1 > write_event 46 + check_reset 47 + 48 + # Reset lost events 49 + for i in $(seq 1 10000); do 50 + echo 1 > write_event 51 + done 52 + check_reset 53 + } 54 + 55 + test_percpu_interface() 56 + { 57 + [ "$(get_cpu_ids | wc -l)" -ge 2 ] || return 0 58 + 59 + for cpu in $(get_cpu_ids); do 60 + taskset -c $cpu echo 1 > write_event 61 + done 62 + 63 + check_non_empty=0 64 + for cpu in $(get_cpu_ids); do 65 + cd per_cpu/cpu$cpu/ 66 + 67 + if [ $check_non_empty -eq 0 ]; then 68 + check_reset 69 + check_non_empty=1 70 + else 71 + # Check we have only reset 1 CPU 72 + output=$(dump_trace_pipe) 73 + test $(wc -l $output | cut -d ' ' -f1) -eq 1 74 + rm $output 75 + fi 76 + cd - 77 + done 78 + } 79 + 80 + test_reset() 81 + { 82 + test_global_interface 83 + test_percpu_interface 84 + } 85 + 86 + if [ -z "$SOURCE_REMOTE_TEST" ]; then 87 + set -e 88 + setup_remote_test 89 + test_reset 90 + fi

+102

tools/testing/selftests/ftrace/test.d/remotes/trace.tc

··· 1 + #!/bin/sh 2 + # SPDX-License-Identifier: GPL-2.0 3 + # description: Test trace remote non-consuming read 4 + # requires: remotes/test 5 + 6 + . $TEST_DIR/remotes/functions 7 + 8 + test_trace() 9 + { 10 + echo 0 > tracing_on 11 + assert_unloaded 12 + 13 + echo 7 > buffer_size_kb 14 + echo 1 > tracing_on 15 + assert_loaded 16 + 17 + # Simple test: Emit few events and try to read them 18 + for i in $(seq 1 8); do 19 + echo $i > write_event 20 + done 21 + 22 + check_trace 1 8 trace 23 + 24 + # 25 + # Test interaction with consuming read 26 + # 27 + 28 + cat trace_pipe > /dev/null & 29 + pid=$! 30 + 31 + sleep 1 32 + kill $pid 33 + 34 + test $(wc -l < trace) -eq 0 35 + 36 + for i in $(seq 16 32); do 37 + echo $i > write_event 38 + done 39 + 40 + check_trace 16 32 trace 41 + 42 + # 43 + # Test interaction with reset 44 + # 45 + 46 + echo 0 > trace 47 + 48 + test $(wc -l < trace) -eq 0 49 + 50 + for i in $(seq 1 8); do 51 + echo $i > write_event 52 + done 53 + 54 + check_trace 1 8 trace 55 + 56 + # 57 + # Test interaction with lost events 58 + # 59 + 60 + # Ensure the writer is not on the reader page by reloading the buffer 61 + reload_remote 62 + 63 + # Ensure ring-buffer overflow by emitting events from the same CPU 64 + for cpu in $(get_cpu_ids); do 65 + break 66 + done 67 + 68 + events_per_page=$(($(get_page_size) / $(get_selftest_event_size))) # Approx: does not take TS into account 69 + nr_events=$(($events_per_page * 2)) 70 + for i in $(seq 1 $nr_events); do 71 + taskset -c $cpu echo $i > write_event 72 + done 73 + 74 + id=$(sed -n -e '1s/\[[0-9]*\]\s*[0-9]*.[0-9]*: [a-z]* id=$[0-9]*$/\1/p' trace) 75 + test $id -ne 1 76 + 77 + check_trace $id $nr_events trace 78 + 79 + # 80 + # Test per-CPU interface 81 + # 82 + echo 0 > trace 83 + 84 + for cpu in $(get_cpu_ids) ; do 85 + taskset -c $cpu echo $cpu > write_event 86 + done 87 + 88 + for cpu in $(get_cpu_ids); do 89 + cd per_cpu/cpu$cpu/ 90 + 91 + check_trace $cpu $cpu trace 92 + 93 + cd - > /dev/null 94 + done 95 + } 96 + 97 + if [ -z "$SOURCE_REMOTE_TEST" ]; then 98 + set -e 99 + 100 + setup_remote_test 101 + test_trace 102 + fi

+102

tools/testing/selftests/ftrace/test.d/remotes/trace_pipe.tc

··· 1 + #!/bin/sh 2 + # SPDX-License-Identifier: GPL-2.0 3 + # description: Test trace remote consuming read 4 + # requires: remotes/test 5 + 6 + . $TEST_DIR/remotes/functions 7 + 8 + test_trace_pipe() 9 + { 10 + echo 0 > tracing_on 11 + assert_unloaded 12 + 13 + # Emit events from the same CPU 14 + for cpu in $(get_cpu_ids); do 15 + break 16 + done 17 + 18 + # 19 + # Simple test: Emit enough events to fill few pages 20 + # 21 + 22 + echo 1024 > buffer_size_kb 23 + echo 1 > tracing_on 24 + assert_loaded 25 + 26 + events_per_page=$(($(get_page_size) / $(get_selftest_event_size))) 27 + nr_events=$(($events_per_page * 4)) 28 + 29 + output=$(mktemp $TMPDIR/remote_test.XXXXXX) 30 + 31 + cat trace_pipe > $output & 32 + pid=$! 33 + 34 + for i in $(seq 1 $nr_events); do 35 + taskset -c $cpu echo $i > write_event 36 + done 37 + 38 + echo 0 > tracing_on 39 + sleep 1 40 + kill $pid 41 + 42 + check_trace 1 $nr_events $output 43 + 44 + rm $output 45 + 46 + # 47 + # Test interaction with lost events 48 + # 49 + 50 + assert_unloaded 51 + echo 7 > buffer_size_kb 52 + echo 1 > tracing_on 53 + assert_loaded 54 + 55 + nr_events=$((events_per_page * 2)) 56 + for i in $(seq 1 $nr_events); do 57 + taskset -c $cpu echo $i > write_event 58 + done 59 + 60 + output=$(dump_trace_pipe) 61 + 62 + lost_events=$(sed -n -e '1s/CPU:.*\[LOST $[0-9]*$ EVENTS\]/\1/p' $output) 63 + test -n "$lost_events" 64 + 65 + id=$(sed -n -e '2s/\[[0-9]*\]\s*[0-9]*.[0-9]*: [a-z]* id=$[0-9]*$/\1/p' $output) 66 + test "$id" -eq $(($lost_events + 1)) 67 + 68 + # Drop [LOST EVENTS] line 69 + sed -i '1d' $output 70 + 71 + check_trace $id $nr_events $output 72 + 73 + rm $output 74 + 75 + # 76 + # Test per-CPU interface 77 + # 78 + 79 + echo 0 > trace 80 + echo 1 > tracing_on 81 + 82 + for cpu in $(get_cpu_ids); do 83 + taskset -c $cpu echo $cpu > write_event 84 + done 85 + 86 + for cpu in $(get_cpu_ids); do 87 + cd per_cpu/cpu$cpu/ 88 + output=$(dump_trace_pipe) 89 + 90 + check_trace $cpu $cpu $output 91 + 92 + rm $output 93 + cd - > /dev/null 94 + done 95 + } 96 + 97 + if [ -z "$SOURCE_REMOTE_TEST" ]; then 98 + set -e 99 + 100 + setup_remote_test 101 + test_trace_pipe 102 + fi

+41

tools/testing/selftests/ftrace/test.d/remotes/unloading.tc

··· 1 + #!/bin/sh 2 + # SPDX-License-Identifier: GPL-2.0 3 + # description: Test trace remote unloading 4 + # requires: remotes/test 5 + 6 + . $TEST_DIR/remotes/functions 7 + 8 + test_unloading() 9 + { 10 + # No reader, writing 11 + assert_loaded 12 + 13 + # No reader, no writing 14 + echo 0 > tracing_on 15 + assert_unloaded 16 + 17 + # 1 reader, no writing 18 + cat trace_pipe & 19 + pid=$! 20 + sleep 1 21 + assert_loaded 22 + kill $pid 23 + assert_unloaded 24 + 25 + # No reader, no writing, events 26 + echo 1 > tracing_on 27 + echo 1 > write_event 28 + echo 0 > tracing_on 29 + assert_loaded 30 + 31 + # Test reset 32 + clear_trace 33 + assert_unloaded 34 + } 35 + 36 + if [ -z "$SOURCE_REMOTE_TEST" ]; then 37 + set -e 38 + 39 + setup_remote_test 40 + test_unloading 41 + fi

+2 -1

tools/testing/selftests/kvm/Makefile.kvm

··· 177 177 TEST_GEN_PROGS_arm64 += arm64/vgic_init 178 178 TEST_GEN_PROGS_arm64 += arm64/vgic_irq 179 179 TEST_GEN_PROGS_arm64 += arm64/vgic_lpi_stress 180 + TEST_GEN_PROGS_arm64 += arm64/vgic_v5 180 181 TEST_GEN_PROGS_arm64 += arm64/vpmu_counter_access 181 - TEST_GEN_PROGS_arm64 += arm64/no-vgic-v3 182 + TEST_GEN_PROGS_arm64 += arm64/no-vgic 182 183 TEST_GEN_PROGS_arm64 += arm64/idreg-idst 183 184 TEST_GEN_PROGS_arm64 += arm64/kvm-uuid 184 185 TEST_GEN_PROGS_arm64 += access_tracking_perf_test

+2 -12

tools/testing/selftests/kvm/arm64/at.c

··· 13 13 14 14 enum { 15 15 CLEAR_ACCESS_FLAG, 16 - TEST_ACCESS_FLAG, 17 16 }; 18 17 19 18 static u64 *ptep_hva; ··· 48 49 GUEST_ASSERT_EQ(FIELD_GET(SYS_PAR_EL1_ATTR, par), MAIR_ATTR_NORMAL); \ 49 50 GUEST_ASSERT_EQ(FIELD_GET(SYS_PAR_EL1_SH, par), PTE_SHARED >> 8); \ 50 51 GUEST_ASSERT_EQ(par & SYS_PAR_EL1_PA, TEST_ADDR); \ 51 - GUEST_SYNC(TEST_ACCESS_FLAG); \ 52 52 } \ 53 53 } while (0) 54 54 ··· 83 85 if (!SYS_FIELD_GET(ID_AA64MMFR1_EL1, HAFDBS, read_sysreg(id_aa64mmfr1_el1))) 84 86 GUEST_DONE(); 85 87 86 - /* 87 - * KVM's software PTW makes the implementation choice that the AT 88 - * instruction sets the access flag. 89 - */ 90 88 sysreg_clear_set(tcr_el1, 0, TCR_HA); 91 89 isb(); 92 90 test_at(false); ··· 96 102 case CLEAR_ACCESS_FLAG: 97 103 /* 98 104 * Delete + reinstall the memslot to invalidate stage-2 99 - * mappings of the stage-1 page tables, forcing KVM to 100 - * use the 'slow' AT emulation path. 105 + * mappings of the stage-1 page tables, allowing KVM to 106 + * potentially use the 'slow' AT emulation path. 101 107 * 102 108 * This and clearing the access flag from host userspace 103 109 * ensures that the access flag cannot be set speculatively ··· 105 111 */ 106 112 clear_bit(__ffs(PTE_AF), ptep_hva); 107 113 vm_mem_region_reload(vcpu->vm, vcpu->vm->memslots[MEM_REGION_PT]); 108 - break; 109 - case TEST_ACCESS_FLAG: 110 - TEST_ASSERT(test_bit(__ffs(PTE_AF), ptep_hva), 111 - "Expected access flag to be set (desc: %lu)", *ptep_hva); 112 114 break; 113 115 default: 114 116 TEST_FAIL("Unexpected SYNC arg: %lu", uc->args[1]);

-177

tools/testing/selftests/kvm/arm64/no-vgic-v3.c

··· 1 - // SPDX-License-Identifier: GPL-2.0 2 - 3 - // Check that, on a GICv3 system, not configuring GICv3 correctly 4 - // results in all of the sysregs generating an UNDEF exception. 5 - 6 - #include <test_util.h> 7 - #include <kvm_util.h> 8 - #include <processor.h> 9 - 10 - static volatile bool handled; 11 - 12 - #define __check_sr_read(r) \ 13 - ({ \ 14 - uint64_t val; \ 15 - \ 16 - handled = false; \ 17 - dsb(sy); \ 18 - val = read_sysreg_s(SYS_ ## r); \ 19 - val; \ 20 - }) 21 - 22 - #define __check_sr_write(r) \ 23 - do { \ 24 - handled = false; \ 25 - dsb(sy); \ 26 - write_sysreg_s(0, SYS_ ## r); \ 27 - isb(); \ 28 - } while(0) 29 - 30 - /* Fatal checks */ 31 - #define check_sr_read(r) \ 32 - do { \ 33 - __check_sr_read(r); \ 34 - __GUEST_ASSERT(handled, #r " no read trap"); \ 35 - } while(0) 36 - 37 - #define check_sr_write(r) \ 38 - do { \ 39 - __check_sr_write(r); \ 40 - __GUEST_ASSERT(handled, #r " no write trap"); \ 41 - } while(0) 42 - 43 - #define check_sr_rw(r) \ 44 - do { \ 45 - check_sr_read(r); \ 46 - check_sr_write(r); \ 47 - } while(0) 48 - 49 - static void guest_code(void) 50 - { 51 - uint64_t val; 52 - 53 - /* 54 - * Check that we advertise that ID_AA64PFR0_EL1.GIC == 0, having 55 - * hidden the feature at runtime without any other userspace action. 56 - */ 57 - __GUEST_ASSERT(FIELD_GET(ID_AA64PFR0_EL1_GIC, 58 - read_sysreg(id_aa64pfr0_el1)) == 0, 59 - "GICv3 wrongly advertised"); 60 - 61 - /* 62 - * Access all GICv3 registers, and fail if we don't get an UNDEF. 63 - * Note that we happily access all the APxRn registers without 64 - * checking their existance, as all we want to see is a failure. 65 - */ 66 - check_sr_rw(ICC_PMR_EL1); 67 - check_sr_read(ICC_IAR0_EL1); 68 - check_sr_write(ICC_EOIR0_EL1); 69 - check_sr_rw(ICC_HPPIR0_EL1); 70 - check_sr_rw(ICC_BPR0_EL1); 71 - check_sr_rw(ICC_AP0R0_EL1); 72 - check_sr_rw(ICC_AP0R1_EL1); 73 - check_sr_rw(ICC_AP0R2_EL1); 74 - check_sr_rw(ICC_AP0R3_EL1); 75 - check_sr_rw(ICC_AP1R0_EL1); 76 - check_sr_rw(ICC_AP1R1_EL1); 77 - check_sr_rw(ICC_AP1R2_EL1); 78 - check_sr_rw(ICC_AP1R3_EL1); 79 - check_sr_write(ICC_DIR_EL1); 80 - check_sr_read(ICC_RPR_EL1); 81 - check_sr_write(ICC_SGI1R_EL1); 82 - check_sr_write(ICC_ASGI1R_EL1); 83 - check_sr_write(ICC_SGI0R_EL1); 84 - check_sr_read(ICC_IAR1_EL1); 85 - check_sr_write(ICC_EOIR1_EL1); 86 - check_sr_rw(ICC_HPPIR1_EL1); 87 - check_sr_rw(ICC_BPR1_EL1); 88 - check_sr_rw(ICC_CTLR_EL1); 89 - check_sr_rw(ICC_IGRPEN0_EL1); 90 - check_sr_rw(ICC_IGRPEN1_EL1); 91 - 92 - /* 93 - * ICC_SRE_EL1 may not be trappable, as ICC_SRE_EL2.Enable can 94 - * be RAO/WI. Engage in non-fatal accesses, starting with a 95 - * write of 0 to try and disable SRE, and let's see if it 96 - * sticks. 97 - */ 98 - __check_sr_write(ICC_SRE_EL1); 99 - if (!handled) 100 - GUEST_PRINTF("ICC_SRE_EL1 write not trapping (OK)\n"); 101 - 102 - val = __check_sr_read(ICC_SRE_EL1); 103 - if (!handled) { 104 - __GUEST_ASSERT((val & BIT(0)), 105 - "ICC_SRE_EL1 not trapped but ICC_SRE_EL1.SRE not set\n"); 106 - GUEST_PRINTF("ICC_SRE_EL1 read not trapping (OK)\n"); 107 - } 108 - 109 - GUEST_DONE(); 110 - } 111 - 112 - static void guest_undef_handler(struct ex_regs *regs) 113 - { 114 - /* Success, we've gracefully exploded! */ 115 - handled = true; 116 - regs->pc += 4; 117 - } 118 - 119 - static void test_run_vcpu(struct kvm_vcpu *vcpu) 120 - { 121 - struct ucall uc; 122 - 123 - do { 124 - vcpu_run(vcpu); 125 - 126 - switch (get_ucall(vcpu, &uc)) { 127 - case UCALL_ABORT: 128 - REPORT_GUEST_ASSERT(uc); 129 - break; 130 - case UCALL_PRINTF: 131 - printf("%s", uc.buffer); 132 - break; 133 - case UCALL_DONE: 134 - break; 135 - default: 136 - TEST_FAIL("Unknown ucall %lu", uc.cmd); 137 - } 138 - } while (uc.cmd != UCALL_DONE); 139 - } 140 - 141 - static void test_guest_no_gicv3(void) 142 - { 143 - struct kvm_vcpu *vcpu; 144 - struct kvm_vm *vm; 145 - 146 - /* Create a VM without a GICv3 */ 147 - vm = vm_create_with_one_vcpu(&vcpu, guest_code); 148 - 149 - vm_init_descriptor_tables(vm); 150 - vcpu_init_descriptor_tables(vcpu); 151 - 152 - vm_install_sync_handler(vm, VECTOR_SYNC_CURRENT, 153 - ESR_ELx_EC_UNKNOWN, guest_undef_handler); 154 - 155 - test_run_vcpu(vcpu); 156 - 157 - kvm_vm_free(vm); 158 - } 159 - 160 - int main(int argc, char *argv[]) 161 - { 162 - struct kvm_vcpu *vcpu; 163 - struct kvm_vm *vm; 164 - uint64_t pfr0; 165 - 166 - test_disable_default_vgic(); 167 - 168 - vm = vm_create_with_one_vcpu(&vcpu, NULL); 169 - pfr0 = vcpu_get_reg(vcpu, KVM_ARM64_SYS_REG(SYS_ID_AA64PFR0_EL1)); 170 - __TEST_REQUIRE(FIELD_GET(ID_AA64PFR0_EL1_GIC, pfr0), 171 - "GICv3 not supported."); 172 - kvm_vm_free(vm); 173 - 174 - test_guest_no_gicv3(); 175 - 176 - return 0; 177 - }

+297

tools/testing/selftests/kvm/arm64/no-vgic.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + 3 + // Check that, on a GICv3-capable system (GICv3 native, or GICv5 with 4 + // FEAT_GCIE_LEGACY), not configuring GICv3 correctly results in all 5 + // of the sysregs generating an UNDEF exception. Do the same for GICv5 6 + // on a GICv5 host. 7 + 8 + #include <test_util.h> 9 + #include <kvm_util.h> 10 + #include <processor.h> 11 + 12 + #include <arm64/gic_v5.h> 13 + 14 + static volatile bool handled; 15 + 16 + #define __check_sr_read(r) \ 17 + ({ \ 18 + uint64_t val; \ 19 + \ 20 + handled = false; \ 21 + dsb(sy); \ 22 + val = read_sysreg_s(SYS_ ## r); \ 23 + val; \ 24 + }) 25 + 26 + #define __check_sr_write(r) \ 27 + do { \ 28 + handled = false; \ 29 + dsb(sy); \ 30 + write_sysreg_s(0, SYS_ ## r); \ 31 + isb(); \ 32 + } while (0) 33 + 34 + #define __check_gicv5_gicr_op(r) \ 35 + ({ \ 36 + uint64_t val; \ 37 + \ 38 + handled = false; \ 39 + dsb(sy); \ 40 + val = read_sysreg_s(GICV5_OP_GICR_ ## r); \ 41 + val; \ 42 + }) 43 + 44 + #define __check_gicv5_gic_op(r) \ 45 + do { \ 46 + handled = false; \ 47 + dsb(sy); \ 48 + write_sysreg_s(0, GICV5_OP_GIC_ ## r); \ 49 + isb(); \ 50 + } while (0) 51 + 52 + /* Fatal checks */ 53 + #define check_sr_read(r) \ 54 + do { \ 55 + __check_sr_read(r); \ 56 + __GUEST_ASSERT(handled, #r " no read trap"); \ 57 + } while (0) 58 + 59 + #define check_sr_write(r) \ 60 + do { \ 61 + __check_sr_write(r); \ 62 + __GUEST_ASSERT(handled, #r " no write trap"); \ 63 + } while (0) 64 + 65 + #define check_sr_rw(r) \ 66 + do { \ 67 + check_sr_read(r); \ 68 + check_sr_write(r); \ 69 + } while (0) 70 + 71 + #define check_gicv5_gicr_op(r) \ 72 + do { \ 73 + __check_gicv5_gicr_op(r); \ 74 + __GUEST_ASSERT(handled, #r " no read trap"); \ 75 + } while (0) 76 + 77 + #define check_gicv5_gic_op(r) \ 78 + do { \ 79 + __check_gicv5_gic_op(r); \ 80 + __GUEST_ASSERT(handled, #r " no write trap"); \ 81 + } while (0) 82 + 83 + static void guest_code_gicv3(void) 84 + { 85 + uint64_t val; 86 + 87 + /* 88 + * Check that we advertise that ID_AA64PFR0_EL1.GIC == 0, having 89 + * hidden the feature at runtime without any other userspace action. 90 + */ 91 + __GUEST_ASSERT(FIELD_GET(ID_AA64PFR0_EL1_GIC, 92 + read_sysreg(id_aa64pfr0_el1)) == 0, 93 + "GICv3 wrongly advertised"); 94 + 95 + /* 96 + * Access all GICv3 registers, and fail if we don't get an UNDEF. 97 + * Note that we happily access all the APxRn registers without 98 + * checking their existence, as all we want to see is a failure. 99 + */ 100 + check_sr_rw(ICC_PMR_EL1); 101 + check_sr_read(ICC_IAR0_EL1); 102 + check_sr_write(ICC_EOIR0_EL1); 103 + check_sr_rw(ICC_HPPIR0_EL1); 104 + check_sr_rw(ICC_BPR0_EL1); 105 + check_sr_rw(ICC_AP0R0_EL1); 106 + check_sr_rw(ICC_AP0R1_EL1); 107 + check_sr_rw(ICC_AP0R2_EL1); 108 + check_sr_rw(ICC_AP0R3_EL1); 109 + check_sr_rw(ICC_AP1R0_EL1); 110 + check_sr_rw(ICC_AP1R1_EL1); 111 + check_sr_rw(ICC_AP1R2_EL1); 112 + check_sr_rw(ICC_AP1R3_EL1); 113 + check_sr_write(ICC_DIR_EL1); 114 + check_sr_read(ICC_RPR_EL1); 115 + check_sr_write(ICC_SGI1R_EL1); 116 + check_sr_write(ICC_ASGI1R_EL1); 117 + check_sr_write(ICC_SGI0R_EL1); 118 + check_sr_read(ICC_IAR1_EL1); 119 + check_sr_write(ICC_EOIR1_EL1); 120 + check_sr_rw(ICC_HPPIR1_EL1); 121 + check_sr_rw(ICC_BPR1_EL1); 122 + check_sr_rw(ICC_CTLR_EL1); 123 + check_sr_rw(ICC_IGRPEN0_EL1); 124 + check_sr_rw(ICC_IGRPEN1_EL1); 125 + 126 + /* 127 + * ICC_SRE_EL1 may not be trappable, as ICC_SRE_EL2.Enable can 128 + * be RAO/WI. Engage in non-fatal accesses, starting with a 129 + * write of 0 to try and disable SRE, and let's see if it 130 + * sticks. 131 + */ 132 + __check_sr_write(ICC_SRE_EL1); 133 + if (!handled) 134 + GUEST_PRINTF("ICC_SRE_EL1 write not trapping (OK)\n"); 135 + 136 + val = __check_sr_read(ICC_SRE_EL1); 137 + if (!handled) { 138 + __GUEST_ASSERT((val & BIT(0)), 139 + "ICC_SRE_EL1 not trapped but ICC_SRE_EL1.SRE not set\n"); 140 + GUEST_PRINTF("ICC_SRE_EL1 read not trapping (OK)\n"); 141 + } 142 + 143 + GUEST_DONE(); 144 + } 145 + 146 + static void guest_code_gicv5(void) 147 + { 148 + /* 149 + * Check that we advertise that ID_AA64PFR2_EL1.GCIE == 0, having 150 + * hidden the feature at runtime without any other userspace action. 151 + */ 152 + __GUEST_ASSERT(FIELD_GET(ID_AA64PFR2_EL1_GCIE, 153 + read_sysreg_s(SYS_ID_AA64PFR2_EL1)) == 0, 154 + "GICv5 wrongly advertised"); 155 + 156 + /* 157 + * Try all GICv5 instructions, and fail if we don't get an UNDEF. 158 + */ 159 + check_gicv5_gic_op(CDAFF); 160 + check_gicv5_gic_op(CDDI); 161 + check_gicv5_gic_op(CDDIS); 162 + check_gicv5_gic_op(CDEOI); 163 + check_gicv5_gic_op(CDHM); 164 + check_gicv5_gic_op(CDPEND); 165 + check_gicv5_gic_op(CDPRI); 166 + check_gicv5_gic_op(CDRCFG); 167 + check_gicv5_gicr_op(CDIA); 168 + check_gicv5_gicr_op(CDNMIA); 169 + 170 + /* Check General System Register acccesses */ 171 + check_sr_rw(ICC_APR_EL1); 172 + check_sr_rw(ICC_CR0_EL1); 173 + check_sr_read(ICC_HPPIR_EL1); 174 + check_sr_read(ICC_IAFFIDR_EL1); 175 + check_sr_rw(ICC_ICSR_EL1); 176 + check_sr_read(ICC_IDR0_EL1); 177 + check_sr_rw(ICC_PCR_EL1); 178 + 179 + /* Check PPI System Register accessess */ 180 + check_sr_rw(ICC_PPI_CACTIVER0_EL1); 181 + check_sr_rw(ICC_PPI_CACTIVER1_EL1); 182 + check_sr_rw(ICC_PPI_SACTIVER0_EL1); 183 + check_sr_rw(ICC_PPI_SACTIVER1_EL1); 184 + check_sr_rw(ICC_PPI_CPENDR0_EL1); 185 + check_sr_rw(ICC_PPI_CPENDR1_EL1); 186 + check_sr_rw(ICC_PPI_SPENDR0_EL1); 187 + check_sr_rw(ICC_PPI_SPENDR1_EL1); 188 + check_sr_rw(ICC_PPI_ENABLER0_EL1); 189 + check_sr_rw(ICC_PPI_ENABLER1_EL1); 190 + check_sr_read(ICC_PPI_HMR0_EL1); 191 + check_sr_read(ICC_PPI_HMR1_EL1); 192 + check_sr_rw(ICC_PPI_PRIORITYR0_EL1); 193 + check_sr_rw(ICC_PPI_PRIORITYR1_EL1); 194 + check_sr_rw(ICC_PPI_PRIORITYR2_EL1); 195 + check_sr_rw(ICC_PPI_PRIORITYR3_EL1); 196 + check_sr_rw(ICC_PPI_PRIORITYR4_EL1); 197 + check_sr_rw(ICC_PPI_PRIORITYR5_EL1); 198 + check_sr_rw(ICC_PPI_PRIORITYR6_EL1); 199 + check_sr_rw(ICC_PPI_PRIORITYR7_EL1); 200 + check_sr_rw(ICC_PPI_PRIORITYR8_EL1); 201 + check_sr_rw(ICC_PPI_PRIORITYR9_EL1); 202 + check_sr_rw(ICC_PPI_PRIORITYR10_EL1); 203 + check_sr_rw(ICC_PPI_PRIORITYR11_EL1); 204 + check_sr_rw(ICC_PPI_PRIORITYR12_EL1); 205 + check_sr_rw(ICC_PPI_PRIORITYR13_EL1); 206 + check_sr_rw(ICC_PPI_PRIORITYR14_EL1); 207 + check_sr_rw(ICC_PPI_PRIORITYR15_EL1); 208 + 209 + GUEST_DONE(); 210 + } 211 + 212 + static void guest_undef_handler(struct ex_regs *regs) 213 + { 214 + /* Success, we've gracefully exploded! */ 215 + handled = true; 216 + regs->pc += 4; 217 + } 218 + 219 + static void test_run_vcpu(struct kvm_vcpu *vcpu) 220 + { 221 + struct ucall uc; 222 + 223 + do { 224 + vcpu_run(vcpu); 225 + 226 + switch (get_ucall(vcpu, &uc)) { 227 + case UCALL_ABORT: 228 + REPORT_GUEST_ASSERT(uc); 229 + break; 230 + case UCALL_PRINTF: 231 + printf("%s", uc.buffer); 232 + break; 233 + case UCALL_DONE: 234 + break; 235 + default: 236 + TEST_FAIL("Unknown ucall %lu", uc.cmd); 237 + } 238 + } while (uc.cmd != UCALL_DONE); 239 + } 240 + 241 + static void test_guest_no_vgic(void *guest_code) 242 + { 243 + struct kvm_vcpu *vcpu; 244 + struct kvm_vm *vm; 245 + 246 + /* Create a VM without a GIC */ 247 + vm = vm_create_with_one_vcpu(&vcpu, guest_code); 248 + 249 + vm_init_descriptor_tables(vm); 250 + vcpu_init_descriptor_tables(vcpu); 251 + 252 + vm_install_sync_handler(vm, VECTOR_SYNC_CURRENT, 253 + ESR_ELx_EC_UNKNOWN, guest_undef_handler); 254 + 255 + test_run_vcpu(vcpu); 256 + 257 + kvm_vm_free(vm); 258 + } 259 + 260 + int main(int argc, char *argv[]) 261 + { 262 + struct kvm_vcpu *vcpu; 263 + struct kvm_vm *vm; 264 + bool has_v3, has_v5; 265 + uint64_t pfr; 266 + 267 + test_disable_default_vgic(); 268 + 269 + vm = vm_create_with_one_vcpu(&vcpu, NULL); 270 + 271 + pfr = vcpu_get_reg(vcpu, KVM_ARM64_SYS_REG(SYS_ID_AA64PFR0_EL1)); 272 + has_v3 = !!FIELD_GET(ID_AA64PFR0_EL1_GIC, pfr); 273 + 274 + pfr = vcpu_get_reg(vcpu, KVM_ARM64_SYS_REG(SYS_ID_AA64PFR2_EL1)); 275 + has_v5 = !!FIELD_GET(ID_AA64PFR2_EL1_GCIE, pfr); 276 + 277 + kvm_vm_free(vm); 278 + 279 + __TEST_REQUIRE(has_v3 || has_v5, 280 + "Neither GICv3 nor GICv5 supported."); 281 + 282 + if (has_v3) { 283 + pr_info("Testing no-vgic-v3\n"); 284 + test_guest_no_vgic(guest_code_gicv3); 285 + } else { 286 + pr_info("No GICv3 support: skipping no-vgic-v3 test\n"); 287 + } 288 + 289 + if (has_v5) { 290 + pr_info("Testing no-vgic-v5\n"); 291 + test_guest_no_vgic(guest_code_gicv5); 292 + } else { 293 + pr_info("No GICv5 support: skipping no-vgic-v5 test\n"); 294 + } 295 + 296 + return 0; 297 + }

+45 -7

tools/testing/selftests/kvm/arm64/set_id_regs.c

··· 37 37 * For FTR_LOWER_SAFE, safe_val is used as the minimal safe value. 38 38 */ 39 39 int64_t safe_val; 40 + 41 + /* Allowed to be changed by the host after run */ 42 + bool mutable; 40 43 }; 41 44 42 45 struct test_feature_reg { ··· 47 44 const struct reg_ftr_bits *ftr_bits; 48 45 }; 49 46 50 - #define __REG_FTR_BITS(NAME, SIGNED, TYPE, SHIFT, MASK, SAFE_VAL) \ 47 + #define __REG_FTR_BITS(NAME, SIGNED, TYPE, SHIFT, MASK, SAFE_VAL, MUT) \ 51 48 { \ 52 49 .name = #NAME, \ 53 50 .sign = SIGNED, \ ··· 55 52 .shift = SHIFT, \ 56 53 .mask = MASK, \ 57 54 .safe_val = SAFE_VAL, \ 55 + .mutable = MUT, \ 58 56 } 59 57 60 58 #define REG_FTR_BITS(type, reg, field, safe_val) \ 61 59 __REG_FTR_BITS(reg##_##field, FTR_UNSIGNED, type, reg##_##field##_SHIFT, \ 62 - reg##_##field##_MASK, safe_val) 60 + reg##_##field##_MASK, safe_val, false) 61 + 62 + #define REG_FTR_BITS_MUTABLE(type, reg, field, safe_val) \ 63 + __REG_FTR_BITS(reg##_##field, FTR_UNSIGNED, type, reg##_##field##_SHIFT, \ 64 + reg##_##field##_MASK, safe_val, true) 63 65 64 66 #define S_REG_FTR_BITS(type, reg, field, safe_val) \ 65 67 __REG_FTR_BITS(reg##_##field, FTR_SIGNED, type, reg##_##field##_SHIFT, \ 66 - reg##_##field##_MASK, safe_val) 68 + reg##_##field##_MASK, safe_val, false) 67 69 68 70 #define REG_FTR_END \ 69 71 { \ ··· 142 134 REG_FTR_BITS(FTR_LOWER_SAFE, ID_AA64PFR0_EL1, CSV2, 0), 143 135 REG_FTR_BITS(FTR_LOWER_SAFE, ID_AA64PFR0_EL1, DIT, 0), 144 136 REG_FTR_BITS(FTR_LOWER_SAFE, ID_AA64PFR0_EL1, SEL2, 0), 145 - REG_FTR_BITS(FTR_EXACT, ID_AA64PFR0_EL1, GIC, 0), 137 + /* GICv3 support will be forced at run time if available */ 138 + REG_FTR_BITS_MUTABLE(FTR_EXACT, ID_AA64PFR0_EL1, GIC, 0), 146 139 REG_FTR_BITS(FTR_LOWER_SAFE, ID_AA64PFR0_EL1, EL3, 1), 147 140 REG_FTR_BITS(FTR_LOWER_SAFE, ID_AA64PFR0_EL1, EL2, 1), 148 141 REG_FTR_BITS(FTR_LOWER_SAFE, ID_AA64PFR0_EL1, EL1, 1), ··· 643 634 ksft_test_result_pass("ID_AA64PFR1_EL1.MTE_frac no longer 0xF\n"); 644 635 } 645 636 637 + static uint64_t reset_mutable_bits(uint32_t id, uint64_t val) 638 + { 639 + struct test_feature_reg *reg = NULL; 640 + 641 + for (int i = 0; i < ARRAY_SIZE(test_regs); i++) { 642 + if (test_regs[i].reg == id) { 643 + reg = &test_regs[i]; 644 + break; 645 + } 646 + } 647 + 648 + if (!reg) 649 + return val; 650 + 651 + for (const struct reg_ftr_bits *bits = reg->ftr_bits; bits->type != FTR_END; bits++) { 652 + if (bits->mutable) { 653 + val &= ~bits->mask; 654 + val |= bits->safe_val << bits->shift; 655 + } 656 + } 657 + 658 + return val; 659 + } 660 + 646 661 static void test_guest_reg_read(struct kvm_vcpu *vcpu) 647 662 { 648 663 bool done = false; 649 664 struct ucall uc; 650 665 651 666 while (!done) { 667 + uint64_t val; 668 + 652 669 vcpu_run(vcpu); 653 670 654 671 switch (get_ucall(vcpu, &uc)) { ··· 682 647 REPORT_GUEST_ASSERT(uc); 683 648 break; 684 649 case UCALL_SYNC: 650 + val = test_reg_vals[encoding_to_range_idx(uc.args[2])]; 651 + val = reset_mutable_bits(uc.args[2], val); 652 + 685 653 /* Make sure the written values are seen by guest */ 686 - TEST_ASSERT_EQ(test_reg_vals[encoding_to_range_idx(uc.args[2])], 687 - uc.args[3]); 654 + TEST_ASSERT_EQ(val, reset_mutable_bits(uc.args[2], uc.args[3])); 688 655 break; 689 656 case UCALL_DONE: 690 657 done = true; ··· 777 740 uint64_t observed; 778 741 779 742 observed = vcpu_get_reg(vcpu, KVM_ARM64_SYS_REG(encoding)); 780 - TEST_ASSERT_EQ(test_reg_vals[idx], observed); 743 + TEST_ASSERT_EQ(reset_mutable_bits(encoding, test_reg_vals[idx]), 744 + reset_mutable_bits(encoding, observed)); 781 745 } 782 746 783 747 static void test_reset_preserves_id_regs(struct kvm_vcpu *vcpu)

+228

tools/testing/selftests/kvm/arm64/vgic_v5.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + 3 + #include <linux/kernel.h> 4 + #include <sys/syscall.h> 5 + #include <asm/kvm.h> 6 + #include <asm/kvm_para.h> 7 + 8 + #include <arm64/gic_v5.h> 9 + 10 + #include "test_util.h" 11 + #include "kvm_util.h" 12 + #include "processor.h" 13 + #include "vgic.h" 14 + 15 + #define NR_VCPUS 1 16 + 17 + struct vm_gic { 18 + struct kvm_vm *vm; 19 + int gic_fd; 20 + uint32_t gic_dev_type; 21 + }; 22 + 23 + static uint64_t max_phys_size; 24 + 25 + #define GUEST_CMD_IRQ_CDIA 10 26 + #define GUEST_CMD_IRQ_DIEOI 11 27 + #define GUEST_CMD_IS_AWAKE 12 28 + #define GUEST_CMD_IS_READY 13 29 + 30 + static void guest_irq_handler(struct ex_regs *regs) 31 + { 32 + bool valid; 33 + u32 hwirq; 34 + u64 ia; 35 + static int count; 36 + 37 + /* 38 + * We have pending interrupts. Should never actually enter WFI 39 + * here! 40 + */ 41 + wfi(); 42 + GUEST_SYNC(GUEST_CMD_IS_AWAKE); 43 + 44 + ia = gicr_insn(CDIA); 45 + valid = GICV5_GICR_CDIA_VALID(ia); 46 + 47 + GUEST_SYNC(GUEST_CMD_IRQ_CDIA); 48 + 49 + if (!valid) 50 + return; 51 + 52 + gsb_ack(); 53 + isb(); 54 + 55 + hwirq = FIELD_GET(GICV5_GICR_CDIA_INTID, ia); 56 + 57 + gic_insn(hwirq, CDDI); 58 + gic_insn(0, CDEOI); 59 + 60 + GUEST_SYNC(GUEST_CMD_IRQ_DIEOI); 61 + 62 + if (++count >= 2) 63 + GUEST_DONE(); 64 + 65 + /* Ask for the next interrupt to be injected */ 66 + GUEST_SYNC(GUEST_CMD_IS_READY); 67 + } 68 + 69 + static void guest_code(void) 70 + { 71 + local_irq_disable(); 72 + 73 + gicv5_cpu_enable_interrupts(); 74 + local_irq_enable(); 75 + 76 + /* Enable the SW_PPI (3) */ 77 + write_sysreg_s(BIT_ULL(3), SYS_ICC_PPI_ENABLER0_EL1); 78 + 79 + /* Ask for the first interrupt to be injected */ 80 + GUEST_SYNC(GUEST_CMD_IS_READY); 81 + 82 + /* Loop forever waiting for interrupts */ 83 + while (1); 84 + } 85 + 86 + 87 + /* we don't want to assert on run execution, hence that helper */ 88 + static int run_vcpu(struct kvm_vcpu *vcpu) 89 + { 90 + return __vcpu_run(vcpu) ? -errno : 0; 91 + } 92 + 93 + static void vm_gic_destroy(struct vm_gic *v) 94 + { 95 + close(v->gic_fd); 96 + kvm_vm_free(v->vm); 97 + } 98 + 99 + static void test_vgic_v5_ppis(uint32_t gic_dev_type) 100 + { 101 + struct kvm_vcpu *vcpus[NR_VCPUS]; 102 + struct ucall uc; 103 + u64 user_ppis[2]; 104 + struct vm_gic v; 105 + int ret, i; 106 + 107 + v.gic_dev_type = gic_dev_type; 108 + v.vm = __vm_create(VM_SHAPE_DEFAULT, NR_VCPUS, 0); 109 + 110 + v.gic_fd = kvm_create_device(v.vm, gic_dev_type); 111 + 112 + for (i = 0; i < NR_VCPUS; i++) 113 + vcpus[i] = vm_vcpu_add(v.vm, i, guest_code); 114 + 115 + vm_init_descriptor_tables(v.vm); 116 + vm_install_exception_handler(v.vm, VECTOR_IRQ_CURRENT, guest_irq_handler); 117 + 118 + for (i = 0; i < NR_VCPUS; i++) 119 + vcpu_init_descriptor_tables(vcpus[i]); 120 + 121 + kvm_device_attr_set(v.gic_fd, KVM_DEV_ARM_VGIC_GRP_CTRL, 122 + KVM_DEV_ARM_VGIC_CTRL_INIT, NULL); 123 + 124 + /* Read out the PPIs that user space is allowed to drive. */ 125 + kvm_device_attr_get(v.gic_fd, KVM_DEV_ARM_VGIC_GRP_CTRL, 126 + KVM_DEV_ARM_VGIC_USERSPACE_PPIS, &user_ppis); 127 + 128 + /* We should always be able to drive the SW_PPI. */ 129 + TEST_ASSERT(user_ppis[0] & BIT(GICV5_ARCH_PPI_SW_PPI), 130 + "SW_PPI is not drivable by userspace"); 131 + 132 + while (1) { 133 + ret = run_vcpu(vcpus[0]); 134 + 135 + switch (get_ucall(vcpus[0], &uc)) { 136 + case UCALL_SYNC: 137 + /* 138 + * The guest is ready for the next level change. Set 139 + * high if ready, and lower if it has been consumed. 140 + */ 141 + if (uc.args[1] == GUEST_CMD_IS_READY || 142 + uc.args[1] == GUEST_CMD_IRQ_DIEOI) { 143 + u64 irq; 144 + bool level = uc.args[1] == GUEST_CMD_IRQ_DIEOI ? 0 : 1; 145 + 146 + irq = FIELD_PREP(KVM_ARM_IRQ_NUM_MASK, 3); 147 + irq |= KVM_ARM_IRQ_TYPE_PPI << KVM_ARM_IRQ_TYPE_SHIFT; 148 + 149 + _kvm_irq_line(v.vm, irq, level); 150 + } else if (uc.args[1] == GUEST_CMD_IS_AWAKE) { 151 + pr_info("Guest skipping WFI due to pending IRQ\n"); 152 + } else if (uc.args[1] == GUEST_CMD_IRQ_CDIA) { 153 + pr_info("Guest acknowledged IRQ\n"); 154 + } 155 + 156 + continue; 157 + case UCALL_ABORT: 158 + REPORT_GUEST_ASSERT(uc); 159 + break; 160 + case UCALL_DONE: 161 + goto done; 162 + default: 163 + TEST_FAIL("Unknown ucall %lu", uc.cmd); 164 + } 165 + } 166 + 167 + done: 168 + TEST_ASSERT(ret == 0, "Failed to test GICv5 PPIs"); 169 + 170 + vm_gic_destroy(&v); 171 + } 172 + 173 + /* 174 + * Returns 0 if it's possible to create GIC device of a given type (V5). 175 + */ 176 + int test_kvm_device(uint32_t gic_dev_type) 177 + { 178 + struct kvm_vcpu *vcpus[NR_VCPUS]; 179 + struct vm_gic v; 180 + int ret; 181 + 182 + v.vm = vm_create_with_vcpus(NR_VCPUS, guest_code, vcpus); 183 + 184 + /* try to create a non existing KVM device */ 185 + ret = __kvm_test_create_device(v.vm, 0); 186 + TEST_ASSERT(ret && errno == ENODEV, "unsupported device"); 187 + 188 + /* trial mode */ 189 + ret = __kvm_test_create_device(v.vm, gic_dev_type); 190 + if (ret) 191 + return ret; 192 + v.gic_fd = kvm_create_device(v.vm, gic_dev_type); 193 + 194 + ret = __kvm_create_device(v.vm, gic_dev_type); 195 + TEST_ASSERT(ret < 0 && errno == EEXIST, "create GIC device twice"); 196 + 197 + vm_gic_destroy(&v); 198 + 199 + return 0; 200 + } 201 + 202 + void run_tests(uint32_t gic_dev_type) 203 + { 204 + pr_info("Test VGICv5 PPIs\n"); 205 + test_vgic_v5_ppis(gic_dev_type); 206 + } 207 + 208 + int main(int ac, char **av) 209 + { 210 + int ret; 211 + int pa_bits; 212 + 213 + test_disable_default_vgic(); 214 + 215 + pa_bits = vm_guest_mode_params[VM_MODE_DEFAULT].pa_bits; 216 + max_phys_size = 1ULL << pa_bits; 217 + 218 + ret = test_kvm_device(KVM_DEV_TYPE_ARM_VGIC_V5); 219 + if (ret) { 220 + pr_info("No GICv5 support; Not running GIC_v5 tests.\n"); 221 + exit(KSFT_SKIP); 222 + } 223 + 224 + pr_info("Running VGIC_V5 tests.\n"); 225 + run_tests(KVM_DEV_TYPE_ARM_VGIC_V5); 226 + 227 + return 0; 228 + }

+150

tools/testing/selftests/kvm/include/arm64/gic_v5.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0-only */ 2 + 3 + #ifndef __SELFTESTS_GIC_V5_H 4 + #define __SELFTESTS_GIC_V5_H 5 + 6 + #include <asm/barrier.h> 7 + #include <asm/sysreg.h> 8 + 9 + #include <linux/bitfield.h> 10 + 11 + #include "processor.h" 12 + 13 + /* 14 + * Definitions for GICv5 instructions for the Current Domain 15 + */ 16 + #define GICV5_OP_GIC_CDAFF sys_insn(1, 0, 12, 1, 3) 17 + #define GICV5_OP_GIC_CDDI sys_insn(1, 0, 12, 2, 0) 18 + #define GICV5_OP_GIC_CDDIS sys_insn(1, 0, 12, 1, 0) 19 + #define GICV5_OP_GIC_CDHM sys_insn(1, 0, 12, 2, 1) 20 + #define GICV5_OP_GIC_CDEN sys_insn(1, 0, 12, 1, 1) 21 + #define GICV5_OP_GIC_CDEOI sys_insn(1, 0, 12, 1, 7) 22 + #define GICV5_OP_GIC_CDPEND sys_insn(1, 0, 12, 1, 4) 23 + #define GICV5_OP_GIC_CDPRI sys_insn(1, 0, 12, 1, 2) 24 + #define GICV5_OP_GIC_CDRCFG sys_insn(1, 0, 12, 1, 5) 25 + #define GICV5_OP_GICR_CDIA sys_insn(1, 0, 12, 3, 0) 26 + #define GICV5_OP_GICR_CDNMIA sys_insn(1, 0, 12, 3, 1) 27 + 28 + /* Definitions for GIC CDAFF */ 29 + #define GICV5_GIC_CDAFF_IAFFID_MASK GENMASK_ULL(47, 32) 30 + #define GICV5_GIC_CDAFF_TYPE_MASK GENMASK_ULL(31, 29) 31 + #define GICV5_GIC_CDAFF_IRM_MASK BIT_ULL(28) 32 + #define GICV5_GIC_CDAFF_ID_MASK GENMASK_ULL(23, 0) 33 + 34 + /* Definitions for GIC CDDI */ 35 + #define GICV5_GIC_CDDI_TYPE_MASK GENMASK_ULL(31, 29) 36 + #define GICV5_GIC_CDDI_ID_MASK GENMASK_ULL(23, 0) 37 + 38 + /* Definitions for GIC CDDIS */ 39 + #define GICV5_GIC_CDDIS_TYPE_MASK GENMASK_ULL(31, 29) 40 + #define GICV5_GIC_CDDIS_TYPE(r) FIELD_GET(GICV5_GIC_CDDIS_TYPE_MASK, r) 41 + #define GICV5_GIC_CDDIS_ID_MASK GENMASK_ULL(23, 0) 42 + #define GICV5_GIC_CDDIS_ID(r) FIELD_GET(GICV5_GIC_CDDIS_ID_MASK, r) 43 + 44 + /* Definitions for GIC CDEN */ 45 + #define GICV5_GIC_CDEN_TYPE_MASK GENMASK_ULL(31, 29) 46 + #define GICV5_GIC_CDEN_ID_MASK GENMASK_ULL(23, 0) 47 + 48 + /* Definitions for GIC CDHM */ 49 + #define GICV5_GIC_CDHM_HM_MASK BIT_ULL(32) 50 + #define GICV5_GIC_CDHM_TYPE_MASK GENMASK_ULL(31, 29) 51 + #define GICV5_GIC_CDHM_ID_MASK GENMASK_ULL(23, 0) 52 + 53 + /* Definitions for GIC CDPEND */ 54 + #define GICV5_GIC_CDPEND_PENDING_MASK BIT_ULL(32) 55 + #define GICV5_GIC_CDPEND_TYPE_MASK GENMASK_ULL(31, 29) 56 + #define GICV5_GIC_CDPEND_ID_MASK GENMASK_ULL(23, 0) 57 + 58 + /* Definitions for GIC CDPRI */ 59 + #define GICV5_GIC_CDPRI_PRIORITY_MASK GENMASK_ULL(39, 35) 60 + #define GICV5_GIC_CDPRI_TYPE_MASK GENMASK_ULL(31, 29) 61 + #define GICV5_GIC_CDPRI_ID_MASK GENMASK_ULL(23, 0) 62 + 63 + /* Definitions for GIC CDRCFG */ 64 + #define GICV5_GIC_CDRCFG_TYPE_MASK GENMASK_ULL(31, 29) 65 + #define GICV5_GIC_CDRCFG_ID_MASK GENMASK_ULL(23, 0) 66 + 67 + /* Definitions for GICR CDIA */ 68 + #define GICV5_GICR_CDIA_VALID_MASK BIT_ULL(32) 69 + #define GICV5_GICR_CDIA_VALID(r) FIELD_GET(GICV5_GICR_CDIA_VALID_MASK, r) 70 + #define GICV5_GICR_CDIA_TYPE_MASK GENMASK_ULL(31, 29) 71 + #define GICV5_GICR_CDIA_ID_MASK GENMASK_ULL(23, 0) 72 + #define GICV5_GICR_CDIA_INTID GENMASK_ULL(31, 0) 73 + 74 + /* Definitions for GICR CDNMIA */ 75 + #define GICV5_GICR_CDNMIA_VALID_MASK BIT_ULL(32) 76 + #define GICV5_GICR_CDNMIA_VALID(r) FIELD_GET(GICV5_GICR_CDNMIA_VALID_MASK, r) 77 + #define GICV5_GICR_CDNMIA_TYPE_MASK GENMASK_ULL(31, 29) 78 + #define GICV5_GICR_CDNMIA_ID_MASK GENMASK_ULL(23, 0) 79 + 80 + #define gicr_insn(insn) read_sysreg_s(GICV5_OP_GICR_##insn) 81 + #define gic_insn(v, insn) write_sysreg_s(v, GICV5_OP_GIC_##insn) 82 + 83 + #define __GIC_BARRIER_INSN(op0, op1, CRn, CRm, op2, Rt) \ 84 + __emit_inst(0xd5000000 | \ 85 + sys_insn((op0), (op1), (CRn), (CRm), (op2)) | \ 86 + ((Rt) & 0x1f)) 87 + 88 + #define GSB_SYS_BARRIER_INSN __GIC_BARRIER_INSN(1, 0, 12, 0, 0, 31) 89 + #define GSB_ACK_BARRIER_INSN __GIC_BARRIER_INSN(1, 0, 12, 0, 1, 31) 90 + 91 + #define gsb_ack() asm volatile(GSB_ACK_BARRIER_INSN : : : "memory") 92 + #define gsb_sys() asm volatile(GSB_SYS_BARRIER_INSN : : : "memory") 93 + 94 + #define REPEAT_BYTE(x) ((~0ul / 0xff) * (x)) 95 + 96 + #define GICV5_IRQ_DEFAULT_PRI 0b10000 97 + 98 + #define GICV5_ARCH_PPI_SW_PPI 0x3 99 + 100 + void gicv5_ppi_priority_init(void) 101 + { 102 + write_sysreg_s(REPEAT_BYTE(GICV5_IRQ_DEFAULT_PRI), SYS_ICC_PPI_PRIORITYR0_EL1); 103 + write_sysreg_s(REPEAT_BYTE(GICV5_IRQ_DEFAULT_PRI), SYS_ICC_PPI_PRIORITYR1_EL1); 104 + write_sysreg_s(REPEAT_BYTE(GICV5_IRQ_DEFAULT_PRI), SYS_ICC_PPI_PRIORITYR2_EL1); 105 + write_sysreg_s(REPEAT_BYTE(GICV5_IRQ_DEFAULT_PRI), SYS_ICC_PPI_PRIORITYR3_EL1); 106 + write_sysreg_s(REPEAT_BYTE(GICV5_IRQ_DEFAULT_PRI), SYS_ICC_PPI_PRIORITYR4_EL1); 107 + write_sysreg_s(REPEAT_BYTE(GICV5_IRQ_DEFAULT_PRI), SYS_ICC_PPI_PRIORITYR5_EL1); 108 + write_sysreg_s(REPEAT_BYTE(GICV5_IRQ_DEFAULT_PRI), SYS_ICC_PPI_PRIORITYR6_EL1); 109 + write_sysreg_s(REPEAT_BYTE(GICV5_IRQ_DEFAULT_PRI), SYS_ICC_PPI_PRIORITYR7_EL1); 110 + write_sysreg_s(REPEAT_BYTE(GICV5_IRQ_DEFAULT_PRI), SYS_ICC_PPI_PRIORITYR8_EL1); 111 + write_sysreg_s(REPEAT_BYTE(GICV5_IRQ_DEFAULT_PRI), SYS_ICC_PPI_PRIORITYR9_EL1); 112 + write_sysreg_s(REPEAT_BYTE(GICV5_IRQ_DEFAULT_PRI), SYS_ICC_PPI_PRIORITYR10_EL1); 113 + write_sysreg_s(REPEAT_BYTE(GICV5_IRQ_DEFAULT_PRI), SYS_ICC_PPI_PRIORITYR11_EL1); 114 + write_sysreg_s(REPEAT_BYTE(GICV5_IRQ_DEFAULT_PRI), SYS_ICC_PPI_PRIORITYR12_EL1); 115 + write_sysreg_s(REPEAT_BYTE(GICV5_IRQ_DEFAULT_PRI), SYS_ICC_PPI_PRIORITYR13_EL1); 116 + write_sysreg_s(REPEAT_BYTE(GICV5_IRQ_DEFAULT_PRI), SYS_ICC_PPI_PRIORITYR14_EL1); 117 + write_sysreg_s(REPEAT_BYTE(GICV5_IRQ_DEFAULT_PRI), SYS_ICC_PPI_PRIORITYR15_EL1); 118 + 119 + /* 120 + * Context syncronization required to make sure system register writes 121 + * effects are synchronised. 122 + */ 123 + isb(); 124 + } 125 + 126 + void gicv5_cpu_disable_interrupts(void) 127 + { 128 + u64 cr0; 129 + 130 + cr0 = FIELD_PREP(ICC_CR0_EL1_EN, 0); 131 + write_sysreg_s(cr0, SYS_ICC_CR0_EL1); 132 + } 133 + 134 + void gicv5_cpu_enable_interrupts(void) 135 + { 136 + u64 cr0, pcr; 137 + 138 + write_sysreg_s(0, SYS_ICC_PPI_ENABLER0_EL1); 139 + write_sysreg_s(0, SYS_ICC_PPI_ENABLER1_EL1); 140 + 141 + gicv5_ppi_priority_init(); 142 + 143 + pcr = FIELD_PREP(ICC_PCR_EL1_PRIORITY, GICV5_IRQ_DEFAULT_PRI); 144 + write_sysreg_s(pcr, SYS_ICC_PCR_EL1); 145 + 146 + cr0 = FIELD_PREP(ICC_CR0_EL1_EN, 1); 147 + write_sysreg_s(cr0, SYS_ICC_CR0_EL1); 148 + } 149 + 150 + #endif

Configure Feed

Configure Feed