Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Merge branch kvm-arm64/pkvm-protected-guest into kvmarm-master/next

* kvm-arm64/pkvm-protected-guest: (41 commits)
: .
: pKVM support for protected guests, implementing the very long
: awaited support for anonymous memory, as the elusive guestmem
: has failed to deliver on its promises despite a multi-year
: effort. Patches courtesy of Will Deacon. From the initial cover
: letter:
:
: "[...] this patch series implements support for protected guest
: memory with pKVM, where pages are unmapped from the host as they are
: faulted into the guest and can be shared back from the guest using pKVM
: hypercalls. Protected guests are created using a new machine type
: identifier and can be booted to a shell using the kvmtool patches
: available at [2], which finally means that we are able to test the pVM
: logic in pKVM. Since this is an incremental step towards full isolation
: from the host (for example, the CPU register state and DMA accesses are
: not yet isolated), creating a pVM requires a developer Kconfig option to
: be enabled in addition to booting with 'kvm-arm.mode=protected' and
: results in a kernel taint."
: .
KVM: arm64: Don't hold 'vm_table_lock' across guest page reclaim
KVM: arm64: Allow get_pkvm_hyp_vm() to take a reference to a dying VM
KVM: arm64: Prevent teardown finalisation of referenced 'hyp_vm'
drivers/virt: pkvm: Add Kconfig dependency on DMA_RESTRICTED_POOL
KVM: arm64: Rename PKVM_PAGE_STATE_MASK
KVM: arm64: Extend pKVM page ownership selftests to cover guest hvcs
KVM: arm64: Extend pKVM page ownership selftests to cover forced reclaim
KVM: arm64: Register 'selftest_vm' in the VM table
KVM: arm64: Extend pKVM page ownership selftests to cover guest donation
KVM: arm64: Add some initial documentation for pKVM
KVM: arm64: Allow userspace to create protected VMs when pKVM is enabled
KVM: arm64: Implement the MEM_UNSHARE hypercall for protected VMs
KVM: arm64: Implement the MEM_SHARE hypercall for protected VMs
KVM: arm64: Add hvc handler at EL2 for hypercalls from protected VMs
KVM: arm64: Return -EFAULT from VCPU_RUN on access to a poisoned pte
KVM: arm64: Reclaim faulting page from pKVM in spurious fault handler
KVM: arm64: Introduce hypercall to force reclaim of a protected page
KVM: arm64: Annotate guest donations with handle and gfn in host stage-2
KVM: arm64: Change 'pkvm_handle_t' to u16
KVM: arm64: Introduce host_stage2_set_owner_metadata_locked()
...

Signed-off-by: Marc Zyngier <maz@kernel.org>

+1383 -231
+2 -2
Documentation/admin-guide/kernel-parameters.txt
··· 3247 3247 for the host. To force nVHE on VHE hardware, add 3248 3248 "arm64_sw.hvhe=0 id_aa64mmfr1.vh=0" to the 3249 3249 command-line. 3250 - "nested" is experimental and should be used with 3251 - extreme caution. 3250 + "nested" and "protected" are experimental and should be 3251 + used with extreme caution. 3252 3252 3253 3253 kvm-arm.vgic_v3_group0_trap= 3254 3254 [KVM,ARM,EARLY] Trap guest accesses to GICv3 group-0
+1
Documentation/virt/kvm/arm/index.rst
··· 10 10 fw-pseudo-registers 11 11 hyp-abi 12 12 hypercalls 13 + pkvm 13 14 pvtime 14 15 ptp_kvm 15 16 vcpu-features
+106
Documentation/virt/kvm/arm/pkvm.rst
··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ==================== 4 + Protected KVM (pKVM) 5 + ==================== 6 + 7 + **NOTE**: pKVM is currently an experimental, development feature and 8 + subject to breaking changes as new isolation features are implemented. 9 + Please reach out to the developers at kvmarm@lists.linux.dev if you have 10 + any questions. 11 + 12 + Overview 13 + ======== 14 + 15 + Booting a host kernel with '``kvm-arm.mode=protected``' enables 16 + "Protected KVM" (pKVM). During boot, pKVM installs a stage-2 identity 17 + map page-table for the host and uses it to isolate the hypervisor 18 + running at EL2 from the rest of the host running at EL1/0. 19 + 20 + pKVM permits creation of protected virtual machines (pVMs) by passing 21 + the ``KVM_VM_TYPE_ARM_PROTECTED`` machine type identifier to the 22 + ``KVM_CREATE_VM`` ioctl(). The hypervisor isolates pVMs from the host by 23 + unmapping pages from the stage-2 identity map as they are accessed by a 24 + pVM. Hypercalls are provided for a pVM to share specific regions of its 25 + IPA space back with the host, allowing for communication with the VMM. 26 + A Linux guest must be configured with ``CONFIG_ARM_PKVM_GUEST=y`` in 27 + order to issue these hypercalls. 28 + 29 + See hypercalls.rst for more details. 30 + 31 + Isolation mechanisms 32 + ==================== 33 + 34 + pKVM relies on a number of mechanisms to isolate PVMs from the host: 35 + 36 + CPU memory isolation 37 + -------------------- 38 + 39 + Status: Isolation of anonymous memory and metadata pages. 40 + 41 + Metadata pages (e.g. page-table pages and '``struct kvm_vcpu``' pages) 42 + are donated from the host to the hypervisor during pVM creation and 43 + are consequently unmapped from the stage-2 identity map until the pVM is 44 + destroyed. 45 + 46 + Similarly to regular KVM, pages are lazily mapped into the guest in 47 + response to stage-2 page faults handled by the host. However, when 48 + running a pVM, these pages are first pinned and then unmapped from the 49 + stage-2 identity map as part of the donation procedure. This gives rise 50 + to some user-visible differences when compared to non-protected VMs, 51 + largely due to the lack of MMU notifiers: 52 + 53 + * Memslots cannot be moved or deleted once the pVM has started running. 54 + * Read-only memslots and dirty logging are not supported. 55 + * With the exception of swap, file-backed pages cannot be mapped into a 56 + pVM. 57 + * Donated pages are accounted against ``RLIMIT_MLOCK`` and so the VMM 58 + must have a sufficient resource limit or be granted ``CAP_IPC_LOCK``. 59 + The lack of a runtime reclaim mechanism means that memory locked for 60 + a pVM will remain locked until the pVM is destroyed. 61 + * Changes to the VMM address space (e.g. a ``MAP_FIXED`` mmap() over a 62 + mapping associated with a memslot) are not reflected in the guest and 63 + may lead to loss of coherency. 64 + * Accessing pVM memory that has not been shared back will result in the 65 + delivery of a SIGSEGV. 66 + * If a system call accesses pVM memory that has not been shared back 67 + then it will either return ``-EFAULT`` or forcefully reclaim the 68 + memory pages. Reclaimed memory is zeroed by the hypervisor and a 69 + subsequent attempt to access it in the pVM will return ``-EFAULT`` 70 + from the ``VCPU_RUN`` ioctl(). 71 + 72 + CPU state isolation 73 + ------------------- 74 + 75 + Status: **Unimplemented.** 76 + 77 + DMA isolation using an IOMMU 78 + ---------------------------- 79 + 80 + Status: **Unimplemented.** 81 + 82 + Proxying of Trustzone services 83 + ------------------------------ 84 + 85 + Status: FF-A and PSCI calls from the host are proxied by the pKVM 86 + hypervisor. 87 + 88 + The FF-A proxy ensures that the host cannot share pVM or hypervisor 89 + memory with Trustzone as part of a "confused deputy" attack. 90 + 91 + The PSCI proxy ensures that CPUs always have the stage-2 identity map 92 + installed when they are executing in the host. 93 + 94 + Protected VM firmware (pvmfw) 95 + ----------------------------- 96 + 97 + Status: **Unimplemented.** 98 + 99 + Resources 100 + ========= 101 + 102 + Quentin Perret's KVM Forum 2022 talk entitled "Protected KVM on arm64: A 103 + technical deep dive" remains a good resource for learning more about 104 + pKVM, despite some of the details having changed in the meantime: 105 + 106 + https://www.youtube.com/watch?v=9npebeVFbFw
+20 -11
arch/arm64/include/asm/kvm_asm.h
··· 51 51 #include <linux/mm.h> 52 52 53 53 enum __kvm_host_smccc_func { 54 - /* Hypercalls available only prior to pKVM finalisation */ 54 + /* Hypercalls that are unavailable once pKVM has finalised. */ 55 55 /* __KVM_HOST_SMCCC_FUNC___kvm_hyp_init */ 56 56 __KVM_HOST_SMCCC_FUNC___pkvm_init = __KVM_HOST_SMCCC_FUNC___kvm_hyp_init + 1, 57 57 __KVM_HOST_SMCCC_FUNC___pkvm_create_private_mapping, ··· 60 60 __KVM_HOST_SMCCC_FUNC___vgic_v3_init_lrs, 61 61 __KVM_HOST_SMCCC_FUNC___vgic_v3_get_gic_config, 62 62 __KVM_HOST_SMCCC_FUNC___pkvm_prot_finalize, 63 + __KVM_HOST_SMCCC_FUNC_MIN_PKVM = __KVM_HOST_SMCCC_FUNC___pkvm_prot_finalize, 63 64 64 - /* Hypercalls available after pKVM finalisation */ 65 - __KVM_HOST_SMCCC_FUNC___pkvm_host_share_hyp, 66 - __KVM_HOST_SMCCC_FUNC___pkvm_host_unshare_hyp, 67 - __KVM_HOST_SMCCC_FUNC___pkvm_host_share_guest, 68 - __KVM_HOST_SMCCC_FUNC___pkvm_host_unshare_guest, 69 - __KVM_HOST_SMCCC_FUNC___pkvm_host_relax_perms_guest, 70 - __KVM_HOST_SMCCC_FUNC___pkvm_host_wrprotect_guest, 71 - __KVM_HOST_SMCCC_FUNC___pkvm_host_test_clear_young_guest, 72 - __KVM_HOST_SMCCC_FUNC___pkvm_host_mkyoung_guest, 65 + /* Hypercalls that are always available and common to [nh]VHE/pKVM. */ 73 66 __KVM_HOST_SMCCC_FUNC___kvm_adjust_pc, 74 67 __KVM_HOST_SMCCC_FUNC___kvm_vcpu_run, 75 68 __KVM_HOST_SMCCC_FUNC___kvm_flush_vm_context, ··· 76 83 __KVM_HOST_SMCCC_FUNC___vgic_v3_restore_vmcr_aprs, 77 84 __KVM_HOST_SMCCC_FUNC___vgic_v5_save_apr, 78 85 __KVM_HOST_SMCCC_FUNC___vgic_v5_restore_vmcr_apr, 86 + __KVM_HOST_SMCCC_FUNC_MAX_NO_PKVM = __KVM_HOST_SMCCC_FUNC___vgic_v5_restore_vmcr_apr, 87 + 88 + /* Hypercalls that are available only when pKVM has finalised. */ 89 + __KVM_HOST_SMCCC_FUNC___pkvm_host_share_hyp, 90 + __KVM_HOST_SMCCC_FUNC___pkvm_host_unshare_hyp, 91 + __KVM_HOST_SMCCC_FUNC___pkvm_host_donate_guest, 92 + __KVM_HOST_SMCCC_FUNC___pkvm_host_share_guest, 93 + __KVM_HOST_SMCCC_FUNC___pkvm_host_unshare_guest, 94 + __KVM_HOST_SMCCC_FUNC___pkvm_host_relax_perms_guest, 95 + __KVM_HOST_SMCCC_FUNC___pkvm_host_wrprotect_guest, 96 + __KVM_HOST_SMCCC_FUNC___pkvm_host_test_clear_young_guest, 97 + __KVM_HOST_SMCCC_FUNC___pkvm_host_mkyoung_guest, 79 98 __KVM_HOST_SMCCC_FUNC___pkvm_reserve_vm, 80 99 __KVM_HOST_SMCCC_FUNC___pkvm_unreserve_vm, 81 100 __KVM_HOST_SMCCC_FUNC___pkvm_init_vm, 82 101 __KVM_HOST_SMCCC_FUNC___pkvm_init_vcpu, 83 - __KVM_HOST_SMCCC_FUNC___pkvm_teardown_vm, 102 + __KVM_HOST_SMCCC_FUNC___pkvm_vcpu_in_poison_fault, 103 + __KVM_HOST_SMCCC_FUNC___pkvm_force_reclaim_guest_page, 104 + __KVM_HOST_SMCCC_FUNC___pkvm_reclaim_dying_guest_page, 105 + __KVM_HOST_SMCCC_FUNC___pkvm_start_teardown_vm, 106 + __KVM_HOST_SMCCC_FUNC___pkvm_finalize_teardown_vm, 84 107 __KVM_HOST_SMCCC_FUNC___pkvm_vcpu_load, 85 108 __KVM_HOST_SMCCC_FUNC___pkvm_vcpu_put, 86 109 __KVM_HOST_SMCCC_FUNC___pkvm_tlb_flush_vmid,
+8 -1
arch/arm64/include/asm/kvm_host.h
··· 251 251 unsigned long vendor_hyp_bmap_2; /* Function numbers 64-127 */ 252 252 }; 253 253 254 - typedef unsigned int pkvm_handle_t; 254 + typedef u16 pkvm_handle_t; 255 255 256 256 struct kvm_protected_vm { 257 257 pkvm_handle_t handle; ··· 259 259 struct kvm_hyp_memcache stage2_teardown_mc; 260 260 bool is_protected; 261 261 bool is_created; 262 + 263 + /* 264 + * True when the guest is being torn down. When in this state, the 265 + * guest's vCPUs can't be loaded anymore, but its pages can be 266 + * reclaimed by the host. 267 + */ 268 + bool is_dying; 262 269 }; 263 270 264 271 struct kvm_mpidr_data {
+33 -12
arch/arm64/include/asm/kvm_pgtable.h
··· 99 99 KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W | \ 100 100 KVM_PTE_LEAF_ATTR_HI_S2_XN) 101 101 102 - #define KVM_INVALID_PTE_OWNER_MASK GENMASK(9, 2) 103 - #define KVM_MAX_OWNER_ID 1 102 + /* pKVM invalid pte encodings */ 103 + #define KVM_INVALID_PTE_TYPE_MASK GENMASK(63, 60) 104 + #define KVM_INVALID_PTE_ANNOT_MASK ~(KVM_PTE_VALID | \ 105 + KVM_INVALID_PTE_TYPE_MASK) 104 106 105 - /* 106 - * Used to indicate a pte for which a 'break-before-make' sequence is in 107 - * progress. 108 - */ 109 - #define KVM_INVALID_PTE_LOCKED BIT(10) 107 + enum kvm_invalid_pte_type { 108 + /* 109 + * Used to indicate a pte for which a 'break-before-make' 110 + * sequence is in progress. 111 + */ 112 + KVM_INVALID_PTE_TYPE_LOCKED = 1, 113 + 114 + /* 115 + * pKVM has unmapped the page from the host due to a change of 116 + * ownership. 117 + */ 118 + KVM_HOST_INVALID_PTE_TYPE_DONATION, 119 + 120 + /* 121 + * The page has been forcefully reclaimed from the guest by the 122 + * host. 123 + */ 124 + KVM_GUEST_INVALID_PTE_TYPE_POISONED, 125 + }; 110 126 111 127 static inline bool kvm_pte_valid(kvm_pte_t pte) 112 128 { ··· 674 658 void *mc, enum kvm_pgtable_walk_flags flags); 675 659 676 660 /** 677 - * kvm_pgtable_stage2_set_owner() - Unmap and annotate pages in the IPA space to 678 - * track ownership. 661 + * kvm_pgtable_stage2_annotate() - Unmap and annotate pages in the IPA space 662 + * to track ownership (and more). 679 663 * @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*(). 680 664 * @addr: Base intermediate physical address to annotate. 681 665 * @size: Size of the annotated range. 682 666 * @mc: Cache of pre-allocated and zeroed memory from which to allocate 683 667 * page-table pages. 684 - * @owner_id: Unique identifier for the owner of the page. 668 + * @type: The type of the annotation, determining its meaning and format. 669 + * @annotation: A 59-bit value that will be stored in the page tables. 670 + * @annotation[0] and @annotation[63:60] must be 0. 671 + * @annotation[59:1] is stored in the page tables, along 672 + * with @type. 685 673 * 686 674 * By default, all page-tables are owned by identifier 0. This function can be 687 675 * used to mark portions of the IPA space as owned by other entities. When a ··· 694 674 * 695 675 * Return: 0 on success, negative error code on failure. 696 676 */ 697 - int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size, 698 - void *mc, u8 owner_id); 677 + int kvm_pgtable_stage2_annotate(struct kvm_pgtable *pgt, u64 addr, u64 size, 678 + void *mc, enum kvm_invalid_pte_type type, 679 + kvm_pte_t annotation); 699 680 700 681 /** 701 682 * kvm_pgtable_stage2_unmap() - Remove a mapping from a guest stage-2 page-table.
+1 -3
arch/arm64/include/asm/kvm_pkvm.h
··· 17 17 18 18 #define HYP_MEMBLOCK_REGIONS 128 19 19 20 - int pkvm_init_host_vm(struct kvm *kvm); 20 + int pkvm_init_host_vm(struct kvm *kvm, unsigned long type); 21 21 int pkvm_create_hyp_vm(struct kvm *kvm); 22 22 bool pkvm_hyp_vm_is_created(struct kvm *kvm); 23 23 void pkvm_destroy_hyp_vm(struct kvm *kvm); ··· 40 40 case KVM_CAP_MAX_VCPU_ID: 41 41 case KVM_CAP_MSI_DEVID: 42 42 case KVM_CAP_ARM_VM_IPA_SIZE: 43 - case KVM_CAP_ARM_PMU_V3: 44 - case KVM_CAP_ARM_SVE: 45 43 case KVM_CAP_ARM_PTRAUTH_ADDRESS: 46 44 case KVM_CAP_ARM_PTRAUTH_GENERIC: 47 45 return true;
+9
arch/arm64/include/asm/virt.h
··· 94 94 static_branch_likely(&kvm_protected_mode_initialized); 95 95 } 96 96 97 + #ifdef CONFIG_KVM 98 + bool pkvm_force_reclaim_guest_page(phys_addr_t phys); 99 + #else 100 + static inline bool pkvm_force_reclaim_guest_page(phys_addr_t phys) 101 + { 102 + return false; 103 + } 104 + #endif 105 + 97 106 /* Reports the availability of HYP mode */ 98 107 static inline bool is_hyp_mode_available(void) 99 108 {
+10 -2
arch/arm64/kvm/arm.c
··· 208 208 { 209 209 int ret; 210 210 211 + if (type & ~KVM_VM_TYPE_ARM_MASK) 212 + return -EINVAL; 213 + 211 214 mutex_init(&kvm->arch.config_lock); 212 215 213 216 #ifdef CONFIG_LOCKDEP ··· 242 239 * If any failures occur after this is successful, make sure to 243 240 * call __pkvm_unreserve_vm to unreserve the VM in hyp. 244 241 */ 245 - ret = pkvm_init_host_vm(kvm); 242 + ret = pkvm_init_host_vm(kvm, type); 246 243 if (ret) 247 - goto err_free_cpumask; 244 + goto err_uninit_mmu; 245 + } else if (type & KVM_VM_TYPE_ARM_PROTECTED) { 246 + ret = -EINVAL; 247 + goto err_uninit_mmu; 248 248 } 249 249 250 250 kvm_vgic_early_init(kvm); ··· 263 257 264 258 return 0; 265 259 260 + err_uninit_mmu: 261 + kvm_uninit_stage2_mmu(kvm); 266 262 err_free_cpumask: 267 263 free_cpumask_var(kvm->arch.supported_cpus); 268 264 err_unshare_kvm:
+9 -1
arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
··· 27 27 enum pkvm_component_id { 28 28 PKVM_ID_HOST, 29 29 PKVM_ID_HYP, 30 - PKVM_ID_FFA, 30 + PKVM_ID_GUEST, 31 31 }; 32 32 33 33 int __pkvm_prot_finalize(void); 34 34 int __pkvm_host_share_hyp(u64 pfn); 35 + int __pkvm_guest_share_host(struct pkvm_hyp_vcpu *vcpu, u64 gfn); 36 + int __pkvm_guest_unshare_host(struct pkvm_hyp_vcpu *vcpu, u64 gfn); 35 37 int __pkvm_host_unshare_hyp(u64 pfn); 36 38 int __pkvm_host_donate_hyp(u64 pfn, u64 nr_pages); 37 39 int __pkvm_hyp_donate_host(u64 pfn, u64 nr_pages); 38 40 int __pkvm_host_share_ffa(u64 pfn, u64 nr_pages); 39 41 int __pkvm_host_unshare_ffa(u64 pfn, u64 nr_pages); 42 + int __pkvm_host_donate_guest(u64 pfn, u64 gfn, struct pkvm_hyp_vcpu *vcpu); 43 + int __pkvm_vcpu_in_poison_fault(struct pkvm_hyp_vcpu *hyp_vcpu); 44 + int __pkvm_host_force_reclaim_page_guest(phys_addr_t phys); 45 + int __pkvm_host_reclaim_page_guest(u64 gfn, struct pkvm_hyp_vm *vm); 40 46 int __pkvm_host_share_guest(u64 pfn, u64 gfn, u64 nr_pages, struct pkvm_hyp_vcpu *vcpu, 41 47 enum kvm_pgtable_prot prot); 42 48 int __pkvm_host_unshare_guest(u64 gfn, u64 nr_pages, struct pkvm_hyp_vm *hyp_vm); ··· 74 68 75 69 #ifdef CONFIG_NVHE_EL2_DEBUG 76 70 void pkvm_ownership_selftest(void *base); 71 + struct pkvm_hyp_vcpu *init_selftest_vm(void *virt); 72 + void teardown_selftest_vm(void); 77 73 #else 78 74 static inline void pkvm_ownership_selftest(void *base) { } 79 75 #endif
+9 -3
arch/arm64/kvm/hyp/include/nvhe/memory.h
··· 30 30 * struct hyp_page. 31 31 */ 32 32 PKVM_NOPAGE = BIT(0) | BIT(1), 33 + 34 + /* 35 + * 'Meta-states' which aren't encoded directly in the PTE's SW bits (or 36 + * the hyp_vmemmap entry for the host) 37 + */ 38 + PKVM_POISON = BIT(2), 33 39 }; 34 - #define PKVM_PAGE_STATE_MASK (BIT(0) | BIT(1)) 40 + #define PKVM_PAGE_STATE_VMEMMAP_MASK (BIT(0) | BIT(1)) 35 41 36 42 #define PKVM_PAGE_STATE_PROT_MASK (KVM_PGTABLE_PROT_SW0 | KVM_PGTABLE_PROT_SW1) 37 43 static inline enum kvm_pgtable_prot pkvm_mkstate(enum kvm_pgtable_prot prot, ··· 114 108 115 109 static inline enum pkvm_page_state get_hyp_state(struct hyp_page *p) 116 110 { 117 - return p->__hyp_state_comp ^ PKVM_PAGE_STATE_MASK; 111 + return p->__hyp_state_comp ^ PKVM_PAGE_STATE_VMEMMAP_MASK; 118 112 } 119 113 120 114 static inline void set_hyp_state(struct hyp_page *p, enum pkvm_page_state state) 121 115 { 122 - p->__hyp_state_comp = state ^ PKVM_PAGE_STATE_MASK; 116 + p->__hyp_state_comp = state ^ PKVM_PAGE_STATE_VMEMMAP_MASK; 123 117 } 124 118 125 119 /*
+6 -1
arch/arm64/kvm/hyp/include/nvhe/pkvm.h
··· 73 73 unsigned long pgd_hva); 74 74 int __pkvm_init_vcpu(pkvm_handle_t handle, struct kvm_vcpu *host_vcpu, 75 75 unsigned long vcpu_hva); 76 - int __pkvm_teardown_vm(pkvm_handle_t handle); 77 76 77 + int __pkvm_reclaim_dying_guest_page(pkvm_handle_t handle, u64 gfn); 78 + int __pkvm_start_teardown_vm(pkvm_handle_t handle); 79 + int __pkvm_finalize_teardown_vm(pkvm_handle_t handle); 80 + 81 + struct pkvm_hyp_vm *get_vm_by_handle(pkvm_handle_t handle); 78 82 struct pkvm_hyp_vcpu *pkvm_load_hyp_vcpu(pkvm_handle_t handle, 79 83 unsigned int vcpu_idx); 80 84 void pkvm_put_hyp_vcpu(struct pkvm_hyp_vcpu *hyp_vcpu); ··· 88 84 struct pkvm_hyp_vm *get_np_pkvm_hyp_vm(pkvm_handle_t handle); 89 85 void put_pkvm_hyp_vm(struct pkvm_hyp_vm *hyp_vm); 90 86 87 + bool kvm_handle_pvm_hvc64(struct kvm_vcpu *vcpu, u64 *exit_code); 91 88 bool kvm_handle_pvm_sysreg(struct kvm_vcpu *vcpu, u64 *exit_code); 92 89 bool kvm_handle_pvm_restricted(struct kvm_vcpu *vcpu, u64 *exit_code); 93 90 void kvm_init_pvm_id_regs(struct kvm_vcpu *vcpu);
+2
arch/arm64/kvm/hyp/include/nvhe/trap_handler.h
··· 16 16 __always_unused int ___check_reg_ ## reg; \ 17 17 type name = (type)cpu_reg(ctxt, (reg)) 18 18 19 + void inject_host_exception(u64 esr); 20 + 19 21 #endif /* __ARM64_KVM_NVHE_TRAP_HANDLER_H__ */
+113 -71
arch/arm64/kvm/hyp/nvhe/hyp-main.c
··· 173 173 DECLARE_REG(u64, hcr_el2, host_ctxt, 3); 174 174 struct pkvm_hyp_vcpu *hyp_vcpu; 175 175 176 - if (!is_protected_kvm_enabled()) 177 - return; 178 - 179 176 hyp_vcpu = pkvm_load_hyp_vcpu(handle, vcpu_idx); 180 177 if (!hyp_vcpu) 181 178 return; ··· 189 192 190 193 static void handle___pkvm_vcpu_put(struct kvm_cpu_context *host_ctxt) 191 194 { 192 - struct pkvm_hyp_vcpu *hyp_vcpu; 195 + struct pkvm_hyp_vcpu *hyp_vcpu = pkvm_get_loaded_hyp_vcpu(); 193 196 194 - if (!is_protected_kvm_enabled()) 195 - return; 196 - 197 - hyp_vcpu = pkvm_get_loaded_hyp_vcpu(); 198 197 if (hyp_vcpu) 199 198 pkvm_put_hyp_vcpu(hyp_vcpu); 200 199 } ··· 245 252 &host_vcpu->arch.pkvm_memcache); 246 253 } 247 254 255 + static void handle___pkvm_host_donate_guest(struct kvm_cpu_context *host_ctxt) 256 + { 257 + DECLARE_REG(u64, pfn, host_ctxt, 1); 258 + DECLARE_REG(u64, gfn, host_ctxt, 2); 259 + struct pkvm_hyp_vcpu *hyp_vcpu; 260 + int ret = -EINVAL; 261 + 262 + hyp_vcpu = pkvm_get_loaded_hyp_vcpu(); 263 + if (!hyp_vcpu || !pkvm_hyp_vcpu_is_protected(hyp_vcpu)) 264 + goto out; 265 + 266 + ret = pkvm_refill_memcache(hyp_vcpu); 267 + if (ret) 268 + goto out; 269 + 270 + ret = __pkvm_host_donate_guest(pfn, gfn, hyp_vcpu); 271 + out: 272 + cpu_reg(host_ctxt, 1) = ret; 273 + } 274 + 248 275 static void handle___pkvm_host_share_guest(struct kvm_cpu_context *host_ctxt) 249 276 { 250 277 DECLARE_REG(u64, pfn, host_ctxt, 1); ··· 273 260 DECLARE_REG(enum kvm_pgtable_prot, prot, host_ctxt, 4); 274 261 struct pkvm_hyp_vcpu *hyp_vcpu; 275 262 int ret = -EINVAL; 276 - 277 - if (!is_protected_kvm_enabled()) 278 - goto out; 279 263 280 264 hyp_vcpu = pkvm_get_loaded_hyp_vcpu(); 281 265 if (!hyp_vcpu || pkvm_hyp_vcpu_is_protected(hyp_vcpu)) ··· 295 285 struct pkvm_hyp_vm *hyp_vm; 296 286 int ret = -EINVAL; 297 287 298 - if (!is_protected_kvm_enabled()) 299 - goto out; 300 - 301 288 hyp_vm = get_np_pkvm_hyp_vm(handle); 302 289 if (!hyp_vm) 303 290 goto out; ··· 312 305 struct pkvm_hyp_vcpu *hyp_vcpu; 313 306 int ret = -EINVAL; 314 307 315 - if (!is_protected_kvm_enabled()) 316 - goto out; 317 - 318 308 hyp_vcpu = pkvm_get_loaded_hyp_vcpu(); 319 309 if (!hyp_vcpu || pkvm_hyp_vcpu_is_protected(hyp_vcpu)) 320 310 goto out; ··· 328 324 DECLARE_REG(u64, nr_pages, host_ctxt, 3); 329 325 struct pkvm_hyp_vm *hyp_vm; 330 326 int ret = -EINVAL; 331 - 332 - if (!is_protected_kvm_enabled()) 333 - goto out; 334 327 335 328 hyp_vm = get_np_pkvm_hyp_vm(handle); 336 329 if (!hyp_vm) ··· 348 347 struct pkvm_hyp_vm *hyp_vm; 349 348 int ret = -EINVAL; 350 349 351 - if (!is_protected_kvm_enabled()) 352 - goto out; 353 - 354 350 hyp_vm = get_np_pkvm_hyp_vm(handle); 355 351 if (!hyp_vm) 356 352 goto out; ··· 363 365 DECLARE_REG(u64, gfn, host_ctxt, 1); 364 366 struct pkvm_hyp_vcpu *hyp_vcpu; 365 367 int ret = -EINVAL; 366 - 367 - if (!is_protected_kvm_enabled()) 368 - goto out; 369 368 370 369 hyp_vcpu = pkvm_get_loaded_hyp_vcpu(); 371 370 if (!hyp_vcpu || pkvm_hyp_vcpu_is_protected(hyp_vcpu)) ··· 423 428 static void handle___pkvm_tlb_flush_vmid(struct kvm_cpu_context *host_ctxt) 424 429 { 425 430 DECLARE_REG(pkvm_handle_t, handle, host_ctxt, 1); 426 - struct pkvm_hyp_vm *hyp_vm; 431 + struct pkvm_hyp_vm *hyp_vm = get_np_pkvm_hyp_vm(handle); 427 432 428 - if (!is_protected_kvm_enabled()) 429 - return; 430 - 431 - hyp_vm = get_np_pkvm_hyp_vm(handle); 432 433 if (!hyp_vm) 433 434 return; 434 435 ··· 575 584 cpu_reg(host_ctxt, 1) = __pkvm_init_vcpu(handle, host_vcpu, vcpu_hva); 576 585 } 577 586 578 - static void handle___pkvm_teardown_vm(struct kvm_cpu_context *host_ctxt) 587 + static void handle___pkvm_vcpu_in_poison_fault(struct kvm_cpu_context *host_ctxt) 588 + { 589 + int ret; 590 + struct pkvm_hyp_vcpu *hyp_vcpu = pkvm_get_loaded_hyp_vcpu(); 591 + 592 + ret = hyp_vcpu ? __pkvm_vcpu_in_poison_fault(hyp_vcpu) : -EINVAL; 593 + cpu_reg(host_ctxt, 1) = ret; 594 + } 595 + 596 + static void handle___pkvm_force_reclaim_guest_page(struct kvm_cpu_context *host_ctxt) 597 + { 598 + DECLARE_REG(phys_addr_t, phys, host_ctxt, 1); 599 + 600 + cpu_reg(host_ctxt, 1) = __pkvm_host_force_reclaim_page_guest(phys); 601 + } 602 + 603 + static void handle___pkvm_reclaim_dying_guest_page(struct kvm_cpu_context *host_ctxt) 604 + { 605 + DECLARE_REG(pkvm_handle_t, handle, host_ctxt, 1); 606 + DECLARE_REG(u64, gfn, host_ctxt, 2); 607 + 608 + cpu_reg(host_ctxt, 1) = __pkvm_reclaim_dying_guest_page(handle, gfn); 609 + } 610 + 611 + static void handle___pkvm_start_teardown_vm(struct kvm_cpu_context *host_ctxt) 579 612 { 580 613 DECLARE_REG(pkvm_handle_t, handle, host_ctxt, 1); 581 614 582 - cpu_reg(host_ctxt, 1) = __pkvm_teardown_vm(handle); 615 + cpu_reg(host_ctxt, 1) = __pkvm_start_teardown_vm(handle); 616 + } 617 + 618 + static void handle___pkvm_finalize_teardown_vm(struct kvm_cpu_context *host_ctxt) 619 + { 620 + DECLARE_REG(pkvm_handle_t, handle, host_ctxt, 1); 621 + 622 + cpu_reg(host_ctxt, 1) = __pkvm_finalize_teardown_vm(handle); 583 623 } 584 624 585 625 static void handle___tracing_load(struct kvm_cpu_context *host_ctxt) ··· 700 678 HANDLE_FUNC(__vgic_v3_get_gic_config), 701 679 HANDLE_FUNC(__pkvm_prot_finalize), 702 680 703 - HANDLE_FUNC(__pkvm_host_share_hyp), 704 - HANDLE_FUNC(__pkvm_host_unshare_hyp), 705 - HANDLE_FUNC(__pkvm_host_share_guest), 706 - HANDLE_FUNC(__pkvm_host_unshare_guest), 707 - HANDLE_FUNC(__pkvm_host_relax_perms_guest), 708 - HANDLE_FUNC(__pkvm_host_wrprotect_guest), 709 - HANDLE_FUNC(__pkvm_host_test_clear_young_guest), 710 - HANDLE_FUNC(__pkvm_host_mkyoung_guest), 711 681 HANDLE_FUNC(__kvm_adjust_pc), 712 682 HANDLE_FUNC(__kvm_vcpu_run), 713 683 HANDLE_FUNC(__kvm_flush_vm_context), ··· 713 699 HANDLE_FUNC(__vgic_v3_restore_vmcr_aprs), 714 700 HANDLE_FUNC(__vgic_v5_save_apr), 715 701 HANDLE_FUNC(__vgic_v5_restore_vmcr_apr), 702 + 703 + HANDLE_FUNC(__pkvm_host_share_hyp), 704 + HANDLE_FUNC(__pkvm_host_unshare_hyp), 705 + HANDLE_FUNC(__pkvm_host_donate_guest), 706 + HANDLE_FUNC(__pkvm_host_share_guest), 707 + HANDLE_FUNC(__pkvm_host_unshare_guest), 708 + HANDLE_FUNC(__pkvm_host_relax_perms_guest), 709 + HANDLE_FUNC(__pkvm_host_wrprotect_guest), 710 + HANDLE_FUNC(__pkvm_host_test_clear_young_guest), 711 + HANDLE_FUNC(__pkvm_host_mkyoung_guest), 716 712 HANDLE_FUNC(__pkvm_reserve_vm), 717 713 HANDLE_FUNC(__pkvm_unreserve_vm), 718 714 HANDLE_FUNC(__pkvm_init_vm), 719 715 HANDLE_FUNC(__pkvm_init_vcpu), 720 - HANDLE_FUNC(__pkvm_teardown_vm), 716 + HANDLE_FUNC(__pkvm_vcpu_in_poison_fault), 717 + HANDLE_FUNC(__pkvm_force_reclaim_guest_page), 718 + HANDLE_FUNC(__pkvm_reclaim_dying_guest_page), 719 + HANDLE_FUNC(__pkvm_start_teardown_vm), 720 + HANDLE_FUNC(__pkvm_finalize_teardown_vm), 721 721 HANDLE_FUNC(__pkvm_vcpu_load), 722 722 HANDLE_FUNC(__pkvm_vcpu_put), 723 723 HANDLE_FUNC(__pkvm_tlb_flush_vmid), ··· 748 720 static void handle_host_hcall(struct kvm_cpu_context *host_ctxt) 749 721 { 750 722 DECLARE_REG(unsigned long, id, host_ctxt, 0); 751 - unsigned long hcall_min = 0; 723 + unsigned long hcall_min = 0, hcall_max = -1; 752 724 hcall_t hfn; 753 725 754 726 /* ··· 760 732 * basis. This is all fine, however, since __pkvm_prot_finalize 761 733 * returns -EPERM after the first call for a given CPU. 762 734 */ 763 - if (static_branch_unlikely(&kvm_protected_mode_initialized)) 764 - hcall_min = __KVM_HOST_SMCCC_FUNC___pkvm_prot_finalize; 735 + if (static_branch_unlikely(&kvm_protected_mode_initialized)) { 736 + hcall_min = __KVM_HOST_SMCCC_FUNC_MIN_PKVM; 737 + } else { 738 + hcall_max = __KVM_HOST_SMCCC_FUNC_MAX_NO_PKVM; 739 + } 765 740 766 741 id &= ~ARM_SMCCC_CALL_HINTS; 767 742 id -= KVM_HOST_SMCCC_ID(0); 768 743 769 - if (unlikely(id < hcall_min || id >= ARRAY_SIZE(host_hcall))) 744 + if (unlikely(id < hcall_min || id > hcall_max || 745 + id >= ARRAY_SIZE(host_hcall))) { 770 746 goto inval; 747 + } 771 748 772 749 hfn = host_hcall[id]; 773 750 if (unlikely(!hfn)) ··· 810 777 kvm_skip_host_instr(); 811 778 } 812 779 813 - /* 814 - * Inject an Undefined Instruction exception into the host. 815 - * 816 - * This is open-coded to allow control over PSTATE construction without 817 - * complicating the generic exception entry helpers. 818 - */ 819 - static void inject_undef64(void) 780 + void inject_host_exception(u64 esr) 820 781 { 821 - u64 spsr_mask, vbar, sctlr, old_spsr, new_spsr, esr, offset; 782 + u64 sctlr, spsr_el1, spsr_el2, exc_offset = except_type_sync; 783 + const u64 spsr_mask = PSR_N_BIT | PSR_Z_BIT | PSR_C_BIT | 784 + PSR_V_BIT | PSR_DIT_BIT | PSR_PAN_BIT; 822 785 823 - spsr_mask = PSR_N_BIT | PSR_Z_BIT | PSR_C_BIT | PSR_V_BIT | PSR_DIT_BIT | PSR_PAN_BIT; 786 + spsr_el1 = spsr_el2 = read_sysreg_el2(SYS_SPSR); 787 + switch (spsr_el1 & (PSR_MODE_MASK | PSR_MODE32_BIT)) { 788 + case PSR_MODE_EL0t: 789 + exc_offset += LOWER_EL_AArch64_VECTOR; 790 + break; 791 + case PSR_MODE_EL0t | PSR_MODE32_BIT: 792 + exc_offset += LOWER_EL_AArch32_VECTOR; 793 + break; 794 + default: 795 + exc_offset += CURRENT_EL_SP_ELx_VECTOR; 796 + } 824 797 825 - vbar = read_sysreg_el1(SYS_VBAR); 798 + spsr_el2 &= spsr_mask; 799 + spsr_el2 |= PSR_D_BIT | PSR_A_BIT | PSR_I_BIT | PSR_F_BIT | 800 + PSR_MODE_EL1h; 801 + 826 802 sctlr = read_sysreg_el1(SYS_SCTLR); 827 - old_spsr = read_sysreg_el2(SYS_SPSR); 828 - 829 - new_spsr = old_spsr & spsr_mask; 830 - new_spsr |= PSR_D_BIT | PSR_A_BIT | PSR_I_BIT | PSR_F_BIT; 831 - new_spsr |= PSR_MODE_EL1h; 832 - 833 803 if (!(sctlr & SCTLR_EL1_SPAN)) 834 - new_spsr |= PSR_PAN_BIT; 804 + spsr_el2 |= PSR_PAN_BIT; 835 805 836 806 if (sctlr & SCTLR_ELx_DSSBS) 837 - new_spsr |= PSR_SSBS_BIT; 807 + spsr_el2 |= PSR_SSBS_BIT; 838 808 839 809 if (system_supports_mte()) 840 - new_spsr |= PSR_TCO_BIT; 810 + spsr_el2 |= PSR_TCO_BIT; 841 811 842 - esr = (ESR_ELx_EC_UNKNOWN << ESR_ELx_EC_SHIFT) | ESR_ELx_IL; 843 - offset = CURRENT_EL_SP_ELx_VECTOR + except_type_sync; 812 + if (esr_fsc_is_translation_fault(esr)) 813 + write_sysreg_el1(read_sysreg_el2(SYS_FAR), SYS_FAR); 844 814 845 815 write_sysreg_el1(esr, SYS_ESR); 846 816 write_sysreg_el1(read_sysreg_el2(SYS_ELR), SYS_ELR); 847 - write_sysreg_el1(old_spsr, SYS_SPSR); 848 - write_sysreg_el2(vbar + offset, SYS_ELR); 849 - write_sysreg_el2(new_spsr, SYS_SPSR); 817 + write_sysreg_el1(spsr_el1, SYS_SPSR); 818 + write_sysreg_el2(read_sysreg_el1(SYS_VBAR) + exc_offset, SYS_ELR); 819 + write_sysreg_el2(spsr_el2, SYS_SPSR); 820 + } 821 + 822 + static void inject_host_undef64(void) 823 + { 824 + inject_host_exception((ESR_ELx_EC_UNKNOWN << ESR_ELx_EC_SHIFT) | 825 + ESR_ELx_IL); 850 826 } 851 827 852 828 static bool handle_host_mte(u64 esr) ··· 878 836 return false; 879 837 } 880 838 881 - inject_undef64(); 839 + inject_host_undef64(); 882 840 return true; 883 841 } 884 842
+530 -57
arch/arm64/kvm/hyp/nvhe/mem_protect.c
··· 18 18 #include <nvhe/memory.h> 19 19 #include <nvhe/mem_protect.h> 20 20 #include <nvhe/mm.h> 21 + #include <nvhe/trap_handler.h> 21 22 22 23 #define KVM_HOST_S2_FLAGS (KVM_PGTABLE_S2_AS_S1 | KVM_PGTABLE_S2_IDMAP) 23 24 ··· 462 461 static inline int __host_stage2_idmap(u64 start, u64 end, 463 462 enum kvm_pgtable_prot prot) 464 463 { 464 + /* 465 + * We don't make permission changes to the host idmap after 466 + * initialisation, so we can squash -EAGAIN to save callers 467 + * having to treat it like success in the case that they try to 468 + * map something that is already mapped. 469 + */ 465 470 return kvm_pgtable_stage2_map(&host_mmu.pgt, start, end - start, start, 466 - prot, &host_s2_pool, 0); 471 + prot, &host_s2_pool, 472 + KVM_PGTABLE_WALK_IGNORE_EAGAIN); 467 473 } 468 474 469 475 /* ··· 512 504 return ret; 513 505 514 506 if (kvm_pte_valid(pte)) 515 - return -EAGAIN; 507 + return -EEXIST; 516 508 517 509 if (pte) { 518 510 WARN_ON(addr_is_memory(addr) && ··· 549 541 set_host_state(page, state); 550 542 } 551 543 552 - int host_stage2_set_owner_locked(phys_addr_t addr, u64 size, u8 owner_id) 544 + #define KVM_HOST_DONATION_PTE_OWNER_MASK GENMASK(3, 1) 545 + #define KVM_HOST_DONATION_PTE_EXTRA_MASK GENMASK(59, 4) 546 + static int host_stage2_set_owner_metadata_locked(phys_addr_t addr, u64 size, 547 + u8 owner_id, u64 meta) 553 548 { 549 + kvm_pte_t annotation; 554 550 int ret; 551 + 552 + if (owner_id == PKVM_ID_HOST) 553 + return -EINVAL; 555 554 556 555 if (!range_is_memory(addr, addr + size)) 557 556 return -EPERM; 558 557 559 - ret = host_stage2_try(kvm_pgtable_stage2_set_owner, &host_mmu.pgt, 560 - addr, size, &host_s2_pool, owner_id); 561 - if (ret) 562 - return ret; 558 + if (!FIELD_FIT(KVM_HOST_DONATION_PTE_OWNER_MASK, owner_id)) 559 + return -EINVAL; 563 560 564 - /* Don't forget to update the vmemmap tracking for the host */ 565 - if (owner_id == PKVM_ID_HOST) 566 - __host_update_page_state(addr, size, PKVM_PAGE_OWNED); 567 - else 561 + if (!FIELD_FIT(KVM_HOST_DONATION_PTE_EXTRA_MASK, meta)) 562 + return -EINVAL; 563 + 564 + annotation = FIELD_PREP(KVM_HOST_DONATION_PTE_OWNER_MASK, owner_id) | 565 + FIELD_PREP(KVM_HOST_DONATION_PTE_EXTRA_MASK, meta); 566 + ret = host_stage2_try(kvm_pgtable_stage2_annotate, &host_mmu.pgt, 567 + addr, size, &host_s2_pool, 568 + KVM_HOST_INVALID_PTE_TYPE_DONATION, annotation); 569 + if (!ret) 568 570 __host_update_page_state(addr, size, PKVM_NOPAGE); 569 571 572 + return ret; 573 + } 574 + 575 + int host_stage2_set_owner_locked(phys_addr_t addr, u64 size, u8 owner_id) 576 + { 577 + int ret = -EINVAL; 578 + 579 + switch (owner_id) { 580 + case PKVM_ID_HOST: 581 + if (!range_is_memory(addr, addr + size)) 582 + return -EPERM; 583 + 584 + ret = host_stage2_idmap_locked(addr, size, PKVM_HOST_MEM_PROT); 585 + if (!ret) 586 + __host_update_page_state(addr, size, PKVM_PAGE_OWNED); 587 + break; 588 + case PKVM_ID_HYP: 589 + ret = host_stage2_set_owner_metadata_locked(addr, size, 590 + owner_id, 0); 591 + break; 592 + } 593 + 594 + return ret; 595 + } 596 + 597 + #define KVM_HOST_PTE_OWNER_GUEST_HANDLE_MASK GENMASK(15, 0) 598 + /* We need 40 bits for the GFN to cover a 52-bit IPA with 4k pages and LPA2 */ 599 + #define KVM_HOST_PTE_OWNER_GUEST_GFN_MASK GENMASK(55, 16) 600 + static u64 host_stage2_encode_gfn_meta(struct pkvm_hyp_vm *vm, u64 gfn) 601 + { 602 + pkvm_handle_t handle = vm->kvm.arch.pkvm.handle; 603 + 604 + BUILD_BUG_ON((pkvm_handle_t)-1 > KVM_HOST_PTE_OWNER_GUEST_HANDLE_MASK); 605 + WARN_ON(!FIELD_FIT(KVM_HOST_PTE_OWNER_GUEST_GFN_MASK, gfn)); 606 + 607 + return FIELD_PREP(KVM_HOST_PTE_OWNER_GUEST_HANDLE_MASK, handle) | 608 + FIELD_PREP(KVM_HOST_PTE_OWNER_GUEST_GFN_MASK, gfn); 609 + } 610 + 611 + static int host_stage2_decode_gfn_meta(kvm_pte_t pte, struct pkvm_hyp_vm **vm, 612 + u64 *gfn) 613 + { 614 + pkvm_handle_t handle; 615 + u64 meta; 616 + 617 + if (WARN_ON(kvm_pte_valid(pte))) 618 + return -EINVAL; 619 + 620 + if (FIELD_GET(KVM_INVALID_PTE_TYPE_MASK, pte) != 621 + KVM_HOST_INVALID_PTE_TYPE_DONATION) { 622 + return -EINVAL; 623 + } 624 + 625 + if (FIELD_GET(KVM_HOST_DONATION_PTE_OWNER_MASK, pte) != PKVM_ID_GUEST) 626 + return -EPERM; 627 + 628 + meta = FIELD_GET(KVM_HOST_DONATION_PTE_EXTRA_MASK, pte); 629 + handle = FIELD_GET(KVM_HOST_PTE_OWNER_GUEST_HANDLE_MASK, meta); 630 + *vm = get_vm_by_handle(handle); 631 + if (!*vm) { 632 + /* We probably raced with teardown; try again */ 633 + return -EAGAIN; 634 + } 635 + 636 + *gfn = FIELD_GET(KVM_HOST_PTE_OWNER_GUEST_GFN_MASK, meta); 570 637 return 0; 571 638 } 572 639 ··· 688 605 return ret; 689 606 } 690 607 608 + static void host_inject_mem_abort(struct kvm_cpu_context *host_ctxt) 609 + { 610 + u64 ec, esr, spsr; 611 + 612 + esr = read_sysreg_el2(SYS_ESR); 613 + spsr = read_sysreg_el2(SYS_SPSR); 614 + 615 + /* Repaint the ESR to report a same-level fault if taken from EL1 */ 616 + if ((spsr & PSR_MODE_MASK) != PSR_MODE_EL0t) { 617 + ec = ESR_ELx_EC(esr); 618 + if (ec == ESR_ELx_EC_DABT_LOW) 619 + ec = ESR_ELx_EC_DABT_CUR; 620 + else if (ec == ESR_ELx_EC_IABT_LOW) 621 + ec = ESR_ELx_EC_IABT_CUR; 622 + else 623 + WARN_ON(1); 624 + esr &= ~ESR_ELx_EC_MASK; 625 + esr |= ec << ESR_ELx_EC_SHIFT; 626 + } 627 + 628 + /* 629 + * Since S1PTW should only ever be set for stage-2 faults, we're pretty 630 + * much guaranteed that it won't be set in ESR_EL1 by the hardware. So, 631 + * let's use that bit to allow the host abort handler to differentiate 632 + * this abort from normal userspace faults. 633 + * 634 + * Note: although S1PTW is RES0 at EL1, it is guaranteed by the 635 + * architecture to be backed by flops, so it should be safe to use. 636 + */ 637 + esr |= ESR_ELx_S1PTW; 638 + inject_host_exception(esr); 639 + } 640 + 691 641 void handle_host_mem_abort(struct kvm_cpu_context *host_ctxt) 692 642 { 693 643 struct kvm_vcpu_fault_info fault; 694 644 u64 esr, addr; 695 - int ret = 0; 696 645 697 646 esr = read_sysreg_el2(SYS_ESR); 698 647 if (!__get_fault_info(esr, &fault)) { ··· 743 628 BUG_ON(!(fault.hpfar_el2 & HPFAR_EL2_NS)); 744 629 addr = FIELD_GET(HPFAR_EL2_FIPA, fault.hpfar_el2) << 12; 745 630 746 - ret = host_stage2_idmap(addr); 747 - BUG_ON(ret && ret != -EAGAIN); 631 + switch (host_stage2_idmap(addr)) { 632 + case -EPERM: 633 + host_inject_mem_abort(host_ctxt); 634 + fallthrough; 635 + case -EEXIST: 636 + case 0: 637 + break; 638 + default: 639 + BUG(); 640 + } 748 641 } 749 642 750 643 struct check_walk_data { ··· 830 707 return 0; 831 708 } 832 709 710 + static bool guest_pte_is_poisoned(kvm_pte_t pte) 711 + { 712 + if (kvm_pte_valid(pte)) 713 + return false; 714 + 715 + return FIELD_GET(KVM_INVALID_PTE_TYPE_MASK, pte) == 716 + KVM_GUEST_INVALID_PTE_TYPE_POISONED; 717 + } 718 + 833 719 static enum pkvm_page_state guest_get_page_state(kvm_pte_t pte, u64 addr) 834 720 { 721 + if (guest_pte_is_poisoned(pte)) 722 + return PKVM_POISON; 723 + 835 724 if (!kvm_pte_valid(pte)) 836 725 return PKVM_NOPAGE; 837 726 ··· 860 725 861 726 hyp_assert_lock_held(&vm->lock); 862 727 return check_page_state_range(&vm->pgt, addr, size, &d); 728 + } 729 + 730 + static int get_valid_guest_pte(struct pkvm_hyp_vm *vm, u64 ipa, kvm_pte_t *ptep, u64 *physp) 731 + { 732 + kvm_pte_t pte; 733 + u64 phys; 734 + s8 level; 735 + int ret; 736 + 737 + ret = kvm_pgtable_get_leaf(&vm->pgt, ipa, &pte, &level); 738 + if (ret) 739 + return ret; 740 + if (guest_pte_is_poisoned(pte)) 741 + return -EHWPOISON; 742 + if (!kvm_pte_valid(pte)) 743 + return -ENOENT; 744 + if (level != KVM_PGTABLE_LAST_LEVEL) 745 + return -E2BIG; 746 + 747 + phys = kvm_pte_to_phys(pte); 748 + ret = check_range_allowed_memory(phys, phys + PAGE_SIZE); 749 + if (WARN_ON(ret)) 750 + return ret; 751 + 752 + *ptep = pte; 753 + *physp = phys; 754 + 755 + return 0; 756 + } 757 + 758 + int __pkvm_vcpu_in_poison_fault(struct pkvm_hyp_vcpu *hyp_vcpu) 759 + { 760 + struct pkvm_hyp_vm *vm = pkvm_hyp_vcpu_to_hyp_vm(hyp_vcpu); 761 + kvm_pte_t pte; 762 + s8 level; 763 + u64 ipa; 764 + int ret; 765 + 766 + switch (kvm_vcpu_trap_get_class(&hyp_vcpu->vcpu)) { 767 + case ESR_ELx_EC_DABT_LOW: 768 + case ESR_ELx_EC_IABT_LOW: 769 + if (kvm_vcpu_trap_is_translation_fault(&hyp_vcpu->vcpu)) 770 + break; 771 + fallthrough; 772 + default: 773 + return -EINVAL; 774 + } 775 + 776 + /* 777 + * The host has the faulting IPA when it calls us from the guest 778 + * fault handler but we retrieve it ourselves from the FAR so as 779 + * to avoid exposing an "oracle" that could reveal data access 780 + * patterns of the guest after initial donation of its pages. 781 + */ 782 + ipa = kvm_vcpu_get_fault_ipa(&hyp_vcpu->vcpu); 783 + ipa |= FAR_TO_FIPA_OFFSET(kvm_vcpu_get_hfar(&hyp_vcpu->vcpu)); 784 + 785 + guest_lock_component(vm); 786 + ret = kvm_pgtable_get_leaf(&vm->pgt, ipa, &pte, &level); 787 + if (ret) 788 + goto unlock; 789 + 790 + if (level != KVM_PGTABLE_LAST_LEVEL) { 791 + ret = -EINVAL; 792 + goto unlock; 793 + } 794 + 795 + ret = guest_pte_is_poisoned(pte); 796 + unlock: 797 + guest_unlock_component(vm); 798 + return ret; 863 799 } 864 800 865 801 int __pkvm_host_share_hyp(u64 pfn) ··· 954 748 955 749 unlock: 956 750 hyp_unlock_component(); 751 + host_unlock_component(); 752 + 753 + return ret; 754 + } 755 + 756 + int __pkvm_guest_share_host(struct pkvm_hyp_vcpu *vcpu, u64 gfn) 757 + { 758 + struct pkvm_hyp_vm *vm = pkvm_hyp_vcpu_to_hyp_vm(vcpu); 759 + u64 phys, ipa = hyp_pfn_to_phys(gfn); 760 + kvm_pte_t pte; 761 + int ret; 762 + 763 + host_lock_component(); 764 + guest_lock_component(vm); 765 + 766 + ret = get_valid_guest_pte(vm, ipa, &pte, &phys); 767 + if (ret) 768 + goto unlock; 769 + 770 + ret = -EPERM; 771 + if (pkvm_getstate(kvm_pgtable_stage2_pte_prot(pte)) != PKVM_PAGE_OWNED) 772 + goto unlock; 773 + if (__host_check_page_state_range(phys, PAGE_SIZE, PKVM_NOPAGE)) 774 + goto unlock; 775 + 776 + ret = 0; 777 + WARN_ON(kvm_pgtable_stage2_map(&vm->pgt, ipa, PAGE_SIZE, phys, 778 + pkvm_mkstate(KVM_PGTABLE_PROT_RWX, PKVM_PAGE_SHARED_OWNED), 779 + &vcpu->vcpu.arch.pkvm_memcache, 0)); 780 + WARN_ON(__host_set_page_state_range(phys, PAGE_SIZE, PKVM_PAGE_SHARED_BORROWED)); 781 + unlock: 782 + guest_unlock_component(vm); 783 + host_unlock_component(); 784 + 785 + return ret; 786 + } 787 + 788 + int __pkvm_guest_unshare_host(struct pkvm_hyp_vcpu *vcpu, u64 gfn) 789 + { 790 + struct pkvm_hyp_vm *vm = pkvm_hyp_vcpu_to_hyp_vm(vcpu); 791 + u64 meta, phys, ipa = hyp_pfn_to_phys(gfn); 792 + kvm_pte_t pte; 793 + int ret; 794 + 795 + host_lock_component(); 796 + guest_lock_component(vm); 797 + 798 + ret = get_valid_guest_pte(vm, ipa, &pte, &phys); 799 + if (ret) 800 + goto unlock; 801 + 802 + ret = -EPERM; 803 + if (pkvm_getstate(kvm_pgtable_stage2_pte_prot(pte)) != PKVM_PAGE_SHARED_OWNED) 804 + goto unlock; 805 + if (__host_check_page_state_range(phys, PAGE_SIZE, PKVM_PAGE_SHARED_BORROWED)) 806 + goto unlock; 807 + 808 + ret = 0; 809 + meta = host_stage2_encode_gfn_meta(vm, gfn); 810 + WARN_ON(host_stage2_set_owner_metadata_locked(phys, PAGE_SIZE, 811 + PKVM_ID_GUEST, meta)); 812 + WARN_ON(kvm_pgtable_stage2_map(&vm->pgt, ipa, PAGE_SIZE, phys, 813 + pkvm_mkstate(KVM_PGTABLE_PROT_RWX, PKVM_PAGE_OWNED), 814 + &vcpu->vcpu.arch.pkvm_memcache, 0)); 815 + unlock: 816 + guest_unlock_component(vm); 957 817 host_unlock_component(); 958 818 959 819 return ret; ··· 1230 958 1231 959 *size = block_size; 1232 960 return 0; 961 + } 962 + 963 + static void hyp_poison_page(phys_addr_t phys) 964 + { 965 + void *addr = hyp_fixmap_map(phys); 966 + 967 + memset(addr, 0, PAGE_SIZE); 968 + /* 969 + * Prefer kvm_flush_dcache_to_poc() over __clean_dcache_guest_page() 970 + * here as the latter may elide the CMO under the assumption that FWB 971 + * will be enabled on CPUs that support it. This is incorrect for the 972 + * host stage-2 and would otherwise lead to a malicious host potentially 973 + * being able to read the contents of newly reclaimed guest pages. 974 + */ 975 + kvm_flush_dcache_to_poc(addr, PAGE_SIZE); 976 + hyp_fixmap_unmap(); 977 + } 978 + 979 + static int host_stage2_get_guest_info(phys_addr_t phys, struct pkvm_hyp_vm **vm, 980 + u64 *gfn) 981 + { 982 + enum pkvm_page_state state; 983 + kvm_pte_t pte; 984 + s8 level; 985 + int ret; 986 + 987 + if (!addr_is_memory(phys)) 988 + return -EFAULT; 989 + 990 + state = get_host_state(hyp_phys_to_page(phys)); 991 + switch (state) { 992 + case PKVM_PAGE_OWNED: 993 + case PKVM_PAGE_SHARED_OWNED: 994 + case PKVM_PAGE_SHARED_BORROWED: 995 + /* The access should no longer fault; try again. */ 996 + return -EAGAIN; 997 + case PKVM_NOPAGE: 998 + break; 999 + default: 1000 + return -EPERM; 1001 + } 1002 + 1003 + ret = kvm_pgtable_get_leaf(&host_mmu.pgt, phys, &pte, &level); 1004 + if (ret) 1005 + return ret; 1006 + 1007 + if (WARN_ON(level != KVM_PGTABLE_LAST_LEVEL)) 1008 + return -EINVAL; 1009 + 1010 + return host_stage2_decode_gfn_meta(pte, vm, gfn); 1011 + } 1012 + 1013 + int __pkvm_host_force_reclaim_page_guest(phys_addr_t phys) 1014 + { 1015 + struct pkvm_hyp_vm *vm; 1016 + u64 gfn, ipa, pa; 1017 + kvm_pte_t pte; 1018 + int ret; 1019 + 1020 + phys &= PAGE_MASK; 1021 + 1022 + hyp_spin_lock(&vm_table_lock); 1023 + host_lock_component(); 1024 + 1025 + ret = host_stage2_get_guest_info(phys, &vm, &gfn); 1026 + if (ret) 1027 + goto unlock_host; 1028 + 1029 + ipa = hyp_pfn_to_phys(gfn); 1030 + guest_lock_component(vm); 1031 + ret = get_valid_guest_pte(vm, ipa, &pte, &pa); 1032 + if (ret) 1033 + goto unlock_guest; 1034 + 1035 + WARN_ON(pa != phys); 1036 + if (guest_get_page_state(pte, ipa) != PKVM_PAGE_OWNED) { 1037 + ret = -EPERM; 1038 + goto unlock_guest; 1039 + } 1040 + 1041 + /* We really shouldn't be allocating, so don't pass a memcache */ 1042 + ret = kvm_pgtable_stage2_annotate(&vm->pgt, ipa, PAGE_SIZE, NULL, 1043 + KVM_GUEST_INVALID_PTE_TYPE_POISONED, 1044 + 0); 1045 + if (ret) 1046 + goto unlock_guest; 1047 + 1048 + hyp_poison_page(phys); 1049 + WARN_ON(host_stage2_set_owner_locked(phys, PAGE_SIZE, PKVM_ID_HOST)); 1050 + unlock_guest: 1051 + guest_unlock_component(vm); 1052 + unlock_host: 1053 + host_unlock_component(); 1054 + hyp_spin_unlock(&vm_table_lock); 1055 + 1056 + return ret; 1057 + } 1058 + 1059 + int __pkvm_host_reclaim_page_guest(u64 gfn, struct pkvm_hyp_vm *vm) 1060 + { 1061 + u64 ipa = hyp_pfn_to_phys(gfn); 1062 + kvm_pte_t pte; 1063 + u64 phys; 1064 + int ret; 1065 + 1066 + host_lock_component(); 1067 + guest_lock_component(vm); 1068 + 1069 + ret = get_valid_guest_pte(vm, ipa, &pte, &phys); 1070 + if (ret) 1071 + goto unlock; 1072 + 1073 + switch (guest_get_page_state(pte, ipa)) { 1074 + case PKVM_PAGE_OWNED: 1075 + WARN_ON(__host_check_page_state_range(phys, PAGE_SIZE, PKVM_NOPAGE)); 1076 + hyp_poison_page(phys); 1077 + break; 1078 + case PKVM_PAGE_SHARED_OWNED: 1079 + WARN_ON(__host_check_page_state_range(phys, PAGE_SIZE, PKVM_PAGE_SHARED_BORROWED)); 1080 + break; 1081 + default: 1082 + ret = -EPERM; 1083 + goto unlock; 1084 + } 1085 + 1086 + WARN_ON(kvm_pgtable_stage2_unmap(&vm->pgt, ipa, PAGE_SIZE)); 1087 + WARN_ON(host_stage2_set_owner_locked(phys, PAGE_SIZE, PKVM_ID_HOST)); 1088 + 1089 + unlock: 1090 + guest_unlock_component(vm); 1091 + host_unlock_component(); 1092 + 1093 + /* 1094 + * -EHWPOISON implies that the page was forcefully reclaimed already 1095 + * so return success for the GUP pin to be dropped. 1096 + */ 1097 + return ret && ret != -EHWPOISON ? ret : 0; 1098 + } 1099 + 1100 + int __pkvm_host_donate_guest(u64 pfn, u64 gfn, struct pkvm_hyp_vcpu *vcpu) 1101 + { 1102 + struct pkvm_hyp_vm *vm = pkvm_hyp_vcpu_to_hyp_vm(vcpu); 1103 + u64 phys = hyp_pfn_to_phys(pfn); 1104 + u64 ipa = hyp_pfn_to_phys(gfn); 1105 + u64 meta; 1106 + int ret; 1107 + 1108 + host_lock_component(); 1109 + guest_lock_component(vm); 1110 + 1111 + ret = __host_check_page_state_range(phys, PAGE_SIZE, PKVM_PAGE_OWNED); 1112 + if (ret) 1113 + goto unlock; 1114 + 1115 + ret = __guest_check_page_state_range(vm, ipa, PAGE_SIZE, PKVM_NOPAGE); 1116 + if (ret) 1117 + goto unlock; 1118 + 1119 + meta = host_stage2_encode_gfn_meta(vm, gfn); 1120 + WARN_ON(host_stage2_set_owner_metadata_locked(phys, PAGE_SIZE, 1121 + PKVM_ID_GUEST, meta)); 1122 + WARN_ON(kvm_pgtable_stage2_map(&vm->pgt, ipa, PAGE_SIZE, phys, 1123 + pkvm_mkstate(KVM_PGTABLE_PROT_RWX, PKVM_PAGE_OWNED), 1124 + &vcpu->vcpu.arch.pkvm_memcache, 0)); 1125 + 1126 + unlock: 1127 + guest_unlock_component(vm); 1128 + host_unlock_component(); 1129 + 1130 + return ret; 1233 1131 } 1234 1132 1235 1133 int __pkvm_host_share_guest(u64 pfn, u64 gfn, u64 nr_pages, struct pkvm_hyp_vcpu *vcpu, ··· 1648 1206 1649 1207 static struct pkvm_expected_state selftest_state; 1650 1208 static struct hyp_page *selftest_page; 1651 - 1652 - static struct pkvm_hyp_vm selftest_vm = { 1653 - .kvm = { 1654 - .arch = { 1655 - .mmu = { 1656 - .arch = &selftest_vm.kvm.arch, 1657 - .pgt = &selftest_vm.pgt, 1658 - }, 1659 - }, 1660 - }, 1661 - }; 1662 - 1663 - static struct pkvm_hyp_vcpu selftest_vcpu = { 1664 - .vcpu = { 1665 - .arch = { 1666 - .hw_mmu = &selftest_vm.kvm.arch.mmu, 1667 - }, 1668 - .kvm = &selftest_vm.kvm, 1669 - }, 1670 - }; 1671 - 1672 - static void init_selftest_vm(void *virt) 1673 - { 1674 - struct hyp_page *p = hyp_virt_to_page(virt); 1675 - int i; 1676 - 1677 - selftest_vm.kvm.arch.mmu.vtcr = host_mmu.arch.mmu.vtcr; 1678 - WARN_ON(kvm_guest_prepare_stage2(&selftest_vm, virt)); 1679 - 1680 - for (i = 0; i < pkvm_selftest_pages(); i++) { 1681 - if (p[i].refcount) 1682 - continue; 1683 - p[i].refcount = 1; 1684 - hyp_put_page(&selftest_vm.pool, hyp_page_to_virt(&p[i])); 1685 - } 1686 - } 1209 + static struct pkvm_hyp_vcpu *selftest_vcpu; 1687 1210 1688 1211 static u64 selftest_ipa(void) 1689 1212 { 1690 - return BIT(selftest_vm.pgt.ia_bits - 1); 1213 + return BIT(selftest_vcpu->vcpu.arch.hw_mmu->pgt->ia_bits - 1); 1691 1214 } 1692 1215 1693 1216 static void assert_page_state(void) 1694 1217 { 1695 1218 void *virt = hyp_page_to_virt(selftest_page); 1696 1219 u64 size = PAGE_SIZE << selftest_page->order; 1697 - struct pkvm_hyp_vcpu *vcpu = &selftest_vcpu; 1220 + struct pkvm_hyp_vcpu *vcpu = selftest_vcpu; 1698 1221 u64 phys = hyp_virt_to_phys(virt); 1699 1222 u64 ipa[2] = { selftest_ipa(), selftest_ipa() + PAGE_SIZE }; 1700 1223 struct pkvm_hyp_vm *vm; ··· 1674 1267 WARN_ON(__hyp_check_page_state_range(phys, size, selftest_state.hyp)); 1675 1268 hyp_unlock_component(); 1676 1269 1677 - guest_lock_component(&selftest_vm); 1270 + guest_lock_component(vm); 1678 1271 WARN_ON(__guest_check_page_state_range(vm, ipa[0], size, selftest_state.guest[0])); 1679 1272 WARN_ON(__guest_check_page_state_range(vm, ipa[1], size, selftest_state.guest[1])); 1680 - guest_unlock_component(&selftest_vm); 1273 + guest_unlock_component(vm); 1681 1274 } 1682 1275 1683 1276 #define assert_transition_res(res, fn, ...) \ ··· 1690 1283 { 1691 1284 enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_RWX; 1692 1285 void *virt = hyp_alloc_pages(&host_s2_pool, 0); 1693 - struct pkvm_hyp_vcpu *vcpu = &selftest_vcpu; 1694 - struct pkvm_hyp_vm *vm = &selftest_vm; 1286 + struct pkvm_hyp_vcpu *vcpu; 1695 1287 u64 phys, size, pfn, gfn; 1288 + struct pkvm_hyp_vm *vm; 1696 1289 1697 1290 WARN_ON(!virt); 1698 1291 selftest_page = hyp_virt_to_page(virt); 1699 1292 selftest_page->refcount = 0; 1700 - init_selftest_vm(base); 1293 + selftest_vcpu = vcpu = init_selftest_vm(base); 1294 + vm = pkvm_hyp_vcpu_to_hyp_vm(vcpu); 1701 1295 1702 1296 size = PAGE_SIZE << selftest_page->order; 1703 1297 phys = hyp_virt_to_phys(virt); ··· 1717 1309 assert_transition_res(-EPERM, hyp_pin_shared_mem, virt, virt + size); 1718 1310 assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn, 1, vcpu, prot); 1719 1311 assert_transition_res(-ENOENT, __pkvm_host_unshare_guest, gfn, 1, vm); 1312 + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn, vcpu); 1720 1313 1721 1314 selftest_state.host = PKVM_PAGE_OWNED; 1722 1315 selftest_state.hyp = PKVM_NOPAGE; ··· 1737 1328 assert_transition_res(-EPERM, __pkvm_hyp_donate_host, pfn, 1); 1738 1329 assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn, 1, vcpu, prot); 1739 1330 assert_transition_res(-ENOENT, __pkvm_host_unshare_guest, gfn, 1, vm); 1331 + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn, vcpu); 1740 1332 1741 1333 assert_transition_res(0, hyp_pin_shared_mem, virt, virt + size); 1742 1334 assert_transition_res(0, hyp_pin_shared_mem, virt, virt + size); ··· 1750 1340 assert_transition_res(-EPERM, __pkvm_hyp_donate_host, pfn, 1); 1751 1341 assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn, 1, vcpu, prot); 1752 1342 assert_transition_res(-ENOENT, __pkvm_host_unshare_guest, gfn, 1, vm); 1343 + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn, vcpu); 1753 1344 1754 1345 hyp_unpin_shared_mem(virt, virt + size); 1755 1346 assert_page_state(); ··· 1770 1359 assert_transition_res(-EPERM, __pkvm_hyp_donate_host, pfn, 1); 1771 1360 assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn, 1, vcpu, prot); 1772 1361 assert_transition_res(-ENOENT, __pkvm_host_unshare_guest, gfn, 1, vm); 1362 + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn, vcpu); 1773 1363 assert_transition_res(-EPERM, hyp_pin_shared_mem, virt, virt + size); 1774 1364 1775 1365 selftest_state.host = PKVM_PAGE_OWNED; ··· 1787 1375 assert_transition_res(-EPERM, __pkvm_host_share_hyp, pfn); 1788 1376 assert_transition_res(-EPERM, __pkvm_host_unshare_hyp, pfn); 1789 1377 assert_transition_res(-EPERM, __pkvm_hyp_donate_host, pfn, 1); 1378 + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn, vcpu); 1790 1379 assert_transition_res(-EPERM, hyp_pin_shared_mem, virt, virt + size); 1791 1380 1792 1381 selftest_state.guest[1] = PKVM_PAGE_SHARED_BORROWED; ··· 1802 1389 assert_transition_res(0, __pkvm_host_unshare_guest, gfn + 1, 1, vm); 1803 1390 1804 1391 selftest_state.host = PKVM_NOPAGE; 1392 + selftest_state.guest[0] = PKVM_PAGE_OWNED; 1393 + assert_transition_res(0, __pkvm_host_donate_guest, pfn, gfn, vcpu); 1394 + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn, vcpu); 1395 + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn + 1, vcpu); 1396 + assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn, 1, vcpu, prot); 1397 + assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn + 1, 1, vcpu, prot); 1398 + assert_transition_res(-EPERM, __pkvm_host_share_ffa, pfn, 1); 1399 + assert_transition_res(-EPERM, __pkvm_host_donate_hyp, pfn, 1); 1400 + assert_transition_res(-EPERM, __pkvm_host_share_hyp, pfn); 1401 + assert_transition_res(-EPERM, __pkvm_host_unshare_hyp, pfn); 1402 + assert_transition_res(-EPERM, __pkvm_hyp_donate_host, pfn, 1); 1403 + 1404 + selftest_state.host = PKVM_PAGE_SHARED_BORROWED; 1405 + selftest_state.guest[0] = PKVM_PAGE_SHARED_OWNED; 1406 + assert_transition_res(0, __pkvm_guest_share_host, vcpu, gfn); 1407 + assert_transition_res(-EPERM, __pkvm_guest_share_host, vcpu, gfn); 1408 + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn, vcpu); 1409 + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn + 1, vcpu); 1410 + assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn, 1, vcpu, prot); 1411 + assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn + 1, 1, vcpu, prot); 1412 + assert_transition_res(-EPERM, __pkvm_host_share_ffa, pfn, 1); 1413 + assert_transition_res(-EPERM, __pkvm_host_donate_hyp, pfn, 1); 1414 + assert_transition_res(-EPERM, __pkvm_host_share_hyp, pfn); 1415 + assert_transition_res(-EPERM, __pkvm_host_unshare_hyp, pfn); 1416 + assert_transition_res(-EPERM, __pkvm_hyp_donate_host, pfn, 1); 1417 + 1418 + selftest_state.host = PKVM_NOPAGE; 1419 + selftest_state.guest[0] = PKVM_PAGE_OWNED; 1420 + assert_transition_res(0, __pkvm_guest_unshare_host, vcpu, gfn); 1421 + assert_transition_res(-EPERM, __pkvm_guest_unshare_host, vcpu, gfn); 1422 + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn, vcpu); 1423 + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn + 1, vcpu); 1424 + assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn, 1, vcpu, prot); 1425 + assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn + 1, 1, vcpu, prot); 1426 + assert_transition_res(-EPERM, __pkvm_host_share_ffa, pfn, 1); 1427 + assert_transition_res(-EPERM, __pkvm_host_donate_hyp, pfn, 1); 1428 + assert_transition_res(-EPERM, __pkvm_host_share_hyp, pfn); 1429 + assert_transition_res(-EPERM, __pkvm_host_unshare_hyp, pfn); 1430 + assert_transition_res(-EPERM, __pkvm_hyp_donate_host, pfn, 1); 1431 + 1432 + selftest_state.host = PKVM_PAGE_OWNED; 1433 + selftest_state.guest[0] = PKVM_POISON; 1434 + assert_transition_res(0, __pkvm_host_force_reclaim_page_guest, phys); 1435 + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn, vcpu); 1436 + assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn, 1, vcpu, prot); 1437 + assert_transition_res(-EHWPOISON, __pkvm_guest_share_host, vcpu, gfn); 1438 + assert_transition_res(-EHWPOISON, __pkvm_guest_unshare_host, vcpu, gfn); 1439 + 1440 + selftest_state.host = PKVM_NOPAGE; 1441 + selftest_state.guest[1] = PKVM_PAGE_OWNED; 1442 + assert_transition_res(0, __pkvm_host_donate_guest, pfn, gfn + 1, vcpu); 1443 + 1444 + selftest_state.host = PKVM_PAGE_OWNED; 1445 + selftest_state.guest[1] = PKVM_NOPAGE; 1446 + assert_transition_res(0, __pkvm_host_reclaim_page_guest, gfn + 1, vm); 1447 + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn, vcpu); 1448 + assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn, 1, vcpu, prot); 1449 + 1450 + selftest_state.host = PKVM_NOPAGE; 1805 1451 selftest_state.hyp = PKVM_PAGE_OWNED; 1806 1452 assert_transition_res(0, __pkvm_host_donate_hyp, pfn, 1); 1807 1453 1454 + teardown_selftest_vm(); 1808 1455 selftest_page->refcount = 1; 1809 1456 hyp_put_page(&host_s2_pool, virt); 1810 1457 }
+228 -11
arch/arm64/kvm/hyp/nvhe/pkvm.c
··· 4 4 * Author: Fuad Tabba <tabba@google.com> 5 5 */ 6 6 7 + #include <kvm/arm_hypercalls.h> 8 + 7 9 #include <linux/kvm_host.h> 8 10 #include <linux/mm.h> 9 11 ··· 224 222 225 223 void pkvm_hyp_vm_table_init(void *tbl) 226 224 { 225 + BUILD_BUG_ON((u64)HANDLE_OFFSET + KVM_MAX_PVMS > (pkvm_handle_t)-1); 227 226 WARN_ON(vm_table); 228 227 vm_table = tbl; 229 228 } ··· 232 229 /* 233 230 * Return the hyp vm structure corresponding to the handle. 234 231 */ 235 - static struct pkvm_hyp_vm *get_vm_by_handle(pkvm_handle_t handle) 232 + struct pkvm_hyp_vm *get_vm_by_handle(pkvm_handle_t handle) 236 233 { 237 234 unsigned int idx = vm_handle_to_idx(handle); 235 + 236 + hyp_assert_lock_held(&vm_table_lock); 238 237 239 238 if (unlikely(idx >= KVM_MAX_PVMS)) 240 239 return NULL; ··· 260 255 261 256 hyp_spin_lock(&vm_table_lock); 262 257 hyp_vm = get_vm_by_handle(handle); 263 - if (!hyp_vm || hyp_vm->kvm.created_vcpus <= vcpu_idx) 258 + if (!hyp_vm || hyp_vm->kvm.arch.pkvm.is_dying) 259 + goto unlock; 260 + 261 + if (hyp_vm->kvm.created_vcpus <= vcpu_idx) 264 262 goto unlock; 265 263 266 264 hyp_vcpu = hyp_vm->vcpus[vcpu_idx]; ··· 727 719 hyp_spin_unlock(&vm_table_lock); 728 720 } 729 721 722 + #ifdef CONFIG_NVHE_EL2_DEBUG 723 + static struct pkvm_hyp_vm selftest_vm = { 724 + .kvm = { 725 + .arch = { 726 + .mmu = { 727 + .arch = &selftest_vm.kvm.arch, 728 + .pgt = &selftest_vm.pgt, 729 + }, 730 + }, 731 + }, 732 + }; 733 + 734 + static struct pkvm_hyp_vcpu selftest_vcpu = { 735 + .vcpu = { 736 + .arch = { 737 + .hw_mmu = &selftest_vm.kvm.arch.mmu, 738 + }, 739 + .kvm = &selftest_vm.kvm, 740 + }, 741 + }; 742 + 743 + struct pkvm_hyp_vcpu *init_selftest_vm(void *virt) 744 + { 745 + struct hyp_page *p = hyp_virt_to_page(virt); 746 + int i; 747 + 748 + selftest_vm.kvm.arch.mmu.vtcr = host_mmu.arch.mmu.vtcr; 749 + WARN_ON(kvm_guest_prepare_stage2(&selftest_vm, virt)); 750 + 751 + for (i = 0; i < pkvm_selftest_pages(); i++) { 752 + if (p[i].refcount) 753 + continue; 754 + p[i].refcount = 1; 755 + hyp_put_page(&selftest_vm.pool, hyp_page_to_virt(&p[i])); 756 + } 757 + 758 + selftest_vm.kvm.arch.pkvm.handle = __pkvm_reserve_vm(); 759 + insert_vm_table_entry(selftest_vm.kvm.arch.pkvm.handle, &selftest_vm); 760 + return &selftest_vcpu; 761 + } 762 + 763 + void teardown_selftest_vm(void) 764 + { 765 + hyp_spin_lock(&vm_table_lock); 766 + remove_vm_table_entry(selftest_vm.kvm.arch.pkvm.handle); 767 + hyp_spin_unlock(&vm_table_lock); 768 + } 769 + #endif /* CONFIG_NVHE_EL2_DEBUG */ 770 + 730 771 /* 731 772 * Initialize the hypervisor copy of the VM state using host-donated memory. 732 773 * ··· 916 859 unmap_donated_memory_noclear(addr, size); 917 860 } 918 861 919 - int __pkvm_teardown_vm(pkvm_handle_t handle) 862 + int __pkvm_reclaim_dying_guest_page(pkvm_handle_t handle, u64 gfn) 863 + { 864 + struct pkvm_hyp_vm *hyp_vm = get_pkvm_hyp_vm(handle); 865 + int ret = -EINVAL; 866 + 867 + if (!hyp_vm) 868 + return ret; 869 + 870 + if (hyp_vm->kvm.arch.pkvm.is_dying) 871 + ret = __pkvm_host_reclaim_page_guest(gfn, hyp_vm); 872 + 873 + put_pkvm_hyp_vm(hyp_vm); 874 + return ret; 875 + } 876 + 877 + static struct pkvm_hyp_vm *get_pkvm_unref_hyp_vm_locked(pkvm_handle_t handle) 878 + { 879 + struct pkvm_hyp_vm *hyp_vm; 880 + 881 + hyp_assert_lock_held(&vm_table_lock); 882 + 883 + hyp_vm = get_vm_by_handle(handle); 884 + if (!hyp_vm || hyp_page_count(hyp_vm)) 885 + return NULL; 886 + 887 + return hyp_vm; 888 + } 889 + 890 + int __pkvm_start_teardown_vm(pkvm_handle_t handle) 891 + { 892 + struct pkvm_hyp_vm *hyp_vm; 893 + int ret = 0; 894 + 895 + hyp_spin_lock(&vm_table_lock); 896 + hyp_vm = get_pkvm_unref_hyp_vm_locked(handle); 897 + if (!hyp_vm || hyp_vm->kvm.arch.pkvm.is_dying) { 898 + ret = -EINVAL; 899 + goto unlock; 900 + } 901 + 902 + hyp_vm->kvm.arch.pkvm.is_dying = true; 903 + unlock: 904 + hyp_spin_unlock(&vm_table_lock); 905 + 906 + return ret; 907 + } 908 + 909 + int __pkvm_finalize_teardown_vm(pkvm_handle_t handle) 920 910 { 921 911 struct kvm_hyp_memcache *mc, *stage2_mc; 922 912 struct pkvm_hyp_vm *hyp_vm; ··· 973 869 int err; 974 870 975 871 hyp_spin_lock(&vm_table_lock); 976 - hyp_vm = get_vm_by_handle(handle); 977 - if (!hyp_vm) { 978 - err = -ENOENT; 979 - goto err_unlock; 980 - } 981 - 982 - if (WARN_ON(hyp_page_count(hyp_vm))) { 983 - err = -EBUSY; 872 + hyp_vm = get_pkvm_unref_hyp_vm_locked(handle); 873 + if (!hyp_vm || !hyp_vm->kvm.arch.pkvm.is_dying) { 874 + err = -EINVAL; 984 875 goto err_unlock; 985 876 } 986 877 ··· 1020 921 err_unlock: 1021 922 hyp_spin_unlock(&vm_table_lock); 1022 923 return err; 924 + } 925 + 926 + static u64 __pkvm_memshare_page_req(struct kvm_vcpu *vcpu, u64 ipa) 927 + { 928 + u64 elr; 929 + 930 + /* Fake up a data abort (level 3 translation fault on write) */ 931 + vcpu->arch.fault.esr_el2 = (ESR_ELx_EC_DABT_LOW << ESR_ELx_EC_SHIFT) | 932 + ESR_ELx_WNR | ESR_ELx_FSC_FAULT | 933 + FIELD_PREP(ESR_ELx_FSC_LEVEL, 3); 934 + 935 + /* Shuffle the IPA around into the HPFAR */ 936 + vcpu->arch.fault.hpfar_el2 = (HPFAR_EL2_NS | (ipa >> 8)) & HPFAR_MASK; 937 + 938 + /* This is a virtual address. 0's good. Let's go with 0. */ 939 + vcpu->arch.fault.far_el2 = 0; 940 + 941 + /* Rewind the ELR so we return to the HVC once the IPA is mapped */ 942 + elr = read_sysreg(elr_el2); 943 + elr -= 4; 944 + write_sysreg(elr, elr_el2); 945 + 946 + return ARM_EXCEPTION_TRAP; 947 + } 948 + 949 + static bool pkvm_memshare_call(u64 *ret, struct kvm_vcpu *vcpu, u64 *exit_code) 950 + { 951 + struct pkvm_hyp_vcpu *hyp_vcpu; 952 + u64 ipa = smccc_get_arg1(vcpu); 953 + 954 + if (!PAGE_ALIGNED(ipa)) 955 + goto out_guest; 956 + 957 + hyp_vcpu = container_of(vcpu, struct pkvm_hyp_vcpu, vcpu); 958 + switch (__pkvm_guest_share_host(hyp_vcpu, hyp_phys_to_pfn(ipa))) { 959 + case 0: 960 + ret[0] = SMCCC_RET_SUCCESS; 961 + goto out_guest; 962 + case -ENOENT: 963 + /* 964 + * Convert the exception into a data abort so that the page 965 + * being shared is mapped into the guest next time. 966 + */ 967 + *exit_code = __pkvm_memshare_page_req(vcpu, ipa); 968 + goto out_host; 969 + } 970 + 971 + out_guest: 972 + return true; 973 + out_host: 974 + return false; 975 + } 976 + 977 + static void pkvm_memunshare_call(u64 *ret, struct kvm_vcpu *vcpu) 978 + { 979 + struct pkvm_hyp_vcpu *hyp_vcpu; 980 + u64 ipa = smccc_get_arg1(vcpu); 981 + 982 + if (!PAGE_ALIGNED(ipa)) 983 + return; 984 + 985 + hyp_vcpu = container_of(vcpu, struct pkvm_hyp_vcpu, vcpu); 986 + if (!__pkvm_guest_unshare_host(hyp_vcpu, hyp_phys_to_pfn(ipa))) 987 + ret[0] = SMCCC_RET_SUCCESS; 988 + } 989 + 990 + /* 991 + * Handler for protected VM HVC calls. 992 + * 993 + * Returns true if the hypervisor has handled the exit (and control 994 + * should return to the guest) or false if it hasn't (and the handling 995 + * should be performed by the host). 996 + */ 997 + bool kvm_handle_pvm_hvc64(struct kvm_vcpu *vcpu, u64 *exit_code) 998 + { 999 + u64 val[4] = { SMCCC_RET_INVALID_PARAMETER }; 1000 + bool handled = true; 1001 + 1002 + switch (smccc_get_function(vcpu)) { 1003 + case ARM_SMCCC_VENDOR_HYP_KVM_FEATURES_FUNC_ID: 1004 + val[0] = BIT(ARM_SMCCC_KVM_FUNC_FEATURES); 1005 + val[0] |= BIT(ARM_SMCCC_KVM_FUNC_HYP_MEMINFO); 1006 + val[0] |= BIT(ARM_SMCCC_KVM_FUNC_MEM_SHARE); 1007 + val[0] |= BIT(ARM_SMCCC_KVM_FUNC_MEM_UNSHARE); 1008 + break; 1009 + case ARM_SMCCC_VENDOR_HYP_KVM_HYP_MEMINFO_FUNC_ID: 1010 + if (smccc_get_arg1(vcpu) || 1011 + smccc_get_arg2(vcpu) || 1012 + smccc_get_arg3(vcpu)) { 1013 + break; 1014 + } 1015 + 1016 + val[0] = PAGE_SIZE; 1017 + break; 1018 + case ARM_SMCCC_VENDOR_HYP_KVM_MEM_SHARE_FUNC_ID: 1019 + if (smccc_get_arg2(vcpu) || 1020 + smccc_get_arg3(vcpu)) { 1021 + break; 1022 + } 1023 + 1024 + handled = pkvm_memshare_call(val, vcpu, exit_code); 1025 + break; 1026 + case ARM_SMCCC_VENDOR_HYP_KVM_MEM_UNSHARE_FUNC_ID: 1027 + if (smccc_get_arg2(vcpu) || 1028 + smccc_get_arg3(vcpu)) { 1029 + break; 1030 + } 1031 + 1032 + pkvm_memunshare_call(val, vcpu); 1033 + break; 1034 + default: 1035 + /* Punt everything else back to the host, for now. */ 1036 + handled = false; 1037 + } 1038 + 1039 + if (handled) 1040 + smccc_set_retval(vcpu, val[0], val[1], val[2], val[3]); 1041 + return handled; 1023 1042 }
+1
arch/arm64/kvm/hyp/nvhe/switch.c
··· 205 205 206 206 static const exit_handler_fn pvm_exit_handlers[] = { 207 207 [0 ... ESR_ELx_EC_MAX] = NULL, 208 + [ESR_ELx_EC_HVC64] = kvm_handle_pvm_hvc64, 208 209 [ESR_ELx_EC_SYS64] = kvm_handle_pvm_sys64, 209 210 [ESR_ELx_EC_SVE] = kvm_handle_pvm_restricted, 210 211 [ESR_ELx_EC_FP_ASIMD] = kvm_hyp_handle_fpsimd,
+8
arch/arm64/kvm/hyp/nvhe/sys_regs.c
··· 400 400 /* Cache maintenance by set/way operations are restricted. */ 401 401 402 402 /* Debug and Trace Registers are restricted. */ 403 + RAZ_WI(SYS_DBGBVRn_EL1(0)), 404 + RAZ_WI(SYS_DBGBCRn_EL1(0)), 405 + RAZ_WI(SYS_DBGWVRn_EL1(0)), 406 + RAZ_WI(SYS_DBGWCRn_EL1(0)), 407 + RAZ_WI(SYS_MDSCR_EL1), 408 + RAZ_WI(SYS_OSLAR_EL1), 409 + RAZ_WI(SYS_OSLSR_EL1), 410 + RAZ_WI(SYS_OSDLR_EL1), 403 411 404 412 /* Group 1 ID registers */ 405 413 HOST_HANDLED(SYS_REVIDR_EL1),
+20 -13
arch/arm64/kvm/hyp/pgtable.c
··· 114 114 return pte; 115 115 } 116 116 117 - static kvm_pte_t kvm_init_invalid_leaf_owner(u8 owner_id) 118 - { 119 - return FIELD_PREP(KVM_INVALID_PTE_OWNER_MASK, owner_id); 120 - } 121 - 122 117 static int kvm_pgtable_visitor_cb(struct kvm_pgtable_walk_data *data, 123 118 const struct kvm_pgtable_visit_ctx *ctx, 124 119 enum kvm_pgtable_walk_flags visit) ··· 576 581 struct stage2_map_data { 577 582 const u64 phys; 578 583 kvm_pte_t attr; 579 - u8 owner_id; 584 + kvm_pte_t pte_annot; 580 585 581 586 kvm_pte_t *anchor; 582 587 kvm_pte_t *childp; ··· 793 798 794 799 static bool stage2_pte_is_locked(kvm_pte_t pte) 795 800 { 796 - return !kvm_pte_valid(pte) && (pte & KVM_INVALID_PTE_LOCKED); 801 + if (kvm_pte_valid(pte)) 802 + return false; 803 + 804 + return FIELD_GET(KVM_INVALID_PTE_TYPE_MASK, pte) == 805 + KVM_INVALID_PTE_TYPE_LOCKED; 797 806 } 798 807 799 808 static bool stage2_try_set_pte(const struct kvm_pgtable_visit_ctx *ctx, kvm_pte_t new) ··· 828 829 struct kvm_s2_mmu *mmu) 829 830 { 830 831 struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops; 832 + kvm_pte_t locked_pte; 831 833 832 834 if (stage2_pte_is_locked(ctx->old)) { 833 835 /* ··· 839 839 return false; 840 840 } 841 841 842 - if (!stage2_try_set_pte(ctx, KVM_INVALID_PTE_LOCKED)) 842 + locked_pte = FIELD_PREP(KVM_INVALID_PTE_TYPE_MASK, 843 + KVM_INVALID_PTE_TYPE_LOCKED); 844 + if (!stage2_try_set_pte(ctx, locked_pte)) 843 845 return false; 844 846 845 847 if (!kvm_pgtable_walk_skip_bbm_tlbi(ctx)) { ··· 966 964 if (!data->annotation) 967 965 new = kvm_init_valid_leaf_pte(phys, data->attr, ctx->level); 968 966 else 969 - new = kvm_init_invalid_leaf_owner(data->owner_id); 967 + new = data->pte_annot; 970 968 971 969 /* 972 970 * Skip updating the PTE if we are trying to recreate the exact ··· 1120 1118 return ret; 1121 1119 } 1122 1120 1123 - int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size, 1124 - void *mc, u8 owner_id) 1121 + int kvm_pgtable_stage2_annotate(struct kvm_pgtable *pgt, u64 addr, u64 size, 1122 + void *mc, enum kvm_invalid_pte_type type, 1123 + kvm_pte_t pte_annot) 1125 1124 { 1126 1125 int ret; 1127 1126 struct stage2_map_data map_data = { 1128 1127 .mmu = pgt->mmu, 1129 1128 .memcache = mc, 1130 - .owner_id = owner_id, 1131 1129 .force_pte = true, 1132 1130 .annotation = true, 1131 + .pte_annot = pte_annot | 1132 + FIELD_PREP(KVM_INVALID_PTE_TYPE_MASK, type), 1133 1133 }; 1134 1134 struct kvm_pgtable_walker walker = { 1135 1135 .cb = stage2_map_walker, ··· 1140 1136 .arg = &map_data, 1141 1137 }; 1142 1138 1143 - if (owner_id > KVM_MAX_OWNER_ID) 1139 + if (pte_annot & ~KVM_INVALID_PTE_ANNOT_MASK) 1140 + return -EINVAL; 1141 + 1142 + if (!type || type == KVM_INVALID_PTE_TYPE_LOCKED) 1144 1143 return -EINVAL; 1145 1144 1146 1145 ret = kvm_pgtable_walk(pgt, addr, size, &walker);
+100 -13
arch/arm64/kvm/mmu.c
··· 340 340 void kvm_stage2_unmap_range(struct kvm_s2_mmu *mmu, phys_addr_t start, 341 341 u64 size, bool may_block) 342 342 { 343 + if (kvm_vm_is_protected(kvm_s2_mmu_to_kvm(mmu))) 344 + return; 345 + 343 346 __unmap_stage2_range(mmu, start, size, may_block); 344 347 } 345 348 ··· 880 877 u32 kvm_ipa_limit = get_kvm_ipa_limit(); 881 878 u64 mmfr0, mmfr1; 882 879 u32 phys_shift; 883 - 884 - if (type & ~KVM_VM_TYPE_ARM_IPA_SIZE_MASK) 885 - return -EINVAL; 886 880 887 881 phys_shift = KVM_VM_TYPE_ARM_IPA_SIZE(type); 888 882 if (is_protected_kvm_enabled()) { ··· 1659 1659 bool map_non_cacheable; 1660 1660 }; 1661 1661 1662 + static int pkvm_mem_abort(const struct kvm_s2_fault_desc *s2fd) 1663 + { 1664 + unsigned int flags = FOLL_HWPOISON | FOLL_LONGTERM | FOLL_WRITE; 1665 + struct kvm_vcpu *vcpu = s2fd->vcpu; 1666 + struct kvm_pgtable *pgt = vcpu->arch.hw_mmu->pgt; 1667 + struct mm_struct *mm = current->mm; 1668 + struct kvm *kvm = vcpu->kvm; 1669 + void *hyp_memcache; 1670 + struct page *page; 1671 + int ret; 1672 + 1673 + hyp_memcache = get_mmu_memcache(vcpu); 1674 + ret = topup_mmu_memcache(vcpu, hyp_memcache); 1675 + if (ret) 1676 + return -ENOMEM; 1677 + 1678 + ret = account_locked_vm(mm, 1, true); 1679 + if (ret) 1680 + return ret; 1681 + 1682 + mmap_read_lock(mm); 1683 + ret = pin_user_pages(s2fd->hva, 1, flags, &page); 1684 + mmap_read_unlock(mm); 1685 + 1686 + if (ret == -EHWPOISON) { 1687 + kvm_send_hwpoison_signal(s2fd->hva, PAGE_SHIFT); 1688 + ret = 0; 1689 + goto dec_account; 1690 + } else if (ret != 1) { 1691 + ret = -EFAULT; 1692 + goto dec_account; 1693 + } else if (!folio_test_swapbacked(page_folio(page))) { 1694 + /* 1695 + * We really can't deal with page-cache pages returned by GUP 1696 + * because (a) we may trigger writeback of a page for which we 1697 + * no longer have access and (b) page_mkclean() won't find the 1698 + * stage-2 mapping in the rmap so we can get out-of-whack with 1699 + * the filesystem when marking the page dirty during unpinning 1700 + * (see cc5095747edf ("ext4: don't BUG if someone dirty pages 1701 + * without asking ext4 first")). 1702 + * 1703 + * Ideally we'd just restrict ourselves to anonymous pages, but 1704 + * we also want to allow memfd (i.e. shmem) pages, so check for 1705 + * pages backed by swap in the knowledge that the GUP pin will 1706 + * prevent try_to_unmap() from succeeding. 1707 + */ 1708 + ret = -EIO; 1709 + goto unpin; 1710 + } 1711 + 1712 + write_lock(&kvm->mmu_lock); 1713 + ret = pkvm_pgtable_stage2_map(pgt, s2fd->fault_ipa, PAGE_SIZE, 1714 + page_to_phys(page), KVM_PGTABLE_PROT_RWX, 1715 + hyp_memcache, 0); 1716 + write_unlock(&kvm->mmu_lock); 1717 + if (ret) { 1718 + if (ret == -EAGAIN) 1719 + ret = 0; 1720 + goto unpin; 1721 + } 1722 + 1723 + return 0; 1724 + unpin: 1725 + unpin_user_pages(&page, 1); 1726 + dec_account: 1727 + account_locked_vm(mm, 1, false); 1728 + return ret; 1729 + } 1730 + 1662 1731 static short kvm_s2_resolve_vma_size(const struct kvm_s2_fault_desc *s2fd, 1663 1732 struct kvm_s2_fault_vma_info *s2vi, 1664 1733 struct vm_area_struct *vma) ··· 2354 2285 goto out_unlock; 2355 2286 } 2356 2287 2357 - VM_WARN_ON_ONCE(kvm_vcpu_trap_is_permission_fault(vcpu) && 2358 - !write_fault && !kvm_vcpu_trap_is_exec_fault(vcpu)); 2359 - 2360 2288 const struct kvm_s2_fault_desc s2fd = { 2361 2289 .vcpu = vcpu, 2362 2290 .fault_ipa = fault_ipa, ··· 2362 2296 .hva = hva, 2363 2297 }; 2364 2298 2365 - if (kvm_slot_has_gmem(memslot)) 2366 - ret = gmem_abort(&s2fd); 2367 - else 2368 - ret = user_mem_abort(&s2fd); 2299 + if (kvm_vm_is_protected(vcpu->kvm)) { 2300 + ret = pkvm_mem_abort(&s2fd); 2301 + } else { 2302 + VM_WARN_ON_ONCE(kvm_vcpu_trap_is_permission_fault(vcpu) && 2303 + !write_fault && 2304 + !kvm_vcpu_trap_is_exec_fault(vcpu)); 2305 + 2306 + if (kvm_slot_has_gmem(memslot)) 2307 + ret = gmem_abort(&s2fd); 2308 + else 2309 + ret = user_mem_abort(&s2fd); 2310 + } 2369 2311 2370 2312 if (ret == 0) 2371 2313 ret = 1; ··· 2387 2313 2388 2314 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) 2389 2315 { 2390 - if (!kvm->arch.mmu.pgt) 2316 + if (!kvm->arch.mmu.pgt || kvm_vm_is_protected(kvm)) 2391 2317 return false; 2392 2318 2393 2319 __unmap_stage2_range(&kvm->arch.mmu, range->start << PAGE_SHIFT, ··· 2402 2328 { 2403 2329 u64 size = (range->end - range->start) << PAGE_SHIFT; 2404 2330 2405 - if (!kvm->arch.mmu.pgt) 2331 + if (!kvm->arch.mmu.pgt || kvm_vm_is_protected(kvm)) 2406 2332 return false; 2407 2333 2408 2334 return KVM_PGT_FN(kvm_pgtable_stage2_test_clear_young)(kvm->arch.mmu.pgt, ··· 2418 2344 { 2419 2345 u64 size = (range->end - range->start) << PAGE_SHIFT; 2420 2346 2421 - if (!kvm->arch.mmu.pgt) 2347 + if (!kvm->arch.mmu.pgt || kvm_vm_is_protected(kvm)) 2422 2348 return false; 2423 2349 2424 2350 return KVM_PGT_FN(kvm_pgtable_stage2_test_clear_young)(kvm->arch.mmu.pgt, ··· 2574 2500 { 2575 2501 hva_t hva, reg_end; 2576 2502 int ret = 0; 2503 + 2504 + if (kvm_vm_is_protected(kvm)) { 2505 + /* Cannot modify memslots once a pVM has run. */ 2506 + if (pkvm_hyp_vm_is_created(kvm) && 2507 + (change == KVM_MR_DELETE || change == KVM_MR_MOVE)) { 2508 + return -EPERM; 2509 + } 2510 + 2511 + if (new && 2512 + new->flags & (KVM_MEM_LOG_DIRTY_PAGES | KVM_MEM_READONLY)) { 2513 + return -EPERM; 2514 + } 2515 + } 2577 2516 2578 2517 if (change != KVM_MR_CREATE && change != KVM_MR_MOVE && 2579 2518 change != KVM_MR_FLAGS_ONLY)
+131 -26
arch/arm64/kvm/pkvm.c
··· 88 88 static void __pkvm_destroy_hyp_vm(struct kvm *kvm) 89 89 { 90 90 if (pkvm_hyp_vm_is_created(kvm)) { 91 - WARN_ON(kvm_call_hyp_nvhe(__pkvm_teardown_vm, 91 + WARN_ON(kvm_call_hyp_nvhe(__pkvm_finalize_teardown_vm, 92 92 kvm->arch.pkvm.handle)); 93 93 } else if (kvm->arch.pkvm.handle) { 94 94 /* ··· 192 192 { 193 193 int ret = 0; 194 194 195 + /* 196 + * Synchronise with kvm_arch_prepare_memory_region(), as we 197 + * prevent memslot modifications on a pVM that has been run. 198 + */ 199 + mutex_lock(&kvm->slots_lock); 195 200 mutex_lock(&kvm->arch.config_lock); 196 201 if (!pkvm_hyp_vm_is_created(kvm)) 197 202 ret = __pkvm_create_hyp_vm(kvm); 198 203 mutex_unlock(&kvm->arch.config_lock); 204 + mutex_unlock(&kvm->slots_lock); 199 205 200 206 return ret; 201 207 } ··· 225 219 mutex_unlock(&kvm->arch.config_lock); 226 220 } 227 221 228 - int pkvm_init_host_vm(struct kvm *kvm) 222 + int pkvm_init_host_vm(struct kvm *kvm, unsigned long type) 229 223 { 230 224 int ret; 225 + bool protected = type & KVM_VM_TYPE_ARM_PROTECTED; 231 226 232 227 if (pkvm_hyp_vm_is_created(kvm)) 233 228 return -EINVAL; ··· 243 236 return ret; 244 237 245 238 kvm->arch.pkvm.handle = ret; 239 + kvm->arch.pkvm.is_protected = protected; 240 + if (protected) { 241 + pr_warn_once("kvm: protected VMs are experimental and for development only, tainting kernel\n"); 242 + add_taint(TAINT_USER, LOCKDEP_STILL_OK); 243 + } 246 244 247 245 return 0; 248 246 } ··· 334 322 return 0; 335 323 } 336 324 337 - static int __pkvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 start, u64 end) 325 + static int __pkvm_pgtable_stage2_reclaim(struct kvm_pgtable *pgt, u64 start, u64 end) 338 326 { 339 327 struct kvm *kvm = kvm_s2_mmu_to_kvm(pgt->mmu); 340 328 pkvm_handle_t handle = kvm->arch.pkvm.handle; 341 329 struct pkvm_mapping *mapping; 342 330 int ret; 343 331 344 - if (!handle) 345 - return 0; 332 + for_each_mapping_in_range_safe(pgt, start, end, mapping) { 333 + struct page *page; 334 + 335 + ret = kvm_call_hyp_nvhe(__pkvm_reclaim_dying_guest_page, 336 + handle, mapping->gfn); 337 + if (WARN_ON(ret)) 338 + continue; 339 + 340 + page = pfn_to_page(mapping->pfn); 341 + WARN_ON_ONCE(mapping->nr_pages != 1); 342 + unpin_user_pages_dirty_lock(&page, 1, true); 343 + account_locked_vm(current->mm, 1, false); 344 + pkvm_mapping_remove(mapping, &pgt->pkvm_mappings); 345 + kfree(mapping); 346 + } 347 + 348 + return 0; 349 + } 350 + 351 + static int __pkvm_pgtable_stage2_unshare(struct kvm_pgtable *pgt, u64 start, u64 end) 352 + { 353 + struct kvm *kvm = kvm_s2_mmu_to_kvm(pgt->mmu); 354 + pkvm_handle_t handle = kvm->arch.pkvm.handle; 355 + struct pkvm_mapping *mapping; 356 + int ret; 346 357 347 358 for_each_mapping_in_range_safe(pgt, start, end, mapping) { 348 359 ret = kvm_call_hyp_nvhe(__pkvm_host_unshare_guest, handle, mapping->gfn, ··· 382 347 void pkvm_pgtable_stage2_destroy_range(struct kvm_pgtable *pgt, 383 348 u64 addr, u64 size) 384 349 { 385 - __pkvm_pgtable_stage2_unmap(pgt, addr, addr + size); 350 + struct kvm *kvm = kvm_s2_mmu_to_kvm(pgt->mmu); 351 + pkvm_handle_t handle = kvm->arch.pkvm.handle; 352 + 353 + if (!handle) 354 + return; 355 + 356 + if (pkvm_hyp_vm_is_created(kvm) && !kvm->arch.pkvm.is_dying) { 357 + WARN_ON(kvm_call_hyp_nvhe(__pkvm_start_teardown_vm, handle)); 358 + kvm->arch.pkvm.is_dying = true; 359 + } 360 + 361 + if (kvm_vm_is_protected(kvm)) 362 + __pkvm_pgtable_stage2_reclaim(pgt, addr, addr + size); 363 + else 364 + __pkvm_pgtable_stage2_unshare(pgt, addr, addr + size); 386 365 } 387 366 388 367 void pkvm_pgtable_stage2_destroy_pgd(struct kvm_pgtable *pgt) ··· 414 365 struct kvm_hyp_memcache *cache = mc; 415 366 u64 gfn = addr >> PAGE_SHIFT; 416 367 u64 pfn = phys >> PAGE_SHIFT; 368 + u64 end = addr + size; 417 369 int ret; 418 370 419 - if (size != PAGE_SIZE && size != PMD_SIZE) 420 - return -EINVAL; 421 - 422 371 lockdep_assert_held_write(&kvm->mmu_lock); 372 + mapping = pkvm_mapping_iter_first(&pgt->pkvm_mappings, addr, end - 1); 423 373 424 - /* 425 - * Calling stage2_map() on top of existing mappings is either happening because of a race 426 - * with another vCPU, or because we're changing between page and block mappings. As per 427 - * user_mem_abort(), same-size permission faults are handled in the relax_perms() path. 428 - */ 429 - mapping = pkvm_mapping_iter_first(&pgt->pkvm_mappings, addr, addr + size - 1); 430 - if (mapping) { 431 - if (size == (mapping->nr_pages * PAGE_SIZE)) 432 - return -EAGAIN; 374 + if (kvm_vm_is_protected(kvm)) { 375 + /* Protected VMs are mapped using RWX page-granular mappings */ 376 + if (WARN_ON_ONCE(size != PAGE_SIZE)) 377 + return -EINVAL; 433 378 434 - /* Remove _any_ pkvm_mapping overlapping with the range, bigger or smaller. */ 435 - ret = __pkvm_pgtable_stage2_unmap(pgt, addr, addr + size); 436 - if (ret) 437 - return ret; 438 - mapping = NULL; 379 + if (WARN_ON_ONCE(prot != KVM_PGTABLE_PROT_RWX)) 380 + return -EINVAL; 381 + 382 + /* 383 + * We either raced with another vCPU or the guest PTE 384 + * has been poisoned by an erroneous host access. 385 + */ 386 + if (mapping) { 387 + ret = kvm_call_hyp_nvhe(__pkvm_vcpu_in_poison_fault); 388 + return ret ? -EFAULT : -EAGAIN; 389 + } 390 + 391 + ret = kvm_call_hyp_nvhe(__pkvm_host_donate_guest, pfn, gfn); 392 + } else { 393 + if (WARN_ON_ONCE(size != PAGE_SIZE && size != PMD_SIZE)) 394 + return -EINVAL; 395 + 396 + /* 397 + * We either raced with another vCPU or we're changing between 398 + * page and block mappings. As per user_mem_abort(), same-size 399 + * permission faults are handled in the relax_perms() path. 400 + */ 401 + if (mapping) { 402 + if (size == (mapping->nr_pages * PAGE_SIZE)) 403 + return -EAGAIN; 404 + 405 + /* 406 + * Remove _any_ pkvm_mapping overlapping with the range, 407 + * bigger or smaller. 408 + */ 409 + ret = __pkvm_pgtable_stage2_unshare(pgt, addr, end); 410 + if (ret) 411 + return ret; 412 + 413 + mapping = NULL; 414 + } 415 + 416 + ret = kvm_call_hyp_nvhe(__pkvm_host_share_guest, pfn, gfn, 417 + size / PAGE_SIZE, prot); 439 418 } 440 419 441 - ret = kvm_call_hyp_nvhe(__pkvm_host_share_guest, pfn, gfn, size / PAGE_SIZE, prot); 442 420 if (WARN_ON(ret)) 443 421 return ret; 444 422 ··· 480 404 481 405 int pkvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size) 482 406 { 483 - lockdep_assert_held_write(&kvm_s2_mmu_to_kvm(pgt->mmu)->mmu_lock); 407 + struct kvm *kvm = kvm_s2_mmu_to_kvm(pgt->mmu); 484 408 485 - return __pkvm_pgtable_stage2_unmap(pgt, addr, addr + size); 409 + if (WARN_ON(kvm_vm_is_protected(kvm))) 410 + return -EPERM; 411 + 412 + lockdep_assert_held_write(&kvm->mmu_lock); 413 + 414 + return __pkvm_pgtable_stage2_unshare(pgt, addr, addr + size); 486 415 } 487 416 488 417 int pkvm_pgtable_stage2_wrprotect(struct kvm_pgtable *pgt, u64 addr, u64 size) ··· 496 415 pkvm_handle_t handle = kvm->arch.pkvm.handle; 497 416 struct pkvm_mapping *mapping; 498 417 int ret = 0; 418 + 419 + if (WARN_ON(kvm_vm_is_protected(kvm))) 420 + return -EPERM; 499 421 500 422 lockdep_assert_held(&kvm->mmu_lock); 501 423 for_each_mapping_in_range_safe(pgt, addr, addr + size, mapping) { ··· 531 447 struct pkvm_mapping *mapping; 532 448 bool young = false; 533 449 450 + if (WARN_ON(kvm_vm_is_protected(kvm))) 451 + return false; 452 + 534 453 lockdep_assert_held(&kvm->mmu_lock); 535 454 for_each_mapping_in_range_safe(pgt, addr, addr + size, mapping) 536 455 young |= kvm_call_hyp_nvhe(__pkvm_host_test_clear_young_guest, handle, mapping->gfn, ··· 545 458 int pkvm_pgtable_stage2_relax_perms(struct kvm_pgtable *pgt, u64 addr, enum kvm_pgtable_prot prot, 546 459 enum kvm_pgtable_walk_flags flags) 547 460 { 461 + if (WARN_ON(kvm_vm_is_protected(kvm_s2_mmu_to_kvm(pgt->mmu)))) 462 + return -EPERM; 463 + 548 464 return kvm_call_hyp_nvhe(__pkvm_host_relax_perms_guest, addr >> PAGE_SHIFT, prot); 549 465 } 550 466 551 467 void pkvm_pgtable_stage2_mkyoung(struct kvm_pgtable *pgt, u64 addr, 552 468 enum kvm_pgtable_walk_flags flags) 553 469 { 470 + if (WARN_ON(kvm_vm_is_protected(kvm_s2_mmu_to_kvm(pgt->mmu)))) 471 + return; 472 + 554 473 WARN_ON(kvm_call_hyp_nvhe(__pkvm_host_mkyoung_guest, addr >> PAGE_SHIFT)); 555 474 } 556 475 ··· 577 484 { 578 485 WARN_ON_ONCE(1); 579 486 return -EINVAL; 487 + } 488 + 489 + /* 490 + * Forcefully reclaim a page from the guest, zeroing its contents and 491 + * poisoning the stage-2 pte so that pages can no longer be mapped at 492 + * the same IPA. The page remains pinned until the guest is destroyed. 493 + */ 494 + bool pkvm_force_reclaim_guest_page(phys_addr_t phys) 495 + { 496 + int ret = kvm_call_hyp_nvhe(__pkvm_force_reclaim_guest_page, phys); 497 + 498 + return !ret || ret == -EAGAIN; 580 499 }
+30 -3
arch/arm64/mm/fault.c
··· 43 43 #include <asm/system_misc.h> 44 44 #include <asm/tlbflush.h> 45 45 #include <asm/traps.h> 46 + #include <asm/virt.h> 46 47 47 48 struct fault_info { 48 49 int (*fn)(unsigned long far, unsigned long esr, ··· 270 269 return false; 271 270 } 272 271 272 + static bool is_pkvm_stage2_abort(unsigned int esr) 273 + { 274 + /* 275 + * S1PTW should only ever be set in ESR_EL1 if the pkvm hypervisor 276 + * injected a stage-2 abort -- see host_inject_mem_abort(). 277 + */ 278 + return is_pkvm_initialized() && (esr & ESR_ELx_S1PTW); 279 + } 280 + 273 281 static bool __kprobes is_spurious_el1_translation_fault(unsigned long addr, 274 282 unsigned long esr, 275 283 struct pt_regs *regs) ··· 299 289 * If we now have a valid translation, treat the translation fault as 300 290 * spurious. 301 291 */ 302 - if (!(par & SYS_PAR_EL1_F)) 292 + if (!(par & SYS_PAR_EL1_F)) { 293 + if (is_pkvm_stage2_abort(esr)) { 294 + par &= SYS_PAR_EL1_PA; 295 + return pkvm_force_reclaim_guest_page(par); 296 + } 297 + 303 298 return true; 299 + } 304 300 305 301 /* 306 302 * If we got a different type of fault from the AT instruction, ··· 392 376 if (!is_el1_instruction_abort(esr) && fixup_exception(regs, esr)) 393 377 return; 394 378 395 - if (WARN_RATELIMIT(is_spurious_el1_translation_fault(addr, esr, regs), 396 - "Ignoring spurious kernel translation fault at virtual address %016lx\n", addr)) 379 + if (is_spurious_el1_translation_fault(addr, esr, regs)) { 380 + WARN_RATELIMIT(!is_pkvm_stage2_abort(esr), 381 + "Ignoring spurious kernel translation fault at virtual address %016lx\n", addr); 397 382 return; 383 + } 398 384 399 385 if (is_el1_mte_sync_tag_check_fault(esr)) { 400 386 do_tag_recovery(addr, esr, regs); ··· 413 395 msg = "read from unreadable memory"; 414 396 } else if (addr < PAGE_SIZE) { 415 397 msg = "NULL pointer dereference"; 398 + } else if (is_pkvm_stage2_abort(esr)) { 399 + msg = "access to hypervisor-protected memory"; 416 400 } else { 417 401 if (esr_fsc_is_translation_fault(esr) && 418 402 kfence_handle_page_fault(addr, esr & ESR_ELx_WNR, regs)) ··· 639 619 if (!insn_may_access_user(regs->pc, esr)) 640 620 die_kernel_fault("access to user memory outside uaccess routines", 641 621 addr, esr, regs); 622 + } 623 + 624 + if (is_pkvm_stage2_abort(esr)) { 625 + if (!user_mode(regs)) 626 + goto no_context; 627 + arm64_force_sig_fault(SIGSEGV, SEGV_ACCERR, far, "stage-2 fault"); 628 + return 0; 642 629 } 643 630 644 631 perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
+1 -1
drivers/virt/coco/pkvm-guest/Kconfig
··· 1 1 config ARM_PKVM_GUEST 2 2 bool "Arm pKVM protected guest driver" 3 - depends on ARM64 3 + depends on ARM64 && DMA_RESTRICTED_POOL 4 4 help 5 5 Protected guests running under the pKVM hypervisor on arm64 6 6 are isolated from the host and must issue hypercalls to enable
+5
include/uapi/linux/kvm.h
··· 703 703 #define KVM_VM_TYPE_ARM_IPA_SIZE_MASK 0xffULL 704 704 #define KVM_VM_TYPE_ARM_IPA_SIZE(x) \ 705 705 ((x) & KVM_VM_TYPE_ARM_IPA_SIZE_MASK) 706 + 707 + #define KVM_VM_TYPE_ARM_PROTECTED (1UL << 31) 708 + #define KVM_VM_TYPE_ARM_MASK (KVM_VM_TYPE_ARM_IPA_SIZE_MASK | \ 709 + KVM_VM_TYPE_ARM_PROTECTED) 710 + 706 711 /* 707 712 * ioctls for /dev/kvm fds: 708 713 */