Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

KVM: X86: Synchronize the shadow pagetable before link it

If gpte is changed from non-present to present, the guest doesn't need
to flush tlb per SDM. So the host must synchronze sp before
link it. Otherwise the guest might use a wrong mapping.

For example: the guest first changes a level-1 pagetable, and then
links its parent to a new place where the original gpte is non-present.
Finally the guest can access the remapped area without flushing
the tlb. The guest's behavior should be allowed per SDM, but the host
kvm mmu makes it wrong.

Fixes: 4731d4c7a077 ("KVM: MMU: out of sync shadow core")
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210918005636.3675-3-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

authored by

Lai Jiangshan and committed by
Paolo Bonzini
65855ed8 f8160295

+31 -9
+10 -7
arch/x86/kvm/mmu/mmu.c
··· 2027 2027 } while (!sp->unsync_children); 2028 2028 } 2029 2029 2030 - static void mmu_sync_children(struct kvm_vcpu *vcpu, 2031 - struct kvm_mmu_page *parent) 2030 + static int mmu_sync_children(struct kvm_vcpu *vcpu, 2031 + struct kvm_mmu_page *parent, bool can_yield) 2032 2032 { 2033 2033 int i; 2034 2034 struct kvm_mmu_page *sp; ··· 2055 2055 } 2056 2056 if (need_resched() || rwlock_needbreak(&vcpu->kvm->mmu_lock)) { 2057 2057 kvm_mmu_flush_or_zap(vcpu, &invalid_list, false, flush); 2058 + if (!can_yield) { 2059 + kvm_make_request(KVM_REQ_MMU_SYNC, vcpu); 2060 + return -EINTR; 2061 + } 2062 + 2058 2063 cond_resched_rwlock_write(&vcpu->kvm->mmu_lock); 2059 2064 flush = false; 2060 2065 } 2061 2066 } 2062 2067 2063 2068 kvm_mmu_flush_or_zap(vcpu, &invalid_list, false, flush); 2069 + return 0; 2064 2070 } 2065 2071 2066 2072 static void __clear_sp_write_flooding_count(struct kvm_mmu_page *sp) ··· 2151 2145 WARN_ON(!list_empty(&invalid_list)); 2152 2146 kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu); 2153 2147 } 2154 - 2155 - if (sp->unsync_children) 2156 - kvm_make_request(KVM_REQ_MMU_SYNC, vcpu); 2157 2148 2158 2149 __clear_sp_write_flooding_count(sp); 2159 2150 ··· 3687 3684 write_lock(&vcpu->kvm->mmu_lock); 3688 3685 kvm_mmu_audit(vcpu, AUDIT_PRE_SYNC); 3689 3686 3690 - mmu_sync_children(vcpu, sp); 3687 + mmu_sync_children(vcpu, sp, true); 3691 3688 3692 3689 kvm_mmu_audit(vcpu, AUDIT_POST_SYNC); 3693 3690 write_unlock(&vcpu->kvm->mmu_lock); ··· 3703 3700 if (IS_VALID_PAE_ROOT(root)) { 3704 3701 root &= PT64_BASE_ADDR_MASK; 3705 3702 sp = to_shadow_page(root); 3706 - mmu_sync_children(vcpu, sp); 3703 + mmu_sync_children(vcpu, sp, true); 3707 3704 } 3708 3705 } 3709 3706
+21 -2
arch/x86/kvm/mmu/paging_tmpl.h
··· 707 707 if (!is_shadow_present_pte(*it.sptep)) { 708 708 table_gfn = gw->table_gfn[it.level - 2]; 709 709 access = gw->pt_access[it.level - 2]; 710 - sp = kvm_mmu_get_page(vcpu, table_gfn, addr, it.level-1, 711 - false, access); 710 + sp = kvm_mmu_get_page(vcpu, table_gfn, addr, 711 + it.level-1, false, access); 712 + /* 713 + * We must synchronize the pagetable before linking it 714 + * because the guest doesn't need to flush tlb when 715 + * the gpte is changed from non-present to present. 716 + * Otherwise, the guest may use the wrong mapping. 717 + * 718 + * For PG_LEVEL_4K, kvm_mmu_get_page() has already 719 + * synchronized it transiently via kvm_sync_page(). 720 + * 721 + * For higher level pagetable, we synchronize it via 722 + * the slower mmu_sync_children(). If it needs to 723 + * break, some progress has been made; return 724 + * RET_PF_RETRY and retry on the next #PF. 725 + * KVM_REQ_MMU_SYNC is not necessary but it 726 + * expedites the process. 727 + */ 728 + if (sp->unsync_children && 729 + mmu_sync_children(vcpu, sp, false)) 730 + return RET_PF_RETRY; 712 731 } 713 732 714 733 /*