From mboxrd@z Thu Jan 1 00:00:00 1970 From: Avi Kivity Subject: Re: KVM: MMU: remove prefault from invlpg handler Date: Sat, 05 Dec 2009 18:57:44 +0200 Message-ID: <4B1A9108.6090000@redhat.com> References: <20091205143411.GA16237@amt.cnet> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Cc: kvm To: Marcelo Tosatti Return-path: Received: from mx1.redhat.com ([209.132.183.28]:55033 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757283AbZLEQ5j (ORCPT ); Sat, 5 Dec 2009 11:57:39 -0500 Received: from int-mx02.intmail.prod.int.phx2.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) by mx1.redhat.com (8.13.8/8.13.8) with ESMTP id nB5Gvk65014691 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Sat, 5 Dec 2009 11:57:46 -0500 In-Reply-To: <20091205143411.GA16237@amt.cnet> Sender: kvm-owner@vger.kernel.org List-ID: On 12/05/2009 04:34 PM, Marcelo Tosatti wrote: > The invlpg prefault optimization breaks Windows 2008 R2 occasionally. > > The visible effect is that the invlpg handler instantiates a pte which > is, microseconds later, written with a different gfn by another vcpu. > > The OS could have other mechanisms to prevent a present translation from > being used, which the hypervisor is unaware of. > > Fix by making invlpg emulation follow documented behaviour. > > Good catch. How did you track it down? I don't think the OS has "other mechanisms", though - the processor can speculate the tlb so that would be an OS bug. It looks like a race: > Signed-off-by: Marcelo Tosatti > > > diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h > index a601713..58a0f1e 100644 > --- a/arch/x86/kvm/paging_tmpl.h > +++ b/arch/x86/kvm/paging_tmpl.h > @@ -455,8 +455,6 @@ out_unlock: > static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva) > { > struct kvm_shadow_walk_iterator iterator; > - pt_element_t gpte; > - gpa_t pte_gpa = -1; > int level; > u64 *sptep; > int need_flush = 0; > @@ -470,10 +468,6 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva) > if (level == PT_PAGE_TABLE_LEVEL || > ((level == PT_DIRECTORY_LEVEL&& is_large_pte(*sptep))) || > ((level == PT_PDPE_LEVEL&& is_large_pte(*sptep)))) { > - struct kvm_mmu_page *sp = page_header(__pa(sptep)); > - > - pte_gpa = (sp->gfn<< PAGE_SHIFT); > - pte_gpa += (sptep - sp->spt) * sizeof(pt_element_t); > > if (is_shadow_present_pte(*sptep)) { > rmap_remove(vcpu->kvm, sptep); > @@ -492,18 +486,6 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva) > if (need_flush) > kvm_flush_remote_tlbs(vcpu->kvm); > spin_unlock(&vcpu->kvm->mmu_lock); > - > - if (pte_gpa == -1) > - return; > - if (kvm_read_guest_atomic(vcpu->kvm, pte_gpa,&gpte, > - sizeof(pt_element_t))) > - return; > Here, another vcpu updates the gpte and issues a new invlpg. > - if (is_present_gpte(gpte)&& (gpte& PT_ACCESSED_MASK)) { > - if (mmu_topup_memory_caches(vcpu)) > - return; > - kvm_mmu_pte_write(vcpu, pte_gpa, (const u8 *)&gpte, > - sizeof(pt_element_t), 0); > - } > And here we undo the correct invlpg with the outdated gpte. Looks like we considered this, since kvm_read_guest_atomic() is only needed if inside the spinlock, but some other change moved the spin_unlock() upwards. Will investigate history. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic.