From mboxrd@z Thu Jan 1 00:00:00 1970 From: Avi Kivity Subject: Re: [patch 2/5] KVM: MMU: allow pinning spte translations (TDP-only) Date: Sun, 22 Jun 2014 16:35:24 +0300 Message-ID: <53A6DB9C.7040107@gmail.com> References: <20140618231203.846608908@amt.cnet> <20140618231521.569025131@amt.cnet> <53A298C2.4040005@gmail.com> <20140619182627.GA32410@amt.cnet> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: kvm@vger.kernel.org, ak@linux.intel.com, pbonzini@redhat.com, xiaoguangrong@linux.vnet.ibm.com, gleb@kernel.org To: Marcelo Tosatti Return-path: Received: from mail-we0-f176.google.com ([74.125.82.176]:52153 "EHLO mail-we0-f176.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751130AbaFVNf2 (ORCPT ); Sun, 22 Jun 2014 09:35:28 -0400 Received: by mail-we0-f176.google.com with SMTP id u56so5741831wes.35 for ; Sun, 22 Jun 2014 06:35:27 -0700 (PDT) In-Reply-To: <20140619182627.GA32410@amt.cnet> Sender: kvm-owner@vger.kernel.org List-ID: On 06/19/2014 09:26 PM, Marcelo Tosatti wrote: > On Thu, Jun 19, 2014 at 11:01:06AM +0300, Avi Kivity wrote: >> On 06/19/2014 02:12 AM, mtosatti@redhat.com wrote: >>> Allow vcpus to pin spte translations by: >>> >>> 1) Creating a per-vcpu list of pinned ranges. >>> 2) On mmu reload request: >>> - Fault ranges. >>> - Mark sptes with a pinned bit. >>> - Mark shadow pages as pinned. >>> >>> 3) Then modify the following actions: >>> - Page age => skip spte flush. >>> - MMU notifiers => force mmu reload request (which kicks cpu out of >>> guest mode). >>> - GET_DIRTY_LOG => force mmu reload request. >>> - SLAB shrinker => skip shadow page deletion. >>> >>> TDP-only. >>> >>> +int kvm_mmu_register_pinned_range(struct kvm_vcpu *vcpu, >>> + gfn_t base_gfn, unsigned long npages) >>> +{ >>> + struct kvm_pinned_page_range *p; >>> + >>> + mutex_lock(&vcpu->arch.pinned_mmu_mutex); >>> + list_for_each_entry(p, &vcpu->arch.pinned_mmu_pages, link) { >>> + if (p->base_gfn == base_gfn && p->npages == npages) { >>> + mutex_unlock(&vcpu->arch.pinned_mmu_mutex); >>> + return -EEXIST; >>> + } >>> + } >>> + mutex_unlock(&vcpu->arch.pinned_mmu_mutex); >>> + >>> + if (vcpu->arch.nr_pinned_ranges >= >>> + KVM_MAX_PER_VCPU_PINNED_RANGE) >>> + return -ENOSPC; >>> + >>> + p = kzalloc(sizeof(struct kvm_pinned_page_range), GFP_KERNEL); >>> + if (!p) >>> + return -ENOMEM; >>> + >>> + vcpu->arch.nr_pinned_ranges++; >>> + >>> + trace_kvm_mmu_register_pinned_range(vcpu->vcpu_id, base_gfn, npages); >>> + >>> + INIT_LIST_HEAD(&p->link); >>> + p->base_gfn = base_gfn; >>> + p->npages = npages; >>> + mutex_lock(&vcpu->arch.pinned_mmu_mutex); >>> + list_add(&p->link, &vcpu->arch.pinned_mmu_pages); >>> + mutex_unlock(&vcpu->arch.pinned_mmu_mutex); >>> + kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu); >>> + >>> + return 0; >>> +} >>> + >> What happens if ranges overlap (within a vcpu, cross-vcpu)? > The page(s) are faulted multiple times if ranges overlap within a vcpu. > > I see no reason to disallow overlapping ranges. Do you? Not really. Just making sure nothing horrible happens. > >> Or if a range overflows and wraps around 0? > Pagefault fails on vm-entry -> KVM_REQ_TRIPLE_FAULT. > > Will double check for overflows to make sure. Will the loop terminate? >> Looks like you're limiting the number of ranges, but not the number >> of pages, so a guest can lock all of its memory. > Yes. The page pinning at get_page time can also lock all of > guest memory. I'm sure that can't be good. Maybe subject this pinning to the task mlock limit. > >>> + >>> +/* >>> + * Pin KVM MMU page translations. This guarantees, for valid >>> + * addresses registered by kvm_mmu_register_pinned_range (valid address >>> + * meaning address which posses sufficient information for fault to >>> + * be resolved), valid translations exist while in guest mode and >>> + * therefore no VM-exits due to faults will occur. >>> + * >>> + * Failure to instantiate pages will abort guest entry. >>> + * >>> + * Page frames should be pinned with get_page in advance. >>> + * >>> + * Pinning is not guaranteed while executing as L2 guest. >> Does this undermine security? > PEBS writes should not be enabled when L2 guest is executing. What prevents L1 for setting up PEBS MSRs for L2? >>> + list_for_each_entry(p, &vcpu->arch.pinned_mmu_pages, link) { >>> + gfn_t gfn_offset; >>> + >>> + for (gfn_offset = 0; gfn_offset < p->npages; gfn_offset++) { >>> + gfn_t gfn = p->base_gfn + gfn_offset; >>> + int r; >>> + bool pinned = false; >>> + >>> + r = vcpu->arch.mmu.page_fault(vcpu, gfn << PAGE_SHIFT, >>> + PFERR_WRITE_MASK, false, >>> + true, &pinned); >>> + /* MMU notifier sequence window: retry */ >>> + if (!r && !pinned) >>> + kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu); >>> + if (r) { >>> + kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu); >>> + break; >>> + } >>> + >>> + } >>> + } >>> + mutex_unlock(&vcpu->arch.pinned_mmu_mutex); >>> +} >>> + >>> int kvm_mmu_load(struct kvm_vcpu *vcpu) >>> { >>> int r; >>> @@ -3916,6 +4101,7 @@ >>> goto out; >>> /* set_cr3() should ensure TLB has been flushed */ >>> vcpu->arch.mmu.set_cr3(vcpu, vcpu->arch.mmu.root_hpa); >>> + kvm_mmu_pin_pages(vcpu); >>> out: >>> return r; >>> } >>> >> I don't see where you unpin pages, so even if you limit the number >> of pinned pages, a guest can pin all of memory by iterating over all >> of memory and pinning it a chunk at a time. > The caller should be responsible for limiting number of pages pinned it > is pinning the struct pages? The caller would be the debug store data are MSR callbacks. How would they know what the limit it? > > And in that case, should remove any limiting from this interface, as > that is confusing. > >> You might try something similar to guest MTRR handling. > Please be more verbose. > mtrr_state already provides physical range attributes that are looked up on every fault, so I thought you could get the pinned attribute from there. But I guess that's too late, you want to pre-fault eveything.