From mboxrd@z Thu Jan  1 00:00:00 1970
From: Avi Kivity <avi.kivity@gmail.com>
Subject: Re: [patch 2/5] KVM: MMU: allow pinning spte translations (TDP-only)
Date: Sun, 22 Jun 2014 16:35:24 +0300
Message-ID: <53A6DB9C.7040107@gmail.com>
References: <20140618231203.846608908@amt.cnet> <20140618231521.569025131@amt.cnet> <53A298C2.4040005@gmail.com> <20140619182627.GA32410@amt.cnet>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: kvm@vger.kernel.org, ak@linux.intel.com, pbonzini@redhat.com,
	xiaoguangrong@linux.vnet.ibm.com, gleb@kernel.org
To: Marcelo Tosatti <mtosatti@redhat.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mail-we0-f176.google.com ([74.125.82.176]:52153 "EHLO
	mail-we0-f176.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751130AbaFVNf2 (ORCPT <rfc822;kvm@vger.kernel.org>);
	Sun, 22 Jun 2014 09:35:28 -0400
Received: by mail-we0-f176.google.com with SMTP id u56so5741831wes.35
        for <kvm@vger.kernel.org>; Sun, 22 Jun 2014 06:35:27 -0700 (PDT)
In-Reply-To: <20140619182627.GA32410@amt.cnet>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>


On 06/19/2014 09:26 PM, Marcelo Tosatti wrote:
> On Thu, Jun 19, 2014 at 11:01:06AM +0300, Avi Kivity wrote:
>> On 06/19/2014 02:12 AM, mtosatti@redhat.com wrote:
>>> Allow vcpus to pin spte translations by:
>>>
>>> 1) Creating a per-vcpu list of pinned ranges.
>>> 2) On mmu reload request:
>>> 	- Fault ranges.
>>> 	- Mark sptes with a pinned bit.
>>> 	- Mark shadow pages as pinned.
>>>
>>> 3) Then modify the following actions:
>>> 	- Page age => skip spte flush.
>>> 	- MMU notifiers => force mmu reload request (which kicks cpu out of
>>> 				guest mode).
>>> 	- GET_DIRTY_LOG => force mmu reload request.
>>> 	- SLAB shrinker => skip shadow page deletion.
>>>
>>> TDP-only.
>>>
>>> +int kvm_mmu_register_pinned_range(struct kvm_vcpu *vcpu,
>>> +				  gfn_t base_gfn, unsigned long npages)
>>> +{
>>> +	struct kvm_pinned_page_range *p;
>>> +
>>> +	mutex_lock(&vcpu->arch.pinned_mmu_mutex);
>>> +	list_for_each_entry(p, &vcpu->arch.pinned_mmu_pages, link) {
>>> +		if (p->base_gfn == base_gfn && p->npages == npages) {
>>> +			mutex_unlock(&vcpu->arch.pinned_mmu_mutex);
>>> +			return -EEXIST;
>>> +		}
>>> +	}
>>> +	mutex_unlock(&vcpu->arch.pinned_mmu_mutex);
>>> +
>>> +	if (vcpu->arch.nr_pinned_ranges >=
>>> +	    KVM_MAX_PER_VCPU_PINNED_RANGE)
>>> +		return -ENOSPC;
>>> +
>>> +	p = kzalloc(sizeof(struct kvm_pinned_page_range), GFP_KERNEL);
>>> +	if (!p)
>>> +		return -ENOMEM;
>>> +
>>> +	vcpu->arch.nr_pinned_ranges++;
>>> +
>>> +	trace_kvm_mmu_register_pinned_range(vcpu->vcpu_id, base_gfn, npages);
>>> +
>>> +	INIT_LIST_HEAD(&p->link);
>>> +	p->base_gfn = base_gfn;
>>> +	p->npages = npages;
>>> +	mutex_lock(&vcpu->arch.pinned_mmu_mutex);
>>> +	list_add(&p->link, &vcpu->arch.pinned_mmu_pages);
>>> +	mutex_unlock(&vcpu->arch.pinned_mmu_mutex);
>>> +	kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
>>> +
>>> +	return 0;
>>> +}
>>> +
>> What happens if ranges overlap (within a vcpu, cross-vcpu)?
> The page(s) are faulted multiple times if ranges overlap within a vcpu.
>
> I see no reason to disallow overlapping ranges. Do you?

Not really.  Just making sure nothing horrible happens.

>
>> Or if a range overflows and wraps around 0?
> Pagefault fails on vm-entry -> KVM_REQ_TRIPLE_FAULT.
>
> Will double check for overflows to make sure.

Will the loop terminate?

>> Looks like you're limiting the number of ranges, but not the number
>> of pages, so a guest can lock all of its memory.
> Yes. The page pinning at get_page time can also lock all of
> guest memory.

I'm sure that can't be good.  Maybe subject this pinning to the task 
mlock limit.

>
>>> +
>>> +/*
>>> + * Pin KVM MMU page translations. This guarantees, for valid
>>> + * addresses registered by kvm_mmu_register_pinned_range (valid address
>>> + * meaning address which posses sufficient information for fault to
>>> + * be resolved), valid translations exist while in guest mode and
>>> + * therefore no VM-exits due to faults will occur.
>>> + *
>>> + * Failure to instantiate pages will abort guest entry.
>>> + *
>>> + * Page frames should be pinned with get_page in advance.
>>> + *
>>> + * Pinning is not guaranteed while executing as L2 guest.
>> Does this undermine security?
> PEBS writes should not be enabled when L2 guest is executing.

What prevents L1 for setting up PEBS MSRs for L2?

>>> +	list_for_each_entry(p, &vcpu->arch.pinned_mmu_pages, link) {
>>> +		gfn_t gfn_offset;
>>> +
>>> +		for (gfn_offset = 0; gfn_offset < p->npages; gfn_offset++) {
>>> +			gfn_t gfn = p->base_gfn + gfn_offset;
>>> +			int r;
>>> +			bool pinned = false;
>>> +
>>> +			r = vcpu->arch.mmu.page_fault(vcpu, gfn << PAGE_SHIFT,
>>> +						     PFERR_WRITE_MASK, false,
>>> +						     true, &pinned);
>>> +			/* MMU notifier sequence window: retry */
>>> +			if (!r && !pinned)
>>> +				kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
>>> +			if (r) {
>>> +				kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
>>> +				break;
>>> +			}
>>> +
>>> +		}
>>> +	}
>>> +	mutex_unlock(&vcpu->arch.pinned_mmu_mutex);
>>> +}
>>> +
>>>   int kvm_mmu_load(struct kvm_vcpu *vcpu)
>>>   {
>>>   	int r;
>>> @@ -3916,6 +4101,7 @@
>>>   		goto out;
>>>   	/* set_cr3() should ensure TLB has been flushed */
>>>   	vcpu->arch.mmu.set_cr3(vcpu, vcpu->arch.mmu.root_hpa);
>>> +	kvm_mmu_pin_pages(vcpu);
>>>   out:
>>>   	return r;
>>>   }
>>>
>> I don't see where  you unpin pages, so even if you limit the number
>> of pinned pages, a guest can pin all of memory by iterating over all
>> of memory and pinning it a chunk at a time.
> The caller should be responsible for limiting number of pages pinned it
> is pinning the struct pages?

The caller would be the debug store data are MSR callbacks. How would 
they know what the limit it?

>
> And in that case, should remove any limiting from this interface, as
> that is confusing.
>
>> You might try something similar to guest MTRR handling.
> Please be more verbose.
>

mtrr_state already provides physical range attributes that are looked up 
on every fault, so I thought you could get the pinned attribute from 
there. But I guess that's too late, you want to pre-fault eveything.