From mboxrd@z Thu Jan  1 00:00:00 1970
From: Avi Kivity <avi.kivity@gmail.com>
Subject: Re: [patch 2/5] KVM: MMU: allow pinning spte translations (TDP-only)
Date: Thu, 19 Jun 2014 11:01:06 +0300
Message-ID: <53A298C2.4040005@gmail.com>
References: <20140618231203.846608908@amt.cnet> <20140618231521.569025131@amt.cnet>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: pbonzini@redhat.com, xiaoguangrong@linux.vnet.ibm.com,
	gleb@kernel.org
To: mtosatti@redhat.com, kvm@vger.kernel.org, ak@linux.intel.com
Return-path: <kvm-owner@vger.kernel.org>
Received: from mail-we0-f172.google.com ([74.125.82.172]:44909 "EHLO
	mail-we0-f172.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1750975AbaFSIBK (ORCPT <rfc822;kvm@vger.kernel.org>);
	Thu, 19 Jun 2014 04:01:10 -0400
Received: by mail-we0-f172.google.com with SMTP id u57so1966299wes.3
        for <kvm@vger.kernel.org>; Thu, 19 Jun 2014 01:01:08 -0700 (PDT)
In-Reply-To: <20140618231521.569025131@amt.cnet>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>


On 06/19/2014 02:12 AM, mtosatti@redhat.com wrote:
> Allow vcpus to pin spte translations by:
>
> 1) Creating a per-vcpu list of pinned ranges.
> 2) On mmu reload request:
> 	- Fault ranges.
> 	- Mark sptes with a pinned bit.
> 	- Mark shadow pages as pinned.
>
> 3) Then modify the following actions:
> 	- Page age => skip spte flush.
> 	- MMU notifiers => force mmu reload request (which kicks cpu out of
> 				guest mode).
> 	- GET_DIRTY_LOG => force mmu reload request.
> 	- SLAB shrinker => skip shadow page deletion.
>
> TDP-only.
>
>   
> +int kvm_mmu_register_pinned_range(struct kvm_vcpu *vcpu,
> +				  gfn_t base_gfn, unsigned long npages)
> +{
> +	struct kvm_pinned_page_range *p;
> +
> +	mutex_lock(&vcpu->arch.pinned_mmu_mutex);
> +	list_for_each_entry(p, &vcpu->arch.pinned_mmu_pages, link) {
> +		if (p->base_gfn == base_gfn && p->npages == npages) {
> +			mutex_unlock(&vcpu->arch.pinned_mmu_mutex);
> +			return -EEXIST;
> +		}
> +	}
> +	mutex_unlock(&vcpu->arch.pinned_mmu_mutex);
> +
> +	if (vcpu->arch.nr_pinned_ranges >=
> +	    KVM_MAX_PER_VCPU_PINNED_RANGE)
> +		return -ENOSPC;
> +
> +	p = kzalloc(sizeof(struct kvm_pinned_page_range), GFP_KERNEL);
> +	if (!p)
> +		return -ENOMEM;
> +
> +	vcpu->arch.nr_pinned_ranges++;
> +
> +	trace_kvm_mmu_register_pinned_range(vcpu->vcpu_id, base_gfn, npages);
> +
> +	INIT_LIST_HEAD(&p->link);
> +	p->base_gfn = base_gfn;
> +	p->npages = npages;
> +	mutex_lock(&vcpu->arch.pinned_mmu_mutex);
> +	list_add(&p->link, &vcpu->arch.pinned_mmu_pages);
> +	mutex_unlock(&vcpu->arch.pinned_mmu_mutex);
> +	kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
> +
> +	return 0;
> +}
> +

What happens if ranges overlap (within a vcpu, cross-vcpu)? Or if a 
range overflows and wraps around 0?  Or if it does not refer to RAM?

Looks like you're limiting the number of ranges, but not the number of 
pages, so a guest can lock all of its memory.

> +
> +/*
> + * Pin KVM MMU page translations. This guarantees, for valid
> + * addresses registered by kvm_mmu_register_pinned_range (valid address
> + * meaning address which posses sufficient information for fault to
> + * be resolved), valid translations exist while in guest mode and
> + * therefore no VM-exits due to faults will occur.
> + *
> + * Failure to instantiate pages will abort guest entry.
> + *
> + * Page frames should be pinned with get_page in advance.
> + *
> + * Pinning is not guaranteed while executing as L2 guest.

Does this undermine security?

> + *
> + */
> +
> +static void kvm_mmu_pin_pages(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_pinned_page_range *p;
> +
> +	if (is_guest_mode(vcpu))
> +		return;
> +
> +	if (!vcpu->arch.mmu.direct_map)
> +		return;
> +
> +	ASSERT(VALID_PAGE(vcpu->arch.mmu.root_hpa));
> +
> +	mutex_lock(&vcpu->arch.pinned_mmu_mutex);

Is the mutex actually needed? It seems it's only taken in vcpu context, 
so the vcpu mutex should be sufficient.

> +	list_for_each_entry(p, &vcpu->arch.pinned_mmu_pages, link) {
> +		gfn_t gfn_offset;
> +
> +		for (gfn_offset = 0; gfn_offset < p->npages; gfn_offset++) {
> +			gfn_t gfn = p->base_gfn + gfn_offset;
> +			int r;
> +			bool pinned = false;
> +
> +			r = vcpu->arch.mmu.page_fault(vcpu, gfn << PAGE_SHIFT,
> +						     PFERR_WRITE_MASK, false,
> +						     true, &pinned);
> +			/* MMU notifier sequence window: retry */
> +			if (!r && !pinned)
> +				kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
> +			if (r) {
> +				kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
> +				break;
> +			}
> +
> +		}
> +	}
> +	mutex_unlock(&vcpu->arch.pinned_mmu_mutex);
> +}
> +
>   int kvm_mmu_load(struct kvm_vcpu *vcpu)
>   {
>   	int r;
> @@ -3916,6 +4101,7 @@
>   		goto out;
>   	/* set_cr3() should ensure TLB has been flushed */
>   	vcpu->arch.mmu.set_cr3(vcpu, vcpu->arch.mmu.root_hpa);
> +	kvm_mmu_pin_pages(vcpu);
>   out:
>   	return r;
>   }
>

I don't see where  you unpin pages, so even if you limit the number of 
pinned pages, a guest can pin all of memory by iterating over all of 
memory and pinning it a chunk at a time.

You might try something similar to guest MTRR handling.