Re: [PATCH 02/10] nEPT: Add EPT tables support to paging_tmpl.h

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
To: "Nadav Har'El" <nyh@il.ibm.com>
Cc: kvm@vger.kernel.org, Joerg.Roedel@amd.com, avi@redhat.com,
	owasserm@redhat.com, abelg@il.ibm.com, eddie.dong@intel.com,
	yang.z.zhang@intel.com
Subject: Re: [PATCH 02/10] nEPT: Add EPT tables support to paging_tmpl.h
Date: Thu, 02 Aug 2012 12:00:57 +0800	[thread overview]
Message-ID: <5019FB79.40407@linux.vnet.ibm.com> (raw)
In-Reply-To: <201208011437.q71Ebfif023819@rice.haifa.ibm.com>

On 08/01/2012 10:37 PM, Nadav Har'El wrote:
> This is the first patch in a series which adds nested EPT support to KVM's
> nested VMX. Nested EPT means emulating EPT for an L1 guest so that L1 can use
> EPT when running a nested guest L2. When L1 uses EPT, it allows the L2 guest
> to set its own cr3 and take its own page faults without either of L0 or L1
> getting involved. This often significanlty improves L2's performance over the
> previous two alternatives (shadow page tables over EPT, and shadow page
> tables over shadow page tables).
> 
> This patch adds EPT support to paging_tmpl.h.
> 
> paging_tmpl.h contains the code for reading and writing page tables. The code
> for 32-bit and 64-bit tables is very similar, but not identical, so
> paging_tmpl.h is #include'd twice in mmu.c, once with PTTTYPE=32 and once
> with PTTYPE=64, and this generates the two sets of similar functions.
> 
> There are subtle but important differences between the format of EPT tables
> and that of ordinary x86 64-bit page tables, so for nested EPT we need a
> third set of functions to read the guest EPT table and to write the shadow
> EPT table.
> 
> So this patch adds third PTTYPE, PTTYPE_EPT, which creates functions (prefixed
> with "EPT") which correctly read and write EPT tables.
> 

Now, paging_tmpl.h becomes really untidy and hard to read, may be we need
to abstract the specified operations depends on PTTYPE.

> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> ---
>  arch/x86/kvm/mmu.c         |   14 +----
>  arch/x86/kvm/paging_tmpl.h |   98 ++++++++++++++++++++++++++++++++---
>  2 files changed, 96 insertions(+), 16 deletions(-)
> 
> --- .before/arch/x86/kvm/mmu.c	2012-08-01 17:22:46.000000000 +0300
> +++ .after/arch/x86/kvm/mmu.c	2012-08-01 17:22:46.000000000 +0300
> @@ -1971,15 +1971,6 @@ static void shadow_walk_next(struct kvm_
>  	return __shadow_walk_next(iterator, *iterator->sptep);
>  }
> 
> -static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp)
> -{
> -	u64 spte;
> -
> -	spte = __pa(sp->spt)
> -		| PT_PRESENT_MASK | PT_ACCESSED_MASK
> -		| PT_WRITABLE_MASK | PT_USER_MASK;
> -	mmu_spte_set(sptep, spte);
> -}
> 
>  static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep,
>  				   unsigned direct_access)
> @@ -3427,6 +3418,11 @@ static bool sync_mmio_spte(u64 *sptep, g
>  	return false;
>  }
> 
> +#define PTTYPE_EPT 18 /* arbitrary */
> +#define PTTYPE PTTYPE_EPT
> +#include "paging_tmpl.h"
> +#undef PTTYPE
> +
>  #define PTTYPE 64
>  #include "paging_tmpl.h"
>  #undef PTTYPE
> --- .before/arch/x86/kvm/paging_tmpl.h	2012-08-01 17:22:46.000000000 +0300
> +++ .after/arch/x86/kvm/paging_tmpl.h	2012-08-01 17:22:46.000000000 +0300
> @@ -50,6 +50,22 @@
>  	#define PT_LEVEL_BITS PT32_LEVEL_BITS
>  	#define PT_MAX_FULL_LEVELS 2
>  	#define CMPXCHG cmpxchg
> +#elif PTTYPE == PTTYPE_EPT
> +	#define pt_element_t u64
> +	#define guest_walker guest_walkerEPT
> +	#define FNAME(name) EPT_##name
> +	#define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
> +	#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
> +	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
> +	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
> +	#define PT_LEVEL_BITS PT64_LEVEL_BITS
> +	#ifdef CONFIG_X86_64
> +	#define PT_MAX_FULL_LEVELS 4
> +	#define CMPXCHG cmpxchg
> +	#else
> +	#define CMPXCHG cmpxchg64
> +	#define PT_MAX_FULL_LEVELS 2
> +	#endif

Missing the case of FULL_LEVELS == 3? Oh, you mentioned it
as PAE case in the PATCH 0.

>  #else
>  	#error Invalid PTTYPE value
>  #endif
> @@ -78,6 +94,7 @@ static gfn_t gpte_to_gfn_lvl(pt_element_
>  	return (gpte & PT_LVL_ADDR_MASK(lvl)) >> PAGE_SHIFT;
>  }
> 
> +#if PTTYPE != PTTYPE_EPT
>  static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>  			       pt_element_t __user *ptep_user, unsigned index,
>  			       pt_element_t orig_pte, pt_element_t new_pte)
> @@ -100,15 +117,22 @@ static int FNAME(cmpxchg_gpte)(struct kv
> 
>  	return (ret != orig_pte);
>  }
> +#endif
> 

Note A/D bits are supported on new intel cpus, this function should be reworked
for nept. I know you did not export this feather to guest, but we can reduce
the difference between nept and other mmu models if A/D are supported.

>  static unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, pt_element_t gpte,
>  				   bool last)
>  {
>  	unsigned access;
> 
> +#if PTTYPE == PTTYPE_EPT
> +	/* We rely here that ACC_WRITE_MASK==VMX_EPT_WRITABLE_MASK */
> +	access = (gpte & VMX_EPT_WRITABLE_MASK) | ACC_USER_MASK |
> +			((gpte & VMX_EPT_EXECUTABLE_MASK) ? ACC_EXEC_MASK : 0);
> +#else
>  	access = (gpte & (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK;
>  	if (last && !is_dirty_gpte(gpte))
>  		access &= ~ACC_WRITE_MASK;
> +#endif
> 

May be we can introduce PT_xxx_MASK to abstracter the access bits.

>  #if PTTYPE == 64
>  	if (vcpu->arch.mmu.nx)
> @@ -135,6 +159,30 @@ static bool FNAME(is_last_gpte)(struct g
>  	return false;
>  }
> 
> +static inline int FNAME(is_present_gpte)(unsigned long pte)
> +{
> +#if PTTYPE == PTTYPE_EPT
> +	return pte & (VMX_EPT_READABLE_MASK | VMX_EPT_WRITABLE_MASK |
> +			VMX_EPT_EXECUTABLE_MASK);
> +#else
> +	return is_present_gpte(pte);
> +#endif
> +}
> +

Introduce PT_PRESENT_BITS can eliminate the dependence, and we
need to rework is_present_gpte since it is used out of paging_tmpl.h.

> +static inline int FNAME(check_write_user_access)(struct kvm_vcpu *vcpu,
> +					   bool write_fault, bool user_fault,
> +					   unsigned long pte)
> +{
> +#if PTTYPE == PTTYPE_EPT
> +	if (unlikely(write_fault && !(pte & VMX_EPT_WRITABLE_MASK)
> +	      && (user_fault || is_write_protection(vcpu))))
> +		return false;
> +	return true;
> +#else
> +	return check_write_user_access(vcpu, write_fault, user_fault, pte);
> +#endif
> +}
> +

Ditto, need to rework check_write_user_access.

>  /*
>   * Fetch a guest pte for a guest virtual address
>   */
> @@ -155,7 +203,9 @@ static int FNAME(walk_addr_generic)(stru
>  	u16 errcode = 0;
> 
>  	trace_kvm_mmu_pagetable_walk(addr, access);
> +#if PTTYPE != PTTYPE_EPT
>  retry_walk:
> +#endif
>  	eperm = false;
>  	walker->level = mmu->root_level;
>  	pte           = mmu->get_cr3(vcpu);
> @@ -202,7 +252,7 @@ retry_walk:
> 
>  		trace_kvm_mmu_paging_element(pte, walker->level);
> 
> -		if (unlikely(!is_present_gpte(pte)))
> +		if (unlikely(!FNAME(is_present_gpte)(pte)))
>  			goto error;
> 
>  		if (unlikely(is_rsvd_bits_set(&vcpu->arch.mmu, pte,
> @@ -211,13 +261,16 @@ retry_walk:
>  			goto error;
>  		}
> 
> -		if (!check_write_user_access(vcpu, write_fault, user_fault,
> -					  pte))
> +		if (!FNAME(check_write_user_access)(vcpu, write_fault,
> +					  user_fault, pte))
>  			eperm = true;
> 
>  #if PTTYPE == 64
>  		if (unlikely(fetch_fault && (pte & PT64_NX_MASK)))
>  			eperm = true;
> +#elif PTTYPE == PTTYPE_EPT
> +		if (unlikely(fetch_fault && !(pte & VMX_EPT_EXECUTABLE_MASK)))
> +			eperm = true;
>  #endif
> 
>  		last_gpte = FNAME(is_last_gpte)(walker, vcpu, mmu, pte);
> @@ -225,12 +278,15 @@ retry_walk:
>  			pte_access = pt_access &
>  				     FNAME(gpte_access)(vcpu, pte, true);
>  			/* check if the kernel is fetching from user page */
> +#if PTTYPE != PTTYPE_EPT
>  			if (unlikely(pte_access & PT_USER_MASK) &&
>  			    kvm_read_cr4_bits(vcpu, X86_CR4_SMEP))
>  				if (fetch_fault && !user_fault)
>  					eperm = true;
> +#endif
>  		}
> 
> +#if PTTYPE != PTTYPE_EPT
>  		if (!eperm && unlikely(!(pte & PT_ACCESSED_MASK))) {
>  			int ret;
>  			trace_kvm_mmu_set_accessed_bit(table_gfn, index,
> @@ -245,6 +301,7 @@ retry_walk:
>  			mark_page_dirty(vcpu->kvm, table_gfn);
>  			pte |= PT_ACCESSED_MASK;
>  		}
> +#endif

If A/D supported, these differences can be be removed?

> 
>  		walker->ptes[walker->level - 1] = pte;
> 
> @@ -283,6 +340,7 @@ retry_walk:
>  		goto error;
>  	}
> 
> +#if PTTYPE != PTTYPE_EPT
>  	if (write_fault && unlikely(!is_dirty_gpte(pte))) {
>  		int ret;
> 
> @@ -298,6 +356,7 @@ retry_walk:
>  		pte |= PT_DIRTY_MASK;
>  		walker->ptes[walker->level - 1] = pte;
>  	}
> +#endif
> 
>  	walker->pt_access = pt_access;
>  	walker->pte_access = pte_access;
> @@ -328,6 +387,7 @@ static int FNAME(walk_addr)(struct guest
>  					access);
>  }
> 
> +#if PTTYPE != PTTYPE_EPT
>  static int FNAME(walk_addr_nested)(struct guest_walker *walker,
>  				   struct kvm_vcpu *vcpu, gva_t addr,
>  				   u32 access)
> @@ -335,6 +395,7 @@ static int FNAME(walk_addr_nested)(struc
>  	return FNAME(walk_addr_generic)(walker, vcpu, &vcpu->arch.nested_mmu,
>  					addr, access);
>  }
> +#endif
> 

Hmm, you do not need the special walking functions?

>  static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu,
>  				    struct kvm_mmu_page *sp, u64 *spte,
> @@ -343,11 +404,13 @@ static bool FNAME(prefetch_invalid_gpte)
>  	if (is_rsvd_bits_set(&vcpu->arch.mmu, gpte, PT_PAGE_TABLE_LEVEL))
>  		goto no_present;
> 
> -	if (!is_present_gpte(gpte))
> +	if (!FNAME(is_present_gpte)(gpte))
>  		goto no_present;
> 
> +#if PTTYPE != PTTYPE_EPT
>  	if (!(gpte & PT_ACCESSED_MASK))
>  		goto no_present;
> +#endif
> 
>  	return false;
> 
> @@ -458,6 +521,20 @@ static void FNAME(pte_prefetch)(struct k
>  			     pfn, true, true);
>  	}
>  }
> +static void FNAME(link_shadow_page)(u64 *sptep, struct kvm_mmu_page *sp)
> +{
> +	u64 spte;
> +
> +	spte = __pa(sp->spt)
> +#if PTTYPE == PTTYPE_EPT
> +		| VMX_EPT_READABLE_MASK | VMX_EPT_WRITABLE_MASK
> +		| VMX_EPT_EXECUTABLE_MASK;
> +#else
> +		| PT_PRESENT_MASK | PT_ACCESSED_MASK
> +		| PT_WRITABLE_MASK | PT_USER_MASK;
> +#endif
> +	mmu_spte_set(sptep, spte);
> +}
> 
>  /*
>   * Fetch a shadow pte for a specific level in the paging hierarchy.
> @@ -474,7 +551,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu
>  	unsigned direct_access;
>  	struct kvm_shadow_walk_iterator it;
> 
> -	if (!is_present_gpte(gw->ptes[gw->level - 1]))
> +	if (!FNAME(is_present_gpte)(gw->ptes[gw->level - 1]))
>  		return NULL;
> 
>  	direct_access = gw->pte_access;
> @@ -514,7 +591,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu
>  			goto out_gpte_changed;
> 
>  		if (sp)
> -			link_shadow_page(it.sptep, sp);
> +			FNAME(link_shadow_page)(it.sptep, sp);
>  	}
> 
>  	for (;
> @@ -534,10 +611,15 @@ static u64 *FNAME(fetch)(struct kvm_vcpu
> 
>  		sp = kvm_mmu_get_page(vcpu, direct_gfn, addr, it.level-1,
>  				      true, direct_access, it.sptep);
> -		link_shadow_page(it.sptep, sp);
> +		FNAME(link_shadow_page)(it.sptep, sp);
>  	}
> 
>  	clear_sp_write_flooding_count(it.sptep);
> +	/* TODO: Consider if everything that set_spte() does is correct when
> +	   the shadow page table is actually EPT. Most is fine (for direct_map)
> +	   but it appears there they be a few wrong corner cases with
> +	   PT_USER_MASK, PT64_NX_MASK, etc., and I need to review everything
> +	 */

Maybe it is ok. But you need to care A/D bits (current, you did not export
A/D bits to guest, however, it may be supported on L0).

>  	mmu_set_spte(vcpu, it.sptep, access, gw->pte_access,
>  		     user_fault, write_fault, emulate, it.level,
>  		     gw->gfn, pfn, prefault, map_writable);
> @@ -733,6 +815,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kv
>  	return gpa;
>  }
> 
> +#if PTTYPE != PTTYPE_EPT
>  static gpa_t FNAME(gva_to_gpa_nested)(struct kvm_vcpu *vcpu, gva_t vaddr,
>  				      u32 access,
>  				      struct x86_exception *exception)
> @@ -751,6 +834,7 @@ static gpa_t FNAME(gva_to_gpa_nested)(st
> 
>  	return gpa;
>  }
> +#endif

Why it is not needed?

next prev parent reply	other threads:[~2012-08-02  4:01 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-08-01 14:36 [PATCH 0/10] nEPT v2: Nested EPT support for Nested VMX Nadav Har'El
2012-08-01 14:37 ` [PATCH 01/10] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1 Nadav Har'El
2012-08-01 14:37 ` [PATCH 02/10] nEPT: Add EPT tables support to paging_tmpl.h Nadav Har'El
2012-08-02  4:00   ` Xiao Guangrong [this message]
2012-08-02 21:25     ` Nadav Har'El
2012-08-03  8:08       ` Xiao Guangrong
2012-08-01 14:38 ` [PATCH 03/10] nEPT: MMU context for nested EPT Nadav Har'El
2012-08-01 14:38 ` [PATCH 04/10] nEPT: Fix cr3 handling in nested exit and entry Nadav Har'El
2012-08-01 14:39 ` [PATCH 05/10] nEPT: Fix wrong test in kvm_set_cr3 Nadav Har'El
2012-08-01 14:39 ` [PATCH 06/10] nEPT: Some additional comments Nadav Har'El
2012-08-01 14:40 ` [PATCH 07/10] nEPT: Advertise EPT to L1 Nadav Har'El
2012-08-01 14:40 ` [PATCH 08/10] nEPT: Nested INVEPT Nadav Har'El
2012-08-01 14:41 ` [PATCH 09/10] nEPT: Documentation Nadav Har'El
2012-08-01 14:41 ` [PATCH 10/10] nEPT: Miscelleneous cleanups Nadav Har'El
2012-08-01 15:07 ` [PATCH 0/10] nEPT v2: Nested EPT support for Nested VMX Avi Kivity

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5019FB79.40407@linux.vnet.ibm.com \
    --to=xiaoguangrong@linux.vnet.ibm.com \
    --cc=Joerg.Roedel@amd.com \
    --cc=abelg@il.ibm.com \
    --cc=avi@redhat.com \
    --cc=eddie.dong@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=nyh@il.ibm.com \
    --cc=owasserm@redhat.com \
    --cc=yang.z.zhang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).