[PATCH 0/10] nEPT v2: Nested EPT support for Nested VMX

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/10] nEPT v2: Nested EPT support for Nested VMX
@ 2012-08-01 14:36 Nadav Har'El
  2012-08-01 14:37 ` [PATCH 01/10] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1 Nadav Har'El
                   ` (10 more replies)
  0 siblings, 11 replies; 15+ messages in thread
From: Nadav Har'El @ 2012-08-01 14:36 UTC (permalink / raw)
  To: kvm; +Cc: Joerg.Roedel, avi, owasserm, abelg, eddie.dong, yang.z.zhang

The following patches add nested EPT support to Nested VMX.

This is the second version of this patch set. Most of the issues from the
previous reviews were handled, and in particular there is now a new variant
of paging_tmpl for EPT page tables.

However, while this version does work in my tests, there are still some known
problems/bugs with this version and unhandled issues from the previous review:

 1. 32-bit *PAE* L2s currently don't work. non-PAE 32-bit L2s do work
    (and so do, of course, 64-bit L2s).

 2. nested_ept_inject_page_fault() assumes vm_exit_reason is already set
    to EPT_VIOLATION. However, it is conceivable that L0 emulates some
    L2 instruction, and during this emulation we read some L2 memory
    causing a need to exit (from L2 to L1) with an EPT violation.

 3. Moreover, now nested_ept_inject_page_fault() always causes an
    EPT_VIOLATION, with vmcs12->exit_qualification = fault->error_code.
    This is wrong: first fault->error code is not in exit qualification
    format but in PFERR_* format. Moreover, PFERR_RSVD_MASK would mean
    we need to cause an EPT_MISCONFIG, NOT EPT_VIOLATION.
    Instead of trying to fix this by translating PFERR to exit_qualification,
    we should calculate and remember in walk_addr() the exit qualification
    (and and an additional bit: whether it's an EPT VIOLATION or
    MISCONFIGURATION). This will be remembered in new fields in x86_exception.

    Avi suggested: "[add to x86_exception] another bool, to distinguish
    between EPT VIOLATION and EPT_QUALIFICATION. The error_code field should
    be extended to 64 bits for EXIT_QUALIFICATION (though only bits 0-12 are
    defined). You need another field for the guest linear address. 
    EXIT_QUALIFICATION has to be calculated, it cannot be derived from the
    original exit. Look at kvm_propagate_fault()."
    He also added: "If we're injecting an EPT VIOLATION to L1 (because we
    weren't able to resolve it; say L1 write-protected the page), then we
    need to compute EXIT_QUALIFICATION.  Bits 3-5 of EXIT_QUALIFICATION are
    computed from EPT12 paging structure entries (easy to derive them from
    pt_access/pte_access)."

 4. Also, nested_ept_inject_page_fault() doesn't set guest linear address.

 5. There are several "TODO"s left in the code.

If there's any volunteer willing to help me with some of these issues,
it would be great :-)

About nested EPT:
-----------------

Nested EPT means emulating EPT for an L1 guest, allowing it to use EPT when
running a nested guest L2. When L1 uses EPT, it allows the L2 guest to set
its own cr3 and take its own page faults without either of L0 or L1 getting
involved. In many workloads this significanlty improves L2's performance over
the previous two alternatives (shadow page tables over ept, and shadow page
tables over shadow page tables). Our paper [1] described these three options,
and the advantages of nested EPT ("multidimensional paging" in the paper).

Nested EPT is enabled by default (if the hardware supports EPT), so users do
not have to do anything special to enjoy the performance improvement that
this patch gives to L2 guests. L1 may of course choose not to use nested
EPT, by simply not using EPT (e.g., a KVM in L1 may use the "ept=0" option).

Just as a non-scientific, non-representative indication of the kind of
dramatic performance improvement you may see in workloads that have a lot of
context switches and page faults, here is a measurement of the time
an example single-threaded "make" took in L2 (kvm over kvm):

 shadow over shadow: 105 seconds
 ("ept=0" in L0 forces this)

 shadow over EPT: 87 seconds
 (the previous default; Can be forced with "ept=0" in L1)

 EPT over EPT: 29 seconds
 (the default after this patch)

Note that the same test on L1 (with EPT) took 25 seconds, so for this example
workload, performance of nested virtualization is now very close to that of
single-level virtualization.

[1] "The Turtles Project: Design and Implementation of Nested Virtualization",
    http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf

Patch statistics:
-----------------

 Documentation/virtual/kvm/nested-vmx.txt |    4 
 arch/x86/include/asm/vmx.h               |    2 
 arch/x86/kvm/mmu.c                       |   52 +++-
 arch/x86/kvm/mmu.h                       |    1 
 arch/x86/kvm/paging_tmpl.h               |   98 ++++++++-
 arch/x86/kvm/vmx.c                       |  227 +++++++++++++++++++--
 arch/x86/kvm/x86.c                       |   11 -
 7 files changed, 354 insertions(+), 41 deletions(-)

--
Nadav Har'El
IBM Haifa Research Lab

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 01/10] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1
  2012-08-01 14:36 [PATCH 0/10] nEPT v2: Nested EPT support for Nested VMX Nadav Har'El
@ 2012-08-01 14:37 ` Nadav Har'El
  2012-08-01 14:37 ` [PATCH 02/10] nEPT: Add EPT tables support to paging_tmpl.h Nadav Har'El
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: Nadav Har'El @ 2012-08-01 14:37 UTC (permalink / raw)
  To: kvm; +Cc: Joerg.Roedel, avi, owasserm, abelg, eddie.dong, yang.z.zhang

Recent KVM, since http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577
switch the EFER MSR when EPT is used and the host and guest have different
NX bits. So if we add support for nested EPT (L1 guest using EPT to run L2)
and want to be able to run recent KVM as L1, we need to allow L1 to use this
EFER switching feature.

To do this EFER switching, KVM uses VM_ENTRY/EXIT_LOAD_IA32_EFER if available,
and if it isn't, it uses the generic VM_ENTRY/EXIT_MSR_LOAD. This patch adds
support for the former (the latter is still unsupported).

Nested entry and exit emulation (prepare_vmcs_02 and load_vmcs12_host_state,
respectively) already handled VM_ENTRY/EXIT_LOAD_IA32_EFER correctly. So all
that's left to do in this patch is to properly advertise this feature to L1.

Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are emulated by L0, by using
vmx_set_efer (which itself sets one of several vmcs02 fields), so we always
support this feature, regardless of whether the host supports it.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2012-08-01 17:22:46.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2012-08-01 17:22:46.000000000 +0300
@@ -1976,6 +1976,7 @@ static __init void nested_vmx_setup_ctls
 #else
 	nested_vmx_exit_ctls_high = 0;
 #endif
+	nested_vmx_exit_ctls_high |= VM_EXIT_LOAD_IA32_EFER;
 
 	/* entry controls */
 	rdmsr(MSR_IA32_VMX_ENTRY_CTLS,
@@ -1983,6 +1984,7 @@ static __init void nested_vmx_setup_ctls
 	nested_vmx_entry_ctls_low = 0;
 	nested_vmx_entry_ctls_high &=
 		VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_IA32E_MODE;
+	nested_vmx_entry_ctls_high |= VM_ENTRY_LOAD_IA32_EFER;
 
 	/* cpu-based controls */
 	rdmsr(MSR_IA32_VMX_PROCBASED_CTLS,
@@ -6768,10 +6770,18 @@ static void prepare_vmcs02(struct kvm_vc
 	vcpu->arch.cr0_guest_owned_bits &= ~vmcs12->cr0_guest_host_mask;
 	vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits);
 
-	/* Note: IA32_MODE, LOAD_IA32_EFER are modified by vmx_set_efer below */
-	vmcs_write32(VM_EXIT_CONTROLS,
-		vmcs12->vm_exit_controls | vmcs_config.vmexit_ctrl);
-	vmcs_write32(VM_ENTRY_CONTROLS, vmcs12->vm_entry_controls |
+	/* L2->L1 exit controls are emulated - the hardware exit is to L0 so
+	 * we should use its exit controls. Note that IA32_MODE, LOAD_IA32_EFER
+	 * bits are further modified by vmx_set_efer() below.
+	 */
+	vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl);
+
+	/* vmcs12's VM_ENTRY_LOAD_IA32_EFER and VM_ENTRY_IA32E_MODE are
+	 * emulated by vmx_set_efer(), below.
+	 */
+	vmcs_write32(VM_ENTRY_CONTROLS,
+		(vmcs12->vm_entry_controls & ~VM_ENTRY_LOAD_IA32_EFER &
+			~VM_ENTRY_IA32E_MODE) |
 		(vmcs_config.vmentry_ctrl & ~VM_ENTRY_IA32E_MODE));
 
 	if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PAT)


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 02/10] nEPT: Add EPT tables support to paging_tmpl.h
  2012-08-01 14:36 [PATCH 0/10] nEPT v2: Nested EPT support for Nested VMX Nadav Har'El
  2012-08-01 14:37 ` [PATCH 01/10] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1 Nadav Har'El
@ 2012-08-01 14:37 ` Nadav Har'El
  2012-08-02  4:00   ` Xiao Guangrong
  2012-08-01 14:38 ` [PATCH 03/10] nEPT: MMU context for nested EPT Nadav Har'El
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 15+ messages in thread
From: Nadav Har'El @ 2012-08-01 14:37 UTC (permalink / raw)
  To: kvm; +Cc: Joerg.Roedel, avi, owasserm, abelg, eddie.dong, yang.z.zhang

This is the first patch in a series which adds nested EPT support to KVM's
nested VMX. Nested EPT means emulating EPT for an L1 guest so that L1 can use
EPT when running a nested guest L2. When L1 uses EPT, it allows the L2 guest
to set its own cr3 and take its own page faults without either of L0 or L1
getting involved. This often significanlty improves L2's performance over the
previous two alternatives (shadow page tables over EPT, and shadow page
tables over shadow page tables).

This patch adds EPT support to paging_tmpl.h.

paging_tmpl.h contains the code for reading and writing page tables. The code
for 32-bit and 64-bit tables is very similar, but not identical, so
paging_tmpl.h is #include'd twice in mmu.c, once with PTTTYPE=32 and once
with PTTYPE=64, and this generates the two sets of similar functions.

There are subtle but important differences between the format of EPT tables
and that of ordinary x86 64-bit page tables, so for nested EPT we need a
third set of functions to read the guest EPT table and to write the shadow
EPT table.

So this patch adds third PTTYPE, PTTYPE_EPT, which creates functions (prefixed
with "EPT") which correctly read and write EPT tables.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/mmu.c         |   14 +----
 arch/x86/kvm/paging_tmpl.h |   98 ++++++++++++++++++++++++++++++++---
 2 files changed, 96 insertions(+), 16 deletions(-)

--- .before/arch/x86/kvm/mmu.c	2012-08-01 17:22:46.000000000 +0300
+++ .after/arch/x86/kvm/mmu.c	2012-08-01 17:22:46.000000000 +0300
@@ -1971,15 +1971,6 @@ static void shadow_walk_next(struct kvm_
 	return __shadow_walk_next(iterator, *iterator->sptep);
 }
 
-static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp)
-{
-	u64 spte;
-
-	spte = __pa(sp->spt)
-		| PT_PRESENT_MASK | PT_ACCESSED_MASK
-		| PT_WRITABLE_MASK | PT_USER_MASK;
-	mmu_spte_set(sptep, spte);
-}
 
 static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 				   unsigned direct_access)
@@ -3427,6 +3418,11 @@ static bool sync_mmio_spte(u64 *sptep, g
 	return false;
 }
 
+#define PTTYPE_EPT 18 /* arbitrary */
+#define PTTYPE PTTYPE_EPT
+#include "paging_tmpl.h"
+#undef PTTYPE
+
 #define PTTYPE 64
 #include "paging_tmpl.h"
 #undef PTTYPE
--- .before/arch/x86/kvm/paging_tmpl.h	2012-08-01 17:22:46.000000000 +0300
+++ .after/arch/x86/kvm/paging_tmpl.h	2012-08-01 17:22:46.000000000 +0300
@@ -50,6 +50,22 @@
 	#define PT_LEVEL_BITS PT32_LEVEL_BITS
 	#define PT_MAX_FULL_LEVELS 2
 	#define CMPXCHG cmpxchg
+#elif PTTYPE == PTTYPE_EPT
+	#define pt_element_t u64
+	#define guest_walker guest_walkerEPT
+	#define FNAME(name) EPT_##name
+	#define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
+	#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
+	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
+	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
+	#define PT_LEVEL_BITS PT64_LEVEL_BITS
+	#ifdef CONFIG_X86_64
+	#define PT_MAX_FULL_LEVELS 4
+	#define CMPXCHG cmpxchg
+	#else
+	#define CMPXCHG cmpxchg64
+	#define PT_MAX_FULL_LEVELS 2
+	#endif
 #else
 	#error Invalid PTTYPE value
 #endif
@@ -78,6 +94,7 @@ static gfn_t gpte_to_gfn_lvl(pt_element_
 	return (gpte & PT_LVL_ADDR_MASK(lvl)) >> PAGE_SHIFT;
 }
 
+#if PTTYPE != PTTYPE_EPT
 static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 			       pt_element_t __user *ptep_user, unsigned index,
 			       pt_element_t orig_pte, pt_element_t new_pte)
@@ -100,15 +117,22 @@ static int FNAME(cmpxchg_gpte)(struct kv
 
 	return (ret != orig_pte);
 }
+#endif
 
 static unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, pt_element_t gpte,
 				   bool last)
 {
 	unsigned access;
 
+#if PTTYPE == PTTYPE_EPT
+	/* We rely here that ACC_WRITE_MASK==VMX_EPT_WRITABLE_MASK */
+	access = (gpte & VMX_EPT_WRITABLE_MASK) | ACC_USER_MASK |
+			((gpte & VMX_EPT_EXECUTABLE_MASK) ? ACC_EXEC_MASK : 0);
+#else
 	access = (gpte & (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK;
 	if (last && !is_dirty_gpte(gpte))
 		access &= ~ACC_WRITE_MASK;
+#endif
 
 #if PTTYPE == 64
 	if (vcpu->arch.mmu.nx)
@@ -135,6 +159,30 @@ static bool FNAME(is_last_gpte)(struct g
 	return false;
 }
 
+static inline int FNAME(is_present_gpte)(unsigned long pte)
+{
+#if PTTYPE == PTTYPE_EPT
+	return pte & (VMX_EPT_READABLE_MASK | VMX_EPT_WRITABLE_MASK |
+			VMX_EPT_EXECUTABLE_MASK);
+#else
+	return is_present_gpte(pte);
+#endif
+}
+
+static inline int FNAME(check_write_user_access)(struct kvm_vcpu *vcpu,
+					   bool write_fault, bool user_fault,
+					   unsigned long pte)
+{
+#if PTTYPE == PTTYPE_EPT
+	if (unlikely(write_fault && !(pte & VMX_EPT_WRITABLE_MASK)
+	      && (user_fault || is_write_protection(vcpu))))
+		return false;
+	return true;
+#else
+	return check_write_user_access(vcpu, write_fault, user_fault, pte);
+#endif
+}
+
 /*
  * Fetch a guest pte for a guest virtual address
  */
@@ -155,7 +203,9 @@ static int FNAME(walk_addr_generic)(stru
 	u16 errcode = 0;
 
 	trace_kvm_mmu_pagetable_walk(addr, access);
+#if PTTYPE != PTTYPE_EPT
 retry_walk:
+#endif
 	eperm = false;
 	walker->level = mmu->root_level;
 	pte           = mmu->get_cr3(vcpu);
@@ -202,7 +252,7 @@ retry_walk:
 
 		trace_kvm_mmu_paging_element(pte, walker->level);
 
-		if (unlikely(!is_present_gpte(pte)))
+		if (unlikely(!FNAME(is_present_gpte)(pte)))
 			goto error;
 
 		if (unlikely(is_rsvd_bits_set(&vcpu->arch.mmu, pte,
@@ -211,13 +261,16 @@ retry_walk:
 			goto error;
 		}
 
-		if (!check_write_user_access(vcpu, write_fault, user_fault,
-					  pte))
+		if (!FNAME(check_write_user_access)(vcpu, write_fault,
+					  user_fault, pte))
 			eperm = true;
 
 #if PTTYPE == 64
 		if (unlikely(fetch_fault && (pte & PT64_NX_MASK)))
 			eperm = true;
+#elif PTTYPE == PTTYPE_EPT
+		if (unlikely(fetch_fault && !(pte & VMX_EPT_EXECUTABLE_MASK)))
+			eperm = true;
 #endif
 
 		last_gpte = FNAME(is_last_gpte)(walker, vcpu, mmu, pte);
@@ -225,12 +278,15 @@ retry_walk:
 			pte_access = pt_access &
 				     FNAME(gpte_access)(vcpu, pte, true);
 			/* check if the kernel is fetching from user page */
+#if PTTYPE != PTTYPE_EPT
 			if (unlikely(pte_access & PT_USER_MASK) &&
 			    kvm_read_cr4_bits(vcpu, X86_CR4_SMEP))
 				if (fetch_fault && !user_fault)
 					eperm = true;
+#endif
 		}
 
+#if PTTYPE != PTTYPE_EPT
 		if (!eperm && unlikely(!(pte & PT_ACCESSED_MASK))) {
 			int ret;
 			trace_kvm_mmu_set_accessed_bit(table_gfn, index,
@@ -245,6 +301,7 @@ retry_walk:
 			mark_page_dirty(vcpu->kvm, table_gfn);
 			pte |= PT_ACCESSED_MASK;
 		}
+#endif
 
 		walker->ptes[walker->level - 1] = pte;
 
@@ -283,6 +340,7 @@ retry_walk:
 		goto error;
 	}
 
+#if PTTYPE != PTTYPE_EPT
 	if (write_fault && unlikely(!is_dirty_gpte(pte))) {
 		int ret;
 
@@ -298,6 +356,7 @@ retry_walk:
 		pte |= PT_DIRTY_MASK;
 		walker->ptes[walker->level - 1] = pte;
 	}
+#endif
 
 	walker->pt_access = pt_access;
 	walker->pte_access = pte_access;
@@ -328,6 +387,7 @@ static int FNAME(walk_addr)(struct guest
 					access);
 }
 
+#if PTTYPE != PTTYPE_EPT
 static int FNAME(walk_addr_nested)(struct guest_walker *walker,
 				   struct kvm_vcpu *vcpu, gva_t addr,
 				   u32 access)
@@ -335,6 +395,7 @@ static int FNAME(walk_addr_nested)(struc
 	return FNAME(walk_addr_generic)(walker, vcpu, &vcpu->arch.nested_mmu,
 					addr, access);
 }
+#endif
 
 static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu,
 				    struct kvm_mmu_page *sp, u64 *spte,
@@ -343,11 +404,13 @@ static bool FNAME(prefetch_invalid_gpte)
 	if (is_rsvd_bits_set(&vcpu->arch.mmu, gpte, PT_PAGE_TABLE_LEVEL))
 		goto no_present;
 
-	if (!is_present_gpte(gpte))
+	if (!FNAME(is_present_gpte)(gpte))
 		goto no_present;
 
+#if PTTYPE != PTTYPE_EPT
 	if (!(gpte & PT_ACCESSED_MASK))
 		goto no_present;
+#endif
 
 	return false;
 
@@ -458,6 +521,20 @@ static void FNAME(pte_prefetch)(struct k
 			     pfn, true, true);
 	}
 }
+static void FNAME(link_shadow_page)(u64 *sptep, struct kvm_mmu_page *sp)
+{
+	u64 spte;
+
+	spte = __pa(sp->spt)
+#if PTTYPE == PTTYPE_EPT
+		| VMX_EPT_READABLE_MASK | VMX_EPT_WRITABLE_MASK
+		| VMX_EPT_EXECUTABLE_MASK;
+#else
+		| PT_PRESENT_MASK | PT_ACCESSED_MASK
+		| PT_WRITABLE_MASK | PT_USER_MASK;
+#endif
+	mmu_spte_set(sptep, spte);
+}
 
 /*
  * Fetch a shadow pte for a specific level in the paging hierarchy.
@@ -474,7 +551,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu
 	unsigned direct_access;
 	struct kvm_shadow_walk_iterator it;
 
-	if (!is_present_gpte(gw->ptes[gw->level - 1]))
+	if (!FNAME(is_present_gpte)(gw->ptes[gw->level - 1]))
 		return NULL;
 
 	direct_access = gw->pte_access;
@@ -514,7 +591,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu
 			goto out_gpte_changed;
 
 		if (sp)
-			link_shadow_page(it.sptep, sp);
+			FNAME(link_shadow_page)(it.sptep, sp);
 	}
 
 	for (;
@@ -534,10 +611,15 @@ static u64 *FNAME(fetch)(struct kvm_vcpu
 
 		sp = kvm_mmu_get_page(vcpu, direct_gfn, addr, it.level-1,
 				      true, direct_access, it.sptep);
-		link_shadow_page(it.sptep, sp);
+		FNAME(link_shadow_page)(it.sptep, sp);
 	}
 
 	clear_sp_write_flooding_count(it.sptep);
+	/* TODO: Consider if everything that set_spte() does is correct when
+	   the shadow page table is actually EPT. Most is fine (for direct_map)
+	   but it appears there they be a few wrong corner cases with
+	   PT_USER_MASK, PT64_NX_MASK, etc., and I need to review everything
+	 */
 	mmu_set_spte(vcpu, it.sptep, access, gw->pte_access,
 		     user_fault, write_fault, emulate, it.level,
 		     gw->gfn, pfn, prefault, map_writable);
@@ -733,6 +815,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kv
 	return gpa;
 }
 
+#if PTTYPE != PTTYPE_EPT
 static gpa_t FNAME(gva_to_gpa_nested)(struct kvm_vcpu *vcpu, gva_t vaddr,
 				      u32 access,
 				      struct x86_exception *exception)
@@ -751,6 +834,7 @@ static gpa_t FNAME(gva_to_gpa_nested)(st
 
 	return gpa;
 }
+#endif
 
 /*
  * Using the cached information from sp->gfns is safe because:


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 02/10] nEPT: Add EPT tables support to paging_tmpl.h
  2012-08-01 14:37 ` [PATCH 02/10] nEPT: Add EPT tables support to paging_tmpl.h Nadav Har'El
@ 2012-08-02  4:00   ` Xiao Guangrong
  2012-08-02 21:25     ` Nadav Har'El
  0 siblings, 1 reply; 15+ messages in thread
From: Xiao Guangrong @ 2012-08-02  4:00 UTC (permalink / raw)
  To: Nadav Har'El
  Cc: kvm, Joerg.Roedel, avi, owasserm, abelg, eddie.dong, yang.z.zhang

On 08/01/2012 10:37 PM, Nadav Har'El wrote:
> This is the first patch in a series which adds nested EPT support to KVM's
> nested VMX. Nested EPT means emulating EPT for an L1 guest so that L1 can use
> EPT when running a nested guest L2. When L1 uses EPT, it allows the L2 guest
> to set its own cr3 and take its own page faults without either of L0 or L1
> getting involved. This often significanlty improves L2's performance over the
> previous two alternatives (shadow page tables over EPT, and shadow page
> tables over shadow page tables).
> 
> This patch adds EPT support to paging_tmpl.h.
> 
> paging_tmpl.h contains the code for reading and writing page tables. The code
> for 32-bit and 64-bit tables is very similar, but not identical, so
> paging_tmpl.h is #include'd twice in mmu.c, once with PTTTYPE=32 and once
> with PTTYPE=64, and this generates the two sets of similar functions.
> 
> There are subtle but important differences between the format of EPT tables
> and that of ordinary x86 64-bit page tables, so for nested EPT we need a
> third set of functions to read the guest EPT table and to write the shadow
> EPT table.
> 
> So this patch adds third PTTYPE, PTTYPE_EPT, which creates functions (prefixed
> with "EPT") which correctly read and write EPT tables.
> 

Now, paging_tmpl.h becomes really untidy and hard to read, may be we need
to abstract the specified operations depends on PTTYPE.

> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> ---
>  arch/x86/kvm/mmu.c         |   14 +----
>  arch/x86/kvm/paging_tmpl.h |   98 ++++++++++++++++++++++++++++++++---
>  2 files changed, 96 insertions(+), 16 deletions(-)
> 
> --- .before/arch/x86/kvm/mmu.c	2012-08-01 17:22:46.000000000 +0300
> +++ .after/arch/x86/kvm/mmu.c	2012-08-01 17:22:46.000000000 +0300
> @@ -1971,15 +1971,6 @@ static void shadow_walk_next(struct kvm_
>  	return __shadow_walk_next(iterator, *iterator->sptep);
>  }
> 
> -static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp)
> -{
> -	u64 spte;
> -
> -	spte = __pa(sp->spt)
> -		| PT_PRESENT_MASK | PT_ACCESSED_MASK
> -		| PT_WRITABLE_MASK | PT_USER_MASK;
> -	mmu_spte_set(sptep, spte);
> -}
> 
>  static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep,
>  				   unsigned direct_access)
> @@ -3427,6 +3418,11 @@ static bool sync_mmio_spte(u64 *sptep, g
>  	return false;
>  }
> 
> +#define PTTYPE_EPT 18 /* arbitrary */
> +#define PTTYPE PTTYPE_EPT
> +#include "paging_tmpl.h"
> +#undef PTTYPE
> +
>  #define PTTYPE 64
>  #include "paging_tmpl.h"
>  #undef PTTYPE
> --- .before/arch/x86/kvm/paging_tmpl.h	2012-08-01 17:22:46.000000000 +0300
> +++ .after/arch/x86/kvm/paging_tmpl.h	2012-08-01 17:22:46.000000000 +0300
> @@ -50,6 +50,22 @@
>  	#define PT_LEVEL_BITS PT32_LEVEL_BITS
>  	#define PT_MAX_FULL_LEVELS 2
>  	#define CMPXCHG cmpxchg
> +#elif PTTYPE == PTTYPE_EPT
> +	#define pt_element_t u64
> +	#define guest_walker guest_walkerEPT
> +	#define FNAME(name) EPT_##name
> +	#define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
> +	#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
> +	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
> +	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
> +	#define PT_LEVEL_BITS PT64_LEVEL_BITS
> +	#ifdef CONFIG_X86_64
> +	#define PT_MAX_FULL_LEVELS 4
> +	#define CMPXCHG cmpxchg
> +	#else
> +	#define CMPXCHG cmpxchg64
> +	#define PT_MAX_FULL_LEVELS 2
> +	#endif

Missing the case of FULL_LEVELS == 3? Oh, you mentioned it
as PAE case in the PATCH 0.

>  #else
>  	#error Invalid PTTYPE value
>  #endif
> @@ -78,6 +94,7 @@ static gfn_t gpte_to_gfn_lvl(pt_element_
>  	return (gpte & PT_LVL_ADDR_MASK(lvl)) >> PAGE_SHIFT;
>  }
> 
> +#if PTTYPE != PTTYPE_EPT
>  static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>  			       pt_element_t __user *ptep_user, unsigned index,
>  			       pt_element_t orig_pte, pt_element_t new_pte)
> @@ -100,15 +117,22 @@ static int FNAME(cmpxchg_gpte)(struct kv
> 
>  	return (ret != orig_pte);
>  }
> +#endif
> 

Note A/D bits are supported on new intel cpus, this function should be reworked
for nept. I know you did not export this feather to guest, but we can reduce
the difference between nept and other mmu models if A/D are supported.

>  static unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, pt_element_t gpte,
>  				   bool last)
>  {
>  	unsigned access;
> 
> +#if PTTYPE == PTTYPE_EPT
> +	/* We rely here that ACC_WRITE_MASK==VMX_EPT_WRITABLE_MASK */
> +	access = (gpte & VMX_EPT_WRITABLE_MASK) | ACC_USER_MASK |
> +			((gpte & VMX_EPT_EXECUTABLE_MASK) ? ACC_EXEC_MASK : 0);
> +#else
>  	access = (gpte & (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK;
>  	if (last && !is_dirty_gpte(gpte))
>  		access &= ~ACC_WRITE_MASK;
> +#endif
> 

May be we can introduce PT_xxx_MASK to abstracter the access bits.

>  #if PTTYPE == 64
>  	if (vcpu->arch.mmu.nx)
> @@ -135,6 +159,30 @@ static bool FNAME(is_last_gpte)(struct g
>  	return false;
>  }
> 
> +static inline int FNAME(is_present_gpte)(unsigned long pte)
> +{
> +#if PTTYPE == PTTYPE_EPT
> +	return pte & (VMX_EPT_READABLE_MASK | VMX_EPT_WRITABLE_MASK |
> +			VMX_EPT_EXECUTABLE_MASK);
> +#else
> +	return is_present_gpte(pte);
> +#endif
> +}
> +

Introduce PT_PRESENT_BITS can eliminate the dependence, and we
need to rework is_present_gpte since it is used out of paging_tmpl.h.

> +static inline int FNAME(check_write_user_access)(struct kvm_vcpu *vcpu,
> +					   bool write_fault, bool user_fault,
> +					   unsigned long pte)
> +{
> +#if PTTYPE == PTTYPE_EPT
> +	if (unlikely(write_fault && !(pte & VMX_EPT_WRITABLE_MASK)
> +	      && (user_fault || is_write_protection(vcpu))))
> +		return false;
> +	return true;
> +#else
> +	return check_write_user_access(vcpu, write_fault, user_fault, pte);
> +#endif
> +}
> +

Ditto, need to rework check_write_user_access.

>  /*
>   * Fetch a guest pte for a guest virtual address
>   */
> @@ -155,7 +203,9 @@ static int FNAME(walk_addr_generic)(stru
>  	u16 errcode = 0;
> 
>  	trace_kvm_mmu_pagetable_walk(addr, access);
> +#if PTTYPE != PTTYPE_EPT
>  retry_walk:
> +#endif
>  	eperm = false;
>  	walker->level = mmu->root_level;
>  	pte           = mmu->get_cr3(vcpu);
> @@ -202,7 +252,7 @@ retry_walk:
> 
>  		trace_kvm_mmu_paging_element(pte, walker->level);
> 
> -		if (unlikely(!is_present_gpte(pte)))
> +		if (unlikely(!FNAME(is_present_gpte)(pte)))
>  			goto error;
> 
>  		if (unlikely(is_rsvd_bits_set(&vcpu->arch.mmu, pte,
> @@ -211,13 +261,16 @@ retry_walk:
>  			goto error;
>  		}
> 
> -		if (!check_write_user_access(vcpu, write_fault, user_fault,
> -					  pte))
> +		if (!FNAME(check_write_user_access)(vcpu, write_fault,
> +					  user_fault, pte))
>  			eperm = true;
> 
>  #if PTTYPE == 64
>  		if (unlikely(fetch_fault && (pte & PT64_NX_MASK)))
>  			eperm = true;
> +#elif PTTYPE == PTTYPE_EPT
> +		if (unlikely(fetch_fault && !(pte & VMX_EPT_EXECUTABLE_MASK)))
> +			eperm = true;
>  #endif
> 
>  		last_gpte = FNAME(is_last_gpte)(walker, vcpu, mmu, pte);
> @@ -225,12 +278,15 @@ retry_walk:
>  			pte_access = pt_access &
>  				     FNAME(gpte_access)(vcpu, pte, true);
>  			/* check if the kernel is fetching from user page */
> +#if PTTYPE != PTTYPE_EPT
>  			if (unlikely(pte_access & PT_USER_MASK) &&
>  			    kvm_read_cr4_bits(vcpu, X86_CR4_SMEP))
>  				if (fetch_fault && !user_fault)
>  					eperm = true;
> +#endif
>  		}
> 
> +#if PTTYPE != PTTYPE_EPT
>  		if (!eperm && unlikely(!(pte & PT_ACCESSED_MASK))) {
>  			int ret;
>  			trace_kvm_mmu_set_accessed_bit(table_gfn, index,
> @@ -245,6 +301,7 @@ retry_walk:
>  			mark_page_dirty(vcpu->kvm, table_gfn);
>  			pte |= PT_ACCESSED_MASK;
>  		}
> +#endif

If A/D supported, these differences can be be removed?

> 
>  		walker->ptes[walker->level - 1] = pte;
> 
> @@ -283,6 +340,7 @@ retry_walk:
>  		goto error;
>  	}
> 
> +#if PTTYPE != PTTYPE_EPT
>  	if (write_fault && unlikely(!is_dirty_gpte(pte))) {
>  		int ret;
> 
> @@ -298,6 +356,7 @@ retry_walk:
>  		pte |= PT_DIRTY_MASK;
>  		walker->ptes[walker->level - 1] = pte;
>  	}
> +#endif
> 
>  	walker->pt_access = pt_access;
>  	walker->pte_access = pte_access;
> @@ -328,6 +387,7 @@ static int FNAME(walk_addr)(struct guest
>  					access);
>  }
> 
> +#if PTTYPE != PTTYPE_EPT
>  static int FNAME(walk_addr_nested)(struct guest_walker *walker,
>  				   struct kvm_vcpu *vcpu, gva_t addr,
>  				   u32 access)
> @@ -335,6 +395,7 @@ static int FNAME(walk_addr_nested)(struc
>  	return FNAME(walk_addr_generic)(walker, vcpu, &vcpu->arch.nested_mmu,
>  					addr, access);
>  }
> +#endif
> 

Hmm, you do not need the special walking functions?

>  static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu,
>  				    struct kvm_mmu_page *sp, u64 *spte,
> @@ -343,11 +404,13 @@ static bool FNAME(prefetch_invalid_gpte)
>  	if (is_rsvd_bits_set(&vcpu->arch.mmu, gpte, PT_PAGE_TABLE_LEVEL))
>  		goto no_present;
> 
> -	if (!is_present_gpte(gpte))
> +	if (!FNAME(is_present_gpte)(gpte))
>  		goto no_present;
> 
> +#if PTTYPE != PTTYPE_EPT
>  	if (!(gpte & PT_ACCESSED_MASK))
>  		goto no_present;
> +#endif
> 
>  	return false;
> 
> @@ -458,6 +521,20 @@ static void FNAME(pte_prefetch)(struct k
>  			     pfn, true, true);
>  	}
>  }
> +static void FNAME(link_shadow_page)(u64 *sptep, struct kvm_mmu_page *sp)
> +{
> +	u64 spte;
> +
> +	spte = __pa(sp->spt)
> +#if PTTYPE == PTTYPE_EPT
> +		| VMX_EPT_READABLE_MASK | VMX_EPT_WRITABLE_MASK
> +		| VMX_EPT_EXECUTABLE_MASK;
> +#else
> +		| PT_PRESENT_MASK | PT_ACCESSED_MASK
> +		| PT_WRITABLE_MASK | PT_USER_MASK;
> +#endif
> +	mmu_spte_set(sptep, spte);
> +}
> 
>  /*
>   * Fetch a shadow pte for a specific level in the paging hierarchy.
> @@ -474,7 +551,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu
>  	unsigned direct_access;
>  	struct kvm_shadow_walk_iterator it;
> 
> -	if (!is_present_gpte(gw->ptes[gw->level - 1]))
> +	if (!FNAME(is_present_gpte)(gw->ptes[gw->level - 1]))
>  		return NULL;
> 
>  	direct_access = gw->pte_access;
> @@ -514,7 +591,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu
>  			goto out_gpte_changed;
> 
>  		if (sp)
> -			link_shadow_page(it.sptep, sp);
> +			FNAME(link_shadow_page)(it.sptep, sp);
>  	}
> 
>  	for (;
> @@ -534,10 +611,15 @@ static u64 *FNAME(fetch)(struct kvm_vcpu
> 
>  		sp = kvm_mmu_get_page(vcpu, direct_gfn, addr, it.level-1,
>  				      true, direct_access, it.sptep);
> -		link_shadow_page(it.sptep, sp);
> +		FNAME(link_shadow_page)(it.sptep, sp);
>  	}
> 
>  	clear_sp_write_flooding_count(it.sptep);
> +	/* TODO: Consider if everything that set_spte() does is correct when
> +	   the shadow page table is actually EPT. Most is fine (for direct_map)
> +	   but it appears there they be a few wrong corner cases with
> +	   PT_USER_MASK, PT64_NX_MASK, etc., and I need to review everything
> +	 */

Maybe it is ok. But you need to care A/D bits (current, you did not export
A/D bits to guest, however, it may be supported on L0).

>  	mmu_set_spte(vcpu, it.sptep, access, gw->pte_access,
>  		     user_fault, write_fault, emulate, it.level,
>  		     gw->gfn, pfn, prefault, map_writable);
> @@ -733,6 +815,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kv
>  	return gpa;
>  }
> 
> +#if PTTYPE != PTTYPE_EPT
>  static gpa_t FNAME(gva_to_gpa_nested)(struct kvm_vcpu *vcpu, gva_t vaddr,
>  				      u32 access,
>  				      struct x86_exception *exception)
> @@ -751,6 +834,7 @@ static gpa_t FNAME(gva_to_gpa_nested)(st
> 
>  	return gpa;
>  }
> +#endif

Why it is not needed?




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 02/10] nEPT: Add EPT tables support to paging_tmpl.h
  2012-08-02  4:00   ` Xiao Guangrong
@ 2012-08-02 21:25     ` Nadav Har'El
  2012-08-03  8:08       ` Xiao Guangrong
  0 siblings, 1 reply; 15+ messages in thread
From: Nadav Har'El @ 2012-08-02 21:25 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: kvm, Joerg.Roedel, avi, owasserm, abelg, eddie.dong, yang.z.zhang

On Thu, Aug 02, 2012, Xiao Guangrong wrote about "Re: [PATCH 02/10] nEPT: Add EPT tables support to paging_tmpl.h":
> > +	#ifdef CONFIG_X86_64
> > +	#define PT_MAX_FULL_LEVELS 4
> > +	#define CMPXCHG cmpxchg
> > +	#else
> > +	#define CMPXCHG cmpxchg64
> > +	#define PT_MAX_FULL_LEVELS 2
> > +	#endif
> 
> Missing the case of FULL_LEVELS == 3? Oh, you mentioned it
> as PAE case in the PATCH 0.

I understood this differently (and it would not be surprising if
wrongly...): With nested EPT, we only deal with two *EPT* tables -
the shadowed page table and shadow page table are both EPT.
And EPT tables cannot have three levels - even if PAE is used. Or at least,
that's what I thought...

> Note A/D bits are supported on new intel cpus, this function should be reworked
> for nept. I know you did not export this feather to guest, but we can reduce
> the difference between nept and other mmu models if A/D are supported.

I'm not sure what you meant: If the access/dirty bits are supported in
newer cpus, do you think we *should* support them also in the processor
L1 processor, or are you saying that it would be easier to support them
because this is what the shadow page table code normally does anyway,
so *not* supporting them will take effort?

> > +#if PTTYPE != PTTYPE_EPT
> >  static int FNAME(walk_addr_nested)(struct guest_walker *walker,
> >  				   struct kvm_vcpu *vcpu, gva_t addr,
> >  				   u32 access)
> > @@ -335,6 +395,7 @@ static int FNAME(walk_addr_nested)(struc
> >  	return FNAME(walk_addr_generic)(walker, vcpu, &vcpu->arch.nested_mmu,
> >  					addr, access);
> >  }
> > +#endif
> > 
> 
> Hmm, you do not need the special walking functions?

Since these functions are static, the compiler warns me on every
function that is never used, so I had to #if them out...


-- 
Nadav Har'El                        |         Thursday, Aug 2 2012, 15 Av 5772
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |It's fortunate I have bad luck - without
http://nadav.harel.org.il           |it I would have no luck at all!

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 02/10] nEPT: Add EPT tables support to paging_tmpl.h
  2012-08-02 21:25     ` Nadav Har'El
@ 2012-08-03  8:08       ` Xiao Guangrong
  0 siblings, 0 replies; 15+ messages in thread
From: Xiao Guangrong @ 2012-08-03  8:08 UTC (permalink / raw)
  To: Nadav Har'El
  Cc: kvm, Joerg.Roedel, avi, owasserm, abelg, eddie.dong, yang.z.zhang

On 08/03/2012 05:25 AM, Nadav Har'El wrote:
> On Thu, Aug 02, 2012, Xiao Guangrong wrote about "Re: [PATCH 02/10] nEPT: Add EPT tables support to paging_tmpl.h":
>>> +	#ifdef CONFIG_X86_64
>>> +	#define PT_MAX_FULL_LEVELS 4
>>> +	#define CMPXCHG cmpxchg
>>> +	#else
>>> +	#define CMPXCHG cmpxchg64
>>> +	#define PT_MAX_FULL_LEVELS 2
>>> +	#endif
>>
>> Missing the case of FULL_LEVELS == 3? Oh, you mentioned it
>> as PAE case in the PATCH 0.
> 
> I understood this differently (and it would not be surprising if
> wrongly...): With nested EPT, we only deal with two *EPT* tables -
> the shadowed page table and shadow page table are both EPT.
> And EPT tables cannot have three levels - even if PAE is used. Or at least,
> that's what I thought...
> 
>> Note A/D bits are supported on new intel cpus, this function should be reworked
>> for nept. I know you did not export this feather to guest, but we can reduce
>> the difference between nept and other mmu models if A/D are supported.
> 
> I'm not sure what you meant: If the access/dirty bits are supported in
> newer cpus, do you think we *should* support them also in the processor
> L1 processor, or are you saying that it would be easier to support them
> because this is what the shadow page table code normally does anyway,
> so *not* supporting them will take effort?

I mean "it would be easier to support them
 because this is what the shadow page table code normally does anyway,
 so *not* supporting them will take effort" :)

Then, we can drop "ifndef PTTYPT_EPT"...

Actuality, we can redefine some bits (like PRSENT, WRTIABLE, DRITY...) to
let the paging_tmpl code work for all models.

> 
>>> +#if PTTYPE != PTTYPE_EPT
>>>  static int FNAME(walk_addr_nested)(struct guest_walker *walker,
>>>  				   struct kvm_vcpu *vcpu, gva_t addr,
>>>  				   u32 access)
>>> @@ -335,6 +395,7 @@ static int FNAME(walk_addr_nested)(struc
>>>  	return FNAME(walk_addr_generic)(walker, vcpu, &vcpu->arch.nested_mmu,
>>>  					addr, access);
>>>  }
>>> +#endif
>>>
>>
>> Hmm, you do not need the special walking functions?
> 
> Since these functions are static, the compiler warns me on every
> function that is never used, so I had to #if them out...
> 
> 

IIUC, you did not implement the functions (like walk_addr_nested) to translate
L2's VA to L2's PA, yes? (it is needed for emulation.)


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 03/10] nEPT: MMU context for nested EPT
  2012-08-01 14:36 [PATCH 0/10] nEPT v2: Nested EPT support for Nested VMX Nadav Har'El
  2012-08-01 14:37 ` [PATCH 01/10] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1 Nadav Har'El
  2012-08-01 14:37 ` [PATCH 02/10] nEPT: Add EPT tables support to paging_tmpl.h Nadav Har'El
@ 2012-08-01 14:38 ` Nadav Har'El
  2012-08-01 14:38 ` [PATCH 04/10] nEPT: Fix cr3 handling in nested exit and entry Nadav Har'El
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: Nadav Har'El @ 2012-08-01 14:38 UTC (permalink / raw)
  To: kvm; +Cc: Joerg.Roedel, avi, owasserm, abelg, eddie.dong, yang.z.zhang

KVM's existing shadow MMU code already supports nested TDP. To use it, we
need to set up a new "MMU context" for nested EPT, and create a few callbacks
for it (nested_ept_*()). This context should also use the EPT versions of
the page table access functions (defined in the previous patch).
Then, we need to switch back and forth between this nested context and the
regular MMU context when switching between L1 and L2 (when L1 runs this L2
with EPT).

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/mmu.c |   38 +++++++++++++++++++++++++++++++
 arch/x86/kvm/mmu.h |    1 
 arch/x86/kvm/vmx.c |   52 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 91 insertions(+)

--- .before/arch/x86/kvm/mmu.h	2012-08-01 17:22:46.000000000 +0300
+++ .after/arch/x86/kvm/mmu.h	2012-08-01 17:22:46.000000000 +0300
@@ -52,6 +52,7 @@ int kvm_mmu_get_spte_hierarchy(struct kv
 void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask);
 int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool direct);
 int kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context);
+int kvm_init_shadow_EPT_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context);
 
 static inline unsigned int kvm_mmu_available_pages(struct kvm *kvm)
 {
--- .before/arch/x86/kvm/mmu.c	2012-08-01 17:22:46.000000000 +0300
+++ .after/arch/x86/kvm/mmu.c	2012-08-01 17:22:46.000000000 +0300
@@ -3616,6 +3616,44 @@ int kvm_init_shadow_mmu(struct kvm_vcpu 
 }
 EXPORT_SYMBOL_GPL(kvm_init_shadow_mmu);
 
+int kvm_init_shadow_EPT_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context)
+{
+	ASSERT(vcpu);
+	ASSERT(!VALID_PAGE(vcpu->arch.mmu.root_hpa));
+
+	context->shadow_root_level = kvm_x86_ops->get_tdp_level();
+
+	context->nx = is_nx(vcpu); /* TODO: ? */
+	context->new_cr3 = paging_new_cr3;
+	context->page_fault = EPT_page_fault;
+	context->gva_to_gpa = EPT_gva_to_gpa;
+	context->sync_page = EPT_sync_page;
+	context->invlpg = EPT_invlpg;
+	context->update_pte = EPT_update_pte;
+	context->free = paging_free;
+	context->root_level = context->shadow_root_level;
+	context->root_hpa = INVALID_PAGE;
+	context->direct_map = false;
+
+	/* TODO: reset_rsvds_bits_mask() is not built for EPT, we need
+	   something different.
+	 */
+	reset_rsvds_bits_mask(vcpu, context);
+
+
+	/* TODO: I copied these from kvm_init_shadow_mmu, I don't know why
+	   they are done, or why they write to vcpu->arch.mmu and not context
+	 */
+	vcpu->arch.mmu.base_role.cr4_pae = !!is_pae(vcpu);
+	vcpu->arch.mmu.base_role.cr0_wp  = is_write_protection(vcpu);
+	vcpu->arch.mmu.base_role.smep_andnot_wp =
+		kvm_read_cr4_bits(vcpu, X86_CR4_SMEP) &&
+		!is_write_protection(vcpu);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_init_shadow_EPT_mmu);
+
 static int init_kvm_softmmu(struct kvm_vcpu *vcpu)
 {
 	int r = kvm_init_shadow_mmu(vcpu, vcpu->arch.walk_mmu);
--- .before/arch/x86/kvm/vmx.c	2012-08-01 17:22:46.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2012-08-01 17:22:46.000000000 +0300
@@ -901,6 +901,11 @@ static inline bool nested_cpu_has_virtua
 	return vmcs12->pin_based_vm_exec_control & PIN_BASED_VIRTUAL_NMIS;
 }
 
+static inline int nested_cpu_has_ept(struct vmcs12 *vmcs12)
+{
+	return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_EPT);
+}
+
 static inline bool is_exception(u32 intr_info)
 {
 	return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK))
@@ -6591,6 +6596,46 @@ static void vmx_set_supported_cpuid(u32 
 		entry->ecx |= bit(X86_FEATURE_VMX);
 }
 
+/* Callbacks for nested_ept_init_mmu_context: */
+
+static unsigned long nested_ept_get_cr3(struct kvm_vcpu *vcpu)
+{
+	/* return the page table to be shadowed - in our case, EPT12 */
+	return get_vmcs12(vcpu)->ept_pointer;
+}
+
+static void nested_ept_inject_page_fault(struct kvm_vcpu *vcpu,
+	struct x86_exception *fault)
+{
+	struct vmcs12 *vmcs12;
+	nested_vmx_vmexit(vcpu);
+	vmcs12 = get_vmcs12(vcpu);
+	/*
+	 * Note no need to set vmcs12->vm_exit_reason as it is already copied
+	 * from vmcs02 in nested_vmx_vmexit() above, i.e., EPT_VIOLATION.
+	 */
+	vmcs12->exit_qualification = fault->error_code;
+	vmcs12->guest_physical_address = fault->address;
+}
+
+static int nested_ept_init_mmu_context(struct kvm_vcpu *vcpu)
+{
+	int r = kvm_init_shadow_EPT_mmu(vcpu, &vcpu->arch.mmu);
+
+	vcpu->arch.mmu.set_cr3           = vmx_set_cr3;
+	vcpu->arch.mmu.get_cr3           = nested_ept_get_cr3;
+	vcpu->arch.mmu.inject_page_fault = nested_ept_inject_page_fault;
+
+	vcpu->arch.walk_mmu              = &vcpu->arch.nested_mmu;
+
+	return r;
+}
+
+static void nested_ept_uninit_mmu_context(struct kvm_vcpu *vcpu)
+{
+	vcpu->arch.walk_mmu = &vcpu->arch.mmu;
+}
+
 /*
  * prepare_vmcs02 is called when the L1 guest hypervisor runs its nested
  * L2 guest. L1 has a vmcs for L2 (vmcs12), and this function "merges" it
@@ -6808,6 +6853,11 @@ static void prepare_vmcs02(struct kvm_vc
 		vmx_flush_tlb(vcpu);
 	}
 
+	if (nested_cpu_has_ept(vmcs12)) {
+		kvm_mmu_unload(vcpu);
+		nested_ept_init_mmu_context(vcpu);
+	}
+
 	if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_EFER)
 		vcpu->arch.efer = vmcs12->guest_ia32_efer;
 	if (vmcs12->vm_entry_controls & VM_ENTRY_IA32E_MODE)
@@ -7138,6 +7188,8 @@ void load_vmcs12_host_state(struct kvm_v
 	vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
 	kvm_set_cr4(vcpu, vmcs12->host_cr4);
 
+	if (nested_cpu_has_ept(vmcs12))
+		nested_ept_uninit_mmu_context(vcpu);
 	/* shadow page tables on either EPT or shadow page tables */
 	kvm_set_cr3(vcpu, vmcs12->host_cr3);
 	kvm_mmu_reset_context(vcpu);


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 04/10] nEPT: Fix cr3 handling in nested exit and entry
  2012-08-01 14:36 [PATCH 0/10] nEPT v2: Nested EPT support for Nested VMX Nadav Har'El
                   ` (2 preceding siblings ...)
  2012-08-01 14:38 ` [PATCH 03/10] nEPT: MMU context for nested EPT Nadav Har'El
@ 2012-08-01 14:38 ` Nadav Har'El
  2012-08-01 14:39 ` [PATCH 05/10] nEPT: Fix wrong test in kvm_set_cr3 Nadav Har'El
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: Nadav Har'El @ 2012-08-01 14:38 UTC (permalink / raw)
  To: kvm; +Cc: Joerg.Roedel, avi, owasserm, abelg, eddie.dong, yang.z.zhang

The existing code for handling cr3 and related VMCS fields during nested
exit and entry wasn't correct in all cases:

If L2 is allowed to control cr3 (and this is indeed the case in nested EPT),
during nested exit we must copy the modified cr3 from vmcs02 to vmcs12, and
we forgot to do so. This patch adds this copy.

If L0 isn't controlling cr3 when running L2 (i.e., L0 is using EPT), and
whoever does control cr3 (L1 or L2) is using PAE, the processor might have
saved PDPTEs and we should also save them in vmcs12 (and restore later).

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2012-08-01 17:22:46.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2012-08-01 17:22:46.000000000 +0300
@@ -6885,6 +6885,17 @@ static void prepare_vmcs02(struct kvm_vc
 	kvm_set_cr3(vcpu, vmcs12->guest_cr3);
 	kvm_mmu_reset_context(vcpu);
 
+	/*
+	 * Additionally, except when L0 is using shadow page tables, L1 or
+	 * L2 control guest_cr3 for L2, so they may also have saved PDPTEs
+	 */
+	if (enable_ept) {
+		vmcs_write64(GUEST_PDPTR0, vmcs12->guest_pdptr0);
+		vmcs_write64(GUEST_PDPTR1, vmcs12->guest_pdptr1);
+		vmcs_write64(GUEST_PDPTR2, vmcs12->guest_pdptr2);
+		vmcs_write64(GUEST_PDPTR3, vmcs12->guest_pdptr3);
+	}
+
 	kvm_register_write(vcpu, VCPU_REGS_RSP, vmcs12->guest_rsp);
 	kvm_register_write(vcpu, VCPU_REGS_RIP, vmcs12->guest_rip);
 }
@@ -7116,6 +7127,25 @@ void prepare_vmcs12(struct kvm_vcpu *vcp
 	vmcs12->guest_pending_dbg_exceptions =
 		vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
 
+	/*
+	 * In some cases (usually, nested EPT), L2 is allowed to change its
+	 * own CR3 without exiting. If it has changed it, we must keep it.
+	 * Of course, if L0 is using shadow page tables, GUEST_CR3 was defined
+	 * by L0, not L1 or L2, so we mustn't unconditionally copy it to vmcs12.
+	 */
+	if (enable_ept)
+		vmcs12->guest_cr3 = vmcs_read64(GUEST_CR3);
+	/*
+	 * Additionally, except when L0 is using shadow page tables, L1 or
+	 * L2 control guest_cr3 for L2, so save their PDPTEs
+	 */
+	if (enable_ept) {
+		vmcs12->guest_pdptr0 = vmcs_read64(GUEST_PDPTR0);
+		vmcs12->guest_pdptr1 = vmcs_read64(GUEST_PDPTR1);
+		vmcs12->guest_pdptr2 = vmcs_read64(GUEST_PDPTR2);
+		vmcs12->guest_pdptr3 = vmcs_read64(GUEST_PDPTR3);
+	}
+
 	/* TODO: These cannot have changed unless we have MSR bitmaps and
 	 * the relevant bit asks not to trap the change */
 	vmcs12->guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 05/10] nEPT: Fix wrong test in kvm_set_cr3
  2012-08-01 14:36 [PATCH 0/10] nEPT v2: Nested EPT support for Nested VMX Nadav Har'El
                   ` (3 preceding siblings ...)
  2012-08-01 14:38 ` [PATCH 04/10] nEPT: Fix cr3 handling in nested exit and entry Nadav Har'El
@ 2012-08-01 14:39 ` Nadav Har'El
  2012-08-01 14:39 ` [PATCH 06/10] nEPT: Some additional comments Nadav Har'El
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: Nadav Har'El @ 2012-08-01 14:39 UTC (permalink / raw)
  To: kvm; +Cc: Joerg.Roedel, avi, owasserm, abelg, eddie.dong, yang.z.zhang

kvm_set_cr3() attempts to check if the new cr3 is a valid guest physical
address. The problem is that with nested EPT, cr3 is an *L2* physical
address, not an L1 physical address as this test expects.

As the comment above this test explains, it isn't necessary, and doesn't
correspond to anything a real processor would do. So this patch removes it.

Note that this wrong test could have also theoretically caused problems
in nested NPT, not just in nested EPT. However, in practice, the problem
was avoided: nested_svm_vmexit()/vmrun() do not call kvm_set_cr3 in the
nested NPT case, and instead set the vmcb (and arch.cr3) directly, thus
circumventing the problem. Additional potential calls to the buggy function
are avoided in that we don't trap cr3 modifications when nested NPT is
enabled. However, because in nested VMX we did want to use kvm_set_cr3()
(as requested in Avi Kivity's review of the original nested VMX patches),
we can't avoid this problem and need to fix it.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/x86.c |   11 -----------
 1 file changed, 11 deletions(-)

--- .before/arch/x86/kvm/x86.c	2012-08-01 17:22:47.000000000 +0300
+++ .after/arch/x86/kvm/x86.c	2012-08-01 17:22:47.000000000 +0300
@@ -659,17 +659,6 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, u
 		 */
 	}

-	/*
-	 * Does the new cr3 value map to physical memory? (Note, we
-	 * catch an invalid cr3 even in real-mode, because it would
-	 * cause trouble later on when we turn on paging anyway.)
-	 *
-	 * A real CPU would silently accept an invalid cr3 and would
-	 * attempt to use it - with largely undefined (and often hard
-	 * to debug) behavior on the guest side.
-	 */
-	if (unlikely(!gfn_to_memslot(vcpu->kvm, cr3 >> PAGE_SHIFT)))
-		return 1;
 	vcpu->arch.cr3 = cr3;
 	__set_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail);
 	vcpu->arch.mmu.new_cr3(vcpu);

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 06/10] nEPT: Some additional comments
  2012-08-01 14:36 [PATCH 0/10] nEPT v2: Nested EPT support for Nested VMX Nadav Har'El
                   ` (4 preceding siblings ...)
  2012-08-01 14:39 ` [PATCH 05/10] nEPT: Fix wrong test in kvm_set_cr3 Nadav Har'El
@ 2012-08-01 14:39 ` Nadav Har'El
  2012-08-01 14:40 ` [PATCH 07/10] nEPT: Advertise EPT to L1 Nadav Har'El
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: Nadav Har'El @ 2012-08-01 14:39 UTC (permalink / raw)
  To: kvm; +Cc: Joerg.Roedel, avi, owasserm, abelg, eddie.dong, yang.z.zhang

Some additional comments to preexisting code:
Explain who (L0 or L1) handles EPT violation and misconfiguration exits.
Don't mention "shadow on either EPT or shadow" as the only two options.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2012-08-01 17:22:47.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2012-08-01 17:22:47.000000000 +0300
@@ -5952,7 +5952,20 @@ static bool nested_vmx_exit_handled(stru
 		return nested_cpu_has2(vmcs12,
 			SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES);
 	case EXIT_REASON_EPT_VIOLATION:
+		/*
+		 * L0 always deals with the EPT violation. If nested EPT is
+		 * used, and the nested mmu code discovers that the address is
+		 * missing in the guest EPT table (EPT12), the EPT violation
+		 * will be injected with nested_ept_inject_page_fault()
+		 */
+		return 0;
 	case EXIT_REASON_EPT_MISCONFIG:
+		/*
+		 * L2 never uses directly L1's EPT, but rather L0's own EPT
+		 * table (shadow on EPT) or a merged EPT table that L0 built
+		 * (EPT on EPT). So any problems with the structure of the
+		 * table is L0's fault.
+		 */
 		return 0;
 	case EXIT_REASON_WBINVD:
 		return nested_cpu_has2(vmcs12, SECONDARY_EXEC_WBINVD_EXITING);
@@ -6881,7 +6894,12 @@ static void prepare_vmcs02(struct kvm_vc
 	vmx_set_cr4(vcpu, vmcs12->guest_cr4);
 	vmcs_writel(CR4_READ_SHADOW, nested_read_cr4(vmcs12));
 
-	/* shadow page tables on either EPT or shadow page tables */
+	/*
+	 * Note that kvm_set_cr3() and kvm_mmu_reset_context() will do the
+	 * right thing, and set GUEST_CR3 and/or EPT_POINTER in all supported
+	 * settings: 1. shadow page tables on shadow page tables, 2. shadow
+	 * page tables on EPT, 3. EPT on EPT.
+	 */
 	kvm_set_cr3(vcpu, vmcs12->guest_cr3);
 	kvm_mmu_reset_context(vcpu);
 
@@ -7220,7 +7238,6 @@ void load_vmcs12_host_state(struct kvm_v
 
 	if (nested_cpu_has_ept(vmcs12))
 		nested_ept_uninit_mmu_context(vcpu);
-	/* shadow page tables on either EPT or shadow page tables */
 	kvm_set_cr3(vcpu, vmcs12->host_cr3);
 	kvm_mmu_reset_context(vcpu);
 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 07/10] nEPT: Advertise EPT to L1
  2012-08-01 14:36 [PATCH 0/10] nEPT v2: Nested EPT support for Nested VMX Nadav Har'El
                   ` (5 preceding siblings ...)
  2012-08-01 14:39 ` [PATCH 06/10] nEPT: Some additional comments Nadav Har'El
@ 2012-08-01 14:40 ` Nadav Har'El
  2012-08-01 14:40 ` [PATCH 08/10] nEPT: Nested INVEPT Nadav Har'El
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: Nadav Har'El @ 2012-08-01 14:40 UTC (permalink / raw)
  To: kvm; +Cc: Joerg.Roedel, avi, owasserm, abelg, eddie.dong, yang.z.zhang

Advertise the support of EPT to the L1 guest, through the appropriate MSR.

This is the last patch of the basic Nested EPT feature, so as to allow
bisection through this patch series: The guest will not see EPT support until
this last patch, and will not attempt to use the half-applied feature.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2012-08-01 17:22:47.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2012-08-01 17:22:47.000000000 +0300
@@ -1946,6 +1946,7 @@ static u32 nested_vmx_secondary_ctls_low
 static u32 nested_vmx_pinbased_ctls_low, nested_vmx_pinbased_ctls_high;
 static u32 nested_vmx_exit_ctls_low, nested_vmx_exit_ctls_high;
 static u32 nested_vmx_entry_ctls_low, nested_vmx_entry_ctls_high;
+static u32 nested_vmx_ept_caps;
 static __init void nested_vmx_setup_ctls_msrs(void)
 {
 	/*
@@ -2021,6 +2022,14 @@ static __init void nested_vmx_setup_ctls
 	nested_vmx_secondary_ctls_low = 0;
 	nested_vmx_secondary_ctls_high &=
 		SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
+	if (enable_ept) {
+		/* nested EPT: emulate EPT also to L1 */
+		nested_vmx_secondary_ctls_high |= SECONDARY_EXEC_ENABLE_EPT;
+		nested_vmx_ept_caps = VMX_EPT_PAGE_WALK_4_BIT;
+		nested_vmx_ept_caps &= vmx_capability.ept;
+	} else
+		nested_vmx_ept_caps = 0;
+
 }
 
 static inline bool vmx_control_verify(u32 control, u32 low, u32 high)
@@ -2120,8 +2129,8 @@ static int vmx_get_vmx_msr(struct kvm_vc
 					nested_vmx_secondary_ctls_high);
 		break;
 	case MSR_IA32_VMX_EPT_VPID_CAP:
-		/* Currently, no nested ept or nested vpid */
-		*pdata = 0;
+		/* Currently, no nested vpid support */
+		*pdata = nested_vmx_ept_caps;
 		break;
 	default:
 		return 0;


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 08/10] nEPT: Nested INVEPT
  2012-08-01 14:36 [PATCH 0/10] nEPT v2: Nested EPT support for Nested VMX Nadav Har'El
                   ` (6 preceding siblings ...)
  2012-08-01 14:40 ` [PATCH 07/10] nEPT: Advertise EPT to L1 Nadav Har'El
@ 2012-08-01 14:40 ` Nadav Har'El
  2012-08-01 14:41 ` [PATCH 09/10] nEPT: Documentation Nadav Har'El
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: Nadav Har'El @ 2012-08-01 14:40 UTC (permalink / raw)
  To: kvm; +Cc: Joerg.Roedel, avi, owasserm, abelg, eddie.dong, yang.z.zhang

If we let L1 use EPT, we should probably also support the INVEPT instruction.

In our current nested EPT implementation, when L1 changes its EPT table for
L2 (i.e., EPT12), L0 modifies the shadow EPT table (EPT02), and in the course
of this modification already calls INVEPT. Therefore, when L1 calls INVEPT,
we don't really need to do anything. In particular we *don't* need to call
the real INVEPT again. All we do in our INVEPT is verify the validity of the
call, and its parameters, and then do nothing.

In KVM Forum 2010, Dong et al. presented "Nested Virtualization Friendly KVM"
and classified our current nested EPT implementation as "shadow-like virtual
EPT". He recommended instead a different approach, which he called "VTLB-like
virtual EPT". If we had taken that alternative approach, INVEPT would have had
a bigger role: L0 would only rebuild the shadow EPT table when L1 calls INVEPT.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/include/asm/vmx.h |    2 
 arch/x86/kvm/vmx.c         |   87 +++++++++++++++++++++++++++++++++++
 2 files changed, 89 insertions(+)

--- .before/arch/x86/include/asm/vmx.h	2012-08-01 17:22:47.000000000 +0300
+++ .after/arch/x86/include/asm/vmx.h	2012-08-01 17:22:47.000000000 +0300
@@ -280,6 +280,7 @@ enum vmcs_field {
 #define EXIT_REASON_APIC_ACCESS         44
 #define EXIT_REASON_EPT_VIOLATION       48
 #define EXIT_REASON_EPT_MISCONFIG       49
+#define EXIT_REASON_INVEPT		50
 #define EXIT_REASON_WBINVD		54
 #define EXIT_REASON_XSETBV		55
 #define EXIT_REASON_INVPCID		58
@@ -406,6 +407,7 @@ enum vmcs_field {
 #define VMX_EPTP_WB_BIT				(1ull << 14)
 #define VMX_EPT_2MB_PAGE_BIT			(1ull << 16)
 #define VMX_EPT_1GB_PAGE_BIT			(1ull << 17)
+#define VMX_EPT_INVEPT_BIT			(1ull << 20)
 #define VMX_EPT_AD_BIT					(1ull << 21)
 #define VMX_EPT_EXTENT_INDIVIDUAL_BIT		(1ull << 24)
 #define VMX_EPT_EXTENT_CONTEXT_BIT		(1ull << 25)
--- .before/arch/x86/kvm/vmx.c	2012-08-01 17:22:47.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2012-08-01 17:22:47.000000000 +0300
@@ -2026,6 +2026,10 @@ static __init void nested_vmx_setup_ctls
 		/* nested EPT: emulate EPT also to L1 */
 		nested_vmx_secondary_ctls_high |= SECONDARY_EXEC_ENABLE_EPT;
 		nested_vmx_ept_caps = VMX_EPT_PAGE_WALK_4_BIT;
+		nested_vmx_ept_caps |=
+			VMX_EPT_INVEPT_BIT | VMX_EPT_EXTENT_GLOBAL_BIT |
+			VMX_EPT_EXTENT_CONTEXT_BIT |
+			VMX_EPT_EXTENT_INDIVIDUAL_BIT;
 		nested_vmx_ept_caps &= vmx_capability.ept;
 	} else
 		nested_vmx_ept_caps = 0;
@@ -5702,6 +5706,87 @@ static int handle_vmptrst(struct kvm_vcp
 	return 1;
 }
 
+/* Emulate the INVEPT instruction */
+static int handle_invept(struct kvm_vcpu *vcpu)
+{
+	u32 vmx_instruction_info;
+	unsigned long type;
+	gva_t gva;
+	struct x86_exception e;
+	struct {
+		u64 eptp, gpa;
+	} operand;
+
+	if (!(nested_vmx_secondary_ctls_high & SECONDARY_EXEC_ENABLE_EPT) ||
+	    !(nested_vmx_ept_caps & VMX_EPT_INVEPT_BIT)) {
+		kvm_queue_exception(vcpu, UD_VECTOR);
+		return 1;
+	}
+
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	if (!kvm_read_cr0_bits(vcpu, X86_CR0_PE)) {
+		kvm_queue_exception(vcpu, UD_VECTOR);
+		return 1;
+	}
+
+	/* According to the Intel VMX instruction reference, the memory
+	 * operand is read even if it isn't needed (e.g., for type==global)
+	 */
+	vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+	if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION),
+			vmx_instruction_info, &gva))
+		return 1;
+	if (kvm_read_guest_virt(&vcpu->arch.emulate_ctxt, gva, &operand,
+				sizeof(operand), &e)) {
+		kvm_inject_page_fault(vcpu, &e);
+		return 1;
+	}
+
+	type = kvm_register_read(vcpu, (vmx_instruction_info >> 28) & 0xf);
+
+	switch (type) {
+	case VMX_EPT_EXTENT_GLOBAL:
+		if (!(nested_vmx_ept_caps & VMX_EPT_EXTENT_GLOBAL_BIT))
+			nested_vmx_failValid(vcpu,
+				VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
+		else {
+			/*
+			 * Do nothing: when L1 changes EPT12, we already
+			 * update EPT02 (the shadow EPT table) and call INVEPT.
+			 * So when L1 calls INVEPT, there's nothing left to do.
+			 */
+			nested_vmx_succeed(vcpu);
+		}
+		break;
+	case VMX_EPT_EXTENT_CONTEXT:
+		if (!(nested_vmx_ept_caps & VMX_EPT_EXTENT_CONTEXT_BIT))
+			nested_vmx_failValid(vcpu,
+				VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
+		else {
+			/* Do nothing */
+			nested_vmx_succeed(vcpu);
+		}
+		break;
+	case VMX_EPT_EXTENT_INDIVIDUAL_ADDR:
+		if (!(nested_vmx_ept_caps & VMX_EPT_EXTENT_INDIVIDUAL_BIT))
+			nested_vmx_failValid(vcpu,
+				VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
+		else {
+			/* Do nothing */
+			nested_vmx_succeed(vcpu);
+		}
+		break;
+	default:
+		nested_vmx_failValid(vcpu,
+			VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
+	}
+
+	skip_emulated_instruction(vcpu);
+	return 1;
+}
+
 /*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
@@ -5744,6 +5829,7 @@ static int (*kvm_vmx_exit_handlers[])(st
 	[EXIT_REASON_PAUSE_INSTRUCTION]       = handle_pause,
 	[EXIT_REASON_MWAIT_INSTRUCTION]	      = handle_invalid_op,
 	[EXIT_REASON_MONITOR_INSTRUCTION]     = handle_invalid_op,
+	[EXIT_REASON_INVEPT]                  = handle_invept,
 };
 
 static const int kvm_vmx_max_exit_handlers =
@@ -5928,6 +6014,7 @@ static bool nested_vmx_exit_handled(stru
 	case EXIT_REASON_VMPTRST: case EXIT_REASON_VMREAD:
 	case EXIT_REASON_VMRESUME: case EXIT_REASON_VMWRITE:
 	case EXIT_REASON_VMOFF: case EXIT_REASON_VMON:
+	case EXIT_REASON_INVEPT:
 		/*
 		 * VMX instructions trap unconditionally. This allows L1 to
 		 * emulate them for its L2 guest, i.e., allows 3-level nesting!


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 09/10] nEPT: Documentation
  2012-08-01 14:36 [PATCH 0/10] nEPT v2: Nested EPT support for Nested VMX Nadav Har'El
                   ` (7 preceding siblings ...)
  2012-08-01 14:40 ` [PATCH 08/10] nEPT: Nested INVEPT Nadav Har'El
@ 2012-08-01 14:41 ` Nadav Har'El
  2012-08-01 14:41 ` [PATCH 10/10] nEPT: Miscelleneous cleanups Nadav Har'El
  2012-08-01 15:07 ` [PATCH 0/10] nEPT v2: Nested EPT support for Nested VMX Avi Kivity
  10 siblings, 0 replies; 15+ messages in thread
From: Nadav Har'El @ 2012-08-01 14:41 UTC (permalink / raw)
  To: kvm; +Cc: Joerg.Roedel, avi, owasserm, abelg, eddie.dong, yang.z.zhang

Update the documentation to no longer say that nested EPT is not supported.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 Documentation/virtual/kvm/nested-vmx.txt |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- .before/Documentation/virtual/kvm/nested-vmx.txt	2012-08-01 17:22:47.000000000 +0300
+++ .after/Documentation/virtual/kvm/nested-vmx.txt	2012-08-01 17:22:47.000000000 +0300
@@ -38,8 +38,8 @@ The current code supports running Linux 
 Only 64-bit guest hypervisors are supported.
 
 Additional patches for running Windows under guest KVM, and Linux under
-guest VMware server, and support for nested EPT, are currently running in
-the lab, and will be sent as follow-on patchsets.
+guest VMware server, are currently running in the lab, and will be sent as
+follow-on patchsets.
 
 
 Running nested VMX


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 10/10] nEPT: Miscelleneous cleanups
  2012-08-01 14:36 [PATCH 0/10] nEPT v2: Nested EPT support for Nested VMX Nadav Har'El
                   ` (8 preceding siblings ...)
  2012-08-01 14:41 ` [PATCH 09/10] nEPT: Documentation Nadav Har'El
@ 2012-08-01 14:41 ` Nadav Har'El
  2012-08-01 15:07 ` [PATCH 0/10] nEPT v2: Nested EPT support for Nested VMX Avi Kivity
  10 siblings, 0 replies; 15+ messages in thread
From: Nadav Har'El @ 2012-08-01 14:41 UTC (permalink / raw)
  To: kvm; +Cc: Joerg.Roedel, avi, owasserm, abelg, eddie.dong, yang.z.zhang

Some trivial code cleanups not really related to nested EPT.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |    6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2012-08-01 17:22:47.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2012-08-01 17:22:47.000000000 +0300
@@ -616,7 +616,6 @@ static void nested_release_page_clean(st
 static u64 construct_eptp(unsigned long root_hpa);
 static void kvm_cpu_vmxon(u64 addr);
 static void kvm_cpu_vmxoff(void);
-static void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3);
 static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr);
 static void vmx_set_segment(struct kvm_vcpu *vcpu,
 			    struct kvm_segment *var, int seg);
@@ -895,8 +894,7 @@ static inline bool nested_cpu_has2(struc
 		(vmcs12->secondary_vm_exec_control & bit);
 }
 
-static inline bool nested_cpu_has_virtual_nmis(struct vmcs12 *vmcs12,
-	struct kvm_vcpu *vcpu)
+static inline bool nested_cpu_has_virtual_nmis(struct vmcs12 *vmcs12)
 {
 	return vmcs12->pin_based_vm_exec_control & PIN_BASED_VIRTUAL_NMIS;
 }
@@ -6135,7 +6133,7 @@ static int vmx_handle_exit(struct kvm_vc
 
 	if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked &&
 	    !(is_guest_mode(vcpu) && nested_cpu_has_virtual_nmis(
-	                                get_vmcs12(vcpu), vcpu)))) {
+					get_vmcs12(vcpu))))) {
 		if (vmx_interrupt_allowed(vcpu)) {
 			vmx->soft_vnmi_blocked = 0;
 		} else if (vmx->vnmi_blocked_time > 1000000000LL &&


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/10] nEPT v2: Nested EPT support for Nested VMX
  2012-08-01 14:36 [PATCH 0/10] nEPT v2: Nested EPT support for Nested VMX Nadav Har'El
                   ` (9 preceding siblings ...)
  2012-08-01 14:41 ` [PATCH 10/10] nEPT: Miscelleneous cleanups Nadav Har'El
@ 2012-08-01 15:07 ` Avi Kivity
  10 siblings, 0 replies; 15+ messages in thread
From: Avi Kivity @ 2012-08-01 15:07 UTC (permalink / raw)
  To: Nadav Har'El
  Cc: kvm, Joerg.Roedel, owasserm, abelg, eddie.dong, yang.z.zhang

On 08/01/2012 05:36 PM, Nadav Har'El wrote:
> The following patches add nested EPT support to Nested VMX.
> 
> This is the second version of this patch set. Most of the issues from the
> previous reviews were handled, and in particular there is now a new variant
> of paging_tmpl for EPT page tables.

Thanks for this repost.

> However, while this version does work in my tests, there are still some known
> problems/bugs with this version and unhandled issues from the previous review:
> 
>  1. 32-bit *PAE* L2s currently don't work. non-PAE 32-bit L2s do work
>     (and so do, of course, 64-bit L2s).
> 

I'm guessing that this has to do with loading the PDPTEs; probably we're
loading them from L1 instead of L2 during mode transitions.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2012-08-03  8:08 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-08-01 14:36 [PATCH 0/10] nEPT v2: Nested EPT support for Nested VMX Nadav Har'El
2012-08-01 14:37 ` [PATCH 01/10] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1 Nadav Har'El
2012-08-01 14:37 ` [PATCH 02/10] nEPT: Add EPT tables support to paging_tmpl.h Nadav Har'El
2012-08-02  4:00   ` Xiao Guangrong
2012-08-02 21:25     ` Nadav Har'El
2012-08-03  8:08       ` Xiao Guangrong
2012-08-01 14:38 ` [PATCH 03/10] nEPT: MMU context for nested EPT Nadav Har'El
2012-08-01 14:38 ` [PATCH 04/10] nEPT: Fix cr3 handling in nested exit and entry Nadav Har'El
2012-08-01 14:39 ` [PATCH 05/10] nEPT: Fix wrong test in kvm_set_cr3 Nadav Har'El
2012-08-01 14:39 ` [PATCH 06/10] nEPT: Some additional comments Nadav Har'El
2012-08-01 14:40 ` [PATCH 07/10] nEPT: Advertise EPT to L1 Nadav Har'El
2012-08-01 14:40 ` [PATCH 08/10] nEPT: Nested INVEPT Nadav Har'El
2012-08-01 14:41 ` [PATCH 09/10] nEPT: Documentation Nadav Har'El
2012-08-01 14:41 ` [PATCH 10/10] nEPT: Miscelleneous cleanups Nadav Har'El
2012-08-01 15:07 ` [PATCH 0/10] nEPT v2: Nested EPT support for Nested VMX Avi Kivity

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).