kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/10] nEPT: Nested EPT support for Nested VMX
@ 2011-11-10  9:57 Nadav Har'El
  2011-11-10  9:58 ` [PATCH 01/10] nEPT: Module option Nadav Har'El
                   ` (10 more replies)
  0 siblings, 11 replies; 49+ messages in thread
From: Nadav Har'El @ 2011-11-10  9:57 UTC (permalink / raw)
  To: kvm; +Cc: Roedel, Joerg, avi, owasserm, abelg

The following patches add nested EPT support to Nested VMX.

Nested EPT means emulating EPT for an L1 guest, allowing it use EPT when
running a nested guest L2. When L1 uses EPT, it allows the L2 guest to set
its own cr3 and take its own page faults without either of L0 or L1 getting
involved. In many workloads this significanlty improves L2's performance over
the previous two alternatives (shadow page tables over ept, and shadow page
tables over shadow page tables). Our paper [1] described these three options,
and the advantages of nested EPT ("multidimensional paging").

Nested EPT is enabled by default (if the hardware supports EPT), so users do
not have to do anything special to enjoy the performance improvement that
this patch gives to L2 guests.

Just as a non-scientific, non-representative indication of the kind of
dramatic performance improvement you may see in workloads that have a lot of
context switches and page faults, here is a measurement of the time
an example single-threaded "make" took in L2 (kvm over kvm):

 shadow over shadow: 105 seconds
 ("ept=0" forces this)

 shadow over EPT: 87 seconds
 (the previous default; Can be forced now with "nested_ept=0")

 EPT over EPT: 29 seconds
 (the default after this patch)

Note that the same test on L1 (with EPT) took 25 seconds, so for this example
workload, performance of nested virtualization is now very close to that of
single-level virtualization.


[1] "The Turtles Project: Design and Implementation of Nested Virtualization",
    http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf


Patch statistics:
-----------------

 Documentation/virtual/kvm/nested-vmx.txt |    4 
 arch/x86/include/asm/kvm_host.h          |    1 
 arch/x86/include/asm/vmx.h               |    2 
 arch/x86/kvm/mmu.c                       |   16 -
 arch/x86/kvm/paging_tmpl.h               |    6 
 arch/x86/kvm/vmx.c                       |  259 ++++++++++++++++++++-
 arch/x86/kvm/x86.c                       |   11 
 7 files changed, 269 insertions(+), 30 deletions(-)

--
Nadav Har'El
IBM Haifa Research Lab

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 01/10] nEPT: Module option
  2011-11-10  9:57 [PATCH 0/10] nEPT: Nested EPT support for Nested VMX Nadav Har'El
@ 2011-11-10  9:58 ` Nadav Har'El
  2011-11-10 12:23   ` Avi Kivity
  2011-11-10  9:58 ` [PATCH 02/10] nEPT: MMU context for nested EPT Nadav Har'El
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 49+ messages in thread
From: Nadav Har'El @ 2011-11-10  9:58 UTC (permalink / raw)
  To: kvm; +Cc: Roedel, Joerg, avi, owasserm, abelg

Add a module option "nested_ept" determining whether to enable Nested EPT.

Nested EPT means emulating EPT for an L1 guest so that L1 can use EPT when
running a nested guest L2. When L1 uses EPT, it allows the L2 guest to set
its own cr3 and take its own page faults without either of L0 or L1 getting
involved. This often significanlty improves L2's performance over the
previous two alternatives (shadow page tables over ept, and shadow page
tables over shadow page tables).

nested_ept is currently enabled by default (when nested VMX is enabled),
unless L0 doesn't have EPT or disabled it with ept=0.

Users would not normally want to explicitly disable this option. One reason
why one might want to disable it is to force L1 to make due without the EPT
capability, when anticipating a future need to migrate this L1 to another
host which doesn't have EPT. Note that currently there is no API to turn off
nested EPT for just a single L1 guest. However, obviously, an individual L1
guest may choose not to use EPT - the nested_cpu_has_ept() checks if L1
actually used EPT when running L2.

In the future, we can support emulation of EPT for L1 *always*, even when L0
itself doesn't have EPT. This so-called "EPT on shadow page tables" mode
has some theoretical advantages over the baseline "shadow page tables on
shadow page tables" mode typically used when EPT is not available to L0 -
namely that L2's cr3 changes and page faults can be handled in L0 and do not
need to be propagated to L1. However, currently we do not support this mode,
and it is becoming less interesting as newer processors all support EPT.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   12 ++++++++++++
 1 file changed, 12 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-11-10 11:33:58.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2011-11-10 11:33:58.000000000 +0200
@@ -83,6 +83,10 @@ module_param(fasteoi, bool, S_IRUGO);
 static int __read_mostly nested = 0;
 module_param(nested, bool, S_IRUGO);
 
+/* Whether L0 emulates EPT for its L1 guests. It doesn't mean L1 must use it */
+static int __read_mostly nested_ept = 1;
+module_param(nested_ept, bool, S_IRUGO);
+
 #define KVM_GUEST_CR0_MASK_UNRESTRICTED_GUEST				\
 	(X86_CR0_WP | X86_CR0_NE | X86_CR0_NW | X86_CR0_CD)
 #define KVM_GUEST_CR0_MASK						\
@@ -875,6 +879,11 @@ static inline bool nested_cpu_has_virtua
 	return vmcs12->pin_based_vm_exec_control & PIN_BASED_VIRTUAL_NMIS;
 }
 
+static inline int nested_cpu_has_ept(struct vmcs12 *vmcs12)
+{
+	return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_EPT);
+}
+
 static inline bool is_exception(u32 intr_info)
 {
 	return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK))
@@ -2642,6 +2651,9 @@ static __init int hardware_setup(void)
 	if (!cpu_has_vmx_ple())
 		ple_gap = 0;
 
+	if (!nested || !enable_ept)
+		nested_ept = 0;
+
 	if (nested)
 		nested_vmx_setup_ctls_msrs();
 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 02/10] nEPT: MMU context for nested EPT
  2011-11-10  9:57 [PATCH 0/10] nEPT: Nested EPT support for Nested VMX Nadav Har'El
  2011-11-10  9:58 ` [PATCH 01/10] nEPT: Module option Nadav Har'El
@ 2011-11-10  9:58 ` Nadav Har'El
  2011-11-10 10:31   ` Avi Kivity
  2011-11-10 12:49   ` Avi Kivity
  2011-11-10  9:59 ` [PATCH 03/10] nEPT: Fix cr3 handling in nested exit and entry Nadav Har'El
                   ` (8 subsequent siblings)
  10 siblings, 2 replies; 49+ messages in thread
From: Nadav Har'El @ 2011-11-10  9:58 UTC (permalink / raw)
  To: kvm; +Cc: Roedel, Joerg, avi, owasserm, abelg

KVM's existing shadow MMU code already supports nested TDP. To use it, we
need to set up a new "MMU context" for nested EPT, and create a few callbacks
for it (nested_ept_*()). We then need to switch back and forth between this
nested context and the regular MMU context when switching between L1 and L2.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   60 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 60 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-11-10 11:33:58.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2011-11-10 11:33:58.000000000 +0200
@@ -6443,6 +6443,59 @@ static void vmx_set_supported_cpuid(u32 
 		entry->ecx |= bit(X86_FEATURE_VMX);
 }
 
+/* Callbacks for nested_ept_init_mmu_context: */
+static unsigned long nested_ept_get_cr3(struct kvm_vcpu *vcpu)
+{
+	/* return the page table to be shadowed - in our case, EPT12 */
+	return get_vmcs12(vcpu)->ept_pointer;
+}
+
+static u64 nested_ept_get_pdptr(struct kvm_vcpu *vcpu, int index)
+{
+	/*
+	 * This function is called (as mmu.get_pdptr()) in mmu.c to help read
+	 * a to-be-shadowed page table in PAE (3-level) format. However, the
+	 * EPT table we're now shadowing (this is the nested EPT mmu), must
+	 * always have 4 levels, and is not in PAE format, so this function
+	 * should never be called.
+	 */
+	kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
+}
+
+static void nested_ept_inject_page_fault(struct kvm_vcpu *vcpu,
+	struct x86_exception *fault)
+{
+	struct vmcs12 *vmcs12;
+	nested_vmx_vmexit(vcpu);
+	vmcs12 = get_vmcs12(vcpu);
+	/*
+	 * Note no need to set vmcs12->vm_exit_reason as it is already copied
+	 * from vmcs02 in nested_vmx_vmexit() above, i.e., EPT_VIOLATION.
+	 */
+	vmcs12->exit_qualification = fault->error_code;
+	vmcs12->guest_physical_address = fault->address;
+}
+
+static int nested_ept_init_mmu_context(struct kvm_vcpu *vcpu)
+{
+	int r = kvm_init_shadow_mmu(vcpu, &vcpu->arch.mmu);
+
+	vcpu->arch.mmu.set_cr3           = vmx_set_cr3;
+	vcpu->arch.mmu.get_cr3           = nested_ept_get_cr3;
+	vcpu->arch.mmu.get_pdptr         = nested_ept_get_pdptr;
+	vcpu->arch.mmu.inject_page_fault = nested_ept_inject_page_fault;
+	vcpu->arch.mmu.shadow_root_level = get_ept_level();
+
+	vcpu->arch.walk_mmu              = &vcpu->arch.nested_mmu;
+
+	return r;
+}
+
+static void nested_ept_uninit_mmu_context(struct kvm_vcpu *vcpu)
+{
+	vcpu->arch.walk_mmu = &vcpu->arch.mmu;
+}
+
 /*
  * prepare_vmcs02 is called when the L1 guest hypervisor runs its nested
  * L2 guest. L1 has a vmcs for L2 (vmcs12), and this function "merges" it
@@ -6652,6 +6705,11 @@ static void prepare_vmcs02(struct kvm_vc
 		vmx_flush_tlb(vcpu);
 	}
 
+	if (nested_cpu_has_ept(vmcs12)) {
+		kvm_mmu_unload(vcpu);
+		nested_ept_init_mmu_context(vcpu);
+	}
+
 	if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_EFER)
 		vcpu->arch.efer = vmcs12->guest_ia32_efer;
 	if (vmcs12->vm_entry_controls & VM_ENTRY_IA32E_MODE)
@@ -6982,6 +7040,8 @@ void load_vmcs12_host_state(struct kvm_v
 	vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
 	kvm_set_cr4(vcpu, vmcs12->host_cr4);
 
+	if (nested_cpu_has_ept(vmcs12))
+		nested_ept_uninit_mmu_context(vcpu);
 	/* shadow page tables on either EPT or shadow page tables */
 	kvm_set_cr3(vcpu, vmcs12->host_cr3);
 	kvm_mmu_reset_context(vcpu);

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 03/10] nEPT: Fix cr3 handling in nested exit and entry
  2011-11-10  9:57 [PATCH 0/10] nEPT: Nested EPT support for Nested VMX Nadav Har'El
  2011-11-10  9:58 ` [PATCH 01/10] nEPT: Module option Nadav Har'El
  2011-11-10  9:58 ` [PATCH 02/10] nEPT: MMU context for nested EPT Nadav Har'El
@ 2011-11-10  9:59 ` Nadav Har'El
  2011-11-10  9:59 ` [PATCH 04/10] nEPT: Fix page table format in nested EPT Nadav Har'El
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 49+ messages in thread
From: Nadav Har'El @ 2011-11-10  9:59 UTC (permalink / raw)
  To: kvm; +Cc: Roedel, Joerg, avi, owasserm, abelg

The existing code for handling cr3 and related VMCS fields during nested
exit and entry wasn't correct in all cases:

If L2 is allowed to control cr3 (and this is indeed the case in nested EPT),
during nested exit we must copy the modified cr3 from vmcs02 to vmcs12, and
we forgot to do so. This patch adds this copy.

If L0 isn't controlling cr3 when running L2 (i.e., L0 is using EPT), and
whoever does control cr3 (L1 or L2) is using PAE, the processor might have
saved PDPTEs and we should also save them in vmcs12 (and restore later).

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-11-10 11:33:58.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2011-11-10 11:33:58.000000000 +0200
@@ -6737,6 +6737,17 @@ static void prepare_vmcs02(struct kvm_vc
 	kvm_set_cr3(vcpu, vmcs12->guest_cr3);
 	kvm_mmu_reset_context(vcpu);
 
+	/*
+	 * Additionally, except when L0 is using shadow page tables, L1 or
+	 * L2 control guest_cr3 for L2, so they may also have saved PDPTEs
+	 */
+	if (enable_ept) {
+		vmcs_write64(GUEST_PDPTR0, vmcs12->guest_pdptr0);
+		vmcs_write64(GUEST_PDPTR1, vmcs12->guest_pdptr1);
+		vmcs_write64(GUEST_PDPTR2, vmcs12->guest_pdptr2);
+		vmcs_write64(GUEST_PDPTR3, vmcs12->guest_pdptr3);
+	}
+
 	kvm_register_write(vcpu, VCPU_REGS_RSP, vmcs12->guest_rsp);
 	kvm_register_write(vcpu, VCPU_REGS_RIP, vmcs12->guest_rip);
 }
@@ -6968,6 +6979,25 @@ void prepare_vmcs12(struct kvm_vcpu *vcp
 	vmcs12->guest_pending_dbg_exceptions =
 		vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
 
+	/*
+	 * In some cases (usually, nested EPT), L2 is allowed to change its
+	 * own CR3 without exiting. If it has changed it, we must keep it.
+	 * Of course, if L0 is using shadow page tables, GUEST_CR3 was defined
+	 * by L0, not L1 or L2, so we mustn't unconditionally copy it to vmcs12.
+	 */
+	if (enable_ept)
+		vmcs12->guest_cr3 = vmcs_read64(GUEST_CR3);
+	/*
+	 * Additionally, except when L0 is using shadow page tables, L1 or
+	 * L2 control guest_cr3 for L2, so save their PDPTEs
+	 */
+	if (enable_ept) {
+		vmcs12->guest_pdptr0 = vmcs_read64(GUEST_PDPTR0);
+		vmcs12->guest_pdptr1 = vmcs_read64(GUEST_PDPTR1);
+		vmcs12->guest_pdptr2 = vmcs_read64(GUEST_PDPTR2);
+		vmcs12->guest_pdptr3 = vmcs_read64(GUEST_PDPTR3);
+	}
+
 	/* TODO: These cannot have changed unless we have MSR bitmaps and
 	 * the relevant bit asks not to trap the change */
 	vmcs12->guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 04/10] nEPT: Fix page table format in nested EPT
  2011-11-10  9:57 [PATCH 0/10] nEPT: Nested EPT support for Nested VMX Nadav Har'El
                   ` (2 preceding siblings ...)
  2011-11-10  9:59 ` [PATCH 03/10] nEPT: Fix cr3 handling in nested exit and entry Nadav Har'El
@ 2011-11-10  9:59 ` Nadav Har'El
  2011-11-10 10:37   ` Avi Kivity
  2011-11-10 13:07   ` Orit Wasserman
  2011-11-10 10:00 ` [PATCH 05/10] nEPT: Fix wrong test in kvm_set_cr3 Nadav Har'El
                   ` (6 subsequent siblings)
  10 siblings, 2 replies; 49+ messages in thread
From: Nadav Har'El @ 2011-11-10  9:59 UTC (permalink / raw)
  To: kvm; +Cc: Roedel, Joerg, avi, owasserm, abelg

When the existing KVM MMU code creates a shadow page table, it assumes it
has the normal x86 page table format. This is obviously correct for normal
shadow page tables, and also correct for AMD's NPT.
Unfortunately, Intel's EPT page tables differ in subtle ways from ordinary
page tables, so when we create a shadow EPT table (i.e., in nested EPT),
we need to slightly modify the way in which this table table is built.

In particular, when mmu.c's link_shadow_page() creates non-leaf page table
entries, it used to enable the "present", "accessed", "writable" and "user"
flags on these entries. While this is correct for ordinary page tables, it
is wrong in EPT tables - where these bits actually have completely different
meaning (compare PT_*_MASK from mmu.h to VMX_EPT_*_MASK from vmx.h).
In particular, leaving the code as-is causes bit 5 of the PTE to be turned on
(supposedly for PT_ACCESSED_MASK), which is a reserved bit in EPT and causes
an "EPT Misconfiguration" failure.

So we must move link_shadow_page's list of extra bits to a new mmu context
field, which is set differently for nested EPT.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/include/asm/kvm_host.h |    1 +
 arch/x86/kvm/mmu.c              |   16 +++++++++-------
 arch/x86/kvm/paging_tmpl.h      |    6 ++++--
 arch/x86/kvm/vmx.c              |    3 +++
 4 files changed, 17 insertions(+), 9 deletions(-)

--- .before/arch/x86/include/asm/kvm_host.h	2011-11-10 11:33:59.000000000 +0200
+++ .after/arch/x86/include/asm/kvm_host.h	2011-11-10 11:33:59.000000000 +0200
@@ -287,6 +287,7 @@ struct kvm_mmu {
 	bool nx;
 
 	u64 pdptrs[4]; /* pae */
+	u64 link_shadow_page_set_bits;
 };
 
 struct kvm_vcpu_arch {
--- .before/arch/x86/kvm/vmx.c	2011-11-10 11:33:59.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2011-11-10 11:33:59.000000000 +0200
@@ -6485,6 +6485,9 @@ static int nested_ept_init_mmu_context(s
 	vcpu->arch.mmu.get_pdptr         = nested_ept_get_pdptr;
 	vcpu->arch.mmu.inject_page_fault = nested_ept_inject_page_fault;
 	vcpu->arch.mmu.shadow_root_level = get_ept_level();
+	vcpu->arch.mmu.link_shadow_page_set_bits =
+		VMX_EPT_READABLE_MASK | VMX_EPT_WRITABLE_MASK |
+		VMX_EPT_EXECUTABLE_MASK;
 
 	vcpu->arch.walk_mmu              = &vcpu->arch.nested_mmu;
 
--- .before/arch/x86/kvm/mmu.c	2011-11-10 11:33:59.000000000 +0200
+++ .after/arch/x86/kvm/mmu.c	2011-11-10 11:33:59.000000000 +0200
@@ -1782,14 +1782,9 @@ static void shadow_walk_next(struct kvm_
 	return __shadow_walk_next(iterator, *iterator->sptep);
 }
 
-static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp)
+static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp, u64 set_bits)
 {
-	u64 spte;
-
-	spte = __pa(sp->spt)
-		| PT_PRESENT_MASK | PT_ACCESSED_MASK
-		| PT_WRITABLE_MASK | PT_USER_MASK;
-	mmu_spte_set(sptep, spte);
+	mmu_spte_set(sptep, __pa(sp->spt) | set_bits);
 }
 
 static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
@@ -3366,6 +3361,13 @@ int kvm_init_shadow_mmu(struct kvm_vcpu 
 	vcpu->arch.mmu.base_role.cr0_wp  = is_write_protection(vcpu);
 	vcpu->arch.mmu.base_role.smep_andnot_wp
 		= smep && !is_write_protection(vcpu);
+	/*
+	 * link_shadow() should apply these bits in shadow page tables, and
+	 * in shadow NPT tables (nested NPT). For nested EPT, different bits
+	 * apply.
+	 */
+	vcpu->arch.mmu.link_shadow_page_set_bits = PT_PRESENT_MASK |
+		PT_ACCESSED_MASK | PT_WRITABLE_MASK | PT_USER_MASK;
 
 	return r;
 }
--- .before/arch/x86/kvm/paging_tmpl.h	2011-11-10 11:33:59.000000000 +0200
+++ .after/arch/x86/kvm/paging_tmpl.h	2011-11-10 11:33:59.000000000 +0200
@@ -515,7 +515,8 @@ static u64 *FNAME(fetch)(struct kvm_vcpu
 			goto out_gpte_changed;
 
 		if (sp)
-			link_shadow_page(it.sptep, sp);
+			link_shadow_page(it.sptep, sp,
+				vcpu->arch.mmu.link_shadow_page_set_bits);
 	}
 
 	for (;
@@ -535,7 +536,8 @@ static u64 *FNAME(fetch)(struct kvm_vcpu
 
 		sp = kvm_mmu_get_page(vcpu, direct_gfn, addr, it.level-1,
 				      true, direct_access, it.sptep);
-		link_shadow_page(it.sptep, sp);
+		link_shadow_page(it.sptep, sp,
+			vcpu->arch.mmu.link_shadow_page_set_bits);
 	}
 
 	clear_sp_write_flooding_count(it.sptep);

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 05/10] nEPT: Fix wrong test in kvm_set_cr3
  2011-11-10  9:57 [PATCH 0/10] nEPT: Nested EPT support for Nested VMX Nadav Har'El
                   ` (3 preceding siblings ...)
  2011-11-10  9:59 ` [PATCH 04/10] nEPT: Fix page table format in nested EPT Nadav Har'El
@ 2011-11-10 10:00 ` Nadav Har'El
  2011-11-10 10:00 ` [PATCH 06/10] nEPT: Some additional comments Nadav Har'El
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 49+ messages in thread
From: Nadav Har'El @ 2011-11-10 10:00 UTC (permalink / raw)
  To: kvm; +Cc: Roedel, Joerg, avi, owasserm, abelg

kvm_set_cr3() attempts to check if the new cr3 is a valid guest physical
address. The problem is that with nested EPT, cr3 is an *L2* physical
address, not an L1 physical address as this test expects.

As the comment above this test explains, it isn't necessary, and doesn't
correspond to anything a real processor would do. So this patch just
comments it out.

Note that this wrong test could have also theoretically caused problems
in nested NPT, not just in nested EPT. However, in practice, the problem
was avoided: nested_svm_vmexit()/vmrun() do not call kvm_set_cr3 in the
nested NPT case, and instead set the vmcb (and arch.cr3) directly, thus
circumventing the problem. Additional potential calls to the buggy function
are avoided in that we don't trap cr3 modifications when nested NPT is
enabled. However, because in nested VMX we did want to use kvm_set_cr3()
(as requested in Avi Kivity's review of the original nested VMX patches),
we can't avoid this problem and need to fix it.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/x86.c |   11 -----------
 1 file changed, 11 deletions(-)

--- .before/arch/x86/kvm/x86.c	2011-11-10 11:33:59.000000000 +0200
+++ .after/arch/x86/kvm/x86.c	2011-11-10 11:33:59.000000000 +0200
@@ -690,17 +690,6 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, u
 		 */
 	}
 
-	/*
-	 * Does the new cr3 value map to physical memory? (Note, we
-	 * catch an invalid cr3 even in real-mode, because it would
-	 * cause trouble later on when we turn on paging anyway.)
-	 *
-	 * A real CPU would silently accept an invalid cr3 and would
-	 * attempt to use it - with largely undefined (and often hard
-	 * to debug) behavior on the guest side.
-	 */
-	if (unlikely(!gfn_to_memslot(vcpu->kvm, cr3 >> PAGE_SHIFT)))
-		return 1;
 	vcpu->arch.cr3 = cr3;
 	__set_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail);
 	vcpu->arch.mmu.new_cr3(vcpu);

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 06/10] nEPT: Some additional comments
  2011-11-10  9:57 [PATCH 0/10] nEPT: Nested EPT support for Nested VMX Nadav Har'El
                   ` (4 preceding siblings ...)
  2011-11-10 10:00 ` [PATCH 05/10] nEPT: Fix wrong test in kvm_set_cr3 Nadav Har'El
@ 2011-11-10 10:00 ` Nadav Har'El
  2011-11-10 10:01 ` [PATCH 07/10] nEPT: Advertise EPT to L1 Nadav Har'El
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 49+ messages in thread
From: Nadav Har'El @ 2011-11-10 10:00 UTC (permalink / raw)
  To: kvm; +Cc: Roedel, Joerg, avi, owasserm, abelg

Some additional comments to preexisting code:
Explain who (L0 or L1) handles EPT violation and misconfiguration exits.
Don't mention "shadow on either EPT or shadow" as the only two options.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2011-11-10 11:33:59.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2011-11-10 11:33:59.000000000 +0200
@@ -5815,7 +5815,20 @@ static bool nested_vmx_exit_handled(stru
 		return nested_cpu_has2(vmcs12,
 			SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES);
 	case EXIT_REASON_EPT_VIOLATION:
+		/*
+		 * L0 always deals with the EPT violation. If nested EPT is
+		 * used, and the nested mmu code discovers that the address is
+		 * missing in the guest EPT table (EPT12), the EPT violation
+		 * will be injected with nested_ept_inject_page_fault()
+		 */
+		return 0;
 	case EXIT_REASON_EPT_MISCONFIG:
+		/*
+		 * L2 never uses directly L1's EPT, but rather L0's own EPT
+		 * table (shadow on EPT) or a merged EPT table that L0 built
+		 * (EPT on EPT). So any problems with the structure of the
+		 * table is L0's fault.
+		 */
 		return 0;
 	case EXIT_REASON_WBINVD:
 		return nested_cpu_has2(vmcs12, SECONDARY_EXEC_WBINVD_EXITING);
@@ -6736,7 +6749,12 @@ static void prepare_vmcs02(struct kvm_vc
 	vmx_set_cr4(vcpu, vmcs12->guest_cr4);
 	vmcs_writel(CR4_READ_SHADOW, nested_read_cr4(vmcs12));
 
-	/* shadow page tables on either EPT or shadow page tables */
+	/*
+	 * Note that kvm_set_cr3() and kvm_mmu_reset_context() will do the
+	 * right thing, and set GUEST_CR3 and/or EPT_POINTER in all supported
+	 * settings: 1. shadow page tables on shadow page tables, 2. shadow
+	 * page tables on EPT, 3. EPT on EPT.
+	 */
 	kvm_set_cr3(vcpu, vmcs12->guest_cr3);
 	kvm_mmu_reset_context(vcpu);
 
@@ -7075,7 +7093,6 @@ void load_vmcs12_host_state(struct kvm_v
 
 	if (nested_cpu_has_ept(vmcs12))
 		nested_ept_uninit_mmu_context(vcpu);
-	/* shadow page tables on either EPT or shadow page tables */
 	kvm_set_cr3(vcpu, vmcs12->host_cr3);
 	kvm_mmu_reset_context(vcpu);
 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 07/10] nEPT: Advertise EPT to L1
  2011-11-10  9:57 [PATCH 0/10] nEPT: Nested EPT support for Nested VMX Nadav Har'El
                   ` (5 preceding siblings ...)
  2011-11-10 10:00 ` [PATCH 06/10] nEPT: Some additional comments Nadav Har'El
@ 2011-11-10 10:01 ` Nadav Har'El
  2011-11-10 10:01 ` [PATCH 08/10] nEPT: Nested INVEPT Nadav Har'El
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 49+ messages in thread
From: Nadav Har'El @ 2011-11-10 10:01 UTC (permalink / raw)
  To: kvm; +Cc: Roedel, Joerg, avi, owasserm, abelg

Advertise the support of EPT to the L1 guest, through the appropriate MSR.

This is the last patch of the basic Nested EPT feature, so as to allow
bisection through this patch series: The guest will not see EPT support until
this last patch, and will not attempt to use the half-applied feature.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2011-11-10 11:33:59.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2011-11-10 11:33:59.000000000 +0200
@@ -1908,6 +1908,7 @@ static u32 nested_vmx_secondary_ctls_low
 static u32 nested_vmx_pinbased_ctls_low, nested_vmx_pinbased_ctls_high;
 static u32 nested_vmx_exit_ctls_low, nested_vmx_exit_ctls_high;
 static u32 nested_vmx_entry_ctls_low, nested_vmx_entry_ctls_high;
+static u32 nested_vmx_ept_caps;
 static __init void nested_vmx_setup_ctls_msrs(void)
 {
 	/*
@@ -1980,6 +1981,16 @@ static __init void nested_vmx_setup_ctls
 	nested_vmx_secondary_ctls_low = 0;
 	nested_vmx_secondary_ctls_high &=
 		SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
+	if (nested_ept)
+		nested_vmx_secondary_ctls_high |= SECONDARY_EXEC_ENABLE_EPT;
+
+	/* ept capabilities */
+	if (nested_ept) {
+		nested_vmx_ept_caps = VMX_EPT_PAGE_WALK_4_BIT;
+		nested_vmx_ept_caps &= vmx_capability.ept;
+	} else
+		nested_vmx_ept_caps = 0;
+
 }
 
 static inline bool vmx_control_verify(u32 control, u32 low, u32 high)
@@ -2079,8 +2090,8 @@ static int vmx_get_vmx_msr(struct kvm_vc
 					nested_vmx_secondary_ctls_high);
 		break;
 	case MSR_IA32_VMX_EPT_VPID_CAP:
-		/* Currently, no nested ept or nested vpid */
-		*pdata = 0;
+		/* Currently, no nested vpid support */
+		*pdata = nested_vmx_ept_caps;
 		break;
 	default:
 		return 0;

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 08/10] nEPT: Nested INVEPT
  2011-11-10  9:57 [PATCH 0/10] nEPT: Nested EPT support for Nested VMX Nadav Har'El
                   ` (6 preceding siblings ...)
  2011-11-10 10:01 ` [PATCH 07/10] nEPT: Advertise EPT to L1 Nadav Har'El
@ 2011-11-10 10:01 ` Nadav Har'El
  2011-11-10 12:17   ` Avi Kivity
  2011-11-10 10:02 ` [PATCH 09/10] nEPT: Documentation Nadav Har'El
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 49+ messages in thread
From: Nadav Har'El @ 2011-11-10 10:01 UTC (permalink / raw)
  To: kvm; +Cc: Roedel, Joerg, avi, owasserm, abelg

If we let L1 use EPT, we should probably also support the INVEPT instruction.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/include/asm/vmx.h |    2 
 arch/x86/kvm/vmx.c         |  112 +++++++++++++++++++++++++++++++++++
 2 files changed, 114 insertions(+)

--- .before/arch/x86/include/asm/vmx.h	2011-11-10 11:33:59.000000000 +0200
+++ .after/arch/x86/include/asm/vmx.h	2011-11-10 11:33:59.000000000 +0200
@@ -279,6 +279,7 @@ enum vmcs_field {
 #define EXIT_REASON_APIC_ACCESS         44
 #define EXIT_REASON_EPT_VIOLATION       48
 #define EXIT_REASON_EPT_MISCONFIG       49
+#define EXIT_REASON_INVEPT		50
 #define EXIT_REASON_WBINVD		54
 #define EXIT_REASON_XSETBV		55
 
@@ -404,6 +405,7 @@ enum vmcs_field {
 #define VMX_EPTP_WB_BIT				(1ull << 14)
 #define VMX_EPT_2MB_PAGE_BIT			(1ull << 16)
 #define VMX_EPT_1GB_PAGE_BIT			(1ull << 17)
+#define VMX_EPT_INVEPT_BIT			(1ull << 20)
 #define VMX_EPT_EXTENT_INDIVIDUAL_BIT		(1ull << 24)
 #define VMX_EPT_EXTENT_CONTEXT_BIT		(1ull << 25)
 #define VMX_EPT_EXTENT_GLOBAL_BIT		(1ull << 26)
--- .before/arch/x86/kvm/vmx.c	2011-11-10 11:33:59.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2011-11-10 11:33:59.000000000 +0200
@@ -351,6 +351,8 @@ struct nested_vmx {
 	struct list_head vmcs02_pool;
 	int vmcs02_num;
 	u64 vmcs01_tsc_offset;
+	/* Remember last EPT02, for single-context INVEPT optimization */
+	u64 last_eptp02;
 	/* L2 must run next, and mustn't decide to exit to L1. */
 	bool nested_run_pending;
 	/*
@@ -1987,6 +1989,10 @@ static __init void nested_vmx_setup_ctls
 	/* ept capabilities */
 	if (nested_ept) {
 		nested_vmx_ept_caps = VMX_EPT_PAGE_WALK_4_BIT;
+		nested_vmx_ept_caps |=
+			VMX_EPT_INVEPT_BIT | VMX_EPT_EXTENT_GLOBAL_BIT |
+			VMX_EPT_EXTENT_CONTEXT_BIT |
+			VMX_EPT_EXTENT_INDIVIDUAL_BIT;
 		nested_vmx_ept_caps &= vmx_capability.ept;
 	} else
 		nested_vmx_ept_caps = 0;
@@ -5568,6 +5574,105 @@ static int handle_vmptrst(struct kvm_vcp
 	return 1;
 }
 
+/* Emulate the INVEPT instruction */
+static int handle_invept(struct kvm_vcpu *vcpu)
+{
+	u32 vmx_instruction_info;
+	unsigned long type;
+	gva_t gva;
+	struct x86_exception e;
+	struct {
+		u64 eptp, gpa;
+	} operand;
+
+
+	if (!nested_ept || !(nested_vmx_ept_caps & VMX_EPT_INVEPT_BIT)) {
+		kvm_queue_exception(vcpu, UD_VECTOR);
+		return 1;
+	}
+
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	if (!kvm_read_cr0_bits(vcpu, X86_CR0_PE)) {
+		kvm_queue_exception(vcpu, UD_VECTOR);
+		return 1;
+	}
+
+	/* According to the Intel VMX instruction reference, the memory
+	 * operand is read even if it isn't needed (e.g., for type==global)
+	 */
+	vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+	if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION),
+			vmx_instruction_info, &gva))
+		return 1;
+	if (kvm_read_guest_virt(&vcpu->arch.emulate_ctxt, gva, &operand,
+				sizeof(operand), &e)) {
+		kvm_inject_page_fault(vcpu, &e);
+		return 1;
+	}
+
+	type = kvm_register_read(vcpu, (vmx_instruction_info >> 28) & 0xf);
+
+	switch (type) {
+	case VMX_EPT_EXTENT_GLOBAL:
+		if (!(nested_vmx_ept_caps & VMX_EPT_EXTENT_GLOBAL_BIT))
+			nested_vmx_failValid(vcpu,
+				VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
+		else {
+			ept_sync_global();
+			nested_vmx_succeed(vcpu);
+		}
+		break;
+	case VMX_EPT_EXTENT_CONTEXT:
+		if (!(nested_vmx_ept_caps & VMX_EPT_EXTENT_CONTEXT_BIT))
+			nested_vmx_failValid(vcpu,
+				VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
+		else {
+			/*
+			 * We efficiently handle the common case, of L1
+			 * invalidating the last eptp it used to run L2.
+			 * TODO: Instead of saving one last_eptp02, look up
+			 * operand.eptp in the shadow EPT table cache, to
+			 * find its shadow. Then last_eptp02 won't be needed.
+			 */
+			struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
+			struct vcpu_vmx *vmx = to_vmx(vcpu);
+			if (vmcs12 && nested_cpu_has_ept(vmcs12) &&
+			    (vmcs12->ept_pointer == operand.eptp) &&
+			    vmx->nested.last_eptp02)
+				ept_sync_context(vmx->nested.last_eptp02);
+			else
+				ept_sync_global();
+			nested_vmx_succeed(vcpu);
+		}
+		break;
+	case VMX_EPT_EXTENT_INDIVIDUAL_ADDR:
+		if (!(nested_vmx_ept_caps & VMX_EPT_EXTENT_INDIVIDUAL_BIT))
+			nested_vmx_failValid(vcpu,
+				VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
+		else {
+			struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
+			struct vcpu_vmx *vmx = to_vmx(vcpu);
+			if (vmcs12 && nested_cpu_has_ept(vmcs12) &&
+			    (vmcs12->ept_pointer == operand.eptp) &&
+			    vmx->nested.last_eptp02)
+				ept_sync_individual_addr(
+					vmx->nested.last_eptp02, operand.gpa);
+			else
+				ept_sync_global();
+			nested_vmx_succeed(vcpu);
+		}
+		break;
+	default:
+		nested_vmx_failValid(vcpu,
+			VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
+	}
+
+	skip_emulated_instruction(vcpu);
+	return 1;
+}
+
 /*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
@@ -5609,6 +5714,7 @@ static int (*kvm_vmx_exit_handlers[])(st
 	[EXIT_REASON_PAUSE_INSTRUCTION]       = handle_pause,
 	[EXIT_REASON_MWAIT_INSTRUCTION]	      = handle_invalid_op,
 	[EXIT_REASON_MONITOR_INSTRUCTION]     = handle_invalid_op,
+	[EXIT_REASON_INVEPT]                  = handle_invept,
 };
 
 static const int kvm_vmx_max_exit_handlers =
@@ -5793,6 +5899,7 @@ static bool nested_vmx_exit_handled(stru
 	case EXIT_REASON_VMPTRST: case EXIT_REASON_VMREAD:
 	case EXIT_REASON_VMRESUME: case EXIT_REASON_VMWRITE:
 	case EXIT_REASON_VMOFF: case EXIT_REASON_VMON:
+	case EXIT_REASON_INVEPT:
 		/*
 		 * VMX instructions trap unconditionally. This allows L1 to
 		 * emulate them for its L2 guest, i.e., allows 3-level nesting!
@@ -7056,6 +7163,11 @@ void prepare_vmcs12(struct kvm_vcpu *vcp
 	/* clear vm-entry fields which are to be cleared on exit */
 	if (!(vmcs12->vm_exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY))
 		vmcs12->vm_entry_intr_info_field &= ~INTR_INFO_VALID_MASK;
+
+	/* For single-context INVEPT optimization */
+	if (nested_cpu_has_ept(vmcs12))
+		to_vmx(vcpu)->nested.last_eptp02 = vmcs_read64(EPT_POINTER);
+
 }
 
 /*

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 09/10] nEPT: Documentation
  2011-11-10  9:57 [PATCH 0/10] nEPT: Nested EPT support for Nested VMX Nadav Har'El
                   ` (7 preceding siblings ...)
  2011-11-10 10:01 ` [PATCH 08/10] nEPT: Nested INVEPT Nadav Har'El
@ 2011-11-10 10:02 ` Nadav Har'El
  2011-11-10 10:02 ` [PATCH 10/10] nEPT: Miscelleneous cleanups Nadav Har'El
  2011-11-10 12:26 ` [PATCH 0/10] nEPT: Nested EPT support for Nested VMX Avi Kivity
  10 siblings, 0 replies; 49+ messages in thread
From: Nadav Har'El @ 2011-11-10 10:02 UTC (permalink / raw)
  To: kvm; +Cc: Roedel, Joerg, avi, owasserm, abelg

Update the documentation to no longer say that nested EPT is not supported.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 Documentation/virtual/kvm/nested-vmx.txt |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- .before/Documentation/virtual/kvm/nested-vmx.txt	2011-11-10 11:33:59.000000000 +0200
+++ .after/Documentation/virtual/kvm/nested-vmx.txt	2011-11-10 11:33:59.000000000 +0200
@@ -38,8 +38,8 @@ The current code supports running Linux 
 Only 64-bit guest hypervisors are supported.
 
 Additional patches for running Windows under guest KVM, and Linux under
-guest VMware server, and support for nested EPT, are currently running in
-the lab, and will be sent as follow-on patchsets.
+guest VMware server, are currently running in the lab, and will be sent as
+follow-on patchsets.
 
 
 Running nested VMX

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 10/10] nEPT: Miscelleneous cleanups
  2011-11-10  9:57 [PATCH 0/10] nEPT: Nested EPT support for Nested VMX Nadav Har'El
                   ` (8 preceding siblings ...)
  2011-11-10 10:02 ` [PATCH 09/10] nEPT: Documentation Nadav Har'El
@ 2011-11-10 10:02 ` Nadav Har'El
  2011-11-10 12:26 ` [PATCH 0/10] nEPT: Nested EPT support for Nested VMX Avi Kivity
  10 siblings, 0 replies; 49+ messages in thread
From: Nadav Har'El @ 2011-11-10 10:02 UTC (permalink / raw)
  To: kvm; +Cc: Roedel, Joerg, avi, owasserm, abelg

Some trivial code cleanups not really related to nested EPT.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |    6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2011-11-10 11:33:59.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2011-11-10 11:34:00.000000000 +0200
@@ -611,7 +611,6 @@ static void nested_release_page_clean(st
 static u64 construct_eptp(unsigned long root_hpa);
 static void kvm_cpu_vmxon(u64 addr);
 static void kvm_cpu_vmxoff(void);
-static void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3);
 static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr);
 
 static DEFINE_PER_CPU(struct vmcs *, vmxarea);
@@ -875,8 +874,7 @@ static inline bool nested_cpu_has2(struc
 		(vmcs12->secondary_vm_exec_control & bit);
 }
 
-static inline bool nested_cpu_has_virtual_nmis(struct vmcs12 *vmcs12,
-	struct kvm_vcpu *vcpu)
+static inline bool nested_cpu_has_virtual_nmis(struct vmcs12 *vmcs12)
 {
 	return vmcs12->pin_based_vm_exec_control & PIN_BASED_VIRTUAL_NMIS;
 }
@@ -6020,7 +6018,7 @@ static int vmx_handle_exit(struct kvm_vc
 
 	if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked &&
 	    !(is_guest_mode(vcpu) && nested_cpu_has_virtual_nmis(
-	                                get_vmcs12(vcpu), vcpu)))) {
+					get_vmcs12(vcpu))))) {
 		if (vmx_interrupt_allowed(vcpu)) {
 			vmx->soft_vnmi_blocked = 0;
 		} else if (vmx->vnmi_blocked_time > 1000000000LL &&

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 02/10] nEPT: MMU context for nested EPT
  2011-11-10  9:58 ` [PATCH 02/10] nEPT: MMU context for nested EPT Nadav Har'El
@ 2011-11-10 10:31   ` Avi Kivity
  2011-11-10 12:49   ` Avi Kivity
  1 sibling, 0 replies; 49+ messages in thread
From: Avi Kivity @ 2011-11-10 10:31 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, Roedel, Joerg, owasserm, abelg

On 11/10/2011 11:58 AM, Nadav Har'El wrote:
> KVM's existing shadow MMU code already supports nested TDP. To use it, we
> need to set up a new "MMU context" for nested EPT, and create a few callbacks
> for it (nested_ept_*()). We then need to switch back and forth between this
> nested context and the regular MMU context when switching between L1 and L2.
>
> +
> +static void nested_ept_inject_page_fault(struct kvm_vcpu *vcpu,
> +	struct x86_exception *fault)
> +{
> +	struct vmcs12 *vmcs12;
> +	nested_vmx_vmexit(vcpu);
> +	vmcs12 = get_vmcs12(vcpu);
> +	/*
> +	 * Note no need to set vmcs12->vm_exit_reason as it is already copied
> +	 * from vmcs02 in nested_vmx_vmexit() above, i.e., EPT_VIOLATION.
> +	 */

Not in all cases.  For example, L0 may emulate an L2 instruction, which
then faults at the EPT level.

> +	vmcs12->exit_qualification = fault->error_code;
> +	vmcs12->guest_physical_address = fault->address;
> +}

What about the guest linear address field?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 04/10] nEPT: Fix page table format in nested EPT
  2011-11-10  9:59 ` [PATCH 04/10] nEPT: Fix page table format in nested EPT Nadav Har'El
@ 2011-11-10 10:37   ` Avi Kivity
  2011-11-10 11:03     ` Nadav Har'El
  2011-11-10 13:07   ` Orit Wasserman
  1 sibling, 1 reply; 49+ messages in thread
From: Avi Kivity @ 2011-11-10 10:37 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, Roedel, Joerg, owasserm, abelg

On 11/10/2011 11:59 AM, Nadav Har'El wrote:
> When the existing KVM MMU code creates a shadow page table, it assumes it
> has the normal x86 page table format. This is obviously correct for normal
> shadow page tables, and also correct for AMD's NPT.
> Unfortunately, Intel's EPT page tables differ in subtle ways from ordinary
> page tables, so when we create a shadow EPT table (i.e., in nested EPT),
> we need to slightly modify the way in which this table table is built.
>
> In particular, when mmu.c's link_shadow_page() creates non-leaf page table
> entries, it used to enable the "present", "accessed", "writable" and "user"
> flags on these entries. While this is correct for ordinary page tables, it
> is wrong in EPT tables - where these bits actually have completely different
> meaning (compare PT_*_MASK from mmu.h to VMX_EPT_*_MASK from vmx.h).
> In particular, leaving the code as-is causes bit 5 of the PTE to be turned on
> (supposedly for PT_ACCESSED_MASK), which is a reserved bit in EPT and causes
> an "EPT Misconfiguration" failure.
>
> So we must move link_shadow_page's list of extra bits to a new mmu context
> field, which is set differently for nested EPT.
>
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> ---
>  arch/x86/include/asm/kvm_host.h |    1 +
>  arch/x86/kvm/mmu.c              |   16 +++++++++-------
>  arch/x86/kvm/paging_tmpl.h      |    6 ++++--
>  arch/x86/kvm/vmx.c              |    3 +++
>  4 files changed, 17 insertions(+), 9 deletions(-)
>
> --- .before/arch/x86/include/asm/kvm_host.h	2011-11-10 11:33:59.000000000 +0200
> +++ .after/arch/x86/include/asm/kvm_host.h	2011-11-10 11:33:59.000000000 +0200
> @@ -287,6 +287,7 @@ struct kvm_mmu {
>  	bool nx;
>  
>  	u64 pdptrs[4]; /* pae */
> +	u64 link_shadow_page_set_bits;
>  };
>  
>  struct kvm_vcpu_arch {
> --- .before/arch/x86/kvm/vmx.c	2011-11-10 11:33:59.000000000 +0200
> +++ .after/arch/x86/kvm/vmx.c	2011-11-10 11:33:59.000000000 +0200
> @@ -6485,6 +6485,9 @@ static int nested_ept_init_mmu_context(s
>  	vcpu->arch.mmu.get_pdptr         = nested_ept_get_pdptr;
>  	vcpu->arch.mmu.inject_page_fault = nested_ept_inject_page_fault;
>  	vcpu->arch.mmu.shadow_root_level = get_ept_level();
> +	vcpu->arch.mmu.link_shadow_page_set_bits =
> +		VMX_EPT_READABLE_MASK | VMX_EPT_WRITABLE_MASK |
> +		VMX_EPT_EXECUTABLE_MASK;
>  
>  	vcpu->arch.walk_mmu              = &vcpu->arch.nested_mmu;
>  
> --- .before/arch/x86/kvm/mmu.c	2011-11-10 11:33:59.000000000 +0200
> +++ .after/arch/x86/kvm/mmu.c	2011-11-10 11:33:59.000000000 +0200
> @@ -1782,14 +1782,9 @@ static void shadow_walk_next(struct kvm_
>  	return __shadow_walk_next(iterator, *iterator->sptep);
>  }
>  
> -static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp)
> +static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp, u64 set_bits)
>  {
> -	u64 spte;
> -
> -	spte = __pa(sp->spt)
> -		| PT_PRESENT_MASK | PT_ACCESSED_MASK
> -		| PT_WRITABLE_MASK | PT_USER_MASK;
> -	mmu_spte_set(sptep, spte);
> +	mmu_spte_set(sptep, __pa(sp->spt) | set_bits);
>  }
>

Minor nit: you can just use link_shadow_page_set_bits here instead of
passing it around (unless later you have a different value for the
parameter?)

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 04/10] nEPT: Fix page table format in nested EPT
  2011-11-10 10:37   ` Avi Kivity
@ 2011-11-10 11:03     ` Nadav Har'El
  2011-11-10 12:21       ` Avi Kivity
  0 siblings, 1 reply; 49+ messages in thread
From: Nadav Har'El @ 2011-11-10 11:03 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, Roedel, Joerg, owasserm, abelg

On Thu, Nov 10, 2011, Avi Kivity wrote about "Re: [PATCH 04/10] nEPT: Fix page table format in nested EPT":
> > @@ -287,6 +287,7 @@ struct kvm_mmu {
> >  	bool nx;
> >  
> >  	u64 pdptrs[4]; /* pae */
> > +	u64 link_shadow_page_set_bits;
>...
> > +static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp, u64 set_bits)
> >  {
> > -	u64 spte;
> > -
> > -	spte = __pa(sp->spt)
> > -		| PT_PRESENT_MASK | PT_ACCESSED_MASK
> > -		| PT_WRITABLE_MASK | PT_USER_MASK;
> > -	mmu_spte_set(sptep, spte);
> > +	mmu_spte_set(sptep, __pa(sp->spt) | set_bits);
> >  }
> >
> 
> Minor nit: you can just use link_shadow_page_set_bits here instead of
> passing it around (unless later you have a different value for the
> parameter?)

The problem was that link_shadow_page did not take an kvm_mmu parameter,
so I don't know where to find this link_shadow_page_set_bits. So either
I pass the pointer to the entire kvm_mmu to link_shadow_page, or I just
pass the only field which I need... I thought that passing the single
field I need was cleaner - but I can easily change it if you prefer to
pass the kvm_mmu.

Thanks,

Nadav.

-- 
Nadav Har'El                        |                  Thursday, Nov 10 2011, 
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |I had a lovely evening. Unfortunately,
http://nadav.harel.org.il           |this wasn't it. - Groucho Marx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 08/10] nEPT: Nested INVEPT
  2011-11-10 10:01 ` [PATCH 08/10] nEPT: Nested INVEPT Nadav Har'El
@ 2011-11-10 12:17   ` Avi Kivity
  2011-12-11 14:24     ` Nadav Har'El
  0 siblings, 1 reply; 49+ messages in thread
From: Avi Kivity @ 2011-11-10 12:17 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, Roedel, Joerg, owasserm, abelg

On 11/10/2011 12:01 PM, Nadav Har'El wrote:
> If we let L1 use EPT, we should probably also support the INVEPT instruction.
>
> +	case VMX_EPT_EXTENT_CONTEXT:
> +		if (!(nested_vmx_ept_caps & VMX_EPT_EXTENT_CONTEXT_BIT))
> +			nested_vmx_failValid(vcpu,
> +				VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
> +		else {
> +			/*
> +			 * We efficiently handle the common case, of L1
> +			 * invalidating the last eptp it used to run L2.
> +			 * TODO: Instead of saving one last_eptp02, look up
> +			 * operand.eptp in the shadow EPT table cache, to
> +			 * find its shadow. Then last_eptp02 won't be needed.
> +			 */
> +			struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> +			struct vcpu_vmx *vmx = to_vmx(vcpu);
> +			if (vmcs12 && nested_cpu_has_ept(vmcs12) &&
> +			    (vmcs12->ept_pointer == operand.eptp) &&
> +			    vmx->nested.last_eptp02)
> +				ept_sync_context(vmx->nested.last_eptp02);
> +			else
> +				ept_sync_global();

Are either of these needed?  Won't a write to a shadowed EPT table cause
them anyway?

> +			nested_vmx_succeed(vcpu);
> +		}
> +		break;
> +	case VMX_EPT_EXTENT_INDIVIDUAL_ADDR:
> +		if (!(nested_vmx_ept_caps & VMX_EPT_EXTENT_INDIVIDUAL_BIT))
> +			nested_vmx_failValid(vcpu,
> +				VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
> +		else {
> +			struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> +			struct vcpu_vmx *vmx = to_vmx(vcpu);
> +			if (vmcs12 && nested_cpu_has_ept(vmcs12) &&
> +			    (vmcs12->ept_pointer == operand.eptp) &&
> +			    vmx->nested.last_eptp02)
> +				ept_sync_individual_addr(
> +					vmx->nested.last_eptp02, operand.gpa);

Same here.

> +			else
> +				ept_sync_global();
> +			nested_vmx_succeed(vcpu);
> +		}
> +		break;
> +	default:
> +		nested_vmx_failValid(vcpu,
> +			VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
> +	}
> +
> +	skip_emulated_instruction(vcpu);
> +	return 1;
> +}
> +
>

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 04/10] nEPT: Fix page table format in nested EPT
  2011-11-10 11:03     ` Nadav Har'El
@ 2011-11-10 12:21       ` Avi Kivity
  2011-11-10 12:50         ` Avi Kivity
  0 siblings, 1 reply; 49+ messages in thread
From: Avi Kivity @ 2011-11-10 12:21 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, Roedel, Joerg, owasserm, abelg

On 11/10/2011 01:03 PM, Nadav Har'El wrote:
> On Thu, Nov 10, 2011, Avi Kivity wrote about "Re: [PATCH 04/10] nEPT: Fix page table format in nested EPT":
> > > @@ -287,6 +287,7 @@ struct kvm_mmu {
> > >  	bool nx;
> > >  
> > >  	u64 pdptrs[4]; /* pae */
> > > +	u64 link_shadow_page_set_bits;
> >...
> > > +static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp, u64 set_bits)
> > >  {
> > > -	u64 spte;
> > > -
> > > -	spte = __pa(sp->spt)
> > > -		| PT_PRESENT_MASK | PT_ACCESSED_MASK
> > > -		| PT_WRITABLE_MASK | PT_USER_MASK;
> > > -	mmu_spte_set(sptep, spte);
> > > +	mmu_spte_set(sptep, __pa(sp->spt) | set_bits);
> > >  }
> > >
> > 
> > Minor nit: you can just use link_shadow_page_set_bits here instead of
> > passing it around (unless later you have a different value for the
> > parameter?)
>
> The problem was that link_shadow_page did not take an kvm_mmu parameter,
> so I don't know where to find this link_shadow_page_set_bits. So either
> I pass the pointer to the entire kvm_mmu to link_shadow_page, or I just
> pass the only field which I need... I thought that passing the single
> field I need was cleaner - but I can easily change it if you prefer to
> pass the kvm_mmu.

Ah, doesn't matter either way.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 01/10] nEPT: Module option
  2011-11-10  9:58 ` [PATCH 01/10] nEPT: Module option Nadav Har'El
@ 2011-11-10 12:23   ` Avi Kivity
  2011-11-10 14:21     ` Nadav Har'El
  0 siblings, 1 reply; 49+ messages in thread
From: Avi Kivity @ 2011-11-10 12:23 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, Roedel, Joerg, owasserm, abelg

On 11/10/2011 11:58 AM, Nadav Har'El wrote:
> Add a module option "nested_ept" determining whether to enable Nested EPT.
>
> Nested EPT means emulating EPT for an L1 guest so that L1 can use EPT when
> running a nested guest L2. When L1 uses EPT, it allows the L2 guest to set
> its own cr3 and take its own page faults without either of L0 or L1 getting
> involved. This often significanlty improves L2's performance over the
> previous two alternatives (shadow page tables over ept, and shadow page
> tables over shadow page tables).
>
> nested_ept is currently enabled by default (when nested VMX is enabled),
> unless L0 doesn't have EPT or disabled it with ept=0.
>
> Users would not normally want to explicitly disable this option. One reason
> why one might want to disable it is to force L1 to make due without the EPT
> capability, when anticipating a future need to migrate this L1 to another
> host which doesn't have EPT. Note that currently there is no API to turn off
> nested EPT for just a single L1 guest. However, obviously, an individual L1
> guest may choose not to use EPT - the nested_cpu_has_ept() checks if L1
> actually used EPT when running L2.
>
> In the future, we can support emulation of EPT for L1 *always*, even when L0
> itself doesn't have EPT. This so-called "EPT on shadow page tables" mode
> has some theoretical advantages over the baseline "shadow page tables on
> shadow page tables" mode typically used when EPT is not available to L0 -
> namely that L2's cr3 changes and page faults can be handled in L0 and do not
> need to be propagated to L1. However, currently we do not support this mode,
> and it is becoming less interesting as newer processors all support EPT.
>
>

I think we can live without this.  But we do need a way to control what
features are exposed to the guest, for compatibility and live migration
purposes, as we do with cpuid.  So we need some way for host userspace
to write to the vmx read-only feature reporting MSRs.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/10] nEPT: Nested EPT support for Nested VMX
  2011-11-10  9:57 [PATCH 0/10] nEPT: Nested EPT support for Nested VMX Nadav Har'El
                   ` (9 preceding siblings ...)
  2011-11-10 10:02 ` [PATCH 10/10] nEPT: Miscelleneous cleanups Nadav Har'El
@ 2011-11-10 12:26 ` Avi Kivity
  2011-11-13  8:52   ` Nadav Har'El
  10 siblings, 1 reply; 49+ messages in thread
From: Avi Kivity @ 2011-11-10 12:26 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, Roedel, Joerg, owasserm, abelg

On 11/10/2011 11:57 AM, Nadav Har'El wrote:
> The following patches add nested EPT support to Nested VMX.
>
> Nested EPT means emulating EPT for an L1 guest, allowing it use EPT when
> running a nested guest L2. When L1 uses EPT, it allows the L2 guest to set
> its own cr3 and take its own page faults without either of L0 or L1 getting
> involved. In many workloads this significanlty improves L2's performance over
> the previous two alternatives (shadow page tables over ept, and shadow page
> tables over shadow page tables). Our paper [1] described these three options,
> and the advantages of nested EPT ("multidimensional paging").
>
> Nested EPT is enabled by default (if the hardware supports EPT), so users do
> not have to do anything special to enjoy the performance improvement that
> this patch gives to L2 guests.
>
> Just as a non-scientific, non-representative indication of the kind of
> dramatic performance improvement you may see in workloads that have a lot of
> context switches and page faults, here is a measurement of the time
> an example single-threaded "make" took in L2 (kvm over kvm):
>
>  shadow over shadow: 105 seconds
>  ("ept=0" forces this)
>
>  shadow over EPT: 87 seconds
>  (the previous default; Can be forced now with "nested_ept=0")
>
>  EPT over EPT: 29 seconds
>  (the default after this patch)
>
> Note that the same test on L1 (with EPT) took 25 seconds, so for this example
> workload, performance of nested virtualization is now very close to that of
> single-level virtualization.
>
>

This patchset is missing a fairly hairy patch that makes reading L2
virtual addresses work.  The standard example is L1 passing a bit of
hardware (emulated in L0) to a L2; when L2 accesses it, the instruction
will fault and need to be handled in L0, transparently to L1.  The
emulation can cause a fault to be injected to L2, or and EPT violation
or misconfiguration injected to L1.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 02/10] nEPT: MMU context for nested EPT
  2011-11-10  9:58 ` [PATCH 02/10] nEPT: MMU context for nested EPT Nadav Har'El
  2011-11-10 10:31   ` Avi Kivity
@ 2011-11-10 12:49   ` Avi Kivity
  2011-11-10 14:40     ` Nadav Har'El
  1 sibling, 1 reply; 49+ messages in thread
From: Avi Kivity @ 2011-11-10 12:49 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, Roedel, Joerg, owasserm, abelg

On 11/10/2011 11:58 AM, Nadav Har'El wrote:
> KVM's existing shadow MMU code already supports nested TDP. To use it, we
> need to set up a new "MMU context" for nested EPT, and create a few callbacks
> for it (nested_ept_*()). We then need to switch back and forth between this
> nested context and the regular MMU context when switching between L1 and L2.
>
> +static int nested_ept_init_mmu_context(struct kvm_vcpu *vcpu)
> +{
> +	int r = kvm_init_shadow_mmu(vcpu, &vcpu->arch.mmu);
> +
> +	vcpu->arch.mmu.set_cr3           = vmx_set_cr3;
> +	vcpu->arch.mmu.get_cr3           = nested_ept_get_cr3;
> +	vcpu->arch.mmu.get_pdptr         = nested_ept_get_pdptr;
> +	vcpu->arch.mmu.inject_page_fault = nested_ept_inject_page_fault;
> +	vcpu->arch.mmu.shadow_root_level = get_ept_level();
> +
> +	vcpu->arch.walk_mmu              = &vcpu->arch.nested_mmu;
> +
> +	return r;
> +}
> +
>

kvm_init_shadow_mmu() will cause ->page_fault to be set to something
like paging64_page_fault(), which is geared to reading EPT ptes.  How
does this work?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 04/10] nEPT: Fix page table format in nested EPT
  2011-11-10 12:21       ` Avi Kivity
@ 2011-11-10 12:50         ` Avi Kivity
  0 siblings, 0 replies; 49+ messages in thread
From: Avi Kivity @ 2011-11-10 12:50 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, Roedel, Joerg, owasserm, abelg

On 11/10/2011 02:21 PM, Avi Kivity wrote:
> On 11/10/2011 01:03 PM, Nadav Har'El wrote:
> > On Thu, Nov 10, 2011, Avi Kivity wrote about "Re: [PATCH 04/10] nEPT: Fix page table format in nested EPT":
> > > > @@ -287,6 +287,7 @@ struct kvm_mmu {
> > > >  	bool nx;
> > > >  
> > > >  	u64 pdptrs[4]; /* pae */
> > > > +	u64 link_shadow_page_set_bits;
> > >...
> > > > +static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp, u64 set_bits)
> > > >  {
> > > > -	u64 spte;
> > > > -
> > > > -	spte = __pa(sp->spt)
> > > > -		| PT_PRESENT_MASK | PT_ACCESSED_MASK
> > > > -		| PT_WRITABLE_MASK | PT_USER_MASK;
> > > > -	mmu_spte_set(sptep, spte);
> > > > +	mmu_spte_set(sptep, __pa(sp->spt) | set_bits);
> > > >  }
> > > >
> > > 
> > > Minor nit: you can just use link_shadow_page_set_bits here instead of
> > > passing it around (unless later you have a different value for the
> > > parameter?)
> >
> > The problem was that link_shadow_page did not take an kvm_mmu parameter,
> > so I don't know where to find this link_shadow_page_set_bits. So either
> > I pass the pointer to the entire kvm_mmu to link_shadow_page, or I just
> > pass the only field which I need... I thought that passing the single
> > field I need was cleaner - but I can easily change it if you prefer to
> > pass the kvm_mmu.
>
> Ah, doesn't matter either way.
>

On second thoughts, passing the mmu is better for future maintainability.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 04/10] nEPT: Fix page table format in nested EPT
  2011-11-10  9:59 ` [PATCH 04/10] nEPT: Fix page table format in nested EPT Nadav Har'El
  2011-11-10 10:37   ` Avi Kivity
@ 2011-11-10 13:07   ` Orit Wasserman
  1 sibling, 0 replies; 49+ messages in thread
From: Orit Wasserman @ 2011-11-10 13:07 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, Roedel, Joerg, avi, abelg

On 11/10/2011 11:59 AM, Nadav Har'El wrote:
> When the existing KVM MMU code creates a shadow page table, it assumes it
> has the normal x86 page table format. This is obviously correct for normal
> shadow page tables, and also correct for AMD's NPT.
> Unfortunately, Intel's EPT page tables differ in subtle ways from ordinary
> page tables, so when we create a shadow EPT table (i.e., in nested EPT),
> we need to slightly modify the way in which this table table is built.
> 
> In particular, when mmu.c's link_shadow_page() creates non-leaf page table
> entries, it used to enable the "present", "accessed", "writable" and "user"
> flags on these entries. While this is correct for ordinary page tables, it
> is wrong in EPT tables - where these bits actually have completely different
> meaning (compare PT_*_MASK from mmu.h to VMX_EPT_*_MASK from vmx.h).
> In particular, leaving the code as-is causes bit 5 of the PTE to be turned on
> (supposedly for PT_ACCESSED_MASK), which is a reserved bit in EPT and causes
> an "EPT Misconfiguration" failure.
> 
> So we must move link_shadow_page's list of extra bits to a new mmu context
> field, which is set differently for nested EPT.
> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> ---
>  arch/x86/include/asm/kvm_host.h |    1 +
>  arch/x86/kvm/mmu.c              |   16 +++++++++-------
>  arch/x86/kvm/paging_tmpl.h      |    6 ++++--
>  arch/x86/kvm/vmx.c              |    3 +++
>  4 files changed, 17 insertions(+), 9 deletions(-)
> 
> --- .before/arch/x86/include/asm/kvm_host.h	2011-11-10 11:33:59.000000000 +0200
> +++ .after/arch/x86/include/asm/kvm_host.h	2011-11-10 11:33:59.000000000 +0200
> @@ -287,6 +287,7 @@ struct kvm_mmu {
>  	bool nx;
>  
>  	u64 pdptrs[4]; /* pae */
> +	u64 link_shadow_page_set_bits;
>  };
>  
>  struct kvm_vcpu_arch {
> --- .before/arch/x86/kvm/vmx.c	2011-11-10 11:33:59.000000000 +0200
> +++ .after/arch/x86/kvm/vmx.c	2011-11-10 11:33:59.000000000 +0200
> @@ -6485,6 +6485,9 @@ static int nested_ept_init_mmu_context(s
>  	vcpu->arch.mmu.get_pdptr         = nested_ept_get_pdptr;
>  	vcpu->arch.mmu.inject_page_fault = nested_ept_inject_page_fault;
>  	vcpu->arch.mmu.shadow_root_level = get_ept_level();
> +	vcpu->arch.mmu.link_shadow_page_set_bits =
> +		VMX_EPT_READABLE_MASK | VMX_EPT_WRITABLE_MASK |
> +		VMX_EPT_EXECUTABLE_MASK;
>  
>  	vcpu->arch.walk_mmu              = &vcpu->arch.nested_mmu;
>  
> --- .before/arch/x86/kvm/mmu.c	2011-11-10 11:33:59.000000000 +0200
> +++ .after/arch/x86/kvm/mmu.c	2011-11-10 11:33:59.000000000 +0200
> @@ -1782,14 +1782,9 @@ static void shadow_walk_next(struct kvm_
>  	return __shadow_walk_next(iterator, *iterator->sptep);
>  }
>  
> -static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp)
> +static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp, u64 set_bits)
>  {
> -	u64 spte;
> -
> -	spte = __pa(sp->spt)
> -		| PT_PRESENT_MASK | PT_ACCESSED_MASK
> -		| PT_WRITABLE_MASK | PT_USER_MASK;
> -	mmu_spte_set(sptep, spte);
> +	mmu_spte_set(sptep, __pa(sp->spt) | set_bits);
>  }
>  
>  static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
> @@ -3366,6 +3361,13 @@ int kvm_init_shadow_mmu(struct kvm_vcpu 
>  	vcpu->arch.mmu.base_role.cr0_wp  = is_write_protection(vcpu);
>  	vcpu->arch.mmu.base_role.smep_andnot_wp
>  		= smep && !is_write_protection(vcpu);
> +	/*
> +	 * link_shadow() should apply these bits in shadow page tables, and
> +	 * in shadow NPT tables (nested NPT). For nested EPT, different bits
> +	 * apply.
> +	 */
> +	vcpu->arch.mmu.link_shadow_page_set_bits = PT_PRESENT_MASK |
> +		PT_ACCESSED_MASK | PT_WRITABLE_MASK | PT_USER_MASK;
>  
>  	return r;
>  }
> --- .before/arch/x86/kvm/paging_tmpl.h	2011-11-10 11:33:59.000000000 +0200
> +++ .after/arch/x86/kvm/paging_tmpl.h	2011-11-10 11:33:59.000000000 +0200
> @@ -515,7 +515,8 @@ static u64 *FNAME(fetch)(struct kvm_vcpu
>  			goto out_gpte_changed;
>  
>  		if (sp)
> -			link_shadow_page(it.sptep, sp);
> +			link_shadow_page(it.sptep, sp,
> +				vcpu->arch.mmu.link_shadow_page_set_bits);
>  	}
>  
>  	for (;
> @@ -535,7 +536,8 @@ static u64 *FNAME(fetch)(struct kvm_vcpu
>  
>  		sp = kvm_mmu_get_page(vcpu, direct_gfn, addr, it.level-1,
>  				      true, direct_access, it.sptep);
> -		link_shadow_page(it.sptep, sp);
> +		link_shadow_page(it.sptep, sp,
> +			vcpu->arch.mmu.link_shadow_page_set_bits);
>  	}
>  
>  	clear_sp_write_flooding_count(it.sptep);
We need to consider the permission in L1's EPT table,what if an entry is read only ?


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 01/10] nEPT: Module option
  2011-11-10 12:23   ` Avi Kivity
@ 2011-11-10 14:21     ` Nadav Har'El
  2011-11-10 14:38       ` Avi Kivity
  0 siblings, 1 reply; 49+ messages in thread
From: Nadav Har'El @ 2011-11-10 14:21 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, Roedel, Joerg, owasserm, abelg

On Thu, Nov 10, 2011, Avi Kivity wrote about "Re: [PATCH 01/10] nEPT: Module option":
> On 11/10/2011 11:58 AM, Nadav Har'El wrote:
> > Add a module option "nested_ept" determining whether to enable Nested EPT.
>...
> > In the future, we can support emulation of EPT for L1 *always*, even when L0
> > itself doesn't have EPT. This so-called "EPT on shadow page tables" mode
> > has some theoretical advantages over the baseline "shadow page tables on
> > shadow page tables" mode typically used when EPT is not available to L0 -
> > namely that L2's cr3 changes and page faults can be handled in L0 and do not
> > need to be propagated to L1. However, currently we do not support this mode,
> > and it is becoming less interesting as newer processors all support EPT.
> >
> >
> 
> I think we can live without this.

By "this", do you mean without the "nested_ept" option, or without the
hypothetical "EPT on shadow page tables" feature?

If the former, then I agree we can "live" without it, but since it was
trivial to add, I don't see what harm it can do, and its nice that we
can return with a single L0 option to the old shadow-on-ept paging.
Is there anything specific you don't like about having this option?

About the latter, I agree - as I said, there isn't much point to go and
write this (quite complicated) 3-level shadowing when all new processors
have EPT anyway. So I didn't.

> But we do need a way to control what
> features are exposed to the guest, for compatibility and live migration
> purposes, as we do with cpuid.  So we need some way for host userspace
> to write to the vmx read-only feature reporting MSRs.

I think this is a general issue (which we already discussed earlier),
of nested VMX and not specific to nested EPT. I already put all the
capabilities which the MSR report in variables initialized in a single
function, nested_vmx_setup_ctls_msrs(), so once we devise an appropriate
userspace interface to set these, we can do so easily.

Does nested SVM also have a similar problem, of whether or not it
advertises new or optional SVM features to L1? If it does have this
problem, how was it solved there?

-- 
Nadav Har'El                        |                  Thursday, Nov 10 2011, 
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |I considered atheism but there weren't
http://nadav.harel.org.il           |enough holidays.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 01/10] nEPT: Module option
  2011-11-10 14:21     ` Nadav Har'El
@ 2011-11-10 14:38       ` Avi Kivity
  2011-11-10 15:14         ` Nadav Har'El
  0 siblings, 1 reply; 49+ messages in thread
From: Avi Kivity @ 2011-11-10 14:38 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, Roedel, Joerg, owasserm, abelg

On 11/10/2011 04:21 PM, Nadav Har'El wrote:
> On Thu, Nov 10, 2011, Avi Kivity wrote about "Re: [PATCH 01/10] nEPT: Module option":
> > On 11/10/2011 11:58 AM, Nadav Har'El wrote:
> > > Add a module option "nested_ept" determining whether to enable Nested EPT.
> >...
> > > In the future, we can support emulation of EPT for L1 *always*, even when L0
> > > itself doesn't have EPT. This so-called "EPT on shadow page tables" mode
> > > has some theoretical advantages over the baseline "shadow page tables on
> > > shadow page tables" mode typically used when EPT is not available to L0 -
> > > namely that L2's cr3 changes and page faults can be handled in L0 and do not
> > > need to be propagated to L1. However, currently we do not support this mode,
> > > and it is becoming less interesting as newer processors all support EPT.
> > >
> > >
> > 
> > I think we can live without this.
>
> By "this", do you mean without the "nested_ept" option, or without the
> hypothetical "EPT on shadow page tables" feature?

Er, both.  The feature should be controlled on a per-guest basis, not
per host.  And while emulating EPT on shadow is possible, we have enough
complexity already, I think, and non-EPT hosts are getting rarer.

> If the former, then I agree we can "live" without it, but since it was
> trivial to add, I don't see what harm it can do, and its nice that we
> can return with a single L0 option to the old shadow-on-ept paging.
> Is there anything specific you don't like about having this option?

It's just redundant, since we do need a per-guest control.

> About the latter, I agree - as I said, there isn't much point to go and
> write this (quite complicated) 3-level shadowing when all new processors
> have EPT anyway. So I didn't.
>
> > But we do need a way to control what
> > features are exposed to the guest, for compatibility and live migration
> > purposes, as we do with cpuid.  So we need some way for host userspace
> > to write to the vmx read-only feature reporting MSRs.
>
> I think this is a general issue (which we already discussed earlier),
> of nested VMX and not specific to nested EPT. I already put all the
> capabilities which the MSR report in variables initialized in a single
> function, nested_vmx_setup_ctls_msrs(), so once we devise an appropriate
> userspace interface to set these, we can do so easily.

Yes.

> Does nested SVM also have a similar problem, of whether or not it
> advertises new or optional SVM features to L1? If it does have this
> problem, how was it solved there?

svm cpu features are, funnily enough, reported by cpuid, so the existing
KVM_GET_SUPPORTED_CPUID/KVM_SET_CPUID2 method works.  We need a similar
KVM_SET_READONLY_MSRS or something.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 02/10] nEPT: MMU context for nested EPT
  2011-11-10 12:49   ` Avi Kivity
@ 2011-11-10 14:40     ` Nadav Har'El
  2011-11-10 15:19       ` Avi Kivity
  0 siblings, 1 reply; 49+ messages in thread
From: Nadav Har'El @ 2011-11-10 14:40 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, Roedel, Joerg, owasserm, abelg

On Thu, Nov 10, 2011, Avi Kivity wrote about "Re: [PATCH 02/10] nEPT: MMU context for nested EPT":
> > +static int nested_ept_init_mmu_context(struct kvm_vcpu *vcpu)
> > +{
> > +	int r = kvm_init_shadow_mmu(vcpu, &vcpu->arch.mmu);
>...
> > +	vcpu->arch.walk_mmu              = &vcpu->arch.nested_mmu;
>...
> 
> kvm_init_shadow_mmu() will cause ->page_fault to be set to something
> like paging64_page_fault(), which is geared to reading EPT ptes.  How
> does this work?

Hi,

I'm afraid I didn't understand the problem.

Nested EPT's merging of two EPT tables (EPT01 and EPT12) works just like
normal shadow page tables' merging of two CR3s (host cr3 and guest cr3):

When L0 receives a "page fault" from L2 (actually an EPT violation - real
guest #PF don't cause exits), L0 first looks it up in the shadowed table,
which is basically EPT12. If the address is there, L0 handles the fault itself
(updating the shadow EPT table, EPT02 using the normal shadow pte building
code). But if the address wasn't in the shadowed page table (EPT12),
mmu->inject_page_fault() is called, which in our case actually causes L1 to
get an EPT-violation (not #PF - see kvm_propagate_fault()).

Please note that all this logic is shared with the existing nested NPT
code (which itself shared most of the code with the preexisting shadow
page tables code). All this code sharing makes it really difficult to
understand at first glance why the code is really working, but once you
understood why one of these cases works, the others work similarly.
And it does in fact work - in typical cases which I tried, at least.

If you still think I'm missing something, I won't be entirely surprised
( :-) ), so let me know.

Nadav.


-- 
Nadav Har'El                        |                  Thursday, Nov 10 2011, 
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |I put a dollar in one of those change
http://nadav.harel.org.il           |machines. Nothing changed.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 01/10] nEPT: Module option
  2011-11-10 14:38       ` Avi Kivity
@ 2011-11-10 15:14         ` Nadav Har'El
  2011-11-10 15:21           ` Avi Kivity
  0 siblings, 1 reply; 49+ messages in thread
From: Nadav Har'El @ 2011-11-10 15:14 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, Roedel, Joerg, owasserm, abelg

On Thu, Nov 10, 2011, Avi Kivity wrote about "Re: [PATCH 01/10] nEPT: Module option":
> > By "this", do you mean without the "nested_ept" option, or without the
> > hypothetical "EPT on shadow page tables" feature?
> 
> Er, both.  The feature should be controlled on a per-guest basis, not
> per host.
>..
> It's just redundant, since we do need a per-guest control.

I agreed that per-guest control would have been nicer, but since we
don't have an API for specifying that per guest since EPT is not,
unfortunately, a CPUID feature, I thought that at least a host-level
flag would be useful.

Why would it be useful? I agree it isn't the most important option since
sliced bread, but if, for example, one day we discover a bug with nested
EPT, L0 can disable it for all L1 guests and basically force them to use
shadow page tables on EPT.
It was also useful for me to have this option for benchmarking, because
I can force back the old shadow-on-EPT method with just a single option
in L0 (instead of needing to give "ept=0" option in L1s).

If you really don't like the existance of this option, I can easily
remove it of course.


-- 
Nadav Har'El                        |                  Thursday, Nov 10 2011, 
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Guarantee: this email is 100% free of
http://nadav.harel.org.il           |magnetic monopoles, or your money back!

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 02/10] nEPT: MMU context for nested EPT
  2011-11-10 14:40     ` Nadav Har'El
@ 2011-11-10 15:19       ` Avi Kivity
  2011-11-10 20:05         ` Nadav Har'El
  0 siblings, 1 reply; 49+ messages in thread
From: Avi Kivity @ 2011-11-10 15:19 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, Roedel, Joerg, owasserm, abelg

On 11/10/2011 04:40 PM, Nadav Har'El wrote:
> On Thu, Nov 10, 2011, Avi Kivity wrote about "Re: [PATCH 02/10] nEPT: MMU context for nested EPT":
> > > +static int nested_ept_init_mmu_context(struct kvm_vcpu *vcpu)
> > > +{
> > > +	int r = kvm_init_shadow_mmu(vcpu, &vcpu->arch.mmu);
> >...
> > > +	vcpu->arch.walk_mmu              = &vcpu->arch.nested_mmu;
> >...
> > 
> > kvm_init_shadow_mmu() will cause ->page_fault to be set to something
> > like paging64_page_fault(), which is geared to reading EPT ptes.  How
> > does this work?

s/EPT/ia32/

>
> Hi,
>
> I'm afraid I didn't understand the problem.
>
> Nested EPT's merging of two EPT tables (EPT01 and EPT12) works just like
> normal shadow page tables' merging of two CR3s (host cr3 and guest cr3):
>
> When L0 receives a "page fault" from L2 (actually an EPT violation - real
> guest #PF don't cause exits), L0 first looks it up in the shadowed table,
> which is basically EPT12. If the address is there, L0 handles the fault itself
> (updating the shadow EPT table, EPT02 using the normal shadow pte building
> code). But if the address wasn't in the shadowed page table (EPT12),
> mmu->inject_page_fault() is called, which in our case actually causes L1 to
> get an EPT-violation (not #PF - see kvm_propagate_fault()).
>
> Please note that all this logic is shared with the existing nested NPT
> code (which itself shared most of the code with the preexisting shadow
> page tables code). All this code sharing makes it really difficult to
> understand at first glance why the code is really working, but once you
> understood why one of these cases works, the others work similarly.
> And it does in fact work - in typical cases which I tried, at least.
>
> If you still think I'm missing something, I won't be entirely surprised
> ( :-) ), so let me know.

This is all correct, but the code in question parses the EPT12 table
using the ia32 page table format.  They're sufficiently similar so that
it works, but it isn't correct.

Bit 0: EPT readable, ia32 present
Bit 1: Writable; ia32 meaning dependent on cr0.wp
Bit 2: EPT executable, ia32 user (so, this implementation will interpret
a non-executable EPT mapping, if someone could find a use for it, as a
L2 kernel only mapping)
Bits 3-5: EPT memory type, ia32 PWT/PCD (similar but different),
Accessed bit
Bit 6: EPT Ignore PAT, ia32 dirty
Bit 7: EPT ignored, ia32 PAT
Bit 8: EPT ignored, ia32 global
Bit 63: EPT ignored, ia32 NX

walk_addr() will also write to bits 6/7, which the L1 won't expect.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 01/10] nEPT: Module option
  2011-11-10 15:14         ` Nadav Har'El
@ 2011-11-10 15:21           ` Avi Kivity
  0 siblings, 0 replies; 49+ messages in thread
From: Avi Kivity @ 2011-11-10 15:21 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, Roedel, Joerg, owasserm, abelg

On 11/10/2011 05:14 PM, Nadav Har'El wrote:
> On Thu, Nov 10, 2011, Avi Kivity wrote about "Re: [PATCH 01/10] nEPT: Module option":
> > > By "this", do you mean without the "nested_ept" option, or without the
> > > hypothetical "EPT on shadow page tables" feature?
> > 
> > Er, both.  The feature should be controlled on a per-guest basis, not
> > per host.
> >..
> > It's just redundant, since we do need a per-guest control.
>
> I agreed that per-guest control would have been nicer, but since we
> don't have an API for specifying that per guest since EPT is not,
> unfortunately, a CPUID feature, I thought that at least a host-level
> flag would be useful.
>
> Why would it be useful? I agree it isn't the most important option since
> sliced bread, but if, for example, one day we discover a bug with nested
> EPT, L0 can disable it for all L1 guests and basically force them to use
> shadow page tables on EPT.

Or we just fix the bug.

> It was also useful for me to have this option for benchmarking, because
> I can force back the old shadow-on-EPT method with just a single option
> in L0 (instead of needing to give "ept=0" option in L1s).

When we have the per-guest controls, we can tell userspace to tell the
kernel disable guest EPT.

> If you really don't like the existance of this option, I can easily
> remove it of course.

Yes please.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 02/10] nEPT: MMU context for nested EPT
  2011-11-10 15:19       ` Avi Kivity
@ 2011-11-10 20:05         ` Nadav Har'El
  2011-11-12 10:39           ` Avi Kivity
  0 siblings, 1 reply; 49+ messages in thread
From: Nadav Har'El @ 2011-11-10 20:05 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, Roedel, Joerg, owasserm, abelg

On Thu, Nov 10, 2011, Avi Kivity wrote about "Re: [PATCH 02/10] nEPT: MMU context for nested EPT":
> This is all correct, but the code in question parses the EPT12 table
> using the ia32 page table format.  They're sufficiently similar so that
> it works, but it isn't correct.
> 
> Bit 0: EPT readable, ia32 present
> Bit 1: Writable; ia32 meaning dependent on cr0.wp
> Bit 2: EPT executable, ia32 user (so, this implementation will interpret
> a non-executable EPT mapping, if someone could find a use for it, as a
> L2 kernel only mapping)
>....

This is a very good point.

I was under the mistaken (?) impression that the page-table shadowing
code will just copy these bits as-is from the shadowed table (EPT12) to the
shadow table (EPT02), without caring what they actually mean. I knew we had
a problem when building, not copying, PTEs, and hence the patch to
link_shadow_page).

Also I realized we sometimes need to actually walk the TDP EPT12+cr3 (e.g.,
to see if an EPT violation is L1's fault), but I thought this was just the
normal TDP walk, which already knows how to correctly read the EPT
table.

> walk_addr() will also write to bits 6/7, which the L1 won't expect.

I didn't notice this :(

Back to the drawing board, I guess. I need to figure out exactly what
needs to be fixed, and how to do this with the least obtrusive changes to
the existing use case (normal shadow page tables, and nested EPT).

-- 
Nadav Har'El                        |                  Thursday, Nov 10 2011, 
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Learn from mistakes of others; you won't
http://nadav.harel.org.il           |live long enough to make them all yourself

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 02/10] nEPT: MMU context for nested EPT
  2011-11-10 20:05         ` Nadav Har'El
@ 2011-11-12 10:39           ` Avi Kivity
  2011-11-12 21:37             ` Nadav Har'El
  0 siblings, 1 reply; 49+ messages in thread
From: Avi Kivity @ 2011-11-12 10:39 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, Roedel, Joerg, owasserm, abelg

On 11/10/2011 10:05 PM, Nadav Har'El wrote:
> On Thu, Nov 10, 2011, Avi Kivity wrote about "Re: [PATCH 02/10] nEPT: MMU context for nested EPT":
> > This is all correct, but the code in question parses the EPT12 table
> > using the ia32 page table format.  They're sufficiently similar so that
> > it works, but it isn't correct.
> > 
> > Bit 0: EPT readable, ia32 present
> > Bit 1: Writable; ia32 meaning dependent on cr0.wp
> > Bit 2: EPT executable, ia32 user (so, this implementation will interpret
> > a non-executable EPT mapping, if someone could find a use for it, as a
> > L2 kernel only mapping)
> >....
>
> This is a very good point.
>
> I was under the mistaken (?) impression that the page-table shadowing
> code will just copy these bits as-is from the shadowed table (EPT12) to the
> shadow table (EPT02), without caring what they actually mean. 

No, for two reasons.  First, the shadow bits are the result of
multiplexing guest and host permissions, for example either the guest of
host may write-protect a page.  Second, the shadow and guest ptes may be
in different formats (ept vs ia32).

> I knew we had
> a problem when building, not copying, PTEs, and hence the patch to
> link_shadow_page).

In fact that happens to accidentally work, no?  Intermediate ptes are
always present/write/user, which translates to read/write/execute in EPT.

> Also I realized we sometimes need to actually walk the TDP EPT12+cr3 (e.g.,
> to see if an EPT violation is L1's fault), but I thought this was just the
> normal TDP walk, which already knows how to correctly read the EPT
> table.
>
> > walk_addr() will also write to bits 6/7, which the L1 won't expect.
>
> I didn't notice this :(
>
> Back to the drawing board, I guess. I need to figure out exactly what
> needs to be fixed, and how to do this with the least obtrusive changes to
> the existing use case (normal shadow page tables, and nested EPT).

Don't optimize for least changes, optimize for best result afterwards.

We need a third variant of walk_addr_generic that parses EPT format
PTEs.  Whether that's best done by writing paging_ept.h or modifying
paging_tmpl.h, I don't know.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 02/10] nEPT: MMU context for nested EPT
  2011-11-12 10:39           ` Avi Kivity
@ 2011-11-12 21:37             ` Nadav Har'El
  2011-11-13  9:10               ` Avi Kivity
  2011-11-13 11:30               ` Orit Wasserman
  0 siblings, 2 replies; 49+ messages in thread
From: Nadav Har'El @ 2011-11-12 21:37 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, Roedel, Joerg, owasserm, abelg

On Sat, Nov 12, 2011, Avi Kivity wrote about "Re: [PATCH 02/10] nEPT: MMU context for nested EPT":
> host may write-protect a page.  Second, the shadow and guest ptes may be
> in different formats (ept vs ia32).

I'm afraid I've lost you here... The shadow table and the to-be-shadowed 
table are both ia32 (this is the normal shadow table code), or both ept
(the nested tdp code). When are they supposed to be in different
formats (ept vs ia32)?

I'm also puzzled in what situation will the host will write-protect an EPT02
(shadow EPT) page?

> In fact that happens to accidentally work, no?  Intermediate ptes are
> always present/write/user, which translates to read/write/execute in EPT.

It didn't work because it also used to set the "accessed" bit, bit 5,
which on EPT is reserved and caused EPT misconfiguration. So I had to
fix link_shadow_page, or nested EPT would not work at all.

> Don't optimize for least changes, optimize for best result afterwards.

As I'm sure you remember, two years ago, in September 6 2009, you wrote in
your blog about the newly contributed nested VMX patch set, and in
particular its nested EPT (which predated the nested NPT contribution).

Nested EPT was, for some workloads, a huge performance improvement, but
you (if I understand correctly) did not want that code in KVM because
it, basically, optimized for getting the job done, in the most correct
and most efficient manner - but without regard of how cleanly this fit with
other types of shadowing (normal shadow page tables, and nested NPT),
or how much of the code was being duplicated or circumvented.

So this time around, I couldn't really "not optimize for least changes".
This time, the nested EPT had to fit (like a square peg in a round hole
;-)), into the preexisting MMU and NPT shadowing. I couldn't really just write
the most correct and most efficient code (which Orit Wasserman already
did, two years earlier). This time I needed to figure out the least obtrusive
way of changing the existing code. The hardest thing about doing this
was trying to understand all the complexities and subtleties of the existing
MMU code in KVM, which already does 101 different cases in one
overloaded piece of code, which is not commented or documented.
And of course, add to that all the complexities (some might even say "cruft")
which the underlying x86 architecture itself has acrued over the years.
So it's not surprising I've missed some of the important subtleties which
didn't have any effect in the typical case I've tried. Like I said, in my
tests nested EPT *did* work. And even getting to that point was hard enough :-)

> We need a third variant of walk_addr_generic that parses EPT format
> PTEs.  Whether that's best done by writing paging_ept.h or modifying
> paging_tmpl.h, I don't know.

Thanks. I'll think about everything you've said in this thread (I'm still
not convinced I understood all your points, so just understanding them
will be the first step). I'll see what I can do to improve the patch.

But I have to be honest - I'm not sure how quickly I can finish this.
I really appreciate all your comments about nested VMX in the last two
years - most of them have been spot-on, 100% correct, and really helpful
for making me understand things which I had previously misunderstood.
However, since you are (of course) extremely familiar with every nook and
cranny of KVM, what normally happens is that every comment which took you
5 minutes to figure out, takes me 5 days to fully understand, and to actually
write, debug and test the fixed code. Every review that takes you two days
to go through (and is very much appreciated!) takes me several months to fix
each and every thing you asked for.

Don't get me wrong, I *am* planning to continue working (part-time) on nested
VMX, and nested EPT in particular. But if you want it to pick up the pace,
I could use some help with actual coding from people who have much more
intimate knowledge of the non-nested-VMX parts of KVM than I have.

In the meantime, if anybody wants to experiment with a much faster
Nested VMX than we had before, you can try my current patch. It may not
be perfect, but in many ways it is better than the old shadow-on-ept code.
And in simple (64 bit, 4k page) kvm-over-kvm configurations like I tried, it
works well.

Nadav.

-- 
Nadav Har'El                        |                  Saturday, Nov 12 2011, 
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |What's tiny, yellow and very dangerous? A
http://nadav.harel.org.il           |canary with the super-user password.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/10] nEPT: Nested EPT support for Nested VMX
  2011-11-10 12:26 ` [PATCH 0/10] nEPT: Nested EPT support for Nested VMX Avi Kivity
@ 2011-11-13  8:52   ` Nadav Har'El
  2011-11-13  9:21     ` Avi Kivity
  0 siblings, 1 reply; 49+ messages in thread
From: Nadav Har'El @ 2011-11-13  8:52 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, Roedel, Joerg, owasserm, abelg

Hi,

On Thu, Nov 10, 2011, Avi Kivity wrote about "Re: [PATCH 0/10] nEPT: Nested EPT support for Nested VMX":
> This patchset is missing a fairly hairy patch that makes reading L2
> virtual addresses work.

This was supposed to be part of the nested TDP code that is already in
the code. To read an L2 virtual address, the code is supposed, if I
understand correctly, to walk the "walk" mmu (EPT01 and guest_cr3)
and then use the EPT table - just like the normal EPT case which uses
the EPT table and the guest_cr3.

I even believed that this inner "walk mmu" will work fine without any
rewrite needed for ia32/ept differences, because it works (or so I believed)
just like normal EPT, with the first table being an EPT table, and the second
table being a normal page table.

I also believed that the fault injection part was also correct: I
thought that the code already knows when to handle the fault in L2 (when
the address is missing in cr3), in L1 (when the translation is missing
in EPT12) or else, in L0.

So what is the "hairy" missing part?

> The standard example is L1 passing a bit of
> hardware (emulated in L0) to a L2; when L2 accesses it, the instruction
> will fault and need to be handled in L0, transparently to L1.  The
> emulation can cause a fault to be injected to L2, or and EPT violation
> or misconfiguration injected to L1.

I don't understand the example. You are refering to nested device
assignment from L1 to L2 (so L1 stops caring about the device)? Since we
don't emulate an IOMMU for L1, how can that be done?

Thanks,
Nadav.

-- 
Nadav Har'El                        |                    Sunday, Nov 13 2011, 
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |An error? Impossible! My modem is error
http://nadav.harel.org.il           |correcting.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 02/10] nEPT: MMU context for nested EPT
  2011-11-12 21:37             ` Nadav Har'El
@ 2011-11-13  9:10               ` Avi Kivity
  2011-11-13 11:30               ` Orit Wasserman
  1 sibling, 0 replies; 49+ messages in thread
From: Avi Kivity @ 2011-11-13  9:10 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, Roedel, Joerg, owasserm, abelg

On 11/12/2011 11:37 PM, Nadav Har'El wrote:
> On Sat, Nov 12, 2011, Avi Kivity wrote about "Re: [PATCH 02/10] nEPT: MMU context for nested EPT":
> > host may write-protect a page.  Second, the shadow and guest ptes may be
> > in different formats (ept vs ia32).
>
> I'm afraid I've lost you here... The shadow table and the to-be-shadowed 
> table are both ia32 (this is the normal shadow table code), or both ept
> (the nested tdp code). When are they supposed to be in different
> formats (ept vs ia32)?

Er, the ia32/ept combo only happens when the host ignores the ia32 ptes
(non-nested on ept), so it's not very interesting.

> I'm also puzzled in what situation will the host will write-protect an EPT02
> (shadow EPT) page?

We only protect guest pages, but the permissions for them are in the
shadow pages.

> > In fact that happens to accidentally work, no?  Intermediate ptes are
> > always present/write/user, which translates to read/write/execute in EPT.
>
> It didn't work because it also used to set the "accessed" bit, bit 5,
> which on EPT is reserved and caused EPT misconfiguration. So I had to
> fix link_shadow_page, or nested EPT would not work at all.

Look at how __direct_map() does it.

> > Don't optimize for least changes, optimize for best result afterwards.
>
> As I'm sure you remember, two years ago, in September 6 2009, you wrote in
> your blog about the newly contributed nested VMX patch set, and in
> particular its nested EPT (which predated the nested NPT contribution).
>
> Nested EPT was, for some workloads, a huge performance improvement, but
> you (if I understand correctly) did not want that code in KVM because
> it, basically, optimized for getting the job done, in the most correct
> and most efficient manner - but without regard of how cleanly this fit with
> other types of shadowing (normal shadow page tables, and nested NPT),
> or how much of the code was being duplicated or circumvented.

I mean "best result" in terms of maintainability - how the code will
look, not performance results.  Don't optimize the size of the patch,
optimize the size of the patched code (and don't take me literally -
small code size doesn't correlate with maintainable code).

> So this time around, I couldn't really "not optimize for least changes".
> This time, the nested EPT had to fit (like a square peg in a round hole
> ;-)), into the preexisting MMU and NPT shadowing. I couldn't really just write
> the most correct and most efficient code (which Orit Wasserman already
> did, two years earlier). 

What is not correct (apart from what we identified) or not efficient in
the current code?

> This time I needed to figure out the least obtrusive
> way of changing the existing code. The hardest thing about doing this
> was trying to understand all the complexities and subtleties of the existing
> MMU code in KVM, which already does 101 different cases in one
> overloaded piece of code, which is not commented or documented.
> And of course, add to that all the complexities (some might even say "cruft")
> which the underlying x86 architecture itself has acrued over the years.
> So it's not surprising I've missed some of the important subtleties which
> didn't have any effect in the typical case I've tried. Like I said, in my
> tests nested EPT *did* work. And even getting to that point was hard enough :-)
>
> > We need a third variant of walk_addr_generic that parses EPT format
> > PTEs.  Whether that's best done by writing paging_ept.h or modifying
> > paging_tmpl.h, I don't know.
>
> Thanks. I'll think about everything you've said in this thread (I'm still
> not convinced I understood all your points, so just understanding them
> will be the first step). I'll see what I can do to improve the patch.
>
> But I have to be honest - I'm not sure how quickly I can finish this.
> I really appreciate all your comments about nested VMX in the last two
> years - most of them have been spot-on, 100% correct, and really helpful
> for making me understand things which I had previously misunderstood.
> However, since you are (of course) extremely familiar with every nook and
> cranny of KVM, what normally happens is that every comment which took you
> 5 minutes to figure out, takes me 5 days to fully understand, and to actually
> write, debug and test the fixed code. Every review that takes you two days
> to go through (and is very much appreciated!) takes me several months to fix
> each and every thing you asked for.

Feel free to ask around, on the mailing list and on IRC, post questions
or pseudo code for review.  Some problems can be caught early.

> Don't get me wrong, I *am* planning to continue working (part-time) on nested
> VMX, and nested EPT in particular. But if you want it to pick up the pace,
> I could use some help with actual coding from people who have much more
> intimate knowledge of the non-nested-VMX parts of KVM than I have.

I do plan to write some code, but it will actually make your job
somewhat harder - I'd like to write a test framework for nvmx, which
will test all sorts of odd combinations.

> In the meantime, if anybody wants to experiment with a much faster
> Nested VMX than we had before, you can try my current patch. It may not
> be perfect, but in many ways it is better than the old shadow-on-ept code.
> And in simple (64 bit, 4k page) kvm-over-kvm configurations like I tried, it
> works well.

Easiest if you post a git URL.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/10] nEPT: Nested EPT support for Nested VMX
  2011-11-13  8:52   ` Nadav Har'El
@ 2011-11-13  9:21     ` Avi Kivity
  2011-12-12 11:37       ` Nadav Har'El
  0 siblings, 1 reply; 49+ messages in thread
From: Avi Kivity @ 2011-11-13  9:21 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, Roedel, Joerg, owasserm, abelg

On 11/13/2011 10:52 AM, Nadav Har'El wrote:
> Hi,
>
> On Thu, Nov 10, 2011, Avi Kivity wrote about "Re: [PATCH 0/10] nEPT: Nested EPT support for Nested VMX":
> > This patchset is missing a fairly hairy patch that makes reading L2
> > virtual addresses work.
>
> This was supposed to be part of the nested TDP code that is already in
> the code. To read an L2 virtual address, the code is supposed, if I
> understand correctly, to walk the "walk" mmu (EPT01 and guest_cr3)
> and then use the EPT table - just like the normal EPT case which uses
> the EPT table and the guest_cr3.
>
> I even believed that this inner "walk mmu" will work fine without any
> rewrite needed for ia32/ept differences, because it works (or so I believed)
> just like normal EPT, with the first table being an EPT table, and the second
> table being a normal page table.

The code that walks the guest page table (walk_addr_generic) is not able
to parse EPT PTEs these days.

> I also believed that the fault injection part was also correct: I
> thought that the code already knows when to handle the fault in L2 (when
> the address is missing in cr3), in L1 (when the translation is missing
> in EPT12) or else, in L0.

It does, but it needs to propagate the fault code correctly.  The exit
reason (ept violation vs ept misconfiguration) is meaningless, since we
don't encode anything about it from ept12 into ept02.  In particular an
ept violation could lead to

- no fault, ept02 updated, instruction retried
- no fault, instruction emulated
- L2 fault
- ept violation, need to compute ept12 permissions for exit qualification
- ept misconfiguration

(the second and third cases occur when it is impossible to create an
ept02 mapping - when L0 emulates a gpa that L1 assigns to L2 via ept12).

> So what is the "hairy" missing part?

The EPT parser, and the code for figuring out the type of L1 fault.

>
> > The standard example is L1 passing a bit of
> > hardware (emulated in L0) to a L2; when L2 accesses it, the instruction
> > will fault and need to be handled in L0, transparently to L1.  The
> > emulation can cause a fault to be injected to L2, or and EPT violation
> > or misconfiguration injected to L1.
>
> I don't understand the example. You are refering to nested device
> assignment from L1 to L2 (so L1 stops caring about the device)? Since we
> don't emulate an IOMMU for L1, how can that be done?

You can have device assignment without an IOMMU.  Say, L1 assigns an
HPET block to L2.

A simple test case is L1 assigning a gpa to an L2 that doesn't exist in L0.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 02/10] nEPT: MMU context for nested EPT
  2011-11-12 21:37             ` Nadav Har'El
  2011-11-13  9:10               ` Avi Kivity
@ 2011-11-13 11:30               ` Orit Wasserman
  2011-11-13 14:32                 ` Avi Kivity
                                   ` (2 more replies)
  1 sibling, 3 replies; 49+ messages in thread
From: Orit Wasserman @ 2011-11-13 11:30 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: Avi Kivity, kvm, Roedel, Joerg, abelg

[-- Attachment #1: Type: text/plain, Size: 4692 bytes --]

On 11/12/2011 11:37 PM, Nadav Har'El wrote:
> On Sat, Nov 12, 2011, Avi Kivity wrote about "Re: [PATCH 02/10] nEPT: MMU context for nested EPT":
>> host may write-protect a page.  Second, the shadow and guest ptes may be
>> in different formats (ept vs ia32).
> 
> I'm afraid I've lost you here... The shadow table and the to-be-shadowed 
> table are both ia32 (this is the normal shadow table code), or both ept
> (the nested tdp code). When are they supposed to be in different
> formats (ept vs ia32)?
> 
> I'm also puzzled in what situation will the host will write-protect an EPT02
> (shadow EPT) page?
> 
>> In fact that happens to accidentally work, no?  Intermediate ptes are
>> always present/write/user, which translates to read/write/execute in EPT.
> 
> It didn't work because it also used to set the "accessed" bit, bit 5,
> which on EPT is reserved and caused EPT misconfiguration. So I had to
> fix link_shadow_page, or nested EPT would not work at all.
> 
>> Don't optimize for least changes, optimize for best result afterwards.
> 
> As I'm sure you remember, two years ago, in September 6 2009, you wrote in
> your blog about the newly contributed nested VMX patch set, and in
> particular its nested EPT (which predated the nested NPT contribution).
> 
> Nested EPT was, for some workloads, a huge performance improvement, but
> you (if I understand correctly) did not want that code in KVM because
> it, basically, optimized for getting the job done, in the most correct
> and most efficient manner - but without regard of how cleanly this fit with
> other types of shadowing (normal shadow page tables, and nested NPT),
> or how much of the code was being duplicated or circumvented.
> 
> So this time around, I couldn't really "not optimize for least changes".
> This time, the nested EPT had to fit (like a square peg in a round hole
> ;-)), into the preexisting MMU and NPT shadowing. I couldn't really just write
> the most correct and most efficient code (which Orit Wasserman already
> did, two years earlier). This time I needed to figure out the least obtrusive
> way of changing the existing code. The hardest thing about doing this
> was trying to understand all the complexities and subtleties of the existing
> MMU code in KVM, which already does 101 different cases in one
> overloaded piece of code, which is not commented or documented.
> And of course, add to that all the complexities (some might even say "cruft")
> which the underlying x86 architecture itself has acrued over the years.
> So it's not surprising I've missed some of the important subtleties which
> didn't have any effect in the typical case I've tried. Like I said, in my
> tests nested EPT *did* work. And even getting to that point was hard enough :-)
> 
>> We need a third variant of walk_addr_generic that parses EPT format
>> PTEs.  Whether that's best done by writing paging_ept.h or modifying
>> paging_tmpl.h, I don't know.
> 
> Thanks. I'll think about everything you've said in this thread (I'm still
> not convinced I understood all your points, so just understanding them
> will be the first step). I'll see what I can do to improve the patch.
> 
> But I have to be honest - I'm not sure how quickly I can finish this.
> I really appreciate all your comments about nested VMX in the last two
> years - most of them have been spot-on, 100% correct, and really helpful
> for making me understand things which I had previously misunderstood.
> However, since you are (of course) extremely familiar with every nook and
> cranny of KVM, what normally happens is that every comment which took you
> 5 minutes to figure out, takes me 5 days to fully understand, and to actually
> write, debug and test the fixed code. Every review that takes you two days
> to go through (and is very much appreciated!) takes me several months to fix
> each and every thing you asked for.
> 
> Don't get me wrong, I *am* planning to continue working (part-time) on nested
> VMX, and nested EPT in particular. But if you want it to pick up the pace,
> I could use some help with actual coding from people who have much more
> intimate knowledge of the non-nested-VMX parts of KVM than I have.
> 
> In the meantime, if anybody wants to experiment with a much faster
> Nested VMX than we had before, you can try my current patch. It may not
> be perfect, but in many ways it is better than the old shadow-on-ept code.
> And in simple (64 bit, 4k page) kvm-over-kvm configurations like I tried, it
> works well.
> 
> Nadav.
> 

Maybe this patch can help, this is roughly what Avi wants (I hope) done very quickly.
I'm sorry I don't have setup to run nested VMX at the moment so i can't test it.

Orit

[-- Attachment #2: 0001-Add-EPT_walk_addr_genric.patch --]
[-- Type: text/plain, Size: 5158 bytes --]

>From 032ba0221ea690f61ecc4f6d946090d18d6f4dfb Mon Sep 17 00:00:00 2001
From: Orit Wasserman <owasserm@redhard.com>
Date: Sun, 13 Nov 2011 12:57:17 +0200
Subject: [PATCH] Add EPT_walk_addr_genric

---
 arch/x86/kvm/mmu.c         |   13 +++++++++++++
 arch/x86/kvm/mmu.h         |    5 +++++
 arch/x86/kvm/paging_tmpl.h |   44 ++++++++++++++++++++++++++++++++++++++++----
 3 files changed, 58 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9335e1b..bbe212f 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3180,6 +3180,10 @@ static bool sync_mmio_spte(u64 *sptep, gfn_t gfn, unsigned access,
 #include "paging_tmpl.h"
 #undef PTTYPE
 
+#define PTTYPE EPT
+#include "paging_tmpl.h"
+#undef PTTYPE
+
 #define PTTYPE 32
 #include "paging_tmpl.h"
 #undef PTTYPE
@@ -3381,6 +3385,15 @@ static int init_kvm_softmmu(struct kvm_vcpu *vcpu)
 	return r;
 }
 
+static int nested_ept_init_mmu_context(struct kvm_vcpu *vcpu)
+{
+	int r = kvm_init_shadow_mmu(vcpu, &vcpu->arch.mmu);
+
+	vcpu->arch.nested_mmu.gva_to_gpa = EPT_gva_to_gpa_nested;
+
+	return r;
+}
+
 static int init_kvm_nested_mmu(struct kvm_vcpu *vcpu)
 {
 	struct kvm_mmu *g_context = &vcpu->arch.nested_mmu;
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index e374db9..545e2ff 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -48,6 +48,11 @@
 #define PFERR_RSVD_MASK (1U << 3)
 #define PFERR_FETCH_MASK (1U << 4)
 
+#define EPT_WRITABLE_MASK 2
+#define EPT_EXEC_MASK 4
+
+#define EPT 18
+
 int kvm_mmu_get_spte_hierarchy(struct kvm_vcpu *vcpu, u64 addr, u64 sptes[4]);
 void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask);
 int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool direct);
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 507e2b8..70d4cfd 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -39,6 +39,21 @@
 	#define CMPXCHG cmpxchg64
 	#define PT_MAX_FULL_LEVELS 2
 	#endif
+#elif PTTYPE == EPT
+	#define pt_element_t u64
+	#define FNAME(name) EPT_##name
+	#define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
+	#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
+	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
+	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
+	#define PT_LEVEL_BITS PT64_LEVEL_BITS
+	#ifdef CONFIG_X86_64
+	#define PT_MAX_FULL_LEVELS 4
+	#define CMPXCHG cmpxchg
+	#else
+	#define CMPXCHG cmpxchg64
+	#define PT_MAX_FULL_LEVELS 2
+	#endif
 #elif PTTYPE == 32
 	#define pt_element_t u32
 	#define guest_walker guest_walker32
@@ -106,14 +121,19 @@ static unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, pt_element_t gpte,
 {
 	unsigned access;
 
+#if PTTYPE == EPT       
 	access = (gpte & (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK;
+#else
+	access = (gpte & EPT_WRITABLE_MASK) | EPT_EXEC_MASK;
 	if (last && !is_dirty_gpte(gpte))
 		access &= ~ACC_WRITE_MASK;
+#endif
 
 #if PTTYPE == 64
 	if (vcpu->arch.mmu.nx)
 		access &= ~(gpte >> PT64_NX_SHIFT);
-#endif
+#endif 
+
 	return access;
 }
 
@@ -149,9 +169,15 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 	gpa_t pte_gpa;
 	bool eperm;
 	int offset;
+#if PTTYPE == EPT
+	const int write_fault = access & EPT_WRITABLE_MASK;
+	const int user_fault  = 0;
+	const int fetch_fault = 0;
+#else
 	const int write_fault = access & PFERR_WRITE_MASK;
 	const int user_fault  = access & PFERR_USER_MASK;
 	const int fetch_fault = access & PFERR_FETCH_MASK;
+#endif
 	u16 errcode = 0;
 
 	trace_kvm_mmu_pagetable_walk(addr, write_fault, user_fault,
@@ -174,6 +200,9 @@ retry_walk:
 	       (mmu->get_cr3(vcpu) & CR3_NONPAE_RESERVED_BITS) == 0);
 
 	pt_access = ACC_ALL;
+#if PTTYPE == EPT 
+	pt_access =  PT_PRESENT_MASK | EPT_WRITABLE_MASK | EPT_EXEC_MASK;
+#endif
 
 	for (;;) {
 		gfn_t real_gfn;
@@ -186,9 +215,14 @@ retry_walk:
 		pte_gpa   = gfn_to_gpa(table_gfn) + offset;
 		walker->table_gfn[walker->level - 1] = table_gfn;
 		walker->pte_gpa[walker->level - 1] = pte_gpa;
-
+#if PTTYPE == EPT 
+		real_gfn = mmu->translate_gpa(vcpu, gfn_to_gpa(table_gfn),
+					      EPT_WRITABLE_MASK);
+#else
 		real_gfn = mmu->translate_gpa(vcpu, gfn_to_gpa(table_gfn),
 					      PFERR_USER_MASK|PFERR_WRITE_MASK);
+#endif
+
 		if (unlikely(real_gfn == UNMAPPED_GVA))
 			goto error;
 		real_gfn = gpa_to_gfn(real_gfn);
@@ -221,6 +255,7 @@ retry_walk:
 			eperm = true;
 #endif
 
+#if PTTYPE != EPT
 		if (!eperm && unlikely(!(pte & PT_ACCESSED_MASK))) {
 			int ret;
 			trace_kvm_mmu_set_accessed_bit(table_gfn, index,
@@ -235,7 +270,7 @@ retry_walk:
 			mark_page_dirty(vcpu->kvm, table_gfn);
 			pte |= PT_ACCESSED_MASK;
 		}
-
+#endif
 		walker->ptes[walker->level - 1] = pte;
 
 		if (FNAME(is_last_gpte)(walker, vcpu, mmu, pte)) {
@@ -244,12 +279,13 @@ retry_walk:
 			gfn_t gfn;
 			u32 ac;
 
+#if PTTYPE != EPT
 			/* check if the kernel is fetching from user page */
 			if (unlikely(pte_access & PT_USER_MASK) &&
 			    kvm_read_cr4_bits(vcpu, X86_CR4_SMEP))
 				if (fetch_fault && !user_fault)
 					eperm = true;
-
+#endif
 			gfn = gpte_to_gfn_lvl(pte, lvl);
 			gfn += (addr & PT_LVL_OFFSET_MASK(lvl)) >> PAGE_SHIFT;
 
-- 
1.7.6.4


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH 02/10] nEPT: MMU context for nested EPT
  2011-11-13 11:30               ` Orit Wasserman
@ 2011-11-13 14:32                 ` Avi Kivity
  2011-11-13 18:26                   ` Orit Wasserman
  2011-12-06 12:40                   ` Nadav Har'El
  2011-11-23 15:06                 ` Nadav Har'El
  2011-12-07  9:06                 ` Nadav Har'El
  2 siblings, 2 replies; 49+ messages in thread
From: Avi Kivity @ 2011-11-13 14:32 UTC (permalink / raw)
  To: Orit Wasserman; +Cc: Nadav Har'El, kvm, Roedel, Joerg, abelg

On 11/13/2011 01:30 PM, Orit Wasserman wrote:
> Maybe this patch can help, this is roughly what Avi wants (I hope) done very quickly.
> I'm sorry I don't have setup to run nested VMX at the moment so i can't test it.
>
> Orit
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 9335e1b..bbe212f 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -3180,6 +3180,10 @@ static bool sync_mmio_spte(u64 *sptep, gfn_t gfn, unsigned access,
>  #include "paging_tmpl.h"
>  #undef PTTYPE
>  
> +#define PTTYPE EPT
> +#include "paging_tmpl.h"
> +#undef PTTYPE
> +

Yes, that's the key.


>  int kvm_mmu_get_spte_hierarchy(struct kvm_vcpu *vcpu, u64 addr, u64 sptes[4]);
>  void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask);
>  int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool direct);
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index 507e2b8..70d4cfd 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -39,6 +39,21 @@
>  	#define CMPXCHG cmpxchg64
>  	#define PT_MAX_FULL_LEVELS 2
>  	#endif
> +#elif PTTYPE == EPT
> +	#define pt_element_t u64
> +	#define FNAME(name) EPT_##name
> +	#define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
> +	#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
> +	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
> +	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
> +	#define PT_LEVEL_BITS PT64_LEVEL_BITS
> +	#ifdef CONFIG_X86_64
> +	#define PT_MAX_FULL_LEVELS 4
> +	#define CMPXCHG cmpxchg
> +	#else
> +	#define CMPXCHG cmpxchg64
> +	#define PT_MAX_FULL_LEVELS 2
> +	#endif

The various masks should be defined here, to avoid lots of #ifdefs later.

>  #elif PTTYPE == 32
>  	#define pt_element_t u32
>  	#define guest_walker guest_walker32
> @@ -106,14 +121,19 @@ static unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, pt_element_t gpte,
>  {
>  	unsigned access;
>  
> +#if PTTYPE == EPT       
>  	access = (gpte & (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK;
> +#else
> +	access = (gpte & EPT_WRITABLE_MASK) | EPT_EXEC_MASK;
>  	if (last && !is_dirty_gpte(gpte))
>  		access &= ~ACC_WRITE_MASK;
> +#endif

Like here, you could make is_dirty_gpte() local to paging_tmpl()
returning true for EPT and the dirty bit otherwise.


>  
>  #if PTTYPE == 64
>  	if (vcpu->arch.mmu.nx)
>  		access &= ~(gpte >> PT64_NX_SHIFT);

The ept X bit is lost.

Could do something like

   access &= (gpte >> PT_X_NX_SHIFT) ^ PT_X_NX_SENSE;


> +#if PTTYPE == EPT
> +	const int write_fault = access & EPT_WRITABLE_MASK;
> +	const int user_fault  = 0;
> +	const int fetch_fault = 0;
> +#else

EPT has fetch permissions (but not user permissions); anyway
translate_nested_gpa() already does this.

>  	const int write_fault = access & PFERR_WRITE_MASK;
>  	const int user_fault  = access & PFERR_USER_MASK;
>  	const int fetch_fault = access & PFERR_FETCH_MASK;
> +#endif
>  	u16 errcode = 0;
>  
>  	trace_kvm_mmu_pagetable_walk(addr, write_fault, user_fault,
> @@ -174,6 +200,9 @@ retry_walk:
>  	       (mmu->get_cr3(vcpu) & CR3_NONPAE_RESERVED_BITS) == 0);
>  
>  	pt_access = ACC_ALL;
> +#if PTTYPE == EPT 
> +	pt_access =  PT_PRESENT_MASK | EPT_WRITABLE_MASK | EPT_EXEC_MASK;
> +#endif

pt_access is not in EPT or ia32 format - it's our own format (xwu).  So
this doesn't need changing.  Updating gpte_access() is sufficient.

>  
>  	for (;;) {
>  		gfn_t real_gfn;
> @@ -186,9 +215,14 @@ retry_walk:
>  		pte_gpa   = gfn_to_gpa(table_gfn) + offset;
>  		walker->table_gfn[walker->level - 1] = table_gfn;
>  		walker->pte_gpa[walker->level - 1] = pte_gpa;
> -
> +#if PTTYPE == EPT 
> +		real_gfn = mmu->translate_gpa(vcpu, gfn_to_gpa(table_gfn),
> +					      EPT_WRITABLE_MASK);
> +#else
>  		real_gfn = mmu->translate_gpa(vcpu, gfn_to_gpa(table_gfn),
>  					      PFERR_USER_MASK|PFERR_WRITE_MASK);
> +#endif
> +

Unneeded, I think.

>  		if (unlikely(real_gfn == UNMAPPED_GVA))
>  			goto error;
>  		real_gfn = gpa_to_gfn(real_gfn);
> @@ -221,6 +255,7 @@ retry_walk:
>  			eperm = true;
>  #endif
>  
> +#if PTTYPE != EPT
>  		if (!eperm && unlikely(!(pte & PT_ACCESSED_MASK))) {
>  			int ret;
>  			trace_kvm_mmu_set_accessed_bit(table_gfn, index,
> @@ -235,7 +270,7 @@ retry_walk:
>  			mark_page_dirty(vcpu->kvm, table_gfn);
>  			pte |= PT_ACCESSED_MASK;
>  		}
> -
> +#endif

If PT_ACCESSED_MASK is 0 for EPT, this goes away without #ifdef.

> +#if PTTYPE != EPT
>  			/* check if the kernel is fetching from user page */
>  			if (unlikely(pte_access & PT_USER_MASK) &&
>  			    kvm_read_cr4_bits(vcpu, X86_CR4_SMEP))
>  				if (fetch_fault && !user_fault)
>  					eperm = true;
> -
> +#endif

Same here.



-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 02/10] nEPT: MMU context for nested EPT
  2011-11-13 14:32                 ` Avi Kivity
@ 2011-11-13 18:26                   ` Orit Wasserman
  2011-11-14  8:25                     ` Avi Kivity
  2011-12-06 12:40                   ` Nadav Har'El
  1 sibling, 1 reply; 49+ messages in thread
From: Orit Wasserman @ 2011-11-13 18:26 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Nadav Har'El, kvm, Roedel, Joerg, abelg

On 11/13/2011 04:32 PM, Avi Kivity wrote:
> On 11/13/2011 01:30 PM, Orit Wasserman wrote:
>> Maybe this patch can help, this is roughly what Avi wants (I hope) done very quickly.
>> I'm sorry I don't have setup to run nested VMX at the moment so i can't test it.
>>
>> Orit
>>
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index 9335e1b..bbe212f 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -3180,6 +3180,10 @@ static bool sync_mmio_spte(u64 *sptep, gfn_t gfn, unsigned access,
>>  #include "paging_tmpl.h"
>>  #undef PTTYPE
>>  
>> +#define PTTYPE EPT
>> +#include "paging_tmpl.h"
>> +#undef PTTYPE
>> +
> 
> Yes, that's the key.
> 
> 
>>  int kvm_mmu_get_spte_hierarchy(struct kvm_vcpu *vcpu, u64 addr, u64 sptes[4]);
>>  void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask);
>>  int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool direct);
>> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
>> index 507e2b8..70d4cfd 100644
>> --- a/arch/x86/kvm/paging_tmpl.h
>> +++ b/arch/x86/kvm/paging_tmpl.h
>> @@ -39,6 +39,21 @@
>>  	#define CMPXCHG cmpxchg64
>>  	#define PT_MAX_FULL_LEVELS 2
>>  	#endif
>> +#elif PTTYPE == EPT
>> +	#define pt_element_t u64
>> +	#define FNAME(name) EPT_##name
>> +	#define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
>> +	#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
>> +	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
>> +	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
>> +	#define PT_LEVEL_BITS PT64_LEVEL_BITS
>> +	#ifdef CONFIG_X86_64
>> +	#define PT_MAX_FULL_LEVELS 4
>> +	#define CMPXCHG cmpxchg
>> +	#else
>> +	#define CMPXCHG cmpxchg64
>> +	#define PT_MAX_FULL_LEVELS 2
>> +	#endif
> 
> The various masks should be defined here, to avoid lots of #ifdefs later.
> 

That what I did first but than I was afraid that the MASK will be changed for mmu.c too.
so I decided on ifdefs.
The more I think about it I think we need rapper function for mask checking (at least for this file).
What do you think ?


>>  #elif PTTYPE == 32
>>  	#define pt_element_t u32
>>  	#define guest_walker guest_walker32
>> @@ -106,14 +121,19 @@ static unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, pt_element_t gpte,
>>  {
>>  	unsigned access;
>>  
>> +#if PTTYPE == EPT       
>>  	access = (gpte & (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK;
>> +#else
>> +	access = (gpte & EPT_WRITABLE_MASK) | EPT_EXEC_MASK;
>>  	if (last && !is_dirty_gpte(gpte))
>>  		access &= ~ACC_WRITE_MASK;
>> +#endif
> 
> Like here, you could make is_dirty_gpte() local to paging_tmpl()
> returning true for EPT and the dirty bit otherwise.
> 
> 
>>  
>>  #if PTTYPE == 64
>>  	if (vcpu->arch.mmu.nx)
>>  		access &= ~(gpte >> PT64_NX_SHIFT);
> 
> The ept X bit is lost.
> 
> Could do something like
> 
>    access &= (gpte >> PT_X_NX_SHIFT) ^ PT_X_NX_SENSE;
> 
> 
>> +#if PTTYPE == EPT
>> +	const int write_fault = access & EPT_WRITABLE_MASK;
>> +	const int user_fault  = 0;
>> +	const int fetch_fault = 0;
>> +#else
> 
> EPT has fetch permissions (but not user permissions); anyway
> translate_nested_gpa() already does this.
> 
>>  	const int write_fault = access & PFERR_WRITE_MASK;
>>  	const int user_fault  = access & PFERR_USER_MASK;
>>  	const int fetch_fault = access & PFERR_FETCH_MASK;
>> +#endif
>>  	u16 errcode = 0;
>>  
>>  	trace_kvm_mmu_pagetable_walk(addr, write_fault, user_fault,
>> @@ -174,6 +200,9 @@ retry_walk:
>>  	       (mmu->get_cr3(vcpu) & CR3_NONPAE_RESERVED_BITS) == 0);
>>  
>>  	pt_access = ACC_ALL;
>> +#if PTTYPE == EPT 
>> +	pt_access =  PT_PRESENT_MASK | EPT_WRITABLE_MASK | EPT_EXEC_MASK;
>> +#endif
> 
> pt_access is not in EPT or ia32 format - it's our own format (xwu).  So
> this doesn't need changing.  Updating gpte_access() is sufficient.
> 
>>  
>>  	for (;;) {
>>  		gfn_t real_gfn;
>> @@ -186,9 +215,14 @@ retry_walk:
>>  		pte_gpa   = gfn_to_gpa(table_gfn) + offset;
>>  		walker->table_gfn[walker->level - 1] = table_gfn;
>>  		walker->pte_gpa[walker->level - 1] = pte_gpa;
>> -
>> +#if PTTYPE == EPT 
>> +		real_gfn = mmu->translate_gpa(vcpu, gfn_to_gpa(table_gfn),
>> +					      EPT_WRITABLE_MASK);
>> +#else
>>  		real_gfn = mmu->translate_gpa(vcpu, gfn_to_gpa(table_gfn),
>>  					      PFERR_USER_MASK|PFERR_WRITE_MASK);
>> +#endif
>> +
> 
> Unneeded, I think.

Is it because translate_nested_gpa always set USER_MASK ? 

> 
>>  		if (unlikely(real_gfn == UNMAPPED_GVA))
>>  			goto error;
>>  		real_gfn = gpa_to_gfn(real_gfn);
>> @@ -221,6 +255,7 @@ retry_walk:
>>  			eperm = true;
>>  #endif
>>  
>> +#if PTTYPE != EPT
>>  		if (!eperm && unlikely(!(pte & PT_ACCESSED_MASK))) {
>>  			int ret;
>>  			trace_kvm_mmu_set_accessed_bit(table_gfn, index,
>> @@ -235,7 +270,7 @@ retry_walk:
>>  			mark_page_dirty(vcpu->kvm, table_gfn);
>>  			pte |= PT_ACCESSED_MASK;
>>  		}
>> -
>> +#endif
> 
> If PT_ACCESSED_MASK is 0 for EPT, this goes away without #ifdef.
> 

true 

>> +#if PTTYPE != EPT
>>  			/* check if the kernel is fetching from user page */
>>  			if (unlikely(pte_access & PT_USER_MASK) &&
>>  			    kvm_read_cr4_bits(vcpu, X86_CR4_SMEP))
>>  				if (fetch_fault && !user_fault)
>>  					eperm = true;
>> -
>> +#endif
> 
> Same here.
> 
> 
> 

Orit


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 02/10] nEPT: MMU context for nested EPT
  2011-11-13 18:26                   ` Orit Wasserman
@ 2011-11-14  8:25                     ` Avi Kivity
  2011-12-08 15:21                       ` Nadav Har'El
  0 siblings, 1 reply; 49+ messages in thread
From: Avi Kivity @ 2011-11-14  8:25 UTC (permalink / raw)
  To: Orit Wasserman; +Cc: Nadav Har'El, kvm, Roedel, Joerg, abelg

On 11/13/2011 08:26 PM, Orit Wasserman wrote:
> > 
> >>  int kvm_mmu_get_spte_hierarchy(struct kvm_vcpu *vcpu, u64 addr, u64 sptes[4]);
> >>  void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask);
> >>  int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool direct);
> >> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> >> index 507e2b8..70d4cfd 100644
> >> --- a/arch/x86/kvm/paging_tmpl.h
> >> +++ b/arch/x86/kvm/paging_tmpl.h
> >> @@ -39,6 +39,21 @@
> >>  	#define CMPXCHG cmpxchg64
> >>  	#define PT_MAX_FULL_LEVELS 2
> >>  	#endif
> >> +#elif PTTYPE == EPT
> >> +	#define pt_element_t u64
> >> +	#define FNAME(name) EPT_##name
> >> +	#define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
> >> +	#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
> >> +	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
> >> +	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
> >> +	#define PT_LEVEL_BITS PT64_LEVEL_BITS
> >> +	#ifdef CONFIG_X86_64
> >> +	#define PT_MAX_FULL_LEVELS 4
> >> +	#define CMPXCHG cmpxchg
> >> +	#else
> >> +	#define CMPXCHG cmpxchg64
> >> +	#define PT_MAX_FULL_LEVELS 2
> >> +	#endif
> > 
> > The various masks should be defined here, to avoid lots of #ifdefs later.
> > 
>
> That what I did first but than I was afraid that the MASK will be changed for mmu.c too.
> so I decided on ifdefs.
> The more I think about it I think we need rapper function for mask checking (at least for this file).
> What do you think ?

Either should work, as long as the main logic is clean.

> >>  	for (;;) {
> >>  		gfn_t real_gfn;
> >> @@ -186,9 +215,14 @@ retry_walk:
> >>  		pte_gpa   = gfn_to_gpa(table_gfn) + offset;
> >>  		walker->table_gfn[walker->level - 1] = table_gfn;
> >>  		walker->pte_gpa[walker->level - 1] = pte_gpa;
> >> -
> >> +#if PTTYPE == EPT 
> >> +		real_gfn = mmu->translate_gpa(vcpu, gfn_to_gpa(table_gfn),
> >> +					      EPT_WRITABLE_MASK);
> >> +#else
> >>  		real_gfn = mmu->translate_gpa(vcpu, gfn_to_gpa(table_gfn),
> >>  					      PFERR_USER_MASK|PFERR_WRITE_MASK);
> >> +#endif
> >> +
> > 
> > Unneeded, I think.
>
> Is it because translate_nested_gpa always set USER_MASK ? 

Yes... maybe that function needs to do something like

   access |= mmu->default_access;



-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 02/10] nEPT: MMU context for nested EPT
  2011-11-13 11:30               ` Orit Wasserman
  2011-11-13 14:32                 ` Avi Kivity
@ 2011-11-23 15:06                 ` Nadav Har'El
  2011-11-23 15:44                   ` Nadav Har'El
  2011-12-07  9:06                 ` Nadav Har'El
  2 siblings, 1 reply; 49+ messages in thread
From: Nadav Har'El @ 2011-11-23 15:06 UTC (permalink / raw)
  To: Orit Wasserman; +Cc: Avi Kivity, kvm, Roedel, Joerg, abelg

On Sun, Nov 13, 2011, Orit Wasserman wrote about "Re: [PATCH 02/10] nEPT: MMU context for nested EPT":
> Maybe this patch can help, this is roughly what Avi wants (I hope) done very quickly.
> I'm sorry I don't have setup to run nested VMX at the moment so i can't test it.

Hi Orit, thanks for the code - I'm now working on incorporating
something based on this into my patch. However, I do have a question:

> +static int nested_ept_init_mmu_context(struct kvm_vcpu *vcpu)
> +{
> +	int r = kvm_init_shadow_mmu(vcpu, &vcpu->arch.mmu);
> +
> +	vcpu->arch.nested_mmu.gva_to_gpa = EPT_gva_to_gpa_nested;
> +
> +	return r;
> +}

I didn't see you actually call this function anywhere - how is it
supposed to work?

The way I understand the current code, kvm_mmu_reset_context() calls
init_kvm_mmu() which (in our case) calls init_kvm_nested_mmu().
I think the above gva_to_gpa setting should be there - right?
It seems we need a fifth case in that function.
But at that point in mmu.c, how will I be able to check if this is the
nested EPT case? Do you have any suggestion?

Thanks,
Nadav.

-- 
Nadav Har'El                        |                 Wednesday, Nov 23 2011, 
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |This message contains 100% recycled
http://nadav.harel.org.il           |characters.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 02/10] nEPT: MMU context for nested EPT
  2011-11-23 15:06                 ` Nadav Har'El
@ 2011-11-23 15:44                   ` Nadav Har'El
  2011-11-24 13:36                     ` Avi Kivity
  0 siblings, 1 reply; 49+ messages in thread
From: Nadav Har'El @ 2011-11-23 15:44 UTC (permalink / raw)
  To: Orit Wasserman; +Cc: Avi Kivity, kvm, Roedel, Joerg, abelg

On Wed, Nov 23, 2011, Nadav Har'El wrote about "Re: [PATCH 02/10] nEPT: MMU context for nested EPT":
> > +static int nested_ept_init_mmu_context(struct kvm_vcpu *vcpu)
> > +{
> > +	int r = kvm_init_shadow_mmu(vcpu, &vcpu->arch.mmu);
> > +
> > +	vcpu->arch.nested_mmu.gva_to_gpa = EPT_gva_to_gpa_nested;
> > +
> > +	return r;
> > +}
>..
> I didn't see you actually call this function anywhere - how is it
> supposed to work?
>..
> It seems we need a fifth case in that function.
>..

On second thought, why is this modifying nested_mmu.gva_to_gpa, and not
mmu.gva_to_gpa? Isn't the nested_mmu the L2 CR3, which is *not* in EPT
format, and what we really want to change is the outer mmu, which is
EPT12 and is indeed in EPT format?
Or am I missing something?

Thanks,
Nadav.

-- 
Nadav Har'El                        |                 Wednesday, Nov 23 2011, 
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |My password is my dog's name. His name is
http://nadav.harel.org.il           |a#j!4@h, but I change it every month.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 02/10] nEPT: MMU context for nested EPT
  2011-11-23 15:44                   ` Nadav Har'El
@ 2011-11-24 13:36                     ` Avi Kivity
  0 siblings, 0 replies; 49+ messages in thread
From: Avi Kivity @ 2011-11-24 13:36 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: Orit Wasserman, kvm, Roedel, Joerg, abelg

On 11/23/2011 05:44 PM, Nadav Har'El wrote:
> On Wed, Nov 23, 2011, Nadav Har'El wrote about "Re: [PATCH 02/10] nEPT: MMU context for nested EPT":
> > > +static int nested_ept_init_mmu_context(struct kvm_vcpu *vcpu)
> > > +{
> > > +	int r = kvm_init_shadow_mmu(vcpu, &vcpu->arch.mmu);
> > > +
> > > +	vcpu->arch.nested_mmu.gva_to_gpa = EPT_gva_to_gpa_nested;
> > > +
> > > +	return r;
> > > +}
> >..
> > I didn't see you actually call this function anywhere - how is it
> > supposed to work?
> >..
> > It seems we need a fifth case in that function.
> >..
>
> On second thought, why is this modifying nested_mmu.gva_to_gpa, and not
> mmu.gva_to_gpa? Isn't the nested_mmu the L2 CR3, which is *not* in EPT
> format, and what we really want to change is the outer mmu, which is
> EPT12 and is indeed in EPT format?
> Or am I missing something?

I think you're right.  The key is to look at what ->walk_mmu points at.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 02/10] nEPT: MMU context for nested EPT
  2011-11-13 14:32                 ` Avi Kivity
  2011-11-13 18:26                   ` Orit Wasserman
@ 2011-12-06 12:40                   ` Nadav Har'El
  2011-12-06 13:07                     ` Avi Kivity
  1 sibling, 1 reply; 49+ messages in thread
From: Nadav Har'El @ 2011-12-06 12:40 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Orit Wasserman, kvm, Roedel, Joerg, abelg

On Sun, Nov 13, 2011, Avi Kivity wrote about "Re: [PATCH 02/10] nEPT: MMU context for nested EPT":
> On 11/13/2011 01:30 PM, Orit Wasserman wrote:
> > Maybe this patch can help, this is roughly what Avi wants (I hope) done very quickly.
> > I'm sorry I don't have setup to run nested VMX at the moment so i can't test it.
>...
> > +#define PTTYPE EPT
> > +#include "paging_tmpl.h"
> > +#undef PTTYPE
> 
> Yes, that's the key.

I'm now preparing a patch based on such ideas.

One downside of this approach is that mmu.c (and therefore the x86
module) will now include EPT-specific functions that are of no use or
relevance to the SVM code. It's not a terrible disaster, but it's
"unclean". I'll try to think if there's a cleaner way.

Nadav.

-- 
Nadav Har'El                        |                    Tuesday, Dec 6 2011, 
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Writing software is like sex: One mistake
http://nadav.harel.org.il           |and you have to support it forever.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 02/10] nEPT: MMU context for nested EPT
  2011-12-06 12:40                   ` Nadav Har'El
@ 2011-12-06 13:07                     ` Avi Kivity
  0 siblings, 0 replies; 49+ messages in thread
From: Avi Kivity @ 2011-12-06 13:07 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: Orit Wasserman, kvm, Roedel, Joerg, abelg

On 12/06/2011 02:40 PM, Nadav Har'El wrote:
> On Sun, Nov 13, 2011, Avi Kivity wrote about "Re: [PATCH 02/10] nEPT: MMU context for nested EPT":
> > On 11/13/2011 01:30 PM, Orit Wasserman wrote:
> > > Maybe this patch can help, this is roughly what Avi wants (I hope) done very quickly.
> > > I'm sorry I don't have setup to run nested VMX at the moment so i can't test it.
> >...
> > > +#define PTTYPE EPT
> > > +#include "paging_tmpl.h"
> > > +#undef PTTYPE
> > 
> > Yes, that's the key.
>
> I'm now preparing a patch based on such ideas.
>
> One downside of this approach is that mmu.c (and therefore the x86
> module) will now include EPT-specific functions that are of no use or
> relevance to the SVM code. It's not a terrible disaster, but it's
> "unclean". I'll try to think if there's a cleaner way.

I'm perfectly willing to live with this.

In general vmx.c and svm.c only deal with host-side differences between
Intel and AMD.  EPT support in paging.h is guest-side, so it doesn't
belong there.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 02/10] nEPT: MMU context for nested EPT
  2011-11-13 11:30               ` Orit Wasserman
  2011-11-13 14:32                 ` Avi Kivity
  2011-11-23 15:06                 ` Nadav Har'El
@ 2011-12-07  9:06                 ` Nadav Har'El
  2011-12-07 10:10                   ` Avi Kivity
  2 siblings, 1 reply; 49+ messages in thread
From: Nadav Har'El @ 2011-12-07  9:06 UTC (permalink / raw)
  To: Orit Wasserman; +Cc: Avi Kivity, kvm, Roedel, Joerg, abelg

On Sun, Nov 13, 2011, Orit Wasserman wrote about "Re: [PATCH 02/10] nEPT: MMU context for nested EPT":
> +++ b/arch/x86/kvm/mmu.h
> @@ -48,6 +48,11 @@
>  #define PFERR_RSVD_MASK (1U << 3)
>  #define PFERR_FETCH_MASK (1U << 4)
>  
> +#define EPT_WRITABLE_MASK 2
> +#define EPT_EXEC_MASK 4

This is another example of the "unclean" movement of VMX-specific things into
x86 :( We already have VMX_EPT_WRITABLE_MASK and friends in vmx.h. I'll
need to think what is less ugly: to move them to mmu.h, or to include vmx.h
in mmu.c, or perhaps even create a new include file, ept.h. Avi, do you have
a preference?
The last thing I want to do is to repeat the same definitions in two places.

-- 
Nadav Har'El                        |                  Wednesday, Dec 7 2011, 
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |"A witty saying proves nothing." --
http://nadav.harel.org.il           |Voltaire

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 02/10] nEPT: MMU context for nested EPT
  2011-12-07  9:06                 ` Nadav Har'El
@ 2011-12-07 10:10                   ` Avi Kivity
  0 siblings, 0 replies; 49+ messages in thread
From: Avi Kivity @ 2011-12-07 10:10 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: Orit Wasserman, kvm, Roedel, Joerg, abelg

On 12/07/2011 11:06 AM, Nadav Har'El wrote:
> On Sun, Nov 13, 2011, Orit Wasserman wrote about "Re: [PATCH 02/10] nEPT: MMU context for nested EPT":
> > +++ b/arch/x86/kvm/mmu.h
> > @@ -48,6 +48,11 @@
> >  #define PFERR_RSVD_MASK (1U << 3)
> >  #define PFERR_FETCH_MASK (1U << 4)
> >  
> > +#define EPT_WRITABLE_MASK 2
> > +#define EPT_EXEC_MASK 4
>
> This is another example of the "unclean" movement of VMX-specific things into
> x86 :( We already have VMX_EPT_WRITABLE_MASK and friends in vmx.h. I'll
> need to think what is less ugly: to move them to mmu.h, or to include vmx.h
> in mmu.c, or perhaps even create a new include file, ept.h. Avi, do you have
> a preference?

Include vmx.h in mmu.c.  vmx.h is neutral wrt guestiness/hostiness, so
it can be included from mmu.c and vmx.c without issues.

> The last thing I want to do is to repeat the same definitions in two places.

Right.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 02/10] nEPT: MMU context for nested EPT
  2011-11-14  8:25                     ` Avi Kivity
@ 2011-12-08 15:21                       ` Nadav Har'El
  0 siblings, 0 replies; 49+ messages in thread
From: Nadav Har'El @ 2011-12-08 15:21 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Orit Wasserman, kvm, Roedel, Joerg, abelg

On Mon, Nov 14, 2011, Avi Kivity wrote about "Re: [PATCH 02/10] nEPT: MMU context for nested EPT":
> > >> +#if PTTYPE == EPT 
> > >> +		real_gfn = mmu->translate_gpa(vcpu, gfn_to_gpa(table_gfn),
> > >> +					      EPT_WRITABLE_MASK);
> > >> +#else
> > >>  		real_gfn = mmu->translate_gpa(vcpu, gfn_to_gpa(table_gfn),
> > >>  					      PFERR_USER_MASK|PFERR_WRITE_MASK);
> > >> +#endif
> > >> +
> > > 
> > > Unneeded, I think.
> >
> > Is it because translate_nested_gpa always set USER_MASK ? 
> 
> Yes... maybe that function needs to do something like
> 
>    access |= mmu->default_access;

Unless I'm misunderstanding something, translate_nested_gpa, and
gva_to_gpa, take as their "access" parameter a bitmask of PFERR_*,
so it's fine for PFERR_USER_MASK to be enabled in translate_nested_gpa;
It just shouldn't cause PT_USER_MASK to be used. The only additional
problem I can find is in walk_addr_generic which

does

		if (!check_write_user_access(vcpu, write_fault, user_fault,
					  pte))
			eperm = true;

and that checks pte & PT_USER_MASK, which it shouldn't if
PTTYPE==PTTYPE_EPT.

It's really confusing that we now have in mmu.c no less than 4 (!)
access bit schemes, similar in many ways but different in many others:

	1. page fault error codes (PFERR_*_MASK)
	2. x86 page tables acess bits (PT_*_MASK)
	3. KVM private access bits (ACC_*_MASK)
	4. EPT access bits (VMX_EPT_*_MASK).

I just have to try hard not to confuse them.

-- 
Nadav Har'El                        |                   Thursday, Dec 8 2011, 
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Sorry, but my karma just ran over your
http://nadav.harel.org.il           |dogma.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 08/10] nEPT: Nested INVEPT
  2011-11-10 12:17   ` Avi Kivity
@ 2011-12-11 14:24     ` Nadav Har'El
  2011-12-11 14:37       ` Avi Kivity
  0 siblings, 1 reply; 49+ messages in thread
From: Nadav Har'El @ 2011-12-11 14:24 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, Roedel, Joerg, owasserm, abelg

On Thu, Nov 10, 2011, Avi Kivity wrote about "Re: [PATCH 08/10] nEPT: Nested INVEPT":
> On 11/10/2011 12:01 PM, Nadav Har'El wrote:
> > If we let L1 use EPT, we should probably also support the INVEPT instruction.
>..
> > +			if (vmcs12 && nested_cpu_has_ept(vmcs12) &&
> > +			    (vmcs12->ept_pointer == operand.eptp) &&
> > +			    vmx->nested.last_eptp02)
> > +				ept_sync_context(vmx->nested.last_eptp02);
> > +			else
> > +				ept_sync_global();
> 
> Are either of these needed?  Won't a write to a shadowed EPT table cause
> them anyway?

This is very good point... You're right that as it stands, any changes
to the guest EPT table (EPT12) will cause changes to the shadow EPT
table (EPT02), and these already cause KVM to do an INVEPT, so no point
to do this again when the guest asks.  So basically, I can have INVEPT
emulated by doing absolutely nothing (after checking all the checks), right?

I wonder if I am missing any reason why a hypervisor might want to do
INVEPT without changing the EPT12 table first.

-- 
Nadav Har'El                        |                    Sunday, Dec 11 2011, 
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Why do programmers mix up Christmas and
http://nadav.harel.org.il           |Halloween? Because DEC 25 = OCT 31

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 08/10] nEPT: Nested INVEPT
  2011-12-11 14:24     ` Nadav Har'El
@ 2011-12-11 14:37       ` Avi Kivity
  0 siblings, 0 replies; 49+ messages in thread
From: Avi Kivity @ 2011-12-11 14:37 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, Roedel, Joerg, owasserm, abelg

On 12/11/2011 04:24 PM, Nadav Har'El wrote:
> On Thu, Nov 10, 2011, Avi Kivity wrote about "Re: [PATCH 08/10] nEPT: Nested INVEPT":
> > On 11/10/2011 12:01 PM, Nadav Har'El wrote:
> > > If we let L1 use EPT, we should probably also support the INVEPT instruction.
> >..
> > > +			if (vmcs12 && nested_cpu_has_ept(vmcs12) &&
> > > +			    (vmcs12->ept_pointer == operand.eptp) &&
> > > +			    vmx->nested.last_eptp02)
> > > +				ept_sync_context(vmx->nested.last_eptp02);
> > > +			else
> > > +				ept_sync_global();
> > 
> > Are either of these needed?  Won't a write to a shadowed EPT table cause
> > them anyway?
>
> This is very good point... You're right that as it stands, any changes
> to the guest EPT table (EPT12) will cause changes to the shadow EPT
> table (EPT02), and these already cause KVM to do an INVEPT, so no point
> to do this again when the guest asks.  So basically, I can have INVEPT
> emulated by doing absolutely nothing (after checking all the checks), right?

Right.  This was the case for INVLPG before we added out-of-sync pages;
we didn't even intercept the instruction.

> I wonder if I am missing any reason why a hypervisor might want to do
> INVEPT without changing the EPT12 table first.

Shouldn't happen, but why do you care?  If EPT12 has not changed, any
access through EPT02 or its TLB entry is valid.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/10] nEPT: Nested EPT support for Nested VMX
  2011-11-13  9:21     ` Avi Kivity
@ 2011-12-12 11:37       ` Nadav Har'El
  2011-12-12 13:04         ` Avi Kivity
  0 siblings, 1 reply; 49+ messages in thread
From: Nadav Har'El @ 2011-12-12 11:37 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, Roedel, Joerg, owasserm, abelg

On Sun, Nov 13, 2011, Avi Kivity wrote about "Re: [PATCH 0/10] nEPT: Nested EPT support for Nested VMX":
> > I also believed that the fault injection part was also correct: I
> > thought that the code already knows when to handle the fault in L2 (when
> > the address is missing in cr3), in L1 (when the translation is missing
> > in EPT12) or else, in L0.
> 
> It does, but it needs to propagate the fault code correctly.  The exit
> reason (ept violation vs ept misconfiguration) is meaningless, since we
> don't encode anything about it from ept12 into ept02.  In particular an
> ept violation could lead to
> 
> - no fault, ept02 updated, instruction retried
> - no fault, instruction emulated
> - L2 fault
> - ept violation, need to compute ept12 permissions for exit qualification
> - ept misconfiguration
> 
> (the second and third cases occur when it is impossible to create an
> ept02 mapping - when L0 emulates a gpa that L1 assigns to L2 via ept12).

I'm now trying to figure out this part, and I think I am beginning to
understand the mess you are referring to:

In nested_ept_inject_page_fault I now assume the exit reason is always EPT
VIOLATION and have

	vmcs12->exit_qualification = fault->error_code;

But fault->error_code is not in the exit qualification format but in
the PFERR_* format, which has different meanings for the bits...
Moreover, PFERR_RSVD_MASK should cause an EPT MISCONFIG, not EPT
VIOLATION. Is this what you meant above?

I didn't quite understand what you meant in the 4th case about needing
to compute ept12 permissions. I'm assuming that if the EPT violation
was caused because L0 decreased permissions from what L1 thought, then L0
will solve the problem itself and not inject it to L1. So if we are injecting
the fault to L1, don't we already know the correct fault reason and don't
need to compute it?

There's another complication: when the fault comes from an EPT violation
in L2, handle_ept_violation() calls mmu.page_fault() with an error_code of
exit_qualification & 0x3. This means that the error_code in this case is
*not* in the expected PFERR_* format, and we need to know that in
nested_ept_inject_page_fault. Moreover, in the original EPT visolation's
exit qualification, there were various other bits which we lose (and don't
have a direct parallel in PFERR_* anyway), so when we reinject the fault,
L1 doesn't get them.

What a mess :(

-- 
Nadav Har'El                        |                    Monday, Dec 12 2011, 
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Hardware, n.: The parts of a computer
http://nadav.harel.org.il           |system that can be kicked.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/10] nEPT: Nested EPT support for Nested VMX
  2011-12-12 11:37       ` Nadav Har'El
@ 2011-12-12 13:04         ` Avi Kivity
  0 siblings, 0 replies; 49+ messages in thread
From: Avi Kivity @ 2011-12-12 13:04 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, Roedel, Joerg, owasserm, abelg

On 12/12/2011 01:37 PM, Nadav Har'El wrote:
> On Sun, Nov 13, 2011, Avi Kivity wrote about "Re: [PATCH 0/10] nEPT: Nested EPT support for Nested VMX":
> > > I also believed that the fault injection part was also correct: I
> > > thought that the code already knows when to handle the fault in L2 (when
> > > the address is missing in cr3), in L1 (when the translation is missing
> > > in EPT12) or else, in L0.
> > 
> > It does, but it needs to propagate the fault code correctly.  The exit
> > reason (ept violation vs ept misconfiguration) is meaningless, since we
> > don't encode anything about it from ept12 into ept02.  In particular an
> > ept violation could lead to
> > 
> > - no fault, ept02 updated, instruction retried
> > - no fault, instruction emulated
> > - L2 fault
> > - ept violation, need to compute ept12 permissions for exit qualification
> > - ept misconfiguration
> > 
> > (the second and third cases occur when it is impossible to create an
> > ept02 mapping - when L0 emulates a gpa that L1 assigns to L2 via ept12).
>
> I'm now trying to figure out this part, and I think I am beginning to
> understand the mess you are referring to:
>
> In nested_ept_inject_page_fault I now assume the exit reason is always EPT
> VIOLATION and have
>
> 	vmcs12->exit_qualification = fault->error_code;
>
> But fault->error_code is not in the exit qualification format but in
> the PFERR_* format, which has different meanings for the bits...
> Moreover, PFERR_RSVD_MASK should cause an EPT MISCONFIG, not EPT
> VIOLATION. Is this what you meant above?

In spirit yes.  In practice rather than translating from PFERR format to
EPT VIOLATION EXIT_QUALIFICATION format, walk_addr() should directly
compute the exit qualification (and an additional bit: whether it's an
EPT VIOLATION or EPT MISCONFIGURATION.

> I didn't quite understand what you meant in the 4th case about needing
> to compute ept12 permissions. I'm assuming that if the EPT violation
> was caused because L0 decreased permissions from what L1 thought, then L0
> will solve the problem itself and not inject it to L1. So if we are injecting
> the fault to L1, don't we already know the correct fault reason and don't
> need to compute it?

If we're injecting an EPT VIOLATION to L1 (because we weren't able to
resolve it; say L1 write-protected the page), then we need to compute
EXIT_QUALIFICATION.  Bits 3-5 of EXIT_QUALIFICATION are computed from
EPT12 paging structure entries (easy to derive them from
pt_access/pte_access).

>
> There's another complication: when the fault comes from an EPT violation
> in L2, handle_ept_violation() calls mmu.page_fault() with an error_code of
> exit_qualification & 0x3. This means that the error_code in this case is
> *not* in the expected PFERR_* format, and we need to know that in
> nested_ept_inject_page_fault. Moreover, in the original EPT visolation's
> exit qualification, there were various other bits which we lose (and don't
> have a direct parallel in PFERR_* anyway), so when we reinject the fault,
> L1 doesn't get them.

struct x86_exception already has 'bool nested', which indicates whether
it's an L1 or L2 fault.  You need to extend that, perhaps by adding
another bool, to distinguish between EPT VIOLATION and
EPT_QUALIFICATION.  The error_code field should be extended to 64 bits
for EXIT_QUALIFICATION (though only bits 0-12 are defined).  You need
another field for the guest linear address.

EXIT_QUALIFICATION has to be calculated, it cannot be derived from the
original exit.

Look at kvm_propagate_fault().

> What a mess :(

If you have a splitting headache, you're on the right track.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2011-12-12 13:05 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-11-10  9:57 [PATCH 0/10] nEPT: Nested EPT support for Nested VMX Nadav Har'El
2011-11-10  9:58 ` [PATCH 01/10] nEPT: Module option Nadav Har'El
2011-11-10 12:23   ` Avi Kivity
2011-11-10 14:21     ` Nadav Har'El
2011-11-10 14:38       ` Avi Kivity
2011-11-10 15:14         ` Nadav Har'El
2011-11-10 15:21           ` Avi Kivity
2011-11-10  9:58 ` [PATCH 02/10] nEPT: MMU context for nested EPT Nadav Har'El
2011-11-10 10:31   ` Avi Kivity
2011-11-10 12:49   ` Avi Kivity
2011-11-10 14:40     ` Nadav Har'El
2011-11-10 15:19       ` Avi Kivity
2011-11-10 20:05         ` Nadav Har'El
2011-11-12 10:39           ` Avi Kivity
2011-11-12 21:37             ` Nadav Har'El
2011-11-13  9:10               ` Avi Kivity
2011-11-13 11:30               ` Orit Wasserman
2011-11-13 14:32                 ` Avi Kivity
2011-11-13 18:26                   ` Orit Wasserman
2011-11-14  8:25                     ` Avi Kivity
2011-12-08 15:21                       ` Nadav Har'El
2011-12-06 12:40                   ` Nadav Har'El
2011-12-06 13:07                     ` Avi Kivity
2011-11-23 15:06                 ` Nadav Har'El
2011-11-23 15:44                   ` Nadav Har'El
2011-11-24 13:36                     ` Avi Kivity
2011-12-07  9:06                 ` Nadav Har'El
2011-12-07 10:10                   ` Avi Kivity
2011-11-10  9:59 ` [PATCH 03/10] nEPT: Fix cr3 handling in nested exit and entry Nadav Har'El
2011-11-10  9:59 ` [PATCH 04/10] nEPT: Fix page table format in nested EPT Nadav Har'El
2011-11-10 10:37   ` Avi Kivity
2011-11-10 11:03     ` Nadav Har'El
2011-11-10 12:21       ` Avi Kivity
2011-11-10 12:50         ` Avi Kivity
2011-11-10 13:07   ` Orit Wasserman
2011-11-10 10:00 ` [PATCH 05/10] nEPT: Fix wrong test in kvm_set_cr3 Nadav Har'El
2011-11-10 10:00 ` [PATCH 06/10] nEPT: Some additional comments Nadav Har'El
2011-11-10 10:01 ` [PATCH 07/10] nEPT: Advertise EPT to L1 Nadav Har'El
2011-11-10 10:01 ` [PATCH 08/10] nEPT: Nested INVEPT Nadav Har'El
2011-11-10 12:17   ` Avi Kivity
2011-12-11 14:24     ` Nadav Har'El
2011-12-11 14:37       ` Avi Kivity
2011-11-10 10:02 ` [PATCH 09/10] nEPT: Documentation Nadav Har'El
2011-11-10 10:02 ` [PATCH 10/10] nEPT: Miscelleneous cleanups Nadav Har'El
2011-11-10 12:26 ` [PATCH 0/10] nEPT: Nested EPT support for Nested VMX Avi Kivity
2011-11-13  8:52   ` Nadav Har'El
2011-11-13  9:21     ` Avi Kivity
2011-12-12 11:37       ` Nadav Har'El
2011-12-12 13:04         ` Avi Kivity

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).