public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/28] nVMX: Nested VMX, v7
@ 2010-12-08 16:59 Nadav Har'El
  2010-12-08 17:00 ` [PATCH 01/28] nVMX: Add "nested" module option to vmx.c Nadav Har'El
                   ` (28 more replies)
  0 siblings, 29 replies; 40+ messages in thread
From: Nadav Har'El @ 2010-12-08 16:59 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

Hi,

This is the seventh iteration of the nested VMX patch set. It fixes a bunch
of bugs in the previous iteration, and in particular it now works correctly
with EPT in the L0 hypervisor, so "ept=0" no longer needs to be specified.

This new set of patches should apply to the current KVM trunk (I checked with
66fc6be8d2b04153b753182610f919faf9c705bc). In particular it uses the recently
added is_guest_mode() function (common to both nested svm and vmx) instead of
inventing our own flag.

About nested VMX:
-----------------

The following 28 patches implement nested VMX support. This feature enables a
guest to use the VMX APIs in order to run its own nested guests. In other
words, it allows running hypervisors (that use VMX) under KVM.
Multiple guest hypervisors can be run concurrently, and each of those can
in turn host multiple guests.

The theory behind this work, our implementation, and its performance
characteristics were presented in OSDI 2010 (the USENIX Symposium on
Operating Systems Design and Implementation). Our paper was titled
"The Turtles Project: Design and Implementation of Nested Virtualization",
and was awarded "Jay Lepreau Best Paper". The paper is available online, at:

	http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf

This patch set does not include all the features described in the paper.
In particular, this patch set is missing nested EPT (shadow page tables are
used in L1, while L0 can use shadow page tables or EPT). It is also missing
some features required to run VMWare Server as a guest. These missing features
will be sent as follow-on patchs.

Running nested VMX:
------------------

The current patches have a number of requirements, which will be relaxed in
follow-on patches:

1. This version was only tested with KVM (64-bit) as a guest hypervisor, and
   Linux as a nested guest.

2. SMP is supported in the code, but is unfortunately buggy in this version
   and often leads to hangs. Use the "nosmp" option in the L0 (topmost)
   kernel to avoid this bug (and to reduce your performance ;-))..

3. No modifications are required to user space (qemu). However, qemu does not
   currently list "VMX" as a CPU feature in its emulated CPUs (even when they
   are named after CPUs that do normally have VMX). Therefore, the "-cpu host"
   option should be given to qemu, to tell it to support CPU features which
   exist in the host - and in particular VMX.
   This requirement can be made unnecessary by a trivial patch to qemu (which
   I will submit in the future).

4. The nested VMX feature is currently disabled by default. It must be
   explicitly enabled with the "nested=1" option to the kvm-intel module.

5. Nested VPID is not properly supported in this version. You must give the
   "vpid=0" module options to kvm-intel to turn this feature off.


Patch statistics:
-----------------

 Documentation/kvm/nested-vmx.txt |  237 ++
 arch/x86/include/asm/kvm_host.h  |    2 
 arch/x86/include/asm/vmx.h       |   31 
 arch/x86/kvm/svm.c               |    6 
 arch/x86/kvm/vmx.c               | 2416 ++++++++++++++++++++++++++++-
 arch/x86/kvm/x86.c               |   16 
 arch/x86/kvm/x86.h               |    6 
 7 files changed, 2676 insertions(+), 38 deletions(-)

--
Nadav Har'El
IBM Haifa Research Lab

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 01/28] nVMX: Add "nested" module option to vmx.c
  2010-12-08 16:59 [PATCH 0/28] nVMX: Nested VMX, v7 Nadav Har'El
@ 2010-12-08 17:00 ` Nadav Har'El
  2010-12-08 17:00 ` [PATCH 02/28] nVMX: Add VMX and SVM to list of supported cpuid features Nadav Har'El
                   ` (27 subsequent siblings)
  28 siblings, 0 replies; 40+ messages in thread
From: Nadav Har'El @ 2010-12-08 17:00 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch adds a module option "nested" to vmx.c, which controls whether
the guest can use VMX instructions, i.e., whether we allow nested
virtualization. A similar, but separate, option already exists for the
SVM module.

This option currently defaults to 0, meaning that nested VMX must be
explicitly enabled by giving nested=1. When nested VMX matures, the default
should probably be changed to enable nested VMX by default - just like
nested SVM is currently enabled by default.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |    8 ++++++++
 1 file changed, 8 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2010-12-08 18:56:48.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-12-08 18:56:48.000000000 +0200
@@ -69,6 +69,14 @@ module_param(emulate_invalid_guest_state
 static int __read_mostly vmm_exclusive = 1;
 module_param(vmm_exclusive, bool, S_IRUGO);
 
+/*
+ * If nested=1, nested virtualization is supported, i.e., the guest may use
+ * VMX and be a hypervisor for its own guests. If nested=0, the guest may not
+ * use VMX instructions.
+ */
+static int nested = 0;
+module_param(nested, int, S_IRUGO);
+
 #define KVM_GUEST_CR0_MASK_UNRESTRICTED_GUEST				\
 	(X86_CR0_WP | X86_CR0_NE | X86_CR0_NW | X86_CR0_CD)
 #define KVM_GUEST_CR0_MASK						\

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 02/28] nVMX: Add VMX and SVM to list of supported cpuid features
  2010-12-08 16:59 [PATCH 0/28] nVMX: Nested VMX, v7 Nadav Har'El
  2010-12-08 17:00 ` [PATCH 01/28] nVMX: Add "nested" module option to vmx.c Nadav Har'El
@ 2010-12-08 17:00 ` Nadav Har'El
  2010-12-09 11:38   ` Joerg Roedel
  2010-12-08 17:01 ` [PATCH 03/28] nVMX: Implement VMXON and VMXOFF Nadav Har'El
                   ` (26 subsequent siblings)
  28 siblings, 1 reply; 40+ messages in thread
From: Nadav Har'El @ 2010-12-08 17:00 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

If the "nested" module option is enabled, add the "VMX" CPU feature to the
list of CPU features KVM advertises with the KVM_GET_SUPPORTED_CPUID ioctl.

Qemu uses this ioctl, and intersects KVM's list with its own list of desired
cpu features (depending on the -cpu option given to qemu) to determine the
final list of features presented to the guest.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |    2 ++
 1 file changed, 2 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2010-12-08 18:56:48.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-12-08 18:56:48.000000000 +0200
@@ -4284,6 +4284,8 @@ static void vmx_cpuid_update(struct kvm_
 
 static void vmx_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry)
 {
+	if (func == 1 && nested)
+		entry->ecx |= bit(X86_FEATURE_VMX);
 }
 
 static struct kvm_x86_ops vmx_x86_ops = {

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 03/28] nVMX: Implement VMXON and VMXOFF
  2010-12-08 16:59 [PATCH 0/28] nVMX: Nested VMX, v7 Nadav Har'El
  2010-12-08 17:00 ` [PATCH 01/28] nVMX: Add "nested" module option to vmx.c Nadav Har'El
  2010-12-08 17:00 ` [PATCH 02/28] nVMX: Add VMX and SVM to list of supported cpuid features Nadav Har'El
@ 2010-12-08 17:01 ` Nadav Har'El
  2010-12-08 17:02 ` [PATCH 04/28] nVMX: Allow setting the VMXE bit in CR4 Nadav Har'El
                   ` (25 subsequent siblings)
  28 siblings, 0 replies; 40+ messages in thread
From: Nadav Har'El @ 2010-12-08 17:01 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch allows a guest to use the VMXON and VMXOFF instructions, and
emulates them accordingly. Basically this amounts to checking some
prerequisites, and then remembering whether the guest has enabled or disabled
VMX operation.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  102 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 100 insertions(+), 2 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2010-12-08 18:56:48.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-12-08 18:56:48.000000000 +0200
@@ -127,6 +127,17 @@ struct shared_msr_entry {
 	u64 mask;
 };
 
+/*
+ * The nested_vmx structure is part of vcpu_vmx, and holds information we need
+ * for correct emulation of VMX (i.e., nested VMX) on this vcpu. For example,
+ * the current VMCS set by L1, a list of the VMCSs used to run the active
+ * L2 guests on the hardware, and more.
+ */
+struct nested_vmx {
+	/* Has the level1 guest done vmxon? */
+	bool vmxon;
+};
+
 struct vcpu_vmx {
 	struct kvm_vcpu       vcpu;
 	struct list_head      local_vcpus_link;
@@ -174,6 +185,9 @@ struct vcpu_vmx {
 	u32 exit_reason;
 
 	bool rdtscp_enabled;
+
+	/* Support for a guest hypervisor (nested VMX) */
+	struct nested_vmx nested;
 };
 
 static inline struct vcpu_vmx *to_vmx(struct kvm_vcpu *vcpu)
@@ -3653,6 +3667,90 @@ static int handle_invalid_op(struct kvm_
 }
 
 /*
+ * Emulate the VMXON instruction.
+ * Currently, we just remember that VMX is active, and do not save or even
+ * inspect the argument to VMXON (the so-called "VMXON pointer") because we
+ * do not currently need to store anything in that guest-allocated memory
+ * region. Consequently, VMCLEAR and VMPTRLD also do not verify that the their
+ * argument is different from the VMXON pointer (which the spec says they do).
+ */
+static int handle_vmon(struct kvm_vcpu *vcpu)
+{
+	struct kvm_segment cs;
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	/* The Intel VMX Instruction Reference lists a bunch of bits that
+	 * are prerequisite to running VMXON, most notably CR4.VMXE must be
+	 * set to 1. Otherwise, we should fail with #UD. We test these now:
+	 */
+	if (!nested ||
+	    !kvm_read_cr4_bits(vcpu, X86_CR4_VMXE) ||
+	    !kvm_read_cr0_bits(vcpu, X86_CR0_PE) ||
+	    (vmx_get_rflags(vcpu) & X86_EFLAGS_VM)) {
+		kvm_queue_exception(vcpu, UD_VECTOR);
+		return 1;
+	}
+
+	vmx_get_segment(vcpu, &cs, VCPU_SREG_CS);
+	if (is_long_mode(vcpu) && !cs.l) {
+		kvm_queue_exception(vcpu, UD_VECTOR);
+		return 1;
+	}
+
+	if (vmx_get_cpl(vcpu)) {
+		kvm_inject_gp(vcpu, 0);
+		return 1;
+	}
+
+	vmx->nested.vmxon = true;
+
+	skip_emulated_instruction(vcpu);
+	return 1;
+}
+
+/*
+ * Intel's VMX Instruction Reference specifies a common set of prerequisites
+ * for running VMX instructions (except VMXON, whose prerequisites are
+ * slightly different). It also specifies what exception to inject otherwise.
+ */
+static int nested_vmx_check_permission(struct kvm_vcpu *vcpu)
+{
+	struct kvm_segment cs;
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	if (!vmx->nested.vmxon) {
+		kvm_queue_exception(vcpu, UD_VECTOR);
+		return 0;
+	}
+
+	vmx_get_segment(vcpu, &cs, VCPU_SREG_CS);
+	if ((vmx_get_rflags(vcpu) & X86_EFLAGS_VM) ||
+	    (is_long_mode(vcpu) && !cs.l)) {
+		kvm_queue_exception(vcpu, UD_VECTOR);
+		return 0;
+	}
+
+	if (vmx_get_cpl(vcpu)) {
+		kvm_inject_gp(vcpu, 0);
+		return 0;
+	}
+
+	return 1;
+}
+
+/* Emulate the VMXOFF instruction */
+static int handle_vmoff(struct kvm_vcpu *vcpu)
+{
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	to_vmx(vcpu)->nested.vmxon = false;
+
+	skip_emulated_instruction(vcpu);
+	return 1;
+}
+
+/*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
  * to be done to userspace and return 0.
@@ -3680,8 +3778,8 @@ static int (*kvm_vmx_exit_handlers[])(st
 	[EXIT_REASON_VMREAD]                  = handle_vmx_insn,
 	[EXIT_REASON_VMRESUME]                = handle_vmx_insn,
 	[EXIT_REASON_VMWRITE]                 = handle_vmx_insn,
-	[EXIT_REASON_VMOFF]                   = handle_vmx_insn,
-	[EXIT_REASON_VMON]                    = handle_vmx_insn,
+	[EXIT_REASON_VMOFF]                   = handle_vmoff,
+	[EXIT_REASON_VMON]                    = handle_vmon,
 	[EXIT_REASON_TPR_BELOW_THRESHOLD]     = handle_tpr_below_threshold,
 	[EXIT_REASON_APIC_ACCESS]             = handle_apic_access,
 	[EXIT_REASON_WBINVD]                  = handle_wbinvd,

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 04/28] nVMX: Allow setting the VMXE bit in CR4
  2010-12-08 16:59 [PATCH 0/28] nVMX: Nested VMX, v7 Nadav Har'El
                   ` (2 preceding siblings ...)
  2010-12-08 17:01 ` [PATCH 03/28] nVMX: Implement VMXON and VMXOFF Nadav Har'El
@ 2010-12-08 17:02 ` Nadav Har'El
  2010-12-08 17:02 ` [PATCH 05/28] nVMX: Introduce vmcs12: a VMCS structure for L1 Nadav Har'El
                   ` (24 subsequent siblings)
  28 siblings, 0 replies; 40+ messages in thread
From: Nadav Har'El @ 2010-12-08 17:02 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch allows the guest to enable the VMXE bit in CR4, which is a
prerequisite to running VMXON.

Whether to allow setting the VMXE bit now depends on the architecture (svm
or vmx), so its checking has moved to kvm_x86_ops->set_cr4(). This function
now returns an int: If kvm_x86_ops->set_cr4() returns 1, __kvm_set_cr4()
will also return 1, and this will cause kvm_set_cr4() will throw a #GP.

Turning on the VMXE bit is allowed only when the "nested" module option is on,
and turning it off is forbidden after a vmxon.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/include/asm/kvm_host.h |    2 +-
 arch/x86/kvm/svm.c              |    6 +++++-
 arch/x86/kvm/vmx.c              |   13 +++++++++++--
 arch/x86/kvm/x86.c              |    4 +---
 4 files changed, 18 insertions(+), 7 deletions(-)

--- .before/arch/x86/include/asm/kvm_host.h	2010-12-08 18:56:49.000000000 +0200
+++ .after/arch/x86/include/asm/kvm_host.h	2010-12-08 18:56:49.000000000 +0200
@@ -535,7 +535,7 @@ struct kvm_x86_ops {
 	void (*decache_cr4_guest_bits)(struct kvm_vcpu *vcpu);
 	void (*set_cr0)(struct kvm_vcpu *vcpu, unsigned long cr0);
 	void (*set_cr3)(struct kvm_vcpu *vcpu, unsigned long cr3);
-	void (*set_cr4)(struct kvm_vcpu *vcpu, unsigned long cr4);
+	int (*set_cr4)(struct kvm_vcpu *vcpu, unsigned long cr4);
 	void (*set_efer)(struct kvm_vcpu *vcpu, u64 efer);
 	void (*get_idt)(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
 	void (*set_idt)(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
--- .before/arch/x86/kvm/svm.c	2010-12-08 18:56:49.000000000 +0200
+++ .after/arch/x86/kvm/svm.c	2010-12-08 18:56:49.000000000 +0200
@@ -1370,11 +1370,14 @@ static void svm_set_cr0(struct kvm_vcpu 
 	update_cr0_intercept(svm);
 }
 
-static void svm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
+static int svm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 {
 	unsigned long host_cr4_mce = read_cr4() & X86_CR4_MCE;
 	unsigned long old_cr4 = to_svm(vcpu)->vmcb->save.cr4;
 
+	if (cr4 & X86_CR4_VMXE)
+		return 1;
+
 	if (npt_enabled && ((old_cr4 ^ cr4) & X86_CR4_PGE))
 		force_new_asid(vcpu);
 
@@ -1383,6 +1386,7 @@ static void svm_set_cr4(struct kvm_vcpu 
 		cr4 |= X86_CR4_PAE;
 	cr4 |= host_cr4_mce;
 	to_svm(vcpu)->vmcb->save.cr4 = cr4;
+	return 0;
 }
 
 static void svm_set_segment(struct kvm_vcpu *vcpu,
--- .before/arch/x86/kvm/x86.c	2010-12-08 18:56:49.000000000 +0200
+++ .after/arch/x86/kvm/x86.c	2010-12-08 18:56:49.000000000 +0200
@@ -610,11 +610,9 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, u
 		   && !load_pdptrs(vcpu, vcpu->arch.walk_mmu, vcpu->arch.cr3))
 		return 1;
 
-	if (cr4 & X86_CR4_VMXE)
+	if (kvm_x86_ops->set_cr4(vcpu, cr4))
 		return 1;
 
-	kvm_x86_ops->set_cr4(vcpu, cr4);
-
 	if ((cr4 ^ old_cr4) & pdptr_bits)
 		kvm_mmu_reset_context(vcpu);
 
--- .before/arch/x86/kvm/vmx.c	2010-12-08 18:56:49.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-12-08 18:56:49.000000000 +0200
@@ -1876,7 +1876,7 @@ static void ept_save_pdptrs(struct kvm_v
 		  (unsigned long *)&vcpu->arch.regs_dirty);
 }
 
-static void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
+static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
 
 static void ept_update_paging_mode_cr0(unsigned long *hw_cr0,
 					unsigned long cr0,
@@ -1971,11 +1971,19 @@ static void vmx_set_cr3(struct kvm_vcpu 
 	vmcs_writel(GUEST_CR3, guest_cr3);
 }
 
-static void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
+static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 {
 	unsigned long hw_cr4 = cr4 | (to_vmx(vcpu)->rmode.vm86_active ?
 		    KVM_RMODE_VM_CR4_ALWAYS_ON : KVM_PMODE_VM_CR4_ALWAYS_ON);
 
+	if (cr4 & X86_CR4_VMXE) {
+		if (!nested)
+			return 1;
+	} else {
+		if (nested && to_vmx(vcpu)->nested.vmxon)
+			return 1;
+	}
+
 	vcpu->arch.cr4 = cr4;
 	if (enable_ept) {
 		if (!is_paging(vcpu)) {
@@ -1988,6 +1996,7 @@ static void vmx_set_cr4(struct kvm_vcpu 
 
 	vmcs_writel(CR4_READ_SHADOW, cr4);
 	vmcs_writel(GUEST_CR4, hw_cr4);
+	return 0;
 }
 
 static u64 vmx_get_segment_base(struct kvm_vcpu *vcpu, int seg)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 05/28] nVMX: Introduce vmcs12: a VMCS structure for L1
  2010-12-08 16:59 [PATCH 0/28] nVMX: Nested VMX, v7 Nadav Har'El
                   ` (3 preceding siblings ...)
  2010-12-08 17:02 ` [PATCH 04/28] nVMX: Allow setting the VMXE bit in CR4 Nadav Har'El
@ 2010-12-08 17:02 ` Nadav Har'El
  2010-12-08 17:03 ` [PATCH 06/28] nVMX: Implement reading and writing of VMX MSRs Nadav Har'El
                   ` (23 subsequent siblings)
  28 siblings, 0 replies; 40+ messages in thread
From: Nadav Har'El @ 2010-12-08 17:02 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

An implementation of VMX needs to define a VMCS structure. This structure
is kept in guest memory, but is opaque to the guest (who can only read or
write it with VMX instructions).

This patch starts to define the VMCS structure which our nested VMX
implementation will present to L1. We call it "vmcs12", as it is the VMCS
that L1 keeps for its L2 guests. We will add more content to this structure
in later patches.

This patch also adds the notion (as required by the VMX spec) of L1's "current
VMCS", and finally includes utility functions for mapping the guest-allocated
VMCSs in host memory.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   64 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 64 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2010-12-08 18:56:49.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-12-08 18:56:49.000000000 +0200
@@ -128,6 +128,34 @@ struct shared_msr_entry {
 };
 
 /*
+ * struct vmcs12 describes the state that our guest hypervisor (L1) keeps for a
+ * single nested guest (L2), hence the name vmcs12. Any VMX implementation has
+ * a VMCS structure, and vmcs12 is our emulated VMX's VMCS. This structure is
+ * stored in guest memory specified by VMPTRLD, but is opaque to the guest,
+ * which must access it using VMREAD/VMWRITE/VMCLEAR instructions. More
+ * than one of these structures may exist, if L1 runs multiple L2 guests.
+ * nested_vmx_run() will use the data here to build a vmcs02: a VMCS for the
+ * underlying hardware which will be used to run L2.
+ * This structure is packed in order to preserve the binary content after live
+ * migration. If there are changes in the content or layout, VMCS12_REVISION
+ * must be changed.
+ */
+struct __packed vmcs12 {
+	/* According to the Intel spec, a VMCS region must start with the
+	 * following two fields. Then follow implementation-specific data.
+	 */
+	u32 revision_id;
+	u32 abort;
+};
+
+/*
+ * VMCS12_REVISION is an arbitrary id that should be changed if the content or
+ * layout of struct vmcs12 is changed. MSR_IA32_VMX_BASIC returns this id, and
+ * VMPTRLD verifies that the VMCS region that L1 is loading contains this id.
+ */
+#define VMCS12_REVISION 0x11e57ed0
+
+/*
  * The nested_vmx structure is part of vcpu_vmx, and holds information we need
  * for correct emulation of VMX (i.e., nested VMX) on this vcpu. For example,
  * the current VMCS set by L1, a list of the VMCSs used to run the active
@@ -136,6 +164,12 @@ struct shared_msr_entry {
 struct nested_vmx {
 	/* Has the level1 guest done vmxon? */
 	bool vmxon;
+
+	/* The guest-physical address of the current VMCS L1 keeps for L2 */
+	gpa_t current_vmptr;
+	/* The host-usable pointer to the above */
+	struct page *current_vmcs12_page;
+	struct vmcs12 *current_vmcs12;
 };
 
 struct vcpu_vmx {
@@ -195,6 +229,28 @@ static inline struct vcpu_vmx *to_vmx(st
 	return container_of(vcpu, struct vcpu_vmx, vcpu);
 }
 
+static struct page *nested_get_page(struct kvm_vcpu *vcpu, gpa_t addr)
+{
+	struct page *page = gfn_to_page(vcpu->kvm, addr >> PAGE_SHIFT);
+	if (is_error_page(page)) {
+		kvm_release_page_clean(page);
+		return NULL;
+	}
+	return page;
+}
+
+static void nested_release_page(struct page *page)
+{
+	kunmap(page);
+	kvm_release_page_dirty(page);
+}
+
+static void nested_release_page_clean(struct page *page)
+{
+	kunmap(page);
+	kvm_release_page_clean(page);
+}
+
 static int init_rmode(struct kvm *kvm);
 static u64 construct_eptp(unsigned long root_hpa);
 static void kvm_cpu_vmxon(u64 addr);
@@ -3755,6 +3811,9 @@ static int handle_vmoff(struct kvm_vcpu 
 
 	to_vmx(vcpu)->nested.vmxon = false;
 
+	if (to_vmx(vcpu)->nested.current_vmptr != -1ull)
+		nested_release_page(to_vmx(vcpu)->nested.current_vmcs12_page);
+
 	skip_emulated_instruction(vcpu);
 	return 1;
 }
@@ -4183,6 +4242,8 @@ static void vmx_free_vcpu(struct kvm_vcp
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
 	free_vpid(vmx);
+	if (vmx->nested.vmxon && to_vmx(vcpu)->nested.current_vmptr != -1ull)
+		nested_release_page(to_vmx(vcpu)->nested.current_vmcs12_page);
 	vmx_free_vmcs(vcpu);
 	kfree(vmx->guest_msrs);
 	kvm_vcpu_uninit(vcpu);
@@ -4249,6 +4310,9 @@ static struct kvm_vcpu *vmx_create_vcpu(
 			goto free_vmcs;
 	}
 
+	vmx->nested.current_vmptr = -1ull;
+	vmx->nested.current_vmcs12 = NULL;
+
 	return &vmx->vcpu;
 
 free_vmcs:

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 06/28] nVMX: Implement reading and writing of VMX MSRs
  2010-12-08 16:59 [PATCH 0/28] nVMX: Nested VMX, v7 Nadav Har'El
                   ` (4 preceding siblings ...)
  2010-12-08 17:02 ` [PATCH 05/28] nVMX: Introduce vmcs12: a VMCS structure for L1 Nadav Har'El
@ 2010-12-08 17:03 ` Nadav Har'El
  2010-12-09 11:04   ` Avi Kivity
  2010-12-08 17:03 ` [PATCH 07/28] nVMX: Decoding memory operands of VMX instructions Nadav Har'El
                   ` (22 subsequent siblings)
  28 siblings, 1 reply; 40+ messages in thread
From: Nadav Har'El @ 2010-12-08 17:03 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

When the guest can use VMX instructions (when the "nested" module option is
on), it should also be able to read and write VMX MSRs, e.g., to query about
VMX capabilities. This patch adds this support.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  117 +++++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c |    6 +-
 2 files changed, 122 insertions(+), 1 deletion(-)

--- .before/arch/x86/kvm/x86.c	2010-12-08 18:56:49.000000000 +0200
+++ .after/arch/x86/kvm/x86.c	2010-12-08 18:56:49.000000000 +0200
@@ -796,7 +796,11 @@ static u32 msrs_to_save[] = {
 #ifdef CONFIG_X86_64
 	MSR_CSTAR, MSR_KERNEL_GS_BASE, MSR_SYSCALL_MASK, MSR_LSTAR,
 #endif
-	MSR_IA32_TSC, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA
+	MSR_IA32_TSC, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA,
+	MSR_IA32_FEATURE_CONTROL,  MSR_IA32_VMX_BASIC,
+	MSR_IA32_VMX_PINBASED_CTLS, MSR_IA32_VMX_PROCBASED_CTLS,
+	MSR_IA32_VMX_EXIT_CTLS, MSR_IA32_VMX_ENTRY_CTLS,
+	MSR_IA32_VMX_PROCBASED_CTLS2, MSR_IA32_VMX_EPT_VPID_CAP,
 };
 
 static unsigned num_msrs_to_save;
--- .before/arch/x86/kvm/vmx.c	2010-12-08 18:56:49.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-12-08 18:56:49.000000000 +0200
@@ -1211,6 +1211,119 @@ static void vmx_adjust_tsc_offset(struct
 }
 
 /*
+ * If we allow our guest to use VMX instructions (i.e., nested VMX), we should
+ * also let it use VMX-specific MSRs.
+ * vmx_get_vmx_msr() and vmx_set_vmx_msr() return 0 when we handled a
+ * VMX-specific MSR, or 1 when we haven't (and the caller should handled it
+ * like all other MSRs).
+ */
+static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata)
+{
+	u64 vmx_msr = 0;
+	u32 vmx_msr_high, vmx_msr_low;
+
+	switch (msr_index) {
+	case MSR_IA32_FEATURE_CONTROL:
+		*pdata = 0;
+		break;
+	case MSR_IA32_VMX_BASIC:
+		/*
+		 * This MSR reports some information about VMX support of the
+		 * processor. We should return information about the VMX we
+		 * emulate for the guest, and the VMCS structure we give it -
+		 * not about the VMX support of the underlying hardware.
+		 * However, some capabilities of the underlying hardware are
+		 * used directly by our emulation (e.g., the physical address
+		 * width), so these are copied from what the hardware reports.
+		 */
+		*pdata = VMCS12_REVISION | (((u64)sizeof(struct vmcs12)) << 32);
+		rdmsrl(MSR_IA32_VMX_BASIC, vmx_msr);
+#define VMX_BASIC_64		0x0001000000000000LLU
+#define VMX_BASIC_MEM_TYPE	0x003c000000000000LLU
+#define VMX_BASIC_INOUT		0x0040000000000000LLU
+		*pdata |= vmx_msr &
+			(VMX_BASIC_64 | VMX_BASIC_MEM_TYPE | VMX_BASIC_INOUT);
+		break;
+#define CORE2_PINBASED_CTLS_MUST_BE_ONE	0x00000016
+#define MSR_IA32_VMX_TRUE_PINBASED_CTLS	0x48d
+	case MSR_IA32_VMX_TRUE_PINBASED_CTLS:
+	case MSR_IA32_VMX_PINBASED_CTLS:
+		vmx_msr_low  = CORE2_PINBASED_CTLS_MUST_BE_ONE;
+		vmx_msr_high = CORE2_PINBASED_CTLS_MUST_BE_ONE |
+				PIN_BASED_EXT_INTR_MASK |
+				PIN_BASED_NMI_EXITING |
+				PIN_BASED_VIRTUAL_NMIS;
+		*pdata = vmx_msr_low | ((u64)vmx_msr_high << 32);
+		break;
+	case MSR_IA32_VMX_PROCBASED_CTLS:
+		/* This MSR determines which vm-execution controls the L1
+		 * hypervisor may ask, or may not ask, to enable. Normally we
+		 * can only allow enabling features which the hardware can
+		 * support, but we limit ourselves to allowing only known
+		 * features that were tested nested. We allow disabling any
+		 * feature (even if the hardware can't disable it).
+		 */
+		rdmsr(MSR_IA32_VMX_PROCBASED_CTLS, vmx_msr_low, vmx_msr_high);
+
+		vmx_msr_low = 0; /* allow disabling any feature */
+		vmx_msr_high &= /* do not expose new untested features */
+			CPU_BASED_HLT_EXITING | CPU_BASED_CR3_LOAD_EXITING |
+			CPU_BASED_CR3_STORE_EXITING | CPU_BASED_USE_IO_BITMAPS |
+			CPU_BASED_MOV_DR_EXITING | CPU_BASED_USE_TSC_OFFSETING |
+			CPU_BASED_MWAIT_EXITING | CPU_BASED_MONITOR_EXITING |
+			CPU_BASED_INVLPG_EXITING | CPU_BASED_TPR_SHADOW |
+			CPU_BASED_USE_MSR_BITMAPS |
+#ifdef CONFIG_X86_64
+			CPU_BASED_CR8_LOAD_EXITING |
+			CPU_BASED_CR8_STORE_EXITING |
+#endif
+			CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
+		*pdata = vmx_msr_low | ((u64)vmx_msr_high << 32);
+		break;
+	case MSR_IA32_VMX_EXIT_CTLS:
+		*pdata = 0;
+#ifdef CONFIG_X86_64
+		*pdata |= VM_EXIT_HOST_ADDR_SPACE_SIZE;
+#endif
+		break;
+	case MSR_IA32_VMX_ENTRY_CTLS:
+		*pdata = 0;
+		break;
+	case MSR_IA32_VMX_PROCBASED_CTLS2:
+		*pdata = 0;
+		if (vm_need_virtualize_apic_accesses(vcpu->kvm))
+			*pdata |= SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
+		break;
+	case MSR_IA32_VMX_EPT_VPID_CAP:
+		*pdata = 0;
+		break;
+	default:
+		return 1;
+	}
+
+	return 0;
+}
+
+static int vmx_set_vmx_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data)
+{
+	switch (msr_index) {
+	case MSR_IA32_FEATURE_CONTROL:
+	case MSR_IA32_VMX_BASIC:
+	case MSR_IA32_VMX_TRUE_PINBASED_CTLS:
+	case MSR_IA32_VMX_PINBASED_CTLS:
+	case MSR_IA32_VMX_PROCBASED_CTLS:
+	case MSR_IA32_VMX_EXIT_CTLS:
+	case MSR_IA32_VMX_ENTRY_CTLS:
+	case MSR_IA32_VMX_PROCBASED_CTLS2:
+	case MSR_IA32_VMX_EPT_VPID_CAP:
+		pr_unimpl(vcpu, "unimplemented VMX MSR write: 0x%x data %llx\n",
+			  msr_index, data);
+		return 0;
+	default:
+		return 1;
+	}
+}
+/*
  * Reads an msr value (of 'msr_index') into 'pdata'.
  * Returns 0 on success, non-0 otherwise.
  * Assumes vcpu_load() was already called.
@@ -1258,6 +1371,8 @@ static int vmx_get_msr(struct kvm_vcpu *
 		/* Otherwise falls through */
 	default:
 		vmx_load_host_state(to_vmx(vcpu));
+		if (nested && !vmx_get_vmx_msr(vcpu, msr_index, &data))
+			break;
 		msr = find_msr_entry(to_vmx(vcpu), msr_index);
 		if (msr) {
 			vmx_load_host_state(to_vmx(vcpu));
@@ -1327,6 +1442,8 @@ static int vmx_set_msr(struct kvm_vcpu *
 			return 1;
 		/* Otherwise falls through */
 	default:
+		if (nested && !vmx_set_vmx_msr(vcpu, msr_index, data))
+			break;
 		msr = find_msr_entry(vmx, msr_index);
 		if (msr) {
 			vmx_load_host_state(vmx);

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 07/28] nVMX: Decoding memory operands of VMX instructions
  2010-12-08 16:59 [PATCH 0/28] nVMX: Nested VMX, v7 Nadav Har'El
                   ` (5 preceding siblings ...)
  2010-12-08 17:03 ` [PATCH 06/28] nVMX: Implement reading and writing of VMX MSRs Nadav Har'El
@ 2010-12-08 17:03 ` Nadav Har'El
  2010-12-09 11:08   ` Avi Kivity
  2010-12-08 17:04 ` [PATCH 08/28] nVMX: Hold a vmcs02 for each vmcs12 Nadav Har'El
                   ` (21 subsequent siblings)
  28 siblings, 1 reply; 40+ messages in thread
From: Nadav Har'El @ 2010-12-08 17:03 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch includes a utility function for decoding pointer operands of VMX
instructions issued by L1 (a guest hypervisor)

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   59 +++++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c |    3 +-
 arch/x86/kvm/x86.h |    3 ++
 3 files changed, 64 insertions(+), 1 deletion(-)

--- .before/arch/x86/kvm/x86.c	2010-12-08 18:56:49.000000000 +0200
+++ .after/arch/x86/kvm/x86.c	2010-12-08 18:56:49.000000000 +0200
@@ -3688,7 +3688,7 @@ static int kvm_fetch_guest_virt(gva_t ad
 					  exception);
 }
 
-static int kvm_read_guest_virt(gva_t addr, void *val, unsigned int bytes,
+int kvm_read_guest_virt(gva_t addr, void *val, unsigned int bytes,
 			       struct kvm_vcpu *vcpu,
 			       struct x86_exception *exception)
 {
@@ -3696,6 +3696,7 @@ static int kvm_read_guest_virt(gva_t add
 	return kvm_read_guest_virt_helper(addr, val, bytes, vcpu, access,
 					  exception);
 }
+EXPORT_SYMBOL_GPL(kvm_read_guest_virt);
 
 static int kvm_read_guest_virt_system(gva_t addr, void *val, unsigned int bytes,
 				      struct kvm_vcpu *vcpu,
--- .before/arch/x86/kvm/x86.h	2010-12-08 18:56:49.000000000 +0200
+++ .after/arch/x86/kvm/x86.h	2010-12-08 18:56:49.000000000 +0200
@@ -74,6 +74,9 @@ void kvm_before_handle_nmi(struct kvm_vc
 void kvm_after_handle_nmi(struct kvm_vcpu *vcpu);
 int kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq);
 
+int kvm_read_guest_virt(gva_t addr, void *val, unsigned int bytes,
+		struct kvm_vcpu *vcpu, struct x86_exception *exception);
+
 void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data);
 
 #endif
--- .before/arch/x86/kvm/vmx.c	2010-12-08 18:56:49.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-12-08 18:56:49.000000000 +0200
@@ -3936,6 +3936,65 @@ static int handle_vmoff(struct kvm_vcpu 
 }
 
 /*
+ * Decode the memory-address operand of a vmx instruction, as recorded on an
+ * exit caused by such an instruction (run by a guest hypervisor).
+ * On success, returns 0. When the operand is invalid, returns 1 and throws
+ * #UD or #GP.
+ */
+static int get_vmx_mem_address(struct kvm_vcpu *vcpu,
+				 unsigned long exit_qualification,
+				 u32 vmx_instruction_info, gva_t *ret)
+{
+	/*
+	 * According to Vol. 3B, "Information for VM Exits Due to Instruction
+	 * Execution", on an exit, vmx_instruction_info holds most of the
+	 * addressing components of the operand. Only the displacement part
+	 * is put in exit_qualification (see 3B, "Basic VM-Exit Information").
+	 * For how an actual address is calculated from all these components,
+	 * refer to Vol. 1, "Operand Addressing".
+	 */
+	int  scaling = vmx_instruction_info & 3;
+	int  addr_size = (vmx_instruction_info >> 7) & 7;
+	bool is_reg = vmx_instruction_info & (1u << 10);
+	int  seg_reg = (vmx_instruction_info >> 15) & 7;
+	int  index_reg = (vmx_instruction_info >> 18) & 0xf;
+	bool index_is_valid = !(vmx_instruction_info & (1u << 22));
+	int  base_reg       = (vmx_instruction_info >> 23) & 0xf;
+	bool base_is_valid  = !(vmx_instruction_info & (1u << 27));
+
+	if (is_reg) {
+		kvm_queue_exception(vcpu, UD_VECTOR);
+		return 1;
+	}
+
+	switch (addr_size) {
+	case 1: /* 32 bit. high bits are undefined according to the spec: */
+		exit_qualification &= 0xffffffff;
+		break;
+	case 2: /* 64 bit */
+		break;
+	default: /* 16 bit */
+		return 1;
+	}
+
+	/* Addr = segment_base + offset */
+	/* offset = base + [index * scale] + displacement */
+	*ret = vmx_get_segment_base(vcpu, seg_reg);
+	if (base_is_valid)
+		*ret += kvm_register_read(vcpu, base_reg);
+	if (index_is_valid)
+		*ret += kvm_register_read(vcpu, index_reg)<<scaling;
+	*ret += exit_qualification; /* holds the displacement */
+	/*
+	 * TODO: throw #GP (and return 1) in various cases that the VM*
+	 * instructions require it - e.g., offset beyond segment limit,
+	 * unusable or unreadable/unwritable segment, non-canonical 64-bit
+	 * address, and so on. Currently these are not checked.
+	 */
+	return 0;
+}
+
+/*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
  * to be done to userspace and return 0.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 08/28] nVMX: Hold a vmcs02 for each vmcs12
  2010-12-08 16:59 [PATCH 0/28] nVMX: Nested VMX, v7 Nadav Har'El
                   ` (6 preceding siblings ...)
  2010-12-08 17:03 ` [PATCH 07/28] nVMX: Decoding memory operands of VMX instructions Nadav Har'El
@ 2010-12-08 17:04 ` Nadav Har'El
  2010-12-09 12:41   ` Avi Kivity
  2010-12-08 17:04 ` [PATCH 09/28] nVMX: Add VMCS fields to the vmcs12 Nadav Har'El
                   ` (20 subsequent siblings)
  28 siblings, 1 reply; 40+ messages in thread
From: Nadav Har'El @ 2010-12-08 17:04 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

In this patch we add a list of L0 (hardware) VMCSs, which we'll use to hold a 
hardware VMCS for each active vmcs12 (i.e., for each L2 guest).

We call each of these L0 VMCSs a "vmcs02", as it is the VMCS that L0 uses
to run its nested guest L2.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   96 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 96 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2010-12-08 18:56:49.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-12-08 18:56:49.000000000 +0200
@@ -155,6 +155,12 @@ struct __packed vmcs12 {
  */
 #define VMCS12_REVISION 0x11e57ed0
 
+struct vmcs_list {
+	struct list_head list;
+	gpa_t vmcs12_addr;
+	struct vmcs *vmcs02;
+};
+
 /*
  * The nested_vmx structure is part of vcpu_vmx, and holds information we need
  * for correct emulation of VMX (i.e., nested VMX) on this vcpu. For example,
@@ -170,6 +176,10 @@ struct nested_vmx {
 	/* The host-usable pointer to the above */
 	struct page *current_vmcs12_page;
 	struct vmcs12 *current_vmcs12;
+
+	/* list of real (hardware) VMCS, one for each L2 guest of L1 */
+	struct list_head vmcs02_list; /* a vmcs_list */
+	int vmcs02_num;
 };
 
 struct vcpu_vmx {
@@ -1736,6 +1746,85 @@ static void free_vmcs(struct vmcs *vmcs)
 	free_pages((unsigned long)vmcs, vmcs_config.order);
 }
 
+static struct vmcs *nested_get_current_vmcs(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	struct vmcs_list *list_item, *n;
+
+	list_for_each_entry_safe(list_item, n, &vmx->nested.vmcs02_list, list)
+		if (list_item->vmcs12_addr == vmx->nested.current_vmptr)
+			return list_item->vmcs02;
+
+	return NULL;
+}
+
+/*
+ * Allocate an L0 VMCS (vmcs02) for the current L1 VMCS (vmcs12), if one
+ * does not already exist. The allocation is done in L0 memory, so to avoid
+ * denial-of-service attack by guests, we limit the number of concurrently-
+ * allocated vmcss. A well-behaving L1 will VMCLEAR unused vmcs12s and not
+ * trigger this limit.
+ */
+static const int NESTED_MAX_VMCS = 256;
+static int nested_create_current_vmcs(struct kvm_vcpu *vcpu)
+{
+	struct vmcs_list *new_l2_guest;
+	struct vmcs *vmcs02;
+
+	if (nested_get_current_vmcs(vcpu))
+		return 0; /* nothing to do - we already have a VMCS */
+
+	if (to_vmx(vcpu)->nested.vmcs02_num >= NESTED_MAX_VMCS)
+		return -ENOMEM;
+
+	new_l2_guest = (struct vmcs_list *)
+		kmalloc(sizeof(struct vmcs_list), GFP_KERNEL);
+	if (!new_l2_guest)
+		return -ENOMEM;
+
+	vmcs02 = alloc_vmcs();
+	if (!vmcs02) {
+		kfree(new_l2_guest);
+		return -ENOMEM;
+	}
+
+	new_l2_guest->vmcs12_addr = to_vmx(vcpu)->nested.current_vmptr;
+	new_l2_guest->vmcs02 = vmcs02;
+	list_add(&(new_l2_guest->list), &(to_vmx(vcpu)->nested.vmcs02_list));
+	to_vmx(vcpu)->nested.vmcs02_num++;
+	return 0;
+}
+
+/* Free a vmcs12's associated vmcs02, and remove it from vmcs02_list */
+static void nested_free_vmcs(struct kvm_vcpu *vcpu, gpa_t vmptr)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	struct vmcs_list *list_item, *n;
+
+	list_for_each_entry_safe(list_item, n, &vmx->nested.vmcs02_list, list)
+		if (list_item->vmcs12_addr == vmptr) {
+			free_vmcs(list_item->vmcs02);
+			list_del(&(list_item->list));
+			kfree(list_item);
+			vmx->nested.vmcs02_num--;
+			return;
+		}
+}
+
+static void free_l1_state(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	struct vmcs_list *list_item, *n;
+
+	list_for_each_entry_safe(list_item, n,
+			&vmx->nested.vmcs02_list, list) {
+		free_vmcs(list_item->vmcs02);
+		list_del(&(list_item->list));
+		kfree(list_item);
+	}
+	vmx->nested.vmcs02_num = 0;
+}
+
 static void free_kvm_area(void)
 {
 	int cpu;
@@ -3884,6 +3973,9 @@ static int handle_vmon(struct kvm_vcpu *
 		return 1;
 	}
 
+	INIT_LIST_HEAD(&(vmx->nested.vmcs02_list));
+	vmx->nested.vmcs02_num = 0;
+
 	vmx->nested.vmxon = true;
 
 	skip_emulated_instruction(vcpu);
@@ -3931,6 +4023,8 @@ static int handle_vmoff(struct kvm_vcpu 
 	if (to_vmx(vcpu)->nested.current_vmptr != -1ull)
 		nested_release_page(to_vmx(vcpu)->nested.current_vmcs12_page);
 
+	free_l1_state(vcpu);
+
 	skip_emulated_instruction(vcpu);
 	return 1;
 }
@@ -4420,6 +4514,8 @@ static void vmx_free_vcpu(struct kvm_vcp
 	free_vpid(vmx);
 	if (vmx->nested.vmxon && to_vmx(vcpu)->nested.current_vmptr != -1ull)
 		nested_release_page(to_vmx(vcpu)->nested.current_vmcs12_page);
+	if (vmx->nested.vmxon)
+		free_l1_state(vcpu);
 	vmx_free_vmcs(vcpu);
 	kfree(vmx->guest_msrs);
 	kvm_vcpu_uninit(vcpu);

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 09/28] nVMX: Add VMCS fields to the vmcs12
  2010-12-08 16:59 [PATCH 0/28] nVMX: Nested VMX, v7 Nadav Har'El
                   ` (7 preceding siblings ...)
  2010-12-08 17:04 ` [PATCH 08/28] nVMX: Hold a vmcs02 for each vmcs12 Nadav Har'El
@ 2010-12-08 17:04 ` Nadav Har'El
  2010-12-09 12:43   ` Avi Kivity
  2010-12-08 17:05 ` [PATCH 10/28] nVMX: Success/failure of VMX instructions Nadav Har'El
                   ` (19 subsequent siblings)
  28 siblings, 1 reply; 40+ messages in thread
From: Nadav Har'El @ 2010-12-08 17:04 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

In this patch we add to vmcs12 (the VMCS that L1 keeps for L2) all the
standard VMCS fields. These fields are encapsulated in a struct vmcs_fields.

Later patches will enable L1 to read and write these fields using VMREAD/
VMWRITE, and they will be used during a VMLAUNCH/VMRESUME in preparing vmcs02,
a hardware VMCS for running L2.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  295 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 295 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2010-12-08 18:56:49.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-12-08 18:56:49.000000000 +0200
@@ -128,6 +128,137 @@ struct shared_msr_entry {
 };
 
 /*
+ * vmcs_fields is a structure used in nested VMX for holding a copy of all
+ * standard VMCS fields. It is used for emulating a VMCS for L1 (see struct
+ * vmcs12), and also for easier access to VMCS data (see vmcs01_fields).
+ */
+struct __packed vmcs_fields {
+	u16 virtual_processor_id;
+	u16 guest_es_selector;
+	u16 guest_cs_selector;
+	u16 guest_ss_selector;
+	u16 guest_ds_selector;
+	u16 guest_fs_selector;
+	u16 guest_gs_selector;
+	u16 guest_ldtr_selector;
+	u16 guest_tr_selector;
+	u16 host_es_selector;
+	u16 host_cs_selector;
+	u16 host_ss_selector;
+	u16 host_ds_selector;
+	u16 host_fs_selector;
+	u16 host_gs_selector;
+	u16 host_tr_selector;
+	u64 io_bitmap_a;
+	u64 io_bitmap_b;
+	u64 msr_bitmap;
+	u64 vm_exit_msr_store_addr;
+	u64 vm_exit_msr_load_addr;
+	u64 vm_entry_msr_load_addr;
+	u64 tsc_offset;
+	u64 virtual_apic_page_addr;
+	u64 apic_access_addr;
+	u64 ept_pointer;
+	u64 guest_physical_address;
+	u64 vmcs_link_pointer;
+	u64 guest_ia32_debugctl;
+	u64 guest_ia32_pat;
+	u64 guest_pdptr0;
+	u64 guest_pdptr1;
+	u64 guest_pdptr2;
+	u64 guest_pdptr3;
+	u64 host_ia32_pat;
+	u32 pin_based_vm_exec_control;
+	u32 cpu_based_vm_exec_control;
+	u32 exception_bitmap;
+	u32 page_fault_error_code_mask;
+	u32 page_fault_error_code_match;
+	u32 cr3_target_count;
+	u32 vm_exit_controls;
+	u32 vm_exit_msr_store_count;
+	u32 vm_exit_msr_load_count;
+	u32 vm_entry_controls;
+	u32 vm_entry_msr_load_count;
+	u32 vm_entry_intr_info_field;
+	u32 vm_entry_exception_error_code;
+	u32 vm_entry_instruction_len;
+	u32 tpr_threshold;
+	u32 secondary_vm_exec_control;
+	u32 vm_instruction_error;
+	u32 vm_exit_reason;
+	u32 vm_exit_intr_info;
+	u32 vm_exit_intr_error_code;
+	u32 idt_vectoring_info_field;
+	u32 idt_vectoring_error_code;
+	u32 vm_exit_instruction_len;
+	u32 vmx_instruction_info;
+	u32 guest_es_limit;
+	u32 guest_cs_limit;
+	u32 guest_ss_limit;
+	u32 guest_ds_limit;
+	u32 guest_fs_limit;
+	u32 guest_gs_limit;
+	u32 guest_ldtr_limit;
+	u32 guest_tr_limit;
+	u32 guest_gdtr_limit;
+	u32 guest_idtr_limit;
+	u32 guest_es_ar_bytes;
+	u32 guest_cs_ar_bytes;
+	u32 guest_ss_ar_bytes;
+	u32 guest_ds_ar_bytes;
+	u32 guest_fs_ar_bytes;
+	u32 guest_gs_ar_bytes;
+	u32 guest_ldtr_ar_bytes;
+	u32 guest_tr_ar_bytes;
+	u32 guest_interruptibility_info;
+	u32 guest_activity_state;
+	u32 guest_sysenter_cs;
+	u32 host_ia32_sysenter_cs;
+	unsigned long cr0_guest_host_mask;
+	unsigned long cr4_guest_host_mask;
+	unsigned long cr0_read_shadow;
+	unsigned long cr4_read_shadow;
+	unsigned long cr3_target_value0;
+	unsigned long cr3_target_value1;
+	unsigned long cr3_target_value2;
+	unsigned long cr3_target_value3;
+	unsigned long exit_qualification;
+	unsigned long guest_linear_address;
+	unsigned long guest_cr0;
+	unsigned long guest_cr3;
+	unsigned long guest_cr4;
+	unsigned long guest_es_base;
+	unsigned long guest_cs_base;
+	unsigned long guest_ss_base;
+	unsigned long guest_ds_base;
+	unsigned long guest_fs_base;
+	unsigned long guest_gs_base;
+	unsigned long guest_ldtr_base;
+	unsigned long guest_tr_base;
+	unsigned long guest_gdtr_base;
+	unsigned long guest_idtr_base;
+	unsigned long guest_dr7;
+	unsigned long guest_rsp;
+	unsigned long guest_rip;
+	unsigned long guest_rflags;
+	unsigned long guest_pending_dbg_exceptions;
+	unsigned long guest_sysenter_esp;
+	unsigned long guest_sysenter_eip;
+	unsigned long host_cr0;
+	unsigned long host_cr3;
+	unsigned long host_cr4;
+	unsigned long host_fs_base;
+	unsigned long host_gs_base;
+	unsigned long host_tr_base;
+	unsigned long host_gdtr_base;
+	unsigned long host_idtr_base;
+	unsigned long host_ia32_sysenter_esp;
+	unsigned long host_ia32_sysenter_eip;
+	unsigned long host_rsp;
+	unsigned long host_rip;
+};
+
+/*
  * struct vmcs12 describes the state that our guest hypervisor (L1) keeps for a
  * single nested guest (L2), hence the name vmcs12. Any VMX implementation has
  * a VMCS structure, and vmcs12 is our emulated VMX's VMCS. This structure is
@@ -146,6 +277,8 @@ struct __packed vmcs12 {
 	 */
 	u32 revision_id;
 	u32 abort;
+
+	struct vmcs_fields fields;
 };
 
 /*
@@ -239,6 +372,168 @@ static inline struct vcpu_vmx *to_vmx(st
 	return container_of(vcpu, struct vcpu_vmx, vcpu);
 }
 
+#define OFFSET(x) offsetof(struct vmcs_fields, x)
+
+static unsigned short vmcs_field_to_offset_table[HOST_RIP+1] = {
+	[VIRTUAL_PROCESSOR_ID] = OFFSET(virtual_processor_id),
+	[GUEST_ES_SELECTOR] = OFFSET(guest_es_selector),
+	[GUEST_CS_SELECTOR] = OFFSET(guest_cs_selector),
+	[GUEST_SS_SELECTOR] = OFFSET(guest_ss_selector),
+	[GUEST_DS_SELECTOR] = OFFSET(guest_ds_selector),
+	[GUEST_FS_SELECTOR] = OFFSET(guest_fs_selector),
+	[GUEST_GS_SELECTOR] = OFFSET(guest_gs_selector),
+	[GUEST_LDTR_SELECTOR] = OFFSET(guest_ldtr_selector),
+	[GUEST_TR_SELECTOR] = OFFSET(guest_tr_selector),
+	[HOST_ES_SELECTOR] = OFFSET(host_es_selector),
+	[HOST_CS_SELECTOR] = OFFSET(host_cs_selector),
+	[HOST_SS_SELECTOR] = OFFSET(host_ss_selector),
+	[HOST_DS_SELECTOR] = OFFSET(host_ds_selector),
+	[HOST_FS_SELECTOR] = OFFSET(host_fs_selector),
+	[HOST_GS_SELECTOR] = OFFSET(host_gs_selector),
+	[HOST_TR_SELECTOR] = OFFSET(host_tr_selector),
+	[IO_BITMAP_A] = OFFSET(io_bitmap_a),
+	[IO_BITMAP_A_HIGH] = OFFSET(io_bitmap_a)+4,
+	[IO_BITMAP_B] = OFFSET(io_bitmap_b),
+	[IO_BITMAP_B_HIGH] = OFFSET(io_bitmap_b)+4,
+	[MSR_BITMAP] = OFFSET(msr_bitmap),
+	[MSR_BITMAP_HIGH] = OFFSET(msr_bitmap)+4,
+	[VM_EXIT_MSR_STORE_ADDR] = OFFSET(vm_exit_msr_store_addr),
+	[VM_EXIT_MSR_STORE_ADDR_HIGH] = OFFSET(vm_exit_msr_store_addr)+4,
+	[VM_EXIT_MSR_LOAD_ADDR] = OFFSET(vm_exit_msr_load_addr),
+	[VM_EXIT_MSR_LOAD_ADDR_HIGH] = OFFSET(vm_exit_msr_load_addr)+4,
+	[VM_ENTRY_MSR_LOAD_ADDR] = OFFSET(vm_entry_msr_load_addr),
+	[VM_ENTRY_MSR_LOAD_ADDR_HIGH] = OFFSET(vm_entry_msr_load_addr)+4,
+	[TSC_OFFSET] = OFFSET(tsc_offset),
+	[TSC_OFFSET_HIGH] = OFFSET(tsc_offset)+4,
+	[VIRTUAL_APIC_PAGE_ADDR] = OFFSET(virtual_apic_page_addr),
+	[VIRTUAL_APIC_PAGE_ADDR_HIGH] = OFFSET(virtual_apic_page_addr)+4,
+	[APIC_ACCESS_ADDR] = OFFSET(apic_access_addr),
+	[APIC_ACCESS_ADDR_HIGH] = OFFSET(apic_access_addr)+4,
+	[EPT_POINTER] = OFFSET(ept_pointer),
+	[EPT_POINTER_HIGH] = OFFSET(ept_pointer)+4,
+	[GUEST_PHYSICAL_ADDRESS] = OFFSET(guest_physical_address),
+	[GUEST_PHYSICAL_ADDRESS_HIGH] = OFFSET(guest_physical_address)+4,
+	[VMCS_LINK_POINTER] = OFFSET(vmcs_link_pointer),
+	[VMCS_LINK_POINTER_HIGH] = OFFSET(vmcs_link_pointer)+4,
+	[GUEST_IA32_DEBUGCTL] = OFFSET(guest_ia32_debugctl),
+	[GUEST_IA32_DEBUGCTL_HIGH] = OFFSET(guest_ia32_debugctl)+4,
+	[GUEST_IA32_PAT] = OFFSET(guest_ia32_pat),
+	[GUEST_IA32_PAT_HIGH] = OFFSET(guest_ia32_pat)+4,
+	[GUEST_PDPTR0] = OFFSET(guest_pdptr0),
+	[GUEST_PDPTR0_HIGH] = OFFSET(guest_pdptr0)+4,
+	[GUEST_PDPTR1] = OFFSET(guest_pdptr1),
+	[GUEST_PDPTR1_HIGH] = OFFSET(guest_pdptr1)+4,
+	[GUEST_PDPTR2] = OFFSET(guest_pdptr2),
+	[GUEST_PDPTR2_HIGH] = OFFSET(guest_pdptr2)+4,
+	[GUEST_PDPTR3] = OFFSET(guest_pdptr3),
+	[GUEST_PDPTR3_HIGH] = OFFSET(guest_pdptr3)+4,
+	[HOST_IA32_PAT] = OFFSET(host_ia32_pat),
+	[HOST_IA32_PAT_HIGH] = OFFSET(host_ia32_pat)+4,
+	[PIN_BASED_VM_EXEC_CONTROL] = OFFSET(pin_based_vm_exec_control),
+	[CPU_BASED_VM_EXEC_CONTROL] = OFFSET(cpu_based_vm_exec_control),
+	[EXCEPTION_BITMAP] = OFFSET(exception_bitmap),
+	[PAGE_FAULT_ERROR_CODE_MASK] = OFFSET(page_fault_error_code_mask),
+	[PAGE_FAULT_ERROR_CODE_MATCH] = OFFSET(page_fault_error_code_match),
+	[CR3_TARGET_COUNT] = OFFSET(cr3_target_count),
+	[VM_EXIT_CONTROLS] = OFFSET(vm_exit_controls),
+	[VM_EXIT_MSR_STORE_COUNT] = OFFSET(vm_exit_msr_store_count),
+	[VM_EXIT_MSR_LOAD_COUNT] = OFFSET(vm_exit_msr_load_count),
+	[VM_ENTRY_CONTROLS] = OFFSET(vm_entry_controls),
+	[VM_ENTRY_MSR_LOAD_COUNT] = OFFSET(vm_entry_msr_load_count),
+	[VM_ENTRY_INTR_INFO_FIELD] = OFFSET(vm_entry_intr_info_field),
+	[VM_ENTRY_EXCEPTION_ERROR_CODE] = OFFSET(vm_entry_exception_error_code),
+	[VM_ENTRY_INSTRUCTION_LEN] = OFFSET(vm_entry_instruction_len),
+	[TPR_THRESHOLD] = OFFSET(tpr_threshold),
+	[SECONDARY_VM_EXEC_CONTROL] = OFFSET(secondary_vm_exec_control),
+	[VM_INSTRUCTION_ERROR] = OFFSET(vm_instruction_error),
+	[VM_EXIT_REASON] = OFFSET(vm_exit_reason),
+	[VM_EXIT_INTR_INFO] = OFFSET(vm_exit_intr_info),
+	[VM_EXIT_INTR_ERROR_CODE] = OFFSET(vm_exit_intr_error_code),
+	[IDT_VECTORING_INFO_FIELD] = OFFSET(idt_vectoring_info_field),
+	[IDT_VECTORING_ERROR_CODE] = OFFSET(idt_vectoring_error_code),
+	[VM_EXIT_INSTRUCTION_LEN] = OFFSET(vm_exit_instruction_len),
+	[VMX_INSTRUCTION_INFO] = OFFSET(vmx_instruction_info),
+	[GUEST_ES_LIMIT] = OFFSET(guest_es_limit),
+	[GUEST_CS_LIMIT] = OFFSET(guest_cs_limit),
+	[GUEST_SS_LIMIT] = OFFSET(guest_ss_limit),
+	[GUEST_DS_LIMIT] = OFFSET(guest_ds_limit),
+	[GUEST_FS_LIMIT] = OFFSET(guest_fs_limit),
+	[GUEST_GS_LIMIT] = OFFSET(guest_gs_limit),
+	[GUEST_LDTR_LIMIT] = OFFSET(guest_ldtr_limit),
+	[GUEST_TR_LIMIT] = OFFSET(guest_tr_limit),
+	[GUEST_GDTR_LIMIT] = OFFSET(guest_gdtr_limit),
+	[GUEST_IDTR_LIMIT] = OFFSET(guest_idtr_limit),
+	[GUEST_ES_AR_BYTES] = OFFSET(guest_es_ar_bytes),
+	[GUEST_CS_AR_BYTES] = OFFSET(guest_cs_ar_bytes),
+	[GUEST_SS_AR_BYTES] = OFFSET(guest_ss_ar_bytes),
+	[GUEST_DS_AR_BYTES] = OFFSET(guest_ds_ar_bytes),
+	[GUEST_FS_AR_BYTES] = OFFSET(guest_fs_ar_bytes),
+	[GUEST_GS_AR_BYTES] = OFFSET(guest_gs_ar_bytes),
+	[GUEST_LDTR_AR_BYTES] = OFFSET(guest_ldtr_ar_bytes),
+	[GUEST_TR_AR_BYTES] = OFFSET(guest_tr_ar_bytes),
+	[GUEST_INTERRUPTIBILITY_INFO] = OFFSET(guest_interruptibility_info),
+	[GUEST_ACTIVITY_STATE] = OFFSET(guest_activity_state),
+	[GUEST_SYSENTER_CS] = OFFSET(guest_sysenter_cs),
+	[HOST_IA32_SYSENTER_CS] = OFFSET(host_ia32_sysenter_cs),
+	[CR0_GUEST_HOST_MASK] = OFFSET(cr0_guest_host_mask),
+	[CR4_GUEST_HOST_MASK] = OFFSET(cr4_guest_host_mask),
+	[CR0_READ_SHADOW] = OFFSET(cr0_read_shadow),
+	[CR4_READ_SHADOW] = OFFSET(cr4_read_shadow),
+	[CR3_TARGET_VALUE0] = OFFSET(cr3_target_value0),
+	[CR3_TARGET_VALUE1] = OFFSET(cr3_target_value1),
+	[CR3_TARGET_VALUE2] = OFFSET(cr3_target_value2),
+	[CR3_TARGET_VALUE3] = OFFSET(cr3_target_value3),
+	[EXIT_QUALIFICATION] = OFFSET(exit_qualification),
+	[GUEST_LINEAR_ADDRESS] = OFFSET(guest_linear_address),
+	[GUEST_CR0] = OFFSET(guest_cr0),
+	[GUEST_CR3] = OFFSET(guest_cr3),
+	[GUEST_CR4] = OFFSET(guest_cr4),
+	[GUEST_ES_BASE] = OFFSET(guest_es_base),
+	[GUEST_CS_BASE] = OFFSET(guest_cs_base),
+	[GUEST_SS_BASE] = OFFSET(guest_ss_base),
+	[GUEST_DS_BASE] = OFFSET(guest_ds_base),
+	[GUEST_FS_BASE] = OFFSET(guest_fs_base),
+	[GUEST_GS_BASE] = OFFSET(guest_gs_base),
+	[GUEST_LDTR_BASE] = OFFSET(guest_ldtr_base),
+	[GUEST_TR_BASE] = OFFSET(guest_tr_base),
+	[GUEST_GDTR_BASE] = OFFSET(guest_gdtr_base),
+	[GUEST_IDTR_BASE] = OFFSET(guest_idtr_base),
+	[GUEST_DR7] = OFFSET(guest_dr7),
+	[GUEST_RSP] = OFFSET(guest_rsp),
+	[GUEST_RIP] = OFFSET(guest_rip),
+	[GUEST_RFLAGS] = OFFSET(guest_rflags),
+	[GUEST_PENDING_DBG_EXCEPTIONS] = OFFSET(guest_pending_dbg_exceptions),
+	[GUEST_SYSENTER_ESP] = OFFSET(guest_sysenter_esp),
+	[GUEST_SYSENTER_EIP] = OFFSET(guest_sysenter_eip),
+	[HOST_CR0] = OFFSET(host_cr0),
+	[HOST_CR3] = OFFSET(host_cr3),
+	[HOST_CR4] = OFFSET(host_cr4),
+	[HOST_FS_BASE] = OFFSET(host_fs_base),
+	[HOST_GS_BASE] = OFFSET(host_gs_base),
+	[HOST_TR_BASE] = OFFSET(host_tr_base),
+	[HOST_GDTR_BASE] = OFFSET(host_gdtr_base),
+	[HOST_IDTR_BASE] = OFFSET(host_idtr_base),
+	[HOST_IA32_SYSENTER_ESP] = OFFSET(host_ia32_sysenter_esp),
+	[HOST_IA32_SYSENTER_EIP] = OFFSET(host_ia32_sysenter_eip),
+	[HOST_RSP] = OFFSET(host_rsp),
+	[HOST_RIP] = OFFSET(host_rip),
+};
+
+static inline short vmcs_field_to_offset(unsigned long field)
+{
+
+	if (field > HOST_RIP || vmcs_field_to_offset_table[field] == 0) {
+		printk(KERN_ERR "invalid vmcs field 0x%lx\n", field);
+		return -1;
+	}
+	return vmcs_field_to_offset_table[field];
+}
+
+static inline struct vmcs_fields *get_vmcs12_fields(struct kvm_vcpu *vcpu)
+{
+	return &(to_vmx(vcpu)->nested.current_vmcs12->fields);
+}
+
 static struct page *nested_get_page(struct kvm_vcpu *vcpu, gpa_t addr)
 {
 	struct page *page = gfn_to_page(vcpu->kvm, addr >> PAGE_SHIFT);

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 10/28] nVMX: Success/failure of VMX instructions.
  2010-12-08 16:59 [PATCH 0/28] nVMX: Nested VMX, v7 Nadav Har'El
                   ` (8 preceding siblings ...)
  2010-12-08 17:04 ` [PATCH 09/28] nVMX: Add VMCS fields to the vmcs12 Nadav Har'El
@ 2010-12-08 17:05 ` Nadav Har'El
  2010-12-08 17:05 ` [PATCH 11/28] nVMX: Implement VMCLEAR Nadav Har'El
                   ` (18 subsequent siblings)
  28 siblings, 0 replies; 40+ messages in thread
From: Nadav Har'El @ 2010-12-08 17:05 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

VMX instructions specify success or failure by setting certain RFLAGS bits.
This patch contains common functions to do this, and they will be used in
the following patches which emulate the various VMX instructions.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/include/asm/vmx.h |   31 +++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx.c         |   30 ++++++++++++++++++++++++++++++
 2 files changed, 61 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2010-12-08 18:56:49.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-12-08 18:56:49.000000000 +0200
@@ -4384,6 +4384,36 @@ static int get_vmx_mem_address(struct kv
 }
 
 /*
+ * The following 3 functions, nested_vmx_succeed()/failValid()/failInvalid(),
+ * set the success or error code of an emulated VMX instruction, as specified
+ * by Vol 2B, VMX Instruction Reference, "Conventions".
+ */
+static void nested_vmx_succeed(struct kvm_vcpu *vcpu)
+{
+	vmx_set_rflags(vcpu, vmx_get_rflags(vcpu)
+			& ~(X86_EFLAGS_CF | X86_EFLAGS_PF | X86_EFLAGS_AF |
+			    X86_EFLAGS_ZF | X86_EFLAGS_SF | X86_EFLAGS_OF));
+}
+
+static void nested_vmx_failInvalid(struct kvm_vcpu *vcpu)
+{
+	vmx_set_rflags(vcpu, (vmx_get_rflags(vcpu)
+			& ~(X86_EFLAGS_PF | X86_EFLAGS_AF | X86_EFLAGS_ZF |
+			    X86_EFLAGS_SF | X86_EFLAGS_OF))
+			| X86_EFLAGS_CF);
+}
+
+static void nested_vmx_failValid(struct kvm_vcpu *vcpu,
+					u32 vm_instruction_error)
+{
+	vmx_set_rflags(vcpu, (vmx_get_rflags(vcpu)
+			& ~(X86_EFLAGS_CF | X86_EFLAGS_PF | X86_EFLAGS_AF |
+			    X86_EFLAGS_SF | X86_EFLAGS_OF))
+			| X86_EFLAGS_ZF);
+	get_vmcs12_fields(vcpu)->vm_instruction_error = vm_instruction_error;
+}
+
+/*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
  * to be done to userspace and return 0.
--- .before/arch/x86/include/asm/vmx.h	2010-12-08 18:56:49.000000000 +0200
+++ .after/arch/x86/include/asm/vmx.h	2010-12-08 18:56:49.000000000 +0200
@@ -412,4 +412,35 @@ struct vmx_msr_entry {
 	u64 value;
 } __aligned(16);
 
+/*
+ * VM-instruction error numbers
+ */
+enum vm_instruction_error_number {
+	VMXERR_VMCALL_IN_VMX_ROOT_OPERATION = 1,
+	VMXERR_VMCLEAR_INVALID_ADDRESS = 2,
+	VMXERR_VMCLEAR_VMXON_POINTER = 3,
+	VMXERR_VMLAUNCH_NONCLEAR_VMCS = 4,
+	VMXERR_VMRESUME_NONLAUNCHED_VMCS = 5,
+	VMXERR_VMRESUME_CORRUPTED_VMCS = 6,
+	VMXERR_ENTRY_INVALID_CONTROL_FIELD = 7,
+	VMXERR_ENTRY_INVALID_HOST_STATE_FIELD = 8,
+	VMXERR_VMPTRLD_INVALID_ADDRESS = 9,
+	VMXERR_VMPTRLD_VMXON_POINTER = 10,
+	VMXERR_VMPTRLD_INCORRECT_VMCS_REVISION_ID = 11,
+	VMXERR_UNSUPPORTED_VMCS_COMPONENT = 12,
+	VMXERR_VMWRITE_READ_ONLY_VMCS_COMPONENT = 13,
+	VMXERR_VMXON_IN_VMX_ROOT_OPERATION = 15,
+	VMXERR_ENTRY_INVALID_EXECUTIVE_VMCS_POINTER = 16,
+	VMXERR_ENTRY_NONLAUNCHED_EXECUTIVE_VMCS = 17,
+	VMXERR_ENTRY_EXECUTIVE_VMCS_POINTER_NOT_VMXON_POINTER = 18,
+	VMXERR_VMCALL_NONCLEAR_VMCS = 19,
+	VMXERR_VMCALL_INVALID_VM_EXIT_CONTROL_FIELDS = 20,
+	VMXERR_VMCALL_INCORRECT_MSEG_REVISION_ID = 22,
+	VMXERR_VMXOFF_UNDER_DUAL_MONITOR_TREATMENT_OF_SMIS_AND_SMM = 23,
+	VMXERR_VMCALL_INVALID_SMM_MONITOR_FEATURES = 24,
+	VMXERR_ENTRY_INVALID_VM_EXECUTION_CONTROL_FIELDS_IN_EXECUTIVE_VMCS = 25,
+	VMXERR_ENTRY_EVENTS_BLOCKED_BY_MOV_SS = 26,
+	VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID = 28,
+};
+
 #endif

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 11/28] nVMX: Implement VMCLEAR
  2010-12-08 16:59 [PATCH 0/28] nVMX: Nested VMX, v7 Nadav Har'El
                   ` (9 preceding siblings ...)
  2010-12-08 17:05 ` [PATCH 10/28] nVMX: Success/failure of VMX instructions Nadav Har'El
@ 2010-12-08 17:05 ` Nadav Har'El
  2010-12-08 17:06 ` [PATCH 12/28] nVMX: Implement VMPTRLD Nadav Har'El
                   ` (17 subsequent siblings)
  28 siblings, 0 replies; 40+ messages in thread
From: Nadav Har'El @ 2010-12-08 17:05 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch implements the VMCLEAR instruction.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   60 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 59 insertions(+), 1 deletion(-)

--- .before/arch/x86/kvm/vmx.c	2010-12-08 18:56:50.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-12-08 18:56:50.000000000 +0200
@@ -279,6 +279,8 @@ struct __packed vmcs12 {
 	u32 abort;
 
 	struct vmcs_fields fields;
+
+	bool launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
 };
 
 /*
@@ -4413,6 +4415,62 @@ static void nested_vmx_failValid(struct 
 	get_vmcs12_fields(vcpu)->vm_instruction_error = vm_instruction_error;
 }
 
+/* Emulate the VMCLEAR instruction */
+static int handle_vmclear(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	gva_t gva;
+	gpa_t vmcs12_addr;
+	struct vmcs12 *vmcs12;
+	struct page *page;
+
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION),
+			vmcs_read32(VMX_INSTRUCTION_INFO), &gva))
+		return 1;
+
+	if (kvm_read_guest_virt(gva, &vmcs12_addr, sizeof(vmcs12_addr),
+				vcpu, NULL)) {
+		kvm_queue_exception(vcpu, PF_VECTOR);
+		return 1;
+	}
+
+	if (!IS_ALIGNED(vmcs12_addr, PAGE_SIZE)) {
+		nested_vmx_failValid(vcpu, VMXERR_VMCLEAR_INVALID_ADDRESS);
+		skip_emulated_instruction(vcpu);
+		return 1;
+	}
+
+	if (vmcs12_addr == vmx->nested.current_vmptr) {
+		nested_release_page(vmx->nested.current_vmcs12_page);
+		vmx->nested.current_vmptr = -1ull;
+	}
+
+	page = nested_get_page(vcpu, vmcs12_addr);
+	if (page == NULL) {
+		/*
+		 * For accurate processor emulation, VMCLEAR beyond available
+		 * physical memory should do nothing at all. However, it is
+		 * possible that a nested vmx bug, not a guest hypervisor bug,
+		 * resulted in this case, so let's shut down before doing any
+		 * more damage:
+		 */
+		set_bit(KVM_REQ_TRIPLE_FAULT, &vcpu->requests);
+		return 1;
+	}
+	vmcs12 = kmap(page);
+	vmcs12->launch_state = 0;
+	nested_release_page(page);
+
+	nested_free_vmcs(vcpu, vmcs12_addr);
+
+	skip_emulated_instruction(vcpu);
+	nested_vmx_succeed(vcpu);
+	return 1;
+}
+
 /*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
@@ -4434,7 +4492,7 @@ static int (*kvm_vmx_exit_handlers[])(st
 	[EXIT_REASON_INVD]		      = handle_invd,
 	[EXIT_REASON_INVLPG]		      = handle_invlpg,
 	[EXIT_REASON_VMCALL]                  = handle_vmcall,
-	[EXIT_REASON_VMCLEAR]	              = handle_vmx_insn,
+	[EXIT_REASON_VMCLEAR]	              = handle_vmclear,
 	[EXIT_REASON_VMLAUNCH]                = handle_vmx_insn,
 	[EXIT_REASON_VMPTRLD]                 = handle_vmx_insn,
 	[EXIT_REASON_VMPTRST]                 = handle_vmx_insn,

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 12/28] nVMX: Implement VMPTRLD
  2010-12-08 16:59 [PATCH 0/28] nVMX: Nested VMX, v7 Nadav Har'El
                   ` (10 preceding siblings ...)
  2010-12-08 17:05 ` [PATCH 11/28] nVMX: Implement VMCLEAR Nadav Har'El
@ 2010-12-08 17:06 ` Nadav Har'El
  2010-12-08 17:06 ` [PATCH 13/28] nVMX: Implement VMPTRST Nadav Har'El
                   ` (16 subsequent siblings)
  28 siblings, 0 replies; 40+ messages in thread
From: Nadav Har'El @ 2010-12-08 17:06 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch implements the VMPTRLD instruction.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   61 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 60 insertions(+), 1 deletion(-)

--- .before/arch/x86/kvm/vmx.c	2010-12-08 18:56:50.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-12-08 18:56:50.000000000 +0200
@@ -4471,6 +4471,65 @@ static int handle_vmclear(struct kvm_vcp
 	return 1;
 }
 
+/* Emulate the VMPTRLD instruction */
+static int handle_vmptrld(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	gva_t gva;
+	gpa_t vmcs12_addr;
+
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION),
+			vmcs_read32(VMX_INSTRUCTION_INFO), &gva))
+		return 1;
+
+	if (kvm_read_guest_virt(gva, &vmcs12_addr, sizeof(vmcs12_addr),
+				vcpu, NULL)) {
+		kvm_queue_exception(vcpu, PF_VECTOR);
+		return 1;
+	}
+
+	if (!IS_ALIGNED(vmcs12_addr, PAGE_SIZE)) {
+		nested_vmx_failValid(vcpu, VMXERR_VMPTRLD_INVALID_ADDRESS);
+		skip_emulated_instruction(vcpu);
+		return 1;
+	}
+
+	if (vmx->nested.current_vmptr != vmcs12_addr) {
+		struct vmcs12 *new_vmcs12;
+		struct page *page;
+		page = nested_get_page(vcpu, vmcs12_addr);
+		if (page == NULL) {
+			nested_vmx_failInvalid(vcpu);
+			skip_emulated_instruction(vcpu);
+			return 1;
+		}
+		new_vmcs12 = kmap(page);
+		if (new_vmcs12->revision_id != VMCS12_REVISION) {
+			nested_release_page_clean(page);
+			nested_vmx_failValid(vcpu,
+				VMXERR_VMPTRLD_INCORRECT_VMCS_REVISION_ID);
+			skip_emulated_instruction(vcpu);
+			return 1;
+		}
+		if (vmx->nested.current_vmptr != -1ull)
+			nested_release_page(vmx->nested.current_vmcs12_page);
+
+		vmx->nested.current_vmptr = vmcs12_addr;
+		vmx->nested.current_vmcs12 = new_vmcs12;
+		vmx->nested.current_vmcs12_page = page;
+
+		if (nested_create_current_vmcs(vcpu))
+			return -ENOMEM;
+	}
+
+	nested_vmx_succeed(vcpu);
+	skip_emulated_instruction(vcpu);
+	return 1;
+}
+
 /*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
@@ -4494,7 +4553,7 @@ static int (*kvm_vmx_exit_handlers[])(st
 	[EXIT_REASON_VMCALL]                  = handle_vmcall,
 	[EXIT_REASON_VMCLEAR]	              = handle_vmclear,
 	[EXIT_REASON_VMLAUNCH]                = handle_vmx_insn,
-	[EXIT_REASON_VMPTRLD]                 = handle_vmx_insn,
+	[EXIT_REASON_VMPTRLD]                 = handle_vmptrld,
 	[EXIT_REASON_VMPTRST]                 = handle_vmx_insn,
 	[EXIT_REASON_VMREAD]                  = handle_vmx_insn,
 	[EXIT_REASON_VMRESUME]                = handle_vmx_insn,

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 13/28] nVMX: Implement VMPTRST
  2010-12-08 16:59 [PATCH 0/28] nVMX: Nested VMX, v7 Nadav Har'El
                   ` (11 preceding siblings ...)
  2010-12-08 17:06 ` [PATCH 12/28] nVMX: Implement VMPTRLD Nadav Har'El
@ 2010-12-08 17:06 ` Nadav Har'El
  2010-12-08 17:07 ` [PATCH 14/28] nVMX: Implement VMREAD and VMWRITE Nadav Har'El
                   ` (15 subsequent siblings)
  28 siblings, 0 replies; 40+ messages in thread
From: Nadav Har'El @ 2010-12-08 17:06 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch implements the VMPTRST instruction. 

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   27 ++++++++++++++++++++++++++-
 arch/x86/kvm/x86.c |    3 ++-
 arch/x86/kvm/x86.h |    3 +++
 3 files changed, 31 insertions(+), 2 deletions(-)

--- .before/arch/x86/kvm/x86.c	2010-12-08 18:56:50.000000000 +0200
+++ .after/arch/x86/kvm/x86.c	2010-12-08 18:56:50.000000000 +0200
@@ -3705,7 +3705,7 @@ static int kvm_read_guest_virt_system(gv
 	return kvm_read_guest_virt_helper(addr, val, bytes, vcpu, 0, exception);
 }
 
-static int kvm_write_guest_virt_system(gva_t addr, void *val,
+int kvm_write_guest_virt_system(gva_t addr, void *val,
 				       unsigned int bytes,
 				       struct kvm_vcpu *vcpu,
 				       struct x86_exception *exception)
@@ -3736,6 +3736,7 @@ static int kvm_write_guest_virt_system(g
 out:
 	return r;
 }
+EXPORT_SYMBOL_GPL(kvm_write_guest_virt_system);
 
 static int emulator_read_emulated(unsigned long addr,
 				  void *val,
--- .before/arch/x86/kvm/x86.h	2010-12-08 18:56:50.000000000 +0200
+++ .after/arch/x86/kvm/x86.h	2010-12-08 18:56:50.000000000 +0200
@@ -77,6 +77,9 @@ int kvm_inject_realmode_interrupt(struct
 int kvm_read_guest_virt(gva_t addr, void *val, unsigned int bytes,
 		struct kvm_vcpu *vcpu, struct x86_exception *exception);
 
+int kvm_write_guest_virt_system(gva_t addr, void *val, unsigned int bytes,
+		struct kvm_vcpu *vcpu, struct x86_exception *exception);
+
 void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data);
 
 #endif
--- .before/arch/x86/kvm/vmx.c	2010-12-08 18:56:50.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-12-08 18:56:50.000000000 +0200
@@ -4530,6 +4530,31 @@ static int handle_vmptrld(struct kvm_vcp
 	return 1;
 }
 
+/* Emulate the VMPTRST instruction */
+static int handle_vmptrst(struct kvm_vcpu *vcpu)
+{
+	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+	u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+	gva_t vmcs_gva;
+
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	if (get_vmx_mem_address(vcpu, exit_qualification,
+			vmx_instruction_info, &vmcs_gva))
+		return 1;
+	/* ok to use *_system, because handle_vmread verified cpl=0 */
+	if (kvm_write_guest_virt_system(vmcs_gva,
+				 (void *)&to_vmx(vcpu)->nested.current_vmptr,
+				 sizeof(u64), vcpu, NULL)) {
+		kvm_queue_exception(vcpu, PF_VECTOR);
+		return 1;
+	}
+	nested_vmx_succeed(vcpu);
+	skip_emulated_instruction(vcpu);
+	return 1;
+}
+
 /*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
@@ -4554,7 +4579,7 @@ static int (*kvm_vmx_exit_handlers[])(st
 	[EXIT_REASON_VMCLEAR]	              = handle_vmclear,
 	[EXIT_REASON_VMLAUNCH]                = handle_vmx_insn,
 	[EXIT_REASON_VMPTRLD]                 = handle_vmptrld,
-	[EXIT_REASON_VMPTRST]                 = handle_vmx_insn,
+	[EXIT_REASON_VMPTRST]                 = handle_vmptrst,
 	[EXIT_REASON_VMREAD]                  = handle_vmx_insn,
 	[EXIT_REASON_VMRESUME]                = handle_vmx_insn,
 	[EXIT_REASON_VMWRITE]                 = handle_vmx_insn,

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 14/28] nVMX: Implement VMREAD and VMWRITE
  2010-12-08 16:59 [PATCH 0/28] nVMX: Nested VMX, v7 Nadav Har'El
                   ` (12 preceding siblings ...)
  2010-12-08 17:06 ` [PATCH 13/28] nVMX: Implement VMPTRST Nadav Har'El
@ 2010-12-08 17:07 ` Nadav Har'El
  2010-12-08 17:07 ` [PATCH 15/28] nVMX: Prepare vmcs02 from vmcs01 and vmcs12 Nadav Har'El
                   ` (14 subsequent siblings)
  28 siblings, 0 replies; 40+ messages in thread
From: Nadav Har'El @ 2010-12-08 17:07 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

Implement the VMREAD and VMWRITE instructions. With these instructions, L1
can read and write to the VMCS it is holding. The values are read or written
to the fields of the vmcs_fields structure introduced in a previous patch.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  171 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 169 insertions(+), 2 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2010-12-08 18:56:50.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-12-08 18:56:50.000000000 +0200
@@ -4471,6 +4471,173 @@ static int handle_vmclear(struct kvm_vcp
 	return 1;
 }
 
+enum vmcs_field_type {
+	VMCS_FIELD_TYPE_U16 = 0,
+	VMCS_FIELD_TYPE_U64 = 1,
+	VMCS_FIELD_TYPE_U32 = 2,
+	VMCS_FIELD_TYPE_ULONG = 3
+};
+
+static inline int vmcs_field_type(unsigned long field)
+{
+	if (0x1 & field)	/* one of the *_HIGH fields, all are 32 bit */
+		return VMCS_FIELD_TYPE_U32;
+	return (field >> 13) & 0x3 ;
+}
+
+static inline int vmcs_field_readonly(unsigned long field)
+{
+	return (((field >> 10) & 0x3) == 1);
+}
+
+static inline bool vmcs12_read_any(struct kvm_vcpu *vcpu,
+					unsigned long field, u64 *ret)
+{
+	short offset = vmcs_field_to_offset(field);
+	char *p;
+
+	if (offset < 0)
+		return 0;
+
+	p = ((char *)(get_vmcs12_fields(vcpu))) + offset;
+
+	switch (vmcs_field_type(field)) {
+	case VMCS_FIELD_TYPE_ULONG:
+		*ret = *((unsigned long *)p);
+		return 1;
+	case VMCS_FIELD_TYPE_U16:
+		*ret = (u16) *((unsigned long *)p);
+		return 1;
+	case VMCS_FIELD_TYPE_U32:
+		*ret = (u32) *((unsigned long *)p);
+		return 1;
+	case VMCS_FIELD_TYPE_U64:
+		*ret = *((u64 *)p);
+		return 1;
+	default:
+		return 0; /* can never happen. */
+	}
+}
+
+static int handle_vmread(struct kvm_vcpu *vcpu)
+{
+	unsigned long field;
+	u64 field_value;
+	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+	u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+	gva_t gva = 0;
+
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	/* decode instruction info and find the field to read */
+	field = kvm_register_read(vcpu, (((vmx_instruction_info) >> 28) & 0xf));
+	if (!vmcs12_read_any(vcpu, field, &field_value)) {
+		nested_vmx_failValid(vcpu, VMXERR_UNSUPPORTED_VMCS_COMPONENT);
+		skip_emulated_instruction(vcpu);
+		return 1;
+	}
+
+	/*
+	 * and now check if reuqest to put the value in register or memory.
+	 * Note that the number of bits actually written is 32 or 64 depending
+	 * in the mode, not on the given field's length.
+	 */
+	if (vmx_instruction_info & (1u << 10)) {
+		kvm_register_write(vcpu, (((vmx_instruction_info) >> 3) & 0xf),
+			field_value);
+	} else {
+		if (get_vmx_mem_address(vcpu, exit_qualification,
+				vmx_instruction_info, &gva))
+			return 1;
+		/* ok to use *_system, because handle_vmread verified cpl=0 */
+		kvm_write_guest_virt_system(gva, &field_value,
+			     (is_long_mode(vcpu) ? 8 : 4), vcpu, NULL);
+	}
+
+	nested_vmx_succeed(vcpu);
+	skip_emulated_instruction(vcpu);
+	return 1;
+}
+
+
+static int handle_vmwrite(struct kvm_vcpu *vcpu)
+{
+	unsigned long field;
+	u64 field_value = 0;
+	gva_t gva;
+	int field_type;
+	unsigned long exit_qualification   = vmcs_readl(EXIT_QUALIFICATION);
+	u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+	char *p;
+	short offset;
+
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	if (vmx_instruction_info & (1u << 10))
+		field_value = kvm_register_read(vcpu,
+			(((vmx_instruction_info) >> 3) & 0xf));
+	else {
+		if (get_vmx_mem_address(vcpu, exit_qualification,
+				vmx_instruction_info, &gva))
+			return 1;
+		if (kvm_read_guest_virt(gva, &field_value,
+				(is_long_mode(vcpu) ? 8 : 4), vcpu, NULL)) {
+			kvm_queue_exception(vcpu, PF_VECTOR);
+			return 1;
+		}
+	}
+
+
+	field = kvm_register_read(vcpu, (((vmx_instruction_info) >> 28) & 0xf));
+
+	if (vmcs_field_readonly(field)) {
+		nested_vmx_failValid(vcpu,
+			VMXERR_VMWRITE_READ_ONLY_VMCS_COMPONENT);
+		skip_emulated_instruction(vcpu);
+		return 1;
+	}
+
+	field_type = vmcs_field_type(field);
+
+	offset = vmcs_field_to_offset(field);
+	if (offset < 0) {
+		nested_vmx_failValid(vcpu, VMXERR_UNSUPPORTED_VMCS_COMPONENT);
+		skip_emulated_instruction(vcpu);
+		return 1;
+	}
+	p = ((char *) get_vmcs12_fields(vcpu)) + offset;
+
+	switch (field_type) {
+	case VMCS_FIELD_TYPE_U16:
+		*(u16 *)p = field_value;
+		break;
+	case VMCS_FIELD_TYPE_U32:
+		*(u32 *)p = field_value;
+		break;
+	case VMCS_FIELD_TYPE_U64:
+#ifdef CONFIG_X86_64
+		*(unsigned long *)p = field_value;
+#else
+		*(unsigned long *)p = field_value;
+		*(((unsigned long *)p)+1) = field_value >> 32;
+#endif
+		break;
+	case VMCS_FIELD_TYPE_ULONG:
+		*(unsigned long *)p = field_value;
+		break;
+	default:
+		nested_vmx_failValid(vcpu, VMXERR_UNSUPPORTED_VMCS_COMPONENT);
+		skip_emulated_instruction(vcpu);
+		return 1;
+	}
+
+	nested_vmx_succeed(vcpu);
+	skip_emulated_instruction(vcpu);
+	return 1;
+}
+
 /* Emulate the VMPTRLD instruction */
 static int handle_vmptrld(struct kvm_vcpu *vcpu)
 {
@@ -4580,9 +4747,9 @@ static int (*kvm_vmx_exit_handlers[])(st
 	[EXIT_REASON_VMLAUNCH]                = handle_vmx_insn,
 	[EXIT_REASON_VMPTRLD]                 = handle_vmptrld,
 	[EXIT_REASON_VMPTRST]                 = handle_vmptrst,
-	[EXIT_REASON_VMREAD]                  = handle_vmx_insn,
+	[EXIT_REASON_VMREAD]                  = handle_vmread,
 	[EXIT_REASON_VMRESUME]                = handle_vmx_insn,
-	[EXIT_REASON_VMWRITE]                 = handle_vmx_insn,
+	[EXIT_REASON_VMWRITE]                 = handle_vmwrite,
 	[EXIT_REASON_VMOFF]                   = handle_vmoff,
 	[EXIT_REASON_VMON]                    = handle_vmon,
 	[EXIT_REASON_TPR_BELOW_THRESHOLD]     = handle_tpr_below_threshold,

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 15/28] nVMX: Prepare vmcs02 from vmcs01 and vmcs12
  2010-12-08 16:59 [PATCH 0/28] nVMX: Nested VMX, v7 Nadav Har'El
                   ` (13 preceding siblings ...)
  2010-12-08 17:07 ` [PATCH 14/28] nVMX: Implement VMREAD and VMWRITE Nadav Har'El
@ 2010-12-08 17:07 ` Nadav Har'El
  2010-12-08 17:08 ` [PATCH 16/28] nVMX: Move register-syncing to a function Nadav Har'El
                   ` (13 subsequent siblings)
  28 siblings, 0 replies; 40+ messages in thread
From: Nadav Har'El @ 2010-12-08 17:07 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch contains code to prepare the VMCS which can be used to actually
run the L2 guest, vmcs02. prepare_vmcs02 appropriately merges the information
in vmcs12 (the vmcs that L1 built for L2) and in vmcs01 (the vmcs that we
built for L1).

VMREAD/WRITE can only access one VMCS at a time (the "current" VMCS), which
makes it difficult for us to read from vmcs01 while writing to vmcs12. This
is why we first make a copy of vmcs01 in memory (vmcs01_fields) and then
read that memory copy while writing to vmcs12.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  409 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 409 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2010-12-08 18:56:50.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-12-08 18:56:50.000000000 +0200
@@ -805,6 +805,28 @@ static inline bool report_flexpriority(v
 	return flexpriority_enabled;
 }
 
+static inline bool nested_cpu_has_vmx_tpr_shadow(struct kvm_vcpu *vcpu)
+{
+	return cpu_has_vmx_tpr_shadow() &&
+		get_vmcs12_fields(vcpu)->cpu_based_vm_exec_control &
+		CPU_BASED_TPR_SHADOW;
+}
+
+static inline bool nested_cpu_has_secondary_exec_ctrls(struct kvm_vcpu *vcpu)
+{
+	return cpu_has_secondary_exec_ctrls() &&
+		get_vmcs12_fields(vcpu)->cpu_based_vm_exec_control &
+		CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
+}
+
+static inline bool nested_vm_need_virtualize_apic_accesses(struct kvm_vcpu
+							   *vcpu)
+{
+	return nested_cpu_has_secondary_exec_ctrls(vcpu) &&
+		(get_vmcs12_fields(vcpu)->secondary_vm_exec_control &
+		SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES);
+}
+
 static int __find_msr_index(struct vcpu_vmx *vmx, u32 msr)
 {
 	int i;
@@ -1253,6 +1275,37 @@ static void vmx_load_host_state(struct v
 	preempt_enable();
 }
 
+int load_vmcs_host_state(struct vmcs_fields *src)
+{
+	vmcs_write16(HOST_ES_SELECTOR, src->host_es_selector);
+	vmcs_write16(HOST_CS_SELECTOR, src->host_cs_selector);
+	vmcs_write16(HOST_SS_SELECTOR, src->host_ss_selector);
+	vmcs_write16(HOST_DS_SELECTOR, src->host_ds_selector);
+	vmcs_write16(HOST_FS_SELECTOR, src->host_fs_selector);
+	vmcs_write16(HOST_GS_SELECTOR, src->host_gs_selector);
+	vmcs_write16(HOST_TR_SELECTOR, src->host_tr_selector);
+
+	if (vmcs_config.vmexit_ctrl & VM_EXIT_LOAD_IA32_PAT)
+		vmcs_write64(HOST_IA32_PAT, src->host_ia32_pat);
+
+	vmcs_write32(HOST_IA32_SYSENTER_CS, src->host_ia32_sysenter_cs);
+
+	vmcs_writel(HOST_CR0, src->host_cr0);
+	vmcs_writel(HOST_CR3, src->host_cr3);
+	vmcs_writel(HOST_CR4, src->host_cr4);
+	vmcs_writel(HOST_FS_BASE, src->host_fs_base);
+	vmcs_writel(HOST_GS_BASE, src->host_gs_base);
+	vmcs_writel(HOST_TR_BASE, src->host_tr_base);
+	vmcs_writel(HOST_GDTR_BASE, src->host_gdtr_base);
+	vmcs_writel(HOST_IDTR_BASE, src->host_idtr_base);
+	vmcs_writel(HOST_RSP, src->host_rsp);
+	vmcs_writel(HOST_RIP, src->host_rip);
+	vmcs_writel(HOST_IA32_SYSENTER_ESP, src->host_ia32_sysenter_esp);
+	vmcs_writel(HOST_IA32_SYSENTER_EIP, src->host_ia32_sysenter_eip);
+
+	return 0;
+}
+
 /*
  * Switches to specified vcpu, until a matching vcpu_put(), but assumes
  * vcpu mutex is already taken.
@@ -5365,6 +5418,362 @@ static void vmx_set_supported_cpuid(u32 
 		entry->ecx |= bit(X86_FEATURE_VMX);
 }
 
+/*
+ * Make a copy of the current VMCS to ordinary memory. This is needed because
+ * in VMX you cannot read and write to two VMCS at the same time, so when we
+ * want to do this (in prepare_vmcs02, which needs to read from vmcs01 while
+ * preparing vmcs02), we need to first save a copy of one VMCS's fields in
+ * memory, and then use that copy.
+ */
+void save_vmcs(struct vmcs_fields *dst)
+{
+	dst->guest_es_selector = vmcs_read16(GUEST_ES_SELECTOR);
+	dst->guest_cs_selector = vmcs_read16(GUEST_CS_SELECTOR);
+	dst->guest_ss_selector = vmcs_read16(GUEST_SS_SELECTOR);
+	dst->guest_ds_selector = vmcs_read16(GUEST_DS_SELECTOR);
+	dst->guest_fs_selector = vmcs_read16(GUEST_FS_SELECTOR);
+	dst->guest_gs_selector = vmcs_read16(GUEST_GS_SELECTOR);
+	dst->guest_ldtr_selector = vmcs_read16(GUEST_LDTR_SELECTOR);
+	dst->guest_tr_selector = vmcs_read16(GUEST_TR_SELECTOR);
+	dst->host_es_selector = vmcs_read16(HOST_ES_SELECTOR);
+	dst->host_cs_selector = vmcs_read16(HOST_CS_SELECTOR);
+	dst->host_ss_selector = vmcs_read16(HOST_SS_SELECTOR);
+	dst->host_ds_selector = vmcs_read16(HOST_DS_SELECTOR);
+	dst->host_fs_selector = vmcs_read16(HOST_FS_SELECTOR);
+	dst->host_gs_selector = vmcs_read16(HOST_GS_SELECTOR);
+	dst->host_tr_selector = vmcs_read16(HOST_TR_SELECTOR);
+	dst->io_bitmap_a = vmcs_read64(IO_BITMAP_A);
+	dst->io_bitmap_b = vmcs_read64(IO_BITMAP_B);
+	if (cpu_has_vmx_msr_bitmap())
+		dst->msr_bitmap = vmcs_read64(MSR_BITMAP);
+	dst->tsc_offset = vmcs_read64(TSC_OFFSET);
+	dst->virtual_apic_page_addr = vmcs_read64(VIRTUAL_APIC_PAGE_ADDR);
+	dst->apic_access_addr = vmcs_read64(APIC_ACCESS_ADDR);
+	dst->guest_physical_address = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
+	dst->vmcs_link_pointer = vmcs_read64(VMCS_LINK_POINTER);
+	dst->guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);
+	if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT)
+		dst->guest_ia32_pat = vmcs_read64(GUEST_IA32_PAT);
+	if (enable_ept) {
+		/* shadow pages tables on EPT */
+		dst->ept_pointer = vmcs_read64(EPT_POINTER);
+		dst->guest_pdptr0 = vmcs_read64(GUEST_PDPTR0);
+		dst->guest_pdptr1 = vmcs_read64(GUEST_PDPTR1);
+		dst->guest_pdptr2 = vmcs_read64(GUEST_PDPTR2);
+		dst->guest_pdptr3 = vmcs_read64(GUEST_PDPTR3);
+	}
+	dst->pin_based_vm_exec_control = vmcs_read32(PIN_BASED_VM_EXEC_CONTROL);
+	dst->cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
+	dst->exception_bitmap = vmcs_read32(EXCEPTION_BITMAP);
+	dst->page_fault_error_code_mask =
+		vmcs_read32(PAGE_FAULT_ERROR_CODE_MASK);
+	dst->page_fault_error_code_match =
+		vmcs_read32(PAGE_FAULT_ERROR_CODE_MATCH);
+	dst->cr3_target_count = vmcs_read32(CR3_TARGET_COUNT);
+	dst->vm_exit_controls = vmcs_read32(VM_EXIT_CONTROLS);
+	dst->vm_entry_controls = vmcs_read32(VM_ENTRY_CONTROLS);
+	dst->vm_entry_intr_info_field = vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);
+	dst->vm_entry_exception_error_code =
+		vmcs_read32(VM_ENTRY_EXCEPTION_ERROR_CODE);
+	dst->vm_entry_instruction_len = vmcs_read32(VM_ENTRY_INSTRUCTION_LEN);
+	dst->tpr_threshold = vmcs_read32(TPR_THRESHOLD);
+	dst->secondary_vm_exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
+	if (enable_vpid && dst->secondary_vm_exec_control &
+	    SECONDARY_EXEC_ENABLE_VPID)
+		dst->virtual_processor_id = vmcs_read16(VIRTUAL_PROCESSOR_ID);
+	dst->vm_instruction_error = vmcs_read32(VM_INSTRUCTION_ERROR);
+	dst->vm_exit_reason  = vmcs_read32(VM_EXIT_REASON);
+	dst->vm_exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
+	dst->vm_exit_intr_error_code = vmcs_read32(VM_EXIT_INTR_ERROR_CODE);
+	dst->idt_vectoring_info_field = vmcs_read32(IDT_VECTORING_INFO_FIELD);
+	dst->idt_vectoring_error_code = vmcs_read32(IDT_VECTORING_ERROR_CODE);
+	dst->vm_exit_instruction_len = vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
+	dst->vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+	dst->guest_es_limit = vmcs_read32(GUEST_ES_LIMIT);
+	dst->guest_cs_limit = vmcs_read32(GUEST_CS_LIMIT);
+	dst->guest_ss_limit = vmcs_read32(GUEST_SS_LIMIT);
+	dst->guest_ds_limit = vmcs_read32(GUEST_DS_LIMIT);
+	dst->guest_fs_limit = vmcs_read32(GUEST_FS_LIMIT);
+	dst->guest_gs_limit = vmcs_read32(GUEST_GS_LIMIT);
+	dst->guest_ldtr_limit = vmcs_read32(GUEST_LDTR_LIMIT);
+	dst->guest_tr_limit = vmcs_read32(GUEST_TR_LIMIT);
+	dst->guest_gdtr_limit = vmcs_read32(GUEST_GDTR_LIMIT);
+	dst->guest_idtr_limit = vmcs_read32(GUEST_IDTR_LIMIT);
+	dst->guest_es_ar_bytes = vmcs_read32(GUEST_ES_AR_BYTES);
+	dst->guest_cs_ar_bytes = vmcs_read32(GUEST_CS_AR_BYTES);
+	dst->guest_ss_ar_bytes = vmcs_read32(GUEST_SS_AR_BYTES);
+	dst->guest_ds_ar_bytes = vmcs_read32(GUEST_DS_AR_BYTES);
+	dst->guest_fs_ar_bytes = vmcs_read32(GUEST_FS_AR_BYTES);
+	dst->guest_gs_ar_bytes = vmcs_read32(GUEST_GS_AR_BYTES);
+	dst->guest_ldtr_ar_bytes = vmcs_read32(GUEST_LDTR_AR_BYTES);
+	dst->guest_tr_ar_bytes = vmcs_read32(GUEST_TR_AR_BYTES);
+	dst->guest_interruptibility_info =
+		vmcs_read32(GUEST_INTERRUPTIBILITY_INFO);
+	dst->guest_activity_state = vmcs_read32(GUEST_ACTIVITY_STATE);
+	dst->guest_sysenter_cs = vmcs_read32(GUEST_SYSENTER_CS);
+	dst->host_ia32_sysenter_cs = vmcs_read32(HOST_IA32_SYSENTER_CS);
+	dst->cr0_guest_host_mask = vmcs_readl(CR0_GUEST_HOST_MASK);
+	dst->cr4_guest_host_mask = vmcs_readl(CR4_GUEST_HOST_MASK);
+	dst->cr0_read_shadow = vmcs_readl(CR0_READ_SHADOW);
+	dst->cr4_read_shadow = vmcs_readl(CR4_READ_SHADOW);
+	dst->cr3_target_value0 = vmcs_readl(CR3_TARGET_VALUE0);
+	dst->cr3_target_value1 = vmcs_readl(CR3_TARGET_VALUE1);
+	dst->cr3_target_value2 = vmcs_readl(CR3_TARGET_VALUE2);
+	dst->cr3_target_value3 = vmcs_readl(CR3_TARGET_VALUE3);
+	dst->exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+	dst->guest_linear_address = vmcs_readl(GUEST_LINEAR_ADDRESS);
+	dst->guest_cr0 = vmcs_readl(GUEST_CR0);
+	dst->guest_cr3 = vmcs_readl(GUEST_CR3);
+	dst->guest_cr4 = vmcs_readl(GUEST_CR4);
+	dst->guest_es_base = vmcs_readl(GUEST_ES_BASE);
+	dst->guest_cs_base = vmcs_readl(GUEST_CS_BASE);
+	dst->guest_ss_base = vmcs_readl(GUEST_SS_BASE);
+	dst->guest_ds_base = vmcs_readl(GUEST_DS_BASE);
+	dst->guest_fs_base = vmcs_readl(GUEST_FS_BASE);
+	dst->guest_gs_base = vmcs_readl(GUEST_GS_BASE);
+	dst->guest_ldtr_base = vmcs_readl(GUEST_LDTR_BASE);
+	dst->guest_tr_base = vmcs_readl(GUEST_TR_BASE);
+	dst->guest_gdtr_base = vmcs_readl(GUEST_GDTR_BASE);
+	dst->guest_idtr_base = vmcs_readl(GUEST_IDTR_BASE);
+	dst->guest_dr7 = vmcs_readl(GUEST_DR7);
+	dst->guest_rsp = vmcs_readl(GUEST_RSP);
+	dst->guest_rip = vmcs_readl(GUEST_RIP);
+	dst->guest_rflags = vmcs_readl(GUEST_RFLAGS);
+	dst->guest_pending_dbg_exceptions =
+		vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
+	dst->guest_sysenter_esp = vmcs_readl(GUEST_SYSENTER_ESP);
+	dst->guest_sysenter_eip = vmcs_readl(GUEST_SYSENTER_EIP);
+	dst->host_cr0 = vmcs_readl(HOST_CR0);
+	dst->host_cr3 = vmcs_readl(HOST_CR3);
+	dst->host_cr4 = vmcs_readl(HOST_CR4);
+	dst->host_fs_base = vmcs_readl(HOST_FS_BASE);
+	dst->host_gs_base = vmcs_readl(HOST_GS_BASE);
+	dst->host_tr_base = vmcs_readl(HOST_TR_BASE);
+	dst->host_gdtr_base = vmcs_readl(HOST_GDTR_BASE);
+	dst->host_idtr_base = vmcs_readl(HOST_IDTR_BASE);
+	dst->host_ia32_sysenter_esp = vmcs_readl(HOST_IA32_SYSENTER_ESP);
+	dst->host_ia32_sysenter_eip = vmcs_readl(HOST_IA32_SYSENTER_EIP);
+	dst->host_rsp = vmcs_readl(HOST_RSP);
+	dst->host_rip = vmcs_readl(HOST_RIP);
+	if (vmcs_config.vmexit_ctrl & VM_EXIT_LOAD_IA32_PAT)
+		dst->host_ia32_pat = vmcs_read64(HOST_IA32_PAT);
+}
+
+/*
+ * prepare_vmcs02 is called in when the L1 guest hypervisor runs its nested
+ * L2 guest. L1 has a vmcs for L2 (vmcs12), and this function "merges" it
+ * with L0's wishes for its guest (vmsc01), so we can run the L2 guest in a
+ * way that will both be appropriate to L1's requests, and our needs.
+ */
+int prepare_vmcs02(struct kvm_vcpu *vcpu,
+	struct vmcs_fields *vmcs12, struct vmcs_fields *vmcs01)
+{
+	u32 exec_control;
+
+	vmcs_write16(GUEST_ES_SELECTOR, vmcs12->guest_es_selector);
+	vmcs_write16(GUEST_CS_SELECTOR, vmcs12->guest_cs_selector);
+	vmcs_write16(GUEST_SS_SELECTOR, vmcs12->guest_ss_selector);
+	vmcs_write16(GUEST_DS_SELECTOR, vmcs12->guest_ds_selector);
+	vmcs_write16(GUEST_FS_SELECTOR, vmcs12->guest_fs_selector);
+	vmcs_write16(GUEST_GS_SELECTOR, vmcs12->guest_gs_selector);
+	vmcs_write16(GUEST_LDTR_SELECTOR, vmcs12->guest_ldtr_selector);
+	vmcs_write16(GUEST_TR_SELECTOR, vmcs12->guest_tr_selector);
+
+	vmcs_write64(GUEST_IA32_DEBUGCTL, vmcs12->guest_ia32_debugctl);
+
+	if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT)
+		vmcs_write64(GUEST_IA32_PAT, vmcs12->guest_ia32_pat);
+
+	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
+		     vmcs12->vm_entry_intr_info_field);
+	vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE,
+		     vmcs12->vm_entry_exception_error_code);
+	vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
+		     vmcs12->vm_entry_instruction_len);
+
+	vmcs_write32(GUEST_ES_LIMIT, vmcs12->guest_es_limit);
+	vmcs_write32(GUEST_CS_LIMIT, vmcs12->guest_cs_limit);
+	vmcs_write32(GUEST_SS_LIMIT, vmcs12->guest_ss_limit);
+	vmcs_write32(GUEST_DS_LIMIT, vmcs12->guest_ds_limit);
+	vmcs_write32(GUEST_FS_LIMIT, vmcs12->guest_fs_limit);
+	vmcs_write32(GUEST_GS_LIMIT, vmcs12->guest_gs_limit);
+	vmcs_write32(GUEST_LDTR_LIMIT, vmcs12->guest_ldtr_limit);
+	vmcs_write32(GUEST_TR_LIMIT, vmcs12->guest_tr_limit);
+	vmcs_write32(GUEST_GDTR_LIMIT, vmcs12->guest_gdtr_limit);
+	vmcs_write32(GUEST_IDTR_LIMIT, vmcs12->guest_idtr_limit);
+	vmcs_write32(GUEST_ES_AR_BYTES, vmcs12->guest_es_ar_bytes);
+	vmcs_write32(GUEST_CS_AR_BYTES, vmcs12->guest_cs_ar_bytes);
+	vmcs_write32(GUEST_SS_AR_BYTES, vmcs12->guest_ss_ar_bytes);
+	vmcs_write32(GUEST_DS_AR_BYTES, vmcs12->guest_ds_ar_bytes);
+	vmcs_write32(GUEST_FS_AR_BYTES, vmcs12->guest_fs_ar_bytes);
+	vmcs_write32(GUEST_GS_AR_BYTES, vmcs12->guest_gs_ar_bytes);
+	vmcs_write32(GUEST_LDTR_AR_BYTES, vmcs12->guest_ldtr_ar_bytes);
+	vmcs_write32(GUEST_TR_AR_BYTES, vmcs12->guest_tr_ar_bytes);
+	vmcs_write32(GUEST_INTERRUPTIBILITY_INFO,
+		     vmcs12->guest_interruptibility_info);
+	vmcs_write32(GUEST_ACTIVITY_STATE, vmcs12->guest_activity_state);
+	vmcs_write32(GUEST_SYSENTER_CS, vmcs12->guest_sysenter_cs);
+
+	vmcs_writel(GUEST_ES_BASE, vmcs12->guest_es_base);
+	vmcs_writel(GUEST_CS_BASE, vmcs12->guest_cs_base);
+	vmcs_writel(GUEST_SS_BASE, vmcs12->guest_ss_base);
+	vmcs_writel(GUEST_DS_BASE, vmcs12->guest_ds_base);
+	vmcs_writel(GUEST_FS_BASE, vmcs12->guest_fs_base);
+	vmcs_writel(GUEST_GS_BASE, vmcs12->guest_gs_base);
+	vmcs_writel(GUEST_LDTR_BASE, vmcs12->guest_ldtr_base);
+	vmcs_writel(GUEST_TR_BASE, vmcs12->guest_tr_base);
+	vmcs_writel(GUEST_GDTR_BASE, vmcs12->guest_gdtr_base);
+	vmcs_writel(GUEST_IDTR_BASE, vmcs12->guest_idtr_base);
+	vmcs_writel(GUEST_DR7, vmcs12->guest_dr7);
+	vmcs_writel(GUEST_RSP, vmcs12->guest_rsp);
+	vmcs_writel(GUEST_RIP, vmcs12->guest_rip);
+	vmcs_writel(GUEST_RFLAGS, vmcs12->guest_rflags);
+	vmcs_writel(GUEST_PENDING_DBG_EXCEPTIONS,
+		    vmcs12->guest_pending_dbg_exceptions);
+	vmcs_writel(GUEST_SYSENTER_ESP, vmcs12->guest_sysenter_esp);
+	vmcs_writel(GUEST_SYSENTER_EIP, vmcs12->guest_sysenter_eip);
+
+	vmcs_write64(VMCS_LINK_POINTER, vmcs12->vmcs_link_pointer);
+	vmcs_write64(IO_BITMAP_A, vmcs01->io_bitmap_a);
+	vmcs_write64(IO_BITMAP_B, vmcs01->io_bitmap_b);
+	if (cpu_has_vmx_msr_bitmap())
+		vmcs_write64(MSR_BITMAP, vmcs01->msr_bitmap);
+
+	if (vmcs12->vm_entry_msr_load_count > 0 ||
+			vmcs12->vm_exit_msr_load_count > 0 ||
+			vmcs12->vm_exit_msr_store_count > 0) {
+		printk(KERN_WARNING
+			"%s: VMCS MSR_{LOAD,STORE} unsupported\n", __func__);
+	}
+
+	if (nested_cpu_has_vmx_tpr_shadow(vcpu)) {
+		struct page *page =
+			nested_get_page(vcpu, vmcs12->virtual_apic_page_addr);
+		if (!page)
+			return 1;
+		vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, page_to_phys(page));
+		kvm_release_page_clean(page);
+	}
+
+	if (nested_vm_need_virtualize_apic_accesses(vcpu)) {
+		struct page *page =
+			nested_get_page(vcpu, vmcs12->apic_access_addr);
+		if (!page)
+			return 1;
+		vmcs_write64(APIC_ACCESS_ADDR, page_to_phys(page));
+		kvm_release_page_clean(page);
+	}
+
+	vmcs_write32(PIN_BASED_VM_EXEC_CONTROL,
+		     (vmcs01->pin_based_vm_exec_control |
+		      vmcs12->pin_based_vm_exec_control));
+
+
+	/*
+	 * Whether page-faults are trapped is determined by a combination of
+	 * 3 settings: PFEC_MASK, PFEC_MATCH and EXCEPTION_BITMAP.PF.
+	 * If enable_ept, L0 doesn't care about page faults and we should
+	 * set all of these to L1's desires. However, if !enable_ept, L0 does
+	 * care about (at least some) page faults, and because it is not easy
+	 * (if at all possible?) to merge L0 and L1's desires, we simply ask
+	 * to exit on each and every L2 page fault. This is done by setting
+	 * MASK=MATCH=0 and (see below) EB.PF=1.
+	 * Note that below we don't need special code to set EB.PF beyond the
+	 * "or"ing of the EB of vmcs01 and vmcs12, because when enable_ept,
+	 * vmcs01's EB.PF is 0 so the "or" will take vmcs12's value, and when
+	 * !enable_ept, EB.PF is 1, so the "or" will always be 1.
+	 *
+	 * A problem with this approach (when !enable_ept) is that L1 may be
+	 * injected with more page faults than it asked for. This could have
+	 * caused problems, but in practice existing hypervisors don't care.
+	 * To fix this, we will need to emulate the PFEC checking (on the L1
+	 * page tables), using walk_addr(), when injecting PFs to L1.
+	 */
+	vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK,
+		enable_ept ? vmcs12->page_fault_error_code_mask : 0);
+	vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH,
+		enable_ept ? vmcs12->page_fault_error_code_match : 0);
+
+	if (cpu_has_secondary_exec_ctrls()) {
+		u32 exec_control = vmcs01->secondary_vm_exec_control;
+		if (nested_cpu_has_secondary_exec_ctrls(vcpu)) {
+			exec_control |= vmcs12->secondary_vm_exec_control;
+			if (!vm_need_virtualize_apic_accesses(vcpu->kvm) ||
+			    !nested_vm_need_virtualize_apic_accesses(vcpu))
+				exec_control &=
+				~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
+		}
+		vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control);
+	}
+
+	load_vmcs_host_state(vmcs01);
+
+	if (vm_need_tpr_shadow(vcpu->kvm) &&
+	    nested_cpu_has_vmx_tpr_shadow(vcpu))
+		vmcs_write32(TPR_THRESHOLD, vmcs12->tpr_threshold);
+
+	exec_control = vmcs01->cpu_based_vm_exec_control;
+	exec_control &= ~CPU_BASED_VIRTUAL_INTR_PENDING;
+	exec_control &= ~CPU_BASED_VIRTUAL_NMI_PENDING;
+	exec_control &= ~CPU_BASED_TPR_SHADOW;
+	exec_control |= vmcs12->cpu_based_vm_exec_control;
+	if (!vm_need_tpr_shadow(vcpu->kvm) ||
+	    vmcs12->virtual_apic_page_addr == 0) {
+		exec_control &= ~CPU_BASED_TPR_SHADOW;
+#ifdef CONFIG_X86_64
+		exec_control |= CPU_BASED_CR8_STORE_EXITING |
+			CPU_BASED_CR8_LOAD_EXITING;
+#endif
+	} else if (exec_control & CPU_BASED_TPR_SHADOW) {
+#ifdef CONFIG_X86_64
+		exec_control &= ~CPU_BASED_CR8_STORE_EXITING;
+		exec_control &= ~CPU_BASED_CR8_LOAD_EXITING;
+#endif
+	}
+	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, exec_control);
+
+	/* EXCEPTION_BITMAP and CR0_GUEST_HOST_MASK should basically be the
+	 * bitwise-or of what L1 wants to trap for L2, and what we want to
+	 * trap. However, vmx_fpu_activate/deactivate may have happened after
+	 * we saved vmcs01, so we shouldn't trust its TS and NM_VECTOR bits
+	 * and need to base them again on fpu_active. Note that CR0.TS also
+	 * needs updating - we do this after this function returns (in
+	 * nested_vmx_run).
+	 */
+	vmcs_write32(EXCEPTION_BITMAP,
+		     ((vmcs01->exception_bitmap&~(1u<<NM_VECTOR)) |
+		      (vcpu->fpu_active ? 0 : (1u<<NM_VECTOR)) |
+		      vmcs12->exception_bitmap));
+	vmcs_writel(CR0_GUEST_HOST_MASK, vmcs12->cr0_guest_host_mask |
+			(vcpu->fpu_active ? 0 : X86_CR0_TS));
+	vcpu->arch.cr0_guest_owned_bits = ~(vmcs12->cr0_guest_host_mask |
+			(vcpu->fpu_active ? 0 : X86_CR0_TS));
+
+	vmcs_write32(VM_EXIT_CONTROLS,
+		     (vmcs01->vm_exit_controls &
+			(~(VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT)))
+		       | vmcs12->vm_exit_controls);
+
+	vmcs_write32(VM_ENTRY_CONTROLS,
+		     (vmcs01->vm_entry_controls &
+			(~(VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_IA32E_MODE)))
+		      | vmcs12->vm_entry_controls);
+
+	vmcs_writel(CR4_GUEST_HOST_MASK,
+		    (vmcs01->cr4_guest_host_mask |
+		     vmcs12->cr4_guest_host_mask));
+	vcpu->arch.cr4_guest_owned_bits = ~(vmcs01->cr4_guest_host_mask |
+		vmcs12->cr4_guest_host_mask);
+
+	vmcs_write64(TSC_OFFSET, vmcs01->tsc_offset + vmcs12->tsc_offset);
+
+	if (enable_ept) {
+		/* shadow page tables on EPT */
+		vmcs_write64(EPT_POINTER, vmcs01->ept_pointer);
+	}
+	return 0;
+}
+
 static struct kvm_x86_ops vmx_x86_ops = {
 	.cpu_has_kvm_support = cpu_has_kvm_support,
 	.disabled_by_bios = vmx_disabled_by_bios,

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 16/28] nVMX: Move register-syncing to a function
  2010-12-08 16:59 [PATCH 0/28] nVMX: Nested VMX, v7 Nadav Har'El
                   ` (14 preceding siblings ...)
  2010-12-08 17:07 ` [PATCH 15/28] nVMX: Prepare vmcs02 from vmcs01 and vmcs12 Nadav Har'El
@ 2010-12-08 17:08 ` Nadav Har'El
  2010-12-08 17:08 ` [PATCH 17/28] nVMX: Implement VMLAUNCH and VMRESUME Nadav Har'El
                   ` (12 subsequent siblings)
  28 siblings, 0 replies; 40+ messages in thread
From: Nadav Har'El @ 2010-12-08 17:08 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

Move code that syncs dirty RSP and RIP registers back to the VMCS, into a
function. We will need to call this function from additional places in the
next patch.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2010-12-08 18:56:50.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-12-08 18:56:50.000000000 +0200
@@ -5033,6 +5033,15 @@ static void vmx_cancel_injection(struct 
 	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0);
 }
 
+static inline void sync_cached_regs_to_vmcs(struct kvm_vcpu *vcpu)
+{
+	if (test_bit(VCPU_REGS_RSP, (unsigned long *)&vcpu->arch.regs_dirty))
+		vmcs_writel(GUEST_RSP, vcpu->arch.regs[VCPU_REGS_RSP]);
+	if (test_bit(VCPU_REGS_RIP, (unsigned long *)&vcpu->arch.regs_dirty))
+		vmcs_writel(GUEST_RIP, vcpu->arch.regs[VCPU_REGS_RIP]);
+	vcpu->arch.regs_dirty = 0;
+}
+
 #ifdef CONFIG_X86_64
 #define R "r"
 #define Q "q"
@@ -5054,10 +5063,7 @@ static void vmx_vcpu_run(struct kvm_vcpu
 	if (vmx->emulation_required && emulate_invalid_guest_state)
 		return;
 
-	if (test_bit(VCPU_REGS_RSP, (unsigned long *)&vcpu->arch.regs_dirty))
-		vmcs_writel(GUEST_RSP, vcpu->arch.regs[VCPU_REGS_RSP]);
-	if (test_bit(VCPU_REGS_RIP, (unsigned long *)&vcpu->arch.regs_dirty))
-		vmcs_writel(GUEST_RIP, vcpu->arch.regs[VCPU_REGS_RIP]);
+	sync_cached_regs_to_vmcs(vcpu);
 
 	/* When single-stepping over STI and MOV SS, we must clear the
 	 * corresponding interruptibility bits in the guest state. Otherwise
@@ -5165,7 +5171,6 @@ static void vmx_vcpu_run(struct kvm_vcpu
 
 	vcpu->arch.regs_avail = ~((1 << VCPU_REGS_RIP) | (1 << VCPU_REGS_RSP)
 				  | (1 << VCPU_EXREG_PDPTR));
-	vcpu->arch.regs_dirty = 0;
 
 	vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 17/28] nVMX: Implement VMLAUNCH and VMRESUME
  2010-12-08 16:59 [PATCH 0/28] nVMX: Nested VMX, v7 Nadav Har'El
                   ` (15 preceding siblings ...)
  2010-12-08 17:08 ` [PATCH 16/28] nVMX: Move register-syncing to a function Nadav Har'El
@ 2010-12-08 17:08 ` Nadav Har'El
  2010-12-08 17:09 ` [PATCH 18/28] nVMX: No need for handle_vmx_insn function any more Nadav Har'El
                   ` (11 subsequent siblings)
  28 siblings, 0 replies; 40+ messages in thread
From: Nadav Har'El @ 2010-12-08 17:08 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

Implement the VMLAUNCH and VMRESUME instructions, allowing a guest
hypervisor to run its own guests.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  235 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 232 insertions(+), 3 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2010-12-08 18:56:50.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-12-08 18:56:50.000000000 +0200
@@ -281,6 +281,9 @@ struct __packed vmcs12 {
 	struct vmcs_fields fields;
 
 	bool launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
+
+	int cpu;
+	int launched;
 };
 
 /*
@@ -315,6 +318,21 @@ struct nested_vmx {
 	/* list of real (hardware) VMCS, one for each L2 guest of L1 */
 	struct list_head vmcs02_list; /* a vmcs_list */
 	int vmcs02_num;
+
+	/* Level 1 state for switching to level 2 and back */
+	struct  {
+		u64 efer;
+		u64 io_bitmap_a;
+		u64 io_bitmap_b;
+		u64 msr_bitmap;
+		int cpu;
+		int launched;
+	} l1_state;
+	/* Saving the VMCS that we used for running L1 */
+	struct vmcs *vmcs01;
+	struct vmcs_fields *vmcs01_fields;
+	/* Saving some vcpu->arch.* data we had for L1, while running L2 */
+	unsigned long l1_arch_cr3;
 };
 
 struct vcpu_vmx {
@@ -1344,6 +1362,16 @@ static void vmx_vcpu_load(struct kvm_vcp
 
 		rdmsrl(MSR_IA32_SYSENTER_ESP, sysenter_esp);
 		vmcs_writel(HOST_IA32_SYSENTER_ESP, sysenter_esp); /* 22.2.3 */
+
+		if (vmx->nested.vmcs01_fields != NULL) {
+			struct vmcs_fields *vmcs01 = vmx->nested.vmcs01_fields;
+			vmcs01->host_tr_base = vmcs_readl(HOST_TR_BASE);
+			vmcs01->host_gdtr_base = vmcs_readl(HOST_GDTR_BASE);
+			vmcs01->host_ia32_sysenter_esp =
+				vmcs_readl(HOST_IA32_SYSENTER_ESP);
+			if (is_guest_mode(vcpu))
+				load_vmcs_host_state(vmcs01);
+		}
 	}
 }
 
@@ -2173,6 +2201,9 @@ static void free_l1_state(struct kvm_vcp
 		kfree(list_item);
 	}
 	vmx->nested.vmcs02_num = 0;
+
+	kfree(vmx->nested.vmcs01_fields);
+	vmx->nested.vmcs01_fields = NULL;
 }
 
 static void free_kvm_area(void)
@@ -4326,6 +4357,10 @@ static int handle_vmon(struct kvm_vcpu *
 	INIT_LIST_HEAD(&(vmx->nested.vmcs02_list));
 	vmx->nested.vmcs02_num = 0;
 
+	vmx->nested.vmcs01_fields = kzalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!vmx->nested.vmcs01_fields)
+		return -ENOMEM;
+
 	vmx->nested.vmxon = true;
 
 	skip_emulated_instruction(vcpu);
@@ -4524,6 +4559,50 @@ static int handle_vmclear(struct kvm_vcp
 	return 1;
 }
 
+static int nested_vmx_run(struct kvm_vcpu *vcpu);
+
+static int handle_launch_or_resume(struct kvm_vcpu *vcpu, bool launch)
+{
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	/* yet another strange pre-requisite listed in the VMX spec */
+	if (vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) &
+			GUEST_INTR_STATE_MOV_SS) {
+		nested_vmx_failValid(vcpu,
+			VMXERR_ENTRY_EVENTS_BLOCKED_BY_MOV_SS);
+		skip_emulated_instruction(vcpu);
+		return 1;
+	}
+
+	if (to_vmx(vcpu)->nested.current_vmcs12->launch_state == launch) {
+		/* Must use VMLAUNCH for the first time, VMRESUME later */
+		nested_vmx_failValid(vcpu,
+			launch ? VMXERR_VMLAUNCH_NONCLEAR_VMCS :
+				 VMXERR_VMRESUME_NONLAUNCHED_VMCS);
+		skip_emulated_instruction(vcpu);
+		return 1;
+	}
+
+	skip_emulated_instruction(vcpu);
+
+	nested_vmx_run(vcpu);
+	return 1;
+}
+
+/* Emulate the VMLAUNCH instruction */
+static int handle_vmlaunch(struct kvm_vcpu *vcpu)
+{
+	return handle_launch_or_resume(vcpu, true);
+}
+
+/* Emulate the VMRESUME instruction */
+static int handle_vmresume(struct kvm_vcpu *vcpu)
+{
+
+	return handle_launch_or_resume(vcpu, false);
+}
+
 enum vmcs_field_type {
 	VMCS_FIELD_TYPE_U16 = 0,
 	VMCS_FIELD_TYPE_U64 = 1,
@@ -4797,11 +4876,11 @@ static int (*kvm_vmx_exit_handlers[])(st
 	[EXIT_REASON_INVLPG]		      = handle_invlpg,
 	[EXIT_REASON_VMCALL]                  = handle_vmcall,
 	[EXIT_REASON_VMCLEAR]	              = handle_vmclear,
-	[EXIT_REASON_VMLAUNCH]                = handle_vmx_insn,
+	[EXIT_REASON_VMLAUNCH]                = handle_vmlaunch,
 	[EXIT_REASON_VMPTRLD]                 = handle_vmptrld,
 	[EXIT_REASON_VMPTRST]                 = handle_vmptrst,
 	[EXIT_REASON_VMREAD]                  = handle_vmread,
-	[EXIT_REASON_VMRESUME]                = handle_vmx_insn,
+	[EXIT_REASON_VMRESUME]                = handle_vmresume,
 	[EXIT_REASON_VMWRITE]                 = handle_vmwrite,
 	[EXIT_REASON_VMOFF]                   = handle_vmoff,
 	[EXIT_REASON_VMON]                    = handle_vmon,
@@ -4870,7 +4949,8 @@ static int vmx_handle_exit(struct kvm_vc
 		       "(0x%x) and exit reason is 0x%x\n",
 		       __func__, vectoring_info, exit_reason);
 
-	if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked)) {
+	if (!is_guest_mode(vcpu) &&
+	    unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked)) {
 		if (vmx_interrupt_allowed(vcpu)) {
 			vmx->soft_vnmi_blocked = 0;
 		} else if (vmx->vnmi_blocked_time > 1000000000LL &&
@@ -5779,6 +5859,155 @@ int prepare_vmcs02(struct kvm_vcpu *vcpu
 	return 0;
 }
 
+
+
+/*
+ * Return the cr0 value that a guest would read. This is a combination of
+ * the real cr0 used to run the guest (guest_cr0), and the bits shadowed by
+ * the hypervisor (cr0_read_shadow).
+ */
+static inline unsigned long guest_readable_cr0(struct vmcs_fields *fields)
+{
+	return (fields->guest_cr0 & ~fields->cr0_guest_host_mask) |
+		(fields->cr0_read_shadow & fields->cr0_guest_host_mask);
+}
+static inline unsigned long guest_readable_cr4(struct vmcs_fields *fields)
+{
+	return (fields->guest_cr4 & ~fields->cr4_guest_host_mask) |
+		(fields->cr4_read_shadow & fields->cr4_guest_host_mask);
+}
+static inline void set_cr3_and_pdptrs(struct kvm_vcpu *vcpu, unsigned long cr3)
+{
+	vcpu->arch.cr3 = cr3;
+	vmcs_writel(GUEST_CR3, cr3);
+	load_pdptrs(vcpu, vcpu->arch.walk_mmu, cr3);
+	vmcs_write64(GUEST_PDPTR0, vcpu->arch.mmu.pdptrs[0]);
+	vmcs_write64(GUEST_PDPTR1, vcpu->arch.mmu.pdptrs[1]);
+	vmcs_write64(GUEST_PDPTR2, vcpu->arch.mmu.pdptrs[2]);
+	vmcs_write64(GUEST_PDPTR3, vcpu->arch.mmu.pdptrs[3]);
+}
+
+static int nested_vmx_run(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	enter_guest_mode(vcpu);
+	sync_cached_regs_to_vmcs(vcpu);
+	save_vmcs(vmx->nested.vmcs01_fields);
+
+	vmx->nested.l1_state.efer = vcpu->arch.efer;
+	/* arch.cr3 - the guest's original page table (not the shadow)
+	   needs to be saved. */
+	vmx->nested.l1_arch_cr3 = vcpu->arch.cr3;
+
+	if (cpu_has_vmx_msr_bitmap())
+		vmx->nested.l1_state.msr_bitmap = vmcs_read64(MSR_BITMAP);
+	else
+		vmx->nested.l1_state.msr_bitmap = 0;
+
+	vmx->nested.l1_state.io_bitmap_a = vmcs_read64(IO_BITMAP_A);
+	vmx->nested.l1_state.io_bitmap_b = vmcs_read64(IO_BITMAP_B);
+	vmx->nested.vmcs01 = vmx->vmcs;
+	vmx->nested.l1_state.cpu = vcpu->cpu;
+	vmx->nested.l1_state.launched = vmx->launched;
+
+	vmx->vmcs = nested_get_current_vmcs(vcpu);
+	if (!vmx->vmcs) {
+		printk(KERN_ERR "Missing VMCS\n");
+		nested_vmx_failValid(vcpu, VMXERR_VMRESUME_CORRUPTED_VMCS);
+		return 1;
+	}
+
+	vcpu->cpu = vmx->nested.current_vmcs12->cpu;
+	vmx->launched = vmx->nested.current_vmcs12->launched;
+
+	if (!vmx->nested.current_vmcs12->launch_state || !vmx->launched) {
+		vmcs_clear(vmx->vmcs);
+		vmx->launched = 0;
+		vmx->nested.current_vmcs12->launch_state = 1;
+	}
+
+	vmx_vcpu_load(vcpu, get_cpu());
+	put_cpu();
+
+	prepare_vmcs02(vcpu,
+		get_vmcs12_fields(vcpu), vmx->nested.vmcs01_fields);
+
+	if (get_vmcs12_fields(vcpu)->vm_entry_controls &
+	    VM_ENTRY_IA32E_MODE) {
+		if (!((vcpu->arch.efer & EFER_LMA) &&
+		      (vcpu->arch.efer & EFER_LME)))
+			vcpu->arch.efer |= (EFER_LMA | EFER_LME);
+	} else {
+		if ((vcpu->arch.efer & EFER_LMA) ||
+		    (vcpu->arch.efer & EFER_LME))
+			vcpu->arch.efer = 0;
+	}
+
+	vmx->rmode.vm86_active =
+		!(get_vmcs12_fields(vcpu)->cr0_read_shadow & X86_CR0_PE);
+
+	/* vmx_set_cr0() sets the cr0 that L2 will read, to be the one that L1
+	 * dictated, and takes appropriate actions for special cr0 bits (like
+	 * real mode, etc.).
+	 */
+	vmx_set_cr0(vcpu, guest_readable_cr0(get_vmcs12_fields(vcpu)));
+
+	/* However, vmx_set_cr0 incorrectly enforces KVM's relationship between
+	 * GUEST_CR0 and CR0_READ_SHADOW, e.g., that the former is the same as
+	 * the latter with with TS added if !fpu_active. We need to take the
+	 * actual GUEST_CR0 that L1 wanted, just with added TS if !fpu_active
+	 * like KVM wants (for the "lazy fpu" feature, to avoid the costly
+	 * restoration of fpu registers until the FPU is really used).
+	 */
+	vmcs_writel(GUEST_CR0, get_vmcs12_fields(vcpu)->guest_cr0 |
+		(vcpu->fpu_active ? 0 : X86_CR0_TS));
+
+	/* we have to set the X86_CR0_PG bit of the cached cr0, because
+	 * kvm_mmu_reset_context enables paging only if X86_CR0_PG is set in
+	 * CR0 (we need the paging so that KVM treat this guest as a paging
+	 * guest so we can easly forward page faults to L1.)
+	 */
+	vcpu->arch.cr0 |= X86_CR0_PG;
+
+	if (enable_ept) {
+		/* shadow page tables on EPT */
+		vcpu->arch.cr4 = guest_readable_cr4(get_vmcs12_fields(vcpu));
+		vmcs_writel(CR4_READ_SHADOW, vcpu->arch.cr4);
+		vmcs_writel(GUEST_CR4, get_vmcs12_fields(vcpu)->guest_cr4);
+		set_cr3_and_pdptrs(vcpu, get_vmcs12_fields(vcpu)->guest_cr3);
+	} else {
+		/* shadow page tables on shadow page tables */
+		vmx_set_cr4(vcpu, get_vmcs12_fields(vcpu)->guest_cr4);
+		vmcs_writel(CR4_READ_SHADOW,
+			    get_vmcs12_fields(vcpu)->cr4_read_shadow);
+		kvm_set_cr3(vcpu, get_vmcs12_fields(vcpu)->guest_cr3);
+		kvm_mmu_reset_context(vcpu);
+
+		if (unlikely(kvm_mmu_load(vcpu))) {
+			nested_vmx_failValid(vcpu,
+				VMXERR_VMRESUME_CORRUPTED_VMCS /* ? */);
+			/* switch back to L1 */
+			leave_guest_mode(vcpu);
+			vmx->vmcs = vmx->nested.vmcs01;
+			vcpu->cpu = vmx->nested.l1_state.cpu;
+			vmx->launched = vmx->nested.l1_state.launched;
+
+			vmx_vcpu_load(vcpu, get_cpu());
+			put_cpu();
+
+			return 1;
+		}
+	}
+
+	kvm_register_write(vcpu, VCPU_REGS_RSP,
+			   get_vmcs12_fields(vcpu)->guest_rsp);
+	kvm_register_write(vcpu, VCPU_REGS_RIP,
+			   get_vmcs12_fields(vcpu)->guest_rip);
+
+	return 1;
+}
+
 static struct kvm_x86_ops vmx_x86_ops = {
 	.cpu_has_kvm_support = cpu_has_kvm_support,
 	.disabled_by_bios = vmx_disabled_by_bios,

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 18/28] nVMX: No need for handle_vmx_insn function any more
  2010-12-08 16:59 [PATCH 0/28] nVMX: Nested VMX, v7 Nadav Har'El
                   ` (16 preceding siblings ...)
  2010-12-08 17:08 ` [PATCH 17/28] nVMX: Implement VMLAUNCH and VMRESUME Nadav Har'El
@ 2010-12-08 17:09 ` Nadav Har'El
  2010-12-08 17:09 ` [PATCH 19/28] nVMX: Exiting from L2 to L1 Nadav Har'El
                   ` (10 subsequent siblings)
  28 siblings, 0 replies; 40+ messages in thread
From: Nadav Har'El @ 2010-12-08 17:09 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

Before nested VMX support, the exit handler for a guest executing a VMX
instruction (vmclear, vmlaunch, vmptrld, vmptrst, vmread, vmread, vmresume,
vmwrite, vmon, vmoff), was handle_vmx_insn(). This handler simply threw a #UD
exception. Now that all these exit reasons are properly handled (and emulate
the relevant VMX instruction), nothing calls this dummy handler and it can
be removed.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |    6 ------
 1 file changed, 6 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2010-12-08 18:56:51.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-12-08 18:56:51.000000000 +0200
@@ -4025,12 +4025,6 @@ static int handle_vmcall(struct kvm_vcpu
 	return 1;
 }
 
-static int handle_vmx_insn(struct kvm_vcpu *vcpu)
-{
-	kvm_queue_exception(vcpu, UD_VECTOR);
-	return 1;
-}
-
 static int handle_invd(struct kvm_vcpu *vcpu)
 {
 	return emulate_instruction(vcpu, 0, 0, 0) == EMULATE_DONE;

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 19/28] nVMX: Exiting from L2 to L1
  2010-12-08 16:59 [PATCH 0/28] nVMX: Nested VMX, v7 Nadav Har'El
                   ` (17 preceding siblings ...)
  2010-12-08 17:09 ` [PATCH 18/28] nVMX: No need for handle_vmx_insn function any more Nadav Har'El
@ 2010-12-08 17:09 ` Nadav Har'El
  2010-12-09 12:55   ` Avi Kivity
  2010-12-08 17:10 ` [PATCH 20/28] nVMX: Deciding if L0 or L1 should handle an L2 exit Nadav Har'El
                   ` (9 subsequent siblings)
  28 siblings, 1 reply; 40+ messages in thread
From: Nadav Har'El @ 2010-12-08 17:09 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch implements nested_vmx_vmexit(), called when the nested L2 guest
exits and we want to run its L1 parent and let it handle this exit.

Note that this will not necessarily be called on every L2 exit. L0 may decide
to handle a particular exit on its own, without L1's involvement; In that
case, L0 will handle the exit, and resume running L2, without running L1 and
without calling nested_vmx_vmexit(). The logic for deciding whether to handle
a particular exit in L1 or in L0, i.e., whether to call nested_vmx_vmexit(),
will appear in the next patch.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  233 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 233 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2010-12-08 18:56:51.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-12-08 18:56:51.000000000 +0200
@@ -5092,6 +5092,8 @@ static void __vmx_complete_interrupts(st
 
 static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
 {
+	if (is_guest_mode(&vmx->vcpu))
+		return;
 	__vmx_complete_interrupts(vmx, vmx->idt_vectoring_info,
 				  VM_EXIT_INSTRUCTION_LEN,
 				  IDT_VECTORING_ERROR_CODE);
@@ -6002,6 +6004,237 @@ static int nested_vmx_run(struct kvm_vcp
 	return 1;
 }
 
+/*
+ * On a nested exit from L2 to L1, vmcs12.guest_cr0 might not be up-to-date
+ * because L2 may have changed some cr0 bits directly (see CRO_GUEST_HOST_MASK)
+ * without L0 trapping the change and updating vmcs12.
+ * This function returns the value we should put in vmcs12.guest_cr0. It's not
+ * enough to just return the current (vmcs02) GUEST_CR0. This may not be the
+ * guest cr0 that L1 thought it was giving its L2 guest - it is possible that
+ * L1 wished to allow its guest to set a cr0 bit directly, but we (L0) asked
+ * to trap this change and instead set just the read shadow. If this is the
+ * case, we need to copy these read-shadow bits back to vmcs12.guest_cr0, where
+ * L1 believes they already are.
+ */
+static inline unsigned long
+vmcs12_guest_cr0(struct kvm_vcpu *vcpu, struct vmcs_fields *vmcs12)
+{
+	unsigned long guest_cr0_bits =
+		vcpu->arch.cr0_guest_owned_bits | vmcs12->cr0_guest_host_mask;
+	return (vmcs_readl(GUEST_CR0) & guest_cr0_bits) |
+		(vmcs_readl(CR0_READ_SHADOW) & ~guest_cr0_bits);
+}
+
+static inline unsigned long
+vmcs12_guest_cr4(struct kvm_vcpu *vcpu, struct vmcs_fields *vmcs12)
+{
+	unsigned long guest_cr4_bits =
+		vcpu->arch.cr4_guest_owned_bits | vmcs12->cr4_guest_host_mask;
+	return (vmcs_readl(GUEST_CR4) & guest_cr4_bits) |
+		(vmcs_readl(CR4_READ_SHADOW) & ~guest_cr4_bits);
+}
+
+/*
+ * prepare_vmcs12 is called when the nested L2 guest exits and we want to
+ * prepare to run its L1 parent. L1 keeps a vmcs for L2 (vmcs12), and this
+ * function updates it to reflect the changes to the guest state while L2 was
+ * running (and perhaps made some exits which were handled directly by L0
+ * without going back to L1), and to reflect the exit reason.
+ * Note that we do not have to copy here all VMCS fields, just those that
+ * could have changed by the L2 guest or the exit - i.e., the guest-state and
+ * exit-information fields only. Other fields are modified by L1 with VMWRITE,
+ * which already writes to vmcs12 directly.
+ */
+void prepare_vmcs12(struct kvm_vcpu *vcpu)
+{
+	struct vmcs_fields *vmcs12 = get_vmcs12_fields(vcpu);
+
+	/* update guest state fields: */
+	vmcs12->guest_cr0 = vmcs12_guest_cr0(vcpu, vmcs12);
+	vmcs12->guest_cr4 = vmcs12_guest_cr4(vcpu, vmcs12);
+
+	vmcs12->guest_dr7 = vmcs_readl(GUEST_DR7);
+	vmcs12->guest_rsp = vmcs_readl(GUEST_RSP);
+	vmcs12->guest_rip = vmcs_readl(GUEST_RIP);
+	vmcs12->guest_rflags = vmcs_readl(GUEST_RFLAGS);
+
+	vmcs12->guest_es_selector = vmcs_read16(GUEST_ES_SELECTOR);
+	vmcs12->guest_cs_selector = vmcs_read16(GUEST_CS_SELECTOR);
+	vmcs12->guest_ss_selector = vmcs_read16(GUEST_SS_SELECTOR);
+	vmcs12->guest_ds_selector = vmcs_read16(GUEST_DS_SELECTOR);
+	vmcs12->guest_fs_selector = vmcs_read16(GUEST_FS_SELECTOR);
+	vmcs12->guest_gs_selector = vmcs_read16(GUEST_GS_SELECTOR);
+	vmcs12->guest_ldtr_selector = vmcs_read16(GUEST_LDTR_SELECTOR);
+	vmcs12->guest_tr_selector = vmcs_read16(GUEST_TR_SELECTOR);
+	vmcs12->guest_es_limit = vmcs_read32(GUEST_ES_LIMIT);
+	vmcs12->guest_cs_limit = vmcs_read32(GUEST_CS_LIMIT);
+	vmcs12->guest_ss_limit = vmcs_read32(GUEST_SS_LIMIT);
+	vmcs12->guest_ds_limit = vmcs_read32(GUEST_DS_LIMIT);
+	vmcs12->guest_fs_limit = vmcs_read32(GUEST_FS_LIMIT);
+	vmcs12->guest_gs_limit = vmcs_read32(GUEST_GS_LIMIT);
+	vmcs12->guest_ldtr_limit = vmcs_read32(GUEST_LDTR_LIMIT);
+	vmcs12->guest_tr_limit = vmcs_read32(GUEST_TR_LIMIT);
+	vmcs12->guest_gdtr_limit = vmcs_read32(GUEST_GDTR_LIMIT);
+	vmcs12->guest_idtr_limit = vmcs_read32(GUEST_IDTR_LIMIT);
+	vmcs12->guest_es_ar_bytes = vmcs_read32(GUEST_ES_AR_BYTES);
+	vmcs12->guest_cs_ar_bytes = vmcs_read32(GUEST_CS_AR_BYTES);
+	vmcs12->guest_ss_ar_bytes = vmcs_read32(GUEST_SS_AR_BYTES);
+	vmcs12->guest_ds_ar_bytes = vmcs_read32(GUEST_DS_AR_BYTES);
+	vmcs12->guest_fs_ar_bytes = vmcs_read32(GUEST_FS_AR_BYTES);
+	vmcs12->guest_gs_ar_bytes = vmcs_read32(GUEST_GS_AR_BYTES);
+	vmcs12->guest_ldtr_ar_bytes = vmcs_read32(GUEST_LDTR_AR_BYTES);
+	vmcs12->guest_tr_ar_bytes = vmcs_read32(GUEST_TR_AR_BYTES);
+	vmcs12->guest_es_base = vmcs_readl(GUEST_ES_BASE);
+	vmcs12->guest_cs_base = vmcs_readl(GUEST_CS_BASE);
+	vmcs12->guest_ss_base = vmcs_readl(GUEST_SS_BASE);
+	vmcs12->guest_ds_base = vmcs_readl(GUEST_DS_BASE);
+	vmcs12->guest_fs_base = vmcs_readl(GUEST_FS_BASE);
+	vmcs12->guest_gs_base = vmcs_readl(GUEST_GS_BASE);
+	vmcs12->guest_ldtr_base = vmcs_readl(GUEST_LDTR_BASE);
+	vmcs12->guest_tr_base = vmcs_readl(GUEST_TR_BASE);
+	vmcs12->guest_gdtr_base = vmcs_readl(GUEST_GDTR_BASE);
+	vmcs12->guest_idtr_base = vmcs_readl(GUEST_IDTR_BASE);
+
+	/* TODO: These cannot have changed unless we have MSR bitmaps and
+	 * the relevant bit asks not to trap the change */
+	vmcs12->guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);
+	if (vmcs_config.vmentry_ctrl & VM_EXIT_SAVE_IA32_PAT)
+		vmcs12->guest_ia32_pat = vmcs_read64(GUEST_IA32_PAT);
+	vmcs12->guest_sysenter_cs = vmcs_read32(GUEST_SYSENTER_CS);
+	vmcs12->guest_sysenter_esp = vmcs_readl(GUEST_SYSENTER_ESP);
+	vmcs12->guest_sysenter_eip = vmcs_readl(GUEST_SYSENTER_EIP);
+
+	vmcs12->guest_activity_state = vmcs_read32(GUEST_ACTIVITY_STATE);
+	vmcs12->guest_interruptibility_info =
+		vmcs_read32(GUEST_INTERRUPTIBILITY_INFO);
+	vmcs12->guest_pending_dbg_exceptions =
+		vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
+	vmcs12->vmcs_link_pointer = vmcs_read64(VMCS_LINK_POINTER);
+
+	/* update exit information fields: */
+
+	vmcs12->vm_exit_reason  = vmcs_read32(VM_EXIT_REASON);
+	vmcs12->exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+
+	vmcs12->vm_exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
+	vmcs12->vm_exit_intr_error_code = vmcs_read32(VM_EXIT_INTR_ERROR_CODE);
+	vmcs12->idt_vectoring_info_field =
+		vmcs_read32(IDT_VECTORING_INFO_FIELD);
+	vmcs12->idt_vectoring_error_code =
+		vmcs_read32(IDT_VECTORING_ERROR_CODE);
+	vmcs12->vm_exit_instruction_len = vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
+	vmcs12->vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+
+	/* clear vm-entry fields which are to be cleared on exit */
+	if (!(vmcs12->vm_exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY))
+		vmcs12->vm_entry_intr_info_field &= ~INTR_INFO_VALID_MASK;
+}
+
+static int nested_vmx_vmexit(struct kvm_vcpu *vcpu, bool is_interrupt)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	int efer_offset;
+	struct vmcs_fields *vmcs01 = vmx->nested.vmcs01_fields;
+
+	if (!is_guest_mode(vcpu)) {
+		printk(KERN_INFO "WARNING: %s called but not in nested mode\n",
+		       __func__);
+		return 0;
+	}
+
+	sync_cached_regs_to_vmcs(vcpu);
+
+	prepare_vmcs12(vcpu);
+
+	if (is_interrupt)
+		get_vmcs12_fields(vcpu)->vm_exit_reason =
+			EXIT_REASON_EXTERNAL_INTERRUPT;
+
+	vmx->nested.current_vmcs12->launched = vmx->launched;
+	vmx->nested.current_vmcs12->cpu = vcpu->cpu;
+
+	vmx->vmcs = vmx->nested.vmcs01;
+	vcpu->cpu = vmx->nested.l1_state.cpu;
+	vmx->launched = vmx->nested.l1_state.launched;
+
+	leave_guest_mode(vcpu);
+
+	vmx_vcpu_load(vcpu, get_cpu());
+	put_cpu();
+
+	vcpu->arch.efer = vmx->nested.l1_state.efer;
+	if ((vcpu->arch.efer & EFER_LMA) &&
+	    !(vcpu->arch.efer & EFER_SCE))
+		vcpu->arch.efer |= EFER_SCE;
+
+	efer_offset = __find_msr_index(vmx, MSR_EFER);
+	if (update_transition_efer(vmx, efer_offset))
+		wrmsrl(MSR_EFER, vmx->guest_msrs[efer_offset].data);
+
+	/*
+	 * L2 perhaps switched to real mode and set vmx->rmode, but we're back
+	 * in L1 and as it is running VMX, it can't be in real mode.
+	 */
+	vmx->rmode.vm86_active = 0;
+
+	/*
+	 * If L1 set the HOST_* fields in the VMCS, when exiting from L2 to L1
+	 * we need to return those, not L1's old values.
+	 */
+	vmcs_writel(GUEST_RIP, get_vmcs12_fields(vcpu)->host_rip);
+	vmcs_writel(GUEST_RSP, get_vmcs12_fields(vcpu)->host_rsp);
+	vmcs01->cr0_read_shadow = get_vmcs12_fields(vcpu)->host_cr0;
+
+	/*
+	 * We're running a regular L1 guest again, so we do the regular KVM
+	 * thing: run vmx_set_cr0 with the cr0 bits the guest thinks it has.
+	 * vmx_set_cr0 might use slightly different bits on the new guest_cr0
+	 * it sets, e.g., add TS when !fpu_active.
+	 * Note that vmx_set_cr0 refers to rmode and efer set above.
+	 */
+	vmx_set_cr0(vcpu, guest_readable_cr0(vmcs01));
+	/*
+	 * If we did fpu_activate()/fpu_deactive() during l2's run, we need to
+	 * apply the same changes to l1's vmcs. We just set cr0 correctly, but
+	 * now we need to also update cr0_guest_host_mask and exception_bitmap.
+	 */
+	vmcs_write32(EXCEPTION_BITMAP,
+		(vmcs01->exception_bitmap & ~(1u<<NM_VECTOR)) |
+			(vcpu->fpu_active ? 0 : (1u<<NM_VECTOR)));
+	vcpu->arch.cr0_guest_owned_bits = (vcpu->fpu_active ? X86_CR0_TS : 0);
+	vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits);
+
+	vmx_set_cr4(vcpu, guest_readable_cr4(vmcs01));
+	vcpu->arch.cr4_guest_owned_bits = ~vmcs01->cr4_guest_host_mask;
+
+	if (enable_ept) {
+		/* shadow page tables on EPT: */
+		set_cr3_and_pdptrs(vcpu, get_vmcs12_fields(vcpu)->host_cr3);
+	} else {
+		/* shadow page tables on shadow page tables: */
+		kvm_set_cr3(vcpu, vmx->nested.l1_arch_cr3);
+		kvm_mmu_reset_context(vcpu);
+		kvm_mmu_load(vcpu);
+	}
+
+	kvm_register_write(vcpu, VCPU_REGS_RSP, vmcs01->guest_rsp);
+	kvm_register_write(vcpu, VCPU_REGS_RIP, vmcs01->guest_rip);
+
+	if (unlikely(vmx->fail)) {
+		/*
+		 * When L1 launches L2 and then we (L0) fail to launch L2,
+		 * we nested_vmx_vmexit back to L1, but now should let it know
+		 * that the VMLAUNCH failed - with the same error that we
+		 * got when launching L2.
+		 */
+		vmx->fail = 0;
+		nested_vmx_failValid(vcpu, vmcs_read32(VM_INSTRUCTION_ERROR));
+	} else
+		nested_vmx_succeed(vcpu);
+
+	return 0;
+}
+
 static struct kvm_x86_ops vmx_x86_ops = {
 	.cpu_has_kvm_support = cpu_has_kvm_support,
 	.disabled_by_bios = vmx_disabled_by_bios,

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 20/28] nVMX: Deciding if L0 or L1 should handle an L2 exit
  2010-12-08 16:59 [PATCH 0/28] nVMX: Nested VMX, v7 Nadav Har'El
                   ` (18 preceding siblings ...)
  2010-12-08 17:09 ` [PATCH 19/28] nVMX: Exiting from L2 to L1 Nadav Har'El
@ 2010-12-08 17:10 ` Nadav Har'El
  2010-12-08 17:10 ` [PATCH 21/28] nVMX: Correct handling of interrupt injection Nadav Har'El
                   ` (8 subsequent siblings)
  28 siblings, 0 replies; 40+ messages in thread
From: Nadav Har'El @ 2010-12-08 17:10 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch contains the logic of whether an L2 exit should be handled by L0
and then L2 should be resumed, or whether L1 should be run to handle this
exit (using the nested_vmx_vmexit() function of the previous patch).

The basic idea is to let L1 handle the exit only if it actually asked to
trap this sort of event. For example, when L2 exits on a change to CR0,
we check L1's CR0_GUEST_HOST_MASK to see if L1 expressed interest in any
bit which changed; If it did, we exit to L1. But if it didn't it means that
it is we (L0) that wished to trap this event, so we handle it ourselves.

The next two patches add additional logic of what to do when an interrupt or
exception is injected: Does L0 need to do it, should we exit to L1 to do it,
or should we resume L2 and keep the exception to be injected later.

We keep a new flag, "nested_run_pending", which can override the decision of
which should run next, L1 or L2. nested_run_pending=1 means that we *must* run
L2 next, not L1. This is necessary in particular when L1 did a VMLAUNCH of L2
and therefore expects L2 to be run (and perhaps be injected with an event it
specified, etc.). Nested_run_pending is especially intended to avoid switching
to L1 in the injection decision-point described above.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  217 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 217 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2010-12-08 18:56:51.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-12-08 18:56:51.000000000 +0200
@@ -333,6 +333,8 @@ struct nested_vmx {
 	struct vmcs_fields *vmcs01_fields;
 	/* Saving some vcpu->arch.* data we had for L1, while running L2 */
 	unsigned long l1_arch_cr3;
+	/* L2 must run next, and mustn't decide to exit to L1. */
+	bool nested_run_pending;
 };
 
 struct vcpu_vmx {
@@ -845,6 +847,20 @@ static inline bool nested_vm_need_virtua
 		SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES);
 }
 
+static inline bool nested_cpu_has_vmx_msr_bitmap(struct kvm_vcpu *vcpu)
+{
+	return get_vmcs12_fields(vcpu)->cpu_based_vm_exec_control &
+		CPU_BASED_USE_MSR_BITMAPS;
+}
+
+static inline bool is_exception(u32 intr_info)
+{
+	return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK))
+		== (INTR_TYPE_HARD_EXCEPTION | INTR_INFO_VALID_MASK);
+}
+
+static int nested_vmx_vmexit(struct kvm_vcpu *vcpu, bool is_interrupt);
+
 static int __find_msr_index(struct vcpu_vmx *vmx, u32 msr)
 {
 	int i;
@@ -4894,6 +4910,195 @@ static int (*kvm_vmx_exit_handlers[])(st
 static const int kvm_vmx_max_exit_handlers =
 	ARRAY_SIZE(kvm_vmx_exit_handlers);
 
+/*
+ * Return 1 if we should exit from L2 to L1 to handle an MSR access access,
+ * rather than handle it ourselves in L0. I.e., check L1's MSR bitmap whether
+ * it expressed interest in the current event (read or write a specific MSR).
+ */
+static bool nested_vmx_exit_handled_msr(struct kvm_vcpu *vcpu,
+	struct vmcs_fields *vmcs12, u32 exit_reason)
+{
+	u32 msr_index = vcpu->arch.regs[VCPU_REGS_RCX];
+	struct page *msr_bitmap_page;
+	void *va;
+	bool ret;
+
+	if (!cpu_has_vmx_msr_bitmap() || !nested_cpu_has_vmx_msr_bitmap(vcpu))
+		return 1;
+
+	msr_bitmap_page = nested_get_page(vcpu, vmcs12->msr_bitmap);
+	if (!msr_bitmap_page) {
+		printk(KERN_INFO "%s error in nested_get_page\n", __func__);
+		return 0;
+	}
+
+	va = kmap_atomic(msr_bitmap_page, KM_USER1);
+	if (exit_reason == EXIT_REASON_MSR_WRITE)
+		va += 0x800;
+	if (msr_index >= 0xc0000000) {
+		msr_index -= 0xc0000000;
+		va += 0x400;
+	}
+	if (msr_index > 0x1fff)
+		return 0;
+	ret = test_bit(msr_index, va);
+	kunmap_atomic(va, KM_USER1);
+	return ret;
+}
+
+/*
+ * Return 1 if we should exit from L2 to L1 to handle a CR access exit,
+ * rather than handle it ourselves in L0. I.e., check if L1 wanted to
+ * intercept (via guest_host_mask etc.) the current event.
+ */
+static bool nested_vmx_exit_handled_cr(struct kvm_vcpu *vcpu,
+	struct vmcs_fields *vmcs12)
+{
+	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+	int cr = exit_qualification & 15;
+	int reg = (exit_qualification >> 8) & 15;
+	unsigned long val = kvm_register_read(vcpu, reg);
+
+	switch ((exit_qualification >> 4) & 3) {
+	case 0: /* mov to cr */
+		switch (cr) {
+		case 0:
+			if (vmcs12->cr0_guest_host_mask &
+			    (val ^ vmcs12->cr0_read_shadow))
+				return 1;
+			break;
+		case 3:
+			if ((vmcs12->cr3_target_count >= 1 &&
+					vmcs12->cr3_target_value0 == val) ||
+				(vmcs12->cr3_target_count >= 2 &&
+					vmcs12->cr3_target_value1 == val) ||
+				(vmcs12->cr3_target_count >= 3 &&
+					vmcs12->cr3_target_value2 == val) ||
+				(vmcs12->cr3_target_count >= 4 &&
+					vmcs12->cr3_target_value3 == val))
+				return 0;
+			if (nested_cpu_has_secondary_exec_ctrls(vcpu) &&
+				(vmcs12->cpu_based_vm_exec_control &
+				CPU_BASED_CR3_LOAD_EXITING)){
+				return 1;
+			}
+			break;
+		case 4:
+			if (vmcs12->cr4_guest_host_mask &
+			    (vmcs12->cr4_read_shadow ^ val))
+				return 1;
+			break;
+		case 8:
+			if (nested_cpu_has_secondary_exec_ctrls(vcpu) &&
+				(vmcs12->cpu_based_vm_exec_control &
+				CPU_BASED_CR8_LOAD_EXITING))
+				return 1;
+			/*
+			 * TODO: missing else if control & CPU_BASED_TPR_SHADOW
+			 * then set tpr shadow and if below tpr_threshold, exit.
+			 */
+			break;
+		}
+		break;
+	case 2: /* clts */
+		if (vmcs12->cr0_guest_host_mask & X86_CR0_TS)
+			return 1;
+		break;
+	case 1: /* mov from cr */
+		switch (cr) {
+		case 0:
+			return 1;
+		case 3:
+			if (vmcs12->cpu_based_vm_exec_control &
+			    CPU_BASED_CR3_STORE_EXITING)
+				return 1;
+			break;
+		case 4:
+			return 1;
+			break;
+		case 8:
+			if (vmcs12->cpu_based_vm_exec_control &
+			    CPU_BASED_CR8_STORE_EXITING)
+				return 1;
+			break;
+		}
+		break;
+	case 3: /* lmsw */
+		/*
+		 * lmsw can change bits 1..3 of cr0, and only set bit 0 of
+		 * cr0. Other attempted changes are ignored, with no exit.
+		 */
+		if (vmcs12->cr0_guest_host_mask & 0xe &
+		    (val ^ vmcs12->cr0_read_shadow))
+			return 1;
+		if ((vmcs12->cr0_guest_host_mask & 0x1) &&
+		    !(vmcs12->cr0_read_shadow & 0x1) &&
+		    (val & 0x1))
+			return 1;
+		break;
+	}
+	return 0;
+}
+
+/*
+ * Return 1 if we should exit from L2 to L1 to handle an exit, or 0 if we
+ * should handle it ourselves in L0 (and then continue L2). Only call this
+ * when in is_guest_mode (L2).
+ */
+static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu)
+{
+	u32 exit_reason = vmcs_read32(VM_EXIT_REASON);
+	u32 intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	struct vmcs_fields *vmcs12 = get_vmcs12_fields(vcpu);
+
+	if (vmx->nested.nested_run_pending)
+		return 0;
+
+	if (unlikely(vmx->fail)) {
+		printk(KERN_INFO "%s failed vm entry %x\n",
+		       __func__, vmcs_read32(VM_INSTRUCTION_ERROR));
+		return 1;
+	}
+
+	switch (exit_reason) {
+	case EXIT_REASON_EXTERNAL_INTERRUPT:
+		return 0;
+	case EXIT_REASON_EXCEPTION_NMI:
+		if (!is_exception(intr_info))
+			return 0;
+		else if (is_page_fault(intr_info))
+			return enable_ept;
+		return vmcs12->exception_bitmap &
+				(1u << (intr_info & INTR_INFO_VECTOR_MASK));
+	case EXIT_REASON_EPT_VIOLATION:
+		return 0;
+	case EXIT_REASON_INVLPG:
+		return vmcs12->cpu_based_vm_exec_control &
+				CPU_BASED_INVLPG_EXITING;
+	case EXIT_REASON_MSR_READ:
+	case EXIT_REASON_MSR_WRITE:
+		return nested_vmx_exit_handled_msr(vcpu, vmcs12, exit_reason);
+	case EXIT_REASON_CR_ACCESS:
+		return nested_vmx_exit_handled_cr(vcpu, vmcs12);
+	case EXIT_REASON_DR_ACCESS:
+		return vmcs12->cpu_based_vm_exec_control &
+				CPU_BASED_MOV_DR_EXITING;
+	default:
+		/*
+		 * One particularly interesting case that is covered here is an
+		 * exit caused by L2 running a VMX instruction. L2 is guest
+		 * mode in L1's world, and according to the VMX spec running a
+		 * VMX instruction in guest mode should cause an exit to root
+		 * mode, i.e., to L1. This is why we need to return r=1 for
+		 * those exit reasons too. This enables further nesting: Like
+		 * L0 emulates VMX for L1, we now allow L1 to emulate VMX for
+		 * L2, who will then be able to run L3.
+		 */
+		return 1;
+	}
+}
+
 static void vmx_get_exit_info(struct kvm_vcpu *vcpu, u64 *info1, u64 *info2)
 {
 	*info1 = vmcs_readl(EXIT_QUALIFICATION);
@@ -4921,6 +5126,17 @@ static int vmx_handle_exit(struct kvm_vc
 	if (enable_ept && is_paging(vcpu))
 		vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
 
+	if (exit_reason == EXIT_REASON_VMLAUNCH ||
+	    exit_reason == EXIT_REASON_VMRESUME)
+		vmx->nested.nested_run_pending = 1;
+	else
+		vmx->nested.nested_run_pending = 0;
+
+	if (is_guest_mode(vcpu) && nested_vmx_exit_handled(vcpu)) {
+		nested_vmx_vmexit(vcpu, false);
+		return 1;
+	}
+
 	if (exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY) {
 		vcpu->run->exit_reason = KVM_EXIT_FAIL_ENTRY;
 		vcpu->run->fail_entry.hardware_entry_failure_reason
@@ -5981,6 +6197,7 @@ static int nested_vmx_run(struct kvm_vcp
 		kvm_mmu_reset_context(vcpu);
 
 		if (unlikely(kvm_mmu_load(vcpu))) {
+			nested_vmx_vmexit(vcpu, false);
 			nested_vmx_failValid(vcpu,
 				VMXERR_VMRESUME_CORRUPTED_VMCS /* ? */);
 			/* switch back to L1 */

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 21/28] nVMX: Correct handling of interrupt injection
  2010-12-08 16:59 [PATCH 0/28] nVMX: Nested VMX, v7 Nadav Har'El
                   ` (19 preceding siblings ...)
  2010-12-08 17:10 ` [PATCH 20/28] nVMX: Deciding if L0 or L1 should handle an L2 exit Nadav Har'El
@ 2010-12-08 17:10 ` Nadav Har'El
  2010-12-08 17:11 ` [PATCH 22/28] nVMX: Correct handling of exception injection Nadav Har'El
                   ` (7 subsequent siblings)
  28 siblings, 0 replies; 40+ messages in thread
From: Nadav Har'El @ 2010-12-08 17:10 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

When KVM wants to inject an interrupt, the guest should think a real interrupt
has happened. Normally (in the non-nested case) this means checking that the
guest doesn't block interrupts (and if it does, inject when it doesn't - using
the "interrupt window" VMX mechanism), and setting up the appropriate VMCS
fields for the guest to receive the interrupt.

However, when we are running a nested guest (L2) and its hypervisor (L1)
requested exits on interrupts (as most hypervisors do), the most efficient
thing to do is to exit L2, telling L1 that the exit was caused by an
interrupt, the one we were injecting; Only when L1 asked not to be notified
of interrupts, we should inject directly to the running L2 guest (i.e.,
the normal code path).

However, properly doing what is described above requires invasive changes to
the flow of the existing code, which we elected not to do in this stage.
Instead we do something more simplistic and less efficient: we modify
vmx_interrupt_allowed(), which kvm calls to see if it can inject the interrupt
now, to exit from L2 to L1 before continuing the normal code. The normal kvm
code then notices that L1 is blocking interrupts, and sets the interrupt
window to inject the interrupt later to L1. Shortly after, L1 gets the
interrupt while it is itself running, not as an exit from L2. The cost is an
extra L1 exit (the interrupt window).

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2010-12-08 18:56:51.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-12-08 18:56:51.000000000 +0200
@@ -3462,9 +3462,25 @@ out:
 	return ret;
 }
 
+/*
+ * In nested virtualization, check if L1 asked to exit on external interrupts.
+ * For most existing hypervisors, this will always return true.
+ */
+static bool nested_exit_on_intr(struct kvm_vcpu *vcpu)
+{
+	return get_vmcs12_fields(vcpu)->pin_based_vm_exec_control &
+		PIN_BASED_EXT_INTR_MASK;
+}
+
 static void enable_irq_window(struct kvm_vcpu *vcpu)
 {
 	u32 cpu_based_vm_exec_control;
+	if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu))
+		/* We can get here when nested_run_pending caused
+		 * vmx_interrupt_allowed() to return false. In this case, do
+		 * nothing - the interrupt will be injected later.
+		 */
+		return;
 
 	cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
 	cpu_based_vm_exec_control |= CPU_BASED_VIRTUAL_INTR_PENDING;
@@ -3578,6 +3594,13 @@ static void vmx_set_nmi_mask(struct kvm_
 
 static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu)
 {
+	if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu)) {
+		if (to_vmx(vcpu)->nested.nested_run_pending)
+			return 0;
+		nested_vmx_vmexit(vcpu, true);
+		/* fall through to normal code, but now in L1, not L2 */
+	}
+
 	return (vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) &&
 		!(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) &
 			(GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS));
@@ -5126,6 +5149,14 @@ static int vmx_handle_exit(struct kvm_vc
 	if (enable_ept && is_paging(vcpu))
 		vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
 
+	/*
+	 * the KVM_REQ_EVENT optimization bit is only on for one entry, and if
+	 * we did not inject a still-pending event to L1 now because of
+	 * nested_run_pending, we need to re-enable this bit.
+	 */
+	if (vmx->nested.nested_run_pending)
+		kvm_make_request(KVM_REQ_EVENT, vcpu);
+
 	if (exit_reason == EXIT_REASON_VMLAUNCH ||
 	    exit_reason == EXIT_REASON_VMRESUME)
 		vmx->nested.nested_run_pending = 1;
@@ -5317,6 +5348,8 @@ static void vmx_complete_interrupts(stru
 
 static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
 {
+	if (is_guest_mode(vcpu))
+		return;
 	__vmx_complete_interrupts(to_vmx(vcpu),
 				  vmcs_read32(VM_ENTRY_INTR_INFO_FIELD),
 				  VM_ENTRY_INSTRUCTION_LEN,

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 22/28] nVMX: Correct handling of exception injection
  2010-12-08 16:59 [PATCH 0/28] nVMX: Nested VMX, v7 Nadav Har'El
                   ` (20 preceding siblings ...)
  2010-12-08 17:10 ` [PATCH 21/28] nVMX: Correct handling of interrupt injection Nadav Har'El
@ 2010-12-08 17:11 ` Nadav Har'El
  2010-12-08 17:11 ` [PATCH 23/28] nVMX: Correct handling of idt vectoring info Nadav Har'El
                   ` (6 subsequent siblings)
  28 siblings, 0 replies; 40+ messages in thread
From: Nadav Har'El @ 2010-12-08 17:11 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

Similar to the previous patch, but concerning injection of exceptions rather
than external interrupts.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2010-12-08 18:56:51.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-12-08 18:56:51.000000000 +0200
@@ -1491,6 +1491,25 @@ static void skip_emulated_instruction(st
 	vmx_set_interrupt_shadow(vcpu, 0);
 }
 
+/*
+ * KVM wants to inject page-faults which it got to the guest. This function
+ * checks whether in a nested guest, we need to inject them to L1 or L2.
+ * This function assumes it is called with the exit reason in vmcs02 being
+ * a #PF exception (this is the only case in which KVM injects a #PF when L2
+ * is running).
+ */
+static int nested_pf_handled(struct kvm_vcpu *vcpu)
+{
+	struct vmcs_fields *vmcs12 = get_vmcs12_fields(vcpu);
+
+	/* TODO: also check PFEC_MATCH/MASK, not just EB.PF. */
+	if (!(vmcs12->exception_bitmap & PF_VECTOR))
+		return 0;
+
+	nested_vmx_vmexit(vcpu, false);
+	return 1;
+}
+
 static void vmx_queue_exception(struct kvm_vcpu *vcpu, unsigned nr,
 				bool has_error_code, u32 error_code,
 				bool reinject)
@@ -1498,6 +1517,10 @@ static void vmx_queue_exception(struct k
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	u32 intr_info = nr | INTR_INFO_VALID_MASK;
 
+	if (nr == PF_VECTOR && is_guest_mode(vcpu) &&
+		nested_pf_handled(vcpu))
+		return;
+
 	if (has_error_code) {
 		vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE, error_code);
 		intr_info |= INTR_INFO_DELIVER_CODE_MASK;
@@ -3533,6 +3556,9 @@ static void vmx_inject_nmi(struct kvm_vc
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
+	if (is_guest_mode(vcpu))
+		return;
+
 	if (!cpu_has_virtual_nmis()) {
 		/*
 		 * Tracking the NMI-blocked state in software is built upon

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 23/28] nVMX: Correct handling of idt vectoring info
  2010-12-08 16:59 [PATCH 0/28] nVMX: Nested VMX, v7 Nadav Har'El
                   ` (21 preceding siblings ...)
  2010-12-08 17:11 ` [PATCH 22/28] nVMX: Correct handling of exception injection Nadav Har'El
@ 2010-12-08 17:11 ` Nadav Har'El
  2010-12-08 17:12 ` [PATCH 24/28] nVMX: Handling of CR0 and CR4 modifying instructions Nadav Har'El
                   ` (5 subsequent siblings)
  28 siblings, 0 replies; 40+ messages in thread
From: Nadav Har'El @ 2010-12-08 17:11 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch adds correct handling of IDT_VECTORING_INFO_FIELD for the nested
case.

When a guest exits while handling an interrupt or exception, we get this
information in IDT_VECTORING_INFO_FIELD in the VMCS. When L2 exits to L1,
there's nothing we need to do, because L1 will see this field in vmcs12, and
handle it itself. However, when L2 exits and L0 handles the exit itself and
plans to return to L2, L0 must inject this event to L2.

In the normal non-nested case, the idt_vectoring_info case is discovered after
the exit, and the decision to inject (though not the injection itself) is made
at that point. However, in the nested case a decision of whether to return
to L2 or L1 also happens during the injection phase (see the previous
patches), so in the nested case we can only decide what to do about the
idt_vectoring_info right after the injection, i.e., in the beginning of
vmx_vcpu_run, which is the first time we know for sure if we're staying in
L2 (i.e., nested_mode is true).

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2010-12-08 18:56:51.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-12-08 18:56:51.000000000 +0200
@@ -335,6 +335,10 @@ struct nested_vmx {
 	unsigned long l1_arch_cr3;
 	/* L2 must run next, and mustn't decide to exit to L1. */
 	bool nested_run_pending;
+	/* true if last exit was of L2, and had a valid idt_vectoring_info */
+	bool valid_idt_vectoring_info;
+	/* These are saved if valid_idt_vectoring_info */
+	u32 vm_exit_instruction_len, idt_vectoring_error_code;
 };
 
 struct vcpu_vmx {
@@ -5384,6 +5388,22 @@ static void vmx_cancel_injection(struct 
 	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0);
 }
 
+static void nested_handle_valid_idt_vectoring_info(struct vcpu_vmx *vmx)
+{
+	int irq  = vmx->idt_vectoring_info & VECTORING_INFO_VECTOR_MASK;
+	int type = vmx->idt_vectoring_info & VECTORING_INFO_TYPE_MASK;
+	int errCodeValid = vmx->idt_vectoring_info &
+		VECTORING_INFO_DELIVER_CODE_MASK;
+	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
+		irq | type | INTR_INFO_VALID_MASK | errCodeValid);
+
+	vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
+		vmx->nested.vm_exit_instruction_len);
+	if (errCodeValid)
+		vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE,
+			vmx->nested.idt_vectoring_error_code);
+}
+
 static inline void sync_cached_regs_to_vmcs(struct kvm_vcpu *vcpu)
 {
 	if (test_bit(VCPU_REGS_RSP, (unsigned long *)&vcpu->arch.regs_dirty))
@@ -5405,6 +5425,9 @@ static void vmx_vcpu_run(struct kvm_vcpu
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
+	if (is_guest_mode(vcpu) && vmx->nested.valid_idt_vectoring_info)
+		nested_handle_valid_idt_vectoring_info(vmx);
+
 	/* Record the guest's net vcpu time for enforced NMI injections. */
 	if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked))
 		vmx->entry_time = ktime_get();
@@ -5525,6 +5548,15 @@ static void vmx_vcpu_run(struct kvm_vcpu
 
 	vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
 
+	vmx->nested.valid_idt_vectoring_info = is_guest_mode(vcpu) &&
+		(vmx->idt_vectoring_info & VECTORING_INFO_VALID_MASK);
+	if (vmx->nested.valid_idt_vectoring_info) {
+		vmx->nested.vm_exit_instruction_len =
+			vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
+		vmx->nested.idt_vectoring_error_code =
+			vmcs_read32(IDT_VECTORING_ERROR_CODE);
+	}
+
 	asm("mov %0, %%ds; mov %0, %%es" : : "r"(__USER_DS));
 	vmx->launched = 1;
 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 24/28] nVMX: Handling of CR0 and CR4 modifying instructions
  2010-12-08 16:59 [PATCH 0/28] nVMX: Nested VMX, v7 Nadav Har'El
                   ` (22 preceding siblings ...)
  2010-12-08 17:11 ` [PATCH 23/28] nVMX: Correct handling of idt vectoring info Nadav Har'El
@ 2010-12-08 17:12 ` Nadav Har'El
  2010-12-09 13:19   ` Avi Kivity
  2010-12-08 17:12 ` [PATCH 25/28] nVMX: Further fixes for lazy FPU loading Nadav Har'El
                   ` (4 subsequent siblings)
  28 siblings, 1 reply; 40+ messages in thread
From: Nadav Har'El @ 2010-12-08 17:12 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

When L2 tries to modify CR0 or CR4 (with mov or clts), and modifies a bit
which L1 asked to shadow (via CR[04]_GUEST_HOST_MASK), we already do the right
thing: we let L1 handle the trap (see nested_vmx_exit_handled_cr() in a
previous patch).
When L2 modifies bits that L1 doesn't care about, we let it think (via
CR[04]_READ_SHADOW) that it did these modifications, while only changing
(in GUEST_CR[04]) the bits that L0 doesn't shadow.

This is needed for corect handling of CR0.TS for lazy FPU loading: L0 may
want to leave TS on, while pretending to allow the guest to change it.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   54 ++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 51 insertions(+), 3 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2010-12-08 18:56:51.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-12-08 18:56:51.000000000 +0200
@@ -3877,6 +3877,54 @@ static void complete_insn_gp(struct kvm_
 		skip_emulated_instruction(vcpu);
 }
 
+/* called to set cr0 as approriate for a mov-to-cr0 exit. */
+static int handle_set_cr0(struct kvm_vcpu *vcpu, unsigned long val)
+{
+	if (is_guest_mode(vcpu)) {
+		/*
+		 * We get here when L2 changed cr0 in a way that did not change
+		 * any of L1's shadowed bits (see nested_vmx_exit_handled_cr),
+		 * but did change L0 shadowed bits. This can currently happen
+		 * with the TS bit: L0 may want to leave TS on (for lazy fpu
+		 * loading) while pretending to allow the guest to change it.
+		 */
+		vmcs_writel(GUEST_CR0,
+		   (val & vcpu->arch.cr0_guest_owned_bits) |
+		   (vmcs_readl(GUEST_CR0) & ~vcpu->arch.cr0_guest_owned_bits));
+		vmcs_writel(CR0_READ_SHADOW, val);
+		vcpu->arch.cr0 = val;
+		return 0;
+	} else
+		return kvm_set_cr0(vcpu, val);
+}
+
+static int handle_set_cr4(struct kvm_vcpu *vcpu, unsigned long val)
+{
+	if (is_guest_mode(vcpu)) {
+		vmcs_writel(GUEST_CR4,
+		  (val & vcpu->arch.cr4_guest_owned_bits) |
+		  (vmcs_readl(GUEST_CR4) & ~vcpu->arch.cr4_guest_owned_bits));
+		vmcs_writel(CR4_READ_SHADOW, val);
+		vcpu->arch.cr4 = val;
+		return 0;
+	} else
+		return kvm_set_cr4(vcpu, val);
+}
+
+
+/* called to set cr0 as approriate for clts instruction exit. */
+static void handle_clts(struct kvm_vcpu *vcpu)
+{
+	if (is_guest_mode(vcpu)) {
+		/* As in handle_set_cr0(), we can't call vmx_set_cr0 here */
+		vmcs_writel(GUEST_CR0, vmcs_readl(GUEST_CR0) & ~X86_CR0_TS);
+		vmcs_writel(CR0_READ_SHADOW,
+			vmcs_readl(CR0_READ_SHADOW) & ~X86_CR0_TS);
+		vcpu->arch.cr0 &= ~X86_CR0_TS;
+	} else
+		vmx_set_cr0(vcpu, kvm_read_cr0_bits(vcpu, ~X86_CR0_TS));
+}
+
 static int handle_cr(struct kvm_vcpu *vcpu)
 {
 	unsigned long exit_qualification, val;
@@ -3893,7 +3941,7 @@ static int handle_cr(struct kvm_vcpu *vc
 		trace_kvm_cr_write(cr, val);
 		switch (cr) {
 		case 0:
-			err = kvm_set_cr0(vcpu, val);
+			err = handle_set_cr0(vcpu, val);
 			complete_insn_gp(vcpu, err);
 			return 1;
 		case 3:
@@ -3901,7 +3949,7 @@ static int handle_cr(struct kvm_vcpu *vc
 			complete_insn_gp(vcpu, err);
 			return 1;
 		case 4:
-			err = kvm_set_cr4(vcpu, val);
+			err = handle_set_cr4(vcpu, val);
 			complete_insn_gp(vcpu, err);
 			return 1;
 		case 8: {
@@ -3919,7 +3967,7 @@ static int handle_cr(struct kvm_vcpu *vc
 		};
 		break;
 	case 2: /* clts */
-		vmx_set_cr0(vcpu, kvm_read_cr0_bits(vcpu, ~X86_CR0_TS));
+		handle_clts(vcpu);
 		trace_kvm_cr_write(0, kvm_read_cr0(vcpu));
 		skip_emulated_instruction(vcpu);
 		vmx_fpu_activate(vcpu);

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 25/28] nVMX: Further fixes for lazy FPU loading
  2010-12-08 16:59 [PATCH 0/28] nVMX: Nested VMX, v7 Nadav Har'El
                   ` (23 preceding siblings ...)
  2010-12-08 17:12 ` [PATCH 24/28] nVMX: Handling of CR0 and CR4 modifying instructions Nadav Har'El
@ 2010-12-08 17:12 ` Nadav Har'El
  2010-12-09 13:05   ` Avi Kivity
  2010-12-08 17:13 ` [PATCH 26/28] nVMX: Additional TSC-offset handling Nadav Har'El
                   ` (3 subsequent siblings)
  28 siblings, 1 reply; 40+ messages in thread
From: Nadav Har'El @ 2010-12-08 17:12 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

KVM's "Lazy FPU loading" means that sometimes L0 needs to set CR0.TS, even
if a guest didn't set it. Moreover, L0 must also trap CR0.TS changes and
NM exceptions, even if we have a guest hypervisor (L1) who didn't want these
traps. And of course, conversely: If L1 wanted to trap these events, we
must let it, even if L0 is not interested in them.

This patch fixes some existing KVM code (in update_exception_bitmap(),
vmx_fpu_activate(), vmx_fpu_deactivate()) to do the correct merging of L0's
and L1's needs. Note that handle_cr() was already fixed in the above patch,
and that new code in introduced in previous patches already handles CR0
correctly (see prepare_vmcs02(), prepare_vmcs12(), and nested_vmx_vmexit()).

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   38 +++++++++++++++++++++++++++++++++++---
 1 file changed, 35 insertions(+), 3 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2010-12-08 18:56:52.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-12-08 18:56:52.000000000 +0200
@@ -1098,6 +1098,15 @@ static void update_exception_bitmap(stru
 		eb &= ~(1u << PF_VECTOR); /* bypass_guest_pf = 0 */
 	if (vcpu->fpu_active)
 		eb &= ~(1u << NM_VECTOR);
+
+	/* When we are running a nested L2 guest and L1 specified for it a
+	 * certain exception bitmap, we must trap the same exceptions and pass
+	 * them to L1. When running L2, we will only handle the exceptions
+	 * specified above if L1 did not want them.
+	 */
+	if (is_guest_mode(vcpu))
+		eb |= get_vmcs12_fields(vcpu)->exception_bitmap;
+
 	vmcs_write32(EXCEPTION_BITMAP, eb);
 }
 
@@ -1415,8 +1424,19 @@ static void vmx_fpu_activate(struct kvm_
 	cr0 &= ~(X86_CR0_TS | X86_CR0_MP);
 	cr0 |= kvm_read_cr0_bits(vcpu, X86_CR0_TS | X86_CR0_MP);
 	vmcs_writel(GUEST_CR0, cr0);
-	update_exception_bitmap(vcpu);
 	vcpu->arch.cr0_guest_owned_bits = X86_CR0_TS;
+	if (is_guest_mode(vcpu)) {
+		/* While we (L0) no longer care about NM exceptions or cr0.TS
+		 * changes, our guest hypervisor (L1) might care in which case
+		 * we must trap them for it.
+		 */
+		u32 eb = vmcs_read32(EXCEPTION_BITMAP) & ~(1u << NM_VECTOR);
+		struct vmcs_fields *vmcs12 = get_vmcs12_fields(vcpu);
+		eb |= vmcs12->exception_bitmap;
+		vcpu->arch.cr0_guest_owned_bits &= ~vmcs12->cr0_guest_host_mask;
+		vmcs_write32(EXCEPTION_BITMAP, eb);
+	} else
+		update_exception_bitmap(vcpu);
 	vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits);
 }
 
@@ -1424,12 +1444,24 @@ static void vmx_decache_cr0_guest_bits(s
 
 static void vmx_fpu_deactivate(struct kvm_vcpu *vcpu)
 {
+	/* Note that there is no vcpu->fpu_active = 0 here. The caller must
+	 * set this *before* calling this function.
+	 */
 	vmx_decache_cr0_guest_bits(vcpu);
 	vmcs_set_bits(GUEST_CR0, X86_CR0_TS | X86_CR0_MP);
-	update_exception_bitmap(vcpu);
+	vmcs_write32(EXCEPTION_BITMAP,
+		vmcs_read32(EXCEPTION_BITMAP) | (1u << NM_VECTOR));
 	vcpu->arch.cr0_guest_owned_bits = 0;
 	vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits);
-	vmcs_writel(CR0_READ_SHADOW, vcpu->arch.cr0);
+	if (is_guest_mode(vcpu))
+		/* Unfortunately in nested mode we play with arch.cr0's PG
+		 * bit, so we musn't copy it all, just the relevant TS bit
+		 */
+		vmcs_writel(CR0_READ_SHADOW,
+			(vmcs_readl(CR0_READ_SHADOW) & ~X86_CR0_TS) |
+			(vcpu->arch.cr0 & X86_CR0_TS));
+	else
+		vmcs_writel(CR0_READ_SHADOW, vcpu->arch.cr0);
 }
 
 static unsigned long vmx_get_rflags(struct kvm_vcpu *vcpu)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 26/28] nVMX: Additional TSC-offset handling
  2010-12-08 16:59 [PATCH 0/28] nVMX: Nested VMX, v7 Nadav Har'El
                   ` (24 preceding siblings ...)
  2010-12-08 17:12 ` [PATCH 25/28] nVMX: Further fixes for lazy FPU loading Nadav Har'El
@ 2010-12-08 17:13 ` Nadav Har'El
  2010-12-08 17:13 ` [PATCH 27/28] nVMX: Miscellenous small corrections Nadav Har'El
                   ` (2 subsequent siblings)
  28 siblings, 0 replies; 40+ messages in thread
From: Nadav Har'El @ 2010-12-08 17:13 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

In the unlikely case that L1 does not capture MSR_IA32_TSC, L0 needs to
emulate this MSR write by L2 by modifying vmcs02.tsc_offset.
We also need to set vmcs12.tsc_offset, for this change to survive the next
nested entry (see prepare_vmcs02()).

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   11 +++++++++++
 1 file changed, 11 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2010-12-08 18:56:52.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-12-08 18:56:52.000000000 +0200
@@ -1665,12 +1665,23 @@ static u64 guest_read_tsc(void)
 static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
 {
 	vmcs_write64(TSC_OFFSET, offset);
+	if (is_guest_mode(vcpu))
+		/*
+		 * We are only changing TSC_OFFSET when L2 is running if for
+		 * some reason L1 chose not to trap the TSC MSR. Since
+		 * prepare_vmcs12() does not copy tsc_offset, we need to also
+		 * set the vmcs12 field here.
+		 */
+		get_vmcs12_fields(vcpu)->tsc_offset = offset -
+			to_vmx(vcpu)->nested.vmcs01_fields->tsc_offset;
 }
 
 static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment)
 {
 	u64 offset = vmcs_read64(TSC_OFFSET);
 	vmcs_write64(TSC_OFFSET, offset + adjustment);
+	if (is_guest_mode(vcpu))
+		get_vmcs12_fields(vcpu)->tsc_offset += adjustment;
 }
 
 /*

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 27/28] nVMX: Miscellenous small corrections
  2010-12-08 16:59 [PATCH 0/28] nVMX: Nested VMX, v7 Nadav Har'El
                   ` (25 preceding siblings ...)
  2010-12-08 17:13 ` [PATCH 26/28] nVMX: Additional TSC-offset handling Nadav Har'El
@ 2010-12-08 17:13 ` Nadav Har'El
  2010-12-08 17:14 ` [PATCH 28/28] nVMX: Documentation Nadav Har'El
  2010-12-09 12:44 ` [PATCH 0/28] nVMX: Nested VMX, v7 Avi Kivity
  28 siblings, 0 replies; 40+ messages in thread
From: Nadav Har'El @ 2010-12-08 17:13 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

Small corrections of KVM (spelling, etc.) not directly related to nested VMX.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- .before/arch/x86/kvm/vmx.c	2010-12-08 18:56:52.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-12-08 18:56:52.000000000 +0200
@@ -933,7 +933,7 @@ static void vmcs_load(struct vmcs *vmcs)
 			: "=g"(error) : "a"(&phys_addr), "m"(phys_addr)
 			: "cc", "memory");
 	if (error)
-		printk(KERN_ERR "kvm: vmptrld %p/%llx fail\n",
+		printk(KERN_ERR "kvm: vmptrld %p/%llx failed\n",
 		       vmcs, phys_addr);
 }
 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 28/28] nVMX: Documentation
  2010-12-08 16:59 [PATCH 0/28] nVMX: Nested VMX, v7 Nadav Har'El
                   ` (26 preceding siblings ...)
  2010-12-08 17:13 ` [PATCH 27/28] nVMX: Miscellenous small corrections Nadav Har'El
@ 2010-12-08 17:14 ` Nadav Har'El
  2010-12-09 12:44 ` [PATCH 0/28] nVMX: Nested VMX, v7 Avi Kivity
  28 siblings, 0 replies; 40+ messages in thread
From: Nadav Har'El @ 2010-12-08 17:14 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch includes a brief introduction to the nested vmx feature in the
Documentation/kvm directory. The document also includes a copy of the
vmcs12 structure, as requested by Avi Kivity.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 Documentation/kvm/nested-vmx.txt |  237 +++++++++++++++++++++++++++++
 1 file changed, 237 insertions(+)

--- .before/Documentation/kvm/nested-vmx.txt	2010-12-08 18:56:52.000000000 +0200
+++ .after/Documentation/kvm/nested-vmx.txt	2010-12-08 18:56:52.000000000 +0200
@@ -0,0 +1,237 @@
+Nested VMX
+==========
+
+Overview
+---------
+
+On Intel processors, KVM uses Intel's VMX (Virtual-Machine eXtensions)
+to easily and efficiently run guest operating systems. Normally, these guests
+*cannot* themselves be hypervisors running their own guests, because in VMX,
+guests cannot use VMX instructions.
+
+The "Nested VMX" feature adds this missing capability - of running guest
+hypervisors (which use VMX) with their own nested guests. It does so by
+allowing a guest to use VMX instructions, and correctly and efficiently
+emulating them using the single level of VMX available in the hardware.
+
+We describe in much greater detail the theory behind the nested VMX feature,
+its implementation and its performance characteristics, in the OSDI 2010 paper
+"The Turtles Project: Design and Implementation of Nested Virtualization",
+available at:
+
+	http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf
+
+
+Terminology
+-----------
+
+Single-level virtualization has two levels - the host (KVM) and the guests.
+In nested virtualization, we have three levels: The host (KVM), which we call
+L0, the guest hypervisor, which we call L1, and the nested guest, which we
+call L2.
+
+
+Known limitations
+-----------------
+
+The current code supports running Linux under a guest KVM using shadow
+page tables. It supports multiple guest hypervisors, each can run multiple
+guests. Only 64-bit guest hypervisors are supported. SMP is supported, but
+is known to be buggy in this release.
+Additional patches for running Windows under guest KVM, and Linux under
+guest VMware server, and support for nested EPT, are currently running in
+the lab, and will be sent as follow-on patchsets.
+
+
+Running nested VMX
+------------------
+
+The nested VMX feature is disabled by default. It can be enabled by giving
+the "nested=1" option to the kvm-intel module.
+
+
+ABIs
+----
+
+Nested VMX aims to present a standard and (eventually) fully-functional VMX
+implementation for the a guest hypervisor to use. As such, the official
+specification of the ABI that it provides is Intel's VMX specification,
+namely volume 3B of their "Intel 64 and IA-32 Architectures Software
+Developer's Manual". Not all of VMX's features are currently fully supported,
+but the goal is to eventually support them all, starting with the VMX features
+which are used in practice by popular hypervisors (KVM and others).
+
+As a VMX implementation, nested VMX presents a VMCS structure to L1.
+As mandated by the spec, other than the two fields revision_id and abort,
+this structure is *opaque* to its user, who is not supposed to know or care
+about its internal structure. Rather, the structure is accessed through the
+VMREAD and VMWRITE instructions.
+Still, for debugging purposes, KVM developers might be interested to know the
+internals of this structure; This is struct vmcs12 from arch/x86/kvm/vmx.c.
+For convenience, we repeat its content here. If the internals of this structure
+changes, this can break live migration across KVM versions. VMCS12_REVISION
+(from vmx.c) should be changed if struct vmcs12 or its inner struct shadow_vmcs
+is ever changed.
+
+struct __packed vmcs12 {
+	/* According to the Intel spec, a VMCS region must start with the
+	 * following two fields. Then follow implementation-specific data.
+	 */
+	u32 revision_id;
+	u32 abort;
+
+	struct shadow_vmcs shadow_vmcs;
+
+	bool launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
+
+	int cpu;
+	int launched;
+}
+
+struct __packed shadow_vmcs {
+	u16 virtual_processor_id;
+	u16 guest_es_selector;
+	u16 guest_cs_selector;
+	u16 guest_ss_selector;
+	u16 guest_ds_selector;
+	u16 guest_fs_selector;
+	u16 guest_gs_selector;
+	u16 guest_ldtr_selector;
+	u16 guest_tr_selector;
+	u16 host_es_selector;
+	u16 host_cs_selector;
+	u16 host_ss_selector;
+	u16 host_ds_selector;
+	u16 host_fs_selector;
+	u16 host_gs_selector;
+	u16 host_tr_selector;
+	u64 io_bitmap_a;
+	u64 io_bitmap_b;
+	u64 msr_bitmap;
+	u64 vm_exit_msr_store_addr;
+	u64 vm_exit_msr_load_addr;
+	u64 vm_entry_msr_load_addr;
+	u64 tsc_offset;
+	u64 virtual_apic_page_addr;
+	u64 apic_access_addr;
+	u64 ept_pointer;
+	u64 guest_physical_address;
+	u64 vmcs_link_pointer;
+	u64 guest_ia32_debugctl;
+	u64 guest_ia32_pat;
+	u64 guest_pdptr0;
+	u64 guest_pdptr1;
+	u64 guest_pdptr2;
+	u64 guest_pdptr3;
+	u64 host_ia32_pat;
+	u32 pin_based_vm_exec_control;
+	u32 cpu_based_vm_exec_control;
+	u32 exception_bitmap;
+	u32 page_fault_error_code_mask;
+	u32 page_fault_error_code_match;
+	u32 cr3_target_count;
+	u32 vm_exit_controls;
+	u32 vm_exit_msr_store_count;
+	u32 vm_exit_msr_load_count;
+	u32 vm_entry_controls;
+	u32 vm_entry_msr_load_count;
+	u32 vm_entry_intr_info_field;
+	u32 vm_entry_exception_error_code;
+	u32 vm_entry_instruction_len;
+	u32 tpr_threshold;
+	u32 secondary_vm_exec_control;
+	u32 vm_instruction_error;
+	u32 vm_exit_reason;
+	u32 vm_exit_intr_info;
+	u32 vm_exit_intr_error_code;
+	u32 idt_vectoring_info_field;
+	u32 idt_vectoring_error_code;
+	u32 vm_exit_instruction_len;
+	u32 vmx_instruction_info;
+	u32 guest_es_limit;
+	u32 guest_cs_limit;
+	u32 guest_ss_limit;
+	u32 guest_ds_limit;
+	u32 guest_fs_limit;
+	u32 guest_gs_limit;
+	u32 guest_ldtr_limit;
+	u32 guest_tr_limit;
+	u32 guest_gdtr_limit;
+	u32 guest_idtr_limit;
+	u32 guest_es_ar_bytes;
+	u32 guest_cs_ar_bytes;
+	u32 guest_ss_ar_bytes;
+	u32 guest_ds_ar_bytes;
+	u32 guest_fs_ar_bytes;
+	u32 guest_gs_ar_bytes;
+	u32 guest_ldtr_ar_bytes;
+	u32 guest_tr_ar_bytes;
+	u32 guest_interruptibility_info;
+	u32 guest_activity_state;
+	u32 guest_sysenter_cs;
+	u32 host_ia32_sysenter_cs;
+	unsigned long cr0_guest_host_mask;
+	unsigned long cr4_guest_host_mask;
+	unsigned long cr0_read_shadow;
+	unsigned long cr4_read_shadow;
+	unsigned long cr3_target_value0;
+	unsigned long cr3_target_value1;
+	unsigned long cr3_target_value2;
+	unsigned long cr3_target_value3;
+	unsigned long exit_qualification;
+	unsigned long guest_linear_address;
+	unsigned long guest_cr0;
+	unsigned long guest_cr3;
+	unsigned long guest_cr4;
+	unsigned long guest_es_base;
+	unsigned long guest_cs_base;
+	unsigned long guest_ss_base;
+	unsigned long guest_ds_base;
+	unsigned long guest_fs_base;
+	unsigned long guest_gs_base;
+	unsigned long guest_ldtr_base;
+	unsigned long guest_tr_base;
+	unsigned long guest_gdtr_base;
+	unsigned long guest_idtr_base;
+	unsigned long guest_dr7;
+	unsigned long guest_rsp;
+	unsigned long guest_rip;
+	unsigned long guest_rflags;
+	unsigned long guest_pending_dbg_exceptions;
+	unsigned long guest_sysenter_esp;
+	unsigned long guest_sysenter_eip;
+	unsigned long host_cr0;
+	unsigned long host_cr3;
+	unsigned long host_cr4;
+	unsigned long host_fs_base;
+	unsigned long host_gs_base;
+	unsigned long host_tr_base;
+	unsigned long host_gdtr_base;
+	unsigned long host_idtr_base;
+	unsigned long host_ia32_sysenter_esp;
+	unsigned long host_ia32_sysenter_eip;
+	unsigned long host_rsp;
+	unsigned long host_rip;
+};
+
+
+Authors
+-------
+
+These patches were written by:
+     Abel Gordon, abelg <at> il.ibm.com
+     Nadav Har'El, nyh <at> il.ibm.com
+     Orit Wasserman, oritw <at> il.ibm.com
+     Ben-Ami Yassor, benami <at> il.ibm.com
+     Muli Ben-Yehuda, muli <at> il.ibm.com
+
+With contributions by:
+     Anthony Liguori, aliguori <at> us.ibm.com
+     Mike Day, mdday <at> us.ibm.com
+     Michael Factor, factor <at> il.ibm.com
+     Zvi Dubitzky, dubi <at> il.ibm.com
+
+And valuable reviews by:
+     Avi Kivity, avi <at> redhat.com
+     Gleb Natapov, gleb <at> redhat.com
+     and others.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 06/28] nVMX: Implement reading and writing of VMX MSRs
  2010-12-08 17:03 ` [PATCH 06/28] nVMX: Implement reading and writing of VMX MSRs Nadav Har'El
@ 2010-12-09 11:04   ` Avi Kivity
  0 siblings, 0 replies; 40+ messages in thread
From: Avi Kivity @ 2010-12-09 11:04 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb

On 12/08/2010 07:03 PM, Nadav Har'El wrote:
> When the guest can use VMX instructions (when the "nested" module option is
> on), it should also be able to read and write VMX MSRs, e.g., to query about
> VMX capabilities. This patch adds this support.
>
> +
> +static int vmx_set_vmx_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data)
> +{
> +	switch (msr_index) {
> +	case MSR_IA32_FEATURE_CONTROL:
> +	case MSR_IA32_VMX_BASIC:
> +	case MSR_IA32_VMX_TRUE_PINBASED_CTLS:
> +	case MSR_IA32_VMX_PINBASED_CTLS:
> +	case MSR_IA32_VMX_PROCBASED_CTLS:
> +	case MSR_IA32_VMX_EXIT_CTLS:
> +	case MSR_IA32_VMX_ENTRY_CTLS:
> +	case MSR_IA32_VMX_PROCBASED_CTLS2:
> +	case MSR_IA32_VMX_EPT_VPID_CAP:
> +		pr_unimpl(vcpu, "unimplemented VMX MSR write: 0x%x data %llx\n",
> +			  msr_index, data);
> +		return 0;
> +	default:
> +		return 1;
> +	}
> +}

These msrs are read-only IIRC, so they should #GP without any message.  
Nor should they be part of the save/restore msr set.

We do need a way for userspace to set them though.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 07/28] nVMX: Decoding memory operands of VMX instructions
  2010-12-08 17:03 ` [PATCH 07/28] nVMX: Decoding memory operands of VMX instructions Nadav Har'El
@ 2010-12-09 11:08   ` Avi Kivity
  0 siblings, 0 replies; 40+ messages in thread
From: Avi Kivity @ 2010-12-09 11:08 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb

On 12/08/2010 07:03 PM, Nadav Har'El wrote:
> This patch includes a utility function for decoding pointer operands of VMX
> instructions issued by L1 (a guest hypervisor)
>
> Signed-off-by: Nadav Har'El<nyh@il.ibm.com>
> ---
>   arch/x86/kvm/vmx.c |   59 +++++++++++++++++++++++++++++++++++++++++++
>   arch/x86/kvm/x86.c |    3 +-
>   arch/x86/kvm/x86.h |    3 ++
>   3 files changed, 64 insertions(+), 1 deletion(-)
>
> --- .before/arch/x86/kvm/x86.c	2010-12-08 18:56:49.000000000 +0200
> +++ .after/arch/x86/kvm/x86.c	2010-12-08 18:56:49.000000000 +0200
> @@ -3688,7 +3688,7 @@ static int kvm_fetch_guest_virt(gva_t ad
>   					  exception);
>   }
>
> -static int kvm_read_guest_virt(gva_t addr, void *val, unsigned int bytes,
> +int kvm_read_guest_virt(gva_t addr, void *val, unsigned int bytes,
>   			       struct kvm_vcpu *vcpu,
>   			       struct x86_exception *exception)
>   {
> @@ -3696,6 +3696,7 @@ static int kvm_read_guest_virt(gva_t add
>   	return kvm_read_guest_virt_helper(addr, val, bytes, vcpu, access,
>   					  exception);
>   }
> +EXPORT_SYMBOL_GPL(kvm_read_guest_virt);
>
>   static int kvm_read_guest_virt_system(gva_t addr, void *val, unsigned int bytes,
>   				      struct kvm_vcpu *vcpu,
> --- .before/arch/x86/kvm/x86.h	2010-12-08 18:56:49.000000000 +0200
> +++ .after/arch/x86/kvm/x86.h	2010-12-08 18:56:49.000000000 +0200
> @@ -74,6 +74,9 @@ void kvm_before_handle_nmi(struct kvm_vc
>   void kvm_after_handle_nmi(struct kvm_vcpu *vcpu);
>   int kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq);
>
> +int kvm_read_guest_virt(gva_t addr, void *val, unsigned int bytes,
> +		struct kvm_vcpu *vcpu, struct x86_exception *exception);
> +
>   void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data);
>
>   #endif
> --- .before/arch/x86/kvm/vmx.c	2010-12-08 18:56:49.000000000 +0200
> +++ .after/arch/x86/kvm/vmx.c	2010-12-08 18:56:49.000000000 +0200
> @@ -3936,6 +3936,65 @@ static int handle_vmoff(struct kvm_vcpu
>   }
>
>   /*
> + * Decode the memory-address operand of a vmx instruction, as recorded on an
> + * exit caused by such an instruction (run by a guest hypervisor).
> + * On success, returns 0. When the operand is invalid, returns 1 and throws
> + * #UD or #GP.
> + */
> +static int get_vmx_mem_address(struct kvm_vcpu *vcpu,
> +				 unsigned long exit_qualification,
> +				 u32 vmx_instruction_info, gva_t *ret)
> +{
> +	/*
> +	 * According to Vol. 3B, "Information for VM Exits Due to Instruction
> +	 * Execution", on an exit, vmx_instruction_info holds most of the
> +	 * addressing components of the operand. Only the displacement part
> +	 * is put in exit_qualification (see 3B, "Basic VM-Exit Information").
> +	 * For how an actual address is calculated from all these components,
> +	 * refer to Vol. 1, "Operand Addressing".
> +	 */
> +	int  scaling = vmx_instruction_info&  3;
> +	int  addr_size = (vmx_instruction_info>>  7)&  7;
> +	bool is_reg = vmx_instruction_info&  (1u<<  10);
> +	int  seg_reg = (vmx_instruction_info>>  15)&  7;
> +	int  index_reg = (vmx_instruction_info>>  18)&  0xf;
> +	bool index_is_valid = !(vmx_instruction_info&  (1u<<  22));
> +	int  base_reg       = (vmx_instruction_info>>  23)&  0xf;
> +	bool base_is_valid  = !(vmx_instruction_info&  (1u<<  27));
> +
> +	if (is_reg) {
> +		kvm_queue_exception(vcpu, UD_VECTOR);
> +		return 1;
> +	}
> +
> +	switch (addr_size) {
> +	case 1: /* 32 bit. high bits are undefined according to the spec: */
> +		exit_qualification&= 0xffffffff;

Best to do this at the end, on *ret, so that segment base + offset is 
subject to truncation.

> +		break;
> +	case 2: /* 64 bit */
> +		break;
> +	default: /* 16 bit */
> +		return 1;
> +	}
> +
> +	/* Addr = segment_base + offset */
> +	/* offset = base + [index * scale] + displacement */
> +	*ret = vmx_get_segment_base(vcpu, seg_reg);
> +	if (base_is_valid)
> +		*ret += kvm_register_read(vcpu, base_reg);
> +	if (index_is_valid)
> +		*ret += kvm_register_read(vcpu, index_reg)<<scaling;
> +	*ret += exit_qualification; /* holds the displacement */
> +	/*
> +	 * TODO: throw #GP (and return 1) in various cases that the VM*
> +	 * instructions require it - e.g., offset beyond segment limit,
> +	 * unusable or unreadable/unwritable segment, non-canonical 64-bit
> +	 * address, and so on. Currently these are not checked.
> +	 */
> +	return 0;
> +}
> +

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 02/28] nVMX: Add VMX and SVM to list of supported cpuid features
  2010-12-08 17:00 ` [PATCH 02/28] nVMX: Add VMX and SVM to list of supported cpuid features Nadav Har'El
@ 2010-12-09 11:38   ` Joerg Roedel
  2010-12-15 13:25     ` Nadav Har'El
  0 siblings, 1 reply; 40+ messages in thread
From: Joerg Roedel @ 2010-12-09 11:38 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb, avi

On Wed, Dec 08, 2010 at 07:00:59PM +0200, Nadav Har'El wrote:
> If the "nested" module option is enabled, add the "VMX" CPU feature to the
> list of CPU features KVM advertises with the KVM_GET_SUPPORTED_CPUID ioctl.
> 
> Qemu uses this ioctl, and intersects KVM's list with its own list of desired
> cpu features (depending on the -cpu option given to qemu) to determine the
> final list of features presented to the guest.
> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> ---
>  arch/x86/kvm/vmx.c |    2 ++
>  1 file changed, 2 insertions(+)
> 
> --- .before/arch/x86/kvm/vmx.c	2010-12-08 18:56:48.000000000 +0200
> +++ .after/arch/x86/kvm/vmx.c	2010-12-08 18:56:48.000000000 +0200
> @@ -4284,6 +4284,8 @@ static void vmx_cpuid_update(struct kvm_
>  
>  static void vmx_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry)
>  {
> +	if (func == 1 && nested)
> +		entry->ecx |= bit(X86_FEATURE_VMX);
>  }
>  
>  static struct kvm_x86_ops vmx_x86_ops = {

This patch should be the last one in your series because VMX should be
fully supported before it is reported to userspace.

	Joerg


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 08/28] nVMX: Hold a vmcs02 for each vmcs12
  2010-12-08 17:04 ` [PATCH 08/28] nVMX: Hold a vmcs02 for each vmcs12 Nadav Har'El
@ 2010-12-09 12:41   ` Avi Kivity
  0 siblings, 0 replies; 40+ messages in thread
From: Avi Kivity @ 2010-12-09 12:41 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb

On 12/08/2010 07:04 PM, Nadav Har'El wrote:
> In this patch we add a list of L0 (hardware) VMCSs, which we'll use to hold a
> hardware VMCS for each active vmcs12 (i.e., for each L2 guest).
>
> We call each of these L0 VMCSs a "vmcs02", as it is the VMCS that L0 uses
> to run its nested guest L2.
>
> Signed-off-by: Nadav Har'El<nyh@il.ibm.com>
> ---
>   arch/x86/kvm/vmx.c |   96 +++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 96 insertions(+)
>
> --- .before/arch/x86/kvm/vmx.c	2010-12-08 18:56:49.000000000 +0200
> +++ .after/arch/x86/kvm/vmx.c	2010-12-08 18:56:49.000000000 +0200
> @@ -155,6 +155,12 @@ struct __packed vmcs12 {
>    */
>   #define VMCS12_REVISION 0x11e57ed0
>
> +struct vmcs_list {
> +	struct list_head list;
> +	gpa_t vmcs12_addr;
> +	struct vmcs *vmcs02;
> +};
> +
>   /*
>    * The nested_vmx structure is part of vcpu_vmx, and holds information we need
>    * for correct emulation of VMX (i.e., nested VMX) on this vcpu. For example,
> @@ -170,6 +176,10 @@ struct nested_vmx {
>   	/* The host-usable pointer to the above */
>   	struct page *current_vmcs12_page;
>   	struct vmcs12 *current_vmcs12;
> +
> +	/* list of real (hardware) VMCS, one for each L2 guest of L1 */
> +	struct list_head vmcs02_list; /* a vmcs_list */
> +	int vmcs02_num;
>   };
>
>   struct vcpu_vmx {
> @@ -1736,6 +1746,85 @@ static void free_vmcs(struct vmcs *vmcs)
>   	free_pages((unsigned long)vmcs, vmcs_config.order);
>   }
>
> +static struct vmcs *nested_get_current_vmcs(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +	struct vmcs_list *list_item, *n;
> +
> +	list_for_each_entry_safe(list_item, n,&vmx->nested.vmcs02_list, list)
> +		if (list_item->vmcs12_addr == vmx->nested.current_vmptr)
> +			return list_item->vmcs02;
> +
> +	return NULL;
> +}
> +
> +/*
> + * Allocate an L0 VMCS (vmcs02) for the current L1 VMCS (vmcs12), if one
> + * does not already exist. The allocation is done in L0 memory, so to avoid
> + * denial-of-service attack by guests, we limit the number of concurrently-
> + * allocated vmcss. A well-behaving L1 will VMCLEAR unused vmcs12s and not
> + * trigger this limit.
> + */
> +static const int NESTED_MAX_VMCS = 256;
> +static int nested_create_current_vmcs(struct kvm_vcpu *vcpu)
> +{
> +	struct vmcs_list *new_l2_guest;
> +	struct vmcs *vmcs02;
> +
> +	if (nested_get_current_vmcs(vcpu))
> +		return 0; /* nothing to do - we already have a VMCS */
> +
> +	if (to_vmx(vcpu)->nested.vmcs02_num>= NESTED_MAX_VMCS)
> +		return -ENOMEM;

I asked for this to be fixed (say by freeing one vmcs02 from the list).  
The guest can easily crash by running a lot of nested guests.

Actually you don't have to free it, simply reuse it for the new vmcs12.

> +
> +	new_l2_guest = (struct vmcs_list *)
> +		kmalloc(sizeof(struct vmcs_list), GFP_KERNEL);
> +	if (!new_l2_guest)
> +		return -ENOMEM;
> +
> +	vmcs02 = alloc_vmcs();
> +	if (!vmcs02) {
> +		kfree(new_l2_guest);
> +		return -ENOMEM;
> +	}
> +
> +	new_l2_guest->vmcs12_addr = to_vmx(vcpu)->nested.current_vmptr;
> +	new_l2_guest->vmcs02 = vmcs02;
> +	list_add(&(new_l2_guest->list),&(to_vmx(vcpu)->nested.vmcs02_list));
> +	to_vmx(vcpu)->nested.vmcs02_num++;
> +	return 0;
> +}
> +

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 09/28] nVMX: Add VMCS fields to the vmcs12
  2010-12-08 17:04 ` [PATCH 09/28] nVMX: Add VMCS fields to the vmcs12 Nadav Har'El
@ 2010-12-09 12:43   ` Avi Kivity
  2010-12-10 12:10     ` Nadav Har'El
  0 siblings, 1 reply; 40+ messages in thread
From: Avi Kivity @ 2010-12-09 12:43 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb

On 12/08/2010 07:04 PM, Nadav Har'El wrote:
> In this patch we add to vmcs12 (the VMCS that L1 keeps for L2) all the
> standard VMCS fields. These fields are encapsulated in a struct vmcs_fields.
>
> Later patches will enable L1 to read and write these fields using VMREAD/
> VMWRITE, and they will be used during a VMLAUNCH/VMRESUME in preparing vmcs02,
> a hardware VMCS for running L2.
>
> Signed-off-by: Nadav Har'El<nyh@il.ibm.com>
> ---
>   arch/x86/kvm/vmx.c |  295 +++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 295 insertions(+)
>
> --- .before/arch/x86/kvm/vmx.c	2010-12-08 18:56:49.000000000 +0200
> +++ .after/arch/x86/kvm/vmx.c	2010-12-08 18:56:49.000000000 +0200
> @@ -128,6 +128,137 @@ struct shared_msr_entry {
>   };
>
>   /*
> + * vmcs_fields is a structure used in nested VMX for holding a copy of all
> + * standard VMCS fields. It is used for emulating a VMCS for L1 (see struct
> + * vmcs12), and also for easier access to VMCS data (see vmcs01_fields).
> + */
> +struct __packed vmcs_fields {
> +	u16 virtual_processor_id;
> +	u16 guest_es_selector;
> +	u16 guest_cs_selector;
> +	u16 guest_ss_selector;
> +	u16 guest_ds_selector;
> +	u16 guest_fs_selector;
> +	u16 guest_gs_selector;
> +	u16 guest_ldtr_selector;
> +	u16 guest_tr_selector;
> +	u16 host_es_selector;
> +	u16 host_cs_selector;
> +	u16 host_ss_selector;
> +	u16 host_ds_selector;
> +	u16 host_fs_selector;
> +	u16 host_gs_selector;
> +	u16 host_tr_selector;
> +	u64 io_bitmap_a;
> +	u64 io_bitmap_b;
> +	u64 msr_bitmap;
> +	u64 vm_exit_msr_store_addr;
> +	u64 vm_exit_msr_load_addr;
> +	u64 vm_entry_msr_load_addr;
> +	u64 tsc_offset;
> +	u64 virtual_apic_page_addr;
> +	u64 apic_access_addr;
> +	u64 ept_pointer;
> +	u64 guest_physical_address;
> +	u64 vmcs_link_pointer;
> +	u64 guest_ia32_debugctl;
> +	u64 guest_ia32_pat;
> +	u64 guest_pdptr0;
> +	u64 guest_pdptr1;
> +	u64 guest_pdptr2;
> +	u64 guest_pdptr3;
> +	u64 host_ia32_pat;
> +	u32 pin_based_vm_exec_control;
> +	u32 cpu_based_vm_exec_control;
> +	u32 exception_bitmap;
> +	u32 page_fault_error_code_mask;
> +	u32 page_fault_error_code_match;
> +	u32 cr3_target_count;
> +	u32 vm_exit_controls;
> +	u32 vm_exit_msr_store_count;
> +	u32 vm_exit_msr_load_count;
> +	u32 vm_entry_controls;
> +	u32 vm_entry_msr_load_count;
> +	u32 vm_entry_intr_info_field;
> +	u32 vm_entry_exception_error_code;
> +	u32 vm_entry_instruction_len;
> +	u32 tpr_threshold;
> +	u32 secondary_vm_exec_control;
> +	u32 vm_instruction_error;
> +	u32 vm_exit_reason;
> +	u32 vm_exit_intr_info;
> +	u32 vm_exit_intr_error_code;
> +	u32 idt_vectoring_info_field;
> +	u32 idt_vectoring_error_code;
> +	u32 vm_exit_instruction_len;
> +	u32 vmx_instruction_info;
> +	u32 guest_es_limit;
> +	u32 guest_cs_limit;
> +	u32 guest_ss_limit;
> +	u32 guest_ds_limit;
> +	u32 guest_fs_limit;
> +	u32 guest_gs_limit;
> +	u32 guest_ldtr_limit;
> +	u32 guest_tr_limit;
> +	u32 guest_gdtr_limit;
> +	u32 guest_idtr_limit;
> +	u32 guest_es_ar_bytes;
> +	u32 guest_cs_ar_bytes;
> +	u32 guest_ss_ar_bytes;
> +	u32 guest_ds_ar_bytes;
> +	u32 guest_fs_ar_bytes;
> +	u32 guest_gs_ar_bytes;
> +	u32 guest_ldtr_ar_bytes;
> +	u32 guest_tr_ar_bytes;
> +	u32 guest_interruptibility_info;
> +	u32 guest_activity_state;
> +	u32 guest_sysenter_cs;
> +	u32 host_ia32_sysenter_cs;
> +	unsigned long cr0_guest_host_mask;
> +	unsigned long cr4_guest_host_mask;
> +	unsigned long cr0_read_shadow;
> +	unsigned long cr4_read_shadow;
> +	unsigned long cr3_target_value0;
> +	unsigned long cr3_target_value1;
> +	unsigned long cr3_target_value2;
> +	unsigned long cr3_target_value3;
> +	unsigned long exit_qualification;
> +	unsigned long guest_linear_address;
> +	unsigned long guest_cr0;
> +	unsigned long guest_cr3;
> +	unsigned long guest_cr4;
> +	unsigned long guest_es_base;
> +	unsigned long guest_cs_base;
> +	unsigned long guest_ss_base;
> +	unsigned long guest_ds_base;
> +	unsigned long guest_fs_base;
> +	unsigned long guest_gs_base;
> +	unsigned long guest_ldtr_base;
> +	unsigned long guest_tr_base;
> +	unsigned long guest_gdtr_base;
> +	unsigned long guest_idtr_base;
> +	unsigned long guest_dr7;
> +	unsigned long guest_rsp;
> +	unsigned long guest_rip;
> +	unsigned long guest_rflags;
> +	unsigned long guest_pending_dbg_exceptions;
> +	unsigned long guest_sysenter_esp;
> +	unsigned long guest_sysenter_eip;
> +	unsigned long host_cr0;
> +	unsigned long host_cr3;
> +	unsigned long host_cr4;
> +	unsigned long host_fs_base;
> +	unsigned long host_gs_base;
> +	unsigned long host_tr_base;
> +	unsigned long host_gdtr_base;
> +	unsigned long host_idtr_base;
> +	unsigned long host_ia32_sysenter_esp;
> +	unsigned long host_ia32_sysenter_eip;
> +	unsigned long host_rsp;
> +	unsigned long host_rip;
> +};
> +

Those ulongs aren't portable.  Please use u64.

And please address all my earlier comments, there's no point in me 
reviewing the same thing again and again.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/28] nVMX: Nested VMX, v7
  2010-12-08 16:59 [PATCH 0/28] nVMX: Nested VMX, v7 Nadav Har'El
                   ` (27 preceding siblings ...)
  2010-12-08 17:14 ` [PATCH 28/28] nVMX: Documentation Nadav Har'El
@ 2010-12-09 12:44 ` Avi Kivity
  28 siblings, 0 replies; 40+ messages in thread
From: Avi Kivity @ 2010-12-09 12:44 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb

On 12/08/2010 06:59 PM, Nadav Har'El wrote:
> Hi,
>
> This is the seventh iteration of the nested VMX patch set. It fixes a bunch
> of bugs in the previous iteration, and in particular it now works correctly
> with EPT in the L0 hypervisor, so "ept=0" no longer needs to be specified.
>
> This new set of patches should apply to the current KVM trunk (I checked with
> 66fc6be8d2b04153b753182610f919faf9c705bc). In particular it uses the recently
> added is_guest_mode() function (common to both nested svm and vmx) instead of
> inventing our own flag.
>
> About nested VMX:
> -----------------
>
> The following 28 patches implement nested VMX support. This feature enables a
> guest to use the VMX APIs in order to run its own nested guests. In other
> words, it allows running hypervisors (that use VMX) under KVM.
> Multiple guest hypervisors can be run concurrently, and each of those can
> in turn host multiple guests.
>
> The theory behind this work, our implementation, and its performance
> characteristics were presented in OSDI 2010 (the USENIX Symposium on
> Operating Systems Design and Implementation). Our paper was titled
> "The Turtles Project: Design and Implementation of Nested Virtualization",
> and was awarded "Jay Lepreau Best Paper". The paper is available online, at:
>
> 	http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf
>
> This patch set does not include all the features described in the paper.
> In particular, this patch set is missing nested EPT (shadow page tables are
> used in L1, while L0 can use shadow page tables or EPT). It is also missing
> some features required to run VMWare Server as a guest. These missing features
> will be sent as follow-on patchs.
>
> Running nested VMX:
> ------------------
>
> The current patches have a number of requirements, which will be relaxed in
> follow-on patches:
>
> 1. This version was only tested with KVM (64-bit) as a guest hypervisor, and
>     Linux as a nested guest.
>
> 2. SMP is supported in the code, but is unfortunately buggy in this version
>     and often leads to hangs. Use the "nosmp" option in the L0 (topmost)
>     kernel to avoid this bug (and to reduce your performance ;-))..

Any idea as to the cause?  There should be little interaction between 
host or guest smp and nvmx.

> 5. Nested VPID is not properly supported in this version. You must give the
>     "vpid=0" module options to kvm-intel to turn this feature off.

Do you mean host vpid here?  Likely you're not flushing the tlb when 
switching between guest and host.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 19/28] nVMX: Exiting from L2 to L1
  2010-12-08 17:09 ` [PATCH 19/28] nVMX: Exiting from L2 to L1 Nadav Har'El
@ 2010-12-09 12:55   ` Avi Kivity
  0 siblings, 0 replies; 40+ messages in thread
From: Avi Kivity @ 2010-12-09 12:55 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb

On 12/08/2010 07:09 PM, Nadav Har'El wrote:
> This patch implements nested_vmx_vmexit(), called when the nested L2 guest
> exits and we want to run its L1 parent and let it handle this exit.
>
> Note that this will not necessarily be called on every L2 exit. L0 may decide
> to handle a particular exit on its own, without L1's involvement; In that
> case, L0 will handle the exit, and resume running L2, without running L1 and
> without calling nested_vmx_vmexit(). The logic for deciding whether to handle
> a particular exit in L1 or in L0, i.e., whether to call nested_vmx_vmexit(),
> will appear in the next patch.
>
>
> +void prepare_vmcs12(struct kvm_vcpu *vcpu)
> +{
> +	struct vmcs_fields *vmcs12 = get_vmcs12_fields(vcpu);
> +
> +	/* update guest state fields: */
> +	vmcs12->guest_cr0 = vmcs12_guest_cr0(vcpu, vmcs12);
> +	vmcs12->guest_cr4 = vmcs12_guest_cr4(vcpu, vmcs12);
> +
> +	vmcs12->guest_dr7 = vmcs_readl(GUEST_DR7);
> +	vmcs12->guest_rsp = vmcs_readl(GUEST_RSP);
> +	vmcs12->guest_rip = vmcs_readl(GUEST_RIP);
> +	vmcs12->guest_rflags = vmcs_readl(GUEST_RFLAGS);

kvm_register_read() etc.

> +
> +static int nested_vmx_vmexit(struct kvm_vcpu *vcpu, bool is_interrupt)
> +{
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +	int efer_offset;
> +	struct vmcs_fields *vmcs01 = vmx->nested.vmcs01_fields;
> +
> +	if (!is_guest_mode(vcpu)) {
> +		printk(KERN_INFO "WARNING: %s called but not in nested mode\n",
> +		       __func__);
> +		return 0;
> +	}
> +
> +	sync_cached_regs_to_vmcs(vcpu);
> +
> +	prepare_vmcs12(vcpu);
> +
> +	if (is_interrupt)
> +		get_vmcs12_fields(vcpu)->vm_exit_reason =
> +			EXIT_REASON_EXTERNAL_INTERRUPT;
> +
> +	vmx->nested.current_vmcs12->launched = vmx->launched;
> +	vmx->nested.current_vmcs12->cpu = vcpu->cpu;
> +
> +	vmx->vmcs = vmx->nested.vmcs01;
> +	vcpu->cpu = vmx->nested.l1_state.cpu;
> +	vmx->launched = vmx->nested.l1_state.launched;
> +
> +	leave_guest_mode(vcpu);
> +
> +	vmx_vcpu_load(vcpu, get_cpu());
> +	put_cpu();
> +
> +	vcpu->arch.efer = vmx->nested.l1_state.efer;
> +	if ((vcpu->arch.efer&  EFER_LMA)&&
> +	    !(vcpu->arch.efer&  EFER_SCE))
> +		vcpu->arch.efer |= EFER_SCE;

set_efer() in x86.c for the side effects.

> +
> +	efer_offset = __find_msr_index(vmx, MSR_EFER);
> +	if (update_transition_efer(vmx, efer_offset))
> +		wrmsrl(MSR_EFER, vmx->guest_msrs[efer_offset].data);

Including this.

> +
> +	/*
> +	 * L2 perhaps switched to real mode and set vmx->rmode, but we're back
> +	 * in L1 and as it is running VMX, it can't be in real mode.
> +	 */
> +	vmx->rmode.vm86_active = 0;

L2 cannot be in real mode since vmx does not support it (except for 
unrestricted guest, in which case rmode.vm86_active would be clear).

> +
> +	/*
> +	 * If L1 set the HOST_* fields in the VMCS, when exiting from L2 to L1
> +	 * we need to return those, not L1's old values.
> +	 */
> +	vmcs_writel(GUEST_RIP, get_vmcs12_fields(vcpu)->host_rip);
> +	vmcs_writel(GUEST_RSP, get_vmcs12_fields(vcpu)->host_rsp);

kvm_register_write() etc.

> +	vmcs01->cr0_read_shadow = get_vmcs12_fields(vcpu)->host_cr0;
> +
> +	/*
> +	 * We're running a regular L1 guest again, so we do the regular KVM
> +	 * thing: run vmx_set_cr0 with the cr0 bits the guest thinks it has.
> +	 * vmx_set_cr0 might use slightly different bits on the new guest_cr0
> +	 * it sets, e.g., add TS when !fpu_active.
> +	 * Note that vmx_set_cr0 refers to rmode and efer set above.
> +	 */
> +	vmx_set_cr0(vcpu, guest_readable_cr0(vmcs01));

kvm_set_cr0() takes care of some extra stuff.  Why guest_readable_cr0?  
want vmcs12->host_cr0.

> +	/*
> +	 * If we did fpu_activate()/fpu_deactive() during l2's run, we need to
> +	 * apply the same changes to l1's vmcs. We just set cr0 correctly, but
> +	 * now we need to also update cr0_guest_host_mask and exception_bitmap.
> +	 */
> +	vmcs_write32(EXCEPTION_BITMAP,
> +		(vmcs01->exception_bitmap&  ~(1u<<NM_VECTOR)) |
> +			(vcpu->fpu_active ? 0 : (1u<<NM_VECTOR)));
> +	vcpu->arch.cr0_guest_owned_bits = (vcpu->fpu_active ? X86_CR0_TS : 0);
> +	vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits);

Should be a side effect of kvm_set_cr0().

> +
> +	vmx_set_cr4(vcpu, guest_readable_cr4(vmcs01));
> +	vcpu->arch.cr4_guest_owned_bits = ~vmcs01->cr4_guest_host_mask;

kvm_set_cr4(vmcs12->host_cr4)

> +
> +	if (enable_ept) {
> +		/* shadow page tables on EPT: */
> +		set_cr3_and_pdptrs(vcpu, get_vmcs12_fields(vcpu)->host_cr3);
> +	} else {
> +		/* shadow page tables on shadow page tables: */
> +		kvm_set_cr3(vcpu, vmx->nested.l1_arch_cr3);
> +		kvm_mmu_reset_context(vcpu);
> +		kvm_mmu_load(vcpu);
> +	}

kvm_set_cr3() should suffice in both cases.  
kvm_mmu_reset_context()/kvm_mmu_load() is probably unneeded.

> +
> +	kvm_register_write(vcpu, VCPU_REGS_RSP, vmcs01->guest_rsp);
> +	kvm_register_write(vcpu, VCPU_REGS_RIP, vmcs01->guest_rip);

vmcs12->host_rip

> +
> +	if (unlikely(vmx->fail)) {
> +		/*
> +		 * When L1 launches L2 and then we (L0) fail to launch L2,
> +		 * we nested_vmx_vmexit back to L1, but now should let it know
> +		 * that the VMLAUNCH failed - with the same error that we
> +		 * got when launching L2.
> +		 */
> +		vmx->fail = 0;
> +		nested_vmx_failValid(vcpu, vmcs_read32(VM_INSTRUCTION_ERROR));
> +	} else
> +		nested_vmx_succeed(vcpu);
> +
> +	return 0;
> +}
> +
>   static struct kvm_x86_ops vmx_x86_ops = {
>   	.cpu_has_kvm_support = cpu_has_kvm_support,
>   	.disabled_by_bios = vmx_disabled_by_bios,


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 25/28] nVMX: Further fixes for lazy FPU loading
  2010-12-08 17:12 ` [PATCH 25/28] nVMX: Further fixes for lazy FPU loading Nadav Har'El
@ 2010-12-09 13:05   ` Avi Kivity
  0 siblings, 0 replies; 40+ messages in thread
From: Avi Kivity @ 2010-12-09 13:05 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb

On 12/08/2010 07:12 PM, Nadav Har'El wrote:
> KVM's "Lazy FPU loading" means that sometimes L0 needs to set CR0.TS, even
> if a guest didn't set it. Moreover, L0 must also trap CR0.TS changes and
> NM exceptions, even if we have a guest hypervisor (L1) who didn't want these
> traps. And of course, conversely: If L1 wanted to trap these events, we
> must let it, even if L0 is not interested in them.
>
> This patch fixes some existing KVM code (in update_exception_bitmap(),
> vmx_fpu_activate(), vmx_fpu_deactivate()) to do the correct merging of L0's
> and L1's needs. Note that handle_cr() was already fixed in the above patch,
> and that new code in introduced in previous patches already handles CR0
> correctly (see prepare_vmcs02(), prepare_vmcs12(), and nested_vmx_vmexit()).
>
>
> @@ -1415,8 +1424,19 @@ static void vmx_fpu_activate(struct kvm_
>   	cr0&= ~(X86_CR0_TS | X86_CR0_MP);
>   	cr0 |= kvm_read_cr0_bits(vcpu, X86_CR0_TS | X86_CR0_MP);
>   	vmcs_writel(GUEST_CR0, cr0);
> -	update_exception_bitmap(vcpu);
>   	vcpu->arch.cr0_guest_owned_bits = X86_CR0_TS;
> +	if (is_guest_mode(vcpu)) {
> +		/* While we (L0) no longer care about NM exceptions or cr0.TS
> +		 * changes, our guest hypervisor (L1) might care in which case
> +		 * we must trap them for it.
> +		 */
> +		u32 eb = vmcs_read32(EXCEPTION_BITMAP)&  ~(1u<<  NM_VECTOR);
> +		struct vmcs_fields *vmcs12 = get_vmcs12_fields(vcpu);
> +		eb |= vmcs12->exception_bitmap;
> +		vcpu->arch.cr0_guest_owned_bits&= ~vmcs12->cr0_guest_host_mask;
> +		vmcs_write32(EXCEPTION_BITMAP, eb);
> +	} else
> +		update_exception_bitmap(vcpu);

Isn't update_exception_bitmap() sufficient for both cases?

>   	vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits);
>   }
>
> @@ -1424,12 +1444,24 @@ static void vmx_decache_cr0_guest_bits(s
>
>   static void vmx_fpu_deactivate(struct kvm_vcpu *vcpu)
>   {
> +	/* Note that there is no vcpu->fpu_active = 0 here. The caller must
> +	 * set this *before* calling this function.
> +	 */
>   	vmx_decache_cr0_guest_bits(vcpu);
>   	vmcs_set_bits(GUEST_CR0, X86_CR0_TS | X86_CR0_MP);
> -	update_exception_bitmap(vcpu);
> +	vmcs_write32(EXCEPTION_BITMAP,
> +		vmcs_read32(EXCEPTION_BITMAP) | (1u<<  NM_VECTOR));

Why not fold the logic into update_exception_bitmap()?

>   	vcpu->arch.cr0_guest_owned_bits = 0;
>   	vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits);
> -	vmcs_writel(CR0_READ_SHADOW, vcpu->arch.cr0);
> +	if (is_guest_mode(vcpu))
> +		/* Unfortunately in nested mode we play with arch.cr0's PG
> +		 * bit, so we musn't copy it all, just the relevant TS bit
> +		 */
> +		vmcs_writel(CR0_READ_SHADOW,
> +			(vmcs_readl(CR0_READ_SHADOW)&  ~X86_CR0_TS) |
> +			(vcpu->arch.cr0&  X86_CR0_TS));
> +	else
> +		vmcs_writel(CR0_READ_SHADOW, vcpu->arch.cr0);

Didn't you have a nice guest_readable_cr0() function that did this?

>   }
>
>   static unsigned long vmx_get_rflags(struct kvm_vcpu *vcpu)


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 24/28] nVMX: Handling of CR0 and CR4 modifying instructions
  2010-12-08 17:12 ` [PATCH 24/28] nVMX: Handling of CR0 and CR4 modifying instructions Nadav Har'El
@ 2010-12-09 13:19   ` Avi Kivity
  0 siblings, 0 replies; 40+ messages in thread
From: Avi Kivity @ 2010-12-09 13:19 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb

On 12/08/2010 07:12 PM, Nadav Har'El wrote:
> When L2 tries to modify CR0 or CR4 (with mov or clts), and modifies a bit
> which L1 asked to shadow (via CR[04]_GUEST_HOST_MASK), we already do the right
> thing: we let L1 handle the trap (see nested_vmx_exit_handled_cr() in a
> previous patch).
> When L2 modifies bits that L1 doesn't care about, we let it think (via
> CR[04]_READ_SHADOW) that it did these modifications, while only changing
> (in GUEST_CR[04]) the bits that L0 doesn't shadow.
>
> This is needed for corect handling of CR0.TS for lazy FPU loading: L0 may
> want to leave TS on, while pretending to allow the guest to change it.
>
> Signed-off-by: Nadav Har'El<nyh@il.ibm.com>
> ---
>   arch/x86/kvm/vmx.c |   54 ++++++++++++++++++++++++++++++++++++++++---
>   1 file changed, 51 insertions(+), 3 deletions(-)
>
> --- .before/arch/x86/kvm/vmx.c	2010-12-08 18:56:51.000000000 +0200
> +++ .after/arch/x86/kvm/vmx.c	2010-12-08 18:56:51.000000000 +0200
> @@ -3877,6 +3877,54 @@ static void complete_insn_gp(struct kvm_
>   		skip_emulated_instruction(vcpu);
>   }
>
> +/* called to set cr0 as approriate for a mov-to-cr0 exit. */
> +static int handle_set_cr0(struct kvm_vcpu *vcpu, unsigned long val)
> +{
> +	if (is_guest_mode(vcpu)) {
> +		/*
> +		 * We get here when L2 changed cr0 in a way that did not change
> +		 * any of L1's shadowed bits (see nested_vmx_exit_handled_cr),
> +		 * but did change L0 shadowed bits. This can currently happen
> +		 * with the TS bit: L0 may want to leave TS on (for lazy fpu
> +		 * loading) while pretending to allow the guest to change it.
> +		 */
> +		vmcs_writel(GUEST_CR0,
> +		   (val&  vcpu->arch.cr0_guest_owned_bits) |
> +		   (vmcs_readl(GUEST_CR0)&  ~vcpu->arch.cr0_guest_owned_bits));
> +		vmcs_writel(CR0_READ_SHADOW, val);
> +		vcpu->arch.cr0 = val;
> +		return 0;
> +	} else
> +		return kvm_set_cr0(vcpu, val);
> +}

Easier way: update val to reflect the change, and call 
kvm_set_cr0(val).  This allows any side effects by kvm_set_cr4() to take 
place (for example the guest may allow the nested guest to change 
cr0.pg, and we need to kvm_set_cr0() to make note of that).

> +
> +static int handle_set_cr4(struct kvm_vcpu *vcpu, unsigned long val)
> +{
> +	if (is_guest_mode(vcpu)) {
> +		vmcs_writel(GUEST_CR4,
> +		  (val&  vcpu->arch.cr4_guest_owned_bits) |
> +		  (vmcs_readl(GUEST_CR4)&  ~vcpu->arch.cr4_guest_owned_bits));
> +		vmcs_writel(CR4_READ_SHADOW, val);
> +		vcpu->arch.cr4 = val;
> +		return 0;
> +	} else
> +		return kvm_set_cr4(vcpu, val);
> +}

Ditto.

> +
> +
> +/* called to set cr0 as approriate for clts instruction exit. */
> +static void handle_clts(struct kvm_vcpu *vcpu)
> +{
> +	if (is_guest_mode(vcpu)) {
> +		/* As in handle_set_cr0(), we can't call vmx_set_cr0 here */
> +		vmcs_writel(GUEST_CR0, vmcs_readl(GUEST_CR0)&  ~X86_CR0_TS);
> +		vmcs_writel(CR0_READ_SHADOW,
> +			vmcs_readl(CR0_READ_SHADOW)&  ~X86_CR0_TS);
> +		vcpu->arch.cr0&= ~X86_CR0_TS;
> +	} else
> +		vmx_set_cr0(vcpu, kvm_read_cr0_bits(vcpu, ~X86_CR0_TS));
> +}

Here, too.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 09/28] nVMX: Add VMCS fields to the vmcs12
  2010-12-09 12:43   ` Avi Kivity
@ 2010-12-10 12:10     ` Nadav Har'El
  0 siblings, 0 replies; 40+ messages in thread
From: Nadav Har'El @ 2010-12-10 12:10 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, gleb

On Thu, Dec 09, 2010, Avi Kivity wrote about "Re: [PATCH 09/28] nVMX: Add VMCS fields to the vmcs12":
> And please address all my earlier comments, there's no point in me 
> reviewing the same thing again and again.

Hi,

Agreed. Like I said previously, I am keeping a detailed list of things you
already asked me to change, and once in a while addressing one issue and
replying about it. None of your previous comments have been forgotten, or
ignored.

Nadav.

-- 
Nadav Har'El                        |        Friday, Dec 10 2010, 3 Tevet 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Long periods of drought are always
http://nadav.harel.org.il           |followed by rain.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 02/28] nVMX: Add VMX and SVM to list of supported cpuid features
  2010-12-09 11:38   ` Joerg Roedel
@ 2010-12-15 13:25     ` Nadav Har'El
  0 siblings, 0 replies; 40+ messages in thread
From: Nadav Har'El @ 2010-12-15 13:25 UTC (permalink / raw)
  To: Joerg Roedel; +Cc: kvm, gleb, avi

On Thu, Dec 09, 2010, Joerg Roedel wrote about "Re: [PATCH 02/28] nVMX: Add VMX and SVM to list of supported cpuid features":
> This patch should be the last one in your series because VMX should be
> fully supported before it is reported to userspace.
> 
> 	Joerg

Thanks, good idea - especially for bisection (where we don't want a guest
to see half the nested VMX feature). 

I also removed the silly reference to SVM in the title of this patch ;-)

Nadav.

-- 
Nadav Har'El                        |     Wednesday, Dec 15 2010, 8 Tevet 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Bore, n.: A person who talks when you
http://nadav.harel.org.il           |wish him to listen.

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2010-12-15 13:25 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-12-08 16:59 [PATCH 0/28] nVMX: Nested VMX, v7 Nadav Har'El
2010-12-08 17:00 ` [PATCH 01/28] nVMX: Add "nested" module option to vmx.c Nadav Har'El
2010-12-08 17:00 ` [PATCH 02/28] nVMX: Add VMX and SVM to list of supported cpuid features Nadav Har'El
2010-12-09 11:38   ` Joerg Roedel
2010-12-15 13:25     ` Nadav Har'El
2010-12-08 17:01 ` [PATCH 03/28] nVMX: Implement VMXON and VMXOFF Nadav Har'El
2010-12-08 17:02 ` [PATCH 04/28] nVMX: Allow setting the VMXE bit in CR4 Nadav Har'El
2010-12-08 17:02 ` [PATCH 05/28] nVMX: Introduce vmcs12: a VMCS structure for L1 Nadav Har'El
2010-12-08 17:03 ` [PATCH 06/28] nVMX: Implement reading and writing of VMX MSRs Nadav Har'El
2010-12-09 11:04   ` Avi Kivity
2010-12-08 17:03 ` [PATCH 07/28] nVMX: Decoding memory operands of VMX instructions Nadav Har'El
2010-12-09 11:08   ` Avi Kivity
2010-12-08 17:04 ` [PATCH 08/28] nVMX: Hold a vmcs02 for each vmcs12 Nadav Har'El
2010-12-09 12:41   ` Avi Kivity
2010-12-08 17:04 ` [PATCH 09/28] nVMX: Add VMCS fields to the vmcs12 Nadav Har'El
2010-12-09 12:43   ` Avi Kivity
2010-12-10 12:10     ` Nadav Har'El
2010-12-08 17:05 ` [PATCH 10/28] nVMX: Success/failure of VMX instructions Nadav Har'El
2010-12-08 17:05 ` [PATCH 11/28] nVMX: Implement VMCLEAR Nadav Har'El
2010-12-08 17:06 ` [PATCH 12/28] nVMX: Implement VMPTRLD Nadav Har'El
2010-12-08 17:06 ` [PATCH 13/28] nVMX: Implement VMPTRST Nadav Har'El
2010-12-08 17:07 ` [PATCH 14/28] nVMX: Implement VMREAD and VMWRITE Nadav Har'El
2010-12-08 17:07 ` [PATCH 15/28] nVMX: Prepare vmcs02 from vmcs01 and vmcs12 Nadav Har'El
2010-12-08 17:08 ` [PATCH 16/28] nVMX: Move register-syncing to a function Nadav Har'El
2010-12-08 17:08 ` [PATCH 17/28] nVMX: Implement VMLAUNCH and VMRESUME Nadav Har'El
2010-12-08 17:09 ` [PATCH 18/28] nVMX: No need for handle_vmx_insn function any more Nadav Har'El
2010-12-08 17:09 ` [PATCH 19/28] nVMX: Exiting from L2 to L1 Nadav Har'El
2010-12-09 12:55   ` Avi Kivity
2010-12-08 17:10 ` [PATCH 20/28] nVMX: Deciding if L0 or L1 should handle an L2 exit Nadav Har'El
2010-12-08 17:10 ` [PATCH 21/28] nVMX: Correct handling of interrupt injection Nadav Har'El
2010-12-08 17:11 ` [PATCH 22/28] nVMX: Correct handling of exception injection Nadav Har'El
2010-12-08 17:11 ` [PATCH 23/28] nVMX: Correct handling of idt vectoring info Nadav Har'El
2010-12-08 17:12 ` [PATCH 24/28] nVMX: Handling of CR0 and CR4 modifying instructions Nadav Har'El
2010-12-09 13:19   ` Avi Kivity
2010-12-08 17:12 ` [PATCH 25/28] nVMX: Further fixes for lazy FPU loading Nadav Har'El
2010-12-09 13:05   ` Avi Kivity
2010-12-08 17:13 ` [PATCH 26/28] nVMX: Additional TSC-offset handling Nadav Har'El
2010-12-08 17:13 ` [PATCH 27/28] nVMX: Miscellenous small corrections Nadav Har'El
2010-12-08 17:14 ` [PATCH 28/28] nVMX: Documentation Nadav Har'El
2010-12-09 12:44 ` [PATCH 0/28] nVMX: Nested VMX, v7 Avi Kivity

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox