From mboxrd@z Thu Jan  1 00:00:00 1970
From: "Nadav Har'El" <nyh@math.technion.ac.il>
Subject: Re: Nested VMX - L1 hangs on running L2
Date: Thu, 28 Jul 2011 14:11:21 +0300
Message-ID: <20110728111121.GA7182@fermat.math.technion.ac.il>
References: <20110708184053.GA23958@stratus.com> <20110718182628.GB5324@amt.cnet> <CAKiCmT00vyR5vRBDWFYK2Z8sgmjLBPwbYU5W8q2wAUTrxS1_tA@mail.gmail.com> <OF526F82DC.9E752594-ONC22578D3.00690341-C22578D3.006DB7B3@il.ibm.com> <CAKiCmT28X2F59p_OByJcWTMSo09S=VdWGa0CckFB6fHGMCpcRA@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-8-i
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: Bandan Das <bandan.das@stratus.com>,
	KVM Mailing List <kvm@vger.kernel.org>,
	Marcelo Tosatti <mtosatti@redhat.com>
To: Zachary Amsden <zamsden@gmail.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mailgw12.technion.ac.il ([132.68.225.12]:51732 "EHLO
	mailgw12.technion.ac.il" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1753248Ab1G1LL3 (ORCPT <rfc822;kvm@vger.kernel.org>);
	Thu, 28 Jul 2011 07:11:29 -0400
Content-Disposition: inline
In-Reply-To: <CAKiCmT28X2F59p_OByJcWTMSo09S=VdWGa0CckFB6fHGMCpcRA@mail.gmail.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On Wed, Jul 20, 2011, Zachary Amsden wrote about "Re: Nested VMX - L1 h=
angs on running L2":
> > > No, both patches are wrong.
> >
> > > The correct fix is to make kvm_get_msr() return the L1 guest TSC =
at all
> > times.
> > > =A0We are serving the L1 guest in this hypervisor, not the L2 gue=
st, and so
> >
> > > should never read its TSC for any reason.

I've been thinking about how to correctly solve this problem, and about=
 this
statement that you made, that get_read_tsc() should *always* return L1'=
s TSC,
even while running L2.

We have two quite-different uses for the TSC-related functions:

 1. They are used in vmx.c to emulate *architectural* instructions, nam=
ely
    RDTSC, RDMSR of TSC, and WRMSR of TSC.

 2. They are used in x86.c for internal KVM uses, adjusting the TSC for
    various reasons (most of which, I have to admit, I don't understand=
).

You said above that kvm_get_msr(), calling guest_read_tsc(), should alw=
ays
return L1's TSC, even when running L2. However, I believe that that for
use #1 - the *architecturally correct* emulation of RDTSC and RDMSR whe=
n
L2 is running, the current nested-agnostic code - which ends up returni=
ng L2's
TSC when L2 is running - is the correct thing to do.

Look at what happens when:

 When we get an exit to L0 when L2 run RDMSR TSC:
   This may exit to L1, if L1 asked in the MSR Bitmap. Done correctly.
   If hasn't exited, the spec says we need to return L1's TSC plus the
   TSC_OFFSET that L1 set for L2 (if L1 asked for TSC_OFFSETTING).
   In other words, we should return L0's TSC offsetted by the sum of vm=
cs01
   and vmcs12 tsc_offsets. This sum is exactly what we have in the curr=
ent
   (vmcs02) TSC_OFFSET. So the existing guest_read_tsc, which returns
   rdtscll() + vmcs_read64(TSC_OFFSET), does the correct thing - also f=
or
   nested.

 When we get an exit to L0 when L2 runs RDTSC:
   This may exit to L1 if L1 asked for TSC_EXITING. Done correctly.
   There is no "else" here - we would get an EXIT_REASON_RDTSC only if =
L1
   asked for it (because KVM itself doesn't use TSC_EXITING).

Regarding emulation of WRMSR TSC, I actually had an error in the code w=
hich
I correct in a patch below: According to the Intel spec, it turns out t=
hat
if WRMSR of the TSC does not cause an exit, then it should change the h=
ost
TSC itself, ignoring TSC_OFFSET. This is very strange, because it means=
 that
if the host doesn't trap MSR read and write of the TSC, then if a guest
writes TSC and then immediately reads it, it won't read what is just wr=
ote,
but rather what it wrote pluss the TSC_OFFSET (because the TSC_OFFSET i=
s
added on read, but not subtracted on write). I don't know why the Intel=
 spec
specifies this, but it does - and it's not what I emulated to I fixed t=
hat
now (in the patch below). But I doubt this is the problem that Bandan a=
nd
others saw?

So now, if I'm right, after this patch, the *architectural* uses of
guest_read_tsc() and vmx_write_tsc_offset() to emulate RDTSR and RD/WRM=
SR,
are done correctly.

But the problem is that these functions are also used in KVM for other =
things.
Ideally, whomever calls kvm_read_msr()/kvm_write_msr() should realize t=
hat it
gets the value for the current running guest, which might be L1 or L2, =
just
like if that guest would read or write the MSR. Ideally, KVM might read=
 the
MSR, change it and write it back, and everything would be correct as bo=
th
read and write code are done correctly. The question is, is some of the=
 code -
perhaps in kvm_guest_time_update(), kvm_arch_vcpu_load(), kvm_arch_vcpu=
_put()
or vcpu_enter_guest() (these are functions which touch the TSC), which =
only
work correctly when reading/writing L1 TSC and can't work correctly if =
L2
TSC is read and written? I'm afraid I don't know enough about what thes=
e
functions are doing, to understand what, if anything, they are doing wr=
ong.

Bandan and others who can reproduce this bug - I attach below a patch w=
hich
should correct some architectural emulation issues I discovered while r=
eading
the code again now - especially of TSC MSR write. Can you please try if
this solves anything? I'm not too optimistic, but since the code was wr=
ong
in this regard, it's at least worth a try.

Thanks,
Nadav.

Subject: [PATCH] nVMX: Fix nested TSC handling

This patch fixes several areas where nested VMX did not correctly emula=
ted
TSC handling.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   31 +++++++++++++++++++++----------
 1 file changed, 21 insertions(+), 10 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2011-07-28 14:02:52.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-07-28 14:02:52.000000000 +0300
@@ -1762,15 +1762,23 @@ static void vmx_set_tsc_khz(struct kvm_v
  */
 static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
 {
-	vmcs_write64(TSC_OFFSET, offset);
-	if (is_guest_mode(vcpu))
+	if (is_guest_mode(vcpu)) {
 		/*
-		 * We're here if L1 chose not to trap the TSC MSR. Since
-		 * prepare_vmcs12() does not copy tsc_offset, we need to also
-		 * set the vmcs12 field here.
+		 * We're here if L1 chose not to trap WRMSR to TSC. According
+		 * to the spec, this should set L1's TSC; The offset that L1
+		 * set for L2 remains unchanged, and still needs to be added
+		 * to the newly set TSC to get L2's TSC.
 		 */
-		get_vmcs12(vcpu)->tsc_offset =3D offset -
-			to_vmx(vcpu)->nested.vmcs01_tsc_offset;
+		struct vmcs12 *vmcs12;
+		to_vmx(vcpu)->nested.vmcs01_tsc_offset =3D offset;
+		/* recalculate vmcs02.TSC_OFFSET: */
+		vmcs12 =3D get_vmcs12(vcpu);
+		vmcs_write64(TSC_OFFSET, offset +
+			(nested_cpu_has(vmcs12, CPU_BASED_USE_TSC_OFFSETING) ?
+			 vmcs12->tsc_offset : 0));
+	} else {
+		vmcs_write64(TSC_OFFSET, offset);
+	}
 }
=20
 static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustmen=
t)
@@ -6514,8 +6522,11 @@ static void prepare_vmcs02(struct kvm_vc
=20
 	set_cr4_guest_host_mask(vmx);
=20
-	vmcs_write64(TSC_OFFSET,
-		vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset);
+	if (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETING)
+		vmcs_write64(TSC_OFFSET,
+			vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset);
+	else
+		vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset);
=20
 	if (enable_vpid) {
 		/*
@@ -6922,7 +6933,7 @@ static void nested_vmx_vmexit(struct kvm
=20
 	load_vmcs12_host_state(vcpu, vmcs12);
=20
-	/* Update TSC_OFFSET if vmx_adjust_tsc_offset() was used while L2 ran=
 */
+	/* Update TSC_OFFSET if TSC was changed while L2 ran */
 	vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset);
=20
 	/* This is needed for same reason as it was needed in prepare_vmcs02 =
*/
--=20
Nadav Har'El                        |    Thursday, Jul 28 2011, 26 Tamm=
uz 5771
nyh@math.technion.ac.il             |----------------------------------=
-------
Phone +972-523-790466, ICQ 13349191 |Just remember that if the world di=
dn't
http://nadav.harel.org.il           |suck, we would all fall off.