From mboxrd@z Thu Jan  1 00:00:00 1970
From: Zachary Amsden <zamsden@redhat.com>
Subject: Re: TSC in nested SVM and VMX
Date: Thu, 30 Sep 2010 18:38:40 -1000
Message-ID: <4CA565D0.9010202@redhat.com>
References: <20100930225058.GA10452@fermat.math.technion.ac.il>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: kvm@vger.kernel.org, joro@8bytes.org, Avi Kivity <avi@redhat.com>
To: "Nadav Har'El" <nyh@math.technion.ac.il>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx1.redhat.com ([209.132.183.28]:51595 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1750906Ab0JAEiw (ORCPT <rfc822;kvm@vger.kernel.org>);
	Fri, 1 Oct 2010 00:38:52 -0400
In-Reply-To: <20100930225058.GA10452@fermat.math.technion.ac.il>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On 09/30/2010 12:50 PM, Nadav Har'El wrote:
> Hi,
>
> I noticed that the TSC handling code has recently changed, and since it
> wasn't done correctly in the nested VMX patch, I wanted to take the opportunity
> to fix it.
>
> I looked at what nested SVM does about TSC, and most of it I think I
> understand, but a couple of other things I don't understand.
>
> The basic point is that when L1 starts L2 with some vmcs12.tsc_offset
> (nested_vmcb->control.tsc_offset in SVM nomenclature), the TSC that L1
> actually thinks it is offsetting from is already the hardware TSC plus
> the vmcs01.tsc_offset. So when L0 runs L2, it needs to use the offset
> vmcs01.tsc_offset + vmcs12.tsc_offset.
>
> This explains the line
>          svm->vmcb->control.tsc_offset += nested_vmcb->control.tsc_offset;
> in nested_svm_vmrun().
>
> In svm_adjust_tsc_offset(), when the hardware TSC changes (e.g., when moving
> cores?) and the TSC offset is changed to keep the guest TSC unchanged,
> when in nested mode we need to update both the currently running L2's tsc
> offset, but also also L1's when we eventually return to it, which explains
> the code in that function.
>
> But there are two things I don't understand:
>
> 1. in svm_get_msr(), MSR_IA32_TSC, there is a special is_nested() case which
>     basically ignores the above mentioned addition and uses just the L0->L1
>     tsc offset. why? Why isn't svm->vmcb->control.tsc_offset + native_read_tsc()
>     the correct thing in both cases?
>
> 2. In svm_write_tsc_offset(), when in a nested guest, we don't write the offset
>     given to us, but (if I understand correctly) set this offset for the *L1*
>     guest (and set the L2's tsc offset accordingly, adding to it vmcs12's
>     tsc offset). Why was this done? Why was the simple code
> 	  svm->vmcb->control.tsc_offset = offset;
>     and that's it, not the right thing to do in this function?
>    

The offset for the L1 guest is copied out to a temporary structure, 
nested.hsave; this hold all L1 state information while the L2 guest is 
running.  The L2 guest state is copied directly into the L1 control 
block, and runs IN PLACE in that structure.  Upon nested vmexit, notice 
that the tsc_offset is NOT subtracted, rather, the original offset from 
nested.hsave is copied back into place.

Now, this answers both of your questions.

1)  When reading an MSR, we are not emulating the L2 guest; we are 
DIRECTLY reading the MSR for the L1 emulation.  Any emulation of the L2 
guest is actually done by the code running /inside/ the L1 emulation, so 
MSR reads for the L2 guest are handed by L1, and MSR reads for the L1 
guest are handled by L0, which is this code.

So if we are currently running nested, the L1 tsc_offset is stored in 
the nested.hsave field; the vmcb which is active is polluted by the L2 
guest offset, which would be incorrect to return to the L1 emulation.

2) When writing an MSR, we are also writing directly for the L1 
emulation.  If we are currently nested, we want the new tsc_offset to be 
copied back into place from nested.hsave when the L2 guest triggers a 
vmexit.  So if nested, we must write the offset for L1 emulation into 
nested.hsave.  We must also apply the delta which was created by 
applying this new L1 offset to L2, as we presume, the offset is being 
used to correct for hardware variations.  This is why we compute 
g_tsc_offset and add it to the active L2 offset (which is running live, 
directly in the original vmcb).

Now, when we exit and re-enter L2, observe:

delta = L2 offset - L1 offset = svm->vmcb->control.tsc_offset - 
svm->nested.hsave->control.tsc_offset

now emulation changes the underlying L1 offset to L1':

L1 offset = L1'
active L2 offset += L1' - L1  (*)

Note vmcb is unaffected - vmexit does NOT preserve tsc_offset for the L2 
guest.
However, next time vmrun for L2 is attempted, we compute:

active L2 offset = L1' + L2 offset = (L1 + L2 offset) + (L1' - L1) = (*)

so equivalence is preserved.


It took me many hours of staring and scratching my head to figure out 
that code.  I was convinced many times that is was wrong or buggy, until 
I realized how it actually works.

It gets more complicated when you realize the L1 guest can also be 
manipulating TSC offset for the L2 guest, inside of the vmcb that we are 
running in place.  However, that is beyond the scope of our work; bugs 
in the L1 guest stay in the L1 guest, and it will do its own MSR read / 
write and tsc_offset emulation beyond our visibility.

We really should write this down in a document somewhere.  Avi, shall I 
add this to the timekeeping docs?  (And I hope I got all the L1 and L2s 
correct in the above - apologies if I made it even more confusing by 
misstating something).

Zach