From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Nadav Har'El" Subject: Re: Nested VMX - L1 hangs on running L2 Date: Thu, 28 Jul 2011 14:11:21 +0300 Message-ID: <20110728111121.GA7182@fermat.math.technion.ac.il> References: <20110708184053.GA23958@stratus.com> <20110718182628.GB5324@amt.cnet> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-8-i Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Bandan Das , KVM Mailing List , Marcelo Tosatti To: Zachary Amsden Return-path: Received: from mailgw12.technion.ac.il ([132.68.225.12]:51732 "EHLO mailgw12.technion.ac.il" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753248Ab1G1LL3 (ORCPT ); Thu, 28 Jul 2011 07:11:29 -0400 Content-Disposition: inline In-Reply-To: Sender: kvm-owner@vger.kernel.org List-ID: On Wed, Jul 20, 2011, Zachary Amsden wrote about "Re: Nested VMX - L1 h= angs on running L2": > > > No, both patches are wrong. > > > > > The correct fix is to make kvm_get_msr() return the L1 guest TSC = at all > > times. > > > =A0We are serving the L1 guest in this hypervisor, not the L2 gue= st, and so > > > > > should never read its TSC for any reason. I've been thinking about how to correctly solve this problem, and about= this statement that you made, that get_read_tsc() should *always* return L1'= s TSC, even while running L2. We have two quite-different uses for the TSC-related functions: 1. They are used in vmx.c to emulate *architectural* instructions, nam= ely RDTSC, RDMSR of TSC, and WRMSR of TSC. 2. They are used in x86.c for internal KVM uses, adjusting the TSC for various reasons (most of which, I have to admit, I don't understand= ). You said above that kvm_get_msr(), calling guest_read_tsc(), should alw= ays return L1's TSC, even when running L2. However, I believe that that for use #1 - the *architecturally correct* emulation of RDTSC and RDMSR whe= n L2 is running, the current nested-agnostic code - which ends up returni= ng L2's TSC when L2 is running - is the correct thing to do. Look at what happens when: When we get an exit to L0 when L2 run RDMSR TSC: This may exit to L1, if L1 asked in the MSR Bitmap. Done correctly. If hasn't exited, the spec says we need to return L1's TSC plus the TSC_OFFSET that L1 set for L2 (if L1 asked for TSC_OFFSETTING). In other words, we should return L0's TSC offsetted by the sum of vm= cs01 and vmcs12 tsc_offsets. This sum is exactly what we have in the curr= ent (vmcs02) TSC_OFFSET. So the existing guest_read_tsc, which returns rdtscll() + vmcs_read64(TSC_OFFSET), does the correct thing - also f= or nested. When we get an exit to L0 when L2 runs RDTSC: This may exit to L1 if L1 asked for TSC_EXITING. Done correctly. There is no "else" here - we would get an EXIT_REASON_RDTSC only if = L1 asked for it (because KVM itself doesn't use TSC_EXITING). Regarding emulation of WRMSR TSC, I actually had an error in the code w= hich I correct in a patch below: According to the Intel spec, it turns out t= hat if WRMSR of the TSC does not cause an exit, then it should change the h= ost TSC itself, ignoring TSC_OFFSET. This is very strange, because it means= that if the host doesn't trap MSR read and write of the TSC, then if a guest writes TSC and then immediately reads it, it won't read what is just wr= ote, but rather what it wrote pluss the TSC_OFFSET (because the TSC_OFFSET i= s added on read, but not subtracted on write). I don't know why the Intel= spec specifies this, but it does - and it's not what I emulated to I fixed t= hat now (in the patch below). But I doubt this is the problem that Bandan a= nd others saw? So now, if I'm right, after this patch, the *architectural* uses of guest_read_tsc() and vmx_write_tsc_offset() to emulate RDTSR and RD/WRM= SR, are done correctly. But the problem is that these functions are also used in KVM for other = things. Ideally, whomever calls kvm_read_msr()/kvm_write_msr() should realize t= hat it gets the value for the current running guest, which might be L1 or L2, = just like if that guest would read or write the MSR. Ideally, KVM might read= the MSR, change it and write it back, and everything would be correct as bo= th read and write code are done correctly. The question is, is some of the= code - perhaps in kvm_guest_time_update(), kvm_arch_vcpu_load(), kvm_arch_vcpu= _put() or vcpu_enter_guest() (these are functions which touch the TSC), which = only work correctly when reading/writing L1 TSC and can't work correctly if = L2 TSC is read and written? I'm afraid I don't know enough about what thes= e functions are doing, to understand what, if anything, they are doing wr= ong. Bandan and others who can reproduce this bug - I attach below a patch w= hich should correct some architectural emulation issues I discovered while r= eading the code again now - especially of TSC MSR write. Can you please try if this solves anything? I'm not too optimistic, but since the code was wr= ong in this regard, it's at least worth a try. Thanks, Nadav. Subject: [PATCH] nVMX: Fix nested TSC handling This patch fixes several areas where nested VMX did not correctly emula= ted TSC handling. Signed-off-by: Nadav Har'El --- arch/x86/kvm/vmx.c | 31 +++++++++++++++++++++---------- 1 file changed, 21 insertions(+), 10 deletions(-) --- .before/arch/x86/kvm/vmx.c 2011-07-28 14:02:52.000000000 +0300 +++ .after/arch/x86/kvm/vmx.c 2011-07-28 14:02:52.000000000 +0300 @@ -1762,15 +1762,23 @@ static void vmx_set_tsc_khz(struct kvm_v */ static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset) { - vmcs_write64(TSC_OFFSET, offset); - if (is_guest_mode(vcpu)) + if (is_guest_mode(vcpu)) { /* - * We're here if L1 chose not to trap the TSC MSR. Since - * prepare_vmcs12() does not copy tsc_offset, we need to also - * set the vmcs12 field here. + * We're here if L1 chose not to trap WRMSR to TSC. According + * to the spec, this should set L1's TSC; The offset that L1 + * set for L2 remains unchanged, and still needs to be added + * to the newly set TSC to get L2's TSC. */ - get_vmcs12(vcpu)->tsc_offset =3D offset - - to_vmx(vcpu)->nested.vmcs01_tsc_offset; + struct vmcs12 *vmcs12; + to_vmx(vcpu)->nested.vmcs01_tsc_offset =3D offset; + /* recalculate vmcs02.TSC_OFFSET: */ + vmcs12 =3D get_vmcs12(vcpu); + vmcs_write64(TSC_OFFSET, offset + + (nested_cpu_has(vmcs12, CPU_BASED_USE_TSC_OFFSETING) ? + vmcs12->tsc_offset : 0)); + } else { + vmcs_write64(TSC_OFFSET, offset); + } } =20 static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustmen= t) @@ -6514,8 +6522,11 @@ static void prepare_vmcs02(struct kvm_vc =20 set_cr4_guest_host_mask(vmx); =20 - vmcs_write64(TSC_OFFSET, - vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset); + if (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETING) + vmcs_write64(TSC_OFFSET, + vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset); + else + vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset); =20 if (enable_vpid) { /* @@ -6922,7 +6933,7 @@ static void nested_vmx_vmexit(struct kvm =20 load_vmcs12_host_state(vcpu, vmcs12); =20 - /* Update TSC_OFFSET if vmx_adjust_tsc_offset() was used while L2 ran= */ + /* Update TSC_OFFSET if TSC was changed while L2 ran */ vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset); =20 /* This is needed for same reason as it was needed in prepare_vmcs02 = */ --=20 Nadav Har'El | Thursday, Jul 28 2011, 26 Tamm= uz 5771 nyh@math.technion.ac.il |----------------------------------= ------- Phone +972-523-790466, ICQ 13349191 |Just remember that if the world di= dn't http://nadav.harel.org.il |suck, we would all fall off.