Nested VMX - L1 hangs on running L2

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* Nested VMX - L1 hangs on running L2
@ 2011-07-08 18:40 Bandan Das
  2011-07-18 18:26 ` Marcelo Tosatti
  0 siblings, 1 reply; 18+ messages in thread
From: Bandan Das @ 2011-07-08 18:40 UTC (permalink / raw)
  To: KVM Mailing List; +Cc: Nadav Har'El

I have already discussed this a bit with Nadav but hoping someone 
else has any other ideas/clues/suggestions/comments. With recent versions of the 
kernel (The last I tried is 3.0-rc5 with nVMX patches already merged), my L1 guest 
always hangs when I start L2. 

My setup : The host, L1 and L2 all are FC15 with the host running 3.0-rc5. When L1 is up 
and running, I start L2 from L1. Within a minute or two, both L1 and L2 hang. Although, if
if I run tracing on the host, I see :

...
qemu-kvm-19756 [013] 153774.856178: kvm_exit: reason APIC_ACCESS rip 0xffffffff81025098 info 1380 0
qemu-kvm-19756 [013] 153774.856189: kvm_exit: reason VMREAD rip 0xffffffffa00d5127 info 0 0
qemu-kvm-19756 [013] 153774.856191: kvm_exit: reason VMREAD rip 0xffffffffa00d5127 info 0 0
...

My point being that I only see kvm_exit messages but no kvm_entry. Does this mean that the VCPUs
are somehow stuck in L2 ?

Anyway, since this setup was running fine for me on older kernels, and I couldn't
identify any significant changes in nVMX, I sifted through the other KVM changes and found this :

--
commit 1aa8ceef0312a6aae7dd863a120a55f1637b361d
Author: Nikola Ciprich <extmaillist@linuxbox.cz>
Date:   Wed Mar 9 23:36:51 2011 +0100

    KVM: fix kvmclock regression due to missing clock update

    commit 387b9f97750444728962b236987fbe8ee8cc4f8c moved kvm_request_guest_time_update(vcpu),
    breaking 32bit SMP guests using kvm-clock. Fix this by moving (new) clock update function
    to proper place.

    Signed-off-by: Nikola Ciprich <nikola.ciprich@linuxbox.cz>
    Acked-by: Zachary Amsden <zamsden@redhat.com>
    Signed-off-by: Avi Kivity <avi@redhat.com>

index 01f08a6..f1e4025 100644 (file)
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2127,8 +2127,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
                if (check_tsc_unstable()) {
                        kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta);
                        vcpu->arch.tsc_catchup = 1;
-                       kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
                }
+               kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
                if (vcpu->cpu != cpu)
                        kvm_migrate_timers(vcpu);
                vcpu->cpu = cpu;
--

If I revert this change, my L1/L2 guests run fine. This ofcourse, just hides the bug
because on my machine, check_tsc_unstable() returns false.

I found out from Nadav that when KVM decides to run L2, it will write 
vmcs01->tsc_offset + vmcs12->tsc_offset to the active TSC_OFFSET which seems right.
But I verified that, if instead, I just write 
vmcs01->tsc_offset to TSC_OFFSET in prepare_vmcs02(), I don't see the bug anymore.

Not sure where to go from here. I would appreciate if any one has any ideas.

Bandan

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: Nested VMX - L1 hangs on running L2
  2011-07-08 18:40 Nested VMX - L1 hangs on running L2 Bandan Das
@ 2011-07-18 18:26 ` Marcelo Tosatti
  2011-07-19  2:41   ` Bandan Das
                     ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Marcelo Tosatti @ 2011-07-18 18:26 UTC (permalink / raw)
  To: Bandan Das; +Cc: KVM Mailing List, Nadav Har'El, Zachary Amsden


On Fri, Jul 08, 2011 at 02:40:53PM -0400, Bandan Das wrote:
> I have already discussed this a bit with Nadav but hoping someone 
> else has any other ideas/clues/suggestions/comments. With recent versions of the 
> kernel (The last I tried is 3.0-rc5 with nVMX patches already merged), my L1 guest 
> always hangs when I start L2. 
> 
> My setup : The host, L1 and L2 all are FC15 with the host running 3.0-rc5. When L1 is up 
> and running, I start L2 from L1. Within a minute or two, both L1 and L2 hang. Although, if
> if I run tracing on the host, I see :
> 
> ...
> qemu-kvm-19756 [013] 153774.856178: kvm_exit: reason APIC_ACCESS rip 0xffffffff81025098 info 1380 0
> qemu-kvm-19756 [013] 153774.856189: kvm_exit: reason VMREAD rip 0xffffffffa00d5127 info 0 0
> qemu-kvm-19756 [013] 153774.856191: kvm_exit: reason VMREAD rip 0xffffffffa00d5127 info 0 0
> ...
> 
> My point being that I only see kvm_exit messages but no kvm_entry. Does this mean that the VCPUs
> are somehow stuck in L2 ?
> 
> Anyway, since this setup was running fine for me on older kernels, and I couldn't
> identify any significant changes in nVMX, I sifted through the other KVM changes and found this :
> 
> --
> commit 1aa8ceef0312a6aae7dd863a120a55f1637b361d
> Author: Nikola Ciprich <extmaillist@linuxbox.cz>
> Date:   Wed Mar 9 23:36:51 2011 +0100
> 
>     KVM: fix kvmclock regression due to missing clock update
>     
>     commit 387b9f97750444728962b236987fbe8ee8cc4f8c moved kvm_request_guest_time_update(vcpu),
>     breaking 32bit SMP guests using kvm-clock. Fix this by moving (new) clock update function
>     to proper place.
>     
>     Signed-off-by: Nikola Ciprich <nikola.ciprich@linuxbox.cz>
>     Acked-by: Zachary Amsden <zamsden@redhat.com>
>     Signed-off-by: Avi Kivity <avi@redhat.com>
> 
> index 01f08a6..f1e4025 100644 (file)
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2127,8 +2127,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>                 if (check_tsc_unstable()) {
>                         kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta);
>                         vcpu->arch.tsc_catchup = 1;
> -                       kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
>                 }
> +               kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
>                 if (vcpu->cpu != cpu)
>                         kvm_migrate_timers(vcpu);
>                 vcpu->cpu = cpu;
> --
> 
> If I revert this change, my L1/L2 guests run fine. This ofcourse, just hides the bug
> because on my machine, check_tsc_unstable() returns false.
> 
> I found out from Nadav that when KVM decides to run L2, it will write 
> vmcs01->tsc_offset + vmcs12->tsc_offset to the active TSC_OFFSET which seems right.
> But I verified that, if instead, I just write 
> vmcs01->tsc_offset to TSC_OFFSET in prepare_vmcs02(), I don't see the bug anymore.
> 
> Not sure where to go from here. I would appreciate if any one has any ideas.
> 
> 
> Bandan

Using guests TSC value when performing TSC adjustments is wrong. Can
you please try the following patch, which skips TSC adjustments if
vcpu is in guest mode.

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2b76ae3..44c90d1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1096,6 +1096,9 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
 	s64 kernel_ns, max_kernel_ns;
 	u64 tsc_timestamp;
 
+	if (is_guest_mode(v))
+		return 0;
+
 	/* Keep irq disabled to prevent changes to the clock */
 	local_irq_save(flags);
 	kvm_get_msr(v, MSR_IA32_TSC, &tsc_timestamp);
@@ -2214,6 +2217,9 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 		tsc_delta = !vcpu->arch.last_guest_tsc ? 0 :
 			     tsc - vcpu->arch.last_guest_tsc;
 
+		if (is_guest_mode(vcpu))
+			tsc_delta = 0;
+
 		if (tsc_delta < 0)
 			mark_tsc_unstable("KVM discovered backwards TSC");
 		if (check_tsc_unstable()) {
@@ -2234,7 +2240,8 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 {
 	kvm_x86_ops->vcpu_put(vcpu);
 	kvm_put_guest_fpu(vcpu);
-	kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
+	if (!is_guest_mode(vcpu))
+		kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
 }
 
 static int is_efer_nx(void)
@@ -5717,7 +5724,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	if (hw_breakpoint_active())
 		hw_breakpoint_restore();
 
-	kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
+	if (!is_guest_mode(vcpu))
+		kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
 
 	vcpu->mode = OUTSIDE_GUEST_MODE;
 	smp_wmb();

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: Nested VMX - L1 hangs on running L2
  2011-07-18 18:26 ` Marcelo Tosatti
@ 2011-07-19  2:41   ` Bandan Das
  2011-07-20  7:58   ` Jan Kiszka
       [not found]   ` <CAKiCmT00vyR5vRBDWFYK2Z8sgmjLBPwbYU5W8q2wAUTrxS1_tA@mail.gmail.com>
  2 siblings, 0 replies; 18+ messages in thread
From: Bandan Das @ 2011-07-19  2:41 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: KVM Mailing List, Nadav Har'El, Zachary Amsden

On  0, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> 
> On Fri, Jul 08, 2011 at 02:40:53PM -0400, Bandan Das wrote:
> > I have already discussed this a bit with Nadav but hoping someone 
> > else has any other ideas/clues/suggestions/comments. With recent versions of the 
> > kernel (The last I tried is 3.0-rc5 with nVMX patches already merged), my L1 guest 
> > always hangs when I start L2. 
> > 
> > My setup : The host, L1 and L2 all are FC15 with the host running 3.0-rc5. When L1 is up 
> > and running, I start L2 from L1. Within a minute or two, both L1 and L2 hang. Although, if
> > if I run tracing on the host, I see :
> > 
>
... 
> Using guests TSC value when performing TSC adjustments is wrong. Can
> you please try the following patch, which skips TSC adjustments if
> vcpu is in guest mode.
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 2b76ae3..44c90d1 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1096,6 +1096,9 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
>  	s64 kernel_ns, max_kernel_ns;
>  	u64 tsc_timestamp;
>  
> +	if (is_guest_mode(v))
> +		return 0;
> +
>  	/* Keep irq disabled to prevent changes to the clock */
>  	local_irq_save(flags);
>  	kvm_get_msr(v, MSR_IA32_TSC, &tsc_timestamp);
> @@ -2214,6 +2217,9 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>  		tsc_delta = !vcpu->arch.last_guest_tsc ? 0 :
>  			     tsc - vcpu->arch.last_guest_tsc;
>  
> +		if (is_guest_mode(vcpu))
> +			tsc_delta = 0;
> +
>  		if (tsc_delta < 0)
>  			mark_tsc_unstable("KVM discovered backwards TSC");
>  		if (check_tsc_unstable()) {
> @@ -2234,7 +2240,8 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
>  {
>  	kvm_x86_ops->vcpu_put(vcpu);
>  	kvm_put_guest_fpu(vcpu);
> -	kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
> +	if (!is_guest_mode(vcpu))
> +		kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
>  }
>  
>  static int is_efer_nx(void)
> @@ -5717,7 +5724,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  	if (hw_breakpoint_active())
>  		hw_breakpoint_restore();
>  
> -	kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
> +	if (!is_guest_mode(vcpu))
> +		kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
>  
>  	vcpu->mode = OUTSIDE_GUEST_MODE;
>  	smp_wmb();

Hi Marcelo,

Thanks for looking at this and for the explanation of the cause. Your patch does solve
my problem. I have been running my L2 guest for a few hours now without any hang, while
usually, I would get a hang within minutes of booting L2. 

Thanks again!
Bandan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Nested VMX - L1 hangs on running L2
  2011-07-18 18:26 ` Marcelo Tosatti
  2011-07-19  2:41   ` Bandan Das
@ 2011-07-20  7:58   ` Jan Kiszka
  2011-07-20 16:12     ` Marcelo Tosatti
       [not found]   ` <CAKiCmT00vyR5vRBDWFYK2Z8sgmjLBPwbYU5W8q2wAUTrxS1_tA@mail.gmail.com>
  2 siblings, 1 reply; 18+ messages in thread
From: Jan Kiszka @ 2011-07-20  7:58 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Bandan Das, KVM Mailing List, Nadav Har'El, Zachary Amsden

[-- Attachment #1: Type: text/plain, Size: 5087 bytes --]

On 2011-07-18 20:26, Marcelo Tosatti wrote:
> 
> On Fri, Jul 08, 2011 at 02:40:53PM -0400, Bandan Das wrote:
>> I have already discussed this a bit with Nadav but hoping someone 
>> else has any other ideas/clues/suggestions/comments. With recent versions of the 
>> kernel (The last I tried is 3.0-rc5 with nVMX patches already merged), my L1 guest 
>> always hangs when I start L2. 
>>
>> My setup : The host, L1 and L2 all are FC15 with the host running 3.0-rc5. When L1 is up 
>> and running, I start L2 from L1. Within a minute or two, both L1 and L2 hang. Although, if
>> if I run tracing on the host, I see :
>>
>> ...
>> qemu-kvm-19756 [013] 153774.856178: kvm_exit: reason APIC_ACCESS rip 0xffffffff81025098 info 1380 0
>> qemu-kvm-19756 [013] 153774.856189: kvm_exit: reason VMREAD rip 0xffffffffa00d5127 info 0 0
>> qemu-kvm-19756 [013] 153774.856191: kvm_exit: reason VMREAD rip 0xffffffffa00d5127 info 0 0
>> ...
>>
>> My point being that I only see kvm_exit messages but no kvm_entry. Does this mean that the VCPUs
>> are somehow stuck in L2 ?
>>
>> Anyway, since this setup was running fine for me on older kernels, and I couldn't
>> identify any significant changes in nVMX, I sifted through the other KVM changes and found this :
>>
>> --
>> commit 1aa8ceef0312a6aae7dd863a120a55f1637b361d
>> Author: Nikola Ciprich <extmaillist@linuxbox.cz>
>> Date:   Wed Mar 9 23:36:51 2011 +0100
>>
>>     KVM: fix kvmclock regression due to missing clock update
>>     
>>     commit 387b9f97750444728962b236987fbe8ee8cc4f8c moved kvm_request_guest_time_update(vcpu),
>>     breaking 32bit SMP guests using kvm-clock. Fix this by moving (new) clock update function
>>     to proper place.
>>     
>>     Signed-off-by: Nikola Ciprich <nikola.ciprich@linuxbox.cz>
>>     Acked-by: Zachary Amsden <zamsden@redhat.com>
>>     Signed-off-by: Avi Kivity <avi@redhat.com>
>>
>> index 01f08a6..f1e4025 100644 (file)
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -2127,8 +2127,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>>                 if (check_tsc_unstable()) {
>>                         kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta);
>>                         vcpu->arch.tsc_catchup = 1;
>> -                       kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
>>                 }
>> +               kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
>>                 if (vcpu->cpu != cpu)
>>                         kvm_migrate_timers(vcpu);
>>                 vcpu->cpu = cpu;
>> --
>>
>> If I revert this change, my L1/L2 guests run fine. This ofcourse, just hides the bug
>> because on my machine, check_tsc_unstable() returns false.
>>
>> I found out from Nadav that when KVM decides to run L2, it will write 
>> vmcs01->tsc_offset + vmcs12->tsc_offset to the active TSC_OFFSET which seems right.
>> But I verified that, if instead, I just write 
>> vmcs01->tsc_offset to TSC_OFFSET in prepare_vmcs02(), I don't see the bug anymore.
>>
>> Not sure where to go from here. I would appreciate if any one has any ideas.
>>
>>
>> Bandan
> 
> Using guests TSC value when performing TSC adjustments is wrong. Can
> you please try the following patch, which skips TSC adjustments if
> vcpu is in guest mode.
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 2b76ae3..44c90d1 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1096,6 +1096,9 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
>  	s64 kernel_ns, max_kernel_ns;
>  	u64 tsc_timestamp;
>  
> +	if (is_guest_mode(v))
> +		return 0;
> +
>  	/* Keep irq disabled to prevent changes to the clock */
>  	local_irq_save(flags);
>  	kvm_get_msr(v, MSR_IA32_TSC, &tsc_timestamp);
> @@ -2214,6 +2217,9 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>  		tsc_delta = !vcpu->arch.last_guest_tsc ? 0 :
>  			     tsc - vcpu->arch.last_guest_tsc;
>  
> +		if (is_guest_mode(vcpu))
> +			tsc_delta = 0;
> +
>  		if (tsc_delta < 0)
>  			mark_tsc_unstable("KVM discovered backwards TSC");
>  		if (check_tsc_unstable()) {
> @@ -2234,7 +2240,8 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
>  {
>  	kvm_x86_ops->vcpu_put(vcpu);
>  	kvm_put_guest_fpu(vcpu);
> -	kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
> +	if (!is_guest_mode(vcpu))
> +		kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
>  }
>  
>  static int is_efer_nx(void)
> @@ -5717,7 +5724,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  	if (hw_breakpoint_active())
>  		hw_breakpoint_restore();
>  
> -	kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
> +	if (!is_guest_mode(vcpu))
> +		kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
>  
>  	vcpu->mode = OUTSIDE_GUEST_MODE;
>  	smp_wmb();

That unfortunately does not fix the L1 lockups I get here - unless I
confine L1 to a single CPU. It looks like (don't have all symbols for
the guest kernel ATM) that we are stuck in processing a timer IRQ.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Nested VMX - L1 hangs on running L2
  2011-07-20  7:58   ` Jan Kiszka
@ 2011-07-20 16:12     ` Marcelo Tosatti
  2011-07-20 16:19       ` Jan Kiszka
  0 siblings, 1 reply; 18+ messages in thread
From: Marcelo Tosatti @ 2011-07-20 16:12 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Bandan Das, KVM Mailing List, Nadav Har'El, Zachary Amsden

On Wed, Jul 20, 2011 at 09:58:47AM +0200, Jan Kiszka wrote:
> On 2011-07-18 20:26, Marcelo Tosatti wrote:
> > 
> > On Fri, Jul 08, 2011 at 02:40:53PM -0400, Bandan Das wrote:
> >> I have already discussed this a bit with Nadav but hoping someone 
> >> else has any other ideas/clues/suggestions/comments. With recent versions of the 
> >> kernel (The last I tried is 3.0-rc5 with nVMX patches already merged), my L1 guest 
> >> always hangs when I start L2. 
> >>
> >> My setup : The host, L1 and L2 all are FC15 with the host running 3.0-rc5. When L1 is up 
> >> and running, I start L2 from L1. Within a minute or two, both L1 and L2 hang. Although, if
> >> if I run tracing on the host, I see :
> >>
> >> ...
> >> qemu-kvm-19756 [013] 153774.856178: kvm_exit: reason APIC_ACCESS rip 0xffffffff81025098 info 1380 0
> >> qemu-kvm-19756 [013] 153774.856189: kvm_exit: reason VMREAD rip 0xffffffffa00d5127 info 0 0
> >> qemu-kvm-19756 [013] 153774.856191: kvm_exit: reason VMREAD rip 0xffffffffa00d5127 info 0 0
> >> ...
> >>
> >> My point being that I only see kvm_exit messages but no kvm_entry. Does this mean that the VCPUs
> >> are somehow stuck in L2 ?
> >>
> >> Anyway, since this setup was running fine for me on older kernels, and I couldn't
> >> identify any significant changes in nVMX, I sifted through the other KVM changes and found this :
> >>
> >> --
> >> commit 1aa8ceef0312a6aae7dd863a120a55f1637b361d
> >> Author: Nikola Ciprich <extmaillist@linuxbox.cz>
> >> Date:   Wed Mar 9 23:36:51 2011 +0100
> >>
> >>     KVM: fix kvmclock regression due to missing clock update
> >>     
> >>     commit 387b9f97750444728962b236987fbe8ee8cc4f8c moved kvm_request_guest_time_update(vcpu),
> >>     breaking 32bit SMP guests using kvm-clock. Fix this by moving (new) clock update function
> >>     to proper place.
> >>     
> >>     Signed-off-by: Nikola Ciprich <nikola.ciprich@linuxbox.cz>
> >>     Acked-by: Zachary Amsden <zamsden@redhat.com>
> >>     Signed-off-by: Avi Kivity <avi@redhat.com>
> >>
> >> index 01f08a6..f1e4025 100644 (file)
> >> --- a/arch/x86/kvm/x86.c
> >> +++ b/arch/x86/kvm/x86.c
> >> @@ -2127,8 +2127,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> >>                 if (check_tsc_unstable()) {
> >>                         kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta);
> >>                         vcpu->arch.tsc_catchup = 1;
> >> -                       kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
> >>                 }
> >> +               kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
> >>                 if (vcpu->cpu != cpu)
> >>                         kvm_migrate_timers(vcpu);
> >>                 vcpu->cpu = cpu;
> >> --
> >>
> >> If I revert this change, my L1/L2 guests run fine. This ofcourse, just hides the bug
> >> because on my machine, check_tsc_unstable() returns false.
> >>
> >> I found out from Nadav that when KVM decides to run L2, it will write 
> >> vmcs01->tsc_offset + vmcs12->tsc_offset to the active TSC_OFFSET which seems right.
> >> But I verified that, if instead, I just write 
> >> vmcs01->tsc_offset to TSC_OFFSET in prepare_vmcs02(), I don't see the bug anymore.
> >>
> >> Not sure where to go from here. I would appreciate if any one has any ideas.
> >>
> >>
> >> Bandan
> > 
> > Using guests TSC value when performing TSC adjustments is wrong. Can
> > you please try the following patch, which skips TSC adjustments if
> > vcpu is in guest mode.
> > 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 2b76ae3..44c90d1 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -1096,6 +1096,9 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
> >  	s64 kernel_ns, max_kernel_ns;
> >  	u64 tsc_timestamp;
> >  
> > +	if (is_guest_mode(v))
> > +		return 0;
> > +
> >  	/* Keep irq disabled to prevent changes to the clock */
> >  	local_irq_save(flags);
> >  	kvm_get_msr(v, MSR_IA32_TSC, &tsc_timestamp);
> > @@ -2214,6 +2217,9 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> >  		tsc_delta = !vcpu->arch.last_guest_tsc ? 0 :
> >  			     tsc - vcpu->arch.last_guest_tsc;
> >  
> > +		if (is_guest_mode(vcpu))
> > +			tsc_delta = 0;
> > +
> >  		if (tsc_delta < 0)
> >  			mark_tsc_unstable("KVM discovered backwards TSC");
> >  		if (check_tsc_unstable()) {
> > @@ -2234,7 +2240,8 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
> >  {
> >  	kvm_x86_ops->vcpu_put(vcpu);
> >  	kvm_put_guest_fpu(vcpu);
> > -	kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
> > +	if (!is_guest_mode(vcpu))
> > +		kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
> >  }
> >  
> >  static int is_efer_nx(void)
> > @@ -5717,7 +5724,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> >  	if (hw_breakpoint_active())
> >  		hw_breakpoint_restore();
> >  
> > -	kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
> > +	if (!is_guest_mode(vcpu))
> > +		kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
> >  
> >  	vcpu->mode = OUTSIDE_GUEST_MODE;
> >  	smp_wmb();
> 
> That unfortunately does not fix the L1 lockups I get here - unless I
> confine L1 to a single CPU. It looks like (don't have all symbols for
> the guest kernel ATM) that we are stuck in processing a timer IRQ.
> 
> Jan

Is L1 using kvmclock? 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Nested VMX - L1 hangs on running L2
  2011-07-20 16:12     ` Marcelo Tosatti
@ 2011-07-20 16:19       ` Jan Kiszka
  2011-07-20 16:35         ` Marcelo Tosatti
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Kiszka @ 2011-07-20 16:19 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Bandan Das, KVM Mailing List, Nadav Har'El, Zachary Amsden

[-- Attachment #1: Type: text/plain, Size: 5566 bytes --]

On 2011-07-20 18:12, Marcelo Tosatti wrote:
> On Wed, Jul 20, 2011 at 09:58:47AM +0200, Jan Kiszka wrote:
>> On 2011-07-18 20:26, Marcelo Tosatti wrote:
>>>
>>> On Fri, Jul 08, 2011 at 02:40:53PM -0400, Bandan Das wrote:
>>>> I have already discussed this a bit with Nadav but hoping someone 
>>>> else has any other ideas/clues/suggestions/comments. With recent versions of the 
>>>> kernel (The last I tried is 3.0-rc5 with nVMX patches already merged), my L1 guest 
>>>> always hangs when I start L2. 
>>>>
>>>> My setup : The host, L1 and L2 all are FC15 with the host running 3.0-rc5. When L1 is up 
>>>> and running, I start L2 from L1. Within a minute or two, both L1 and L2 hang. Although, if
>>>> if I run tracing on the host, I see :
>>>>
>>>> ...
>>>> qemu-kvm-19756 [013] 153774.856178: kvm_exit: reason APIC_ACCESS rip 0xffffffff81025098 info 1380 0
>>>> qemu-kvm-19756 [013] 153774.856189: kvm_exit: reason VMREAD rip 0xffffffffa00d5127 info 0 0
>>>> qemu-kvm-19756 [013] 153774.856191: kvm_exit: reason VMREAD rip 0xffffffffa00d5127 info 0 0
>>>> ...
>>>>
>>>> My point being that I only see kvm_exit messages but no kvm_entry. Does this mean that the VCPUs
>>>> are somehow stuck in L2 ?
>>>>
>>>> Anyway, since this setup was running fine for me on older kernels, and I couldn't
>>>> identify any significant changes in nVMX, I sifted through the other KVM changes and found this :
>>>>
>>>> --
>>>> commit 1aa8ceef0312a6aae7dd863a120a55f1637b361d
>>>> Author: Nikola Ciprich <extmaillist@linuxbox.cz>
>>>> Date:   Wed Mar 9 23:36:51 2011 +0100
>>>>
>>>>     KVM: fix kvmclock regression due to missing clock update
>>>>     
>>>>     commit 387b9f97750444728962b236987fbe8ee8cc4f8c moved kvm_request_guest_time_update(vcpu),
>>>>     breaking 32bit SMP guests using kvm-clock. Fix this by moving (new) clock update function
>>>>     to proper place.
>>>>     
>>>>     Signed-off-by: Nikola Ciprich <nikola.ciprich@linuxbox.cz>
>>>>     Acked-by: Zachary Amsden <zamsden@redhat.com>
>>>>     Signed-off-by: Avi Kivity <avi@redhat.com>
>>>>
>>>> index 01f08a6..f1e4025 100644 (file)
>>>> --- a/arch/x86/kvm/x86.c
>>>> +++ b/arch/x86/kvm/x86.c
>>>> @@ -2127,8 +2127,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>>>>                 if (check_tsc_unstable()) {
>>>>                         kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta);
>>>>                         vcpu->arch.tsc_catchup = 1;
>>>> -                       kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
>>>>                 }
>>>> +               kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
>>>>                 if (vcpu->cpu != cpu)
>>>>                         kvm_migrate_timers(vcpu);
>>>>                 vcpu->cpu = cpu;
>>>> --
>>>>
>>>> If I revert this change, my L1/L2 guests run fine. This ofcourse, just hides the bug
>>>> because on my machine, check_tsc_unstable() returns false.
>>>>
>>>> I found out from Nadav that when KVM decides to run L2, it will write 
>>>> vmcs01->tsc_offset + vmcs12->tsc_offset to the active TSC_OFFSET which seems right.
>>>> But I verified that, if instead, I just write 
>>>> vmcs01->tsc_offset to TSC_OFFSET in prepare_vmcs02(), I don't see the bug anymore.
>>>>
>>>> Not sure where to go from here. I would appreciate if any one has any ideas.
>>>>
>>>>
>>>> Bandan
>>>
>>> Using guests TSC value when performing TSC adjustments is wrong. Can
>>> you please try the following patch, which skips TSC adjustments if
>>> vcpu is in guest mode.
>>>
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index 2b76ae3..44c90d1 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -1096,6 +1096,9 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
>>>  	s64 kernel_ns, max_kernel_ns;
>>>  	u64 tsc_timestamp;
>>>  
>>> +	if (is_guest_mode(v))
>>> +		return 0;
>>> +
>>>  	/* Keep irq disabled to prevent changes to the clock */
>>>  	local_irq_save(flags);
>>>  	kvm_get_msr(v, MSR_IA32_TSC, &tsc_timestamp);
>>> @@ -2214,6 +2217,9 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>>>  		tsc_delta = !vcpu->arch.last_guest_tsc ? 0 :
>>>  			     tsc - vcpu->arch.last_guest_tsc;
>>>  
>>> +		if (is_guest_mode(vcpu))
>>> +			tsc_delta = 0;
>>> +
>>>  		if (tsc_delta < 0)
>>>  			mark_tsc_unstable("KVM discovered backwards TSC");
>>>  		if (check_tsc_unstable()) {
>>> @@ -2234,7 +2240,8 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
>>>  {
>>>  	kvm_x86_ops->vcpu_put(vcpu);
>>>  	kvm_put_guest_fpu(vcpu);
>>> -	kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
>>> +	if (!is_guest_mode(vcpu))
>>> +		kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
>>>  }
>>>  
>>>  static int is_efer_nx(void)
>>> @@ -5717,7 +5724,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>>>  	if (hw_breakpoint_active())
>>>  		hw_breakpoint_restore();
>>>  
>>> -	kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
>>> +	if (!is_guest_mode(vcpu))
>>> +		kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
>>>  
>>>  	vcpu->mode = OUTSIDE_GUEST_MODE;
>>>  	smp_wmb();
>>
>> That unfortunately does not fix the L1 lockups I get here - unless I
>> confine L1 to a single CPU. It looks like (don't have all symbols for
>> the guest kernel ATM) that we are stuck in processing a timer IRQ.
>>
>> Jan
> 
> Is L1 using kvmclock? 

Yes, it's a standard 3.0-rc7 SUSE kernel. Disabling it seems to help on
first glance.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Nested VMX - L1 hangs on running L2
  2011-07-20 16:19       ` Jan Kiszka
@ 2011-07-20 16:35         ` Marcelo Tosatti
  0 siblings, 0 replies; 18+ messages in thread
From: Marcelo Tosatti @ 2011-07-20 16:35 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Bandan Das, KVM Mailing List, Nadav Har'El, Zachary Amsden

On Wed, Jul 20, 2011 at 06:19:19PM +0200, Jan Kiszka wrote:
> On 2011-07-20 18:12, Marcelo Tosatti wrote:
> > On Wed, Jul 20, 2011 at 09:58:47AM +0200, Jan Kiszka wrote:
> >> On 2011-07-18 20:26, Marcelo Tosatti wrote:
> >>>
> >>> On Fri, Jul 08, 2011 at 02:40:53PM -0400, Bandan Das wrote:
> >>>> I have already discussed this a bit with Nadav but hoping someone 
> >>>> else has any other ideas/clues/suggestions/comments. With recent versions of the 
> >>>> kernel (The last I tried is 3.0-rc5 with nVMX patches already merged), my L1 guest 
> >>>> always hangs when I start L2. 
> >>>>
> >>>> My setup : The host, L1 and L2 all are FC15 with the host running 3.0-rc5. When L1 is up 
> >>>> and running, I start L2 from L1. Within a minute or two, both L1 and L2 hang. Although, if
> >>>> if I run tracing on the host, I see :
> >>>>
> >>>> ...
> >>>> qemu-kvm-19756 [013] 153774.856178: kvm_exit: reason APIC_ACCESS rip 0xffffffff81025098 info 1380 0
> >>>> qemu-kvm-19756 [013] 153774.856189: kvm_exit: reason VMREAD rip 0xffffffffa00d5127 info 0 0
> >>>> qemu-kvm-19756 [013] 153774.856191: kvm_exit: reason VMREAD rip 0xffffffffa00d5127 info 0 0
> >>>> ...
> >>>>
> >>>> My point being that I only see kvm_exit messages but no kvm_entry. Does this mean that the VCPUs
> >>>> are somehow stuck in L2 ?
> >>>>
> >>>> Anyway, since this setup was running fine for me on older kernels, and I couldn't
> >>>> identify any significant changes in nVMX, I sifted through the other KVM changes and found this :
> >>>>
> >>>> --
> >>>> commit 1aa8ceef0312a6aae7dd863a120a55f1637b361d
> >>>> Author: Nikola Ciprich <extmaillist@linuxbox.cz>
> >>>> Date:   Wed Mar 9 23:36:51 2011 +0100
> >>>>
> >>>>     KVM: fix kvmclock regression due to missing clock update
> >>>>     
> >>>>     commit 387b9f97750444728962b236987fbe8ee8cc4f8c moved kvm_request_guest_time_update(vcpu),
> >>>>     breaking 32bit SMP guests using kvm-clock. Fix this by moving (new) clock update function
> >>>>     to proper place.
> >>>>     
> >>>>     Signed-off-by: Nikola Ciprich <nikola.ciprich@linuxbox.cz>
> >>>>     Acked-by: Zachary Amsden <zamsden@redhat.com>
> >>>>     Signed-off-by: Avi Kivity <avi@redhat.com>
> >>>>
> >>>> index 01f08a6..f1e4025 100644 (file)
> >>>> --- a/arch/x86/kvm/x86.c
> >>>> +++ b/arch/x86/kvm/x86.c
> >>>> @@ -2127,8 +2127,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> >>>>                 if (check_tsc_unstable()) {
> >>>>                         kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta);
> >>>>                         vcpu->arch.tsc_catchup = 1;
> >>>> -                       kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
> >>>>                 }
> >>>> +               kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
> >>>>                 if (vcpu->cpu != cpu)
> >>>>                         kvm_migrate_timers(vcpu);
> >>>>                 vcpu->cpu = cpu;
> >>>> --
> >>>>
> >>>> If I revert this change, my L1/L2 guests run fine. This ofcourse, just hides the bug
> >>>> because on my machine, check_tsc_unstable() returns false.
> >>>>
> >>>> I found out from Nadav that when KVM decides to run L2, it will write 
> >>>> vmcs01->tsc_offset + vmcs12->tsc_offset to the active TSC_OFFSET which seems right.
> >>>> But I verified that, if instead, I just write 
> >>>> vmcs01->tsc_offset to TSC_OFFSET in prepare_vmcs02(), I don't see the bug anymore.
> >>>>
> >>>> Not sure where to go from here. I would appreciate if any one has any ideas.
> >>>>
> >>>>
> >>>> Bandan
> >>>
> >>> Using guests TSC value when performing TSC adjustments is wrong. Can
> >>> you please try the following patch, which skips TSC adjustments if
> >>> vcpu is in guest mode.
> >>>
> >>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >>> index 2b76ae3..44c90d1 100644
> >>> --- a/arch/x86/kvm/x86.c
> >>> +++ b/arch/x86/kvm/x86.c
> >>> @@ -1096,6 +1096,9 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
> >>>  	s64 kernel_ns, max_kernel_ns;
> >>>  	u64 tsc_timestamp;
> >>>  
> >>> +	if (is_guest_mode(v))
> >>> +		return 0;
> >>> +
> >>>  	/* Keep irq disabled to prevent changes to the clock */
> >>>  	local_irq_save(flags);
> >>>  	kvm_get_msr(v, MSR_IA32_TSC, &tsc_timestamp);
> >>> @@ -2214,6 +2217,9 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> >>>  		tsc_delta = !vcpu->arch.last_guest_tsc ? 0 :
> >>>  			     tsc - vcpu->arch.last_guest_tsc;
> >>>  
> >>> +		if (is_guest_mode(vcpu))
> >>> +			tsc_delta = 0;
> >>> +
> >>>  		if (tsc_delta < 0)
> >>>  			mark_tsc_unstable("KVM discovered backwards TSC");
> >>>  		if (check_tsc_unstable()) {
> >>> @@ -2234,7 +2240,8 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
> >>>  {
> >>>  	kvm_x86_ops->vcpu_put(vcpu);
> >>>  	kvm_put_guest_fpu(vcpu);
> >>> -	kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
> >>> +	if (!is_guest_mode(vcpu))
> >>> +		kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
> >>>  }
> >>>  
> >>>  static int is_efer_nx(void)
> >>> @@ -5717,7 +5724,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> >>>  	if (hw_breakpoint_active())
> >>>  		hw_breakpoint_restore();
> >>>  
> >>> -	kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
> >>> +	if (!is_guest_mode(vcpu))
> >>> +		kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
> >>>  
> >>>  	vcpu->mode = OUTSIDE_GUEST_MODE;
> >>>  	smp_wmb();
> >>
> >> That unfortunately does not fix the L1 lockups I get here - unless I
> >> confine L1 to a single CPU. It looks like (don't have all symbols for
> >> the guest kernel ATM) that we are stuck in processing a timer IRQ.
> >>
> >> Jan
> > 
> > Is L1 using kvmclock? 
> 
> Yes, it's a standard 3.0-rc7 SUSE kernel. Disabling it seems to help on
> first glance.
> 
> Jan

The patch i posted is broken, thinking on a proper fix.


^ permalink raw reply	[flat|nested] 18+ messages in thread

[parent not found: <CAKiCmT00vyR5vRBDWFYK2Z8sgmjLBPwbYU5W8q2wAUTrxS1_tA@mail.gmail.com>]

* Re: Nested VMX - L1 hangs on running L2
       [not found]   ` <CAKiCmT00vyR5vRBDWFYK2Z8sgmjLBPwbYU5W8q2wAUTrxS1_tA@mail.gmail.com>
@ 2011-07-20 19:52     ` Nadav Har'El
  2011-07-20 20:42       ` Bandan Das
  2011-07-21  2:49       ` Zachary Amsden
  0 siblings, 2 replies; 18+ messages in thread
From: Nadav Har'El @ 2011-07-20 19:52 UTC (permalink / raw)
  To: Zachary Amsden; +Cc: Bandan Das, KVM Mailing List, Marcelo Tosatti

> No, both patches are wrong.

Guys, thanks for looking into this bug. I'm afraid I'm still at a loss at
why a TSC bug would even cause a guest lockup :(

When Avi Kivity saw my nested TSC handling code he remarked "this is
probably wrong". When I asked him where it was wrong, he basically said
that he didn't know where, but TSC handling code is always wrong ;-)
And it turns out he was right.

> The correct fix is to make kvm_get_msr() return the L1 guest TSC at all
times.
>  We are serving the L1 guest in this hypervisor, not the L2 guest, and so

> should never read its TSC for any reason.
...
> allows the L2 guest to overwrite the L1 guest TSC, which at first seems
wrong,
> but is in fact the correct virtualization of a security hole in the L1
guest.

I think I'm beginning to see the error in my ways...

When L1 lets L2 (using the MSR bitmap) direct read/write access to the TSC,
it doesn't want L0 to be "clever" and give L2 its own separate TSC (like
I do now), but rather gives it full control over L1's TSC - so reading or
writing it should actually return L1's TSC, and the TSC_OFFSET in vmcs12
is to be ignored.

So basically, if I understand correctly, what I need to change is
in prepare_vmcs02(), if the MSR_IA32_TSC is on the MSR bitmap (read?
write?), instead of doing
        vmcs_write64(TSC_OFFSET,
                vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset);
I just need to do
        vmcs_write64(TSC_OFFSET,
                vmx->nested.vmcs01_tsc_offset);
thereby giving L2 exactly the same TSC that L1 had.
Brandan, if I remember correctly you once tried this sort of fix and
it actually worked?

Then, guest_read_tsc() will return (without need to change this code)
the correct L1 TSC.

And vmx_write_tsc_offset() should do in the is_guest_mode() not what
it does now (vmcs12->tsc_offset is of no important when the TSC MSR
is passed through) but rather set vmcs01_tsc_offset (which will be
applied on the next exit to L1).

Is my analysis correct? Or perhaps completely wrong? ;-)
Am I missing anything else that should be change?

In any case, I don't understand why on my machine I never encountered
these problems, and nothing broke even if I replaced the TSC nesting
code with randomly broken code. Are the people who are seeing this
brakage actually passed the MSR from L1 to L2 - using the MSR bitmap -
like I guessed above? Or am I missing something completely different?

Sorry, but I'm really becoming confused by these TSC issues...

Thanks,
Nadav.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Nested VMX - L1 hangs on running L2
  2011-07-20 19:52     ` Nadav Har'El
@ 2011-07-20 20:42       ` Bandan Das
  2011-07-21  2:49       ` Zachary Amsden
  1 sibling, 0 replies; 18+ messages in thread
From: Bandan Das @ 2011-07-20 20:42 UTC (permalink / raw)
  To: Nadav Har'El
  Cc: Zachary Amsden, Bandan Das, KVM Mailing List, Marcelo Tosatti

On  0, Nadav Har'El <NYH@il.ibm.com> wrote:
> > No, both patches are wrong.
> 
> Guys, thanks for looking into this bug. I'm afraid I'm still at a loss at
> why a TSC bug would even cause a guest lockup :(
> 
> When Avi Kivity saw my nested TSC handling code he remarked "this is
> probably wrong". When I asked him where it was wrong, he basically said
> that he didn't know where, but TSC handling code is always wrong ;-)
> And it turns out he was right.
> 
> > The correct fix is to make kvm_get_msr() return the L1 guest TSC at all
> times.
> >  We are serving the L1 guest in this hypervisor, not the L2 guest, and so
> 
> > should never read its TSC for any reason.
> ...
> > allows the L2 guest to overwrite the L1 guest TSC, which at first seems
> wrong,
> > but is in fact the correct virtualization of a security hole in the L1
> guest.
> 
> I think I'm beginning to see the error in my ways...
> 
> When L1 lets L2 (using the MSR bitmap) direct read/write access to the TSC,
> it doesn't want L0 to be "clever" and give L2 its own separate TSC (like
> I do now), but rather gives it full control over L1's TSC - so reading or
> writing it should actually return L1's TSC, and the TSC_OFFSET in vmcs12
> is to be ignored.
> 
> So basically, if I understand correctly, what I need to change is
> in prepare_vmcs02(), if the MSR_IA32_TSC is on the MSR bitmap (read?
> write?), instead of doing
>         vmcs_write64(TSC_OFFSET,
>                 vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset);
> I just need to do
>         vmcs_write64(TSC_OFFSET,
>                 vmx->nested.vmcs01_tsc_offset);
> thereby giving L2 exactly the same TSC that L1 had.
> Brandan, if I remember correctly you once tried this sort of fix and
> it actually worked?
That is correct. That is still my "workaround fix" that I have been using
on my systems. But as you have mentioned above (and below), I am still struggling
with two questions :

1. Why does L1 hang even if the TSC has wrong values.
2. I see this on a Dell  R610 and I don't know why you and some others don't see this.
I assumed from the symptoms that this should be fairly easy to reproduce on any system.

Bandan 
> Then, guest_read_tsc() will return (without need to change this code)
> the correct L1 TSC.
> 
> And vmx_write_tsc_offset() should do in the is_guest_mode() not what
> it does now (vmcs12->tsc_offset is of no important when the TSC MSR
> is passed through) but rather set vmcs01_tsc_offset (which will be
> applied on the next exit to L1).
> 
> Is my analysis correct? Or perhaps completely wrong? ;-)
> Am I missing anything else that should be change?
> 
> In any case, I don't understand why on my machine I never encountered
> these problems, and nothing broke even if I replaced the TSC nesting
> code with randomly broken code. Are the people who are seeing this
> brakage actually passed the MSR from L1 to L2 - using the MSR bitmap -
> like I guessed above? Or am I missing something completely different?
> 
> Sorry, but I'm really becoming confused by these TSC issues...
> 
> Thanks,
> Nadav.
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Nested VMX - L1 hangs on running L2
  2011-07-20 19:52     ` Nadav Har'El
  2011-07-20 20:42       ` Bandan Das
@ 2011-07-21  2:49       ` Zachary Amsden
  2011-07-27 11:51         ` Nadav Har'El
  2011-07-28 11:11         ` Nadav Har'El
  1 sibling, 2 replies; 18+ messages in thread
From: Zachary Amsden @ 2011-07-21  2:49 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: Bandan Das, KVM Mailing List, Marcelo Tosatti

On Wed, Jul 20, 2011 at 12:52 PM, Nadav Har'El <NYH@il.ibm.com> wrote:
>
> > No, both patches are wrong.
>
> Guys, thanks for looking into this bug. I'm afraid I'm still at a loss at
> why a TSC bug would even cause a guest lockup :(
>
> When Avi Kivity saw my nested TSC handling code he remarked "this is
> probably wrong". When I asked him where it was wrong, he basically said
> that he didn't know where, but TSC handling code is always wrong ;-)
> And it turns out he was right.
>
> > The correct fix is to make kvm_get_msr() return the L1 guest TSC at all
> times.
> >  We are serving the L1 guest in this hypervisor, not the L2 guest, and so
>
> > should never read its TSC for any reason.
> ...
> > allows the L2 guest to overwrite the L1 guest TSC, which at first seems
> wrong,
> > but is in fact the correct virtualization of a security hole in the L1
> guest.
>
> I think I'm beginning to see the error in my ways...
>
> When L1 lets L2 (using the MSR bitmap) direct read/write access to the TSC,
> it doesn't want L0 to be "clever" and give L2 its own separate TSC (like
> I do now), but rather gives it full control over L1's TSC - so reading or
> writing it should actually return L1's TSC, and the TSC_OFFSET in vmcs12
> is to be ignored.

>
> So basically, if I understand correctly, what I need to change is
> in prepare_vmcs02(), if the MSR_IA32_TSC is on the MSR bitmap (read?
> write?), instead of doing
>        vmcs_write64(TSC_OFFSET,
>                vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset);
> I just need to do
>        vmcs_write64(TSC_OFFSET,
>                vmx->nested.vmcs01_tsc_offset);
> thereby giving L2 exactly the same TSC that L1 had.
> Brandan, if I remember correctly you once tried this sort of fix and
> it actually worked?
>
> Then, guest_read_tsc() will return (without need to change this code)
> the correct L1 TSC.
>

Note quite.

kvm_get_msr(vcpu, MSR_IA32_TSC, &tsc) should always return the L1 TSC,
regardless of the setting of any MSR bitmap. The reason why is that it
is being called by the L0 hypervisor kernel, which handles only
interactions with the L1 MSRs.

If L1 wants to configure L2 to have its own private TSC (99.99% of use
cases), it should mask off access to the TSC MSR in the permission
bitmap. Then
when L2 reads or writes the MSR (not the TSC, which is protected by
the RDTSC instruction, and a different trap), an exit will be
generated to L0, which will read the permission bitmaps, determine the
access is failed, and forward the exit to L1, which will deal with
reading or writing the L2 MSR. L0 NEVER deals directly with reading or
writing the L2 MSR in this scenario, how that is handled is supposed
to be determined by L1.

If L1 wants to configure L2 to be able to read and write the L1 TSC
instead, it should enable access to the TSC MSR in the MSR permission
bitmap.  Then when L2 reads or write the MSR, an exit will be
generated to L0, which will read the permission bitmaps, determine the
access is a success, and directly read or write the L1 TSC.  This
requires changing both the offset of the running L2 (in the active L0
VMCS) and the saved offset of the inactive L1.

However, when computing the L2 tsc offset field to use for a running
guest, BOTH the L1 and L2 offsets should be added.  This is exactly
what nested hardware would do.  This implies that to be correct in the
second scenario, you need to forget the change you made to the offset
of the running L2 (in the active L0 VMCS) when the L2 guest stops
running.  You want it to be live on hardware, but not stored in any
saved TSC_OFFSET field for the L2 guest - it seems illogical, but it
is not.  The reason for that is that the L2 guest was updating ONLY
the L1 TSC, not the L2 TSC_OFFSET (which is still controlled by L1).

Note the following invariants: For N>1, L0 will never under any
circumstance set or adjust the L{N} TSC virtual VMCS field.  That is
only for the L{N-1} guest to do, and any attempt to do so by any other
than L{N-1} is not a correct virtualization.  L0 will only set the
hardware VMCS field to the offset required for virtualizing each level
L1-L{N}.

Also, the invariant for nested hardware:  the observed TSC of an L{N}
guest is equal to the hardware TSC, plus the TSC offset field for all
L1-L{N}.  If the hardware does not support nested virtualization, this
invariant must be guaranteed by the L0 hypervisor, which is simulating
nested hardware by adding up all of the nested offset fields.

Neither of these facts involves checking the status of permission bits
in the MSR map.  That is only done for determining success or failure
of an MSR read or write, which may be forwarded to some L{N} as a
virtual exit or #GP fault.

Yes, it is tricky, but once you start thinking of it in simple
hardware terms, you will realize there is no need to be clever, just
precise.  If it seems counterintuitive to think about the second case,
consider the scenario when L2 and L1 are cooperative, and L1 sets the
L2 TSC_OFFSET to zero :)  Then L2 simply manages the TSC for L1.  I
find that makes reasoning about the correct place at which to store
writes and adjustments much more intuitive.

Zach

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Nested VMX - L1 hangs on running L2
  2011-07-21  2:49       ` Zachary Amsden
@ 2011-07-27 11:51         ` Nadav Har'El
  2011-07-29  9:01           ` Zachary Amsden
  2011-07-28 11:11         ` Nadav Har'El
  1 sibling, 1 reply; 18+ messages in thread
From: Nadav Har'El @ 2011-07-27 11:51 UTC (permalink / raw)
  To: Zachary Amsden; +Cc: Bandan Das, KVM Mailing List, Marcelo Tosatti

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=iso-8859-8-i, Size: 2155 bytes --]

On Wed, Jul 20, 2011, Zachary Amsden wrote about "Re: Nested VMX - L1 hangs on running L2":
> > > No, both patches are wrong.
> >
> 
> kvm_get_msr(vcpu, MSR_IA32_TSC, &tsc) should always return the L1 TSC,
> regardless of the setting of any MSR bitmap. The reason why is that it
> is being called by the L0 hypervisor kernel, which handles only
> interactions with the L1 MSRs.

guest_read_tsc() (called by the above get_msr) currently does this:

	static u64 guest_read_tsc(void)
	{
		u64 host_tsc, tsc_offset;

		rdtscll(host_tsc);
		tsc_offset = vmcs_read64(TSC_OFFSET);
		return host_tsc + tsc_offset;
	}

I guess you'd want this to change to something like:

		tsc_offset = is_guest_mode(vcpu) ? 
			vmx->nested.vmcs01_tsc_offset :
			vmcs_read64(TSC_OFFSET);

But I still am not convinced that that would be right....

E.g., imagine the case where L1 uses TSC_OFFSETING and but doesn't
trap TSC MSR read. The SDM says (if I understand it correctly) that this TSC
MSR read will not exit (because L1 doesn't trap it) but *will* do the extra
offsetting. In this case, the original code (using vmcs02's TSC_OFFSET which
is the sum of that of vmcs01 and vmcs12), is correct, and the new code will
be incorrect. Or am I misunderstanding the SDM?

Can you tell me in which case the original code would return incorrect
results to a guest (L1 or L2) doing anything MSR-related?

I'm assuming that some code in KVM also uses kvm_read_msr and assumes it
gets the TSC value for L1, not for the guest currently running (L2 or L1).
I don't understand why it needs to assume that... Why would it be wrong to
return L2's TSC, and just remember that *changing* the L2 TSC really means
changing the L1 TSC offset (vmcs01_tsc_offset), not vmcs12.tsc_offset which
we can't touch?

Thanks,
Nadav.

-- 
Nadav Har'El                        |   Wednesday, Jul 27 2011, 25 Tammuz 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |You may only be one person to the world,
http://nadav.harel.org.il           |but may also be the world to one person.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Nested VMX - L1 hangs on running L2
  2011-07-27 11:51         ` Nadav Har'El
@ 2011-07-29  9:01           ` Zachary Amsden
  2011-07-29 10:21             ` Roedel, Joerg
  2011-07-31 13:48             ` Nadav Har'El
  0 siblings, 2 replies; 18+ messages in thread
From: Zachary Amsden @ 2011-07-29  9:01 UTC (permalink / raw)
  To: Nadav Har'El
  Cc: Bandan Das, KVM Mailing List, Marcelo Tosatti, joerg.roedel

2011/7/27 Nadav Har'El <nyh@math.technion.ac.il>:
> On Wed, Jul 20, 2011, Zachary Amsden wrote about "Re: Nested VMX - L1 hangs on running L2":
>> > > No, both patches are wrong.
>> >
>>
>> kvm_get_msr(vcpu, MSR_IA32_TSC, &tsc) should always return the L1 TSC,
>> regardless of the setting of any MSR bitmap. The reason why is that it
>> is being called by the L0 hypervisor kernel, which handles only
>> interactions with the L1 MSRs.
>
> guest_read_tsc() (called by the above get_msr) currently does this:
>
>        static u64 guest_read_tsc(void)
>        {
>                u64 host_tsc, tsc_offset;
>
>                rdtscll(host_tsc);
>                tsc_offset = vmcs_read64(TSC_OFFSET);
>                return host_tsc + tsc_offset;
>        }

That's wrong.  You should NEVER believe the offset written into the
hardware VMCS to be the current, real L1 TSC offset, as that is not an
invariant.

Instead, you should ALWAYS return the host TSC + the L1 TSC offset.
Sometimes, this may be the hardware value.

> I guess you'd want this to change to something like:
>
>                tsc_offset = is_guest_mode(vcpu) ?
>                        vmx->nested.vmcs01_tsc_offset :
>                        vmcs_read64(TSC_OFFSET);
>
> But I still am not convinced that that would be right....

I believe this is correct.  But may it be cheaper to read from the
in-memory structure than the actual hardware VMCS?

> E.g., imagine the case where L1 uses TSC_OFFSETING and but doesn't
> trap TSC MSR read. The SDM says (if I understand it correctly) that this TSC
> MSR read will not exit (because L1 doesn't trap it) but *will* do the extra
> offsetting. In this case, the original code (using vmcs02's TSC_OFFSET which
> is the sum of that of vmcs01 and vmcs12), is correct, and the new code will
> be incorrect. Or am I misunderstanding the SDM?

In that case, you need to distinguish between reads of the TSC MSR by
the guest and reads by the host (as done internally to track drift and
compensation).  The code that needs to change isn't guest_read_tsc(),
that code must respect the invariant of only returning the L1 guest
TSC (in fact, that may be a good name change for the function).  What
needs to change is the actual code involved in the MSR read.  If it
determines that something other than the L1 guest is running, it needs
to ignore the hardware TSC offset and return the TSC as if read by the
L1 guest.

Unfortunately, the layering currently doesn't seem to allow for this,
and it looks like both vendor specific variants currently get this
wrong.

The call stack:

kvm_get_msr()
kvm_x86_ops->get_msr()
vendor_get_msr()
vendor_guest_read_tsc()

offers no insight as to the intention of the caller.  Is it trying to
get the guest TSC to return to the guest, or is it trying to get the
guest TSC to calibrate / measure and compensate for TSC effects?

So you are right, this is still wrong for the case in which L1 does
not trap TSC MSR reads.  Note however, the RDTSC instruction is still
virtualized properly, it is only the relatively rare actual TSC MSR
read via RDMSR which is mis-virtualized (this bug exists today in the
SVM implementation if I am reading it correctly - cc'd Joerg to notify
him of that).  That, combined with the relative importance of
supporting a guest which does not trap on these MSR reads suggest this
is a low priority design issue however (RDTSC still works even if the
MSR is trapped, correct?)

If you want to go the extra mile to support such guests, the only
fully correct approach then is to do one of the following:

1) Add a new vendor-specific API call, vmx_x86_ops->get_L1_TSC(), and
transform current uses of the code which does TSC compensation to use
this new API.  *Bonus* - no need to do double indirection through the
generic MSR code.

or, alternatively:

2) Do not trap MSR reads of the TSC if the current L1 guest is not
trapping MSR reads of the TSC.  This is not possible if you cannot
enforce specific read vs. write permission in hardware - it may be
possible however, if you can trap all MSR writes regardless of the
permission bitmap.

> Can you tell me in which case the original code would returen incorrect
> results to a guest (L1 or L2) doing anything MSR-related?

It never returns incorrect values to a guest.  It does however return
incorrect values to the L0 hypervisor, which is expecting to do
arithmetic based on the L1 TSC value, and this fails catastrophically
when it receives values for other nested guests.

> I'm assuming that some code in KVM also uses kvm_read_msr and assumes it
> gets the TSC value for L1, not for the guest currently running (L2 or L1).

Exactly.

> I don't understand why it needs to assume that... Why would it be wrong to
> return L2's TSC, and just remember that *changing* the L2 TSC really means
> changing the L1 TSC offset (vmcs01_tsc_offset), not vmcs12.tsc_offset which
> we can't touch?

L0 measures the L1 TSC at various points to be sure the L1 TSC never
regresses by going backwards, and also to restore the value if there
is a non-terminal hardware power interruption (hardware sleep which
resets hardware TSC to zero, but leaves the system image intact - i.e
S1 sleep).  If the measurement returns something other than the L1
TSC, this fails.

So the API there is today is broken, the way the code is today, it
needs to return the L1 TSC only - but the incorrect virtualization is
limited to quite an obscure case which doesn't appear to be a problem
in practice.  And as I pointed out, you can either fix the API, or if
you have cooperative hardware, ignore the issue and simply do not trap
on TSC MSR reads from L2 guests.

Zach

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Nested VMX - L1 hangs on running L2
  2011-07-29  9:01           ` Zachary Amsden
@ 2011-07-29 10:21             ` Roedel, Joerg
  2011-07-31 13:48             ` Nadav Har'El
  1 sibling, 0 replies; 18+ messages in thread
From: Roedel, Joerg @ 2011-07-29 10:21 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Nadav Har'El, Bandan Das, KVM Mailing List, Marcelo Tosatti

On Fri, Jul 29, 2011 at 05:01:16AM -0400, Zachary Amsden wrote:
> So you are right, this is still wrong for the case in which L1 does
> not trap TSC MSR reads.  Note however, the RDTSC instruction is still
> virtualized properly, it is only the relatively rare actual TSC MSR
> read via RDMSR which is mis-virtualized (this bug exists today in the
> SVM implementation if I am reading it correctly - cc'd Joerg to notify
> him of that).  That, combined with the relative importance of
> supporting a guest which does not trap on these MSR reads suggest this
> is a low priority design issue however (RDTSC still works even if the
> MSR is trapped, correct?)

Actually, the documentation is not entirely clear about this. But I tend
to agree that direct _reads_ of MSR 0x10 in guest-mode should return the
tsc with tsc_offset applied.
But on the other side, there is even SVM hardware which does this wrong.
For some K8s there is an erratum that the tsc_offset is not applied when
the MSR is read directly in guest mode.
But yes, to be architecturaly correct the msr read should always return
the tsc of the currently running guest level. In reality this shouldn't
be am issue, though, and rdtsc[p] is still working correctly.

Regards,

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Nested VMX - L1 hangs on running L2
  2011-07-29  9:01           ` Zachary Amsden
  2011-07-29 10:21             ` Roedel, Joerg
@ 2011-07-31 13:48             ` Nadav Har'El
  2011-07-31 18:55               ` Zachary Amsden
  1 sibling, 1 reply; 18+ messages in thread
From: Nadav Har'El @ 2011-07-31 13:48 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Bandan Das, KVM Mailing List, Marcelo Tosatti, joerg.roedel

On Fri, Jul 29, 2011, Zachary Amsden wrote about "Re: Nested VMX - L1 hangs on running L2":
>...
> In that case, you need to distinguish between reads of the TSC MSR by
> the guest and reads by the host (as done internally to track drift and
>...
> Unfortunately, the layering currently doesn't seem to allow for this,
> and it looks like both vendor specific variants currently get this
> wrong.
>...
> 1) Add a new vendor-specific API call, vmx_x86_ops->get_L1_TSC(), and
> transform current uses of the code which does TSC compensation to use
> this new API.  *Bonus* - no need to do double indirection through the
> generic MSR code.

Thank you so much for this analysis!

I propose the following two patches. The first one just fixes two small
bugs in the correct emulation in the VMX case, and the second patch solves
the originally-reported bug as you outlined above:

The code in x86.c which used to call *_get_msr() to get the TSC no longer
does it - because *_get_msr() should only be called with the intention of
reading the msr of the current guest (L1 or L2) and returning it to that
guest.

Instead, I added a new x86_op read_l1_tsc(vcpu), whose intention is to
always return L1's notion of the current TSC - even if the current guest
is L2. All calls in x86.c now use this new read_l1_tsc() function instead
of reading the MSR - does this look correct to you?

> read via RDMSR which is mis-virtualized (this bug exists today in the
> SVM implementation if I am reading it correctly - cc'd Joerg to notify

I believe that you're right. I created (in the patch below) svm.c's
read_l1_tsc() with the same code you had in svm_get_msr(), but I think
that in svm_get_msr() the code should be different in the nested case -
I think you should use svm->vmcb, not get_host_vmcb()? I didn't add this svm.c
fix to the patch, because I'm not sure at all that my understanding here is
correct.

Zachary, Marcelo, do these patches look right to you?
Bandan, and others who case reproduce this bug - do these patches solve the
bug?

> So you are right, this is still wrong for the case in which L1 does
> not trap TSC MSR reads.  Note however, the RDTSC instruction is still
> virtualized properly, it is only the relatively rare actual TSC MSR
> read via RDMSR which is mis-virtualized (this bug exists today in the
> SVM implementation if I am reading it correctly - cc'd Joerg to notify
> him of that).  That, combined with the relative importance of
> supporting a guest which does not trap on these MSR reads suggest this
> is a low priority design issue however (RDTSC still works even if the
> MSR is trapped, correct?)

Yes, RDTSC_EXITING is separate from the MSR bitmap. I agree that these
are indeed obscure corner cases, but I think that it's still really confusing
that a function called kvm_get_msr deliberately works incorrectly (returning
always the L1 TSC) just so that it can fit some other uses. The SVM code worked
in the common case, but was still confusing why svm_get_msr() deliberately
goes to great lengths to use L1's TSC_OFFSET even when L2 is running - when
this is not how this MSR is really supposed to behave. Not to mention that one
day, somebody *will* want to use these obscure settings and be surprise that
they don't work ;-)

Subject: [PATCH 1/2] nVMX: Fix nested TSC emulation

This patch fixes two corner cases in nested (L2) handling of TSC-related
exits:

1. Somewhat suprisingly, according to the Intel spec, if L1 allows WRMSR to
the TSC MSR without an exit, then this should set L1's TSC value itself - not
offset by vmcs12.TSC_OFFSET (like was wrongly done in the previous code).

2. Allow L1 to disable the TSC_OFFSETING control, and then correctly ignore
the vmcs12.TSC_OFFSET.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   31 +++++++++++++++++++++----------
 1 file changed, 21 insertions(+), 10 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2011-07-31 16:17:30.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-07-31 16:17:30.000000000 +0300
@@ -1762,15 +1762,23 @@ static void vmx_set_tsc_khz(struct kvm_v
  */
 static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
 {
-	vmcs_write64(TSC_OFFSET, offset);
-	if (is_guest_mode(vcpu))
+	if (is_guest_mode(vcpu)) {
 		/*
-		 * We're here if L1 chose not to trap the TSC MSR. Since
-		 * prepare_vmcs12() does not copy tsc_offset, we need to also
-		 * set the vmcs12 field here.
+		 * We're here if L1 chose not to trap WRMSR to TSC. According
+		 * to the spec, this should set L1's TSC; The offset that L1
+		 * set for L2 remains unchanged, and still needs to be added
+		 * to the newly set TSC to get L2's TSC.
 		 */
-		get_vmcs12(vcpu)->tsc_offset = offset -
-			to_vmx(vcpu)->nested.vmcs01_tsc_offset;
+		struct vmcs12 *vmcs12;
+		to_vmx(vcpu)->nested.vmcs01_tsc_offset = offset;
+		/* recalculate vmcs02.TSC_OFFSET: */
+		vmcs12 = get_vmcs12(vcpu);
+		vmcs_write64(TSC_OFFSET, offset +
+			(nested_cpu_has(vmcs12, CPU_BASED_USE_TSC_OFFSETING) ?
+			 vmcs12->tsc_offset : 0));
+	} else {
+		vmcs_write64(TSC_OFFSET, offset);
+	}
 }
 
 static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment)
@@ -6514,8 +6522,11 @@ static void prepare_vmcs02(struct kvm_vc
 
 	set_cr4_guest_host_mask(vmx);
 
-	vmcs_write64(TSC_OFFSET,
-		vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset);
+	if (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETING)
+		vmcs_write64(TSC_OFFSET,
+			vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset);
+	else
+		vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset);
 
 	if (enable_vpid) {
 		/*
@@ -6922,7 +6933,7 @@ static void nested_vmx_vmexit(struct kvm
 
 	load_vmcs12_host_state(vcpu, vmcs12);
 
-	/* Update TSC_OFFSET if vmx_adjust_tsc_offset() was used while L2 ran */
+	/* Update TSC_OFFSET if TSC was changed while L2 ran */
 	vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset);
 
 	/* This is needed for same reason as it was needed in prepare_vmcs02 */

Subject: [PATCH 2/2] nVMX: L1 TSC handling

KVM assumed in several places that reading the TSC MSR returns the value for
L1. This is incorrect, because when L2 is running, the correct TSC read exit
emulation is to return L2's value.

We therefore add a new x86_ops function, read_l1_tsc, to use in places that
specifically need to read the L1 TSC, NOT the TSC of the current level of
guest.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/include/asm/kvm_host.h |    2 ++
 arch/x86/kvm/svm.c              |    9 +++++++++
 arch/x86/kvm/vmx.c              |   17 +++++++++++++++++
 arch/x86/kvm/x86.c              |    8 ++++----
 4 files changed, 32 insertions(+), 4 deletions(-)

--- .before/arch/x86/include/asm/kvm_host.h	2011-07-31 16:17:30.000000000 +0300
+++ .after/arch/x86/include/asm/kvm_host.h	2011-07-31 16:17:30.000000000 +0300
@@ -636,6 +636,8 @@ struct kvm_x86_ops {
 			       struct x86_instruction_info *info,
 			       enum x86_intercept_stage stage);
 
+	u64 (*read_l1_tsc)(struct kvm_vcpu *vcpu);
+
 	const struct trace_print_flags *exit_reasons_str;
 };
 
--- .before/arch/x86/kvm/vmx.c	2011-07-31 16:17:30.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-07-31 16:17:30.000000000 +0300
@@ -1748,6 +1748,21 @@ static u64 guest_read_tsc(void)
 }
 
 /*
+ * Like guest_read_tsc, but always returns L1's notion of the timestamp
+ * counter, even if a nested guest (L2) is currently running.
+ */
+u64 vmx_read_l1_tsc(kvm_vcpu *vcpu)
+{
+	u64 host_tsc, tsc_offset;
+
+	rdtscll(host_tsc);
+	tsc_offset = is_guest_mode(vcpu) ?
+		to_vmx(vcpu)->nested.vmcs01_tsc_offset :
+		vmcs_read64(TSC_OFFSET);
+	return host_tsc + tsc_offset;
+}
+
+/*
  * Empty call-back. Needs to be implemented when VMX enables the SET_TSC_KHZ
  * ioctl. In this case the call-back should update internal vmx state to make
  * the changes effective.
@@ -7070,6 +7085,8 @@ static struct kvm_x86_ops vmx_x86_ops = 
 	.set_tdp_cr3 = vmx_set_cr3,
 
 	.check_intercept = vmx_check_intercept,
+
+	.read_l1_tsc = vmx_read_l1_tsc(kvm_vcpu *vcpu),
 };
 
 static int __init vmx_init(void)
--- .before/arch/x86/kvm/svm.c	2011-07-31 16:17:30.000000000 +0300
+++ .after/arch/x86/kvm/svm.c	2011-07-31 16:17:30.000000000 +0300
@@ -2894,6 +2894,13 @@ static int cr8_write_interception(struct
 	return 0;
 }
 
+u64 svm_read_l1_tsc(kvm_vcpu *vcpu)
+{
+	return vmcb->control.tsc_offset +
+			svm_scale_tsc(vcpu, native_read_tsc());
+}
+
+
 static int svm_get_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 *data)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
@@ -4243,6 +4250,8 @@ static struct kvm_x86_ops svm_x86_ops = 
 	.set_tdp_cr3 = set_tdp_cr3,
 
 	.check_intercept = svm_check_intercept,
+
+	.read_l1_tsc = vmx_read_l1_tsc(kvm_vcpu *vcpu),
 };
 
 static int __init svm_init(void)
--- .before/arch/x86/kvm/x86.c	2011-07-31 16:17:30.000000000 +0300
+++ .after/arch/x86/kvm/x86.c	2011-07-31 16:17:30.000000000 +0300
@@ -1098,7 +1098,7 @@ static int kvm_guest_time_update(struct 
 
 	/* Keep irq disabled to prevent changes to the clock */
 	local_irq_save(flags);
-	kvm_get_msr(v, MSR_IA32_TSC, &tsc_timestamp);
+	tsc_timestamp = kvm_x86_ops->read_l1_tsc(vcpu);
 	kernel_ns = get_kernel_ns();
 	this_tsc_khz = vcpu_tsc_khz(v);
 	if (unlikely(this_tsc_khz == 0)) {
@@ -2215,7 +2215,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu 
 		s64 tsc_delta;
 		u64 tsc;
 
-		kvm_get_msr(vcpu, MSR_IA32_TSC, &tsc);
+		tsc = kvm_x86_ops->read_l1_tsc(vcpu);
 		tsc_delta = !vcpu->arch.last_guest_tsc ? 0 :
 			     tsc - vcpu->arch.last_guest_tsc;
 
@@ -2239,7 +2239,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *
 {
 	kvm_x86_ops->vcpu_put(vcpu);
 	kvm_put_guest_fpu(vcpu);
-	kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
+	vcpu->arch.last_guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu);
 }
 
 static int is_efer_nx(void)
@@ -5722,7 +5722,7 @@ static int vcpu_enter_guest(struct kvm_v
 	if (hw_breakpoint_active())
 		hw_breakpoint_restore();
 
-	kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
+	vcpu->arch.last_guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu);
 
 	vcpu->mode = OUTSIDE_GUEST_MODE;
 	smp_wmb();

-- 
Nadav Har'El                        |      Sunday, Jul 31 2011, 29 Tammuz 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Support bacteria - they're the only
http://nadav.harel.org.il           |culture some people have!

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Nested VMX - L1 hangs on running L2
  2011-07-31 13:48             ` Nadav Har'El
@ 2011-07-31 18:55               ` Zachary Amsden
  2011-07-31 20:34                 ` Nadav Har'El
  0 siblings, 1 reply; 18+ messages in thread
From: Zachary Amsden @ 2011-07-31 18:55 UTC (permalink / raw)
  To: Nadav Har'El
  Cc: Bandan Das, KVM Mailing List, Marcelo Tosatti, joerg.roedel

This patch looks good, with one comment noted inline below.  Are there
no other call sites for kvm_get_msr() or which alias some other
function to kvm_get_msr(MSR_IA32_TSC) ?

Did I miss the first patch?

Zach

On Sun, Jul 31, 2011 at 6:48 AM, Nadav Har'El <nyh@math.technion.ac.il> wrote:
> On Fri, Jul 29, 2011, Zachary Amsden wrote about "Re: Nested VMX - L1 hangs on running L2":
>>...
>> In that case, you need to distinguish between reads of the TSC MSR by
>> the guest and reads by the host (as done internally to track drift and
>>...
>> Unfortunately, the layering currently doesn't seem to allow for this,
>> and it looks like both vendor specific variants currently get this
>> wrong.
>>...
>> 1) Add a new vendor-specific API call, vmx_x86_ops->get_L1_TSC(), and
>> transform current uses of the code which does TSC compensation to use
>> this new API.  *Bonus* - no need to do double indirection through the
>> generic MSR code.
>
> Thank you so much for this analysis!
>
> I propose the following two patches. The first one just fixes two small
> bugs in the correct emulation in the VMX case, and the second patch solves
> the originally-reported bug as you outlined above:
>
> The code in x86.c which used to call *_get_msr() to get the TSC no longer
> does it - because *_get_msr() should only be called with the intention of
> reading the msr of the current guest (L1 or L2) and returning it to that
> guest.
>
> Instead, I added a new x86_op read_l1_tsc(vcpu), whose intention is to
> always return L1's notion of the current TSC - even if the current guest
> is L2. All calls in x86.c now use this new read_l1_tsc() function instead
> of reading the MSR - does this look correct to you?
>
>> read via RDMSR which is mis-virtualized (this bug exists today in the
>> SVM implementation if I am reading it correctly - cc'd Joerg to notify
>
> I believe that you're right. I created (in the patch below) svm.c's
> read_l1_tsc() with the same code you had in svm_get_msr(), but I think
> that in svm_get_msr() the code should be different in the nested case -
> I think you should use svm->vmcb, not get_host_vmcb()? I didn't add this svm.c
> fix to the patch, because I'm not sure at all that my understanding here is
> correct.
>
> Zachary, Marcelo, do these patches look right to you?
> Bandan, and others who case reproduce this bug - do these patches solve the
> bug?
>
>> So you are right, this is still wrong for the case in which L1 does
>> not trap TSC MSR reads.  Note however, the RDTSC instruction is still
>> virtualized properly, it is only the relatively rare actual TSC MSR
>> read via RDMSR which is mis-virtualized (this bug exists today in the
>> SVM implementation if I am reading it correctly - cc'd Joerg to notify
>> him of that).  That, combined with the relative importance of
>> supporting a guest which does not trap on these MSR reads suggest this
>> is a low priority design issue however (RDTSC still works even if the
>> MSR is trapped, correct?)
>
> Yes, RDTSC_EXITING is separate from the MSR bitmap. I agree that these
> are indeed obscure corner cases, but I think that it's still really confusing
> that a function called kvm_get_msr deliberately works incorrectly (returning
> always the L1 TSC) just so that it can fit some other uses. The SVM code worked
> in the common case, but was still confusing why svm_get_msr() deliberately
> goes to great lengths to use L1's TSC_OFFSET even when L2 is running - when
> this is not how this MSR is really supposed to behave. Not to mention that one
> day, somebody *will* want to use these obscure settings and be surprise that
> they don't work ;-)
>
> Subject: [PATCH 1/2] nVMX: Fix nested TSC emulation
>
> This patch fixes two corner cases in nested (L2) handling of TSC-related
> exits:
>
> 1. Somewhat suprisingly, according to the Intel spec, if L1 allows WRMSR to
> the TSC MSR without an exit, then this should set L1's TSC value itself - not
> offset by vmcs12.TSC_OFFSET (like was wrongly done in the previous code).
>
> 2. Allow L1 to disable the TSC_OFFSETING control, and then correctly ignore
> the vmcs12.TSC_OFFSET.
>
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> ---
>  arch/x86/kvm/vmx.c |   31 +++++++++++++++++++++----------
>  1 file changed, 21 insertions(+), 10 deletions(-)
>
> --- .before/arch/x86/kvm/vmx.c  2011-07-31 16:17:30.000000000 +0300
> +++ .after/arch/x86/kvm/vmx.c   2011-07-31 16:17:30.000000000 +0300
> @@ -1762,15 +1762,23 @@ static void vmx_set_tsc_khz(struct kvm_v
>  */
>  static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
>  {
> -       vmcs_write64(TSC_OFFSET, offset);
> -       if (is_guest_mode(vcpu))
> +       if (is_guest_mode(vcpu)) {
>                /*
> -                * We're here if L1 chose not to trap the TSC MSR. Since
> -                * prepare_vmcs12() does not copy tsc_offset, we need to also
> -                * set the vmcs12 field here.
> +                * We're here if L1 chose not to trap WRMSR to TSC. According
> +                * to the spec, this should set L1's TSC; The offset that L1
> +                * set for L2 remains unchanged, and still needs to be added
> +                * to the newly set TSC to get L2's TSC.
>                 */
> -               get_vmcs12(vcpu)->tsc_offset = offset -
> -                       to_vmx(vcpu)->nested.vmcs01_tsc_offset;
> +               struct vmcs12 *vmcs12;
> +               to_vmx(vcpu)->nested.vmcs01_tsc_offset = offset;
> +               /* recalculate vmcs02.TSC_OFFSET: */
> +               vmcs12 = get_vmcs12(vcpu);
> +               vmcs_write64(TSC_OFFSET, offset +
> +                       (nested_cpu_has(vmcs12, CPU_BASED_USE_TSC_OFFSETING) ?
> +                        vmcs12->tsc_offset : 0));
> +       } else {
> +               vmcs_write64(TSC_OFFSET, offset);
> +       }
>  }
>
>  static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment)
> @@ -6514,8 +6522,11 @@ static void prepare_vmcs02(struct kvm_vc
>
>        set_cr4_guest_host_mask(vmx);
>
> -       vmcs_write64(TSC_OFFSET,
> -               vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset);
> +       if (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETING)
> +               vmcs_write64(TSC_OFFSET,
> +                       vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset);
> +       else
> +               vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset);
>
>        if (enable_vpid) {
>                /*
> @@ -6922,7 +6933,7 @@ static void nested_vmx_vmexit(struct kvm
>
>        load_vmcs12_host_state(vcpu, vmcs12);
>
> -       /* Update TSC_OFFSET if vmx_adjust_tsc_offset() was used while L2 ran */
> +       /* Update TSC_OFFSET if TSC was changed while L2 ran */
>        vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset);
>
>        /* This is needed for same reason as it was needed in prepare_vmcs02 */
>
> Subject: [PATCH 2/2] nVMX: L1 TSC handling
>
> KVM assumed in several places that reading the TSC MSR returns the value for
> L1. This is incorrect, because when L2 is running, the correct TSC read exit
> emulation is to return L2's value.
>
> We therefore add a new x86_ops function, read_l1_tsc, to use in places that
> specifically need to read the L1 TSC, NOT the TSC of the current level of
> guest.
>
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> ---
>  arch/x86/include/asm/kvm_host.h |    2 ++
>  arch/x86/kvm/svm.c              |    9 +++++++++
>  arch/x86/kvm/vmx.c              |   17 +++++++++++++++++
>  arch/x86/kvm/x86.c              |    8 ++++----
>  4 files changed, 32 insertions(+), 4 deletions(-)
>
> --- .before/arch/x86/include/asm/kvm_host.h     2011-07-31 16:17:30.000000000 +0300
> +++ .after/arch/x86/include/asm/kvm_host.h      2011-07-31 16:17:30.000000000 +0300
> @@ -636,6 +636,8 @@ struct kvm_x86_ops {
>                               struct x86_instruction_info *info,
>                               enum x86_intercept_stage stage);
>
> +       u64 (*read_l1_tsc)(struct kvm_vcpu *vcpu);
> +
>        const struct trace_print_flags *exit_reasons_str;
>  };
>
> --- .before/arch/x86/kvm/vmx.c  2011-07-31 16:17:30.000000000 +0300
> +++ .after/arch/x86/kvm/vmx.c   2011-07-31 16:17:30.000000000 +0300
> @@ -1748,6 +1748,21 @@ static u64 guest_read_tsc(void)
>  }
>
>  /*
> + * Like guest_read_tsc, but always returns L1's notion of the timestamp
> + * counter, even if a nested guest (L2) is currently running.
> + */
> +u64 vmx_read_l1_tsc(kvm_vcpu *vcpu)
> +{
> +       u64 host_tsc, tsc_offset;
> +
> +       rdtscll(host_tsc);
> +       tsc_offset = is_guest_mode(vcpu) ?
> +               to_vmx(vcpu)->nested.vmcs01_tsc_offset :
> +               vmcs_read64(TSC_OFFSET);
> +       return host_tsc + tsc_offset;
> +}
> +
> +/*
>  * Empty call-back. Needs to be implemented when VMX enables the SET_TSC_KHZ
>  * ioctl. In this case the call-back should update internal vmx state to make
>  * the changes effective.
> @@ -7070,6 +7085,8 @@ static struct kvm_x86_ops vmx_x86_ops =
>        .set_tdp_cr3 = vmx_set_cr3,
>
>        .check_intercept = vmx_check_intercept,
> +
> +       .read_l1_tsc = vmx_read_l1_tsc(kvm_vcpu *vcpu),
>  };
>
>  static int __init vmx_init(void)
> --- .before/arch/x86/kvm/svm.c  2011-07-31 16:17:30.000000000 +0300
> +++ .after/arch/x86/kvm/svm.c   2011-07-31 16:17:30.000000000 +0300
> @@ -2894,6 +2894,13 @@ static int cr8_write_interception(struct
>        return 0;
>  }
>
> +u64 svm_read_l1_tsc(kvm_vcpu *vcpu)
> +{
e> +       return vmcb->control.tsc_offset +
> +                       svm_scale_tsc(vcpu, native_read_tsc());
> +}
> +
> +
>  static int svm_get_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 *data)
>  {
>        struct vcpu_svm *svm = to_svm(vcpu);
> @@ -4243,6 +4250,8 @@ static struct kvm_x86_ops svm_x86_ops =
>        .set_tdp_cr3 = set_tdp_cr3,
>
>        .check_intercept = svm_check_intercept,
> +
> +       .read_l1_tsc = vmx_read_l1_tsc(kvm_vcpu *vcpu),
>  };
>
>  static int __init svm_init(void)
> --- .before/arch/x86/kvm/x86.c  2011-07-31 16:17:30.000000000 +0300
> +++ .after/arch/x86/kvm/x86.c   2011-07-31 16:17:30.000000000 +0300
> @@ -1098,7 +1098,7 @@ static int kvm_guest_time_update(struct
>
>        /* Keep irq disabled to prevent changes to the clock */
>        local_irq_save(flags);
> -       kvm_get_msr(v, MSR_IA32_TSC, &tsc_timestamp);
> +       tsc_timestamp = kvm_x86_ops->read_l1_tsc(vcpu);
>        kernel_ns = get_kernel_ns();
>        this_tsc_khz = vcpu_tsc_khz(v);
>        if (unlikely(this_tsc_khz == 0)) {
> @@ -2215,7 +2215,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu
>                s64 tsc_delta;
>                u64 tsc;
>
> -               kvm_get_msr(vcpu, MSR_IA32_TSC, &tsc);
> +               tsc = kvm_x86_ops->read_l1_tsc(vcpu);
>                tsc_delta = !vcpu->arch.last_guest_tsc ? 0 :
>                             tsc - vcpu->arch.last_guest_tsc;

Technically, this call site should no longer exist; this should be
re-factored in terms of hardware TSC, not guest TSC.  I sent a patch
series which did that, but it has not yet been applied.

> @@ -2239,7 +2239,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *
>  {
>        kvm_x86_ops->vcpu_put(vcpu);
>        kvm_put_guest_fpu(vcpu);
> -       kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
> +       vcpu->arch.last_guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu);
>  }
>
>  static int is_efer_nx(void)
> @@ -5722,7 +5722,7 @@ static int vcpu_enter_guest(struct kvm_v
>        if (hw_breakpoint_active())
>                hw_breakpoint_restore();
>
> -       kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
> +       vcpu->arch.last_guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu);
>
>        vcpu->mode = OUTSIDE_GUEST_MODE;
>        smp_wmb();
>
> --
> Nadav Har'El                        |      Sunday, Jul 31 2011, 29 Tammuz 5771
> nyh@math.technion.ac.il             |-----------------------------------------
> Phone +972-523-790466, ICQ 13349191 |Support bacteria - they're the only
> http://nadav.harel.org.il           |culture some people have!
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Nested VMX - L1 hangs on running L2
  2011-07-31 18:55               ` Zachary Amsden
@ 2011-07-31 20:34                 ` Nadav Har'El
  0 siblings, 0 replies; 18+ messages in thread
From: Nadav Har'El @ 2011-07-31 20:34 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Bandan Das, KVM Mailing List, Marcelo Tosatti, joerg.roedel

On Sun, Jul 31, 2011, Zachary Amsden wrote about "Re: Nested VMX - L1 hangs on running L2":
> This patch looks good, with one comment noted inline below.  Are there
> no other call sites for kvm_get_msr() or which alias some other
> function to kvm_get_msr(MSR_IA32_TSC) ?

This is all I found, but I'll try to look again.

> Did I miss the first patch?

Sorry, I sent both patches in one mail, one after another. I think you saw
them both, because your comment was on the second.
Next time I'll send them in separate mails.

Thanks,
Nadav.

-- 
Nadav Har'El                        |           Sunday, Jul 31 2011, 1 Av 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Willpower: The ability to eat only one
http://nadav.harel.org.il           |salted peanut.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Nested VMX - L1 hangs on running L2
  2011-07-21  2:49       ` Zachary Amsden
  2011-07-27 11:51         ` Nadav Har'El
@ 2011-07-28 11:11         ` Nadav Har'El
  2011-07-29  2:06           ` Matt McGill
  1 sibling, 1 reply; 18+ messages in thread
From: Nadav Har'El @ 2011-07-28 11:11 UTC (permalink / raw)
  To: Zachary Amsden; +Cc: Bandan Das, KVM Mailing List, Marcelo Tosatti

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=iso-8859-8-i, Size: 7021 bytes --]

On Wed, Jul 20, 2011, Zachary Amsden wrote about "Re: Nested VMX - L1 hangs on running L2":
> > > No, both patches are wrong.
> >
> > > The correct fix is to make kvm_get_msr() return the L1 guest TSC at all
> > times.
> > >  We are serving the L1 guest in this hypervisor, not the L2 guest, and so
> >
> > > should never read its TSC for any reason.

I've been thinking about how to correctly solve this problem, and about this
statement that you made, that get_read_tsc() should *always* return L1's TSC,
even while running L2.

We have two quite-different uses for the TSC-related functions:

 1. They are used in vmx.c to emulate *architectural* instructions, namely
    RDTSC, RDMSR of TSC, and WRMSR of TSC.

 2. They are used in x86.c for internal KVM uses, adjusting the TSC for
    various reasons (most of which, I have to admit, I don't understand).

You said above that kvm_get_msr(), calling guest_read_tsc(), should always
return L1's TSC, even when running L2. However, I believe that that for
use #1 - the *architecturally correct* emulation of RDTSC and RDMSR when
L2 is running, the current nested-agnostic code - which ends up returning L2's
TSC when L2 is running - is the correct thing to do.

Look at what happens when:

 When we get an exit to L0 when L2 run RDMSR TSC:
   This may exit to L1, if L1 asked in the MSR Bitmap. Done correctly.
   If hasn't exited, the spec says we need to return L1's TSC plus the
   TSC_OFFSET that L1 set for L2 (if L1 asked for TSC_OFFSETTING).
   In other words, we should return L0's TSC offsetted by the sum of vmcs01
   and vmcs12 tsc_offsets. This sum is exactly what we have in the current
   (vmcs02) TSC_OFFSET. So the existing guest_read_tsc, which returns
   rdtscll() + vmcs_read64(TSC_OFFSET), does the correct thing - also for
   nested.

 When we get an exit to L0 when L2 runs RDTSC:
   This may exit to L1 if L1 asked for TSC_EXITING. Done correctly.
   There is no "else" here - we would get an EXIT_REASON_RDTSC only if L1
   asked for it (because KVM itself doesn't use TSC_EXITING).

Regarding emulation of WRMSR TSC, I actually had an error in the code which
I correct in a patch below: According to the Intel spec, it turns out that
if WRMSR of the TSC does not cause an exit, then it should change the host
TSC itself, ignoring TSC_OFFSET. This is very strange, because it means that
if the host doesn't trap MSR read and write of the TSC, then if a guest
writes TSC and then immediately reads it, it won't read what is just wrote,
but rather what it wrote pluss the TSC_OFFSET (because the TSC_OFFSET is
added on read, but not subtracted on write). I don't know why the Intel spec
specifies this, but it does - and it's not what I emulated to I fixed that
now (in the patch below). But I doubt this is the problem that Bandan and
others saw?

So now, if I'm right, after this patch, the *architectural* uses of
guest_read_tsc() and vmx_write_tsc_offset() to emulate RDTSR and RD/WRMSR,
are done correctly.

But the problem is that these functions are also used in KVM for other things.
Ideally, whomever calls kvm_read_msr()/kvm_write_msr() should realize that it
gets the value for the current running guest, which might be L1 or L2, just
like if that guest would read or write the MSR. Ideally, KVM might read the
MSR, change it and write it back, and everything would be correct as both
read and write code are done correctly. The question is, is some of the code -
perhaps in kvm_guest_time_update(), kvm_arch_vcpu_load(), kvm_arch_vcpu_put()
or vcpu_enter_guest() (these are functions which touch the TSC), which only
work correctly when reading/writing L1 TSC and can't work correctly if L2
TSC is read and written? I'm afraid I don't know enough about what these
functions are doing, to understand what, if anything, they are doing wrong.

Bandan and others who can reproduce this bug - I attach below a patch which
should correct some architectural emulation issues I discovered while reading
the code again now - especially of TSC MSR write. Can you please try if
this solves anything? I'm not too optimistic, but since the code was wrong
in this regard, it's at least worth a try.

Thanks,
Nadav.

Subject: [PATCH] nVMX: Fix nested TSC handling

This patch fixes several areas where nested VMX did not correctly emulated
TSC handling.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   31 +++++++++++++++++++++----------
 1 file changed, 21 insertions(+), 10 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2011-07-28 14:02:52.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-07-28 14:02:52.000000000 +0300
@@ -1762,15 +1762,23 @@ static void vmx_set_tsc_khz(struct kvm_v
  */
 static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
 {
-	vmcs_write64(TSC_OFFSET, offset);
-	if (is_guest_mode(vcpu))
+	if (is_guest_mode(vcpu)) {
 		/*
-		 * We're here if L1 chose not to trap the TSC MSR. Since
-		 * prepare_vmcs12() does not copy tsc_offset, we need to also
-		 * set the vmcs12 field here.
+		 * We're here if L1 chose not to trap WRMSR to TSC. According
+		 * to the spec, this should set L1's TSC; The offset that L1
+		 * set for L2 remains unchanged, and still needs to be added
+		 * to the newly set TSC to get L2's TSC.
 		 */
-		get_vmcs12(vcpu)->tsc_offset = offset -
-			to_vmx(vcpu)->nested.vmcs01_tsc_offset;
+		struct vmcs12 *vmcs12;
+		to_vmx(vcpu)->nested.vmcs01_tsc_offset = offset;
+		/* recalculate vmcs02.TSC_OFFSET: */
+		vmcs12 = get_vmcs12(vcpu);
+		vmcs_write64(TSC_OFFSET, offset +
+			(nested_cpu_has(vmcs12, CPU_BASED_USE_TSC_OFFSETING) ?
+			 vmcs12->tsc_offset : 0));
+	} else {
+		vmcs_write64(TSC_OFFSET, offset);
+	}
 }

 static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment)
@@ -6514,8 +6522,11 @@ static void prepare_vmcs02(struct kvm_vc

 	set_cr4_guest_host_mask(vmx);

-	vmcs_write64(TSC_OFFSET,
-		vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset);
+	if (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETING)
+		vmcs_write64(TSC_OFFSET,
+			vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset);
+	else
+		vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset);

 	if (enable_vpid) {
 		/*
@@ -6922,7 +6933,7 @@ static void nested_vmx_vmexit(struct kvm

 	load_vmcs12_host_state(vcpu, vmcs12);

-	/* Update TSC_OFFSET if vmx_adjust_tsc_offset() was used while L2 ran */
+	/* Update TSC_OFFSET if TSC was changed while L2 ran */
 	vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset);

 	/* This is needed for same reason as it was needed in prepare_vmcs02 */
-- 
Nadav Har'El                        |    Thursday, Jul 28 2011, 26 Tammuz 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Just remember that if the world didn't
http://nadav.harel.org.il           |suck, we would all fall off.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Nested VMX - L1 hangs on running L2
  2011-07-28 11:11         ` Nadav Har'El
@ 2011-07-29  2:06           ` Matt McGill
  0 siblings, 0 replies; 18+ messages in thread
From: Matt McGill @ 2011-07-29  2:06 UTC (permalink / raw)
  To: Nadav Har'El
  Cc: Zachary Amsden, Bandan Das, KVM Mailing List, Marcelo Tosatti

Hello all,

I've been poking at this bug for awhile and following the discussion, so I thought I'd bring all the information together in one place.

First, I've been able to reliably reproduce this bug. Here is (what I believe to be) the relevant information:
* Host setup (8 CPUs):
Ubuntu Maverick with the kvm HEAD kernel, on an Intel i7.

* L1 guest (1 virtual CPU):
Ubuntu Maverick, standard distro kernel, started like so:
qemu-system-x86_64 -enable-kvm -cpu qemu64,+vmx -hda mav-mini.img \
  -nographic -vnc :1 -net nic -net tap,script="${tapup}"  -serial file:/tmp/mav-mini-console

* L2 guest (1 virtual CPU)
The 'linux-0.2.img' image from the QEMU web site [http://wiki.qemu.org/download/linux-0.2.img.bz2], started like so:
qemu-system-x86_64 -enable-kvm -hda linux-0.2.img -nographic -vnc 172.17.1.5:1 -net none

Steps to reproduce:
1. Start L1 guest.
2. In L1 guest, ensure that clocksource is 'kvm-clock'.
	echo "kvm-clock" | sudo tee /sys/devices/system/clocksource/clocksource0/current_clocksource
3. In L1 guest, start L2 guest.
4. In L2 guest, run the nbench benchmark.
5. On the host, use taskset to forcibly kick the qemu-system-x86_64 process to different host CPUs until L1 hangs.

Next, two work-arounds have been mentioned. For me, either one by itself is sufficient to prevent hangs.

Workaround #1: Use the 'jiffies' clocksource in L1. I suspect any other clocksource should also work, but haven't tested them.
Workaround #2: Using taskset, pin the qemu-system-x86_64 process managing L1 to a single host CPU.

The previous messages to the list have also already pointed at the problem, but not in full detail. For clarity, here's the sequence that appears to trigger the hang:
1. The qemu-system-x86_64 process migrates to a new host CPU.

2. On the next call to kvm_arch_vcpu_load() in kvm/x86.c, we enter this branch:
     if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
        ....
        kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
        ....
    }

3. The KVM_REQ_CLOCK_UPDATE request gets handled in vcpu_enter_guest() [kvm/x86.c].
        if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, vcpu)) {
            r = kvm_guest_time_update(vcpu);
            ....
        }    

4. In kvm_guest_time_update() [kvm/x86.c], we call kvm_get_msr to read MSR_IA32_TSC assuming we'll get the TSC value of L1:

  static int kvm_guest_time_update(struct kvm_vcpu *v)
  {
      ...
      kvm_get_msr(v, MSR_IA32_TSC, &tsc_timestamp);
      ...

This is the heart of the problem. If we're entering L2 rather than L1, tsc_timestamp gets L2's TLC.

We then do some computation based on wall clock/system time, and we write the (possibly adjusted) TSC value directly into a data structure used by the L1 kvm-clock (see also arch/x86/kernel/kvmclock.c and include/asm/kvm_host.h):
    /* With all the info we got, fill in the values */
    vcpu->hv_clock.tsc_timestamp = tsc_timestamp;
    vcpu->hv_clock.system_time = kernel_ns + v->kvm->arch.kvmclock_offset;
    vcpu->last_kernel_ns = kernel_ns;
    vcpu->last_guest_tsc = tsc_timestamp;
    vcpu->hv_clock.flags = 0;
Note that vcpu->hv_clock is actually mapped to a guest page (vcpu->time_page), and shared directly with the L1 kvm-clock paravirtualized clocksource.

5. If we read L2's TSC in the previous step, the L1 kvm-clock now has an arbitrary value. Presumably the scheduler becomes horribly confused, and nothing gets scheduled. This is illustrated by the output I get from the L1 soft lockup detector (stuck for 169 years, eh?):
	[5750304228.280373] BUG: soft lockup - CPU#0 stuck for 5355388067s! [qemu-system-x86:731]
As further evidence, if I patch guest_read_tsc() so that it always returns the L1 TSC, I can no longer reproduce the hang.

As for the correct way to fix it, I don't yet understand the VMX spec or the possible cases well enough to know what the correct behavior should be.

-Matt McGill

On Jul 28, 2011, at 7:11 AM, Nadav Har'El wrote:

> On Wed, Jul 20, 2011, Zachary Amsden wrote about "Re: Nested VMX - L1 hangs on running L2":
>>>> No, both patches are wrong.
>>> 
>>>> The correct fix is to make kvm_get_msr() return the L1 guest TSC at all
>>> times.
>>>>  We are serving the L1 guest in this hypervisor, not the L2 guest, and so
>>> 
>>>> should never read its TSC for any reason.
> 
> I've been thinking about how to correctly solve this problem, and about this
> statement that you made, that get_read_tsc() should *always* return L1's TSC,
> even while running L2.
> 
> We have two quite-different uses for the TSC-related functions:
> 
> 1. They are used in vmx.c to emulate *architectural* instructions, namely
>    RDTSC, RDMSR of TSC, and WRMSR of TSC.
> 
> 2. They are used in x86.c for internal KVM uses, adjusting the TSC for
>    various reasons (most of which, I have to admit, I don't understand).
> 
> You said above that kvm_get_msr(), calling guest_read_tsc(), should always
> return L1's TSC, even when running L2. However, I believe that that for
> use #1 - the *architecturally correct* emulation of RDTSC and RDMSR when
> L2 is running, the current nested-agnostic code - which ends up returning L2's
> TSC when L2 is running - is the correct thing to do.
> 
> Look at what happens when:
> 
> When we get an exit to L0 when L2 run RDMSR TSC:
>   This may exit to L1, if L1 asked in the MSR Bitmap. Done correctly.
>   If hasn't exited, the spec says we need to return L1's TSC plus the
>   TSC_OFFSET that L1 set for L2 (if L1 asked for TSC_OFFSETTING).
>   In other words, we should return L0's TSC offsetted by the sum of vmcs01
>   and vmcs12 tsc_offsets. This sum is exactly what we have in the current
>   (vmcs02) TSC_OFFSET. So the existing guest_read_tsc, which returns
>   rdtscll() + vmcs_read64(TSC_OFFSET), does the correct thing - also for
>   nested.
> 
> When we get an exit to L0 when L2 runs RDTSC:
>   This may exit to L1 if L1 asked for TSC_EXITING. Done correctly.
>   There is no "else" here - we would get an EXIT_REASON_RDTSC only if L1
>   asked for it (because KVM itself doesn't use TSC_EXITING).
> 
> Regarding emulation of WRMSR TSC, I actually had an error in the code which
> I correct in a patch below: According to the Intel spec, it turns out that
> if WRMSR of the TSC does not cause an exit, then it should change the host
> TSC itself, ignoring TSC_OFFSET. This is very strange, because it means that
> if the host doesn't trap MSR read and write of the TSC, then if a guest
> writes TSC and then immediately reads it, it won't read what is just wrote,
> but rather what it wrote pluss the TSC_OFFSET (because the TSC_OFFSET is
> added on read, but not subtracted on write). I don't know why the Intel spec
> specifies this, but it does - and it's not what I emulated to I fixed that
> now (in the patch below). But I doubt this is the problem that Bandan and
> others saw?
> 
> So now, if I'm right, after this patch, the *architectural* uses of
> guest_read_tsc() and vmx_write_tsc_offset() to emulate RDTSR and RD/WRMSR,
> are done correctly.
> 
> But the problem is that these functions are also used in KVM for other things.
> Ideally, whomever calls kvm_read_msr()/kvm_write_msr() should realize that it
> gets the value for the current running guest, which might be L1 or L2, just
> like if that guest would read or write the MSR. Ideally, KVM might read the
> MSR, change it and write it back, and everything would be correct as both
> read and write code are done correctly. The question is, is some of the code -
> perhaps in kvm_guest_time_update(), kvm_arch_vcpu_load(), kvm_arch_vcpu_put()
> or vcpu_enter_guest() (these are functions which touch the TSC), which only
> work correctly when reading/writing L1 TSC and can't work correctly if L2
> TSC is read and written? I'm afraid I don't know enough about what these
> functions are doing, to understand what, if anything, they are doing wrong.
> 
> Bandan and others who can reproduce this bug - I attach below a patch which
> should correct some architectural emulation issues I discovered while reading
> the code again now - especially of TSC MSR write. Can you please try if
> this solves anything? I'm not too optimistic, but since the code was wrong
> in this regard, it's at least worth a try.
> 
> Thanks,
> Nadav.
> 
> Subject: [PATCH] nVMX: Fix nested TSC handling
> 
> This patch fixes several areas where nested VMX did not correctly emulated
> TSC handling.
> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> ---
> arch/x86/kvm/vmx.c |   31 +++++++++++++++++++++----------
> 1 file changed, 21 insertions(+), 10 deletions(-)
> 
> --- .before/arch/x86/kvm/vmx.c	2011-07-28 14:02:52.000000000 +0300
> +++ .after/arch/x86/kvm/vmx.c	2011-07-28 14:02:52.000000000 +0300
> @@ -1762,15 +1762,23 @@ static void vmx_set_tsc_khz(struct kvm_v
>  */
> static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
> {
> -	vmcs_write64(TSC_OFFSET, offset);
> -	if (is_guest_mode(vcpu))
> +	if (is_guest_mode(vcpu)) {
> 		/*
> -		 * We're here if L1 chose not to trap the TSC MSR. Since
> -		 * prepare_vmcs12() does not copy tsc_offset, we need to also
> -		 * set the vmcs12 field here.
> +		 * We're here if L1 chose not to trap WRMSR to TSC. According
> +		 * to the spec, this should set L1's TSC; The offset that L1
> +		 * set for L2 remains unchanged, and still needs to be added
> +		 * to the newly set TSC to get L2's TSC.
> 		 */
> -		get_vmcs12(vcpu)->tsc_offset = offset -
> -			to_vmx(vcpu)->nested.vmcs01_tsc_offset;
> +		struct vmcs12 *vmcs12;
> +		to_vmx(vcpu)->nested.vmcs01_tsc_offset = offset;
> +		/* recalculate vmcs02.TSC_OFFSET: */
> +		vmcs12 = get_vmcs12(vcpu);
> +		vmcs_write64(TSC_OFFSET, offset +
> +			(nested_cpu_has(vmcs12, CPU_BASED_USE_TSC_OFFSETING) ?
> +			 vmcs12->tsc_offset : 0));
> +	} else {
> +		vmcs_write64(TSC_OFFSET, offset);
> +	}
> }
> 
> static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment)
> @@ -6514,8 +6522,11 @@ static void prepare_vmcs02(struct kvm_vc
> 
> 	set_cr4_guest_host_mask(vmx);
> 
> -	vmcs_write64(TSC_OFFSET,
> -		vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset);
> +	if (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETING)
> +		vmcs_write64(TSC_OFFSET,
> +			vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset);
> +	else
> +		vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset);
> 
> 	if (enable_vpid) {
> 		/*
> @@ -6922,7 +6933,7 @@ static void nested_vmx_vmexit(struct kvm
> 
> 	load_vmcs12_host_state(vcpu, vmcs12);
> 
> -	/* Update TSC_OFFSET if vmx_adjust_tsc_offset() was used while L2 ran */
> +	/* Update TSC_OFFSET if TSC was changed while L2 ran */
> 	vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset);
> 
> 	/* This is needed for same reason as it was needed in prepare_vmcs02 */
> -- 
> Nadav Har'El                        |    Thursday, Jul 28 2011, 26 Tammuz 5771
> nyh@math.technion.ac.il             |-----------------------------------------
> Phone +972-523-790466, ICQ 13349191 |Just remember that if the world didn't
> http://nadav.harel.org.il           |suck, we would all fall off.
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2011-07-31 20:34 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-07-08 18:40 Nested VMX - L1 hangs on running L2 Bandan Das
2011-07-18 18:26 ` Marcelo Tosatti
2011-07-19  2:41   ` Bandan Das
2011-07-20  7:58   ` Jan Kiszka
2011-07-20 16:12     ` Marcelo Tosatti
2011-07-20 16:19       ` Jan Kiszka
2011-07-20 16:35         ` Marcelo Tosatti
     [not found]   ` <CAKiCmT00vyR5vRBDWFYK2Z8sgmjLBPwbYU5W8q2wAUTrxS1_tA@mail.gmail.com>
2011-07-20 19:52     ` Nadav Har'El
2011-07-20 20:42       ` Bandan Das
2011-07-21  2:49       ` Zachary Amsden
2011-07-27 11:51         ` Nadav Har'El
2011-07-29  9:01           ` Zachary Amsden
2011-07-29 10:21             ` Roedel, Joerg
2011-07-31 13:48             ` Nadav Har'El
2011-07-31 18:55               ` Zachary Amsden
2011-07-31 20:34                 ` Nadav Har'El
2011-07-28 11:11         ` Nadav Har'El
2011-07-29  2:06           ` Matt McGill

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox