* Nested VMX - L1 hangs on running L2
@ 2011-07-08 18:40 Bandan Das
2011-07-18 18:26 ` Marcelo Tosatti
0 siblings, 1 reply; 18+ messages in thread
From: Bandan Das @ 2011-07-08 18:40 UTC (permalink / raw)
To: KVM Mailing List; +Cc: Nadav Har'El
I have already discussed this a bit with Nadav but hoping someone
else has any other ideas/clues/suggestions/comments. With recent versions of the
kernel (The last I tried is 3.0-rc5 with nVMX patches already merged), my L1 guest
always hangs when I start L2.
My setup : The host, L1 and L2 all are FC15 with the host running 3.0-rc5. When L1 is up
and running, I start L2 from L1. Within a minute or two, both L1 and L2 hang. Although, if
if I run tracing on the host, I see :
...
qemu-kvm-19756 [013] 153774.856178: kvm_exit: reason APIC_ACCESS rip 0xffffffff81025098 info 1380 0
qemu-kvm-19756 [013] 153774.856189: kvm_exit: reason VMREAD rip 0xffffffffa00d5127 info 0 0
qemu-kvm-19756 [013] 153774.856191: kvm_exit: reason VMREAD rip 0xffffffffa00d5127 info 0 0
...
My point being that I only see kvm_exit messages but no kvm_entry. Does this mean that the VCPUs
are somehow stuck in L2 ?
Anyway, since this setup was running fine for me on older kernels, and I couldn't
identify any significant changes in nVMX, I sifted through the other KVM changes and found this :
--
commit 1aa8ceef0312a6aae7dd863a120a55f1637b361d
Author: Nikola Ciprich <extmaillist@linuxbox.cz>
Date: Wed Mar 9 23:36:51 2011 +0100
KVM: fix kvmclock regression due to missing clock update
commit 387b9f97750444728962b236987fbe8ee8cc4f8c moved kvm_request_guest_time_update(vcpu),
breaking 32bit SMP guests using kvm-clock. Fix this by moving (new) clock update function
to proper place.
Signed-off-by: Nikola Ciprich <nikola.ciprich@linuxbox.cz>
Acked-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
index 01f08a6..f1e4025 100644 (file)
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2127,8 +2127,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
if (check_tsc_unstable()) {
kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta);
vcpu->arch.tsc_catchup = 1;
- kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
}
+ kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
if (vcpu->cpu != cpu)
kvm_migrate_timers(vcpu);
vcpu->cpu = cpu;
--
If I revert this change, my L1/L2 guests run fine. This ofcourse, just hides the bug
because on my machine, check_tsc_unstable() returns false.
I found out from Nadav that when KVM decides to run L2, it will write
vmcs01->tsc_offset + vmcs12->tsc_offset to the active TSC_OFFSET which seems right.
But I verified that, if instead, I just write
vmcs01->tsc_offset to TSC_OFFSET in prepare_vmcs02(), I don't see the bug anymore.
Not sure where to go from here. I would appreciate if any one has any ideas.
Bandan
^ permalink raw reply related [flat|nested] 18+ messages in thread* Re: Nested VMX - L1 hangs on running L2 2011-07-08 18:40 Nested VMX - L1 hangs on running L2 Bandan Das @ 2011-07-18 18:26 ` Marcelo Tosatti 2011-07-19 2:41 ` Bandan Das ` (2 more replies) 0 siblings, 3 replies; 18+ messages in thread From: Marcelo Tosatti @ 2011-07-18 18:26 UTC (permalink / raw) To: Bandan Das; +Cc: KVM Mailing List, Nadav Har'El, Zachary Amsden On Fri, Jul 08, 2011 at 02:40:53PM -0400, Bandan Das wrote: > I have already discussed this a bit with Nadav but hoping someone > else has any other ideas/clues/suggestions/comments. With recent versions of the > kernel (The last I tried is 3.0-rc5 with nVMX patches already merged), my L1 guest > always hangs when I start L2. > > My setup : The host, L1 and L2 all are FC15 with the host running 3.0-rc5. When L1 is up > and running, I start L2 from L1. Within a minute or two, both L1 and L2 hang. Although, if > if I run tracing on the host, I see : > > ... > qemu-kvm-19756 [013] 153774.856178: kvm_exit: reason APIC_ACCESS rip 0xffffffff81025098 info 1380 0 > qemu-kvm-19756 [013] 153774.856189: kvm_exit: reason VMREAD rip 0xffffffffa00d5127 info 0 0 > qemu-kvm-19756 [013] 153774.856191: kvm_exit: reason VMREAD rip 0xffffffffa00d5127 info 0 0 > ... > > My point being that I only see kvm_exit messages but no kvm_entry. Does this mean that the VCPUs > are somehow stuck in L2 ? > > Anyway, since this setup was running fine for me on older kernels, and I couldn't > identify any significant changes in nVMX, I sifted through the other KVM changes and found this : > > -- > commit 1aa8ceef0312a6aae7dd863a120a55f1637b361d > Author: Nikola Ciprich <extmaillist@linuxbox.cz> > Date: Wed Mar 9 23:36:51 2011 +0100 > > KVM: fix kvmclock regression due to missing clock update > > commit 387b9f97750444728962b236987fbe8ee8cc4f8c moved kvm_request_guest_time_update(vcpu), > breaking 32bit SMP guests using kvm-clock. Fix this by moving (new) clock update function > to proper place. > > Signed-off-by: Nikola Ciprich <nikola.ciprich@linuxbox.cz> > Acked-by: Zachary Amsden <zamsden@redhat.com> > Signed-off-by: Avi Kivity <avi@redhat.com> > > index 01f08a6..f1e4025 100644 (file) > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -2127,8 +2127,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) > if (check_tsc_unstable()) { > kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta); > vcpu->arch.tsc_catchup = 1; > - kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu); > } > + kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu); > if (vcpu->cpu != cpu) > kvm_migrate_timers(vcpu); > vcpu->cpu = cpu; > -- > > If I revert this change, my L1/L2 guests run fine. This ofcourse, just hides the bug > because on my machine, check_tsc_unstable() returns false. > > I found out from Nadav that when KVM decides to run L2, it will write > vmcs01->tsc_offset + vmcs12->tsc_offset to the active TSC_OFFSET which seems right. > But I verified that, if instead, I just write > vmcs01->tsc_offset to TSC_OFFSET in prepare_vmcs02(), I don't see the bug anymore. > > Not sure where to go from here. I would appreciate if any one has any ideas. > > > Bandan Using guests TSC value when performing TSC adjustments is wrong. Can you please try the following patch, which skips TSC adjustments if vcpu is in guest mode. diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 2b76ae3..44c90d1 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1096,6 +1096,9 @@ static int kvm_guest_time_update(struct kvm_vcpu *v) s64 kernel_ns, max_kernel_ns; u64 tsc_timestamp; + if (is_guest_mode(v)) + return 0; + /* Keep irq disabled to prevent changes to the clock */ local_irq_save(flags); kvm_get_msr(v, MSR_IA32_TSC, &tsc_timestamp); @@ -2214,6 +2217,9 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) tsc_delta = !vcpu->arch.last_guest_tsc ? 0 : tsc - vcpu->arch.last_guest_tsc; + if (is_guest_mode(vcpu)) + tsc_delta = 0; + if (tsc_delta < 0) mark_tsc_unstable("KVM discovered backwards TSC"); if (check_tsc_unstable()) { @@ -2234,7 +2240,8 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) { kvm_x86_ops->vcpu_put(vcpu); kvm_put_guest_fpu(vcpu); - kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc); + if (!is_guest_mode(vcpu)) + kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc); } static int is_efer_nx(void) @@ -5717,7 +5724,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) if (hw_breakpoint_active()) hw_breakpoint_restore(); - kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc); + if (!is_guest_mode(vcpu)) + kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc); vcpu->mode = OUTSIDE_GUEST_MODE; smp_wmb(); ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: Nested VMX - L1 hangs on running L2 2011-07-18 18:26 ` Marcelo Tosatti @ 2011-07-19 2:41 ` Bandan Das 2011-07-20 7:58 ` Jan Kiszka [not found] ` <CAKiCmT00vyR5vRBDWFYK2Z8sgmjLBPwbYU5W8q2wAUTrxS1_tA@mail.gmail.com> 2 siblings, 0 replies; 18+ messages in thread From: Bandan Das @ 2011-07-19 2:41 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: KVM Mailing List, Nadav Har'El, Zachary Amsden On 0, Marcelo Tosatti <mtosatti@redhat.com> wrote: > > On Fri, Jul 08, 2011 at 02:40:53PM -0400, Bandan Das wrote: > > I have already discussed this a bit with Nadav but hoping someone > > else has any other ideas/clues/suggestions/comments. With recent versions of the > > kernel (The last I tried is 3.0-rc5 with nVMX patches already merged), my L1 guest > > always hangs when I start L2. > > > > My setup : The host, L1 and L2 all are FC15 with the host running 3.0-rc5. When L1 is up > > and running, I start L2 from L1. Within a minute or two, both L1 and L2 hang. Although, if > > if I run tracing on the host, I see : > > > ... > Using guests TSC value when performing TSC adjustments is wrong. Can > you please try the following patch, which skips TSC adjustments if > vcpu is in guest mode. > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 2b76ae3..44c90d1 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -1096,6 +1096,9 @@ static int kvm_guest_time_update(struct kvm_vcpu *v) > s64 kernel_ns, max_kernel_ns; > u64 tsc_timestamp; > > + if (is_guest_mode(v)) > + return 0; > + > /* Keep irq disabled to prevent changes to the clock */ > local_irq_save(flags); > kvm_get_msr(v, MSR_IA32_TSC, &tsc_timestamp); > @@ -2214,6 +2217,9 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) > tsc_delta = !vcpu->arch.last_guest_tsc ? 0 : > tsc - vcpu->arch.last_guest_tsc; > > + if (is_guest_mode(vcpu)) > + tsc_delta = 0; > + > if (tsc_delta < 0) > mark_tsc_unstable("KVM discovered backwards TSC"); > if (check_tsc_unstable()) { > @@ -2234,7 +2240,8 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) > { > kvm_x86_ops->vcpu_put(vcpu); > kvm_put_guest_fpu(vcpu); > - kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc); > + if (!is_guest_mode(vcpu)) > + kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc); > } > > static int is_efer_nx(void) > @@ -5717,7 +5724,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) > if (hw_breakpoint_active()) > hw_breakpoint_restore(); > > - kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc); > + if (!is_guest_mode(vcpu)) > + kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc); > > vcpu->mode = OUTSIDE_GUEST_MODE; > smp_wmb(); Hi Marcelo, Thanks for looking at this and for the explanation of the cause. Your patch does solve my problem. I have been running my L2 guest for a few hours now without any hang, while usually, I would get a hang within minutes of booting L2. Thanks again! Bandan ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Nested VMX - L1 hangs on running L2 2011-07-18 18:26 ` Marcelo Tosatti 2011-07-19 2:41 ` Bandan Das @ 2011-07-20 7:58 ` Jan Kiszka 2011-07-20 16:12 ` Marcelo Tosatti [not found] ` <CAKiCmT00vyR5vRBDWFYK2Z8sgmjLBPwbYU5W8q2wAUTrxS1_tA@mail.gmail.com> 2 siblings, 1 reply; 18+ messages in thread From: Jan Kiszka @ 2011-07-20 7:58 UTC (permalink / raw) To: Marcelo Tosatti Cc: Bandan Das, KVM Mailing List, Nadav Har'El, Zachary Amsden [-- Attachment #1: Type: text/plain, Size: 5087 bytes --] On 2011-07-18 20:26, Marcelo Tosatti wrote: > > On Fri, Jul 08, 2011 at 02:40:53PM -0400, Bandan Das wrote: >> I have already discussed this a bit with Nadav but hoping someone >> else has any other ideas/clues/suggestions/comments. With recent versions of the >> kernel (The last I tried is 3.0-rc5 with nVMX patches already merged), my L1 guest >> always hangs when I start L2. >> >> My setup : The host, L1 and L2 all are FC15 with the host running 3.0-rc5. When L1 is up >> and running, I start L2 from L1. Within a minute or two, both L1 and L2 hang. Although, if >> if I run tracing on the host, I see : >> >> ... >> qemu-kvm-19756 [013] 153774.856178: kvm_exit: reason APIC_ACCESS rip 0xffffffff81025098 info 1380 0 >> qemu-kvm-19756 [013] 153774.856189: kvm_exit: reason VMREAD rip 0xffffffffa00d5127 info 0 0 >> qemu-kvm-19756 [013] 153774.856191: kvm_exit: reason VMREAD rip 0xffffffffa00d5127 info 0 0 >> ... >> >> My point being that I only see kvm_exit messages but no kvm_entry. Does this mean that the VCPUs >> are somehow stuck in L2 ? >> >> Anyway, since this setup was running fine for me on older kernels, and I couldn't >> identify any significant changes in nVMX, I sifted through the other KVM changes and found this : >> >> -- >> commit 1aa8ceef0312a6aae7dd863a120a55f1637b361d >> Author: Nikola Ciprich <extmaillist@linuxbox.cz> >> Date: Wed Mar 9 23:36:51 2011 +0100 >> >> KVM: fix kvmclock regression due to missing clock update >> >> commit 387b9f97750444728962b236987fbe8ee8cc4f8c moved kvm_request_guest_time_update(vcpu), >> breaking 32bit SMP guests using kvm-clock. Fix this by moving (new) clock update function >> to proper place. >> >> Signed-off-by: Nikola Ciprich <nikola.ciprich@linuxbox.cz> >> Acked-by: Zachary Amsden <zamsden@redhat.com> >> Signed-off-by: Avi Kivity <avi@redhat.com> >> >> index 01f08a6..f1e4025 100644 (file) >> --- a/arch/x86/kvm/x86.c >> +++ b/arch/x86/kvm/x86.c >> @@ -2127,8 +2127,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) >> if (check_tsc_unstable()) { >> kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta); >> vcpu->arch.tsc_catchup = 1; >> - kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu); >> } >> + kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu); >> if (vcpu->cpu != cpu) >> kvm_migrate_timers(vcpu); >> vcpu->cpu = cpu; >> -- >> >> If I revert this change, my L1/L2 guests run fine. This ofcourse, just hides the bug >> because on my machine, check_tsc_unstable() returns false. >> >> I found out from Nadav that when KVM decides to run L2, it will write >> vmcs01->tsc_offset + vmcs12->tsc_offset to the active TSC_OFFSET which seems right. >> But I verified that, if instead, I just write >> vmcs01->tsc_offset to TSC_OFFSET in prepare_vmcs02(), I don't see the bug anymore. >> >> Not sure where to go from here. I would appreciate if any one has any ideas. >> >> >> Bandan > > Using guests TSC value when performing TSC adjustments is wrong. Can > you please try the following patch, which skips TSC adjustments if > vcpu is in guest mode. > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 2b76ae3..44c90d1 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -1096,6 +1096,9 @@ static int kvm_guest_time_update(struct kvm_vcpu *v) > s64 kernel_ns, max_kernel_ns; > u64 tsc_timestamp; > > + if (is_guest_mode(v)) > + return 0; > + > /* Keep irq disabled to prevent changes to the clock */ > local_irq_save(flags); > kvm_get_msr(v, MSR_IA32_TSC, &tsc_timestamp); > @@ -2214,6 +2217,9 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) > tsc_delta = !vcpu->arch.last_guest_tsc ? 0 : > tsc - vcpu->arch.last_guest_tsc; > > + if (is_guest_mode(vcpu)) > + tsc_delta = 0; > + > if (tsc_delta < 0) > mark_tsc_unstable("KVM discovered backwards TSC"); > if (check_tsc_unstable()) { > @@ -2234,7 +2240,8 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) > { > kvm_x86_ops->vcpu_put(vcpu); > kvm_put_guest_fpu(vcpu); > - kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc); > + if (!is_guest_mode(vcpu)) > + kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc); > } > > static int is_efer_nx(void) > @@ -5717,7 +5724,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) > if (hw_breakpoint_active()) > hw_breakpoint_restore(); > > - kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc); > + if (!is_guest_mode(vcpu)) > + kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc); > > vcpu->mode = OUTSIDE_GUEST_MODE; > smp_wmb(); That unfortunately does not fix the L1 lockups I get here - unless I confine L1 to a single CPU. It looks like (don't have all symbols for the guest kernel ATM) that we are stuck in processing a timer IRQ. Jan [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 262 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Nested VMX - L1 hangs on running L2 2011-07-20 7:58 ` Jan Kiszka @ 2011-07-20 16:12 ` Marcelo Tosatti 2011-07-20 16:19 ` Jan Kiszka 0 siblings, 1 reply; 18+ messages in thread From: Marcelo Tosatti @ 2011-07-20 16:12 UTC (permalink / raw) To: Jan Kiszka; +Cc: Bandan Das, KVM Mailing List, Nadav Har'El, Zachary Amsden On Wed, Jul 20, 2011 at 09:58:47AM +0200, Jan Kiszka wrote: > On 2011-07-18 20:26, Marcelo Tosatti wrote: > > > > On Fri, Jul 08, 2011 at 02:40:53PM -0400, Bandan Das wrote: > >> I have already discussed this a bit with Nadav but hoping someone > >> else has any other ideas/clues/suggestions/comments. With recent versions of the > >> kernel (The last I tried is 3.0-rc5 with nVMX patches already merged), my L1 guest > >> always hangs when I start L2. > >> > >> My setup : The host, L1 and L2 all are FC15 with the host running 3.0-rc5. When L1 is up > >> and running, I start L2 from L1. Within a minute or two, both L1 and L2 hang. Although, if > >> if I run tracing on the host, I see : > >> > >> ... > >> qemu-kvm-19756 [013] 153774.856178: kvm_exit: reason APIC_ACCESS rip 0xffffffff81025098 info 1380 0 > >> qemu-kvm-19756 [013] 153774.856189: kvm_exit: reason VMREAD rip 0xffffffffa00d5127 info 0 0 > >> qemu-kvm-19756 [013] 153774.856191: kvm_exit: reason VMREAD rip 0xffffffffa00d5127 info 0 0 > >> ... > >> > >> My point being that I only see kvm_exit messages but no kvm_entry. Does this mean that the VCPUs > >> are somehow stuck in L2 ? > >> > >> Anyway, since this setup was running fine for me on older kernels, and I couldn't > >> identify any significant changes in nVMX, I sifted through the other KVM changes and found this : > >> > >> -- > >> commit 1aa8ceef0312a6aae7dd863a120a55f1637b361d > >> Author: Nikola Ciprich <extmaillist@linuxbox.cz> > >> Date: Wed Mar 9 23:36:51 2011 +0100 > >> > >> KVM: fix kvmclock regression due to missing clock update > >> > >> commit 387b9f97750444728962b236987fbe8ee8cc4f8c moved kvm_request_guest_time_update(vcpu), > >> breaking 32bit SMP guests using kvm-clock. Fix this by moving (new) clock update function > >> to proper place. > >> > >> Signed-off-by: Nikola Ciprich <nikola.ciprich@linuxbox.cz> > >> Acked-by: Zachary Amsden <zamsden@redhat.com> > >> Signed-off-by: Avi Kivity <avi@redhat.com> > >> > >> index 01f08a6..f1e4025 100644 (file) > >> --- a/arch/x86/kvm/x86.c > >> +++ b/arch/x86/kvm/x86.c > >> @@ -2127,8 +2127,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) > >> if (check_tsc_unstable()) { > >> kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta); > >> vcpu->arch.tsc_catchup = 1; > >> - kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu); > >> } > >> + kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu); > >> if (vcpu->cpu != cpu) > >> kvm_migrate_timers(vcpu); > >> vcpu->cpu = cpu; > >> -- > >> > >> If I revert this change, my L1/L2 guests run fine. This ofcourse, just hides the bug > >> because on my machine, check_tsc_unstable() returns false. > >> > >> I found out from Nadav that when KVM decides to run L2, it will write > >> vmcs01->tsc_offset + vmcs12->tsc_offset to the active TSC_OFFSET which seems right. > >> But I verified that, if instead, I just write > >> vmcs01->tsc_offset to TSC_OFFSET in prepare_vmcs02(), I don't see the bug anymore. > >> > >> Not sure where to go from here. I would appreciate if any one has any ideas. > >> > >> > >> Bandan > > > > Using guests TSC value when performing TSC adjustments is wrong. Can > > you please try the following patch, which skips TSC adjustments if > > vcpu is in guest mode. > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > index 2b76ae3..44c90d1 100644 > > --- a/arch/x86/kvm/x86.c > > +++ b/arch/x86/kvm/x86.c > > @@ -1096,6 +1096,9 @@ static int kvm_guest_time_update(struct kvm_vcpu *v) > > s64 kernel_ns, max_kernel_ns; > > u64 tsc_timestamp; > > > > + if (is_guest_mode(v)) > > + return 0; > > + > > /* Keep irq disabled to prevent changes to the clock */ > > local_irq_save(flags); > > kvm_get_msr(v, MSR_IA32_TSC, &tsc_timestamp); > > @@ -2214,6 +2217,9 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) > > tsc_delta = !vcpu->arch.last_guest_tsc ? 0 : > > tsc - vcpu->arch.last_guest_tsc; > > > > + if (is_guest_mode(vcpu)) > > + tsc_delta = 0; > > + > > if (tsc_delta < 0) > > mark_tsc_unstable("KVM discovered backwards TSC"); > > if (check_tsc_unstable()) { > > @@ -2234,7 +2240,8 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) > > { > > kvm_x86_ops->vcpu_put(vcpu); > > kvm_put_guest_fpu(vcpu); > > - kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc); > > + if (!is_guest_mode(vcpu)) > > + kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc); > > } > > > > static int is_efer_nx(void) > > @@ -5717,7 +5724,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) > > if (hw_breakpoint_active()) > > hw_breakpoint_restore(); > > > > - kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc); > > + if (!is_guest_mode(vcpu)) > > + kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc); > > > > vcpu->mode = OUTSIDE_GUEST_MODE; > > smp_wmb(); > > That unfortunately does not fix the L1 lockups I get here - unless I > confine L1 to a single CPU. It looks like (don't have all symbols for > the guest kernel ATM) that we are stuck in processing a timer IRQ. > > Jan Is L1 using kvmclock? ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Nested VMX - L1 hangs on running L2 2011-07-20 16:12 ` Marcelo Tosatti @ 2011-07-20 16:19 ` Jan Kiszka 2011-07-20 16:35 ` Marcelo Tosatti 0 siblings, 1 reply; 18+ messages in thread From: Jan Kiszka @ 2011-07-20 16:19 UTC (permalink / raw) To: Marcelo Tosatti Cc: Bandan Das, KVM Mailing List, Nadav Har'El, Zachary Amsden [-- Attachment #1: Type: text/plain, Size: 5566 bytes --] On 2011-07-20 18:12, Marcelo Tosatti wrote: > On Wed, Jul 20, 2011 at 09:58:47AM +0200, Jan Kiszka wrote: >> On 2011-07-18 20:26, Marcelo Tosatti wrote: >>> >>> On Fri, Jul 08, 2011 at 02:40:53PM -0400, Bandan Das wrote: >>>> I have already discussed this a bit with Nadav but hoping someone >>>> else has any other ideas/clues/suggestions/comments. With recent versions of the >>>> kernel (The last I tried is 3.0-rc5 with nVMX patches already merged), my L1 guest >>>> always hangs when I start L2. >>>> >>>> My setup : The host, L1 and L2 all are FC15 with the host running 3.0-rc5. When L1 is up >>>> and running, I start L2 from L1. Within a minute or two, both L1 and L2 hang. Although, if >>>> if I run tracing on the host, I see : >>>> >>>> ... >>>> qemu-kvm-19756 [013] 153774.856178: kvm_exit: reason APIC_ACCESS rip 0xffffffff81025098 info 1380 0 >>>> qemu-kvm-19756 [013] 153774.856189: kvm_exit: reason VMREAD rip 0xffffffffa00d5127 info 0 0 >>>> qemu-kvm-19756 [013] 153774.856191: kvm_exit: reason VMREAD rip 0xffffffffa00d5127 info 0 0 >>>> ... >>>> >>>> My point being that I only see kvm_exit messages but no kvm_entry. Does this mean that the VCPUs >>>> are somehow stuck in L2 ? >>>> >>>> Anyway, since this setup was running fine for me on older kernels, and I couldn't >>>> identify any significant changes in nVMX, I sifted through the other KVM changes and found this : >>>> >>>> -- >>>> commit 1aa8ceef0312a6aae7dd863a120a55f1637b361d >>>> Author: Nikola Ciprich <extmaillist@linuxbox.cz> >>>> Date: Wed Mar 9 23:36:51 2011 +0100 >>>> >>>> KVM: fix kvmclock regression due to missing clock update >>>> >>>> commit 387b9f97750444728962b236987fbe8ee8cc4f8c moved kvm_request_guest_time_update(vcpu), >>>> breaking 32bit SMP guests using kvm-clock. Fix this by moving (new) clock update function >>>> to proper place. >>>> >>>> Signed-off-by: Nikola Ciprich <nikola.ciprich@linuxbox.cz> >>>> Acked-by: Zachary Amsden <zamsden@redhat.com> >>>> Signed-off-by: Avi Kivity <avi@redhat.com> >>>> >>>> index 01f08a6..f1e4025 100644 (file) >>>> --- a/arch/x86/kvm/x86.c >>>> +++ b/arch/x86/kvm/x86.c >>>> @@ -2127,8 +2127,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) >>>> if (check_tsc_unstable()) { >>>> kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta); >>>> vcpu->arch.tsc_catchup = 1; >>>> - kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu); >>>> } >>>> + kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu); >>>> if (vcpu->cpu != cpu) >>>> kvm_migrate_timers(vcpu); >>>> vcpu->cpu = cpu; >>>> -- >>>> >>>> If I revert this change, my L1/L2 guests run fine. This ofcourse, just hides the bug >>>> because on my machine, check_tsc_unstable() returns false. >>>> >>>> I found out from Nadav that when KVM decides to run L2, it will write >>>> vmcs01->tsc_offset + vmcs12->tsc_offset to the active TSC_OFFSET which seems right. >>>> But I verified that, if instead, I just write >>>> vmcs01->tsc_offset to TSC_OFFSET in prepare_vmcs02(), I don't see the bug anymore. >>>> >>>> Not sure where to go from here. I would appreciate if any one has any ideas. >>>> >>>> >>>> Bandan >>> >>> Using guests TSC value when performing TSC adjustments is wrong. Can >>> you please try the following patch, which skips TSC adjustments if >>> vcpu is in guest mode. >>> >>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >>> index 2b76ae3..44c90d1 100644 >>> --- a/arch/x86/kvm/x86.c >>> +++ b/arch/x86/kvm/x86.c >>> @@ -1096,6 +1096,9 @@ static int kvm_guest_time_update(struct kvm_vcpu *v) >>> s64 kernel_ns, max_kernel_ns; >>> u64 tsc_timestamp; >>> >>> + if (is_guest_mode(v)) >>> + return 0; >>> + >>> /* Keep irq disabled to prevent changes to the clock */ >>> local_irq_save(flags); >>> kvm_get_msr(v, MSR_IA32_TSC, &tsc_timestamp); >>> @@ -2214,6 +2217,9 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) >>> tsc_delta = !vcpu->arch.last_guest_tsc ? 0 : >>> tsc - vcpu->arch.last_guest_tsc; >>> >>> + if (is_guest_mode(vcpu)) >>> + tsc_delta = 0; >>> + >>> if (tsc_delta < 0) >>> mark_tsc_unstable("KVM discovered backwards TSC"); >>> if (check_tsc_unstable()) { >>> @@ -2234,7 +2240,8 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) >>> { >>> kvm_x86_ops->vcpu_put(vcpu); >>> kvm_put_guest_fpu(vcpu); >>> - kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc); >>> + if (!is_guest_mode(vcpu)) >>> + kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc); >>> } >>> >>> static int is_efer_nx(void) >>> @@ -5717,7 +5724,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) >>> if (hw_breakpoint_active()) >>> hw_breakpoint_restore(); >>> >>> - kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc); >>> + if (!is_guest_mode(vcpu)) >>> + kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc); >>> >>> vcpu->mode = OUTSIDE_GUEST_MODE; >>> smp_wmb(); >> >> That unfortunately does not fix the L1 lockups I get here - unless I >> confine L1 to a single CPU. It looks like (don't have all symbols for >> the guest kernel ATM) that we are stuck in processing a timer IRQ. >> >> Jan > > Is L1 using kvmclock? Yes, it's a standard 3.0-rc7 SUSE kernel. Disabling it seems to help on first glance. Jan [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 262 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Nested VMX - L1 hangs on running L2 2011-07-20 16:19 ` Jan Kiszka @ 2011-07-20 16:35 ` Marcelo Tosatti 0 siblings, 0 replies; 18+ messages in thread From: Marcelo Tosatti @ 2011-07-20 16:35 UTC (permalink / raw) To: Jan Kiszka; +Cc: Bandan Das, KVM Mailing List, Nadav Har'El, Zachary Amsden On Wed, Jul 20, 2011 at 06:19:19PM +0200, Jan Kiszka wrote: > On 2011-07-20 18:12, Marcelo Tosatti wrote: > > On Wed, Jul 20, 2011 at 09:58:47AM +0200, Jan Kiszka wrote: > >> On 2011-07-18 20:26, Marcelo Tosatti wrote: > >>> > >>> On Fri, Jul 08, 2011 at 02:40:53PM -0400, Bandan Das wrote: > >>>> I have already discussed this a bit with Nadav but hoping someone > >>>> else has any other ideas/clues/suggestions/comments. With recent versions of the > >>>> kernel (The last I tried is 3.0-rc5 with nVMX patches already merged), my L1 guest > >>>> always hangs when I start L2. > >>>> > >>>> My setup : The host, L1 and L2 all are FC15 with the host running 3.0-rc5. When L1 is up > >>>> and running, I start L2 from L1. Within a minute or two, both L1 and L2 hang. Although, if > >>>> if I run tracing on the host, I see : > >>>> > >>>> ... > >>>> qemu-kvm-19756 [013] 153774.856178: kvm_exit: reason APIC_ACCESS rip 0xffffffff81025098 info 1380 0 > >>>> qemu-kvm-19756 [013] 153774.856189: kvm_exit: reason VMREAD rip 0xffffffffa00d5127 info 0 0 > >>>> qemu-kvm-19756 [013] 153774.856191: kvm_exit: reason VMREAD rip 0xffffffffa00d5127 info 0 0 > >>>> ... > >>>> > >>>> My point being that I only see kvm_exit messages but no kvm_entry. Does this mean that the VCPUs > >>>> are somehow stuck in L2 ? > >>>> > >>>> Anyway, since this setup was running fine for me on older kernels, and I couldn't > >>>> identify any significant changes in nVMX, I sifted through the other KVM changes and found this : > >>>> > >>>> -- > >>>> commit 1aa8ceef0312a6aae7dd863a120a55f1637b361d > >>>> Author: Nikola Ciprich <extmaillist@linuxbox.cz> > >>>> Date: Wed Mar 9 23:36:51 2011 +0100 > >>>> > >>>> KVM: fix kvmclock regression due to missing clock update > >>>> > >>>> commit 387b9f97750444728962b236987fbe8ee8cc4f8c moved kvm_request_guest_time_update(vcpu), > >>>> breaking 32bit SMP guests using kvm-clock. Fix this by moving (new) clock update function > >>>> to proper place. > >>>> > >>>> Signed-off-by: Nikola Ciprich <nikola.ciprich@linuxbox.cz> > >>>> Acked-by: Zachary Amsden <zamsden@redhat.com> > >>>> Signed-off-by: Avi Kivity <avi@redhat.com> > >>>> > >>>> index 01f08a6..f1e4025 100644 (file) > >>>> --- a/arch/x86/kvm/x86.c > >>>> +++ b/arch/x86/kvm/x86.c > >>>> @@ -2127,8 +2127,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) > >>>> if (check_tsc_unstable()) { > >>>> kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta); > >>>> vcpu->arch.tsc_catchup = 1; > >>>> - kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu); > >>>> } > >>>> + kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu); > >>>> if (vcpu->cpu != cpu) > >>>> kvm_migrate_timers(vcpu); > >>>> vcpu->cpu = cpu; > >>>> -- > >>>> > >>>> If I revert this change, my L1/L2 guests run fine. This ofcourse, just hides the bug > >>>> because on my machine, check_tsc_unstable() returns false. > >>>> > >>>> I found out from Nadav that when KVM decides to run L2, it will write > >>>> vmcs01->tsc_offset + vmcs12->tsc_offset to the active TSC_OFFSET which seems right. > >>>> But I verified that, if instead, I just write > >>>> vmcs01->tsc_offset to TSC_OFFSET in prepare_vmcs02(), I don't see the bug anymore. > >>>> > >>>> Not sure where to go from here. I would appreciate if any one has any ideas. > >>>> > >>>> > >>>> Bandan > >>> > >>> Using guests TSC value when performing TSC adjustments is wrong. Can > >>> you please try the following patch, which skips TSC adjustments if > >>> vcpu is in guest mode. > >>> > >>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > >>> index 2b76ae3..44c90d1 100644 > >>> --- a/arch/x86/kvm/x86.c > >>> +++ b/arch/x86/kvm/x86.c > >>> @@ -1096,6 +1096,9 @@ static int kvm_guest_time_update(struct kvm_vcpu *v) > >>> s64 kernel_ns, max_kernel_ns; > >>> u64 tsc_timestamp; > >>> > >>> + if (is_guest_mode(v)) > >>> + return 0; > >>> + > >>> /* Keep irq disabled to prevent changes to the clock */ > >>> local_irq_save(flags); > >>> kvm_get_msr(v, MSR_IA32_TSC, &tsc_timestamp); > >>> @@ -2214,6 +2217,9 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) > >>> tsc_delta = !vcpu->arch.last_guest_tsc ? 0 : > >>> tsc - vcpu->arch.last_guest_tsc; > >>> > >>> + if (is_guest_mode(vcpu)) > >>> + tsc_delta = 0; > >>> + > >>> if (tsc_delta < 0) > >>> mark_tsc_unstable("KVM discovered backwards TSC"); > >>> if (check_tsc_unstable()) { > >>> @@ -2234,7 +2240,8 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) > >>> { > >>> kvm_x86_ops->vcpu_put(vcpu); > >>> kvm_put_guest_fpu(vcpu); > >>> - kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc); > >>> + if (!is_guest_mode(vcpu)) > >>> + kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc); > >>> } > >>> > >>> static int is_efer_nx(void) > >>> @@ -5717,7 +5724,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) > >>> if (hw_breakpoint_active()) > >>> hw_breakpoint_restore(); > >>> > >>> - kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc); > >>> + if (!is_guest_mode(vcpu)) > >>> + kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc); > >>> > >>> vcpu->mode = OUTSIDE_GUEST_MODE; > >>> smp_wmb(); > >> > >> That unfortunately does not fix the L1 lockups I get here - unless I > >> confine L1 to a single CPU. It looks like (don't have all symbols for > >> the guest kernel ATM) that we are stuck in processing a timer IRQ. > >> > >> Jan > > > > Is L1 using kvmclock? > > Yes, it's a standard 3.0-rc7 SUSE kernel. Disabling it seems to help on > first glance. > > Jan The patch i posted is broken, thinking on a proper fix. ^ permalink raw reply [flat|nested] 18+ messages in thread
[parent not found: <CAKiCmT00vyR5vRBDWFYK2Z8sgmjLBPwbYU5W8q2wAUTrxS1_tA@mail.gmail.com>]
* Re: Nested VMX - L1 hangs on running L2 [not found] ` <CAKiCmT00vyR5vRBDWFYK2Z8sgmjLBPwbYU5W8q2wAUTrxS1_tA@mail.gmail.com> @ 2011-07-20 19:52 ` Nadav Har'El 2011-07-20 20:42 ` Bandan Das 2011-07-21 2:49 ` Zachary Amsden 0 siblings, 2 replies; 18+ messages in thread From: Nadav Har'El @ 2011-07-20 19:52 UTC (permalink / raw) To: Zachary Amsden; +Cc: Bandan Das, KVM Mailing List, Marcelo Tosatti > No, both patches are wrong. Guys, thanks for looking into this bug. I'm afraid I'm still at a loss at why a TSC bug would even cause a guest lockup :( When Avi Kivity saw my nested TSC handling code he remarked "this is probably wrong". When I asked him where it was wrong, he basically said that he didn't know where, but TSC handling code is always wrong ;-) And it turns out he was right. > The correct fix is to make kvm_get_msr() return the L1 guest TSC at all times. > We are serving the L1 guest in this hypervisor, not the L2 guest, and so > should never read its TSC for any reason. ... > allows the L2 guest to overwrite the L1 guest TSC, which at first seems wrong, > but is in fact the correct virtualization of a security hole in the L1 guest. I think I'm beginning to see the error in my ways... When L1 lets L2 (using the MSR bitmap) direct read/write access to the TSC, it doesn't want L0 to be "clever" and give L2 its own separate TSC (like I do now), but rather gives it full control over L1's TSC - so reading or writing it should actually return L1's TSC, and the TSC_OFFSET in vmcs12 is to be ignored. So basically, if I understand correctly, what I need to change is in prepare_vmcs02(), if the MSR_IA32_TSC is on the MSR bitmap (read? write?), instead of doing vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset); I just need to do vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset); thereby giving L2 exactly the same TSC that L1 had. Brandan, if I remember correctly you once tried this sort of fix and it actually worked? Then, guest_read_tsc() will return (without need to change this code) the correct L1 TSC. And vmx_write_tsc_offset() should do in the is_guest_mode() not what it does now (vmcs12->tsc_offset is of no important when the TSC MSR is passed through) but rather set vmcs01_tsc_offset (which will be applied on the next exit to L1). Is my analysis correct? Or perhaps completely wrong? ;-) Am I missing anything else that should be change? In any case, I don't understand why on my machine I never encountered these problems, and nothing broke even if I replaced the TSC nesting code with randomly broken code. Are the people who are seeing this brakage actually passed the MSR from L1 to L2 - using the MSR bitmap - like I guessed above? Or am I missing something completely different? Sorry, but I'm really becoming confused by these TSC issues... Thanks, Nadav. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Nested VMX - L1 hangs on running L2 2011-07-20 19:52 ` Nadav Har'El @ 2011-07-20 20:42 ` Bandan Das 2011-07-21 2:49 ` Zachary Amsden 1 sibling, 0 replies; 18+ messages in thread From: Bandan Das @ 2011-07-20 20:42 UTC (permalink / raw) To: Nadav Har'El Cc: Zachary Amsden, Bandan Das, KVM Mailing List, Marcelo Tosatti On 0, Nadav Har'El <NYH@il.ibm.com> wrote: > > No, both patches are wrong. > > Guys, thanks for looking into this bug. I'm afraid I'm still at a loss at > why a TSC bug would even cause a guest lockup :( > > When Avi Kivity saw my nested TSC handling code he remarked "this is > probably wrong". When I asked him where it was wrong, he basically said > that he didn't know where, but TSC handling code is always wrong ;-) > And it turns out he was right. > > > The correct fix is to make kvm_get_msr() return the L1 guest TSC at all > times. > > We are serving the L1 guest in this hypervisor, not the L2 guest, and so > > > should never read its TSC for any reason. > ... > > allows the L2 guest to overwrite the L1 guest TSC, which at first seems > wrong, > > but is in fact the correct virtualization of a security hole in the L1 > guest. > > I think I'm beginning to see the error in my ways... > > When L1 lets L2 (using the MSR bitmap) direct read/write access to the TSC, > it doesn't want L0 to be "clever" and give L2 its own separate TSC (like > I do now), but rather gives it full control over L1's TSC - so reading or > writing it should actually return L1's TSC, and the TSC_OFFSET in vmcs12 > is to be ignored. > > So basically, if I understand correctly, what I need to change is > in prepare_vmcs02(), if the MSR_IA32_TSC is on the MSR bitmap (read? > write?), instead of doing > vmcs_write64(TSC_OFFSET, > vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset); > I just need to do > vmcs_write64(TSC_OFFSET, > vmx->nested.vmcs01_tsc_offset); > thereby giving L2 exactly the same TSC that L1 had. > Brandan, if I remember correctly you once tried this sort of fix and > it actually worked? That is correct. That is still my "workaround fix" that I have been using on my systems. But as you have mentioned above (and below), I am still struggling with two questions : 1. Why does L1 hang even if the TSC has wrong values. 2. I see this on a Dell R610 and I don't know why you and some others don't see this. I assumed from the symptoms that this should be fairly easy to reproduce on any system. Bandan > Then, guest_read_tsc() will return (without need to change this code) > the correct L1 TSC. > > And vmx_write_tsc_offset() should do in the is_guest_mode() not what > it does now (vmcs12->tsc_offset is of no important when the TSC MSR > is passed through) but rather set vmcs01_tsc_offset (which will be > applied on the next exit to L1). > > Is my analysis correct? Or perhaps completely wrong? ;-) > Am I missing anything else that should be change? > > In any case, I don't understand why on my machine I never encountered > these problems, and nothing broke even if I replaced the TSC nesting > code with randomly broken code. Are the people who are seeing this > brakage actually passed the MSR from L1 to L2 - using the MSR bitmap - > like I guessed above? Or am I missing something completely different? > > Sorry, but I'm really becoming confused by these TSC issues... > > Thanks, > Nadav. > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Nested VMX - L1 hangs on running L2 2011-07-20 19:52 ` Nadav Har'El 2011-07-20 20:42 ` Bandan Das @ 2011-07-21 2:49 ` Zachary Amsden 2011-07-27 11:51 ` Nadav Har'El 2011-07-28 11:11 ` Nadav Har'El 1 sibling, 2 replies; 18+ messages in thread From: Zachary Amsden @ 2011-07-21 2:49 UTC (permalink / raw) To: Nadav Har'El; +Cc: Bandan Das, KVM Mailing List, Marcelo Tosatti On Wed, Jul 20, 2011 at 12:52 PM, Nadav Har'El <NYH@il.ibm.com> wrote: > > > No, both patches are wrong. > > Guys, thanks for looking into this bug. I'm afraid I'm still at a loss at > why a TSC bug would even cause a guest lockup :( > > When Avi Kivity saw my nested TSC handling code he remarked "this is > probably wrong". When I asked him where it was wrong, he basically said > that he didn't know where, but TSC handling code is always wrong ;-) > And it turns out he was right. > > > The correct fix is to make kvm_get_msr() return the L1 guest TSC at all > times. > > We are serving the L1 guest in this hypervisor, not the L2 guest, and so > > > should never read its TSC for any reason. > ... > > allows the L2 guest to overwrite the L1 guest TSC, which at first seems > wrong, > > but is in fact the correct virtualization of a security hole in the L1 > guest. > > I think I'm beginning to see the error in my ways... > > When L1 lets L2 (using the MSR bitmap) direct read/write access to the TSC, > it doesn't want L0 to be "clever" and give L2 its own separate TSC (like > I do now), but rather gives it full control over L1's TSC - so reading or > writing it should actually return L1's TSC, and the TSC_OFFSET in vmcs12 > is to be ignored. > > So basically, if I understand correctly, what I need to change is > in prepare_vmcs02(), if the MSR_IA32_TSC is on the MSR bitmap (read? > write?), instead of doing > vmcs_write64(TSC_OFFSET, > vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset); > I just need to do > vmcs_write64(TSC_OFFSET, > vmx->nested.vmcs01_tsc_offset); > thereby giving L2 exactly the same TSC that L1 had. > Brandan, if I remember correctly you once tried this sort of fix and > it actually worked? > > Then, guest_read_tsc() will return (without need to change this code) > the correct L1 TSC. > Note quite. kvm_get_msr(vcpu, MSR_IA32_TSC, &tsc) should always return the L1 TSC, regardless of the setting of any MSR bitmap. The reason why is that it is being called by the L0 hypervisor kernel, which handles only interactions with the L1 MSRs. If L1 wants to configure L2 to have its own private TSC (99.99% of use cases), it should mask off access to the TSC MSR in the permission bitmap. Then when L2 reads or writes the MSR (not the TSC, which is protected by the RDTSC instruction, and a different trap), an exit will be generated to L0, which will read the permission bitmaps, determine the access is failed, and forward the exit to L1, which will deal with reading or writing the L2 MSR. L0 NEVER deals directly with reading or writing the L2 MSR in this scenario, how that is handled is supposed to be determined by L1. If L1 wants to configure L2 to be able to read and write the L1 TSC instead, it should enable access to the TSC MSR in the MSR permission bitmap. Then when L2 reads or write the MSR, an exit will be generated to L0, which will read the permission bitmaps, determine the access is a success, and directly read or write the L1 TSC. This requires changing both the offset of the running L2 (in the active L0 VMCS) and the saved offset of the inactive L1. However, when computing the L2 tsc offset field to use for a running guest, BOTH the L1 and L2 offsets should be added. This is exactly what nested hardware would do. This implies that to be correct in the second scenario, you need to forget the change you made to the offset of the running L2 (in the active L0 VMCS) when the L2 guest stops running. You want it to be live on hardware, but not stored in any saved TSC_OFFSET field for the L2 guest - it seems illogical, but it is not. The reason for that is that the L2 guest was updating ONLY the L1 TSC, not the L2 TSC_OFFSET (which is still controlled by L1). Note the following invariants: For N>1, L0 will never under any circumstance set or adjust the L{N} TSC virtual VMCS field. That is only for the L{N-1} guest to do, and any attempt to do so by any other than L{N-1} is not a correct virtualization. L0 will only set the hardware VMCS field to the offset required for virtualizing each level L1-L{N}. Also, the invariant for nested hardware: the observed TSC of an L{N} guest is equal to the hardware TSC, plus the TSC offset field for all L1-L{N}. If the hardware does not support nested virtualization, this invariant must be guaranteed by the L0 hypervisor, which is simulating nested hardware by adding up all of the nested offset fields. Neither of these facts involves checking the status of permission bits in the MSR map. That is only done for determining success or failure of an MSR read or write, which may be forwarded to some L{N} as a virtual exit or #GP fault. Yes, it is tricky, but once you start thinking of it in simple hardware terms, you will realize there is no need to be clever, just precise. If it seems counterintuitive to think about the second case, consider the scenario when L2 and L1 are cooperative, and L1 sets the L2 TSC_OFFSET to zero :) Then L2 simply manages the TSC for L1. I find that makes reasoning about the correct place at which to store writes and adjustments much more intuitive. Zach ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Nested VMX - L1 hangs on running L2 2011-07-21 2:49 ` Zachary Amsden @ 2011-07-27 11:51 ` Nadav Har'El 2011-07-29 9:01 ` Zachary Amsden 2011-07-28 11:11 ` Nadav Har'El 1 sibling, 1 reply; 18+ messages in thread From: Nadav Har'El @ 2011-07-27 11:51 UTC (permalink / raw) To: Zachary Amsden; +Cc: Bandan Das, KVM Mailing List, Marcelo Tosatti [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset=iso-8859-8-i, Size: 2155 bytes --] On Wed, Jul 20, 2011, Zachary Amsden wrote about "Re: Nested VMX - L1 hangs on running L2": > > > No, both patches are wrong. > > > > kvm_get_msr(vcpu, MSR_IA32_TSC, &tsc) should always return the L1 TSC, > regardless of the setting of any MSR bitmap. The reason why is that it > is being called by the L0 hypervisor kernel, which handles only > interactions with the L1 MSRs. guest_read_tsc() (called by the above get_msr) currently does this: static u64 guest_read_tsc(void) { u64 host_tsc, tsc_offset; rdtscll(host_tsc); tsc_offset = vmcs_read64(TSC_OFFSET); return host_tsc + tsc_offset; } I guess you'd want this to change to something like: tsc_offset = is_guest_mode(vcpu) ? vmx->nested.vmcs01_tsc_offset : vmcs_read64(TSC_OFFSET); But I still am not convinced that that would be right.... E.g., imagine the case where L1 uses TSC_OFFSETING and but doesn't trap TSC MSR read. The SDM says (if I understand it correctly) that this TSC MSR read will not exit (because L1 doesn't trap it) but *will* do the extra offsetting. In this case, the original code (using vmcs02's TSC_OFFSET which is the sum of that of vmcs01 and vmcs12), is correct, and the new code will be incorrect. Or am I misunderstanding the SDM? Can you tell me in which case the original code would return incorrect results to a guest (L1 or L2) doing anything MSR-related? I'm assuming that some code in KVM also uses kvm_read_msr and assumes it gets the TSC value for L1, not for the guest currently running (L2 or L1). I don't understand why it needs to assume that... Why would it be wrong to return L2's TSC, and just remember that *changing* the L2 TSC really means changing the L1 TSC offset (vmcs01_tsc_offset), not vmcs12.tsc_offset which we can't touch? Thanks, Nadav. -- Nadav Har'El | Wednesday, Jul 27 2011, 25 Tammuz 5771 nyh@math.technion.ac.il |----------------------------------------- Phone +972-523-790466, ICQ 13349191 |You may only be one person to the world, http://nadav.harel.org.il |but may also be the world to one person. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Nested VMX - L1 hangs on running L2 2011-07-27 11:51 ` Nadav Har'El @ 2011-07-29 9:01 ` Zachary Amsden 2011-07-29 10:21 ` Roedel, Joerg 2011-07-31 13:48 ` Nadav Har'El 0 siblings, 2 replies; 18+ messages in thread From: Zachary Amsden @ 2011-07-29 9:01 UTC (permalink / raw) To: Nadav Har'El Cc: Bandan Das, KVM Mailing List, Marcelo Tosatti, joerg.roedel 2011/7/27 Nadav Har'El <nyh@math.technion.ac.il>: > On Wed, Jul 20, 2011, Zachary Amsden wrote about "Re: Nested VMX - L1 hangs on running L2": >> > > No, both patches are wrong. >> > >> >> kvm_get_msr(vcpu, MSR_IA32_TSC, &tsc) should always return the L1 TSC, >> regardless of the setting of any MSR bitmap. The reason why is that it >> is being called by the L0 hypervisor kernel, which handles only >> interactions with the L1 MSRs. > > guest_read_tsc() (called by the above get_msr) currently does this: > > static u64 guest_read_tsc(void) > { > u64 host_tsc, tsc_offset; > > rdtscll(host_tsc); > tsc_offset = vmcs_read64(TSC_OFFSET); > return host_tsc + tsc_offset; > } That's wrong. You should NEVER believe the offset written into the hardware VMCS to be the current, real L1 TSC offset, as that is not an invariant. Instead, you should ALWAYS return the host TSC + the L1 TSC offset. Sometimes, this may be the hardware value. > I guess you'd want this to change to something like: > > tsc_offset = is_guest_mode(vcpu) ? > vmx->nested.vmcs01_tsc_offset : > vmcs_read64(TSC_OFFSET); > > But I still am not convinced that that would be right.... I believe this is correct. But may it be cheaper to read from the in-memory structure than the actual hardware VMCS? > E.g., imagine the case where L1 uses TSC_OFFSETING and but doesn't > trap TSC MSR read. The SDM says (if I understand it correctly) that this TSC > MSR read will not exit (because L1 doesn't trap it) but *will* do the extra > offsetting. In this case, the original code (using vmcs02's TSC_OFFSET which > is the sum of that of vmcs01 and vmcs12), is correct, and the new code will > be incorrect. Or am I misunderstanding the SDM? In that case, you need to distinguish between reads of the TSC MSR by the guest and reads by the host (as done internally to track drift and compensation). The code that needs to change isn't guest_read_tsc(), that code must respect the invariant of only returning the L1 guest TSC (in fact, that may be a good name change for the function). What needs to change is the actual code involved in the MSR read. If it determines that something other than the L1 guest is running, it needs to ignore the hardware TSC offset and return the TSC as if read by the L1 guest. Unfortunately, the layering currently doesn't seem to allow for this, and it looks like both vendor specific variants currently get this wrong. The call stack: kvm_get_msr() kvm_x86_ops->get_msr() vendor_get_msr() vendor_guest_read_tsc() offers no insight as to the intention of the caller. Is it trying to get the guest TSC to return to the guest, or is it trying to get the guest TSC to calibrate / measure and compensate for TSC effects? So you are right, this is still wrong for the case in which L1 does not trap TSC MSR reads. Note however, the RDTSC instruction is still virtualized properly, it is only the relatively rare actual TSC MSR read via RDMSR which is mis-virtualized (this bug exists today in the SVM implementation if I am reading it correctly - cc'd Joerg to notify him of that). That, combined with the relative importance of supporting a guest which does not trap on these MSR reads suggest this is a low priority design issue however (RDTSC still works even if the MSR is trapped, correct?) If you want to go the extra mile to support such guests, the only fully correct approach then is to do one of the following: 1) Add a new vendor-specific API call, vmx_x86_ops->get_L1_TSC(), and transform current uses of the code which does TSC compensation to use this new API. *Bonus* - no need to do double indirection through the generic MSR code. or, alternatively: 2) Do not trap MSR reads of the TSC if the current L1 guest is not trapping MSR reads of the TSC. This is not possible if you cannot enforce specific read vs. write permission in hardware - it may be possible however, if you can trap all MSR writes regardless of the permission bitmap. > Can you tell me in which case the original code would returen incorrect > results to a guest (L1 or L2) doing anything MSR-related? It never returns incorrect values to a guest. It does however return incorrect values to the L0 hypervisor, which is expecting to do arithmetic based on the L1 TSC value, and this fails catastrophically when it receives values for other nested guests. > I'm assuming that some code in KVM also uses kvm_read_msr and assumes it > gets the TSC value for L1, not for the guest currently running (L2 or L1). Exactly. > I don't understand why it needs to assume that... Why would it be wrong to > return L2's TSC, and just remember that *changing* the L2 TSC really means > changing the L1 TSC offset (vmcs01_tsc_offset), not vmcs12.tsc_offset which > we can't touch? L0 measures the L1 TSC at various points to be sure the L1 TSC never regresses by going backwards, and also to restore the value if there is a non-terminal hardware power interruption (hardware sleep which resets hardware TSC to zero, but leaves the system image intact - i.e S1 sleep). If the measurement returns something other than the L1 TSC, this fails. So the API there is today is broken, the way the code is today, it needs to return the L1 TSC only - but the incorrect virtualization is limited to quite an obscure case which doesn't appear to be a problem in practice. And as I pointed out, you can either fix the API, or if you have cooperative hardware, ignore the issue and simply do not trap on TSC MSR reads from L2 guests. Zach ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Nested VMX - L1 hangs on running L2 2011-07-29 9:01 ` Zachary Amsden @ 2011-07-29 10:21 ` Roedel, Joerg 2011-07-31 13:48 ` Nadav Har'El 1 sibling, 0 replies; 18+ messages in thread From: Roedel, Joerg @ 2011-07-29 10:21 UTC (permalink / raw) To: Zachary Amsden Cc: Nadav Har'El, Bandan Das, KVM Mailing List, Marcelo Tosatti On Fri, Jul 29, 2011 at 05:01:16AM -0400, Zachary Amsden wrote: > So you are right, this is still wrong for the case in which L1 does > not trap TSC MSR reads. Note however, the RDTSC instruction is still > virtualized properly, it is only the relatively rare actual TSC MSR > read via RDMSR which is mis-virtualized (this bug exists today in the > SVM implementation if I am reading it correctly - cc'd Joerg to notify > him of that). That, combined with the relative importance of > supporting a guest which does not trap on these MSR reads suggest this > is a low priority design issue however (RDTSC still works even if the > MSR is trapped, correct?) Actually, the documentation is not entirely clear about this. But I tend to agree that direct _reads_ of MSR 0x10 in guest-mode should return the tsc with tsc_offset applied. But on the other side, there is even SVM hardware which does this wrong. For some K8s there is an erratum that the tsc_offset is not applied when the MSR is read directly in guest mode. But yes, to be architecturaly correct the msr read should always return the tsc of the currently running guest level. In reality this shouldn't be am issue, though, and rdtsc[p] is still working correctly. Regards, Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632 ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Nested VMX - L1 hangs on running L2 2011-07-29 9:01 ` Zachary Amsden 2011-07-29 10:21 ` Roedel, Joerg @ 2011-07-31 13:48 ` Nadav Har'El 2011-07-31 18:55 ` Zachary Amsden 1 sibling, 1 reply; 18+ messages in thread From: Nadav Har'El @ 2011-07-31 13:48 UTC (permalink / raw) To: Zachary Amsden Cc: Bandan Das, KVM Mailing List, Marcelo Tosatti, joerg.roedel On Fri, Jul 29, 2011, Zachary Amsden wrote about "Re: Nested VMX - L1 hangs on running L2": >... > In that case, you need to distinguish between reads of the TSC MSR by > the guest and reads by the host (as done internally to track drift and >... > Unfortunately, the layering currently doesn't seem to allow for this, > and it looks like both vendor specific variants currently get this > wrong. >... > 1) Add a new vendor-specific API call, vmx_x86_ops->get_L1_TSC(), and > transform current uses of the code which does TSC compensation to use > this new API. *Bonus* - no need to do double indirection through the > generic MSR code. Thank you so much for this analysis! I propose the following two patches. The first one just fixes two small bugs in the correct emulation in the VMX case, and the second patch solves the originally-reported bug as you outlined above: The code in x86.c which used to call *_get_msr() to get the TSC no longer does it - because *_get_msr() should only be called with the intention of reading the msr of the current guest (L1 or L2) and returning it to that guest. Instead, I added a new x86_op read_l1_tsc(vcpu), whose intention is to always return L1's notion of the current TSC - even if the current guest is L2. All calls in x86.c now use this new read_l1_tsc() function instead of reading the MSR - does this look correct to you? > read via RDMSR which is mis-virtualized (this bug exists today in the > SVM implementation if I am reading it correctly - cc'd Joerg to notify I believe that you're right. I created (in the patch below) svm.c's read_l1_tsc() with the same code you had in svm_get_msr(), but I think that in svm_get_msr() the code should be different in the nested case - I think you should use svm->vmcb, not get_host_vmcb()? I didn't add this svm.c fix to the patch, because I'm not sure at all that my understanding here is correct. Zachary, Marcelo, do these patches look right to you? Bandan, and others who case reproduce this bug - do these patches solve the bug? > So you are right, this is still wrong for the case in which L1 does > not trap TSC MSR reads. Note however, the RDTSC instruction is still > virtualized properly, it is only the relatively rare actual TSC MSR > read via RDMSR which is mis-virtualized (this bug exists today in the > SVM implementation if I am reading it correctly - cc'd Joerg to notify > him of that). That, combined with the relative importance of > supporting a guest which does not trap on these MSR reads suggest this > is a low priority design issue however (RDTSC still works even if the > MSR is trapped, correct?) Yes, RDTSC_EXITING is separate from the MSR bitmap. I agree that these are indeed obscure corner cases, but I think that it's still really confusing that a function called kvm_get_msr deliberately works incorrectly (returning always the L1 TSC) just so that it can fit some other uses. The SVM code worked in the common case, but was still confusing why svm_get_msr() deliberately goes to great lengths to use L1's TSC_OFFSET even when L2 is running - when this is not how this MSR is really supposed to behave. Not to mention that one day, somebody *will* want to use these obscure settings and be surprise that they don't work ;-) Subject: [PATCH 1/2] nVMX: Fix nested TSC emulation This patch fixes two corner cases in nested (L2) handling of TSC-related exits: 1. Somewhat suprisingly, according to the Intel spec, if L1 allows WRMSR to the TSC MSR without an exit, then this should set L1's TSC value itself - not offset by vmcs12.TSC_OFFSET (like was wrongly done in the previous code). 2. Allow L1 to disable the TSC_OFFSETING control, and then correctly ignore the vmcs12.TSC_OFFSET. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> --- arch/x86/kvm/vmx.c | 31 +++++++++++++++++++++---------- 1 file changed, 21 insertions(+), 10 deletions(-) --- .before/arch/x86/kvm/vmx.c 2011-07-31 16:17:30.000000000 +0300 +++ .after/arch/x86/kvm/vmx.c 2011-07-31 16:17:30.000000000 +0300 @@ -1762,15 +1762,23 @@ static void vmx_set_tsc_khz(struct kvm_v */ static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset) { - vmcs_write64(TSC_OFFSET, offset); - if (is_guest_mode(vcpu)) + if (is_guest_mode(vcpu)) { /* - * We're here if L1 chose not to trap the TSC MSR. Since - * prepare_vmcs12() does not copy tsc_offset, we need to also - * set the vmcs12 field here. + * We're here if L1 chose not to trap WRMSR to TSC. According + * to the spec, this should set L1's TSC; The offset that L1 + * set for L2 remains unchanged, and still needs to be added + * to the newly set TSC to get L2's TSC. */ - get_vmcs12(vcpu)->tsc_offset = offset - - to_vmx(vcpu)->nested.vmcs01_tsc_offset; + struct vmcs12 *vmcs12; + to_vmx(vcpu)->nested.vmcs01_tsc_offset = offset; + /* recalculate vmcs02.TSC_OFFSET: */ + vmcs12 = get_vmcs12(vcpu); + vmcs_write64(TSC_OFFSET, offset + + (nested_cpu_has(vmcs12, CPU_BASED_USE_TSC_OFFSETING) ? + vmcs12->tsc_offset : 0)); + } else { + vmcs_write64(TSC_OFFSET, offset); + } } static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment) @@ -6514,8 +6522,11 @@ static void prepare_vmcs02(struct kvm_vc set_cr4_guest_host_mask(vmx); - vmcs_write64(TSC_OFFSET, - vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset); + if (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETING) + vmcs_write64(TSC_OFFSET, + vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset); + else + vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset); if (enable_vpid) { /* @@ -6922,7 +6933,7 @@ static void nested_vmx_vmexit(struct kvm load_vmcs12_host_state(vcpu, vmcs12); - /* Update TSC_OFFSET if vmx_adjust_tsc_offset() was used while L2 ran */ + /* Update TSC_OFFSET if TSC was changed while L2 ran */ vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset); /* This is needed for same reason as it was needed in prepare_vmcs02 */ Subject: [PATCH 2/2] nVMX: L1 TSC handling KVM assumed in several places that reading the TSC MSR returns the value for L1. This is incorrect, because when L2 is running, the correct TSC read exit emulation is to return L2's value. We therefore add a new x86_ops function, read_l1_tsc, to use in places that specifically need to read the L1 TSC, NOT the TSC of the current level of guest. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> --- arch/x86/include/asm/kvm_host.h | 2 ++ arch/x86/kvm/svm.c | 9 +++++++++ arch/x86/kvm/vmx.c | 17 +++++++++++++++++ arch/x86/kvm/x86.c | 8 ++++---- 4 files changed, 32 insertions(+), 4 deletions(-) --- .before/arch/x86/include/asm/kvm_host.h 2011-07-31 16:17:30.000000000 +0300 +++ .after/arch/x86/include/asm/kvm_host.h 2011-07-31 16:17:30.000000000 +0300 @@ -636,6 +636,8 @@ struct kvm_x86_ops { struct x86_instruction_info *info, enum x86_intercept_stage stage); + u64 (*read_l1_tsc)(struct kvm_vcpu *vcpu); + const struct trace_print_flags *exit_reasons_str; }; --- .before/arch/x86/kvm/vmx.c 2011-07-31 16:17:30.000000000 +0300 +++ .after/arch/x86/kvm/vmx.c 2011-07-31 16:17:30.000000000 +0300 @@ -1748,6 +1748,21 @@ static u64 guest_read_tsc(void) } /* + * Like guest_read_tsc, but always returns L1's notion of the timestamp + * counter, even if a nested guest (L2) is currently running. + */ +u64 vmx_read_l1_tsc(kvm_vcpu *vcpu) +{ + u64 host_tsc, tsc_offset; + + rdtscll(host_tsc); + tsc_offset = is_guest_mode(vcpu) ? + to_vmx(vcpu)->nested.vmcs01_tsc_offset : + vmcs_read64(TSC_OFFSET); + return host_tsc + tsc_offset; +} + +/* * Empty call-back. Needs to be implemented when VMX enables the SET_TSC_KHZ * ioctl. In this case the call-back should update internal vmx state to make * the changes effective. @@ -7070,6 +7085,8 @@ static struct kvm_x86_ops vmx_x86_ops = .set_tdp_cr3 = vmx_set_cr3, .check_intercept = vmx_check_intercept, + + .read_l1_tsc = vmx_read_l1_tsc(kvm_vcpu *vcpu), }; static int __init vmx_init(void) --- .before/arch/x86/kvm/svm.c 2011-07-31 16:17:30.000000000 +0300 +++ .after/arch/x86/kvm/svm.c 2011-07-31 16:17:30.000000000 +0300 @@ -2894,6 +2894,13 @@ static int cr8_write_interception(struct return 0; } +u64 svm_read_l1_tsc(kvm_vcpu *vcpu) +{ + return vmcb->control.tsc_offset + + svm_scale_tsc(vcpu, native_read_tsc()); +} + + static int svm_get_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 *data) { struct vcpu_svm *svm = to_svm(vcpu); @@ -4243,6 +4250,8 @@ static struct kvm_x86_ops svm_x86_ops = .set_tdp_cr3 = set_tdp_cr3, .check_intercept = svm_check_intercept, + + .read_l1_tsc = vmx_read_l1_tsc(kvm_vcpu *vcpu), }; static int __init svm_init(void) --- .before/arch/x86/kvm/x86.c 2011-07-31 16:17:30.000000000 +0300 +++ .after/arch/x86/kvm/x86.c 2011-07-31 16:17:30.000000000 +0300 @@ -1098,7 +1098,7 @@ static int kvm_guest_time_update(struct /* Keep irq disabled to prevent changes to the clock */ local_irq_save(flags); - kvm_get_msr(v, MSR_IA32_TSC, &tsc_timestamp); + tsc_timestamp = kvm_x86_ops->read_l1_tsc(vcpu); kernel_ns = get_kernel_ns(); this_tsc_khz = vcpu_tsc_khz(v); if (unlikely(this_tsc_khz == 0)) { @@ -2215,7 +2215,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu s64 tsc_delta; u64 tsc; - kvm_get_msr(vcpu, MSR_IA32_TSC, &tsc); + tsc = kvm_x86_ops->read_l1_tsc(vcpu); tsc_delta = !vcpu->arch.last_guest_tsc ? 0 : tsc - vcpu->arch.last_guest_tsc; @@ -2239,7 +2239,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu * { kvm_x86_ops->vcpu_put(vcpu); kvm_put_guest_fpu(vcpu); - kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc); + vcpu->arch.last_guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu); } static int is_efer_nx(void) @@ -5722,7 +5722,7 @@ static int vcpu_enter_guest(struct kvm_v if (hw_breakpoint_active()) hw_breakpoint_restore(); - kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc); + vcpu->arch.last_guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu); vcpu->mode = OUTSIDE_GUEST_MODE; smp_wmb(); -- Nadav Har'El | Sunday, Jul 31 2011, 29 Tammuz 5771 nyh@math.technion.ac.il |----------------------------------------- Phone +972-523-790466, ICQ 13349191 |Support bacteria - they're the only http://nadav.harel.org.il |culture some people have! ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Nested VMX - L1 hangs on running L2 2011-07-31 13:48 ` Nadav Har'El @ 2011-07-31 18:55 ` Zachary Amsden 2011-07-31 20:34 ` Nadav Har'El 0 siblings, 1 reply; 18+ messages in thread From: Zachary Amsden @ 2011-07-31 18:55 UTC (permalink / raw) To: Nadav Har'El Cc: Bandan Das, KVM Mailing List, Marcelo Tosatti, joerg.roedel This patch looks good, with one comment noted inline below. Are there no other call sites for kvm_get_msr() or which alias some other function to kvm_get_msr(MSR_IA32_TSC) ? Did I miss the first patch? Zach On Sun, Jul 31, 2011 at 6:48 AM, Nadav Har'El <nyh@math.technion.ac.il> wrote: > On Fri, Jul 29, 2011, Zachary Amsden wrote about "Re: Nested VMX - L1 hangs on running L2": >>... >> In that case, you need to distinguish between reads of the TSC MSR by >> the guest and reads by the host (as done internally to track drift and >>... >> Unfortunately, the layering currently doesn't seem to allow for this, >> and it looks like both vendor specific variants currently get this >> wrong. >>... >> 1) Add a new vendor-specific API call, vmx_x86_ops->get_L1_TSC(), and >> transform current uses of the code which does TSC compensation to use >> this new API. *Bonus* - no need to do double indirection through the >> generic MSR code. > > Thank you so much for this analysis! > > I propose the following two patches. The first one just fixes two small > bugs in the correct emulation in the VMX case, and the second patch solves > the originally-reported bug as you outlined above: > > The code in x86.c which used to call *_get_msr() to get the TSC no longer > does it - because *_get_msr() should only be called with the intention of > reading the msr of the current guest (L1 or L2) and returning it to that > guest. > > Instead, I added a new x86_op read_l1_tsc(vcpu), whose intention is to > always return L1's notion of the current TSC - even if the current guest > is L2. All calls in x86.c now use this new read_l1_tsc() function instead > of reading the MSR - does this look correct to you? > >> read via RDMSR which is mis-virtualized (this bug exists today in the >> SVM implementation if I am reading it correctly - cc'd Joerg to notify > > I believe that you're right. I created (in the patch below) svm.c's > read_l1_tsc() with the same code you had in svm_get_msr(), but I think > that in svm_get_msr() the code should be different in the nested case - > I think you should use svm->vmcb, not get_host_vmcb()? I didn't add this svm.c > fix to the patch, because I'm not sure at all that my understanding here is > correct. > > Zachary, Marcelo, do these patches look right to you? > Bandan, and others who case reproduce this bug - do these patches solve the > bug? > >> So you are right, this is still wrong for the case in which L1 does >> not trap TSC MSR reads. Note however, the RDTSC instruction is still >> virtualized properly, it is only the relatively rare actual TSC MSR >> read via RDMSR which is mis-virtualized (this bug exists today in the >> SVM implementation if I am reading it correctly - cc'd Joerg to notify >> him of that). That, combined with the relative importance of >> supporting a guest which does not trap on these MSR reads suggest this >> is a low priority design issue however (RDTSC still works even if the >> MSR is trapped, correct?) > > Yes, RDTSC_EXITING is separate from the MSR bitmap. I agree that these > are indeed obscure corner cases, but I think that it's still really confusing > that a function called kvm_get_msr deliberately works incorrectly (returning > always the L1 TSC) just so that it can fit some other uses. The SVM code worked > in the common case, but was still confusing why svm_get_msr() deliberately > goes to great lengths to use L1's TSC_OFFSET even when L2 is running - when > this is not how this MSR is really supposed to behave. Not to mention that one > day, somebody *will* want to use these obscure settings and be surprise that > they don't work ;-) > > Subject: [PATCH 1/2] nVMX: Fix nested TSC emulation > > This patch fixes two corner cases in nested (L2) handling of TSC-related > exits: > > 1. Somewhat suprisingly, according to the Intel spec, if L1 allows WRMSR to > the TSC MSR without an exit, then this should set L1's TSC value itself - not > offset by vmcs12.TSC_OFFSET (like was wrongly done in the previous code). > > 2. Allow L1 to disable the TSC_OFFSETING control, and then correctly ignore > the vmcs12.TSC_OFFSET. > > Signed-off-by: Nadav Har'El <nyh@il.ibm.com> > --- > arch/x86/kvm/vmx.c | 31 +++++++++++++++++++++---------- > 1 file changed, 21 insertions(+), 10 deletions(-) > > --- .before/arch/x86/kvm/vmx.c 2011-07-31 16:17:30.000000000 +0300 > +++ .after/arch/x86/kvm/vmx.c 2011-07-31 16:17:30.000000000 +0300 > @@ -1762,15 +1762,23 @@ static void vmx_set_tsc_khz(struct kvm_v > */ > static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset) > { > - vmcs_write64(TSC_OFFSET, offset); > - if (is_guest_mode(vcpu)) > + if (is_guest_mode(vcpu)) { > /* > - * We're here if L1 chose not to trap the TSC MSR. Since > - * prepare_vmcs12() does not copy tsc_offset, we need to also > - * set the vmcs12 field here. > + * We're here if L1 chose not to trap WRMSR to TSC. According > + * to the spec, this should set L1's TSC; The offset that L1 > + * set for L2 remains unchanged, and still needs to be added > + * to the newly set TSC to get L2's TSC. > */ > - get_vmcs12(vcpu)->tsc_offset = offset - > - to_vmx(vcpu)->nested.vmcs01_tsc_offset; > + struct vmcs12 *vmcs12; > + to_vmx(vcpu)->nested.vmcs01_tsc_offset = offset; > + /* recalculate vmcs02.TSC_OFFSET: */ > + vmcs12 = get_vmcs12(vcpu); > + vmcs_write64(TSC_OFFSET, offset + > + (nested_cpu_has(vmcs12, CPU_BASED_USE_TSC_OFFSETING) ? > + vmcs12->tsc_offset : 0)); > + } else { > + vmcs_write64(TSC_OFFSET, offset); > + } > } > > static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment) > @@ -6514,8 +6522,11 @@ static void prepare_vmcs02(struct kvm_vc > > set_cr4_guest_host_mask(vmx); > > - vmcs_write64(TSC_OFFSET, > - vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset); > + if (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETING) > + vmcs_write64(TSC_OFFSET, > + vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset); > + else > + vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset); > > if (enable_vpid) { > /* > @@ -6922,7 +6933,7 @@ static void nested_vmx_vmexit(struct kvm > > load_vmcs12_host_state(vcpu, vmcs12); > > - /* Update TSC_OFFSET if vmx_adjust_tsc_offset() was used while L2 ran */ > + /* Update TSC_OFFSET if TSC was changed while L2 ran */ > vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset); > > /* This is needed for same reason as it was needed in prepare_vmcs02 */ > > Subject: [PATCH 2/2] nVMX: L1 TSC handling > > KVM assumed in several places that reading the TSC MSR returns the value for > L1. This is incorrect, because when L2 is running, the correct TSC read exit > emulation is to return L2's value. > > We therefore add a new x86_ops function, read_l1_tsc, to use in places that > specifically need to read the L1 TSC, NOT the TSC of the current level of > guest. > > Signed-off-by: Nadav Har'El <nyh@il.ibm.com> > --- > arch/x86/include/asm/kvm_host.h | 2 ++ > arch/x86/kvm/svm.c | 9 +++++++++ > arch/x86/kvm/vmx.c | 17 +++++++++++++++++ > arch/x86/kvm/x86.c | 8 ++++---- > 4 files changed, 32 insertions(+), 4 deletions(-) > > --- .before/arch/x86/include/asm/kvm_host.h 2011-07-31 16:17:30.000000000 +0300 > +++ .after/arch/x86/include/asm/kvm_host.h 2011-07-31 16:17:30.000000000 +0300 > @@ -636,6 +636,8 @@ struct kvm_x86_ops { > struct x86_instruction_info *info, > enum x86_intercept_stage stage); > > + u64 (*read_l1_tsc)(struct kvm_vcpu *vcpu); > + > const struct trace_print_flags *exit_reasons_str; > }; > > --- .before/arch/x86/kvm/vmx.c 2011-07-31 16:17:30.000000000 +0300 > +++ .after/arch/x86/kvm/vmx.c 2011-07-31 16:17:30.000000000 +0300 > @@ -1748,6 +1748,21 @@ static u64 guest_read_tsc(void) > } > > /* > + * Like guest_read_tsc, but always returns L1's notion of the timestamp > + * counter, even if a nested guest (L2) is currently running. > + */ > +u64 vmx_read_l1_tsc(kvm_vcpu *vcpu) > +{ > + u64 host_tsc, tsc_offset; > + > + rdtscll(host_tsc); > + tsc_offset = is_guest_mode(vcpu) ? > + to_vmx(vcpu)->nested.vmcs01_tsc_offset : > + vmcs_read64(TSC_OFFSET); > + return host_tsc + tsc_offset; > +} > + > +/* > * Empty call-back. Needs to be implemented when VMX enables the SET_TSC_KHZ > * ioctl. In this case the call-back should update internal vmx state to make > * the changes effective. > @@ -7070,6 +7085,8 @@ static struct kvm_x86_ops vmx_x86_ops = > .set_tdp_cr3 = vmx_set_cr3, > > .check_intercept = vmx_check_intercept, > + > + .read_l1_tsc = vmx_read_l1_tsc(kvm_vcpu *vcpu), > }; > > static int __init vmx_init(void) > --- .before/arch/x86/kvm/svm.c 2011-07-31 16:17:30.000000000 +0300 > +++ .after/arch/x86/kvm/svm.c 2011-07-31 16:17:30.000000000 +0300 > @@ -2894,6 +2894,13 @@ static int cr8_write_interception(struct > return 0; > } > > +u64 svm_read_l1_tsc(kvm_vcpu *vcpu) > +{ e> + return vmcb->control.tsc_offset + > + svm_scale_tsc(vcpu, native_read_tsc()); > +} > + > + > static int svm_get_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 *data) > { > struct vcpu_svm *svm = to_svm(vcpu); > @@ -4243,6 +4250,8 @@ static struct kvm_x86_ops svm_x86_ops = > .set_tdp_cr3 = set_tdp_cr3, > > .check_intercept = svm_check_intercept, > + > + .read_l1_tsc = vmx_read_l1_tsc(kvm_vcpu *vcpu), > }; > > static int __init svm_init(void) > --- .before/arch/x86/kvm/x86.c 2011-07-31 16:17:30.000000000 +0300 > +++ .after/arch/x86/kvm/x86.c 2011-07-31 16:17:30.000000000 +0300 > @@ -1098,7 +1098,7 @@ static int kvm_guest_time_update(struct > > /* Keep irq disabled to prevent changes to the clock */ > local_irq_save(flags); > - kvm_get_msr(v, MSR_IA32_TSC, &tsc_timestamp); > + tsc_timestamp = kvm_x86_ops->read_l1_tsc(vcpu); > kernel_ns = get_kernel_ns(); > this_tsc_khz = vcpu_tsc_khz(v); > if (unlikely(this_tsc_khz == 0)) { > @@ -2215,7 +2215,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu > s64 tsc_delta; > u64 tsc; > > - kvm_get_msr(vcpu, MSR_IA32_TSC, &tsc); > + tsc = kvm_x86_ops->read_l1_tsc(vcpu); > tsc_delta = !vcpu->arch.last_guest_tsc ? 0 : > tsc - vcpu->arch.last_guest_tsc; Technically, this call site should no longer exist; this should be re-factored in terms of hardware TSC, not guest TSC. I sent a patch series which did that, but it has not yet been applied. > @@ -2239,7 +2239,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu * > { > kvm_x86_ops->vcpu_put(vcpu); > kvm_put_guest_fpu(vcpu); > - kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc); > + vcpu->arch.last_guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu); > } > > static int is_efer_nx(void) > @@ -5722,7 +5722,7 @@ static int vcpu_enter_guest(struct kvm_v > if (hw_breakpoint_active()) > hw_breakpoint_restore(); > > - kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc); > + vcpu->arch.last_guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu); > > vcpu->mode = OUTSIDE_GUEST_MODE; > smp_wmb(); > > -- > Nadav Har'El | Sunday, Jul 31 2011, 29 Tammuz 5771 > nyh@math.technion.ac.il |----------------------------------------- > Phone +972-523-790466, ICQ 13349191 |Support bacteria - they're the only > http://nadav.harel.org.il |culture some people have! > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Nested VMX - L1 hangs on running L2 2011-07-31 18:55 ` Zachary Amsden @ 2011-07-31 20:34 ` Nadav Har'El 0 siblings, 0 replies; 18+ messages in thread From: Nadav Har'El @ 2011-07-31 20:34 UTC (permalink / raw) To: Zachary Amsden Cc: Bandan Das, KVM Mailing List, Marcelo Tosatti, joerg.roedel On Sun, Jul 31, 2011, Zachary Amsden wrote about "Re: Nested VMX - L1 hangs on running L2": > This patch looks good, with one comment noted inline below. Are there > no other call sites for kvm_get_msr() or which alias some other > function to kvm_get_msr(MSR_IA32_TSC) ? This is all I found, but I'll try to look again. > Did I miss the first patch? Sorry, I sent both patches in one mail, one after another. I think you saw them both, because your comment was on the second. Next time I'll send them in separate mails. Thanks, Nadav. -- Nadav Har'El | Sunday, Jul 31 2011, 1 Av 5771 nyh@math.technion.ac.il |----------------------------------------- Phone +972-523-790466, ICQ 13349191 |Willpower: The ability to eat only one http://nadav.harel.org.il |salted peanut. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Nested VMX - L1 hangs on running L2 2011-07-21 2:49 ` Zachary Amsden 2011-07-27 11:51 ` Nadav Har'El @ 2011-07-28 11:11 ` Nadav Har'El 2011-07-29 2:06 ` Matt McGill 1 sibling, 1 reply; 18+ messages in thread From: Nadav Har'El @ 2011-07-28 11:11 UTC (permalink / raw) To: Zachary Amsden; +Cc: Bandan Das, KVM Mailing List, Marcelo Tosatti [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset=iso-8859-8-i, Size: 7021 bytes --] On Wed, Jul 20, 2011, Zachary Amsden wrote about "Re: Nested VMX - L1 hangs on running L2": > > > No, both patches are wrong. > > > > > The correct fix is to make kvm_get_msr() return the L1 guest TSC at all > > times. > > > We are serving the L1 guest in this hypervisor, not the L2 guest, and so > > > > > should never read its TSC for any reason. I've been thinking about how to correctly solve this problem, and about this statement that you made, that get_read_tsc() should *always* return L1's TSC, even while running L2. We have two quite-different uses for the TSC-related functions: 1. They are used in vmx.c to emulate *architectural* instructions, namely RDTSC, RDMSR of TSC, and WRMSR of TSC. 2. They are used in x86.c for internal KVM uses, adjusting the TSC for various reasons (most of which, I have to admit, I don't understand). You said above that kvm_get_msr(), calling guest_read_tsc(), should always return L1's TSC, even when running L2. However, I believe that that for use #1 - the *architecturally correct* emulation of RDTSC and RDMSR when L2 is running, the current nested-agnostic code - which ends up returning L2's TSC when L2 is running - is the correct thing to do. Look at what happens when: When we get an exit to L0 when L2 run RDMSR TSC: This may exit to L1, if L1 asked in the MSR Bitmap. Done correctly. If hasn't exited, the spec says we need to return L1's TSC plus the TSC_OFFSET that L1 set for L2 (if L1 asked for TSC_OFFSETTING). In other words, we should return L0's TSC offsetted by the sum of vmcs01 and vmcs12 tsc_offsets. This sum is exactly what we have in the current (vmcs02) TSC_OFFSET. So the existing guest_read_tsc, which returns rdtscll() + vmcs_read64(TSC_OFFSET), does the correct thing - also for nested. When we get an exit to L0 when L2 runs RDTSC: This may exit to L1 if L1 asked for TSC_EXITING. Done correctly. There is no "else" here - we would get an EXIT_REASON_RDTSC only if L1 asked for it (because KVM itself doesn't use TSC_EXITING). Regarding emulation of WRMSR TSC, I actually had an error in the code which I correct in a patch below: According to the Intel spec, it turns out that if WRMSR of the TSC does not cause an exit, then it should change the host TSC itself, ignoring TSC_OFFSET. This is very strange, because it means that if the host doesn't trap MSR read and write of the TSC, then if a guest writes TSC and then immediately reads it, it won't read what is just wrote, but rather what it wrote pluss the TSC_OFFSET (because the TSC_OFFSET is added on read, but not subtracted on write). I don't know why the Intel spec specifies this, but it does - and it's not what I emulated to I fixed that now (in the patch below). But I doubt this is the problem that Bandan and others saw? So now, if I'm right, after this patch, the *architectural* uses of guest_read_tsc() and vmx_write_tsc_offset() to emulate RDTSR and RD/WRMSR, are done correctly. But the problem is that these functions are also used in KVM for other things. Ideally, whomever calls kvm_read_msr()/kvm_write_msr() should realize that it gets the value for the current running guest, which might be L1 or L2, just like if that guest would read or write the MSR. Ideally, KVM might read the MSR, change it and write it back, and everything would be correct as both read and write code are done correctly. The question is, is some of the code - perhaps in kvm_guest_time_update(), kvm_arch_vcpu_load(), kvm_arch_vcpu_put() or vcpu_enter_guest() (these are functions which touch the TSC), which only work correctly when reading/writing L1 TSC and can't work correctly if L2 TSC is read and written? I'm afraid I don't know enough about what these functions are doing, to understand what, if anything, they are doing wrong. Bandan and others who can reproduce this bug - I attach below a patch which should correct some architectural emulation issues I discovered while reading the code again now - especially of TSC MSR write. Can you please try if this solves anything? I'm not too optimistic, but since the code was wrong in this regard, it's at least worth a try. Thanks, Nadav. Subject: [PATCH] nVMX: Fix nested TSC handling This patch fixes several areas where nested VMX did not correctly emulated TSC handling. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> --- arch/x86/kvm/vmx.c | 31 +++++++++++++++++++++---------- 1 file changed, 21 insertions(+), 10 deletions(-) --- .before/arch/x86/kvm/vmx.c 2011-07-28 14:02:52.000000000 +0300 +++ .after/arch/x86/kvm/vmx.c 2011-07-28 14:02:52.000000000 +0300 @@ -1762,15 +1762,23 @@ static void vmx_set_tsc_khz(struct kvm_v */ static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset) { - vmcs_write64(TSC_OFFSET, offset); - if (is_guest_mode(vcpu)) + if (is_guest_mode(vcpu)) { /* - * We're here if L1 chose not to trap the TSC MSR. Since - * prepare_vmcs12() does not copy tsc_offset, we need to also - * set the vmcs12 field here. + * We're here if L1 chose not to trap WRMSR to TSC. According + * to the spec, this should set L1's TSC; The offset that L1 + * set for L2 remains unchanged, and still needs to be added + * to the newly set TSC to get L2's TSC. */ - get_vmcs12(vcpu)->tsc_offset = offset - - to_vmx(vcpu)->nested.vmcs01_tsc_offset; + struct vmcs12 *vmcs12; + to_vmx(vcpu)->nested.vmcs01_tsc_offset = offset; + /* recalculate vmcs02.TSC_OFFSET: */ + vmcs12 = get_vmcs12(vcpu); + vmcs_write64(TSC_OFFSET, offset + + (nested_cpu_has(vmcs12, CPU_BASED_USE_TSC_OFFSETING) ? + vmcs12->tsc_offset : 0)); + } else { + vmcs_write64(TSC_OFFSET, offset); + } } static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment) @@ -6514,8 +6522,11 @@ static void prepare_vmcs02(struct kvm_vc set_cr4_guest_host_mask(vmx); - vmcs_write64(TSC_OFFSET, - vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset); + if (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETING) + vmcs_write64(TSC_OFFSET, + vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset); + else + vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset); if (enable_vpid) { /* @@ -6922,7 +6933,7 @@ static void nested_vmx_vmexit(struct kvm load_vmcs12_host_state(vcpu, vmcs12); - /* Update TSC_OFFSET if vmx_adjust_tsc_offset() was used while L2 ran */ + /* Update TSC_OFFSET if TSC was changed while L2 ran */ vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset); /* This is needed for same reason as it was needed in prepare_vmcs02 */ -- Nadav Har'El | Thursday, Jul 28 2011, 26 Tammuz 5771 nyh@math.technion.ac.il |----------------------------------------- Phone +972-523-790466, ICQ 13349191 |Just remember that if the world didn't http://nadav.harel.org.il |suck, we would all fall off. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Nested VMX - L1 hangs on running L2 2011-07-28 11:11 ` Nadav Har'El @ 2011-07-29 2:06 ` Matt McGill 0 siblings, 0 replies; 18+ messages in thread From: Matt McGill @ 2011-07-29 2:06 UTC (permalink / raw) To: Nadav Har'El Cc: Zachary Amsden, Bandan Das, KVM Mailing List, Marcelo Tosatti Hello all, I've been poking at this bug for awhile and following the discussion, so I thought I'd bring all the information together in one place. First, I've been able to reliably reproduce this bug. Here is (what I believe to be) the relevant information: * Host setup (8 CPUs): Ubuntu Maverick with the kvm HEAD kernel, on an Intel i7. * L1 guest (1 virtual CPU): Ubuntu Maverick, standard distro kernel, started like so: qemu-system-x86_64 -enable-kvm -cpu qemu64,+vmx -hda mav-mini.img \ -nographic -vnc :1 -net nic -net tap,script="${tapup}" -serial file:/tmp/mav-mini-console * L2 guest (1 virtual CPU) The 'linux-0.2.img' image from the QEMU web site [http://wiki.qemu.org/download/linux-0.2.img.bz2], started like so: qemu-system-x86_64 -enable-kvm -hda linux-0.2.img -nographic -vnc 172.17.1.5:1 -net none Steps to reproduce: 1. Start L1 guest. 2. In L1 guest, ensure that clocksource is 'kvm-clock'. echo "kvm-clock" | sudo tee /sys/devices/system/clocksource/clocksource0/current_clocksource 3. In L1 guest, start L2 guest. 4. In L2 guest, run the nbench benchmark. 5. On the host, use taskset to forcibly kick the qemu-system-x86_64 process to different host CPUs until L1 hangs. Next, two work-arounds have been mentioned. For me, either one by itself is sufficient to prevent hangs. Workaround #1: Use the 'jiffies' clocksource in L1. I suspect any other clocksource should also work, but haven't tested them. Workaround #2: Using taskset, pin the qemu-system-x86_64 process managing L1 to a single host CPU. The previous messages to the list have also already pointed at the problem, but not in full detail. For clarity, here's the sequence that appears to trigger the hang: 1. The qemu-system-x86_64 process migrates to a new host CPU. 2. On the next call to kvm_arch_vcpu_load() in kvm/x86.c, we enter this branch: if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) { .... kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu); .... } 3. The KVM_REQ_CLOCK_UPDATE request gets handled in vcpu_enter_guest() [kvm/x86.c]. if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, vcpu)) { r = kvm_guest_time_update(vcpu); .... } 4. In kvm_guest_time_update() [kvm/x86.c], we call kvm_get_msr to read MSR_IA32_TSC assuming we'll get the TSC value of L1: static int kvm_guest_time_update(struct kvm_vcpu *v) { ... kvm_get_msr(v, MSR_IA32_TSC, &tsc_timestamp); ... This is the heart of the problem. If we're entering L2 rather than L1, tsc_timestamp gets L2's TLC. We then do some computation based on wall clock/system time, and we write the (possibly adjusted) TSC value directly into a data structure used by the L1 kvm-clock (see also arch/x86/kernel/kvmclock.c and include/asm/kvm_host.h): /* With all the info we got, fill in the values */ vcpu->hv_clock.tsc_timestamp = tsc_timestamp; vcpu->hv_clock.system_time = kernel_ns + v->kvm->arch.kvmclock_offset; vcpu->last_kernel_ns = kernel_ns; vcpu->last_guest_tsc = tsc_timestamp; vcpu->hv_clock.flags = 0; Note that vcpu->hv_clock is actually mapped to a guest page (vcpu->time_page), and shared directly with the L1 kvm-clock paravirtualized clocksource. 5. If we read L2's TSC in the previous step, the L1 kvm-clock now has an arbitrary value. Presumably the scheduler becomes horribly confused, and nothing gets scheduled. This is illustrated by the output I get from the L1 soft lockup detector (stuck for 169 years, eh?): [5750304228.280373] BUG: soft lockup - CPU#0 stuck for 5355388067s! [qemu-system-x86:731] As further evidence, if I patch guest_read_tsc() so that it always returns the L1 TSC, I can no longer reproduce the hang. As for the correct way to fix it, I don't yet understand the VMX spec or the possible cases well enough to know what the correct behavior should be. -Matt McGill On Jul 28, 2011, at 7:11 AM, Nadav Har'El wrote: > On Wed, Jul 20, 2011, Zachary Amsden wrote about "Re: Nested VMX - L1 hangs on running L2": >>>> No, both patches are wrong. >>> >>>> The correct fix is to make kvm_get_msr() return the L1 guest TSC at all >>> times. >>>> We are serving the L1 guest in this hypervisor, not the L2 guest, and so >>> >>>> should never read its TSC for any reason. > > I've been thinking about how to correctly solve this problem, and about this > statement that you made, that get_read_tsc() should *always* return L1's TSC, > even while running L2. > > We have two quite-different uses for the TSC-related functions: > > 1. They are used in vmx.c to emulate *architectural* instructions, namely > RDTSC, RDMSR of TSC, and WRMSR of TSC. > > 2. They are used in x86.c for internal KVM uses, adjusting the TSC for > various reasons (most of which, I have to admit, I don't understand). > > You said above that kvm_get_msr(), calling guest_read_tsc(), should always > return L1's TSC, even when running L2. However, I believe that that for > use #1 - the *architecturally correct* emulation of RDTSC and RDMSR when > L2 is running, the current nested-agnostic code - which ends up returning L2's > TSC when L2 is running - is the correct thing to do. > > Look at what happens when: > > When we get an exit to L0 when L2 run RDMSR TSC: > This may exit to L1, if L1 asked in the MSR Bitmap. Done correctly. > If hasn't exited, the spec says we need to return L1's TSC plus the > TSC_OFFSET that L1 set for L2 (if L1 asked for TSC_OFFSETTING). > In other words, we should return L0's TSC offsetted by the sum of vmcs01 > and vmcs12 tsc_offsets. This sum is exactly what we have in the current > (vmcs02) TSC_OFFSET. So the existing guest_read_tsc, which returns > rdtscll() + vmcs_read64(TSC_OFFSET), does the correct thing - also for > nested. > > When we get an exit to L0 when L2 runs RDTSC: > This may exit to L1 if L1 asked for TSC_EXITING. Done correctly. > There is no "else" here - we would get an EXIT_REASON_RDTSC only if L1 > asked for it (because KVM itself doesn't use TSC_EXITING). > > Regarding emulation of WRMSR TSC, I actually had an error in the code which > I correct in a patch below: According to the Intel spec, it turns out that > if WRMSR of the TSC does not cause an exit, then it should change the host > TSC itself, ignoring TSC_OFFSET. This is very strange, because it means that > if the host doesn't trap MSR read and write of the TSC, then if a guest > writes TSC and then immediately reads it, it won't read what is just wrote, > but rather what it wrote pluss the TSC_OFFSET (because the TSC_OFFSET is > added on read, but not subtracted on write). I don't know why the Intel spec > specifies this, but it does - and it's not what I emulated to I fixed that > now (in the patch below). But I doubt this is the problem that Bandan and > others saw? > > So now, if I'm right, after this patch, the *architectural* uses of > guest_read_tsc() and vmx_write_tsc_offset() to emulate RDTSR and RD/WRMSR, > are done correctly. > > But the problem is that these functions are also used in KVM for other things. > Ideally, whomever calls kvm_read_msr()/kvm_write_msr() should realize that it > gets the value for the current running guest, which might be L1 or L2, just > like if that guest would read or write the MSR. Ideally, KVM might read the > MSR, change it and write it back, and everything would be correct as both > read and write code are done correctly. The question is, is some of the code - > perhaps in kvm_guest_time_update(), kvm_arch_vcpu_load(), kvm_arch_vcpu_put() > or vcpu_enter_guest() (these are functions which touch the TSC), which only > work correctly when reading/writing L1 TSC and can't work correctly if L2 > TSC is read and written? I'm afraid I don't know enough about what these > functions are doing, to understand what, if anything, they are doing wrong. > > Bandan and others who can reproduce this bug - I attach below a patch which > should correct some architectural emulation issues I discovered while reading > the code again now - especially of TSC MSR write. Can you please try if > this solves anything? I'm not too optimistic, but since the code was wrong > in this regard, it's at least worth a try. > > Thanks, > Nadav. > > Subject: [PATCH] nVMX: Fix nested TSC handling > > This patch fixes several areas where nested VMX did not correctly emulated > TSC handling. > > Signed-off-by: Nadav Har'El <nyh@il.ibm.com> > --- > arch/x86/kvm/vmx.c | 31 +++++++++++++++++++++---------- > 1 file changed, 21 insertions(+), 10 deletions(-) > > --- .before/arch/x86/kvm/vmx.c 2011-07-28 14:02:52.000000000 +0300 > +++ .after/arch/x86/kvm/vmx.c 2011-07-28 14:02:52.000000000 +0300 > @@ -1762,15 +1762,23 @@ static void vmx_set_tsc_khz(struct kvm_v > */ > static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset) > { > - vmcs_write64(TSC_OFFSET, offset); > - if (is_guest_mode(vcpu)) > + if (is_guest_mode(vcpu)) { > /* > - * We're here if L1 chose not to trap the TSC MSR. Since > - * prepare_vmcs12() does not copy tsc_offset, we need to also > - * set the vmcs12 field here. > + * We're here if L1 chose not to trap WRMSR to TSC. According > + * to the spec, this should set L1's TSC; The offset that L1 > + * set for L2 remains unchanged, and still needs to be added > + * to the newly set TSC to get L2's TSC. > */ > - get_vmcs12(vcpu)->tsc_offset = offset - > - to_vmx(vcpu)->nested.vmcs01_tsc_offset; > + struct vmcs12 *vmcs12; > + to_vmx(vcpu)->nested.vmcs01_tsc_offset = offset; > + /* recalculate vmcs02.TSC_OFFSET: */ > + vmcs12 = get_vmcs12(vcpu); > + vmcs_write64(TSC_OFFSET, offset + > + (nested_cpu_has(vmcs12, CPU_BASED_USE_TSC_OFFSETING) ? > + vmcs12->tsc_offset : 0)); > + } else { > + vmcs_write64(TSC_OFFSET, offset); > + } > } > > static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment) > @@ -6514,8 +6522,11 @@ static void prepare_vmcs02(struct kvm_vc > > set_cr4_guest_host_mask(vmx); > > - vmcs_write64(TSC_OFFSET, > - vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset); > + if (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETING) > + vmcs_write64(TSC_OFFSET, > + vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset); > + else > + vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset); > > if (enable_vpid) { > /* > @@ -6922,7 +6933,7 @@ static void nested_vmx_vmexit(struct kvm > > load_vmcs12_host_state(vcpu, vmcs12); > > - /* Update TSC_OFFSET if vmx_adjust_tsc_offset() was used while L2 ran */ > + /* Update TSC_OFFSET if TSC was changed while L2 ran */ > vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset); > > /* This is needed for same reason as it was needed in prepare_vmcs02 */ > -- > Nadav Har'El | Thursday, Jul 28 2011, 26 Tammuz 5771 > nyh@math.technion.ac.il |----------------------------------------- > Phone +972-523-790466, ICQ 13349191 |Just remember that if the world didn't > http://nadav.harel.org.il |suck, we would all fall off. > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2011-07-31 20:34 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-07-08 18:40 Nested VMX - L1 hangs on running L2 Bandan Das
2011-07-18 18:26 ` Marcelo Tosatti
2011-07-19 2:41 ` Bandan Das
2011-07-20 7:58 ` Jan Kiszka
2011-07-20 16:12 ` Marcelo Tosatti
2011-07-20 16:19 ` Jan Kiszka
2011-07-20 16:35 ` Marcelo Tosatti
[not found] ` <CAKiCmT00vyR5vRBDWFYK2Z8sgmjLBPwbYU5W8q2wAUTrxS1_tA@mail.gmail.com>
2011-07-20 19:52 ` Nadav Har'El
2011-07-20 20:42 ` Bandan Das
2011-07-21 2:49 ` Zachary Amsden
2011-07-27 11:51 ` Nadav Har'El
2011-07-29 9:01 ` Zachary Amsden
2011-07-29 10:21 ` Roedel, Joerg
2011-07-31 13:48 ` Nadav Har'El
2011-07-31 18:55 ` Zachary Amsden
2011-07-31 20:34 ` Nadav Har'El
2011-07-28 11:11 ` Nadav Har'El
2011-07-29 2:06 ` Matt McGill
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox