From mboxrd@z Thu Jan 1 00:00:00 1970 From: Marcelo Tosatti Subject: Re: [PATCH v1] Revert "KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock system MSR" Date: Tue, 22 Sep 2015 16:02:13 -0300 Message-ID: <20150922190213.GC23748@amt.cnet> References: <1442939626-16935-1-git-send-email-rkrcmar@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, Paolo Bonzini , Luiz Capitulino , stable@vger.kernel.org To: Radim =?utf-8?B?S3LEjW3DocWZ?= Return-path: Content-Disposition: inline In-Reply-To: <1442939626-16935-1-git-send-email-rkrcmar@redhat.com> Sender: linux-kernel-owner@vger.kernel.org List-Id: kvm.vger.kernel.org On Tue, Sep 22, 2015 at 06:33:46PM +0200, Radim Kr=C4=8Dm=C3=A1=C5=99 w= rote: > PVCLOCK_COUNTS_FROM_ZERO broke ABI and (at least) three things with i= t. > All problems stem from repeated writes to MSR_KVM_SYSTEM_TIME(_NEW). > The reverted patch treated the MSR write as a one-shot initializer: > any write from VCPU 0 would reset system_time. >=20 > And this is what broke for Linux guests: > * Onlining/hotplugging of VCPU 0 >=20 > VCPU has to assign an address to KVM clock before use, which is don= e > with MSR_KVM_SYSTEM_TIME_NEW. Linux has an idea that time should n= ot > jump backward, so any `sleep` won't return before |system_time at t= he > point of offline| elapses since the online. Be sure to run ntp. >=20 > * S3 and S4 resume >=20 > If you don't have PVCLOCK_TSC_STABLE_BIT in Linux, resume will free= ze > for |system_time at the point of suspend|, because pvclock ensures > monoticity and kvmclock did not think about it. >=20 > If you have stable clock, execution will resume immediately, but > restoring KVM clock writes to the MSR and dmesg starts to count fro= m > zero. It's better than the onlining, but not what we want either. >=20 > * Boot of SLES 10 guest >=20 > SLES 10 has a custom implementation of kvm clock that calls > MSR_KVM_SYSTEM_TIME before every read to enhance precision ... > Two things are happening at the same time: > 1) The guest periodically receives an interrupt that is handled by > main_timer_handler(): > a) get time using the kvm clock: > 1) write the address to MSR_KVM_SYSTEM_TIME > 2) read tsc and pvclock (tsc_offset, system_time) > 3) time =3D tsc - tsc_offset + system_time > b) compute time since the last main_timer_handler() > c) bump jiffies if enough time has elapsed > 2) the guest wants to calibrate loops per jiffy [1]: > a) read tsc > b) loop till jiffies increase > c) compute lpj >=20 > Because (1a1) always resets the system_time to 0, we read the same > value over and over so the condition for (1c) is never true and > jiffies remain constant. A hang happens in (2b) as it is the first > place that depends on jiffies. >=20 >=20 > We could make hypervisor workaround for this, but that is just asking > for more trouble. Luckily, reverting does not break to guests that > learned about PVCLOCK_COUNTS_FROM_ZERO, in new ways. > Only 4.2+ guests with NOHZ_FULL wanted PVCLOCK_COUNTS_FROM_ZERO, whic= h > is a good trade-off for not regressing. >=20 > This reverts commit b7e60c5aedd2b63f16ef06fde4f81ca032211bc5. > And adds a note to the definition of PVCLOCK_COUNTS_FROM_ZERO. >=20 > Cc: stable@vger.kernel.org > Signed-off-by: Radim Kr=C4=8Dm=C3=A1=C5=99 > --- > v1: Extended commit message based on a discussion with Marcelo >=20 > arch/x86/include/asm/pvclock-abi.h | 1 + > arch/x86/kvm/x86.c | 4 ---- > 2 files changed, 1 insertion(+), 4 deletions(-) >=20 > diff --git a/arch/x86/include/asm/pvclock-abi.h b/arch/x86/include/as= m/pvclock-abi.h > index 655e07a48f6c..67f08230103a 100644 > --- a/arch/x86/include/asm/pvclock-abi.h > +++ b/arch/x86/include/asm/pvclock-abi.h > @@ -41,6 +41,7 @@ struct pvclock_wall_clock { > =20 > #define PVCLOCK_TSC_STABLE_BIT (1 << 0) > #define PVCLOCK_GUEST_STOPPED (1 << 1) > +/* PVCLOCK_COUNTS_FROM_ZERO broke ABI and can't be used anymore. */ > #define PVCLOCK_COUNTS_FROM_ZERO (1 << 2) > #endif /* __ASSEMBLY__ */ > #endif /* _ASM_X86_PVCLOCK_ABI_H */ > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 4bca39f0fdb3..71731994d897 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -1711,8 +1711,6 @@ static int kvm_guest_time_update(struct kvm_vcp= u *v) > vcpu->pvclock_set_guest_stopped_request =3D false; > } > =20 > - pvclock_flags |=3D PVCLOCK_COUNTS_FROM_ZERO; > - > /* If the host uses TSC clocksource, then it is stable */ > if (use_master_clock) > pvclock_flags |=3D PVCLOCK_TSC_STABLE_BIT; > @@ -2010,8 +2008,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, s= truct msr_data *msr_info) > &vcpu->requests); > =20 > ka->boot_vcpu_runs_old_kvmclock =3D tmp; > - > - ka->kvmclock_offset =3D -get_kernel_ns(); > } > =20 > vcpu->arch.time =3D data; > --=20 > 2.5.3 NACK, please use original patchset.