[PATCH v1] Revert "KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock system MSR"

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v1] Revert "KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock system MSR"
@ 2015-09-22 16:33 Radim Krčmář
  2015-09-22 19:02 ` Marcelo Tosatti
  0 siblings, 1 reply; 3+ messages in thread
From: Radim Krčmář @ 2015-09-22 16:33 UTC (permalink / raw)
  To: linux-kernel; +Cc: kvm, Paolo Bonzini, Marcelo Tosatti, Luiz Capitulino, stable

PVCLOCK_COUNTS_FROM_ZERO broke ABI and (at least) three things with it.
All problems stem from repeated writes to MSR_KVM_SYSTEM_TIME(_NEW).
The reverted patch treated the MSR write as a one-shot initializer:
any write from VCPU 0 would reset system_time.

And this is what broke for Linux guests:
* Onlining/hotplugging of VCPU 0

  VCPU has to assign an address to KVM clock before use, which is done
  with MSR_KVM_SYSTEM_TIME_NEW.  Linux has an idea that time should not
  jump backward, so any `sleep` won't return before |system_time at the
  point of offline| elapses since the online.  Be sure to run ntp.

* S3 and S4 resume

  If you don't have PVCLOCK_TSC_STABLE_BIT in Linux, resume will freeze
  for |system_time at the point of suspend|, because pvclock ensures
  monoticity and kvmclock did not think about it.

  If you have stable clock, execution will resume immediately, but
  restoring KVM clock writes to the MSR and dmesg starts to count from
  zero.  It's better than the onlining, but not what we want either.

* Boot of SLES 10 guest

  SLES 10 has a custom implementation of kvm clock that calls
  MSR_KVM_SYSTEM_TIME before every read to enhance precision ...
  Two things are happening at the same time:
  1) The guest periodically receives an interrupt that is handled by
     main_timer_handler():
     a) get time using the kvm clock:
        1) write the address to MSR_KVM_SYSTEM_TIME
        2) read tsc and pvclock (tsc_offset, system_time)
        3) time = tsc - tsc_offset + system_time
     b) compute time since the last main_timer_handler()
     c) bump jiffies if enough time has elapsed
  2) the guest wants to calibrate loops per jiffy [1]:
     a) read tsc
     b) loop till jiffies increase
     c) compute lpj

  Because (1a1) always resets the system_time to 0, we read the same
  value over and over so the condition for (1c) is never true and
  jiffies remain constant.  A hang happens in (2b) as it is the first
  place that depends on jiffies.

We could make hypervisor workaround for this, but that is just asking
for more trouble.  Luckily, reverting does not break to guests that
learned about PVCLOCK_COUNTS_FROM_ZERO, in new ways.
Only 4.2+ guests with NOHZ_FULL wanted PVCLOCK_COUNTS_FROM_ZERO, which
is a good trade-off for not regressing.

This reverts commit b7e60c5aedd2b63f16ef06fde4f81ca032211bc5.
And adds a note to the definition of PVCLOCK_COUNTS_FROM_ZERO.

Cc: stable@vger.kernel.org
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
---
 v1: Extended commit message based on a discussion with Marcelo

 arch/x86/include/asm/pvclock-abi.h | 1 +
 arch/x86/kvm/x86.c                 | 4 ----
 2 files changed, 1 insertion(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/pvclock-abi.h b/arch/x86/include/asm/pvclock-abi.h
index 655e07a48f6c..67f08230103a 100644
--- a/arch/x86/include/asm/pvclock-abi.h
+++ b/arch/x86/include/asm/pvclock-abi.h
@@ -41,6 +41,7 @@ struct pvclock_wall_clock {

 #define PVCLOCK_TSC_STABLE_BIT	(1 << 0)
 #define PVCLOCK_GUEST_STOPPED	(1 << 1)
+/* PVCLOCK_COUNTS_FROM_ZERO broke ABI and can't be used anymore. */
 #define PVCLOCK_COUNTS_FROM_ZERO (1 << 2)
 #endif /* __ASSEMBLY__ */
 #endif /* _ASM_X86_PVCLOCK_ABI_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4bca39f0fdb3..71731994d897 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1711,8 +1711,6 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
 		vcpu->pvclock_set_guest_stopped_request = false;
 	}

-	pvclock_flags |= PVCLOCK_COUNTS_FROM_ZERO;
-
 	/* If the host uses TSC clocksource, then it is stable */
 	if (use_master_clock)
 		pvclock_flags |= PVCLOCK_TSC_STABLE_BIT;
@@ -2010,8 +2008,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 					&vcpu->requests);

 			ka->boot_vcpu_runs_old_kvmclock = tmp;
-
-			ka->kvmclock_offset = -get_kernel_ns();
 		}

 		vcpu->arch.time = data;
-- 
2.5.3

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH v1] Revert "KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock system MSR"
  2015-09-22 16:33 [PATCH v1] Revert "KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock system MSR" Radim Krčmář
@ 2015-09-22 19:02 ` Marcelo Tosatti
  2015-09-22 19:51   ` Paolo Bonzini
  0 siblings, 1 reply; 3+ messages in thread
From: Marcelo Tosatti @ 2015-09-22 19:02 UTC (permalink / raw)
  To: Radim Krčmář
  Cc: linux-kernel, kvm, Paolo Bonzini, Luiz Capitulino, stable

On Tue, Sep 22, 2015 at 06:33:46PM +0200, Radim Krčmář wrote:
> PVCLOCK_COUNTS_FROM_ZERO broke ABI and (at least) three things with it.
> All problems stem from repeated writes to MSR_KVM_SYSTEM_TIME(_NEW).
> The reverted patch treated the MSR write as a one-shot initializer:
> any write from VCPU 0 would reset system_time.
> 
> And this is what broke for Linux guests:
> * Onlining/hotplugging of VCPU 0
> 
>   VCPU has to assign an address to KVM clock before use, which is done
>   with MSR_KVM_SYSTEM_TIME_NEW.  Linux has an idea that time should not
>   jump backward, so any `sleep` won't return before |system_time at the
>   point of offline| elapses since the online.  Be sure to run ntp.
> 
> * S3 and S4 resume
> 
>   If you don't have PVCLOCK_TSC_STABLE_BIT in Linux, resume will freeze
>   for |system_time at the point of suspend|, because pvclock ensures
>   monoticity and kvmclock did not think about it.
> 
>   If you have stable clock, execution will resume immediately, but
>   restoring KVM clock writes to the MSR and dmesg starts to count from
>   zero.  It's better than the onlining, but not what we want either.
> 
> * Boot of SLES 10 guest
> 
>   SLES 10 has a custom implementation of kvm clock that calls
>   MSR_KVM_SYSTEM_TIME before every read to enhance precision ...
>   Two things are happening at the same time:
>   1) The guest periodically receives an interrupt that is handled by
>      main_timer_handler():
>      a) get time using the kvm clock:
>         1) write the address to MSR_KVM_SYSTEM_TIME
>         2) read tsc and pvclock (tsc_offset, system_time)
>         3) time = tsc - tsc_offset + system_time
>      b) compute time since the last main_timer_handler()
>      c) bump jiffies if enough time has elapsed
>   2) the guest wants to calibrate loops per jiffy [1]:
>      a) read tsc
>      b) loop till jiffies increase
>      c) compute lpj
> 
>   Because (1a1) always resets the system_time to 0, we read the same
>   value over and over so the condition for (1c) is never true and
>   jiffies remain constant.  A hang happens in (2b) as it is the first
>   place that depends on jiffies.
> 
> 
> We could make hypervisor workaround for this, but that is just asking
> for more trouble.  Luckily, reverting does not break to guests that
> learned about PVCLOCK_COUNTS_FROM_ZERO, in new ways.
> Only 4.2+ guests with NOHZ_FULL wanted PVCLOCK_COUNTS_FROM_ZERO, which
> is a good trade-off for not regressing.
> 
> This reverts commit b7e60c5aedd2b63f16ef06fde4f81ca032211bc5.
> And adds a note to the definition of PVCLOCK_COUNTS_FROM_ZERO.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
> ---
>  v1: Extended commit message based on a discussion with Marcelo
> 
>  arch/x86/include/asm/pvclock-abi.h | 1 +
>  arch/x86/kvm/x86.c                 | 4 ----
>  2 files changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pvclock-abi.h b/arch/x86/include/asm/pvclock-abi.h
> index 655e07a48f6c..67f08230103a 100644
> --- a/arch/x86/include/asm/pvclock-abi.h
> +++ b/arch/x86/include/asm/pvclock-abi.h
> @@ -41,6 +41,7 @@ struct pvclock_wall_clock {
>  
>  #define PVCLOCK_TSC_STABLE_BIT	(1 << 0)
>  #define PVCLOCK_GUEST_STOPPED	(1 << 1)
> +/* PVCLOCK_COUNTS_FROM_ZERO broke ABI and can't be used anymore. */
>  #define PVCLOCK_COUNTS_FROM_ZERO (1 << 2)
>  #endif /* __ASSEMBLY__ */
>  #endif /* _ASM_X86_PVCLOCK_ABI_H */
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 4bca39f0fdb3..71731994d897 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1711,8 +1711,6 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
>  		vcpu->pvclock_set_guest_stopped_request = false;
>  	}
>  
> -	pvclock_flags |= PVCLOCK_COUNTS_FROM_ZERO;
> -
>  	/* If the host uses TSC clocksource, then it is stable */
>  	if (use_master_clock)
>  		pvclock_flags |= PVCLOCK_TSC_STABLE_BIT;
> @@ -2010,8 +2008,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  					&vcpu->requests);
>  
>  			ka->boot_vcpu_runs_old_kvmclock = tmp;
> -
> -			ka->kvmclock_offset = -get_kernel_ns();
>  		}
>  
>  		vcpu->arch.time = data;
> -- 
> 2.5.3

NACK, please use original patchset.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH v1] Revert "KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock system MSR"
  2015-09-22 19:02 ` Marcelo Tosatti
@ 2015-09-22 19:51   ` Paolo Bonzini
  0 siblings, 0 replies; 3+ messages in thread
From: Paolo Bonzini @ 2015-09-22 19:51 UTC (permalink / raw)
  To: Marcelo Tosatti, Radim Krčmář
  Cc: linux-kernel, kvm, Luiz Capitulino, stable



On 22/09/2015 21:02, Marcelo Tosatti wrote:
> NACK, please use original patchset.

What do you mean?

Paolo

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2015-09-22 19:51 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-22 16:33 [PATCH v1] Revert "KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock system MSR" Radim Krčmář
2015-09-22 19:02 ` Marcelo Tosatti
2015-09-22 19:51   ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).