From mboxrd@z Thu Jan  1 00:00:00 1970
From: Paolo Bonzini <pbonzini@redhat.com>
Subject: Re: KVM: x86: fix tsc catchup issue with tsc scaling
Date: Wed, 15 Jan 2014 12:43:02 +0100
Message-ID: <52D67446.3020403@redhat.com>
References: <20140106141859.GA27260@amt.cnet>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-15
Content-Transfer-Encoding: 7bit
Cc: kvm-devel <kvm@vger.kernel.org>
To: Marcelo Tosatti <mtosatti@redhat.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx1.redhat.com ([209.132.183.28]:31330 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1751010AbaAOLnG (ORCPT <rfc822;kvm@vger.kernel.org>);
	Wed, 15 Jan 2014 06:43:06 -0500
Received: from int-mx11.intmail.prod.int.phx2.redhat.com (int-mx11.intmail.prod.int.phx2.redhat.com [10.5.11.24])
	by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id s0FBh57O025455
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK)
	for <kvm@vger.kernel.org>; Wed, 15 Jan 2014 06:43:06 -0500
In-Reply-To: <20140106141859.GA27260@amt.cnet>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

Il 06/01/2014 15:18, Marcelo Tosatti ha scritto:
> 
> To fix a problem related to different resolution of TSC and system clock,
> the offset in TSC units is approximated by 
> 
> delta = vcpu->hv_clock.tsc_timestamp 	- 	vcpu->last_guest_tsc
> 
> (Guest TSC value at 			(Guest TSC value at last VM-exit)
> the last kvm_guest_time_update
> call)
> 
> Delta is then later scaled using mult,shift pair found in hv_clock 
> structure (which is correct against tsc_timestamp in that 
> structure).
> 
> However, if a frequency change is performed between these two points, 
> this delta is measured using different TSC frequencies, but scaled using 
> mult,shift pair for one frequency only.
> 
> The end result is an incorrect delta.
> 
> The bug which this code works around is not the only cause for 
> clock backwards events. The global accumulator is still
> necessary, so remove the max_kernel_ns fix and rely on the 
> global accumulator for no clock backwards events.

This is basically reverting commit 1d5f066 (KVM: x86: Fix a possible
backwards warp of kvmclock, 2010-08-19).

Your commit message basically says that the guest-side 489fb49 (x86,
paravirt: Add a global synchronization point for pvclock, 2010-05-11) is
the real solution to the problem that the host-side commit 1d5f066 was
trying to fix.  Right?

This patch makes vcpu->hv_clock.tsc_timestamp write only.  Please
provide a follow up that drops the field entirely, then I'll apply both.
 In the meanwhile:

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

Thanks,

Paolo

> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> Index: linux-2.6.git/arch/x86/kvm/x86.c
> ===================================================================
> --- linux-2.6.git.orig/arch/x86/kvm/x86.c
> +++ linux-2.6.git/arch/x86/kvm/x86.c
> @@ -1484,7 +1484,7 @@ static int kvm_guest_time_update(struct
>  	unsigned long flags, this_tsc_khz;
>  	struct kvm_vcpu_arch *vcpu = &v->arch;
>  	struct kvm_arch *ka = &v->kvm->arch;
> -	s64 kernel_ns, max_kernel_ns;
> +	s64 kernel_ns;
>  	u64 tsc_timestamp, host_tsc;
>  	struct pvclock_vcpu_time_info guest_hv_clock;
>  	u8 pvclock_flags;
> @@ -1543,37 +1543,6 @@ static int kvm_guest_time_update(struct
>  	if (!vcpu->pv_time_enabled)
>  		return 0;
>  
> -	/*
> -	 * Time as measured by the TSC may go backwards when resetting the base
> -	 * tsc_timestamp.  The reason for this is that the TSC resolution is
> -	 * higher than the resolution of the other clock scales.  Thus, many
> -	 * possible measurments of the TSC correspond to one measurement of any
> -	 * other clock, and so a spread of values is possible.  This is not a
> -	 * problem for the computation of the nanosecond clock; with TSC rates
> -	 * around 1GHZ, there can only be a few cycles which correspond to one
> -	 * nanosecond value, and any path through this code will inevitably
> -	 * take longer than that.  However, with the kernel_ns value itself,
> -	 * the precision may be much lower, down to HZ granularity.  If the
> -	 * first sampling of TSC against kernel_ns ends in the low part of the
> -	 * range, and the second in the high end of the range, we can get:
> -	 *
> -	 * (TSC - offset_low) * S + kns_old > (TSC - offset_high) * S + kns_new
> -	 *
> -	 * As the sampling errors potentially range in the thousands of cycles,
> -	 * it is possible such a time value has already been observed by the
> -	 * guest.  To protect against this, we must compute the system time as
> -	 * observed by the guest and ensure the new system time is greater.
> -	 */
> -	max_kernel_ns = 0;
> -	if (vcpu->hv_clock.tsc_timestamp) {
> -		max_kernel_ns = vcpu->last_guest_tsc -
> -				vcpu->hv_clock.tsc_timestamp;
> -		max_kernel_ns = pvclock_scale_delta(max_kernel_ns,
> -				    vcpu->hv_clock.tsc_to_system_mul,
> -				    vcpu->hv_clock.tsc_shift);
> -		max_kernel_ns += vcpu->last_kernel_ns;
> -	}
> -
>  	if (unlikely(vcpu->hw_tsc_khz != this_tsc_khz)) {
>  		kvm_get_time_scale(NSEC_PER_SEC / 1000, this_tsc_khz,
>  				   &vcpu->hv_clock.tsc_shift,
> @@ -1581,14 +1550,6 @@ static int kvm_guest_time_update(struct
>  		vcpu->hw_tsc_khz = this_tsc_khz;
>  	}
>  
> -	/* with a master <monotonic time, tsc value> tuple,
> -	 * pvclock clock reads always increase at the (scaled) rate
> -	 * of guest TSC - no need to deal with sampling errors.
> -	 */
> -	if (!use_master_clock) {
> -		if (max_kernel_ns > kernel_ns)
> -			kernel_ns = max_kernel_ns;
> -	}
>  	/* With all the info we got, fill in the values */
>  	vcpu->hv_clock.tsc_timestamp = tsc_timestamp;
>  	vcpu->hv_clock.system_time = kernel_ns + v->kvm->arch.kvmclock_offset;
>