[PATCH v2] KVM: x86: do not go through vcpu in __get_kvmclock

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2] KVM: x86: do not go through vcpu in __get_kvmclock_ns
@ 2016-11-14 17:51 Paolo Bonzini
  2016-11-16 16:10 ` Radim Krčmář
  0 siblings, 1 reply; 4+ messages in thread
From: Paolo Bonzini @ 2016-11-14 17:51 UTC (permalink / raw)
  To: linux-kernel, kvm; +Cc: rkrcmar, mtosatti

Going through the first VCPU is wrong if you follow a KVM_SET_CLOCK with
a KVM_GET_CLOCK immediately after, without letting the VCPU run and
call kvm_guest_time_update.

To fix this, compute the kvmclock value ourselves, using the master
clock (tsc, nsec) pair as the base and the host CPU frequency as
the scale.

Reported-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/x86.c | 23 ++++++++++++++---------
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1ba08278a9a9..c4b07d6f7bf1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1620,6 +1620,11 @@ static bool kvm_get_time_and_clockread(s64 *kernel_ns, cycle_t *cycle_now)
 
 	return do_monotonic_boot(kernel_ns, cycle_now) == VCLOCK_TSC;
 }
+#else
+static inline bool kvm_get_time_and_clockread(s64 *kernel_ns, cycle_t *cycle_now)
+{
+	return false;
+}
 #endif
 
 /*
@@ -1724,18 +1729,18 @@ static void kvm_gen_update_masterclock(struct kvm *kvm)
 
 static u64 __get_kvmclock_ns(struct kvm *kvm)
 {
-	struct kvm_vcpu *vcpu = kvm_get_vcpu(kvm, 0);
 	struct kvm_arch *ka = &kvm->arch;
-	s64 ns;
+	struct pvclock_vcpu_time_info hv_clock;
 
-	if (vcpu->arch.hv_clock.flags & PVCLOCK_TSC_STABLE_BIT) {
-		u64 tsc = kvm_read_l1_tsc(vcpu, rdtsc());
-		ns = __pvclock_read_cycles(&vcpu->arch.hv_clock, tsc);
-	} else {
-		ns = ktime_get_boot_ns() + ka->kvmclock_offset;
-	}
+	if (!ka->use_master_clock)
+		return ktime_get_boot_ns() + ka->kvmclock_offset;
 
-	return ns;
+	hv_clock.tsc_timestamp = ka->master_cycle_now;
+	hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
+	kvm_get_time_scale(NSEC_PER_SEC, __this_cpu_read(cpu_tsc_khz) * 1000LL,
+			   &hv_clock.tsc_shift,
+			   &hv_clock.tsc_to_system_mul);
+	return __pvclock_read_cycles(&hv_clock, rdtsc());
 }
 
 u64 get_kvmclock_ns(struct kvm *kvm)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] KVM: x86: do not go through vcpu in __get_kvmclock_ns
  2016-11-14 17:51 [PATCH v2] KVM: x86: do not go through vcpu in __get_kvmclock_ns Paolo Bonzini
@ 2016-11-16 16:10 ` Radim Krčmář
  2016-11-16 17:27   ` Paolo Bonzini
  0 siblings, 1 reply; 4+ messages in thread
From: Radim Krčmář @ 2016-11-16 16:10 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-kernel, kvm, mtosatti

2016-11-14 18:51+0100, Paolo Bonzini:
> Going through the first VCPU is wrong if you follow a KVM_SET_CLOCK with
> a KVM_GET_CLOCK immediately after, without letting the VCPU run and
> call kvm_guest_time_update.
> 
> To fix this, compute the kvmclock value ourselves, using the master
> clock (tsc, nsec) pair as the base and the host CPU frequency as
> the scale.
> 
> Reported-by: Marcelo Tosatti <mtosatti@redhat.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> @@ -1620,6 +1620,11 @@ static bool kvm_get_time_and_clockread(s64 *kernel_ns, cycle_t *cycle_now)
>  
>  	return do_monotonic_boot(kernel_ns, cycle_now) == VCLOCK_TSC;
>  }
> +#else
> +static inline bool kvm_get_time_and_clockread(s64 *kernel_ns, cycle_t *cycle_now)
> +{
> +	return false;
> +}
>  #endif

A left-over from v1. ;)

> @@ -1724,18 +1729,18 @@ static void kvm_gen_update_masterclock(struct kvm *kvm)
>  
>  static u64 __get_kvmclock_ns(struct kvm *kvm)
>  {
> -	struct kvm_vcpu *vcpu = kvm_get_vcpu(kvm, 0);
>  	struct kvm_arch *ka = &kvm->arch;
> -	s64 ns;
> +	struct pvclock_vcpu_time_info hv_clock;
>  
> -	if (vcpu->arch.hv_clock.flags & PVCLOCK_TSC_STABLE_BIT) {
> -		u64 tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> -		ns = __pvclock_read_cycles(&vcpu->arch.hv_clock, tsc);
> -	} else {
> -		ns = ktime_get_boot_ns() + ka->kvmclock_offset;
> -	}

If we access the "global" master clock, it would be better to prevent it
from changing under our hands with
  	spin_lock(&ka->pvclock_gtod_sync_lock).

> +	if (!ka->use_master_clock)
> +		return ktime_get_boot_ns() + ka->kvmclock_offset;
>  
> -	return ns;
> +	hv_clock.tsc_timestamp = ka->master_cycle_now;
> +	hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
> +	kvm_get_time_scale(NSEC_PER_SEC, __this_cpu_read(cpu_tsc_khz) * 1000LL,
> +			   &hv_clock.tsc_shift,
> +			   &hv_clock.tsc_to_system_mul);

Doesn't this result in a minor drift with scaled clock, because the
guest can be combining two systems that approximate frequency?

  1) tsc_shift and tsc_to_system_mul for kvmclock scaling
  2) hardware TSC scaling ratio

If we are on a 7654321 kHz TSC and TSC-ratio scale to 1234567 kHz and
then tsc_shift+tsc_to_system_mul kvmclock-scale to 1000000 kHz, we
should be using multipliers of
  0.161290204578564186163606151349022336533834941074459772460...  and
  0.810000591300431649315104000025920018921613812778083328000...,
to achieve that.  Those multipliers cannot be precisely expressed in
what we have (shifts and 64/32 bit multipliers with intermediate values
only up to 128 bits), so performing the scaling will result in slightly
incorrect frequency.

The result of combining two operations that alter the freqency is quite
unlikely to cancel out and produce the same result as an operation that
uses a different shift+multiplier to scale in one step, so I think that
we aren't getting the same time as the guest with TSC-scaling is seeing.

(I'd be happier if we didn't ignore this drift when the whole endeavor
 started just to get rid of a drift, but introducing a minor bug is still
 improving the situation -- I'm ok with first two changes only.)

> +	return __pvclock_read_cycles(&hv_clock, rdtsc());
>  }
>  
>  u64 get_kvmclock_ns(struct kvm *kvm)
> -- 
> 1.8.3.1
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] KVM: x86: do not go through vcpu in __get_kvmclock_ns
  2016-11-16 16:10 ` Radim Krčmář
@ 2016-11-16 17:27   ` Paolo Bonzini
  2016-11-16 18:10     ` Radim Krčmář
  0 siblings, 1 reply; 4+ messages in thread
From: Paolo Bonzini @ 2016-11-16 17:27 UTC (permalink / raw)
  To: Radim Krčmář; +Cc: linux-kernel, kvm, mtosatti


> > -	if (vcpu->arch.hv_clock.flags & PVCLOCK_TSC_STABLE_BIT) {
> > -		u64 tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> > -		ns = __pvclock_read_cycles(&vcpu->arch.hv_clock, tsc);
> > -	} else {
> > -		ns = ktime_get_boot_ns() + ka->kvmclock_offset;
> > -	}
> 
> If we access the "global" master clock, it would be better to prevent it
> from changing under our hands with
>   	spin_lock(&ka->pvclock_gtod_sync_lock).

Yes, good idea.

> > +	if (!ka->use_master_clock)
> > +		return ktime_get_boot_ns() + ka->kvmclock_offset;
> >  
> > -	return ns;
> > +	hv_clock.tsc_timestamp = ka->master_cycle_now;
> > +	hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
> > +	kvm_get_time_scale(NSEC_PER_SEC, __this_cpu_read(cpu_tsc_khz) * 1000LL,
> > +			   &hv_clock.tsc_shift,
> > +			   &hv_clock.tsc_to_system_mul);
> 
> Doesn't this result in a minor drift with scaled clock, because the
> guest can be combining two systems that approximate frequency?

You mean instead of doing read_l1_tsc?

>   1) tsc_shift and tsc_to_system_mul for kvmclock scaling
>   2) hardware TSC scaling ratio
> 
> If we are on a 7654321 kHz TSC and TSC-ratio scale to 1234567 kHz and
> then tsc_shift+tsc_to_system_mul kvmclock-scale to 1000000 kHz, we
> should be using multipliers of
>   0.161290204578564186163606151349022336533834941074459772460...  and
>   0.810000591300431649315104000025920018921613812778083328000...,
> to achieve that.  Those multipliers cannot be precisely expressed in
> what we have (shifts and 64/32 bit multipliers with intermediate values
> only up to 128 bits), so performing the scaling will result in slightly
> incorrect frequency.
> 
> The result of combining two operations that alter the freqency is quite
> unlikely to cancel out and produce the same result as an operation that
> uses a different shift+multiplier to scale in one step, so I think that
> we aren't getting the same time as the guest with TSC-scaling is seeing.

I think you get pretty good precision, since 30 fractional bits are more
or less equivalent to nanosecond precision.  For example, cutting the two
ratios above to 30 fractional bits I get respectively 173184038/2^30
and 869731512/2^30.  Multiplying them gives 140279173/2^30 which matches
exactly the fixed point representation of 1000000/7654321.

Since the TSC scaling frequency has a larger precision (32 or 48 bits),
you should get at most 1 ulp error, which is not bad.

Paolo

> (I'd be happier if we didn't ignore this drift when the whole endeavor
>  started just to get rid of a drift, but introducing a minor bug is still
>  improving the situation -- I'm ok with first two changes only.)
> 
> > +	return __pvclock_read_cycles(&hv_clock, rdtsc());
> >  }
> >  
> >  u64 get_kvmclock_ns(struct kvm *kvm)
> > --
> > 1.8.3.1
> > 
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] KVM: x86: do not go through vcpu in __get_kvmclock_ns
  2016-11-16 17:27   ` Paolo Bonzini
@ 2016-11-16 18:10     ` Radim Krčmář
  0 siblings, 0 replies; 4+ messages in thread
From: Radim Krčmář @ 2016-11-16 18:10 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-kernel, kvm, mtosatti

2016-11-16 12:27-0500, Paolo Bonzini:
>> > +	if (!ka->use_master_clock)
>> > +		return ktime_get_boot_ns() + ka->kvmclock_offset;
>> >  
>> > -	return ns;
>> > +	hv_clock.tsc_timestamp = ka->master_cycle_now;
>> > +	hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
>> > +	kvm_get_time_scale(NSEC_PER_SEC, __this_cpu_read(cpu_tsc_khz) * 1000LL,
>> > +			   &hv_clock.tsc_shift,
>> > +			   &hv_clock.tsc_to_system_mul);
>> 
>> Doesn't this result in a minor drift with scaled clock, because the
>> guest can be combining two systems that approximate frequency?
> 
> You mean instead of doing read_l1_tsc?

Yes, when we avoid the scaling and use host TSC directly.

>>   1) tsc_shift and tsc_to_system_mul for kvmclock scaling
>>   2) hardware TSC scaling ratio
>> 
>> If we are on a 7654321 kHz TSC and TSC-ratio scale to 1234567 kHz and
>> then tsc_shift+tsc_to_system_mul kvmclock-scale to 1000000 kHz, we
>> should be using multipliers of
>>   0.161290204578564186163606151349022336533834941074459772460...  and
>>   0.810000591300431649315104000025920018921613812778083328000...,
>> to achieve that.  Those multipliers cannot be precisely expressed in
>> what we have (shifts and 64/32 bit multipliers with intermediate values
>> only up to 128 bits), so performing the scaling will result in slightly
>> incorrect frequency.
>> 
>> The result of combining two operations that alter the freqency is quite
>> unlikely to cancel out and produce the same result as an operation that
>> uses a different shift+multiplier to scale in one step, so I think that
>> we aren't getting the same time as the guest with TSC-scaling is seeing.
> 
> I think you get pretty good precision, since 30 fractional bits are more
> or less equivalent to nanosecond precision.  For example, cutting the two
> ratios above to 30 fractional bits I get respectively 173184038/2^30
> and 869731512/2^30.  Multiplying them gives 140279173/2^30 which matches
> exactly the fixed point representation of 1000000/7654321.
> 
> Since the TSC scaling frequency has a larger precision (32 or 48 bits),
> you should get at most 1 ulp error, which is not bad.

True, it would take many years to accumulate into a second.

Thanks.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2016-11-16 18:10 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-11-14 17:51 [PATCH v2] KVM: x86: do not go through vcpu in __get_kvmclock_ns Paolo Bonzini
2016-11-16 16:10 ` Radim Krčmář
2016-11-16 17:27   ` Paolo Bonzini
2016-11-16 18:10     ` Radim Krčmář

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).