Re: [RFC PATCH v3 10/21] KVM: x86: Fix software TSC upscaling in kvm_update_guest_time()

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Sean Christopherson <seanjc@google.com>
To: David Woodhouse <dwmw2@infradead.org>
Cc: kvm@vger.kernel.org, Paolo Bonzini <pbonzini@redhat.com>,
	 Jonathan Corbet <corbet@lwn.net>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>,  Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	x86@kernel.org,  "H. Peter Anvin" <hpa@zytor.com>,
	Paul Durrant <paul@xen.org>,
	Peter Zijlstra <peterz@infradead.org>,
	 Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	 Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	 Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	 Daniel Bristot de Oliveira <bristot@redhat.com>,
	Valentin Schneider <vschneid@redhat.com>,
	Shuah Khan <shuah@kernel.org>,
	 linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	 jalliste@amazon.co.uk, sveith@amazon.de, zide.chen@intel.com,
	 Dongli Zhang <dongli.zhang@oracle.com>,
	Chenyi Qiang <chenyi.qiang@intel.com>
Subject: Re: [RFC PATCH v3 10/21] KVM: x86: Fix software TSC upscaling in kvm_update_guest_time()
Date: Tue, 13 Aug 2024 21:57:29 -0700	[thread overview]
Message-ID: <Zrw5ORlemXZPrIWl@google.com> (raw)
In-Reply-To: <20240522001817.619072-11-dwmw2@infradead.org>

On Wed, May 22, 2024, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> There was some confusion in kvm_update_guest_time() when software needs
> to advance the guest TSC.
> 
> In master clock mode, there are two points of time which need to be taken
> into account. First there is the master clock reference point, stored in
> kvm->arch.master_kernel_ns (and associated host TSC ->master_cycle_now).
> Secondly, there is the time *now*, at the point kvm_update_guest_time()
> is being called.
> 
> With software TSC upscaling, the guest TSC is getting further and further
> ahead of the host TSC as time elapses. So at time "now", the guest TSC
> should be further ahead of the host, than it was at master_kernel_ns.
> 
> The adjustment in kvm_update_guest_time() was not taking that into
> account, and was only advancing the guest TSC by the appropriate amount
> for master_kernel_ns, *not* the current time.
> 
> Fix it to calculate them both correctly.
> 
> Since the KVM clock reference point in master_kernel_ns might actually
> be *earlier* than the reference point used for the guest TSC
> (vcpu->last_tsc_nsec), this might lead to a negative delta. Fix the
> compute_guest_tsc() function to cope with negative numbers, which
> then means there is no need to force a master clock update when the
> guest TSC is written.

Please do this in a separate patch.  There's no need to squeeze it in here, and
this change is complex/subtle enough as it is.

> @@ -3300,8 +3306,6 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
>  		kernel_ns = get_kvmclock_base_ns();
>  	}
>  
> -	tsc_timestamp = kvm_read_l1_tsc(v, host_tsc);
> -
>  	/*
>  	 * We may have to catch up the TSC to match elapsed wall clock
>  	 * time for two reasons, even if kvmclock is used.
> @@ -3313,11 +3317,46 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
>  	 *	very slowly.
>  	 */
>  	if (vcpu->tsc_catchup) {
> -		u64 tsc = compute_guest_tsc(v, kernel_ns);

Random side topic, kernel_ns is a s64, shouldn't it be a u64?

> -		if (tsc > tsc_timestamp) {
> -			adjust_tsc_offset_guest(v, tsc - tsc_timestamp);
> -			tsc_timestamp = tsc;
> +		uint64_t now_host_tsc, now_guest_tsc;
> +		int64_t adjustment;
> +
> +		/*
> +		 * First, calculate what the guest TSC should be at the
> +		 * time (kernel_ns) which will be placed in the hvclock.
> +		 * This may be the *current* time, or it may be the time
> +		 * of the master clock reference. This is 'tsc_timestamp'.
> +		 */
> +		tsc_timestamp = compute_guest_tsc(v, kernel_ns);
> +
> +		now_guest_tsc = tsc_timestamp;
> +		now_host_tsc = host_tsc;
> +
> +#ifdef CONFIG_X86_64
> +		/*
> +		 * If the master clock was used, calculate what the guest
> +		 * TSC should be *now* in order to advance to that.
> +		 */
> +		if (use_master_clock) {
> +			int64_t now_kernel_ns;
> +
> +			if (!kvm_get_time_and_clockread(&now_kernel_ns,

Doesn't this need to be called under protection of the seqcount?

Ahh, but with that change, then get_cpu_tsc_khz() isn't guaranteed to be from
the same CPU.

Oof, disabling IRQs to protect against migration is complete overkill, and at
this point dumb luck as much as anything.  Saving IRQs was added by commit
commit 18068523d3a0 ("KVM: paravirtualized clocksource: host part") before there
was any coordination with timekeeping.  And then after the coordination and
locking was added, commit c09664bb4418 ("KVM: x86: fix deadlock in clock-in-progress
request handling") moved the locking/coordination out of IRQ protection, and thus
made disabling IRQs completely pointless, except for protecting get_cpu_tsc_khz()
and now kvm_get_time_and_clockread().

Ha!  And if we slowly unwind that mess, this all ends up being _excrutiatingly_
close to the same code as get_kvmclock().  Sadly, I don't think it's close enough
to be reusable, unless we want to play macro games.

> +							&now_host_tsc)) {
> +				now_kernel_ns = get_kvmclock_base_ns();
> +				now_host_tsc = rdtsc();
> +			}
> +			now_guest_tsc = compute_guest_tsc(v, now_kernel_ns);

I find the mixed state of kernel_ns and host_tsc to be terribly confusing.  It's
hard to see and remember that kernel_ns/host_tsc aren't "now" when use_master_clock
is true.

For TSC upscaling, I think we can have kernel_ns/host_tsc always be "now", we just
need to snapshot the master clock tsc+ns, and then shove those into kernel_ns and
host_tsc after doing the software upscaling.  That simplifies the TSC upscaling
code considerably, and IMO makes it more obvious how tsc_timestamp is computed,
and what its role is.

When all is said and done, I think we can get to this?

	/*
	 * If the host uses TSC clock, then passthrough TSC as stable
	 * to the guest.
	 */
	do {
		seq = read_seqcount_begin(&ka->pvclock_sc);

		use_master_clock = ka->use_master_clock;

		/*
		 * The TSC read and the call to get_cpu_tsc_khz() must happen
		 * on the same CPU.
		 */
		get_cpu();

		tgt_tsc_hz = get_cpu_tsc_khz();

		if (use_master_clock &&
		    !kvm_get_time_and_clockread(&kernel_ns, &host_tsc) &&
		    WARN_ON_ONCE(!read_seqcount_retry(&ka->pvclock_sc, seq)))
			use_master_clock = false;

		put_cpu();

		if (!use_master_clock)
			break;

		master_host_tsc = ka->master_cycle_now;
		master_kernel_ns = ka->master_kernel_ns;
	while (read_seqcount_retry(&ka->pvclock_sc, seq))

	if (unlikely(!tgt_tsc_hz)) {
		kvm_make_request(KVM_REQ_CLOCK_UPDATE, v);
		return 1;
	}
	if (!use_master_clock) {
		host_tsc = rdtsc();
		kernel_ns = get_kvmclock_base_ns();
	}

	/*
	 * We may have to catch up the TSC to match elapsed wall clock
	 * time for two reasons, even if kvmclock is used.
	 *   1) CPU could have been running below the maximum TSC rate
	 *   2) Broken TSC compensation resets the base at each VCPU
	 *      entry to avoid unknown leaps of TSC even when running
	 *      again on the same CPU.  This may cause apparent elapsed
	 *      time to disappear, and the guest to stand still or run
	 *	very slowly.
	 */
	if (vcpu->tsc_catchup) {
		int64_t adjustment;

		/*
		 * Calculate the delta between what the guest TSC *should* be,
		 * and what it actually is according to kvm_read_l1_tsc().
		 */
		adjustment = compute_guest_tsc(v, kernel_ns) -
			     kvm_read_l1_tsc(v, host_tsc);
		if (adjustment > 0)
			adjust_tsc_offset_guest(v, adjustment);
	}

	/*
	 * Now that TSC upscaling is out of the way, the remaining calculations
	 * are all relative to the reference time that's placed in hv_clock.
	 * If the master clock is NOT in use, the reference time is "now".  If
	 * master clock is in use, the reference time comes from there.
	 */
	if (use_master_clock) {
		host_tsc = master_host_tsc;
		kernel_ns = master_kernel_ns;
	}
	tsc_timestamp = kvm_read_l1_tsc(v, host_tsc);

next prev parent reply	other threads:[~2024-08-14  4:57 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-22  0:16 [RFC PATCH v3 00/21] Cleaning up the KVM clock mess David Woodhouse
2024-05-22  0:16 ` [RFC PATCH v3 01/21] KVM: x86/xen: Do not corrupt KVM clock in kvm_xen_shared_info_init() David Woodhouse
2024-05-22  0:16 ` [RFC PATCH v3 02/21] KVM: x86: Improve accuracy of KVM clock when TSC scaling is in force David Woodhouse
2024-08-13 17:50   ` Sean Christopherson
2024-05-22  0:16 ` [RFC PATCH v3 03/21] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration David Woodhouse
2024-05-22  0:16 ` [RFC PATCH v3 04/21] UAPI: x86: Move pvclock-abi to UAPI for x86 platforms David Woodhouse
2024-05-24 13:14   ` Paul Durrant
2024-08-13 18:07   ` Sean Christopherson
2024-05-22  0:17 ` [RFC PATCH v3 05/21] KVM: selftests: Add KVM/PV clock selftest to prove timer correction David Woodhouse
2024-08-13 18:55   ` Sean Christopherson
2024-05-22  0:17 ` [RFC PATCH v3 06/21] KVM: x86: Explicitly disable TSC scaling without CONSTANT_TSC David Woodhouse
2024-05-22  0:17 ` [RFC PATCH v3 07/21] KVM: x86: Add KVM_VCPU_TSC_SCALE and fix the documentation on TSC migration David Woodhouse
2024-08-14  1:52   ` Sean Christopherson
2024-05-22  0:17 ` [RFC PATCH v3 08/21] KVM: x86: Avoid NTP frequency skew for KVM clock on 32-bit host David Woodhouse
2024-08-14  1:57   ` Sean Christopherson
2024-05-22  0:17 ` [RFC PATCH v3 09/21] KVM: x86: Fix KVM clock precision in __get_kvmclock() David Woodhouse
2024-05-24 13:20   ` Paul Durrant
2024-08-14  2:58   ` Sean Christopherson
2024-05-22  0:17 ` [RFC PATCH v3 10/21] KVM: x86: Fix software TSC upscaling in kvm_update_guest_time() David Woodhouse
2024-05-24 13:26   ` Paul Durrant
2024-08-14  4:57   ` Sean Christopherson [this message]
2024-05-22  0:17 ` [RFC PATCH v3 11/21] KVM: x86: Simplify and comment kvm_get_time_scale() David Woodhouse
2024-05-24 13:53   ` Paul Durrant
2024-08-15 15:46   ` Sean Christopherson
2024-05-22  0:17 ` [RFC PATCH v3 12/21] KVM: x86: Remove implicit rdtsc() from kvm_compute_l1_tsc_offset() David Woodhouse
2024-05-24 13:56   ` Paul Durrant
2024-08-15 15:52   ` Sean Christopherson
2024-05-22  0:17 ` [RFC PATCH v3 13/21] KVM: x86: Improve synchronization in kvm_synchronize_tsc() David Woodhouse
2024-05-24 14:03   ` Paul Durrant
2024-08-15 16:00   ` Sean Christopherson
2024-05-22  0:17 ` [RFC PATCH v3 14/21] KVM: x86: Kill cur_tsc_{nsec,offset,write} fields David Woodhouse
2024-05-24 14:05   ` Paul Durrant
2024-08-15 16:30   ` Sean Christopherson
2024-05-22  0:17 ` [RFC PATCH v3 15/21] KVM: x86: Allow KVM master clock mode when TSCs are offset from each other David Woodhouse
2024-05-24 14:10   ` Paul Durrant
2024-08-16  2:38   ` Sean Christopherson
2024-05-22  0:17 ` [RFC PATCH v3 16/21] KVM: x86: Factor out kvm_use_master_clock() David Woodhouse
2024-05-24 14:13   ` Paul Durrant
2024-08-15 17:12   ` Sean Christopherson
2024-05-22  0:17 ` [RFC PATCH v3 17/21] KVM: x86: Avoid global clock update on setting KVM clock MSR David Woodhouse
2024-05-24 14:14   ` Paul Durrant
2024-08-16  4:28   ` Sean Christopherson
2024-05-22  0:17 ` [RFC PATCH v3 18/21] KVM: x86: Avoid gratuitous global clock reload in kvm_arch_vcpu_load() David Woodhouse
2024-05-24 14:16   ` Paul Durrant
2024-08-15 17:31   ` Sean Christopherson
2024-05-22  0:17 ` [RFC PATCH v3 19/21] KVM: x86: Avoid periodic KVM clock updates in master clock mode David Woodhouse
2024-05-24 14:18   ` Paul Durrant
2024-08-16  4:33   ` Sean Christopherson
2024-05-22  0:17 ` [RFC PATCH v3 20/21] KVM: x86/xen: Prevent runstate times from becoming negative David Woodhouse
2024-05-24 14:21   ` Paul Durrant
2024-08-16  4:39   ` Sean Christopherson
2024-08-20 10:22     ` David Woodhouse
2024-08-20 15:08       ` Steven Rostedt
2024-08-20 15:42         ` David Woodhouse
2024-05-22  0:17 ` [RFC PATCH v3 21/21] sched/cputime: Cope with steal time going backwards or negative David Woodhouse
2024-05-24 14:25   ` Paul Durrant
2024-07-02 14:09   ` Shrikanth Hegde
2024-08-16  4:35   ` Sean Christopherson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Zrw5ORlemXZPrIWl@google.com \
    --to=seanjc@google.com \
    --cc=bp@alien8.de \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=chenyi.qiang@intel.com \
    --cc=corbet@lwn.net \
    --cc=dave.hansen@linux.intel.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=dongli.zhang@oracle.com \
    --cc=dwmw2@infradead.org \
    --cc=hpa@zytor.com \
    --cc=jalliste@amazon.co.uk \
    --cc=juri.lelli@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=paul@xen.org \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=shuah@kernel.org \
    --cc=sveith@amazon.de \
    --cc=tglx@linutronix.de \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=x86@kernel.org \
    --cc=zide.chen@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).