Kernel KVM virtualization development
 help / color / mirror / Atom feed
From: Dongli Zhang <dongli.zhang@oracle.com>
To: David Woodhouse <dwmw2@infradead.org>,
	kvm@vger.kernel.org, x86@kernel.org,
	linux-kselftest@vger.kernel.org
Cc: seanjc@google.com, pbonzini@redhat.com, vkuznets@redhat.com,
	tglx@kernel.org, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, shuah@kernel.org,
	peterz@infradead.org, juri.lelli@redhat.com,
	vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
	rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
	vschneid@redhat.com, kprateek.nayak@amd.com, jgross@suse.com,
	joe.jin@oracle.com, "H. Peter Anvin" <hpa@zytor.com>
Subject: Re: [PATCH 3/5] KVM: x86: account KVM_SET_CLOCK downtime in steal time
Date: Mon, 11 May 2026 14:46:47 -0700	[thread overview]
Message-ID: <87ac4d82-a121-4666-bfa1-8a279133c683@oracle.com> (raw)
In-Reply-To: <87691b4969fa963b332b0be209361d6f206f493f.camel@infradead.org>



On 5/10/26 1:13 PM, David Woodhouse wrote:
> On Sun, 2026-05-10 at 12:11 -0700, H. Peter Anvin wrote:
>> On May 10, 2026 11:54:38 AM PDT, David Woodhouse <dwmw2@infradead.org> wrote:
>>> On Mon, 2026-05-04 at 17:30 -0700, Dongli Zhang wrote:
>>>> The KVM_CLOCK_REALTIME has been introduced to help track the downtime of
>>>> live migration. KVM uses that realtime value to advance guest clock, but
>>>> the same blackout is not reflected in KVM steal time.
>>>>
>>>> Account that same delta in steal time directly in kvm_vm_ioctl_set_clock(),
>>>> only when KVM_CLOCK_REALTIME is used. This keeps the KVM-only solution
>>>> self-contained and avoids adding a new KVM ioctl or requiring additional
>>>> userspace changes (i.e. QEMU).
>>>>
>>>> Record the per-VM downtime delta when KVM_SET_CLOCK receives
>>>> KVM_CLOCK_REALTIME, and fold it into the existing x86 steal accounting
>>>> path. Initialize each vCPU's local cursor
>>>> (vcpu->arch.st.last_downtime_steal) when the guest enables
>>>> MSR_KVM_STEAL_TIME so previously accumulated blackout is not charged.
>>>>
>>>> Note that this means a vCPU may observe additional steal time after
>>>> blackout even if the host side contribution from current->sched_info
>>>> did not increase during that interval.
>>>>
>>>> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
>>>
>>> I really don't want to see KVM_CLOCK_REALTIME used for anything more
>>> than it already is. Or, indeed, even for that.
>>>
>>> There is precisely *one* place where it's OK to use 'real time' as a
>>> comparator, and that's when setting the guest's TSC. And even then it
>>> should be using TAI not UTC unless you like your guests' clocks jumping
>>> around by a second if you migrate at the wrong time. KVM_CLOCK_REALTIME
>>> was never the right thing to use, for anything.
>>>
>>> The KVM clock is a function of the guest's TSC (see
>>> KVM_SET_CLOCK_GUEST), and steal time is a function of that (as it's
>>> measured in nanoseconds).
>>>
>>> Don't bring UTC into it *anywhere*.
>>>
>>>
>>
>> Unfortunately TAI is often unavailable. One can hope that the
>> proposal of abolishing leap seconds by 2035, fixing the TAI-UTC
>> offset permanently, actually happens.
> 
> I was hoping for the opposite; it's just pandering to stupid bugs.
> Yes, leap seconds are fairly rare; instead maybe we should *always*
> have a leap second in one direction or the other at the end of the
> year. Otherwise it's just building up to be a bigger problem later.
> 
>>  The difference between atomic and solar time is better handled with
>> the already-existing "time zones" mechanism, which tends to change
>> far more frequently for entirely different reasons than the TAI-UT1
>> difference slowly accumulates.
> 
> I have absolutely no faith in a 'time zones with second precision'
> model ever actually working either. Although maybe if we ditched UTC
> completely (and the pointless 37-second offset frozen in time for ever
> like the GPS offset), and our second-precision time zones were based on
> *TAI* we could exercise them from day one?
> 
> Either way, as long as it isn't the awful abomination of *smearing*
> leap seconds and screwing up time precision, nobody actually needs to
> be nailed to anything.
> 
> And none of it matters here for *steal time*, since the *only* thing in
> a migration that should be based on any kind of real time is the guest
> TSC, and everything else should be based purely on that (perhaps via
> the kvmclock).
> 
> And even then it's only for live *migration* to a different host, as
> live update on the same host across kexec should be purely based on
> offsets from the host's TSC which remains unperturbed.


Thank you very much!

Based on my understanding, you have two main points:

1. KVM_CLOCK_REALTIME is not preferred for live migration or live update.
Essentially, the only acceptable use of KVM_CLOCK_REALTIME is to adjust
guest_TSC. After that, everything should rely on kvm-clock (especially after
KVM_SET_CLOCK_GUEST).

2. Regardless of whether TAI or KVM_CLOCK_REALTIME is used to adjust guest_TSC,
the calculation of the steal-time delta should always and exclusively be based
on kvm-clock values before and after the adjustment.

Would you mind confirming whether my understanding of your points is correct?

Dongli Zhang


  reply	other threads:[~2026-05-11 21:47 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-05  0:30 [PATCH 0/5] Fix and enhance KVM steal accounting for both guest and host Dongli Zhang
2026-05-05  0:30 ` [PATCH 1/5] x86/kvm: Reset prev_steal_time and prev_steal_time_rq when enabling steal time Dongli Zhang
2026-05-05  0:30 ` [PATCH 2/5] KVM: x86: Reset vcpu->arch.st.last_steal " Dongli Zhang
2026-05-08 22:40   ` Sean Christopherson
2026-05-10 17:09     ` David Woodhouse
2026-05-10 18:40       ` David Woodhouse
2026-05-11 21:27         ` Dongli Zhang
2026-05-12  5:46           ` David Woodhouse
2026-05-11 21:26     ` Dongli Zhang
2026-05-05  0:30 ` [PATCH 3/5] KVM: x86: account KVM_SET_CLOCK downtime in " Dongli Zhang
2026-05-10 18:54   ` David Woodhouse
2026-05-10 19:11     ` H. Peter Anvin
2026-05-10 20:13       ` David Woodhouse
2026-05-11 21:46         ` Dongli Zhang [this message]
2026-05-12  5:34           ` David Woodhouse
2026-05-05  0:30 ` [PATCH 4/5] KVM: selftests: Test steal time when re-adding a vCPU on a new thread Dongli Zhang
2026-05-05  0:30 ` [PATCH 5/5] KVM: selftests: Test KVM_SET_CLOCK downtime in steal time Dongli Zhang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87ac4d82-a121-4666-bfa1-8a279133c683@oracle.com \
    --to=dongli.zhang@oracle.com \
    --cc=bp@alien8.de \
    --cc=bsegall@google.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=dwmw2@infradead.org \
    --cc=hpa@zytor.com \
    --cc=jgross@suse.com \
    --cc=joe.jin@oracle.com \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=seanjc@google.com \
    --cc=shuah@kernel.org \
    --cc=tglx@kernel.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vkuznets@redhat.com \
    --cc=vschneid@redhat.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox