From: Dongli Zhang <dongli.zhang@oracle.com>
To: David Woodhouse <dwmw2@infradead.org>,
kvm@vger.kernel.org, x86@kernel.org,
linux-kselftest@vger.kernel.org
Cc: seanjc@google.com, pbonzini@redhat.com, vkuznets@redhat.com,
tglx@kernel.org, mingo@redhat.com, bp@alien8.de,
dave.hansen@linux.intel.com, shuah@kernel.org,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, kprateek.nayak@amd.com, jgross@suse.com,
joe.jin@oracle.com, "H. Peter Anvin" <hpa@zytor.com>
Subject: Re: [PATCH 3/5] KVM: x86: account KVM_SET_CLOCK downtime in steal time
Date: Mon, 11 May 2026 14:46:47 -0700 [thread overview]
Message-ID: <87ac4d82-a121-4666-bfa1-8a279133c683@oracle.com> (raw)
In-Reply-To: <87691b4969fa963b332b0be209361d6f206f493f.camel@infradead.org>
On 5/10/26 1:13 PM, David Woodhouse wrote:
> On Sun, 2026-05-10 at 12:11 -0700, H. Peter Anvin wrote:
>> On May 10, 2026 11:54:38 AM PDT, David Woodhouse <dwmw2@infradead.org> wrote:
>>> On Mon, 2026-05-04 at 17:30 -0700, Dongli Zhang wrote:
>>>> The KVM_CLOCK_REALTIME has been introduced to help track the downtime of
>>>> live migration. KVM uses that realtime value to advance guest clock, but
>>>> the same blackout is not reflected in KVM steal time.
>>>>
>>>> Account that same delta in steal time directly in kvm_vm_ioctl_set_clock(),
>>>> only when KVM_CLOCK_REALTIME is used. This keeps the KVM-only solution
>>>> self-contained and avoids adding a new KVM ioctl or requiring additional
>>>> userspace changes (i.e. QEMU).
>>>>
>>>> Record the per-VM downtime delta when KVM_SET_CLOCK receives
>>>> KVM_CLOCK_REALTIME, and fold it into the existing x86 steal accounting
>>>> path. Initialize each vCPU's local cursor
>>>> (vcpu->arch.st.last_downtime_steal) when the guest enables
>>>> MSR_KVM_STEAL_TIME so previously accumulated blackout is not charged.
>>>>
>>>> Note that this means a vCPU may observe additional steal time after
>>>> blackout even if the host side contribution from current->sched_info
>>>> did not increase during that interval.
>>>>
>>>> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
>>>
>>> I really don't want to see KVM_CLOCK_REALTIME used for anything more
>>> than it already is. Or, indeed, even for that.
>>>
>>> There is precisely *one* place where it's OK to use 'real time' as a
>>> comparator, and that's when setting the guest's TSC. And even then it
>>> should be using TAI not UTC unless you like your guests' clocks jumping
>>> around by a second if you migrate at the wrong time. KVM_CLOCK_REALTIME
>>> was never the right thing to use, for anything.
>>>
>>> The KVM clock is a function of the guest's TSC (see
>>> KVM_SET_CLOCK_GUEST), and steal time is a function of that (as it's
>>> measured in nanoseconds).
>>>
>>> Don't bring UTC into it *anywhere*.
>>>
>>>
>>
>> Unfortunately TAI is often unavailable. One can hope that the
>> proposal of abolishing leap seconds by 2035, fixing the TAI-UTC
>> offset permanently, actually happens.
>
> I was hoping for the opposite; it's just pandering to stupid bugs.
> Yes, leap seconds are fairly rare; instead maybe we should *always*
> have a leap second in one direction or the other at the end of the
> year. Otherwise it's just building up to be a bigger problem later.
>
>> The difference between atomic and solar time is better handled with
>> the already-existing "time zones" mechanism, which tends to change
>> far more frequently for entirely different reasons than the TAI-UT1
>> difference slowly accumulates.
>
> I have absolutely no faith in a 'time zones with second precision'
> model ever actually working either. Although maybe if we ditched UTC
> completely (and the pointless 37-second offset frozen in time for ever
> like the GPS offset), and our second-precision time zones were based on
> *TAI* we could exercise them from day one?
>
> Either way, as long as it isn't the awful abomination of *smearing*
> leap seconds and screwing up time precision, nobody actually needs to
> be nailed to anything.
>
> And none of it matters here for *steal time*, since the *only* thing in
> a migration that should be based on any kind of real time is the guest
> TSC, and everything else should be based purely on that (perhaps via
> the kvmclock).
>
> And even then it's only for live *migration* to a different host, as
> live update on the same host across kexec should be purely based on
> offsets from the host's TSC which remains unperturbed.
Thank you very much!
Based on my understanding, you have two main points:
1. KVM_CLOCK_REALTIME is not preferred for live migration or live update.
Essentially, the only acceptable use of KVM_CLOCK_REALTIME is to adjust
guest_TSC. After that, everything should rely on kvm-clock (especially after
KVM_SET_CLOCK_GUEST).
2. Regardless of whether TAI or KVM_CLOCK_REALTIME is used to adjust guest_TSC,
the calculation of the steal-time delta should always and exclusively be based
on kvm-clock values before and after the adjustment.
Would you mind confirming whether my understanding of your points is correct?
Dongli Zhang
next prev parent reply other threads:[~2026-05-11 21:47 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-05 0:30 [PATCH 0/5] Fix and enhance KVM steal accounting for both guest and host Dongli Zhang
2026-05-05 0:30 ` [PATCH 1/5] x86/kvm: Reset prev_steal_time and prev_steal_time_rq when enabling steal time Dongli Zhang
2026-05-05 0:30 ` [PATCH 2/5] KVM: x86: Reset vcpu->arch.st.last_steal " Dongli Zhang
2026-05-08 22:40 ` Sean Christopherson
2026-05-10 17:09 ` David Woodhouse
2026-05-10 18:40 ` David Woodhouse
2026-05-11 21:27 ` Dongli Zhang
2026-05-12 5:46 ` David Woodhouse
2026-05-11 21:26 ` Dongli Zhang
2026-05-05 0:30 ` [PATCH 3/5] KVM: x86: account KVM_SET_CLOCK downtime in " Dongli Zhang
2026-05-10 18:54 ` David Woodhouse
2026-05-10 19:11 ` H. Peter Anvin
2026-05-10 20:13 ` David Woodhouse
2026-05-11 21:46 ` Dongli Zhang [this message]
2026-05-12 5:34 ` David Woodhouse
2026-05-05 0:30 ` [PATCH 4/5] KVM: selftests: Test steal time when re-adding a vCPU on a new thread Dongli Zhang
2026-05-05 0:30 ` [PATCH 5/5] KVM: selftests: Test KVM_SET_CLOCK downtime in steal time Dongli Zhang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87ac4d82-a121-4666-bfa1-8a279133c683@oracle.com \
--to=dongli.zhang@oracle.com \
--cc=bp@alien8.de \
--cc=bsegall@google.com \
--cc=dave.hansen@linux.intel.com \
--cc=dietmar.eggemann@arm.com \
--cc=dwmw2@infradead.org \
--cc=hpa@zytor.com \
--cc=jgross@suse.com \
--cc=joe.jin@oracle.com \
--cc=juri.lelli@redhat.com \
--cc=kprateek.nayak@amd.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=pbonzini@redhat.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=seanjc@google.com \
--cc=shuah@kernel.org \
--cc=tglx@kernel.org \
--cc=vincent.guittot@linaro.org \
--cc=vkuznets@redhat.com \
--cc=vschneid@redhat.com \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox