From: Sean Christopherson <seanjc@google.com>
To: Thomas Gleixner <tglx@linutronix.de>,
Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
Dave Hansen <dave.hansen@linux.intel.com>,
x86@kernel.org,
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
Juergen Gross <jgross@suse.com>,
"K. Y. Srinivasan" <kys@microsoft.com>,
Haiyang Zhang <haiyangz@microsoft.com>,
Wei Liu <wei.liu@kernel.org>, Dexuan Cui <decui@microsoft.com>,
Ajay Kaher <ajay.kaher@broadcom.com>,
Jan Kiszka <jan.kiszka@siemens.com>,
Paolo Bonzini <pbonzini@redhat.com>,
Andy Lutomirski <luto@kernel.org>,
Peter Zijlstra <peterz@infradead.org>,
linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev,
virtualization@lists.linux.dev, linux-hyperv@vger.kernel.org,
kvm@vger.kernel.org, xen-devel@lists.xenproject.org,
Nikunj A Dadhania <nikunj@amd.com>,
Tom Lendacky <thomas.lendacky@amd.com>
Subject: Re: [PATCH 16/16] x86/kvmclock: Use TSC for sched_clock if it's constant and non-stop
Date: Fri, 7 Feb 2025 09:23:24 -0800 [thread overview]
Message-ID: <Z6ZBjNdoULymGgxz@google.com> (raw)
In-Reply-To: <20250201021718.699411-17-seanjc@google.com>
Dropping a few people/lists whose emails are bouncing.
On Fri, Jan 31, 2025, Sean Christopherson wrote:
> @@ -369,6 +369,11 @@ void __init kvmclock_init(void)
> #ifdef CONFIG_X86_LOCAL_APIC
> x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock;
> #endif
> + /*
> + * Save/restore "sched" clock state even if kvmclock isn't being used
> + * for sched_clock, as kvmclock is still used for wallclock and relies
> + * on these hooks to re-enable kvmclock after suspend+resume.
This is wrong, wallclock is a different MSR entirely.
> + */
> x86_platform.save_sched_clock_state = kvm_save_sched_clock_state;
> x86_platform.restore_sched_clock_state = kvm_restore_sched_clock_state;
And usurping sched_clock save/restore is *really* wrong if kvmclock isn't being
used as sched_clock, because when TSC is reset on suspend/hiberation, not doing
tsc_{save,restore}_sched_clock_state() results in time going haywire.
Subtly, that issue goes all the way back to patch "x86/paravirt: Don't use a PV
sched_clock in CoCo guests with trusted TSC" because pulling the rug out from
under kvmclock leads to the same problem.
The whole PV sched_clock scheme is a disaster.
Hyper-V overrides the save/restore callbacks, but _also_ runs the old TSC callbacks,
because Hyper-V doesn't ensure that it's actually using the Hyper-V clock for
sched_clock. And the code is all kinds of funky, because it tries to keep the
x86 code isolated from the generic HV clock code, but (a) there's already x86 PV
specific code in drivers/clocksource/hyperv_timer.c, and (b) splitting the code
means that Hyper-V overides the sched_clock save/restore hooks even when PARAVIRT=n,
i.e. when HV clock can't possibly be used as sched_clock.
VMware appears to be buggy and doesn't do have offset adjustments, and also lets
the TSC callbacks run.
I can't tell if Xen is broken, or if it's the sanest of the bunch. Xen does
save/restore things a la kvmclock, but only in the Xen PV suspend path. So if
the "normal" suspend/hibernate paths are unreachable, Xen is sane. If not, Xen
is quite broken.
To make matters worse, kvmclock is a mess and has existing bugs. The BSP's clock
is disabled during syscore_suspend() (via kvm_suspend()), but only re-enabled in
the sched_clock callback. So if suspend is aborted due to a late wakeup, the BSP
will run without its clock enabled, which "works" only because KVM-the-hypervisor
is kind enough to not clobber the shared memory when the clock is disabled. But
over time, I would expect time on the BSP to drift from APs.
And then there's this crud:
#ifdef CONFIG_X86_LOCAL_APIC
x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock;
#endif
which (a) should be guarded by CONFIG_SMP, not X86_LOCAL_APIC, and (b) is only
actually needed when kvmclock is sched_clock, because timekeeping doesn't actually
need to start that early. But of course kvmclock craptastic handling of suspend
and resume makes untangling that more difficult than it needs to be.
The icing on the cake is that after cleaning up all the hacks, and having
kvmclock hook clocksource.suspend/resume like it should, suspend/resume under
kvmclock corrupts wall clock time because timekeeping_resume() reads the persistent
clock before resuming clocksource clocks, and the stupid kvmclock wall clock subtly
consumes the clocksource/system clock. *sigh*
I have yet more patches to clean all of this up. The series is rather unwieldly,
as it's now sitting at 38 patches (ugh), but I don't see a way to chunk it up in
a meaningful way, because everything is so intertwined. :-/
next prev parent reply other threads:[~2025-02-07 17:23 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-02-01 2:17 [PATCH 00/16] x86/tsc: Try to wrangle PV clocks vs. TSC Sean Christopherson
2025-02-01 2:17 ` [PATCH 01/16] x86/tsc: Add a standalone helpers for getting TSC info from CPUID.0x15 Sean Christopherson
2025-02-03 5:55 ` Nikunj A Dadhania
2025-02-03 22:03 ` Sean Christopherson
2025-02-05 22:13 ` Sean Christopherson
2025-02-11 15:01 ` Borislav Petkov
2025-02-11 17:25 ` Sean Christopherson
2025-02-11 18:40 ` Borislav Petkov
2025-02-11 19:03 ` Sean Christopherson
2025-02-01 2:17 ` [PATCH 02/16] x86/tsc: Add standalone helper for getting CPU frequency from CPUID Sean Christopherson
2025-02-01 2:17 ` [PATCH 03/16] x86/tsc: Add helper to register CPU and TSC freq calibration routines Sean Christopherson
2025-02-11 17:32 ` Borislav Petkov
2025-02-11 17:43 ` Sean Christopherson
2025-02-11 20:32 ` Borislav Petkov
2025-02-12 16:49 ` Tom Lendacky
2025-02-01 2:17 ` [PATCH 04/16] x86/sev: Mark TSC as reliable when configuring Secure TSC Sean Christopherson
2025-02-04 8:02 ` Nikunj A Dadhania
2025-02-01 2:17 ` [PATCH 05/16] x86/sev: Move check for SNP Secure TSC support to tsc_early_init() Sean Christopherson
2025-02-04 8:27 ` Nikunj A Dadhania
2025-02-01 2:17 ` [PATCH 06/16] x86/tdx: Override PV calibration routines with CPUID-based calibration Sean Christopherson
2025-02-04 10:16 ` Nikunj A Dadhania
2025-02-04 19:29 ` Sean Christopherson
2025-02-05 3:56 ` Nikunj A Dadhania
2025-02-01 2:17 ` [PATCH 07/16] x86/acrn: Mark TSC frequency as known when using ACRN for calibration Sean Christopherson
2025-02-01 2:17 ` [PATCH 08/16] x86/tsc: Pass KNOWN_FREQ and RELIABLE as params to registration Sean Christopherson
2025-02-03 14:48 ` Tom Lendacky
2025-02-03 19:52 ` Sean Christopherson
2025-02-01 2:17 ` [PATCH 09/16] x86/tsc: Rejects attempts to override TSC calibration with lesser routine Sean Christopherson
2025-02-01 2:17 ` [PATCH 10/16] x86/paravirt: Move handling of unstable PV clocks into paravirt_set_sched_clock() Sean Christopherson
2025-02-01 2:17 ` [PATCH 11/16] x86/paravirt: Don't use a PV sched_clock in CoCo guests with trusted TSC Sean Christopherson
2025-02-01 2:17 ` [PATCH 12/16] x86/kvmclock: Mark TSC as reliable when it's constant and nonstop Sean Christopherson
2025-02-01 2:17 ` [PATCH 13/16] x86/kvmclock: Get CPU base frequency from CPUID when it's available Sean Christopherson
2025-02-01 2:17 ` [PATCH 14/16] x86/kvmclock: Get TSC frequency from CPUID when its available Sean Christopherson
2025-02-01 2:17 ` [PATCH 15/16] x86/kvmclock: Stuff local APIC bus period when core crystal freq comes from CPUID Sean Christopherson
2025-02-01 2:17 ` [PATCH 16/16] x86/kvmclock: Use TSC for sched_clock if it's constant and non-stop Sean Christopherson
2025-02-07 17:23 ` Sean Christopherson [this message]
2025-02-08 18:03 ` Michael Kelley
2025-02-10 16:21 ` Sean Christopherson
2025-02-12 16:44 ` Michael Kelley
2025-02-12 22:55 ` Sean Christopherson
2025-02-11 14:39 ` [PATCH 00/16] x86/tsc: Try to wrangle PV clocks vs. TSC Borislav Petkov
2025-02-11 16:28 ` Sean Christopherson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z6ZBjNdoULymGgxz@google.com \
--to=seanjc@google.com \
--cc=ajay.kaher@broadcom.com \
--cc=bp@alien8.de \
--cc=dave.hansen@linux.intel.com \
--cc=decui@microsoft.com \
--cc=haiyangz@microsoft.com \
--cc=jan.kiszka@siemens.com \
--cc=jgross@suse.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=kvm@vger.kernel.org \
--cc=kys@microsoft.com \
--cc=linux-coco@lists.linux.dev \
--cc=linux-hyperv@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=luto@kernel.org \
--cc=mingo@redhat.com \
--cc=nikunj@amd.com \
--cc=pbonzini@redhat.com \
--cc=peterz@infradead.org \
--cc=tglx@linutronix.de \
--cc=thomas.lendacky@amd.com \
--cc=virtualization@lists.linux.dev \
--cc=wei.liu@kernel.org \
--cc=x86@kernel.org \
--cc=xen-devel@lists.xenproject.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).