From: Sean Christopherson <seanjc@google.com>
To: "Nikunj A. Dadhania" <nikunj@amd.com>
Cc: Borislav Petkov <bp@alien8.de>,
linux-kernel@vger.kernel.org, thomas.lendacky@amd.com,
x86@kernel.org, kvm@vger.kernel.org, mingo@redhat.com,
tglx@linutronix.de, dave.hansen@linux.intel.com,
pgonda@google.com, pbonzini@redhat.com,
francescolavra.fl@gmail.com,
Alexey Makhalov <alexey.makhalov@broadcom.com>,
Juergen Gross <jgross@suse.com>,
Boris Ostrovsky <boris.ostrovsky@oracle.com>
Subject: Re: [PATCH v16 12/13] x86/tsc: Switch to native sched clock
Date: Wed, 15 Jan 2025 13:37:25 -0800 [thread overview]
Message-ID: <Z4gqlbumOFPF_rxd@google.com> (raw)
In-Reply-To: <4ab9dc76-4556-4a96-be0d-2c8ee942b113@amd.com>
On Thu, Jan 09, 2025, Nikunj A. Dadhania wrote:
> On 1/8/2025 10:32 PM, Sean Christopherson wrote:
> > On Wed, Jan 08, 2025, Borislav Petkov wrote:
> >> On Wed, Jan 08, 2025 at 06:00:59AM -0800, Sean Christopherson wrote:
>
> > For TDX guests, the TSC is _always_ "secure". So similar to singling out kvmclock,
> > handling SNP's STSC but not the TDX case again leaves the kernel in an inconsistent
> > state. Which is why I originally suggested[*] fixing the sched_clock mess in a
> > generically;
> > doing so would avoid the need to special case SNP or TDX in code> that doesn't/shouldn't care about SNP or TDX.
>
> That is what I have attempted in this patch[1] where irrespective of SNP/TDX, whenever
> TSC is picked up as a clock source, sched-clock will start using TSC instead of any
> PV sched clock. This does not have any special case for STSC/SNP/TDX.
*sigh*
This is a complete and utter trainwreck.
Paolo had an idea to handle this without a deferred callback by having kvmclock
detect if kvmclock is selected as the clocksource, OR if the TSC is unreliable,
i.e. if the kernel may switch to kvmclock due to the TSC being marked unreliable.
Unfortunately, that idea doesn't work because the ordering is all wrong. The
*early* TSC initialization in setup_arch() happens after kvmclock_init() (via
init_hypervisor_platform())
init_hypervisor_platform();
tsc_early_init();
and even if we mucked with that, __clocksource_select() very deliberately doesn't
change the clocksource until the kernel is "finished" booting, in order to avoid
thrashing the clocksource.
static struct clocksource *clocksource_find_best(bool oneshot, bool skipcur)
{
struct clocksource *cs;
if (!finished_booting || list_empty(&clocksource_list)) <===
return NULL;
...
}
/*
* clocksource_done_booting - Called near the end of core bootup
*
* Hack to avoid lots of clocksource churn at boot time.
* We use fs_initcall because we want this to start before
* device_initcall but after subsys_initcall.
*/
static int __init clocksource_done_booting(void)
{
mutex_lock(&clocksource_mutex);
curr_clocksource = clocksource_default_clock();
finished_booting = 1;
/*
* Run the watchdog first to eliminate unstable clock sources
*/
__clocksource_watchdog_kthread();
clocksource_select();
mutex_unlock(&clocksource_mutex);
return 0;
}
fs_initcall(clocksource_done_booting);
I fiddled with a variety of ideas to try and let kvmclock tell sched_clock that
it prefers the TSC, e.g. if TSC_RELIABLE is set, but after seeing this comment
in native_sched_clock():
/*
* Fall back to jiffies if there's no TSC available:
* ( But note that we still use it if the TSC is marked
* unstable. We do this because unlike Time Of Day,
* the scheduler clock tolerates small errors and it's
* very important for it to be as fast as the platform
* can achieve it. )
*/
My strong vote is prefer TSC over kvmclock for sched_clock if the TSC is constant,
nonstop, and not marked stable via command line. I.e. use the same criteria as
tweaking the clocksource rating. As above, sched_clock is more tolerant of slop
than clocksource, so it's a bit ridiculous to care whether the TSC or kvmclock
(or something else entirely) is used for the clocksource.
If we wanted to go with a more conservative approach, e.g. to minimize the risk
of breaking existing setups, we could also condition the change on the TSC being
reliable and having a known frequency. I.e. require SNP's Secure TSC, or require
the hypervisor to enumerate the TSC frequency via CPUID. I don't see a ton of
value in that approach though, and long-term it would lead to some truly weird
code due to holding sched_clock to a higher standard than clocksource.
But wait, there's more! Because TDX doesn't override .calibrate_tsc() or
.calibrate_cpu(), even though TDX provides a trusted TSC *and* enumerates the
frequency of the TSC, unless I'm missing something, tsc_early_init() will compute
the TSC frequency using the information provided by KVM, i.e. the untrusted host.
The "obvious" solution is to leave the calibration functions as-is if the TSC has
a known, reliable frequency, but even _that_ is riddled with problems, because
as-is, the kernel sets TSC_KNOWN_FREQ and TSC_RELIABLE in tsc_early_init(), which
runs *after* init_hypervisor_platform(). SNP Secure TSC fudges around this by
overiding the calibration routines, but that's a bit gross and easy to fix if we
also fix TDX. And fixing TDX by running native_calibrate_tsc() would give the
same love to setups where the hypervisor provides CPUID 0x15 and/or 0x16.
All in all, I'm thinking something like this (across multiple patches):
---
arch/x86/include/asm/tsc.h | 1 +
arch/x86/kernel/kvmclock.c | 55 ++++++++++++++++++++++++++------------
arch/x86/kernel/setup.c | 7 +++++
arch/x86/kernel/tsc.c | 32 +++++++++++++++++-----
4 files changed, 72 insertions(+), 23 deletions(-)
diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index 94408a784c8e..e13d6c3f2298 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -28,6 +28,7 @@ static inline cycles_t get_cycles(void)
}
#define get_cycles get_cycles
+extern void tsc_early_detect(void);
extern void tsc_early_init(void);
extern void tsc_init(void);
extern void mark_tsc_unstable(char *reason);
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 5b2c15214a6b..fa6bf71cc511 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -317,33 +317,54 @@ void __init kvmclock_init(void)
if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);
- flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
- kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
+ /*
+ * If the TSC counts at a constant frequency across P/T states, counts
+ * in deep C-states, and the TSC hasn't been marked unstable, prefer
+ * the TSC over kvmclock for sched_clock and drop kvmclock's rating so
+ * that TSC is chosen as the clocksource. Note, the TSC unstable check
+ * exists purely to honor the TSC being marked unstable via command
+ * line, any runtime detection of an unstable will happen after this.
+ */
+ if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
+ boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
+ !check_tsc_unstable()) {
+ kvm_clock.rating = 299;
+ pr_warn("kvm-clock: Using native sched_clock\n");
+ } else {
+ flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
+ kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
+ }
+
+ /*
+ * If the TSC frequency is already known, e.g. via CPUID, rely on that
+ * information. For "normal" VMs, the hypervisor controls kvmclock and
+ * CPUID, i.e. the frequency is coming from the same place. For CoCo
+ * VMs, the TSC frequency may be provided by trusted firmware, in which
+ * case it's highly desirable to use that information, not kvmclock's.
+ * Note, TSC_KNOWN_FREQ must be read before presetting loops-per-jiffy,
+ * (see kvm_get_tsc_khz()).
+ */
+ if (!boot_cpu_has(X86_FEATURE_TSC_KNOWN_FREQ) ||
+ !boot_cpu_has(X86_FEATURE_TSC_RELIABLE)) {
+ pr_warn("kvm-clock: Using native calibration\n");
+ x86_platform.calibrate_tsc = kvm_get_tsc_khz;
+ x86_platform.calibrate_cpu = kvm_get_tsc_khz;
+ }
- x86_platform.calibrate_tsc = kvm_get_tsc_khz;
- x86_platform.calibrate_cpu = kvm_get_tsc_khz;
x86_platform.get_wallclock = kvm_get_wallclock;
x86_platform.set_wallclock = kvm_set_wallclock;
#ifdef CONFIG_X86_LOCAL_APIC
x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock;
#endif
+ /*
+ * Save/restore "sched" clock state even if kvmclock isn't being used
+ * for sched_clock, as kvmclock is still used for wallclock and relies
+ * on these hooks to re-enable kvmclock after suspend+resume.
+ */
x86_platform.save_sched_clock_state = kvm_save_sched_clock_state;
x86_platform.restore_sched_clock_state = kvm_restore_sched_clock_state;
kvm_get_preset_lpj();
- /*
- * X86_FEATURE_NONSTOP_TSC is TSC runs at constant rate
- * with P/T states and does not stop in deep C-states.
- *
- * Invariant TSC exposed by host means kvmclock is not necessary:
- * can use TSC as clocksource.
- *
- */
- if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
- boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
- !check_tsc_unstable())
- kvm_clock.rating = 299;
-
clocksource_register_hz(&kvm_clock, NSEC_PER_SEC);
pv_info.name = "KVM";
}
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index f1fea506e20f..2b6800426349 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -907,6 +907,13 @@ void __init setup_arch(char **cmdline_p)
reserve_ibft_region();
x86_init.resources.dmi_setup();
+ /*
+ * Detect, but do not fully initialize, TSC info before initializing
+ * the hypervisor platform, so that hypervisor code can make informed
+ * decisions about using a paravirt clock vs. TSC.
+ */
+ tsc_early_detect();
+
/*
* VMware detection requires dmi to be available, so this
* needs to be done after dmi_setup(), for the boot CPU.
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 0864b314c26a..9baffb425386 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -663,7 +663,12 @@ unsigned long native_calibrate_tsc(void)
unsigned int eax_denominator, ebx_numerator, ecx_hz, edx;
unsigned int crystal_khz;
- if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
+ /*
+ * Ignore the vendor when running as a VM, if the hypervisor provides
+ * garbage CPUID information then the vendor is also suspect.
+ */
+ if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL &&
+ !boot_cpu_has(X86_FEATURE_HYPERVISOR))
return 0;
if (boot_cpu_data.cpuid_level < 0x15)
@@ -713,10 +718,13 @@ unsigned long native_calibrate_tsc(void)
return 0;
/*
- * For Atom SoCs TSC is the only reliable clocksource.
- * Mark TSC reliable so no watchdog on it.
+ * For Atom SoCs TSC is the only reliable clocksource. Similarly, in a
+ * VM, any watchdog is going to be less reliable than the TSC as the
+ * watchdog source will be emulated in software. In both cases, mark
+ * the TSC reliable so that no watchdog runs on it.
*/
- if (boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT)
+ if (boot_cpu_has(X86_FEATURE_HYPERVISOR) ||
+ boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT)
setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
#ifdef CONFIG_X86_LOCAL_APIC
@@ -1509,6 +1517,20 @@ static void __init tsc_enable_sched_clock(void)
static_branch_enable(&__use_tsc);
}
+void __init tsc_early_detect(void)
+{
+ if (!boot_cpu_has(X86_FEATURE_TSC))
+ return;
+
+ snp_secure_tsc_init();
+
+ /*
+ * Run through native TSC calibration to set TSC_KNOWN_FREQ and/or
+ * TSC_RELIABLE as appropriate.
+ */
+ native_calibrate_tsc();
+}
+
void __init tsc_early_init(void)
{
if (!boot_cpu_has(X86_FEATURE_TSC))
@@ -1517,8 +1539,6 @@ void __init tsc_early_init(void)
if (is_early_uv_system())
return;
- snp_secure_tsc_init();
-
if (!determine_cpu_tsc_frequencies(true))
return;
tsc_enable_sched_clock();
base-commit: 0563ee35ae2c9cfb0c6a7b2c0ddf7d9372bb8a98
--
next prev parent reply other threads:[~2025-01-15 21:37 UTC|newest]
Thread overview: 49+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-06 12:46 [PATCH v16 00/13] Add Secure TSC support for SNP guests Nikunj A Dadhania
2025-01-06 12:46 ` [PATCH v16 01/13] virt: sev-guest: Remove is_vmpck_empty() helper Nikunj A Dadhania
2025-01-07 18:38 ` Tom Lendacky
2025-01-06 12:46 ` [PATCH v16 02/13] virt: sev-guest: Replace GFP_KERNEL_ACCOUNT with GFP_KERNEL Nikunj A Dadhania
2025-01-07 18:40 ` Tom Lendacky
2025-01-06 12:46 ` [PATCH v16 03/13] x86/sev: Carve out and export SNP guest messaging init routines Nikunj A Dadhania
2025-01-06 12:46 ` [PATCH v16 04/13] x86/sev: Relocate SNP guest messaging routines to common code Nikunj A Dadhania
2025-01-06 12:46 ` [PATCH v16 05/13] x86/sev: Add Secure TSC support for SNP guests Nikunj A Dadhania
2025-01-07 10:42 ` Borislav Petkov
2025-01-07 11:43 ` Nikunj A. Dadhania
2025-01-07 12:37 ` Borislav Petkov
2025-01-07 18:53 ` Tom Lendacky
2025-01-07 19:18 ` Borislav Petkov
2025-01-08 7:47 ` Nikunj A. Dadhania
2025-01-08 8:05 ` Borislav Petkov
2025-01-08 8:37 ` Nikunj A. Dadhania
2025-01-08 8:43 ` Borislav Petkov
2025-01-07 19:46 ` Tom Lendacky
2025-01-07 19:53 ` Borislav Petkov
2025-01-06 12:46 ` [PATCH v16 06/13] x86/sev: Change TSC MSR behavior for Secure TSC enabled guests Nikunj A Dadhania
2025-01-07 20:09 ` Tom Lendacky
2025-01-06 12:46 ` [PATCH v16 07/13] x86/sev: Prevent GUEST_TSC_FREQ MSR interception " Nikunj A Dadhania
2025-01-06 12:46 ` [PATCH v16 08/13] x86/sev: Prevent RDTSC/RDTSCP " Nikunj A Dadhania
2025-01-06 12:46 ` [PATCH v16 09/13] x86/sev: Mark Secure TSC as reliable clocksource Nikunj A Dadhania
2025-01-06 12:46 ` [PATCH v16 10/13] x86/tsc: Switch Secure TSC guests away from kvm-clock Nikunj A Dadhania
2025-01-07 15:16 ` Borislav Petkov
2025-01-08 10:45 ` Nikunj A. Dadhania
2025-01-06 12:46 ` [PATCH v16 11/13] x86/tsc: Upgrade TSC clocksource rating for guests Nikunj A Dadhania
2025-01-07 17:51 ` Borislav Petkov
2025-01-06 12:46 ` [PATCH v16 12/13] x86/tsc: Switch to native sched clock Nikunj A Dadhania
2025-01-07 19:37 ` Borislav Petkov
2025-01-08 5:20 ` Nikunj A. Dadhania
2025-01-08 8:22 ` Borislav Petkov
2025-01-08 8:34 ` Nikunj A. Dadhania
2025-01-08 10:20 ` Nikunj A. Dadhania
2025-01-08 14:00 ` Sean Christopherson
2025-01-08 15:34 ` Borislav Petkov
2025-01-08 17:02 ` Sean Christopherson
2025-01-08 19:53 ` Borislav Petkov
2025-01-09 6:32 ` Nikunj A. Dadhania
2025-01-15 21:37 ` Sean Christopherson [this message]
2025-01-16 16:25 ` Borislav Petkov
2025-01-16 16:56 ` Sean Christopherson
2025-01-17 20:28 ` Borislav Petkov
2025-01-17 20:59 ` Sean Christopherson
2025-01-21 11:32 ` Borislav Petkov
2025-01-21 3:59 ` Nikunj A. Dadhania
2025-01-28 5:41 ` Nikunj A Dadhania
2025-01-06 12:46 ` [PATCH v16 13/13] x86/sev: Allow Secure TSC feature for SNP guests Nikunj A Dadhania
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z4gqlbumOFPF_rxd@google.com \
--to=seanjc@google.com \
--cc=alexey.makhalov@broadcom.com \
--cc=boris.ostrovsky@oracle.com \
--cc=bp@alien8.de \
--cc=dave.hansen@linux.intel.com \
--cc=francescolavra.fl@gmail.com \
--cc=jgross@suse.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=nikunj@amd.com \
--cc=pbonzini@redhat.com \
--cc=pgonda@google.com \
--cc=tglx@linutronix.de \
--cc=thomas.lendacky@amd.com \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).