* [PATCH v2 0/3] Support "generic" CPUID timing leaf as KVM guest and host
@ 2025-08-16 10:09 David Woodhouse
2025-08-16 10:10 ` [PATCH v2 1/3] KVM: x86: Restore caching of KVM CPUID base David Woodhouse
` (3 more replies)
0 siblings, 4 replies; 21+ messages in thread
From: David Woodhouse @ 2025-08-16 10:09 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Vitaly Kuznetsov, kvm, linux-kernel, graf, Ajay Kaher,
Alexey Makhalov, Colin Percival
In https://lkml.org/lkml/2008/10/1/246 VMware proposed a generic standard
for harmonising CPUID between hypervisors. It was mostly shot down in
flames, but the generic timing leaf at 0x4000_0010 didn't quite die.
Mostly the hypervisor leaves at 0x4000_0xxx are very hypervisor-specific,
but XNU and FreeBSD as guests will look for 0x4000_0010 unconditionally,
under any hypervisor. The EC2 Nitro hypervisor has also exposed TSC
frequency information in this leaf, since 2020.
As things stand, KVM guests have to reverse-calculate the TSC frequency
from the mul/shift information given to them in the KVM clock to convert
ticks into nanoseconds, with a corresponding loss of precision.
There's certainly no way we can sanely use 0x4000_0010 for anything *else*
at this point. Just adopt it, as both guest and host. We already have the
infrastructure for keeping the TSC frequency information up to date for
the Xen CPUID leaf anyway, so do precisely the same for this one.
v2:
• Fix inadvertent C++ism pointed out by syzbot:
https://ci.syzbot.org/series/a9510b1a-8024-41ce-9775-675f5c165e20
David Woodhouse (3):
KVM: x86: Restore caching of KVM CPUID base
KVM: x86: Provide TSC frequency in "generic" timing infomation CPUID leaf
x86/kvm: Obtain TSC frequency from CPUID if present
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/include/asm/kvm_para.h | 1 +
arch/x86/include/uapi/asm/kvm_para.h | 11 +++++++++++
arch/x86/kernel/kvm.c | 10 ++++++++++
arch/x86/kernel/kvmclock.c | 7 ++++++-
arch/x86/kvm/cpuid.c | 23 ++++++++++++++++++-----
6 files changed, 47 insertions(+), 6 deletions(-)
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH v2 1/3] KVM: x86: Restore caching of KVM CPUID base
2025-08-16 10:09 [PATCH v2 0/3] Support "generic" CPUID timing leaf as KVM guest and host David Woodhouse
@ 2025-08-16 10:10 ` David Woodhouse
2025-08-16 10:10 ` [PATCH v2 2/3] KVM: x86: Provide TSC frequency in "generic" timing infomation CPUID leaf David Woodhouse
` (2 subsequent siblings)
3 siblings, 0 replies; 21+ messages in thread
From: David Woodhouse @ 2025-08-16 10:10 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Vitaly Kuznetsov, kvm, linux-kernel, graf, Ajay Kaher,
Alexey Makhalov, Colin Percival
From: David Woodhouse <dwmw@amazon.co.uk>
This mostly reverts commit a5b32718081e ("KVM: x86: Remove unnecessary
caching of KVM's PV CPUID base").
Sure, caching state which might change has certain risks, but KVM
already does cache the CPUID contents, and the whole point of calling
kvm_apply_cpuid_pv_features_quirk() from kvm_vcpu_after_set_cpuid() is
to cache the contents of that leaf too, so that guest_pv_has() can
access them quickly.
An upcoming commit is going to want to use vcpu->arch.kvm_cpuid from
kvm_cpuid() at runtime too, so put it back.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/cpuid.c | 16 +++++++++++-----
2 files changed, 12 insertions(+), 5 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f19a76d3ca0e..50febd333f5f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -897,6 +897,7 @@ struct kvm_vcpu_arch {
int cpuid_nent;
struct kvm_cpuid_entry2 *cpuid_entries;
+ struct kvm_hypervisor_cpuid kvm_cpuid;
bool cpuid_dynamic_bits_dirty;
bool is_amd_compatible;
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index e2836a255b16..bcce3a75c3f2 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -178,7 +178,12 @@ static int kvm_cpuid_check_equal(struct kvm_vcpu *vcpu, struct kvm_cpuid_entry2
/*
* Apply runtime CPUID updates to the incoming CPUID entries to avoid
- * false positives due mismatches on KVM-owned feature flags.
+ * false positives due mismatches on KVM-owned feature flags. Note,
+ * runtime CPUID updates may consume other CPUID-driven vCPU state,
+ * e.g. KVM or Xen CPUID bases. Updating runtime state before full
+ * CPUID processing is functionally correct only because any change in
+ * CPUID is disallowed, i.e. using stale data is ok because the below
+ * checks will reject the change.
*
* Note! @e2 and @nent track the _old_ CPUID entries!
*/
@@ -231,14 +236,14 @@ static struct kvm_hypervisor_cpuid kvm_get_hypervisor_cpuid(struct kvm_vcpu *vcp
static u32 kvm_apply_cpuid_pv_features_quirk(struct kvm_vcpu *vcpu)
{
- struct kvm_hypervisor_cpuid kvm_cpuid;
struct kvm_cpuid_entry2 *best;
+ u32 features_leaf = vcpu->arch.kvm_cpuid.base | KVM_CPUID_FEATURES;
- kvm_cpuid = kvm_get_hypervisor_cpuid(vcpu, KVM_SIGNATURE);
- if (!kvm_cpuid.base)
+ if (!vcpu->arch.kvm_cpuid.base ||
+ vcpu->arch.kvm_cpuid.limit < features_leaf)
return 0;
- best = kvm_find_cpuid_entry(vcpu, kvm_cpuid.base | KVM_CPUID_FEATURES);
+ best = kvm_find_cpuid_entry(vcpu, features_leaf);
if (!best)
return 0;
@@ -541,6 +546,7 @@ static int kvm_set_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid_entry2 *e2,
if (r)
goto err;
+ vcpu->arch.kvm_cpuid = kvm_get_hypervisor_cpuid(vcpu, KVM_SIGNATURE);
#ifdef CONFIG_KVM_XEN
vcpu->arch.xen.cpuid = kvm_get_hypervisor_cpuid(vcpu, XEN_SIGNATURE);
#endif
--
2.49.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH v2 2/3] KVM: x86: Provide TSC frequency in "generic" timing infomation CPUID leaf
2025-08-16 10:09 [PATCH v2 0/3] Support "generic" CPUID timing leaf as KVM guest and host David Woodhouse
2025-08-16 10:10 ` [PATCH v2 1/3] KVM: x86: Restore caching of KVM CPUID base David Woodhouse
@ 2025-08-16 10:10 ` David Woodhouse
2025-08-16 10:10 ` [PATCH v2 3/3] x86/kvm: Obtain TSC frequency from CPUID if present David Woodhouse
2025-08-21 16:26 ` [PATCH v2 0/3] Support "generic" CPUID timing leaf as KVM guest and host Sean Christopherson
3 siblings, 0 replies; 21+ messages in thread
From: David Woodhouse @ 2025-08-16 10:10 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Vitaly Kuznetsov, kvm, linux-kernel, graf, Ajay Kaher,
Alexey Makhalov, Colin Percival
From: David Woodhouse <dwmw@amazon.co.uk>
In https://lkml.org/lkml/2008/10/1/246 a proposal was made for generic
CPUID leaves, of which only 0x40000010 was defined, to contain the TSC
and local APIC frequencies. The proposal from VMware was mostly shot
down in flames, *but* XNU does unconditionally assume that this leaf
contains the frequency information, if it's present on any hypervisor:
https://github.com/apple/darwin-xnu/blob/main/osfmk/i386/cpuid.c
So does FreeBSD: https://github.com/freebsd/freebsd-src/commit/4a432614f68
So at this point it would be daft for a hypervisor to expose 0x40000010
for any *other* content. KVM might as well adopt it, and fill in the
accurate TSC frequency just as it does for the Xen TSC leaf.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
arch/x86/include/uapi/asm/kvm_para.h | 11 +++++++++++
arch/x86/kvm/cpuid.c | 7 +++++++
2 files changed, 18 insertions(+)
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index a1efa7907a0b..1597c4a2a24a 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -44,6 +44,17 @@
*/
#define KVM_FEATURE_CLOCKSOURCE_STABLE_BIT 24
+
+/*
+ * Proposed by VMware in https://lkml.org/lkml/2008/10/1/246 the timing
+ * information leaf provides the TSC and local APIC timer frequencies:
+ *
+ * # EAX: (Virtual) TSC frequency in kHz.
+ * # EBX: (Virtual) Bus (local apic timer) frequency in kHz.
+ * # ECX, EDX: RESERVED (reserved fields are set to zero).
+ */
+#define KVM_CPUID_TIMING_INFO 0x40000010
+
#define MSR_KVM_WALL_CLOCK 0x11
#define MSR_KVM_SYSTEM_TIME 0x12
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index bcce3a75c3f2..1bd69d9c86b7 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -2029,6 +2029,13 @@ bool kvm_cpuid(struct kvm_vcpu *vcpu, u32 *eax, u32 *ebx,
} else if (index == 2) {
*eax = vcpu->arch.hw_tsc_khz;
}
+ } else if (vcpu->arch.kvm_cpuid.base &&
+ function <= vcpu->arch.kvm_cpuid.limit &&
+ function == (vcpu->arch.kvm_cpuid.base | KVM_CPUID_TIMING_INFO)) {
+ if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, vcpu))
+ kvm_guest_time_update(vcpu);
+
+ *eax = vcpu->arch.hw_tsc_khz;
}
} else {
*eax = *ebx = *ecx = *edx = 0;
--
2.49.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH v2 3/3] x86/kvm: Obtain TSC frequency from CPUID if present
2025-08-16 10:09 [PATCH v2 0/3] Support "generic" CPUID timing leaf as KVM guest and host David Woodhouse
2025-08-16 10:10 ` [PATCH v2 1/3] KVM: x86: Restore caching of KVM CPUID base David Woodhouse
2025-08-16 10:10 ` [PATCH v2 2/3] KVM: x86: Provide TSC frequency in "generic" timing infomation CPUID leaf David Woodhouse
@ 2025-08-16 10:10 ` David Woodhouse
2025-08-21 16:26 ` [PATCH v2 0/3] Support "generic" CPUID timing leaf as KVM guest and host Sean Christopherson
3 siblings, 0 replies; 21+ messages in thread
From: David Woodhouse @ 2025-08-16 10:10 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Vitaly Kuznetsov, kvm, linux-kernel, graf, Ajay Kaher,
Alexey Makhalov, Colin Percival
From: David Woodhouse <dwmw@amazon.co.uk>
In https://lkml.org/lkml/2008/10/1/246 a proposal was made for generic
CPUID conventions across hypervisors. It was mostly shot down in flames,
but the leaf at 0x40000010 containing timing information didn't die.
It's used by XNU and FreeBSD guests under all hypervisors¹² to determine
the TSC frequency, and also exposed by the EC2 Nitro hypervisor (as
well as, presumably, VMware). FreeBSD's Bhyve is probably just about
to start exposing it too.
Use it under KVM to obtain the TSC frequency more accurately, instead
of reverse-calculating the frequency from the mul/shift values in the
KVM clock.
Before:
[ 0.000020] tsc: Detected 2900.014 MHz processor
After:
[ 0.000020] tsc: Detected 2900.015 MHz processor
$ cpuid -1 -l 0x40000010
CPU:
hypervisor generic timing information (0x40000010):
TSC frequency (Hz) = 2900015
bus frequency (Hz) = 1000000
¹ https://github.com/apple/darwin-xnu/blob/main/osfmk/i386/cpuid.c
² https://github.com/freebsd/freebsd-src/commit/4a432614f68
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
arch/x86/include/asm/kvm_para.h | 1 +
arch/x86/kernel/kvm.c | 10 ++++++++++
arch/x86/kernel/kvmclock.c | 7 ++++++-
3 files changed, 17 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 57bc74e112f2..d53927103cab 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -121,6 +121,7 @@ static inline long kvm_sev_hypercall3(unsigned int nr, unsigned long p1,
void kvmclock_init(void);
void kvmclock_disable(void);
bool kvm_para_available(void);
+unsigned int kvm_para_tsc_khz(void);
unsigned int kvm_arch_para_features(void);
unsigned int kvm_arch_para_hints(void);
void kvm_async_pf_task_wait_schedule(u32 token);
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 8ae750cde0c6..44040e37c9a7 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -896,6 +896,16 @@ bool kvm_para_available(void)
}
EXPORT_SYMBOL_GPL(kvm_para_available);
+unsigned int kvm_para_tsc_khz(void)
+{
+ u32 base = kvm_cpuid_base();
+
+ if (cpuid_eax(base) >= (base | KVM_CPUID_TIMING_INFO))
+ return cpuid_eax(base | KVM_CPUID_TIMING_INFO);
+
+ return 0;
+}
+
unsigned int kvm_arch_para_features(void)
{
return cpuid_eax(kvm_cpuid_base() | KVM_CPUID_FEATURES);
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index ca0a49eeac4a..0908450ebac9 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -117,7 +117,12 @@ static inline void kvm_sched_clock_init(bool stable)
static unsigned long kvm_get_tsc_khz(void)
{
setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
- return pvclock_tsc_khz(this_cpu_pvti());
+
+ /*
+ * If KVM advertises the frequency directly in CPUID, use that
+ * instead of reverse-calculating it from the KVM clock data.
+ */
+ return kvm_para_tsc_khz() ? : pvclock_tsc_khz(this_cpu_pvti());
}
static void __init kvm_get_preset_lpj(void)
--
2.49.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH v2 0/3] Support "generic" CPUID timing leaf as KVM guest and host
2025-08-16 10:09 [PATCH v2 0/3] Support "generic" CPUID timing leaf as KVM guest and host David Woodhouse
` (2 preceding siblings ...)
2025-08-16 10:10 ` [PATCH v2 3/3] x86/kvm: Obtain TSC frequency from CPUID if present David Woodhouse
@ 2025-08-21 16:26 ` Sean Christopherson
2025-08-21 17:37 ` David Woodhouse
3 siblings, 1 reply; 21+ messages in thread
From: Sean Christopherson @ 2025-08-21 16:26 UTC (permalink / raw)
To: David Woodhouse
Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H. Peter Anvin, Vitaly Kuznetsov, kvm,
linux-kernel, graf, Ajay Kaher, Alexey Makhalov, Colin Percival
On Sat, Aug 16, 2025, David Woodhouse wrote:
> In https://lkml.org/lkml/2008/10/1/246 VMware proposed a generic standard
> for harmonising CPUID between hypervisors. It was mostly shot down in
> flames, but the generic timing leaf at 0x4000_0010 didn't quite die.
>
> Mostly the hypervisor leaves at 0x4000_0xxx are very hypervisor-specific,
> but XNU and FreeBSD as guests will look for 0x4000_0010 unconditionally,
> under any hypervisor. The EC2 Nitro hypervisor has also exposed TSC
> frequency information in this leaf, since 2020.
>
> As things stand, KVM guests have to reverse-calculate the TSC frequency
> from the mul/shift information given to them in the KVM clock to convert
> ticks into nanoseconds, with a corresponding loss of precision.
I would rather have the VMM use the Intel-define CPUID.0x15 to enumerate the
TSC frequency. I would also love, love, love reviews on that series.
https://lore.kernel.org/all/20250227021855.3257188-36-seanjc@google.com
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2 0/3] Support "generic" CPUID timing leaf as KVM guest and host
2025-08-21 16:26 ` [PATCH v2 0/3] Support "generic" CPUID timing leaf as KVM guest and host Sean Christopherson
@ 2025-08-21 17:37 ` David Woodhouse
2025-08-21 19:27 ` Sean Christopherson
0 siblings, 1 reply; 21+ messages in thread
From: David Woodhouse @ 2025-08-21 17:37 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H. Peter Anvin, Vitaly Kuznetsov, kvm,
linux-kernel, graf, Ajay Kaher, Alexey Makhalov, Colin Percival
[-- Attachment #1: Type: text/plain, Size: 1617 bytes --]
On Thu, 2025-08-21 at 09:26 -0700, Sean Christopherson wrote:
> On Sat, Aug 16, 2025, David Woodhouse wrote:
> > In https://lkml.org/lkml/2008/10/1/246 VMware proposed a generic standard
> > for harmonising CPUID between hypervisors. It was mostly shot down in
> > flames, but the generic timing leaf at 0x4000_0010 didn't quite die.
> >
> > Mostly the hypervisor leaves at 0x4000_0xxx are very hypervisor-specific,
> > but XNU and FreeBSD as guests will look for 0x4000_0010 unconditionally,
> > under any hypervisor. The EC2 Nitro hypervisor has also exposed TSC
> > frequency information in this leaf, since 2020.
> >
> > As things stand, KVM guests have to reverse-calculate the TSC frequency
> > from the mul/shift information given to them in the KVM clock to convert
> > ticks into nanoseconds, with a corresponding loss of precision.
>
> I would rather have the VMM use the Intel-define CPUID.0x15 to enumerate the
> TSC frequency.
The problem with that is that it's been quite unreliable. The kernel
doesn't trust it even on chips as recent (hah) as Skylake. I'd be
happier to trust what the hypervisor explicitly gives us. But yes, it
should be *one* of the sources of information before we reverse-
calculate it from the pvclock.
> I would also love, love, love reviews on that series.
>
> https://lore.kernel.org/all/20250227021855.3257188-36-seanjc@google.com
The carousel has come back round to me frowning at clocks, and
hopefully I can spend some time looking over that, and bringing in some
of the other fixes I had which are still needed, quite soon...
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2 0/3] Support "generic" CPUID timing leaf as KVM guest and host
2025-08-21 17:37 ` David Woodhouse
@ 2025-08-21 19:27 ` Sean Christopherson
2025-08-21 20:42 ` David Woodhouse
0 siblings, 1 reply; 21+ messages in thread
From: Sean Christopherson @ 2025-08-21 19:27 UTC (permalink / raw)
To: David Woodhouse
Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H. Peter Anvin, Vitaly Kuznetsov, kvm,
linux-kernel, graf, Ajay Kaher, Alexey Makhalov, Colin Percival
On Thu, Aug 21, 2025, David Woodhouse wrote:
> On Thu, 2025-08-21 at 09:26 -0700, Sean Christopherson wrote:
> > On Sat, Aug 16, 2025, David Woodhouse wrote:
> > > In https://lkml.org/lkml/2008/10/1/246 VMware proposed a generic standard
> > > for harmonising CPUID between hypervisors. It was mostly shot down in
> > > flames, but the generic timing leaf at 0x4000_0010 didn't quite die.
> > >
> > > Mostly the hypervisor leaves at 0x4000_0xxx are very hypervisor-specific,
> > > but XNU and FreeBSD as guests will look for 0x4000_0010 unconditionally,
> > > under any hypervisor. The EC2 Nitro hypervisor has also exposed TSC
> > > frequency information in this leaf, since 2020.
> > >
> > > As things stand, KVM guests have to reverse-calculate the TSC frequency
> > > from the mul/shift information given to them in the KVM clock to convert
> > > ticks into nanoseconds, with a corresponding loss of precision.
> >
> > I would rather have the VMM use the Intel-define CPUID.0x15 to enumerate the
> > TSC frequency.
>
> The problem with that is that it's been quite unreliable. The kernel
> doesn't trust it even on chips as recent (hah) as Skylake. I'd be
> happier to trust what the hypervisor explicitly gives us. But yes, it
> should be *one* of the sources of information before we reverse-
> calculate it from the pvclock.
Sorry, by "the VMM use" I mean have the host, e.g. QEMU, explicitly define TSC
frequency in CPUID.0x15 and CPU frequency in CPUID.0x16. And then on the
KVM-as-a-guest side of things, trust those leaves when they're available.
So same idea as having the VMM fill 0x4000_0010, but piggyback the Intel-defined
leaves instead of the VMware-defined leaf. One of the reasons I'd like to go
that route is to avoid having to choose one or the other when running under TDX,
where CPUID.{0x15,0x16} are provided by the "trusted" TDX-Module, but any PV
leaf is not.
Dunno how feasible it is to get non-Linux guests on board though...
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2 0/3] Support "generic" CPUID timing leaf as KVM guest and host
2025-08-21 19:27 ` Sean Christopherson
@ 2025-08-21 20:42 ` David Woodhouse
2025-08-21 20:48 ` Sean Christopherson
0 siblings, 1 reply; 21+ messages in thread
From: David Woodhouse @ 2025-08-21 20:42 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H. Peter Anvin, Vitaly Kuznetsov, kvm,
linux-kernel, graf, Ajay Kaher, Alexey Makhalov, Colin Percival
[-- Attachment #1: Type: text/plain, Size: 1746 bytes --]
On Thu, 2025-08-21 at 12:27 -0700, Sean Christopherson wrote:
>
> > The problem with that is that it's been quite unreliable. The kernel
> > doesn't trust it even on chips as recent (hah) as Skylake. I'd be
> > happier to trust what the hypervisor explicitly gives us. But yes, it
> > should be *one* of the sources of information before we reverse-
> > calculate it from the pvclock.
>
> Sorry, by "the VMM use" I mean have the host, e.g. QEMU, explicitly define TSC
> frequency in CPUID.0x15 and CPU frequency in CPUID.0x16. And then on the
> KVM-as-a-guest side of things, trust those leaves when they're available.
Those leaves are untrustworthy on hardware. Are you suggesting that the
kernel should trust them when it detects that it's running under KVM,
on the assumption that KVM will have corrected them? And that KVM will
be fabricating them even on CPU models which didn't naturally have
those leaves? And that in the presence of TSC scaling, those leaves
will show the right values for the guest even on hypervisors running
today?
I'll be surprised if that works out well.
I think I'm a lot happier with the explicit CPUID leaf exposed by the
hypervisor.
> So same idea as having the VMM fill 0x4000_0010, but piggyback the Intel-defined
> leaves instead of the VMware-defined leaf. One of the reasons I'd like to go
> that route is to avoid having to choose one or the other when running under TDX,
> where CPUID.{0x15,0x16} are provided by the "trusted" TDX-Module, but any PV
> leaf is not.
>
> Dunno how feasible it is to get non-Linux guests on board though...
FreeBSD as a guest already uses 0x4000_0010, and QEMU already supports
exposing it with the vmware-cpuid-freq option.
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2 0/3] Support "generic" CPUID timing leaf as KVM guest and host
2025-08-21 20:42 ` David Woodhouse
@ 2025-08-21 20:48 ` Sean Christopherson
2025-08-21 21:10 ` David Woodhouse
0 siblings, 1 reply; 21+ messages in thread
From: Sean Christopherson @ 2025-08-21 20:48 UTC (permalink / raw)
To: David Woodhouse
Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H. Peter Anvin, Vitaly Kuznetsov, kvm,
linux-kernel, graf, Ajay Kaher, Alexey Makhalov, Colin Percival
On Thu, Aug 21, 2025, David Woodhouse wrote:
> On Thu, 2025-08-21 at 12:27 -0700, Sean Christopherson wrote:
> >
> > > The problem with that is that it's been quite unreliable. The kernel
> > > doesn't trust it even on chips as recent (hah) as Skylake. I'd be
> > > happier to trust what the hypervisor explicitly gives us. But yes, it
> > > should be *one* of the sources of information before we reverse-
> > > calculate it from the pvclock.
> >
> > Sorry, by "the VMM use" I mean have the host, e.g. QEMU, explicitly define TSC
> > frequency in CPUID.0x15 and CPU frequency in CPUID.0x16. And then on the
> > KVM-as-a-guest side of things, trust those leaves when they're available.
>
> Those leaves are untrustworthy on hardware. Are you suggesting that the
> kernel should trust them when it detects that it's running under KVM,
> on the assumption that KVM will have corrected them? And that KVM will
> be fabricating them even on CPU models which didn't naturally have
> those leaves? And that in the presence of TSC scaling, those leaves
> will show the right values for the guest even on hypervisors running
> today?
>
> I'll be surprised if that works out well.
>
> I think I'm a lot happier with the explicit CPUID leaf exposed by the
> hypervisor.
Why? If the hypervisor is ultimately the one defining the state, why does it
matter which CPUID leaf its in?
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2 0/3] Support "generic" CPUID timing leaf as KVM guest and host
2025-08-21 20:48 ` Sean Christopherson
@ 2025-08-21 21:10 ` David Woodhouse
2025-08-22 1:57 ` Colin Percival
0 siblings, 1 reply; 21+ messages in thread
From: David Woodhouse @ 2025-08-21 21:10 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H. Peter Anvin, Vitaly Kuznetsov, kvm,
linux-kernel, graf, Ajay Kaher, Alexey Makhalov, Colin Percival
[-- Attachment #1: Type: text/plain, Size: 980 bytes --]
On Thu, 2025-08-21 at 13:48 -0700, Sean Christopherson wrote:
>
> > I think I'm a lot happier with the explicit CPUID leaf exposed by the
> > hypervisor.
>
> Why? If the hypervisor is ultimately the one defining the state, why does it
> matter which CPUID leaf its in?
It matters to the guest. If there's any hypervisor anywhere which
allows the bogus Skylake CPUID contents to show through to a guest, or
which allows the native hardware contents of the 0x15/0x16 leaves to
show even when TSC scaling is in force, then the guest cannot trust
those leaves.
If you tell me that 0x15 is *never* wrong when seen by a KVM guest, and
that it's OK to extend the hardware CPUID support up to 0x15 even on
older CPUs and there'll never be any adverse consequences from weird
assumptions in guest operating systems if we do the latter... well, for
a start, I won't believe you. And even if I do, I won't think it's
worth the risk. Just use a hypervisor leaf :)
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2 0/3] Support "generic" CPUID timing leaf as KVM guest and host
2025-08-21 21:10 ` David Woodhouse
@ 2025-08-22 1:57 ` Colin Percival
2025-08-26 19:30 ` Sean Christopherson
0 siblings, 1 reply; 21+ messages in thread
From: Colin Percival @ 2025-08-22 1:57 UTC (permalink / raw)
To: David Woodhouse, Sean Christopherson
Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H. Peter Anvin, Vitaly Kuznetsov, kvm,
linux-kernel, graf, Ajay Kaher, Alexey Makhalov
On 8/21/25 14:10, David Woodhouse wrote:
> On Thu, 2025-08-21 at 13:48 -0700, Sean Christopherson wrote:
>>> I think I'm a lot happier with the explicit CPUID leaf exposed by the
>>> hypervisor.
>>
>> Why? If the hypervisor is ultimately the one defining the state, why does it
>> matter which CPUID leaf its in?
> [...]
>
> If you tell me that 0x15 is *never* wrong when seen by a KVM guest, and
> that it's OK to extend the hardware CPUID support up to 0x15 even on
> older CPUs and there'll never be any adverse consequences from weird
> assumptions in guest operating systems if we do the latter... well, for
> a start, I won't believe you. And even if I do, I won't think it's
> worth the risk. Just use a hypervisor leaf :)
FreeBSD developer here. I'm with David on this, we'll consult the 0x15/0x16
CPUID leaves if we don't have anything better, but I'm not going to trust
those nearly as much as the 0x40000010 leaf.
Also, the 0x40000010 leaf provides the lapic frequency, which AFAIK is not
exposed in any other way.
--
Colin Percival
FreeBSD Release Engineering Lead & EC2 platform maintainer
Founder, Tarsnap | www.tarsnap.com | Online backups for the truly paranoid
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2 0/3] Support "generic" CPUID timing leaf as KVM guest and host
2025-08-22 1:57 ` Colin Percival
@ 2025-08-26 19:30 ` Sean Christopherson
2025-08-27 9:30 ` David Woodhouse
0 siblings, 1 reply; 21+ messages in thread
From: Sean Christopherson @ 2025-08-26 19:30 UTC (permalink / raw)
To: Colin Percival
Cc: David Woodhouse, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Vitaly Kuznetsov, kvm, linux-kernel, graf, Ajay Kaher,
Alexey Makhalov
On Fri, Aug 22, 2025, Colin Percival wrote:
> On 8/21/25 14:10, David Woodhouse wrote:
> > On Thu, 2025-08-21 at 13:48 -0700, Sean Christopherson wrote:
> > > > I think I'm a lot happier with the explicit CPUID leaf exposed by the
> > > > hypervisor.
> > >
> > > Why? If the hypervisor is ultimately the one defining the state, why does it
> > > matter which CPUID leaf its in?
> > [...]
> >
> > If you tell me that 0x15 is *never* wrong when seen by a KVM guest, and
> > that it's OK to extend the hardware CPUID support up to 0x15 even on
> > older CPUs and there'll never be any adverse consequences from weird
> > assumptions in guest operating systems if we do the latter... well, for
> > a start, I won't believe you. And even if I do, I won't think it's
> > worth the risk. Just use a hypervisor leaf :)
But for CoCo VMs (TDX in particular), using a hypervisor leaf is objectively worse,
because the hypervisor leaf is emulated by the untrusted world, whereas CPUID.0x15
is emulated by the trusted world (TDX-Module).
If the issue is one of trust, what if we carve out a KVM_FEATURE_xxx bit that
userspace can set to pinky swear it isn't broken?
> FreeBSD developer here. I'm with David on this, we'll consult the 0x15/0x16
> CPUID leaves if we don't have anything better, but I'm not going to trust
> those nearly as much as the 0x40000010 leaf.
>
> Also, the 0x40000010 leaf provides the lapic frequency, which AFAIK is not
> exposed in any other way.
On Intel CPUs, CPUID.0x15 defines the APIC timer frequency:
The APIC timer frequency will be the processor’s bus clock or core crystal clock
frequency (when TSC/core crystal clock ratio is enumerated in CPUID leaf 0x15)
divided by the value specified in the divide configuration register.
Thanks to TDX (again), that is also now KVM's ABI.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2 0/3] Support "generic" CPUID timing leaf as KVM guest and host
2025-08-26 19:30 ` Sean Christopherson
@ 2025-08-27 9:30 ` David Woodhouse
2025-08-28 23:40 ` Sean Christopherson
0 siblings, 1 reply; 21+ messages in thread
From: David Woodhouse @ 2025-08-27 9:30 UTC (permalink / raw)
To: Sean Christopherson, Colin Percival
Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H. Peter Anvin, Vitaly Kuznetsov, kvm,
linux-kernel, graf, Ajay Kaher, Alexey Makhalov
[-- Attachment #1: Type: text/plain, Size: 2673 bytes --]
On Tue, 2025-08-26 at 12:30 -0700, Sean Christopherson wrote:
> On Fri, Aug 22, 2025, Colin Percival wrote:
> > On 8/21/25 14:10, David Woodhouse wrote:
> > > On Thu, 2025-08-21 at 13:48 -0700, Sean Christopherson wrote:
> > > > > I think I'm a lot happier with the explicit CPUID leaf exposed by the
> > > > > hypervisor.
> > > >
> > > > Why? If the hypervisor is ultimately the one defining the state, why does it
> > > > matter which CPUID leaf its in?
> > > [...]
> > >
> > > If you tell me that 0x15 is *never* wrong when seen by a KVM guest, and
> > > that it's OK to extend the hardware CPUID support up to 0x15 even on
> > > older CPUs and there'll never be any adverse consequences from weird
> > > assumptions in guest operating systems if we do the latter... well, for
> > > a start, I won't believe you. And even if I do, I won't think it's
> > > worth the risk. Just use a hypervisor leaf :)
>
> But for CoCo VMs (TDX in particular), using a hypervisor leaf is objectively worse,
> because the hypervisor leaf is emulated by the untrusted world, whereas CPUID.0x15
> is emulated by the trusted world (TDX-Module).
>
> If the issue is one of trust, what if we carve out a KVM_FEATURE_xxx bit that
> userspace can set to pinky swear it isn't broken?
>
> > FreeBSD developer here. I'm with David on this, we'll consult the 0x15/0x16
> > CPUID leaves if we don't have anything better, but I'm not going to trust
> > those nearly as much as the 0x40000010 leaf.
> >
> > Also, the 0x40000010 leaf provides the lapic frequency, which AFAIK is not
> > exposed in any other way.
>
> On Intel CPUs, CPUID.0x15 defines the APIC timer frequency:
>
> The APIC timer frequency will be the processor’s bus clock or core crystal clock
> frequency (when TSC/core crystal clock ratio is enumerated in CPUID leaf 0x15)
> divided by the value specified in the divide configuration register.
>
> Thanks to TDX (again), that is also now KVM's ABI.
And AMD's Secure TSC provides it in a GUEST_TSC_FREQ MSR, I believe.
For the non-CoCo cases, I do think we'd need at least that 'I pinky
swear that CPUID 0x15 is telling the truth' bit — because right now, on
today's hypervisors, I believe it might not be correct. So a guest
can't trust it without that bit.
But I'm also concerned about the side-effects of advertising to guests
that everything up to 0x15 is present, on older and AMD CPUs. And I
just don't see the point in that 'pinky swear' bit, when there's an
*existing* hypervisor leaf which just gives the information directly,
which is implemented in QEMU and EC2, as well as various guests.
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2 0/3] Support "generic" CPUID timing leaf as KVM guest and host
2025-08-27 9:30 ` David Woodhouse
@ 2025-08-28 23:40 ` Sean Christopherson
2025-08-29 9:50 ` David Woodhouse
0 siblings, 1 reply; 21+ messages in thread
From: Sean Christopherson @ 2025-08-28 23:40 UTC (permalink / raw)
To: David Woodhouse
Cc: Colin Percival, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Vitaly Kuznetsov, kvm, linux-kernel, graf, Ajay Kaher,
Alexey Makhalov
On Wed, Aug 27, 2025, David Woodhouse wrote:
> On Tue, 2025-08-26 at 12:30 -0700, Sean Christopherson wrote:
> > On Fri, Aug 22, 2025, Colin Percival wrote:
> > > On 8/21/25 14:10, David Woodhouse wrote:
> > > > On Thu, 2025-08-21 at 13:48 -0700, Sean Christopherson wrote:
> > > > > > I think I'm a lot happier with the explicit CPUID leaf exposed by the
> > > > > > hypervisor.
> > > > >
> > > > > Why? If the hypervisor is ultimately the one defining the state, why does it
> > > > > matter which CPUID leaf its in?
> > > > [...]
> > > >
> > > > If you tell me that 0x15 is *never* wrong when seen by a KVM guest, and
> > > > that it's OK to extend the hardware CPUID support up to 0x15 even on
> > > > older CPUs and there'll never be any adverse consequences from weird
> > > > assumptions in guest operating systems if we do the latter... well, for
> > > > a start, I won't believe you. And even if I do, I won't think it's
> > > > worth the risk. Just use a hypervisor leaf :)
> >
> > But for CoCo VMs (TDX in particular), using a hypervisor leaf is objectively worse,
> > because the hypervisor leaf is emulated by the untrusted world, whereas CPUID.0x15
> > is emulated by the trusted world (TDX-Module).
> >
> > If the issue is one of trust, what if we carve out a KVM_FEATURE_xxx bit that
> > userspace can set to pinky swear it isn't broken?
> >
> > > FreeBSD developer here. I'm with David on this, we'll consult the 0x15/0x16
> > > CPUID leaves if we don't have anything better, but I'm not going to trust
> > > those nearly as much as the 0x40000010 leaf.
> > >
> > > Also, the 0x40000010 leaf provides the lapic frequency, which AFAIK is not
> > > exposed in any other way.
> >
> > On Intel CPUs, CPUID.0x15 defines the APIC timer frequency:
> >
> > The APIC timer frequency will be the processor’s bus clock or core crystal clock
> > frequency (when TSC/core crystal clock ratio is enumerated in CPUID leaf 0x15)
> > divided by the value specified in the divide configuration register.
> >
> > Thanks to TDX (again), that is also now KVM's ABI.
>
> And AMD's Secure TSC provides it in a GUEST_TSC_FREQ MSR, I believe.
>
> For the non-CoCo cases, I do think we'd need at least that 'I pinky
> swear that CPUID 0x15 is telling the truth' bit — because right now, on
> today's hypervisors, I believe it might not be correct. So a guest
> can't trust it without that bit.
>
> But I'm also concerned about the side-effects of advertising to guests
> that everything up to 0x15 is present, on older and AMD CPUs.
Ah, you want to bolt this onto older vCPU models. That makes sene.
> And I just don't see the point in that 'pinky swear' bit,
Yeah, I can see poorly written guest software freaking out over CPUID.0x15 being
unexpectedly valid, e.g. on AMD hardware, in which case pinky swearing it's ok
won't help.
> when there's an *existing* hypervisor leaf which just gives the information
> directly, which is implemented in QEMU and EC2, as well as various guests.
Can we just have the VMM do the work then? I.e. carve out the bit and the
leaf in KVM's ABI, but leave it to the VMM to fill in? I'd strongly prefer not
to hook kvm_cpuid(), as I don't like overriding userspace's CPUID entries, and
I especially don't like that hooking kvm_cpuid() means the value can change
throughout the lifetime of the VM, at least in theory, but in practice will only
ever be checked by the guest during early boot.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2 0/3] Support "generic" CPUID timing leaf as KVM guest and host
2025-08-28 23:40 ` Sean Christopherson
@ 2025-08-29 9:50 ` David Woodhouse
2025-08-29 11:08 ` Durrant, Paul
0 siblings, 1 reply; 21+ messages in thread
From: David Woodhouse @ 2025-08-29 9:50 UTC (permalink / raw)
To: Sean Christopherson, Paul Durrant, Griffoul, Fred
Cc: Colin Percival, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Vitaly Kuznetsov, kvm, linux-kernel, graf, Ajay Kaher,
Alexey Makhalov
[-- Attachment #1: Type: text/plain, Size: 5036 bytes --]
On Thu, 2025-08-28 at 16:40 -0700, Sean Christopherson wrote:
> On Wed, Aug 27, 2025, David Woodhouse wrote:
> > when there's an *existing* hypervisor leaf which just gives the information
> > directly, which is implemented in QEMU and EC2, as well as various guests.
>
> Can we just have the VMM do the work then? I.e. carve out the bit and the
> leaf in KVM's ABI, but leave it to the VMM to fill in? I'd strongly prefer not
> to hook kvm_cpuid(), as I don't like overriding userspace's CPUID entries, and
> I especially don't like that hooking kvm_cpuid() means the value can change
> throughout the lifetime of the VM, at least in theory, but in practice will only
> ever be checked by the guest during early boot.
The problem is that VMM doesn't know what TSC frequency the guest
actually gets. VMM only knows what it *asked* for, not what KVM
actually ended up configuring — which depends on the capabilities of
the hardware and the host's idea of what its actual TSC frequency is.
Hence https://git.kernel.org/torvalds/c/f422f853af036 in which we
allowed KVM to populate the value in the Xen TSC info CPUID leaves. I
was just following that precedent.
I am not *entirely* averse to ripping that out, and doing things
differently. We would have to:
• Declare that exposing the TSC frequency to guests via CPUID is
nonsense on crappy old hardware where it actually varies at runtime
anyway. Partly because the guest will only check it at boot, and
partly because that TSC has to be advertised as unreliable anyway.
• Add a new API for the VMM to extract the actual effective frequency,
only on 'sane' hosts.
• Declare that we don't care that it's strictly an ABI change, and
VMMs which used to just populate the leaf and let KVM fill it in
for Xen guests now *have* to use the new API.
I'm actually OK with that, even the last one, because I've just noticed
that KVM is updating the *wrong* Xen leaf. 0x40000x03/2 EAX is supposed
to be the *host* TSC frequency, and the guest frequency is supposed to
be in 0x40000x03/0 ECX. And Linux as a Xen guest doesn't even use it
anyway, AFAICT.
Paul, it was your code originally; are you happy with removing it?
As we look at a new API for exposing the precise TSC scaling, I'd like
to make sure it works for VMClock (for which I am still working on
writing up proper documentation but in the meantime
https://gitlab.com/qemu-project/qemu/-/commit/3634039b93cc5 serves as a
decent reference). In short, VMClock allows the hypervisor to provide a
pvclock-style clock with microsecond accuracy to its guests, solving
the problems of
• All guests using external precision clocks to repeat the *same* work
of calibrating the *same* underlying oscillator
• ...badly, experiencing imprecision due to steal time as they do so.
• Live migration completely disrupting the clock and causing actual
data corruption, where precision timestamps are required for e.g.
distributed database coherency.
In its initial implementation, the VMClock in QEMU (and EC2) only
resolves the last issue, by advertising a 'disruption' on live
migration so that the guest can know that its clock is hosed until it
manages to resync.
Now I'm trying to plumb in the actual clock information from the host,
so that migrated guests can have precision time from the moment they
arrive on the new host. There are two major use cases to consider...
1. Dedicated hosting setups will calibrate the host TSC *directly*
against the external clock, and maybe feed it into the host kernel's
adjtimex() almost as an afterthought. So userspace will be able to
produce a system-wide VMClock data structure which can then be
advertised to each guest with the appropriate TSC offset and scaling
factor.
For this I think we want the *actual* scaling factor to be exposed
by KVM to userspace, not just the resulting estimated frequency.
Unless we allow userspace just to provide the host's view and let
KVM apply the offset/scale. Which maybe doesn't make as much sense
in *this* setup but we might end up wanting that anyway for...
2. More traditional hosts just running Chrony/ntpd to feed the host's
CLOCK_REALTIME with adjtimex(). For this case, there is probably
more of an argument for letting the kernel generate the vmclock
data — KVM already has the gtod notifier which is invoked every time
the apparent frequency changes, and userspace has none of what it
needs.
So... if we need KVM to be able to apply the per-VM scaling/offset
because we're going to do it all in-kernel in that second case, then we
might as well let KVM apply the per-VM scaling/offset even in the
dedicated hosting case. And then the API we use for the original CPUID
problem only needs to expose the actual effective frequency.
But if we want userspace to do more for itself, we'd need to expose the
scaling factors directly. I think...
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2 0/3] Support "generic" CPUID timing leaf as KVM guest and host
2025-08-29 9:50 ` David Woodhouse
@ 2025-08-29 11:08 ` Durrant, Paul
2025-08-29 11:19 ` David Woodhouse
0 siblings, 1 reply; 21+ messages in thread
From: Durrant, Paul @ 2025-08-29 11:08 UTC (permalink / raw)
To: David Woodhouse, Sean Christopherson, Durrant, Paul,
Griffoul, Fred
Cc: Colin Percival, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86@kernel.org, H. Peter Anvin,
Vitaly Kuznetsov, kvm@vger.kernel.org,
linux-kernel@vger.kernel.org, Graf (AWS), Alexander, Ajay Kaher,
Alexey Makhalov
On 29/08/2025, 10:51, "David Woodhouse" <dwmw2@infradead.org <mailto:dwmw2@infradead.org>> wrote:
[snip]
> • Declare that we don't care that it's strictly an ABI change, and
> VMMs which used to just populate the leaf and let KVM fill it in
> for Xen guests now *have* to use the new API.
>
>
> I'm actually OK with that, even the last one, because I've just noticed
> that KVM is updating the *wrong* Xen leaf. 0x40000x03/2 EAX is supposed
> to be the *host* TSC frequency, and the guest frequency is supposed to
> be in 0x40000x03/0 ECX. And Linux as a Xen guest doesn't even use it
> anyway, AFAICT
>
> Paul, it was your code originally; are you happy with removing it?
Yes, if it is incorrect then please fix it. I must have become confused whilst reading the original Xen code.
Cheers,
Paul
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2 0/3] Support "generic" CPUID timing leaf as KVM guest and host
2025-08-29 11:08 ` Durrant, Paul
@ 2025-08-29 11:19 ` David Woodhouse
2025-08-29 20:36 ` Sean Christopherson
0 siblings, 1 reply; 21+ messages in thread
From: David Woodhouse @ 2025-08-29 11:19 UTC (permalink / raw)
To: Durrant, Paul, Sean Christopherson, Griffoul, Fred
Cc: Colin Percival, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86@kernel.org, H. Peter Anvin,
Vitaly Kuznetsov, kvm@vger.kernel.org,
linux-kernel@vger.kernel.org, Graf (AWS), Alexander, Ajay Kaher,
Alexey Makhalov
[-- Attachment #1: Type: text/plain, Size: 1396 bytes --]
On Fri, 2025-08-29 at 11:08 +0000, Durrant, Paul wrote:
> On 29/08/2025, 10:51, "David Woodhouse" <dwmw2@infradead.org <mailto:dwmw2@infradead.org>> wrote:
> [snip]
> > • Declare that we don't care that it's strictly an ABI change, and
> > VMMs which used to just populate the leaf and let KVM fill it in
> > for Xen guests now *have* to use the new API.
> >
> >
> > I'm actually OK with that, even the last one, because I've just noticed
> > that KVM is updating the *wrong* Xen leaf. 0x40000x03/2 EAX is supposed
> > to be the *host* TSC frequency, and the guest frequency is supposed to
> > be in 0x40000x03/0 ECX. And Linux as a Xen guest doesn't even use it
> > anyway, AFAICT
> >
> > Paul, it was your code originally; are you happy with removing it?
>
> Yes, if it is incorrect then please fix it. I must have become
> confused whilst reading the original Xen code.
The proposal is not to *fix* it but just to rip it out entirely and
provide userspace with some way of knowing the effective TSC frequency.
This does mean userspace would have to set the vCPU's TSC frequency and
then query the kernel before setting up its CPUID. And in the absence
of scaling, this KVM API would report the hardware TSC frequency. I
guess the API would have to return -EHARDWARETOOSTUPID if the TSC
frequency *isn't* the same across all CPUs and all power states, etc.
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2 0/3] Support "generic" CPUID timing leaf as KVM guest and host
2025-08-29 11:19 ` David Woodhouse
@ 2025-08-29 20:36 ` Sean Christopherson
2025-09-02 8:31 ` David Woodhouse
0 siblings, 1 reply; 21+ messages in thread
From: Sean Christopherson @ 2025-08-29 20:36 UTC (permalink / raw)
To: David Woodhouse
Cc: Paul Durrant, Fred Griffoul, Colin Percival, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
x86@kernel.org, H. Peter Anvin, Vitaly Kuznetsov,
kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
Graf (AWS), Alexander, Ajay Kaher, Alexey Makhalov
On Fri, Aug 29, 2025, David Woodhouse wrote:
> On Fri, 2025-08-29 at 11:08 +0000, Durrant, Paul wrote:
> > On 29/08/2025, 10:51, "David Woodhouse" <dwmw2@infradead.org <mailto:dwmw2@infradead.org>> wrote:
> > [snip]
> > > • Declare that we don't care that it's strictly an ABI change, and
> > > VMMs which used to just populate the leaf and let KVM fill it in
> > > for Xen guests now *have* to use the new API.
> > >
> > >
> > > I'm actually OK with that, even the last one, because I've just noticed
> > > that KVM is updating the *wrong* Xen leaf. 0x40000x03/2 EAX is supposed
> > > to be the *host* TSC frequency, and the guest frequency is supposed to
> > > be in 0x40000x03/0 ECX. And Linux as a Xen guest doesn't even use it
> > > anyway, AFAICT
> > >
> > > Paul, it was your code originally; are you happy with removing it?
> >
> > Yes, if it is incorrect then please fix it. I must have become
> > confused whilst reading the original Xen code.
>
> The proposal is not to *fix* it but just to rip it out entirely and
> provide userspace with some way of knowing the effective TSC frequency.
>
> This does mean userspace would have to set the vCPU's TSC frequency and
> then query the kernel before setting up its CPUID. And in the absence
> of scaling, this KVM API would report the hardware TSC frequency.
Reporting the hardware TSC frequency on CPUs without scaling seems all kinds of
wrong (which another reason I don't like KVM shoving in the state). Of course,
reporting the frequency KVM is trying to provide isn't great either, as the guest
will definitely observe something in between those two.
> I guess the API would have to return -EHARDWARETOOSTUPID if the TSC frequency
> *isn't* the same across all CPUs and all power states, etc.
What if KVM advertises the flag in KVM_GET_SUPPORTED_CPUID if and only if the
TSC will be constant from the guest's perspective? TSC scaling has been supported
by AMD and Intel for ~10 years, it doesn't seem at all unreasonable to restrict
the feature to somewhat modern hardware. And if userspace or the admin knows
better than KVM, then userspace can always ignore KVM and report the frequency
anyways.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2 0/3] Support "generic" CPUID timing leaf as KVM guest and host
2025-08-29 20:36 ` Sean Christopherson
@ 2025-09-02 8:31 ` David Woodhouse
2025-09-02 17:49 ` Sean Christopherson
0 siblings, 1 reply; 21+ messages in thread
From: David Woodhouse @ 2025-09-02 8:31 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paul Durrant, Fred Griffoul, Colin Percival, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
x86@kernel.org, H. Peter Anvin, Vitaly Kuznetsov,
kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
Graf (AWS), Alexander, Ajay Kaher, Alexey Makhalov
[-- Attachment #1: Type: text/plain, Size: 1868 bytes --]
On Fri, 2025-08-29 at 13:36 -0700, Sean Christopherson wrote:
>
> > This does mean userspace would have to set the vCPU's TSC frequency and
> > then query the kernel before setting up its CPUID. And in the absence
> > of scaling, this KVM API would report the hardware TSC frequency.
>
> Reporting the hardware TSC frequency on CPUs without scaling seems all kinds of
> wrong (which another reason I don't like KVM shoving in the state). Of course,
> reporting the frequency KVM is trying to provide isn't great either, as the guest
> will definitely observe something in between those two.
Yes, on CPUs that don't support TSC scaling, we should not attempt to
advertise a frequency.
Where I said 'in the absence of scaling' I meant modern CPUs but where
the VMM just didn't ask for TSC scaling.
> > I guess the API would have to return -EHARDWARETOOSTUPID if the TSC frequency
> > *isn't* the same across all CPUs and all power states, etc.
>
> What if KVM advertises the flag in KVM_GET_SUPPORTED_CPUID if and only if the
> TSC will be constant from the guest's perspective? TSC scaling has been supported
> by AMD and Intel for ~10 years, it doesn't seem at all unreasonable to restrict
> the feature to somewhat modern hardware. And if userspace or the admin knows
> better than KVM, then userspace can always ignore KVM and report the frequency
> anyways.
I hadn't put it in KVM_GET_SUPPORTED_CPUID; I was following the lead of
the existing Xen leaf support, where *if* userspace provides that leaf,
KVM will dynamically correct the values in it.
The problem is that KVM_GET_SUPPORTED_CPUID is a *system* ioctl on the
bare /dev/kvm device, isn 't it? So even if a VMM has set the TSC
frequency VM-wide with KVM_SET_TSC_KHZ instead of doing it the old per-
vCPU way, how can it get the results for a specific VM?
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2 0/3] Support "generic" CPUID timing leaf as KVM guest and host
2025-09-02 8:31 ` David Woodhouse
@ 2025-09-02 17:49 ` Sean Christopherson
2025-09-02 18:23 ` David Woodhouse
0 siblings, 1 reply; 21+ messages in thread
From: Sean Christopherson @ 2025-09-02 17:49 UTC (permalink / raw)
To: David Woodhouse
Cc: Paul Durrant, Fred Griffoul, Colin Percival, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
x86@kernel.org, H. Peter Anvin, Vitaly Kuznetsov,
kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
Graf (AWS), Alexander, Ajay Kaher, Alexey Makhalov
On Tue, Sep 02, 2025, David Woodhouse wrote:
> On Fri, 2025-08-29 at 13:36 -0700, Sean Christopherson wrote:
> >
> > > This does mean userspace would have to set the vCPU's TSC frequency and
> > > then query the kernel before setting up its CPUID. And in the absence
> > > of scaling, this KVM API would report the hardware TSC frequency.
> >
> > Reporting the hardware TSC frequency on CPUs without scaling seems all kinds of
> > wrong (which another reason I don't like KVM shoving in the state). Of course,
> > reporting the frequency KVM is trying to provide isn't great either, as the guest
> > will definitely observe something in between those two.
>
> Yes, on CPUs that don't support TSC scaling, we should not attempt to
> advertise a frequency.
>
> Where I said 'in the absence of scaling' I meant modern CPUs but where
> the VMM just didn't ask for TSC scaling.
>
> > > I guess the API would have to return -EHARDWARETOOSTUPID if the TSC frequency
> > > *isn't* the same across all CPUs and all power states, etc.
> >
> > What if KVM advertises the flag in KVM_GET_SUPPORTED_CPUID if and only if the
> > TSC will be constant from the guest's perspective? TSC scaling has been supported
> > by AMD and Intel for ~10 years, it doesn't seem at all unreasonable to restrict
> > the feature to somewhat modern hardware. And if userspace or the admin knows
> > better than KVM, then userspace can always ignore KVM and report the frequency
> > anyways.
>
> I hadn't put it in KVM_GET_SUPPORTED_CPUID; I was following the lead of
> the existing Xen leaf support, where *if* userspace provides that leaf,
> KVM will dynamically correct the values in it.
>
> The problem is that KVM_GET_SUPPORTED_CPUID is a *system* ioctl on the
> bare /dev/kvm device, isn 't it?
Yep.
> So even if a VMM has set the TSC frequency VM-wide with KVM_SET_TSC_KHZ
> instead of doing it the old per- vCPU way, how can it get the results for a
> specific VM?
I don't see any need for userspace to query per-VM support. What I'm proposing
is that KVM advertise the feature if the bare metal TSC is constant and the CPU
supports TSC scaling. Beyond that, _KVM_ doesn't need to do anything to ensure
the guest sees a constant frequency, it's userspace's responsibility to provide
a sane configuration.
And strictly speaking, CPUID is per-CPU, i.e. it's architecturally legal to set
per-vCPU frequencies and then advertise a different frequency in CPUID for each
vCPU. That's all but guaranteed to break guests as most/all kernels assume that
TSC operates at the same frequency on all CPUs, but as above, that's userspace's
responsibility to not screw up.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2 0/3] Support "generic" CPUID timing leaf as KVM guest and host
2025-09-02 17:49 ` Sean Christopherson
@ 2025-09-02 18:23 ` David Woodhouse
0 siblings, 0 replies; 21+ messages in thread
From: David Woodhouse @ 2025-09-02 18:23 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paul Durrant, Fred Griffoul, Colin Percival, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
x86@kernel.org, H. Peter Anvin, Vitaly Kuznetsov,
kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
Graf (AWS), Alexander, Ajay Kaher, Alexey Makhalov
[-- Attachment #1: Type: text/plain, Size: 1390 bytes --]
On Tue, 2025-09-02 at 10:49 -0700, Sean Christopherson wrote:
>
> > So even if a VMM has set the TSC frequency VM-wide with KVM_SET_TSC_KHZ
> > instead of doing it the old per- vCPU way, how can it get the results for a
> > specific VM?
>
> I don't see any need for userspace to query per-VM support. What I'm proposing
> is that KVM advertise the feature if the bare metal TSC is constant and the CPU
> supports TSC scaling. Beyond that, _KVM_ doesn't need to do anything to ensure
> the guest sees a constant frequency, it's userspace's responsibility to provide
> a sane configuration.
>
> And strictly speaking, CPUID is per-CPU, i.e. it's architecturally legal to set
> per-vCPU frequencies and then advertise a different frequency in CPUID for each
> vCPU. That's all but guaranteed to break guests as most/all kernels assume that
> TSC operates at the same frequency on all CPUs, but as above, that's userspace's
> responsibility to not screw up.
Sure, but doesn't that make this whole thing orthogonal to the original
problem being solved? Because userspace still doesn't *know* the actual
effective TSC frequency, whether it's scaled or not.
Or are you suggesting that we add the leaf (with unscaled values) in
KVM_GET_SUPPORTED_CPUID and *also* 'correct' the values if userspace
does pass that leaf to its guests, as I had originally proposed?
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2025-09-02 18:23 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-16 10:09 [PATCH v2 0/3] Support "generic" CPUID timing leaf as KVM guest and host David Woodhouse
2025-08-16 10:10 ` [PATCH v2 1/3] KVM: x86: Restore caching of KVM CPUID base David Woodhouse
2025-08-16 10:10 ` [PATCH v2 2/3] KVM: x86: Provide TSC frequency in "generic" timing infomation CPUID leaf David Woodhouse
2025-08-16 10:10 ` [PATCH v2 3/3] x86/kvm: Obtain TSC frequency from CPUID if present David Woodhouse
2025-08-21 16:26 ` [PATCH v2 0/3] Support "generic" CPUID timing leaf as KVM guest and host Sean Christopherson
2025-08-21 17:37 ` David Woodhouse
2025-08-21 19:27 ` Sean Christopherson
2025-08-21 20:42 ` David Woodhouse
2025-08-21 20:48 ` Sean Christopherson
2025-08-21 21:10 ` David Woodhouse
2025-08-22 1:57 ` Colin Percival
2025-08-26 19:30 ` Sean Christopherson
2025-08-27 9:30 ` David Woodhouse
2025-08-28 23:40 ` Sean Christopherson
2025-08-29 9:50 ` David Woodhouse
2025-08-29 11:08 ` Durrant, Paul
2025-08-29 11:19 ` David Woodhouse
2025-08-29 20:36 ` Sean Christopherson
2025-09-02 8:31 ` David Woodhouse
2025-09-02 17:49 ` Sean Christopherson
2025-09-02 18:23 ` David Woodhouse
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).