public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] KVM: x86: Address xstate_required_size() perf regression
@ 2024-12-11  1:32 Sean Christopherson
  2024-12-11  1:32 ` [PATCH 1/5] KVM: x86: Cache CPUID.0xD XSTATE offsets+sizes during module init Sean Christopherson
                   ` (8 more replies)
  0 siblings, 9 replies; 15+ messages in thread
From: Sean Christopherson @ 2024-12-11  1:32 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel, Jim Mattson

Fix a hilarious/revolting performance regression (relative to older CPU
generations) in xstate_required_size() that pops up due to CPUID _in the
host_ taking 3x-4x longer on Emerald Rapids than Skylake.

The issue rears its head on nested virtualization transitions, as KVM
(unnecessarily) performs runtime CPUID updates, including XSAVE sizes,
multiple times per transition.  And calculating XSAVE sizes, especially
for vCPUs with a decent number of supported XSAVE features and compacted
format support, can add up to thousands of cycles.

To fix the immediate issue, cache the CPUID output at kvm.ko load.  The
information is static for a given CPU, i.e. doesn't need to be re-read
from hardware every time.  That's patch 1, and eliminates pretty much all
of the meaningful overhead.

Patch 2 is a minor cleanup to try and make the code easier to read.

Patch 3 fixes a wart in CPUID emulation where KVM does a moderately
expensive (though cheap compared to CPUID, lol) MSR lookup that is likely
unnecessary for the vast majority of VMs.

Patches 4 and 5 address the problem of KVM doing runtime CPUID updates
multiple times for each nested VM-Enter and VM-Exit, at least half of
which are completely unnecessary (CPUID is a mandatory intercept on both
Intel and AMD, so ensuring dynamic CPUID bits are up-to-date while running
L2 is pointless).  The idea is fairly simple: lazily do the CPUID updates
by deferring them until something might actually consume guest the relevant
bits.

This applies on the cpu_caps overhaul[*], as patches 3-5 would otherwise
conflict, and I didn't want to think about how safe patch 5 is without
the rework.

That said, patch 1, which is the most important and tagged for stable,
applies cleanly on 6.1, 6.6, and 6.12 (and the backport for 5.15 and
earlier shouldn't be too horrific).

Side topic, I can't help but wonder if the CPUID latency on EMR is a CPU
or ucode bug.  For a number of leaves, KVM can emulate CPUID faster than
the CPUID can execute the instruction.  I.e. the entire VM-Exit => emulate
=> VM-Enter sequence takes less time than executing CPUID on bare metal.
Which seems absolutely insane.  But, it shouldn't impact guest performance,
so that's someone else's problem, at least for now.

[*] https://lore.kernel.org/all/20241128013424.4096668-1-seanjc@google.com

Sean Christopherson (5):
  KVM: x86: Cache CPUID.0xD XSTATE offsets+sizes during module init
  KVM: x86: Use for-loop to iterate over XSTATE size entries
  KVM: x86: Apply TSX_CTRL_CPUID_CLEAR if and only if the vCPU has RTM
    or HLE
  KVM: x86: Query X86_FEATURE_MWAIT iff userspace owns the CPUID feature
    bit
  KVM: x86: Defer runtime updates of dynamic CPUID bits until CPUID
    emulation

 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/cpuid.c            | 63 ++++++++++++++++++++++++---------
 arch/x86/kvm/cpuid.h            | 10 +++++-
 arch/x86/kvm/lapic.c            |  2 +-
 arch/x86/kvm/smm.c              |  2 +-
 arch/x86/kvm/svm/sev.c          |  2 +-
 arch/x86/kvm/svm/svm.c          |  2 +-
 arch/x86/kvm/vmx/vmx.c          |  2 +-
 arch/x86/kvm/x86.c              | 22 +++++++++---
 9 files changed, 78 insertions(+), 28 deletions(-)


base-commit: 06a4919e729761be751366716c00fb8c3f51d37e
-- 
2.47.0.338.g60cca15819-goog


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/5] KVM: x86: Cache CPUID.0xD XSTATE offsets+sizes during module init
  2024-12-11  1:32 [PATCH 0/5] KVM: x86: Address xstate_required_size() perf regression Sean Christopherson
@ 2024-12-11  1:32 ` Sean Christopherson
  2024-12-11  1:32 ` [PATCH 2/5] KVM: x86: Use for-loop to iterate over XSTATE size entries Sean Christopherson
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Sean Christopherson @ 2024-12-11  1:32 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel, Jim Mattson

Snapshot the output of CPUID.0xD.[1..n] during kvm.ko initiliaization to
avoid the overead of CPUID during runtime.  The offset, size, and metadata
for CPUID.0xD.[1..n] sub-leaves does not depend on XCR0 or XSS values, i.e.
is constant for a given CPU, and thus can be cached during module load.

On Intel's Emerald Rapids, CPUID is *wildly* expensive, to the point where
recomputing XSAVE offsets and sizes results in a 4x increase in latency of
nested VM-Enter and VM-Exit (nested transitions can trigger
xstate_required_size() multiple times per transition), relative to using
cached values.  The issue is easily visible by running `perf top` while
triggering nested transitions: kvm_update_cpuid_runtime() shows up at a
whopping 50%.

As measured via RDTSC from L2 (using KVM-Unit-Test's CPUID VM-Exit test
and a slightly modified L1 KVM to handle CPUID in the fastpath), a nested
roundtrip to emulate CPUID on Skylake (SKX), Icelake (ICX), and Emerald
Rapids (EMR) takes:

  SKX 11650
  ICX 22350
  EMR 28850

Using cached values, the latency drops to:

  SKX 6850
  ICX 9000
  EMR 7900

The underlying issue is that CPUID itself is slow on ICX, and comically
slow on EMR.  The problem is exacerbated on CPUs which support XSAVES
and/or XSAVEC, as KVM invokes xstate_required_size() twice on each
runtime CPUID update, and because there are more supported XSAVE features
(CPUID for supported XSAVE feature sub-leafs is significantly slower).

 SKX:
  CPUID.0xD.2  = 348 cycles
  CPUID.0xD.3  = 400 cycles
  CPUID.0xD.4  = 276 cycles
  CPUID.0xD.5  = 236 cycles
  <other sub-leaves are similar>

 EMR:
  CPUID.0xD.2  = 1138 cycles
  CPUID.0xD.3  = 1362 cycles
  CPUID.0xD.4  = 1068 cycles
  CPUID.0xD.5  = 910 cycles
  CPUID.0xD.6  = 914 cycles
  CPUID.0xD.7  = 1350 cycles
  CPUID.0xD.8  = 734 cycles
  CPUID.0xD.9  = 766 cycles
  CPUID.0xD.10 = 732 cycles
  CPUID.0xD.11 = 718 cycles
  CPUID.0xD.12 = 734 cycles
  CPUID.0xD.13 = 1700 cycles
  CPUID.0xD.14 = 1126 cycles
  CPUID.0xD.15 = 898 cycles
  CPUID.0xD.16 = 716 cycles
  CPUID.0xD.17 = 748 cycles
  CPUID.0xD.18 = 776 cycles

Note, updating runtime CPUID information multiple times per nested
transition is itself a flaw, especially since CPUID is a mandotory
intercept on both Intel and AMD.  E.g. KVM doesn't need to ensure emulated
CPUID state is up-to-date while running L2.  That flaw will be fixed in a
future patch, as deferring runtime CPUID updates is more subtle than it
appears at first glance, the benefits aren't super critical to have once
the XSAVE issue is resolved, and caching CPUID output is desirable even if
KVM's updates are deferred.

Cc: Jim Mattson <jmattson@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/cpuid.c | 31 ++++++++++++++++++++++++++-----
 arch/x86/kvm/cpuid.h |  1 +
 arch/x86/kvm/x86.c   |  2 ++
 3 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 572dfa7e206e..edef30359c19 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -36,6 +36,26 @@
 u32 kvm_cpu_caps[NR_KVM_CPU_CAPS] __read_mostly;
 EXPORT_SYMBOL_GPL(kvm_cpu_caps);
 
+struct cpuid_xstate_sizes {
+	u32 eax;
+	u32 ebx;
+	u32 ecx;
+};
+
+static struct cpuid_xstate_sizes xstate_sizes[XFEATURE_MAX] __ro_after_init;
+
+void __init kvm_init_xstate_sizes(void)
+{
+	u32 ign;
+	int i;
+
+	for (i = XFEATURE_YMM; i < ARRAY_SIZE(xstate_sizes); i++) {
+		struct cpuid_xstate_sizes *xs = &xstate_sizes[i];
+
+		cpuid_count(0xD, i, &xs->eax, &xs->ebx, &xs->ecx, &ign);
+	}
+}
+
 u32 xstate_required_size(u64 xstate_bv, bool compacted)
 {
 	int feature_bit = 0;
@@ -44,14 +64,15 @@ u32 xstate_required_size(u64 xstate_bv, bool compacted)
 	xstate_bv &= XFEATURE_MASK_EXTEND;
 	while (xstate_bv) {
 		if (xstate_bv & 0x1) {
-		        u32 eax, ebx, ecx, edx, offset;
-		        cpuid_count(0xD, feature_bit, &eax, &ebx, &ecx, &edx);
+			struct cpuid_xstate_sizes *xs = &xstate_sizes[feature_bit];
+			u32 offset;
+
 			/* ECX[1]: 64B alignment in compacted form */
 			if (compacted)
-				offset = (ecx & 0x2) ? ALIGN(ret, 64) : ret;
+				offset = (xs->ecx & 0x2) ? ALIGN(ret, 64) : ret;
 			else
-				offset = ebx;
-			ret = max(ret, offset + eax);
+				offset = xs->ebx;
+			ret = max(ret, offset + xs->eax);
 		}
 
 		xstate_bv >>= 1;
diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index 3d69a0ef8268..67d80aa72d50 100644
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -31,6 +31,7 @@ int kvm_vcpu_ioctl_get_cpuid2(struct kvm_vcpu *vcpu,
 bool kvm_cpuid(struct kvm_vcpu *vcpu, u32 *eax, u32 *ebx,
 	       u32 *ecx, u32 *edx, bool exact_only);
 
+void __init kvm_init_xstate_sizes(void);
 u32 xstate_required_size(u64 xstate_bv, bool compacted);
 
 int cpuid_query_maxphyaddr(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cc4563fb07d1..320764e5f798 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13982,6 +13982,8 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_rmp_fault);
 
 static int __init kvm_x86_init(void)
 {
+	kvm_init_xstate_sizes();
+
 	kvm_mmu_x86_module_init();
 	mitigate_smt_rsb &= boot_cpu_has_bug(X86_BUG_SMT_RSB) && cpu_smt_possible();
 	return 0;
-- 
2.47.0.338.g60cca15819-goog


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/5] KVM: x86: Use for-loop to iterate over XSTATE size entries
  2024-12-11  1:32 [PATCH 0/5] KVM: x86: Address xstate_required_size() perf regression Sean Christopherson
  2024-12-11  1:32 ` [PATCH 1/5] KVM: x86: Cache CPUID.0xD XSTATE offsets+sizes during module init Sean Christopherson
@ 2024-12-11  1:32 ` Sean Christopherson
  2024-12-11  1:33 ` [PATCH 3/5] KVM: x86: Apply TSX_CTRL_CPUID_CLEAR if and only if the vCPU has RTM or HLE Sean Christopherson
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Sean Christopherson @ 2024-12-11  1:32 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel, Jim Mattson

Rework xstate_required_size() to use a for-loop and continue, to make it
more obvious that the xstate_sizes[] lookups are indeed correctly bounded,
and to make it (hopefully) easier to understand that the loop is iterating
over supported XSAVE features.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/cpuid.c | 27 +++++++++++++--------------
 1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index edef30359c19..f73af4a98c35 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -58,25 +58,24 @@ void __init kvm_init_xstate_sizes(void)
 
 u32 xstate_required_size(u64 xstate_bv, bool compacted)
 {
-	int feature_bit = 0;
 	u32 ret = XSAVE_HDR_SIZE + XSAVE_HDR_OFFSET;
+	int i;
 
 	xstate_bv &= XFEATURE_MASK_EXTEND;
-	while (xstate_bv) {
-		if (xstate_bv & 0x1) {
-			struct cpuid_xstate_sizes *xs = &xstate_sizes[feature_bit];
-			u32 offset;
+	for (i = XFEATURE_YMM; i < ARRAY_SIZE(xstate_sizes) && xstate_bv; i++) {
+		struct cpuid_xstate_sizes *xs = &xstate_sizes[i];
+		u32 offset;
 
-			/* ECX[1]: 64B alignment in compacted form */
-			if (compacted)
-				offset = (xs->ecx & 0x2) ? ALIGN(ret, 64) : ret;
-			else
-				offset = xs->ebx;
-			ret = max(ret, offset + xs->eax);
-		}
+		if (!(xstate_bv & BIT_ULL(i)))
+			continue;
 
-		xstate_bv >>= 1;
-		feature_bit++;
+		/* ECX[1]: 64B alignment in compacted form */
+		if (compacted)
+			offset = (xs->ecx & 0x2) ? ALIGN(ret, 64) : ret;
+		else
+			offset = xs->ebx;
+		ret = max(ret, offset + xs->eax);
+		xstate_bv &= ~BIT_ULL(i);
 	}
 
 	return ret;
-- 
2.47.0.338.g60cca15819-goog


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/5] KVM: x86: Apply TSX_CTRL_CPUID_CLEAR if and only if the vCPU has RTM or HLE
  2024-12-11  1:32 [PATCH 0/5] KVM: x86: Address xstate_required_size() perf regression Sean Christopherson
  2024-12-11  1:32 ` [PATCH 1/5] KVM: x86: Cache CPUID.0xD XSTATE offsets+sizes during module init Sean Christopherson
  2024-12-11  1:32 ` [PATCH 2/5] KVM: x86: Use for-loop to iterate over XSTATE size entries Sean Christopherson
@ 2024-12-11  1:33 ` Sean Christopherson
  2024-12-13 14:39   ` Jim Mattson
  2024-12-11  1:33 ` [PATCH 4/5] KVM: x86: Query X86_FEATURE_MWAIT iff userspace owns the CPUID feature bit Sean Christopherson
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 15+ messages in thread
From: Sean Christopherson @ 2024-12-11  1:33 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel, Jim Mattson

When emulating CPUID, retrieve MSR_IA32_TSX_CTRL.TSX_CTRL_CPUID_CLEAR if
and only if RTM and/or HLE feature bits need to be cleared.  Getting the
MSR value is unnecessary if neither bit is set, and avoiding the lookup
saves ~80 cycles for vCPUs without RTM or HLE.

Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/cpuid.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index f73af4a98c35..7f5fa6665969 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -1998,7 +1998,8 @@ bool kvm_cpuid(struct kvm_vcpu *vcpu, u32 *eax, u32 *ebx,
 		*edx = entry->edx;
 		if (function == 7 && index == 0) {
 			u64 data;
-		        if (!__kvm_get_msr(vcpu, MSR_IA32_TSX_CTRL, &data, true) &&
+			if ((*ebx & (feature_bit(RTM) | feature_bit(HLE))) &&
+			    !__kvm_get_msr(vcpu, MSR_IA32_TSX_CTRL, &data, true) &&
 			    (data & TSX_CTRL_CPUID_CLEAR))
 				*ebx &= ~(feature_bit(RTM) | feature_bit(HLE));
 		} else if (function == 0x80000007) {
-- 
2.47.0.338.g60cca15819-goog


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/5] KVM: x86: Query X86_FEATURE_MWAIT iff userspace owns the CPUID feature bit
  2024-12-11  1:32 [PATCH 0/5] KVM: x86: Address xstate_required_size() perf regression Sean Christopherson
                   ` (2 preceding siblings ...)
  2024-12-11  1:33 ` [PATCH 3/5] KVM: x86: Apply TSX_CTRL_CPUID_CLEAR if and only if the vCPU has RTM or HLE Sean Christopherson
@ 2024-12-11  1:33 ` Sean Christopherson
  2024-12-11  1:33 ` [PATCH 5/5] KVM: x86: Defer runtime updates of dynamic CPUID bits until CPUID emulation Sean Christopherson
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Sean Christopherson @ 2024-12-11  1:33 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel, Jim Mattson

Rework MONITOR/MWAIT emulation to query X86_FEATURE_MWAIT if and only if
the MISC_ENABLE_NO_MWAIT quirk is enabled, in which case MWAIT is not a
dynamic, KVM-controlled CPUID feature.  KVM's funky ABI for that quirk is
to emulate MONITOR/MWAIT as nops if userspace sets MWAIT in guest CPUID.

For the case where KVM owns the MWAIT feature bit, check MISC_ENABLES
itself, i.e. check the actual control, not its reflection in guest CPUID.

Avoiding consumption of dynamic CPUID features will allow KVM to defer
runtime CPUID updates until kvm_emulate_cpuid(), i.e. until the updates
become visible to the guest.  Alternatively, KVM could play other games
with runtime CPUID updates, e.g. by precisely specifying which feature
bits to update, but doing so adds non-trivial complexity and doesn't solve
the underlying issue of unnecessary updates causing meaningful overhead
for nested virtualization roundtrips.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/x86.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 320764e5f798..dc8829712edd 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2080,10 +2080,20 @@ EXPORT_SYMBOL_GPL(kvm_handle_invalid_op);
 
 static int kvm_emulate_monitor_mwait(struct kvm_vcpu *vcpu, const char *insn)
 {
-	if (!kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_MWAIT_NEVER_UD_FAULTS) &&
-	    !guest_cpu_cap_has(vcpu, X86_FEATURE_MWAIT))
+	bool enabled;
+
+	if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_MWAIT_NEVER_UD_FAULTS))
+		goto emulate_as_nop;
+
+	if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT))
+		enabled = guest_cpu_cap_has(vcpu, X86_FEATURE_MWAIT);
+	else
+		enabled = vcpu->arch.ia32_misc_enable_msr & MSR_IA32_MISC_ENABLE_MWAIT;
+
+	if (!enabled)
 		return kvm_handle_invalid_op(vcpu);
 
+emulate_as_nop:
 	pr_warn_once("%s instruction emulated as NOP!\n", insn);
 	return kvm_emulate_as_nop(vcpu);
 }
-- 
2.47.0.338.g60cca15819-goog


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 5/5] KVM: x86: Defer runtime updates of dynamic CPUID bits until CPUID emulation
  2024-12-11  1:32 [PATCH 0/5] KVM: x86: Address xstate_required_size() perf regression Sean Christopherson
                   ` (3 preceding siblings ...)
  2024-12-11  1:33 ` [PATCH 4/5] KVM: x86: Query X86_FEATURE_MWAIT iff userspace owns the CPUID feature bit Sean Christopherson
@ 2024-12-11  1:33 ` Sean Christopherson
  2025-11-28 11:32   ` Igor Mammedov
  2024-12-11 16:42 ` [PATCH 0/5] KVM: x86: Address xstate_required_size() perf regression Sean Christopherson
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 15+ messages in thread
From: Sean Christopherson @ 2024-12-11  1:33 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel, Jim Mattson

Defer runtime CPUID updates until the next non-faulting CPUID emulation
or KVM_GET_CPUID2, which are the only paths in KVM that consume the
dynamic entries.  Deferring the updates is especially beneficial to
nested VM-Enter/VM-Exit, as KVM will almost always detect multiple state
changes, not to mention the updates don't need to be realized while L2 is
active, as CPUID is a mandatory intercept on both Intel and AMD.

Deferring CPUID updates shaves several hundred cycles from nested VMX
roundtrips, as measured from L2 executing CPUID in a tight loop:

  SKX 6850 => 6450
  ICX 9000 => 8800
  EMR 7900 => 7700

Alternatively, KVM could update only the CPUID leaves that are affected
by the state change, e.g. update XSAVE info only if XCR0 or XSS changes,
but that adds non-trivial complexity and doesn't solve the underlying
problem of nested transitions potentially changing both XCR0 and XSS, on
both nested VM-Enter and VM-Exit.

KVM could also skip updates entirely while L2 is active, because again
CPUID is a mandatory intercept.  However, simply skipping updates if L2
is active is *very* subtly dangerous and complex.  Most KVM updates are
triggered by changes to the current vCPU state, which may be L2 state
whereas performing updates only for L1 would requiring detecting changes
to L1 state.  KVM would need to either track relevant L1 state, or defer
runtime CPUID updates until the next nested VM-Exit.  The former is ugly
and complex, while the latter comes with similar dangers to deferring all
CPUID updates, and would only address the nested VM-Enter path.

To guard against using stale data, disallow querying dynamic CPUID feature
bits, i.e. features that KVM updates at runtime, via a compile-time
assertion in guest_cpu_cap_has().  Exempt MWAIT from the rule, as the
MISC_ENABLE_NO_MWAIT means that MWAIT is _conditionally_ a dynamic CPUID
feature.

Note, the rule could be enforced for MWAIT as well, e.g. by querying guest
CPUID in kvm_emulate_monitor_mwait, but there's no obvious advtantage to
doing so, and allowing MWAIT for guest_cpuid_has() opens up a different can
of worms.  MONITOR/MWAIT can't be virtualized (for a reasonable definition),
and the nature of the MWAIT_NEVER_UD_FAULTS and MISC_ENABLE_NO_MWAIT quirks
means checking X86_FEATURE_MWAIT outside of kvm_emulate_monitor_mwait() is
wrong for other reasons.

Beyond the aforementioned feature bits, the only other dynamic CPUID
(sub)leaves are the XSAVE sizes, and similar to MWAIT, consuming those
CPUID entries in KVM is all but guaranteed to be a bug.  The layout for an
actual XSAVE buffer depends on the format (compacted or not) and
potentially the features that are actually enabled.  E.g. see the logic in
fpstate_clear_xstate_component() needed to poke into the guest's effective
XSAVE state to clear MPX state on INIT.  KVM does consume
CPUID.0xD.0.{EAX,EDX} in kvm_check_cpuid() and cpuid_get_supported_xcr0(),
but not EBX, which is the only dynamic output register in the leaf.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/cpuid.c            | 12 ++++++++++--
 arch/x86/kvm/cpuid.h            |  9 ++++++++-
 arch/x86/kvm/lapic.c            |  2 +-
 arch/x86/kvm/smm.c              |  2 +-
 arch/x86/kvm/svm/sev.c          |  2 +-
 arch/x86/kvm/svm/svm.c          |  2 +-
 arch/x86/kvm/vmx/vmx.c          |  2 +-
 arch/x86/kvm/x86.c              |  6 +++---
 9 files changed, 27 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 81ce8cd5814a..23cc5c10060e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -871,6 +871,7 @@ struct kvm_vcpu_arch {
 
 	int cpuid_nent;
 	struct kvm_cpuid_entry2 *cpuid_entries;
+	bool cpuid_dynamic_bits_dirty;
 	bool is_amd_compatible;
 
 	/*
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 7f5fa6665969..54ba1a75b779 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -195,6 +195,7 @@ static int kvm_check_cpuid(struct kvm_vcpu *vcpu)
 }
 
 static u32 kvm_apply_cpuid_pv_features_quirk(struct kvm_vcpu *vcpu);
+static void kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu);
 
 /* Check whether the supplied CPUID data is equal to what is already set for the vCPU. */
 static int kvm_cpuid_check_equal(struct kvm_vcpu *vcpu, struct kvm_cpuid_entry2 *e2,
@@ -299,10 +300,12 @@ static __always_inline void kvm_update_feature_runtime(struct kvm_vcpu *vcpu,
 	guest_cpu_cap_change(vcpu, x86_feature, has_feature);
 }
 
-void kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu)
+static void kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu)
 {
 	struct kvm_cpuid_entry2 *best;
 
+	vcpu->arch.cpuid_dynamic_bits_dirty = false;
+
 	best = kvm_find_cpuid_entry(vcpu, 1);
 	if (best) {
 		kvm_update_feature_runtime(vcpu, best, X86_FEATURE_OSXSAVE,
@@ -332,7 +335,6 @@ void kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu)
 		     cpuid_entry_has(best, X86_FEATURE_XSAVEC)))
 		best->ebx = xstate_required_size(vcpu->arch.xcr0, true);
 }
-EXPORT_SYMBOL_GPL(kvm_update_cpuid_runtime);
 
 static bool kvm_cpuid_has_hyperv(struct kvm_vcpu *vcpu)
 {
@@ -645,6 +647,9 @@ int kvm_vcpu_ioctl_get_cpuid2(struct kvm_vcpu *vcpu,
 	if (cpuid->nent < vcpu->arch.cpuid_nent)
 		return -E2BIG;
 
+	if (vcpu->arch.cpuid_dynamic_bits_dirty)
+		kvm_update_cpuid_runtime(vcpu);
+
 	if (copy_to_user(entries, vcpu->arch.cpuid_entries,
 			 vcpu->arch.cpuid_nent * sizeof(struct kvm_cpuid_entry2)))
 		return -EFAULT;
@@ -1983,6 +1988,9 @@ bool kvm_cpuid(struct kvm_vcpu *vcpu, u32 *eax, u32 *ebx,
 	struct kvm_cpuid_entry2 *entry;
 	bool exact, used_max_basic = false;
 
+	if (vcpu->arch.cpuid_dynamic_bits_dirty)
+		kvm_update_cpuid_runtime(vcpu);
+
 	entry = kvm_find_cpuid_entry_index(vcpu, function, index);
 	exact = !!entry;
 
diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index 67d80aa72d50..d2884162a46a 100644
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -11,7 +11,6 @@ extern u32 kvm_cpu_caps[NR_KVM_CPU_CAPS] __read_mostly;
 void kvm_set_cpu_caps(void);
 
 void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu);
-void kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu);
 struct kvm_cpuid_entry2 *kvm_find_cpuid_entry_index(struct kvm_vcpu *vcpu,
 						    u32 function, u32 index);
 struct kvm_cpuid_entry2 *kvm_find_cpuid_entry(struct kvm_vcpu *vcpu,
@@ -232,6 +231,14 @@ static __always_inline bool guest_cpu_cap_has(struct kvm_vcpu *vcpu,
 {
 	unsigned int x86_leaf = __feature_leaf(x86_feature);
 
+	/*
+	 * Except for MWAIT, querying dynamic feature bits is disallowed, so
+	 * that KVM can defer runtime updates until the next CPUID emulation.
+	 */
+	BUILD_BUG_ON(x86_feature == X86_FEATURE_APIC ||
+		     x86_feature == X86_FEATURE_OSXSAVE ||
+		     x86_feature == X86_FEATURE_OSPKE);
+
 	return vcpu->arch.cpu_caps[x86_leaf] & __feature_bit(x86_feature);
 }
 
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index ae81ae27d534..cf74c87b8b3f 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2585,7 +2585,7 @@ static void __kvm_apic_set_base(struct kvm_vcpu *vcpu, u64 value)
 	vcpu->arch.apic_base = value;
 
 	if ((old_value ^ value) & MSR_IA32_APICBASE_ENABLE)
-		kvm_update_cpuid_runtime(vcpu);
+		vcpu->arch.cpuid_dynamic_bits_dirty = true;
 
 	if (!apic)
 		return;
diff --git a/arch/x86/kvm/smm.c b/arch/x86/kvm/smm.c
index e0ab7df27b66..699e551ec93b 100644
--- a/arch/x86/kvm/smm.c
+++ b/arch/x86/kvm/smm.c
@@ -358,7 +358,7 @@ void enter_smm(struct kvm_vcpu *vcpu)
 			goto error;
 #endif
 
-	kvm_update_cpuid_runtime(vcpu);
+	vcpu->arch.cpuid_dynamic_bits_dirty = true;
 	kvm_mmu_reset_context(vcpu);
 	return;
 error:
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 09be12a44288..5e4581ed0ef1 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3274,7 +3274,7 @@ static void sev_es_sync_from_ghcb(struct vcpu_svm *svm)
 
 	if (kvm_ghcb_xcr0_is_valid(svm)) {
 		vcpu->arch.xcr0 = ghcb_get_xcr0(ghcb);
-		kvm_update_cpuid_runtime(vcpu);
+		vcpu->arch.cpuid_dynamic_bits_dirty = true;
 	}
 
 	/* Copy the GHCB exit information into the VMCB fields */
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 07911ddf1efe..6a350cee2f6c 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1936,7 +1936,7 @@ void svm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 	vmcb_mark_dirty(to_svm(vcpu)->vmcb, VMCB_CR);
 
 	if ((cr4 ^ old_cr4) & (X86_CR4_OSXSAVE | X86_CR4_PKE))
-		kvm_update_cpuid_runtime(vcpu);
+		vcpu->arch.cpuid_dynamic_bits_dirty = true;
 }
 
 static void svm_set_segment(struct kvm_vcpu *vcpu,
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index cf872d8691b5..b5f3c5628bfd 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3516,7 +3516,7 @@ void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 	vmcs_writel(GUEST_CR4, hw_cr4);
 
 	if ((cr4 ^ old_cr4) & (X86_CR4_OSXSAVE | X86_CR4_PKE))
-		kvm_update_cpuid_runtime(vcpu);
+		vcpu->arch.cpuid_dynamic_bits_dirty = true;
 }
 
 void vmx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index dc8829712edd..10b7d8c01e4d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1264,7 +1264,7 @@ static int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
 	vcpu->arch.xcr0 = xcr0;
 
 	if ((xcr0 ^ old_xcr0) & XFEATURE_MASK_EXTEND)
-		kvm_update_cpuid_runtime(vcpu);
+		vcpu->arch.cpuid_dynamic_bits_dirty = true;
 	return 0;
 }
 
@@ -3899,7 +3899,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 			if (!guest_cpu_cap_has(vcpu, X86_FEATURE_XMM3))
 				return 1;
 			vcpu->arch.ia32_misc_enable_msr = data;
-			kvm_update_cpuid_runtime(vcpu);
+			vcpu->arch.cpuid_dynamic_bits_dirty = true;
 		} else {
 			vcpu->arch.ia32_misc_enable_msr = data;
 		}
@@ -3934,7 +3934,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		if (data & ~kvm_caps.supported_xss)
 			return 1;
 		vcpu->arch.ia32_xss = data;
-		kvm_update_cpuid_runtime(vcpu);
+		vcpu->arch.cpuid_dynamic_bits_dirty = true;
 		break;
 	case MSR_SMI_COUNT:
 		if (!msr_info->host_initiated)
-- 
2.47.0.338.g60cca15819-goog


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/5] KVM: x86: Address xstate_required_size() perf regression
  2024-12-11  1:32 [PATCH 0/5] KVM: x86: Address xstate_required_size() perf regression Sean Christopherson
                   ` (4 preceding siblings ...)
  2024-12-11  1:33 ` [PATCH 5/5] KVM: x86: Defer runtime updates of dynamic CPUID bits until CPUID emulation Sean Christopherson
@ 2024-12-11 16:42 ` Sean Christopherson
  2024-12-13 18:58 ` Paolo Bonzini
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Sean Christopherson @ 2024-12-11 16:42 UTC (permalink / raw)
  To: Paolo Bonzini, kvm, linux-kernel, Jim Mattson

On Tue, Dec 10, 2024, Sean Christopherson wrote:
> CPUID is a mandatory intercept on both Intel and AMD

Jim pointed out that CPUID is NOT a mandatory intercept on AMD, and so while the
code is correct, the changelogs are not.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/5] KVM: x86: Apply TSX_CTRL_CPUID_CLEAR if and only if the vCPU has RTM or HLE
  2024-12-11  1:33 ` [PATCH 3/5] KVM: x86: Apply TSX_CTRL_CPUID_CLEAR if and only if the vCPU has RTM or HLE Sean Christopherson
@ 2024-12-13 14:39   ` Jim Mattson
  0 siblings, 0 replies; 15+ messages in thread
From: Jim Mattson @ 2024-12-13 14:39 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel

On Tue, Dec 10, 2024 at 5:33 PM Sean Christopherson <seanjc@google.com> wrote:
>
> When emulating CPUID, retrieve MSR_IA32_TSX_CTRL.TSX_CTRL_CPUID_CLEAR if
> and only if RTM and/or HLE feature bits need to be cleared.  Getting the
> MSR value is unnecessary if neither bit is set, and avoiding the lookup
> saves ~80 cycles for vCPUs without RTM or HLE.
>
> Cc: Jim Mattson <jmattson@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

I'm not sure why you cc'd me :), but...

Reviewed-by: Jim Mattson <jmattson@google.com>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/5] KVM: x86: Address xstate_required_size() perf regression
  2024-12-11  1:32 [PATCH 0/5] KVM: x86: Address xstate_required_size() perf regression Sean Christopherson
                   ` (5 preceding siblings ...)
  2024-12-11 16:42 ` [PATCH 0/5] KVM: x86: Address xstate_required_size() perf regression Sean Christopherson
@ 2024-12-13 18:58 ` Paolo Bonzini
  2024-12-16 20:04 ` Jim Mattson
  2025-02-15  0:50 ` Sean Christopherson
  8 siblings, 0 replies; 15+ messages in thread
From: Paolo Bonzini @ 2024-12-13 18:58 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: kvm, linux-kernel, Jim Mattson

On 12/11/24 02:32, Sean Christopherson wrote:
> Fix a hilarious/revolting performance regression (relative to older CPU
> generations) in xstate_required_size() that pops up due to CPUID _in the
> host_ taking 3x-4x longer on Emerald Rapids than Skylake.
> 
> The issue rears its head on nested virtualization transitions, as KVM
> (unnecessarily) performs runtime CPUID updates, including XSAVE sizes,
> multiple times per transition.  And calculating XSAVE sizes, especially
> for vCPUs with a decent number of supported XSAVE features and compacted
> format support, can add up to thousands of cycles.
> 
> To fix the immediate issue, cache the CPUID output at kvm.ko load.  The
> information is static for a given CPU, i.e. doesn't need to be re-read
> from hardware every time.  That's patch 1, and eliminates pretty much all
> of the meaningful overhead.

Queued this one, thanks!

Paolo


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/5] KVM: x86: Address xstate_required_size() perf regression
  2024-12-11  1:32 [PATCH 0/5] KVM: x86: Address xstate_required_size() perf regression Sean Christopherson
                   ` (6 preceding siblings ...)
  2024-12-13 18:58 ` Paolo Bonzini
@ 2024-12-16 20:04 ` Jim Mattson
  2024-12-16 23:16   ` Sean Christopherson
  2025-02-15  0:50 ` Sean Christopherson
  8 siblings, 1 reply; 15+ messages in thread
From: Jim Mattson @ 2024-12-16 20:04 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel

On Tue, Dec 10, 2024 at 5:33 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Fix a hilarious/revolting performance regression (relative to older CPU
> generations) in xstate_required_size() that pops up due to CPUID _in the
> host_ taking 3x-4x longer on Emerald Rapids than Skylake.
>
> The issue rears its head on nested virtualization transitions, as KVM
> (unnecessarily) performs runtime CPUID updates, including XSAVE sizes,
> multiple times per transition.  And calculating XSAVE sizes, especially
> for vCPUs with a decent number of supported XSAVE features and compacted
> format support, can add up to thousands of cycles.
>
> To fix the immediate issue, cache the CPUID output at kvm.ko load.  The
> information is static for a given CPU, i.e. doesn't need to be re-read
> from hardware every time.  That's patch 1, and eliminates pretty much all
> of the meaningful overhead.
>
> Patch 2 is a minor cleanup to try and make the code easier to read.
>
> Patch 3 fixes a wart in CPUID emulation where KVM does a moderately
> expensive (though cheap compared to CPUID, lol) MSR lookup that is likely
> unnecessary for the vast majority of VMs.
>
> Patches 4 and 5 address the problem of KVM doing runtime CPUID updates
> multiple times for each nested VM-Enter and VM-Exit, at least half of
> which are completely unnecessary (CPUID is a mandatory intercept on both
> Intel and AMD, so ensuring dynamic CPUID bits are up-to-date while running
> L2 is pointless).  The idea is fairly simple: lazily do the CPUID updates
> by deferring them until something might actually consume guest the relevant
> bits.
>
> This applies on the cpu_caps overhaul[*], as patches 3-5 would otherwise
> conflict, and I didn't want to think about how safe patch 5 is without
> the rework.
>
> That said, patch 1, which is the most important and tagged for stable,
> applies cleanly on 6.1, 6.6, and 6.12 (and the backport for 5.15 and
> earlier shouldn't be too horrific).
>
> Side topic, I can't help but wonder if the CPUID latency on EMR is a CPU
> or ucode bug.  For a number of leaves, KVM can emulate CPUID faster than
> the CPUID can execute the instruction.  I.e. the entire VM-Exit => emulate
> => VM-Enter sequence takes less time than executing CPUID on bare metal.
> Which seems absolutely insane.  But, it shouldn't impact guest performance,
> so that's someone else's problem, at least for now.

Virtualization aside, perhaps Linux should set
MSR_FEATURE_ENABLES.CPUID_GP_ON_CPL_GT_0[bit 0] on EMR, and emulate
the CPUID instruction in the kernel?  :)

> [*] https://lore.kernel.org/all/20241128013424.4096668-1-seanjc@google.com
>
> Sean Christopherson (5):
>   KVM: x86: Cache CPUID.0xD XSTATE offsets+sizes during module init
>   KVM: x86: Use for-loop to iterate over XSTATE size entries
>   KVM: x86: Apply TSX_CTRL_CPUID_CLEAR if and only if the vCPU has RTM
>     or HLE
>   KVM: x86: Query X86_FEATURE_MWAIT iff userspace owns the CPUID feature
>     bit
>   KVM: x86: Defer runtime updates of dynamic CPUID bits until CPUID
>     emulation
>
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/cpuid.c            | 63 ++++++++++++++++++++++++---------
>  arch/x86/kvm/cpuid.h            | 10 +++++-
>  arch/x86/kvm/lapic.c            |  2 +-
>  arch/x86/kvm/smm.c              |  2 +-
>  arch/x86/kvm/svm/sev.c          |  2 +-
>  arch/x86/kvm/svm/svm.c          |  2 +-
>  arch/x86/kvm/vmx/vmx.c          |  2 +-
>  arch/x86/kvm/x86.c              | 22 +++++++++---
>  9 files changed, 78 insertions(+), 28 deletions(-)
>
>
> base-commit: 06a4919e729761be751366716c00fb8c3f51d37e
> --
> 2.47.0.338.g60cca15819-goog
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/5] KVM: x86: Address xstate_required_size() perf regression
  2024-12-16 20:04 ` Jim Mattson
@ 2024-12-16 23:16   ` Sean Christopherson
  0 siblings, 0 replies; 15+ messages in thread
From: Sean Christopherson @ 2024-12-16 23:16 UTC (permalink / raw)
  To: Jim Mattson; +Cc: Paolo Bonzini, kvm, linux-kernel

On Mon, Dec 16, 2024, Jim Mattson wrote:
> On Tue, Dec 10, 2024 at 5:33 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > Fix a hilarious/revolting performance regression (relative to older CPU
> > generations) in xstate_required_size() that pops up due to CPUID _in the
> > host_ taking 3x-4x longer on Emerald Rapids than Skylake.
> >
> > The issue rears its head on nested virtualization transitions, as KVM
> > (unnecessarily) performs runtime CPUID updates, including XSAVE sizes,
> > multiple times per transition.  And calculating XSAVE sizes, especially
> > for vCPUs with a decent number of supported XSAVE features and compacted
> > format support, can add up to thousands of cycles.
> >
> > To fix the immediate issue, cache the CPUID output at kvm.ko load.  The
> > information is static for a given CPU, i.e. doesn't need to be re-read
> > from hardware every time.  That's patch 1, and eliminates pretty much all
> > of the meaningful overhead.
> >
> > Patch 2 is a minor cleanup to try and make the code easier to read.
> >
> > Patch 3 fixes a wart in CPUID emulation where KVM does a moderately
> > expensive (though cheap compared to CPUID, lol) MSR lookup that is likely
> > unnecessary for the vast majority of VMs.
> >
> > Patches 4 and 5 address the problem of KVM doing runtime CPUID updates
> > multiple times for each nested VM-Enter and VM-Exit, at least half of
> > which are completely unnecessary (CPUID is a mandatory intercept on both
> > Intel and AMD, so ensuring dynamic CPUID bits are up-to-date while running
> > L2 is pointless).  The idea is fairly simple: lazily do the CPUID updates
> > by deferring them until something might actually consume guest the relevant
> > bits.
> >
> > This applies on the cpu_caps overhaul[*], as patches 3-5 would otherwise
> > conflict, and I didn't want to think about how safe patch 5 is without
> > the rework.
> >
> > That said, patch 1, which is the most important and tagged for stable,
> > applies cleanly on 6.1, 6.6, and 6.12 (and the backport for 5.15 and
> > earlier shouldn't be too horrific).
> >
> > Side topic, I can't help but wonder if the CPUID latency on EMR is a CPU
> > or ucode bug.  For a number of leaves, KVM can emulate CPUID faster than
> > the CPUID can execute the instruction.  I.e. the entire VM-Exit => emulate
> > => VM-Enter sequence takes less time than executing CPUID on bare metal.
> > Which seems absolutely insane.  But, it shouldn't impact guest performance,
> > so that's someone else's problem, at least for now.
> 
> Virtualization aside, perhaps Linux should set
> MSR_FEATURE_ENABLES.CPUID_GP_ON_CPL_GT_0[bit 0] on EMR, and emulate
> the CPUID instruction in the kernel?  :)

Heh, that thought crossed my mind too.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/5] KVM: x86: Address xstate_required_size() perf regression
  2024-12-11  1:32 [PATCH 0/5] KVM: x86: Address xstate_required_size() perf regression Sean Christopherson
                   ` (7 preceding siblings ...)
  2024-12-16 20:04 ` Jim Mattson
@ 2025-02-15  0:50 ` Sean Christopherson
  8 siblings, 0 replies; 15+ messages in thread
From: Sean Christopherson @ 2025-02-15  0:50 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel, Jim Mattson

On Tue, 10 Dec 2024 17:32:57 -0800, Sean Christopherson wrote:
> Fix a hilarious/revolting performance regression (relative to older CPU
> generations) in xstate_required_size() that pops up due to CPUID _in the
> host_ taking 3x-4x longer on Emerald Rapids than Skylake.
> 
> The issue rears its head on nested virtualization transitions, as KVM
> (unnecessarily) performs runtime CPUID updates, including XSAVE sizes,
> multiple times per transition.  And calculating XSAVE sizes, especially
> for vCPUs with a decent number of supported XSAVE features and compacted
> format support, can add up to thousands of cycles.
> 
> [...]

Applied 2-5 to kvm-x86 misc, with a changelog that doesn't incorrectly state
that CPUID is a mandatory intercept on AMD.

[1/5] KVM: x86: Cache CPUID.0xD XSTATE offsets+sizes during module init
      (no commit info)
[2/5] KVM: x86: Use for-loop to iterate over XSTATE size entries
      https://github.com/kvm-x86/linux/commit/aa93b6f96f64
[3/5] KVM: x86: Apply TSX_CTRL_CPUID_CLEAR if and only if the vCPU has RTM or HLE
      https://github.com/kvm-x86/linux/commit/7e9f735e7ac4
[4/5] KVM: x86: Query X86_FEATURE_MWAIT iff userspace owns the CPUID feature bit
      https://github.com/kvm-x86/linux/commit/a487f6797c88
[5/5] KVM: x86: Defer runtime updates of dynamic CPUID bits until CPUID emulation
      https://github.com/kvm-x86/linux/commit/93da6af3ae56

--
https://github.com/kvm-x86/linux/tree/next

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 5/5] KVM: x86: Defer runtime updates of dynamic CPUID bits until CPUID emulation
  2024-12-11  1:33 ` [PATCH 5/5] KVM: x86: Defer runtime updates of dynamic CPUID bits until CPUID emulation Sean Christopherson
@ 2025-11-28 11:32   ` Igor Mammedov
  2025-12-01 15:55     ` Sean Christopherson
  0 siblings, 1 reply; 15+ messages in thread
From: Igor Mammedov @ 2025-11-28 11:32 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Jim Mattson, mlevitsk

On Tue, 10 Dec 2024 17:33:02 -0800
Sean Christopherson <seanjc@google.com> wrote:

Sean,

this patch broke vCPU hotplug (still broken with current master),
after repeated plug/unplug of the same vCPU in a loop, QEMU exits
due to error in vcpu initialization:

    r = kvm_vcpu_ioctl(cs, KVM_SET_CPUID2, &cpuid_data);                         
    if (r) {                                                                     
        goto fail;                                                               
    }

Reproducer (host in question is Haswell but it's been seen on other hosts as well):
for it to trigger the issue it must be Q35 machine with UEFI firmware
(the rest doesn't seem to matter)
===

#!/bin/sh

/tmp/qemu_build/qemu-system-x86_64 -M q35,pflash0=drive_ovmf_code,pflash1=drive_ovmf_vars \
    -m 16G -cpu host -smp 1,maxcpus=2 -enable-kvm \
    -blockdev node-name=file_ovmf_code,driver=file,filename=edk2-x86_64-secure-code.fd,auto-read-only=on,discard=unmap \
    -blockdev node-name=drive_ovmf_code,driver=raw,read-only=on,file=file_ovmf_code \
    -blockdev node-name=file_ovmf_vars,driver=file,filename=myOVMF_VARS.fd,auto-read-only=on,discard=unmap \
    -blockdev node-name=drive_ovmf_vars,driver=raw,read-only=off,file=file_ovmf_vars \
    -monitor unix:/tmp/monitor2,server,nowait -snapshot \
    ./rhel10-efi.qcow2 &

sleep 120

i=0
while [ $i -lt 3 ] 
do
	echo "====== The $(($i+1)) iterations ======"
	echo "info cpus" | nc -U /tmp/monitor2
	sleep 2
	echo "device_add host-x86_64-cpu,core-id=1,thread-id=0,socket-id=0,id=core1" | nc -U /tmp/monitor2
	sleep 6 
	echo "info cpus" | nc -U /tmp/monitor2
	sleep 2
	echo "device_del core1" | nc -U /tmp/monitor2
	sleep 6
	echo "info cpus" | nc -U /tmp/monitor2
	sleep 2
	((i++))
done

===


> Defer runtime CPUID updates until the next non-faulting CPUID emulation
> or KVM_GET_CPUID2, which are the only paths in KVM that consume the
> dynamic entries.  Deferring the updates is especially beneficial to
> nested VM-Enter/VM-Exit, as KVM will almost always detect multiple state
> changes, not to mention the updates don't need to be realized while L2 is
> active, as CPUID is a mandatory intercept on both Intel and AMD.
> 
> Deferring CPUID updates shaves several hundred cycles from nested VMX
> roundtrips, as measured from L2 executing CPUID in a tight loop:
> 
>   SKX 6850 => 6450
>   ICX 9000 => 8800
>   EMR 7900 => 7700
> 
> Alternatively, KVM could update only the CPUID leaves that are affected
> by the state change, e.g. update XSAVE info only if XCR0 or XSS changes,
> but that adds non-trivial complexity and doesn't solve the underlying
> problem of nested transitions potentially changing both XCR0 and XSS, on
> both nested VM-Enter and VM-Exit.
> 
> KVM could also skip updates entirely while L2 is active, because again
> CPUID is a mandatory intercept.  However, simply skipping updates if L2
> is active is *very* subtly dangerous and complex.  Most KVM updates are
> triggered by changes to the current vCPU state, which may be L2 state
> whereas performing updates only for L1 would requiring detecting changes
> to L1 state.  KVM would need to either track relevant L1 state, or defer
> runtime CPUID updates until the next nested VM-Exit.  The former is ugly
> and complex, while the latter comes with similar dangers to deferring all
> CPUID updates, and would only address the nested VM-Enter path.
> 
> To guard against using stale data, disallow querying dynamic CPUID feature
> bits, i.e. features that KVM updates at runtime, via a compile-time
> assertion in guest_cpu_cap_has().  Exempt MWAIT from the rule, as the
> MISC_ENABLE_NO_MWAIT means that MWAIT is _conditionally_ a dynamic CPUID
> feature.
> 
> Note, the rule could be enforced for MWAIT as well, e.g. by querying guest
> CPUID in kvm_emulate_monitor_mwait, but there's no obvious advtantage to
> doing so, and allowing MWAIT for guest_cpuid_has() opens up a different can
> of worms.  MONITOR/MWAIT can't be virtualized (for a reasonable definition),
> and the nature of the MWAIT_NEVER_UD_FAULTS and MISC_ENABLE_NO_MWAIT quirks
> means checking X86_FEATURE_MWAIT outside of kvm_emulate_monitor_mwait() is
> wrong for other reasons.
> 
> Beyond the aforementioned feature bits, the only other dynamic CPUID
> (sub)leaves are the XSAVE sizes, and similar to MWAIT, consuming those
> CPUID entries in KVM is all but guaranteed to be a bug.  The layout for an
> actual XSAVE buffer depends on the format (compacted or not) and
> potentially the features that are actually enabled.  E.g. see the logic in
> fpstate_clear_xstate_component() needed to poke into the guest's effective
> XSAVE state to clear MPX state on INIT.  KVM does consume
> CPUID.0xD.0.{EAX,EDX} in kvm_check_cpuid() and cpuid_get_supported_xcr0(),
> but not EBX, which is the only dynamic output register in the leaf.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/cpuid.c            | 12 ++++++++++--
>  arch/x86/kvm/cpuid.h            |  9 ++++++++-
>  arch/x86/kvm/lapic.c            |  2 +-
>  arch/x86/kvm/smm.c              |  2 +-
>  arch/x86/kvm/svm/sev.c          |  2 +-
>  arch/x86/kvm/svm/svm.c          |  2 +-
>  arch/x86/kvm/vmx/vmx.c          |  2 +-
>  arch/x86/kvm/x86.c              |  6 +++---
>  9 files changed, 27 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 81ce8cd5814a..23cc5c10060e 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -871,6 +871,7 @@ struct kvm_vcpu_arch {
>  
>  	int cpuid_nent;
>  	struct kvm_cpuid_entry2 *cpuid_entries;
> +	bool cpuid_dynamic_bits_dirty;
>  	bool is_amd_compatible;
>  
>  	/*
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 7f5fa6665969..54ba1a75b779 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -195,6 +195,7 @@ static int kvm_check_cpuid(struct kvm_vcpu *vcpu)
>  }
>  
>  static u32 kvm_apply_cpuid_pv_features_quirk(struct kvm_vcpu *vcpu);
> +static void kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu);
>  
>  /* Check whether the supplied CPUID data is equal to what is already set for the vCPU. */
>  static int kvm_cpuid_check_equal(struct kvm_vcpu *vcpu, struct kvm_cpuid_entry2 *e2,
> @@ -299,10 +300,12 @@ static __always_inline void kvm_update_feature_runtime(struct kvm_vcpu *vcpu,
>  	guest_cpu_cap_change(vcpu, x86_feature, has_feature);
>  }
>  
> -void kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu)
> +static void kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu)
>  {
>  	struct kvm_cpuid_entry2 *best;
>  
> +	vcpu->arch.cpuid_dynamic_bits_dirty = false;
> +
>  	best = kvm_find_cpuid_entry(vcpu, 1);
>  	if (best) {
>  		kvm_update_feature_runtime(vcpu, best, X86_FEATURE_OSXSAVE,
> @@ -332,7 +335,6 @@ void kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu)
>  		     cpuid_entry_has(best, X86_FEATURE_XSAVEC)))
>  		best->ebx = xstate_required_size(vcpu->arch.xcr0, true);
>  }
> -EXPORT_SYMBOL_GPL(kvm_update_cpuid_runtime);
>  
>  static bool kvm_cpuid_has_hyperv(struct kvm_vcpu *vcpu)
>  {
> @@ -645,6 +647,9 @@ int kvm_vcpu_ioctl_get_cpuid2(struct kvm_vcpu *vcpu,
>  	if (cpuid->nent < vcpu->arch.cpuid_nent)
>  		return -E2BIG;
>  
> +	if (vcpu->arch.cpuid_dynamic_bits_dirty)
> +		kvm_update_cpuid_runtime(vcpu);
> +
>  	if (copy_to_user(entries, vcpu->arch.cpuid_entries,
>  			 vcpu->arch.cpuid_nent * sizeof(struct kvm_cpuid_entry2)))
>  		return -EFAULT;
> @@ -1983,6 +1988,9 @@ bool kvm_cpuid(struct kvm_vcpu *vcpu, u32 *eax, u32 *ebx,
>  	struct kvm_cpuid_entry2 *entry;
>  	bool exact, used_max_basic = false;
>  
> +	if (vcpu->arch.cpuid_dynamic_bits_dirty)
> +		kvm_update_cpuid_runtime(vcpu);
> +
>  	entry = kvm_find_cpuid_entry_index(vcpu, function, index);
>  	exact = !!entry;
>  
> diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
> index 67d80aa72d50..d2884162a46a 100644
> --- a/arch/x86/kvm/cpuid.h
> +++ b/arch/x86/kvm/cpuid.h
> @@ -11,7 +11,6 @@ extern u32 kvm_cpu_caps[NR_KVM_CPU_CAPS] __read_mostly;
>  void kvm_set_cpu_caps(void);
>  
>  void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu);
> -void kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu);
>  struct kvm_cpuid_entry2 *kvm_find_cpuid_entry_index(struct kvm_vcpu *vcpu,
>  						    u32 function, u32 index);
>  struct kvm_cpuid_entry2 *kvm_find_cpuid_entry(struct kvm_vcpu *vcpu,
> @@ -232,6 +231,14 @@ static __always_inline bool guest_cpu_cap_has(struct kvm_vcpu *vcpu,
>  {
>  	unsigned int x86_leaf = __feature_leaf(x86_feature);
>  
> +	/*
> +	 * Except for MWAIT, querying dynamic feature bits is disallowed, so
> +	 * that KVM can defer runtime updates until the next CPUID emulation.
> +	 */
> +	BUILD_BUG_ON(x86_feature == X86_FEATURE_APIC ||
> +		     x86_feature == X86_FEATURE_OSXSAVE ||
> +		     x86_feature == X86_FEATURE_OSPKE);
> +
>  	return vcpu->arch.cpu_caps[x86_leaf] & __feature_bit(x86_feature);
>  }
>  
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index ae81ae27d534..cf74c87b8b3f 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -2585,7 +2585,7 @@ static void __kvm_apic_set_base(struct kvm_vcpu *vcpu, u64 value)
>  	vcpu->arch.apic_base = value;
>  
>  	if ((old_value ^ value) & MSR_IA32_APICBASE_ENABLE)
> -		kvm_update_cpuid_runtime(vcpu);
> +		vcpu->arch.cpuid_dynamic_bits_dirty = true;
>  
>  	if (!apic)
>  		return;
> diff --git a/arch/x86/kvm/smm.c b/arch/x86/kvm/smm.c
> index e0ab7df27b66..699e551ec93b 100644
> --- a/arch/x86/kvm/smm.c
> +++ b/arch/x86/kvm/smm.c
> @@ -358,7 +358,7 @@ void enter_smm(struct kvm_vcpu *vcpu)
>  			goto error;
>  #endif
>  
> -	kvm_update_cpuid_runtime(vcpu);
> +	vcpu->arch.cpuid_dynamic_bits_dirty = true;
>  	kvm_mmu_reset_context(vcpu);
>  	return;
>  error:
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 09be12a44288..5e4581ed0ef1 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -3274,7 +3274,7 @@ static void sev_es_sync_from_ghcb(struct vcpu_svm *svm)
>  
>  	if (kvm_ghcb_xcr0_is_valid(svm)) {
>  		vcpu->arch.xcr0 = ghcb_get_xcr0(ghcb);
> -		kvm_update_cpuid_runtime(vcpu);
> +		vcpu->arch.cpuid_dynamic_bits_dirty = true;
>  	}
>  
>  	/* Copy the GHCB exit information into the VMCB fields */
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 07911ddf1efe..6a350cee2f6c 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -1936,7 +1936,7 @@ void svm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
>  	vmcb_mark_dirty(to_svm(vcpu)->vmcb, VMCB_CR);
>  
>  	if ((cr4 ^ old_cr4) & (X86_CR4_OSXSAVE | X86_CR4_PKE))
> -		kvm_update_cpuid_runtime(vcpu);
> +		vcpu->arch.cpuid_dynamic_bits_dirty = true;
>  }
>  
>  static void svm_set_segment(struct kvm_vcpu *vcpu,
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index cf872d8691b5..b5f3c5628bfd 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -3516,7 +3516,7 @@ void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
>  	vmcs_writel(GUEST_CR4, hw_cr4);
>  
>  	if ((cr4 ^ old_cr4) & (X86_CR4_OSXSAVE | X86_CR4_PKE))
> -		kvm_update_cpuid_runtime(vcpu);
> +		vcpu->arch.cpuid_dynamic_bits_dirty = true;
>  }
>  
>  void vmx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index dc8829712edd..10b7d8c01e4d 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1264,7 +1264,7 @@ static int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
>  	vcpu->arch.xcr0 = xcr0;
>  
>  	if ((xcr0 ^ old_xcr0) & XFEATURE_MASK_EXTEND)
> -		kvm_update_cpuid_runtime(vcpu);
> +		vcpu->arch.cpuid_dynamic_bits_dirty = true;
>  	return 0;
>  }
>  
> @@ -3899,7 +3899,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  			if (!guest_cpu_cap_has(vcpu, X86_FEATURE_XMM3))
>  				return 1;
>  			vcpu->arch.ia32_misc_enable_msr = data;
> -			kvm_update_cpuid_runtime(vcpu);
> +			vcpu->arch.cpuid_dynamic_bits_dirty = true;
>  		} else {
>  			vcpu->arch.ia32_misc_enable_msr = data;
>  		}
> @@ -3934,7 +3934,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  		if (data & ~kvm_caps.supported_xss)
>  			return 1;
>  		vcpu->arch.ia32_xss = data;
> -		kvm_update_cpuid_runtime(vcpu);
> +		vcpu->arch.cpuid_dynamic_bits_dirty = true;
>  		break;
>  	case MSR_SMI_COUNT:
>  		if (!msr_info->host_initiated)


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 5/5] KVM: x86: Defer runtime updates of dynamic CPUID bits until CPUID emulation
  2025-11-28 11:32   ` Igor Mammedov
@ 2025-12-01 15:55     ` Sean Christopherson
  2025-12-01 19:36       ` Sean Christopherson
  0 siblings, 1 reply; 15+ messages in thread
From: Sean Christopherson @ 2025-12-01 15:55 UTC (permalink / raw)
  To: Igor Mammedov; +Cc: Paolo Bonzini, kvm, linux-kernel, Jim Mattson, mlevitsk

On Fri, Nov 28, 2025, Igor Mammedov wrote:
> On Tue, 10 Dec 2024 17:33:02 -0800
> Sean Christopherson <seanjc@google.com> wrote:
> 
> Sean,
> 
> this patch broke vCPU hotplug (still broken with current master),
> after repeated plug/unplug of the same vCPU in a loop, QEMU exits
> due to error in vcpu initialization:
> 
>     r = kvm_vcpu_ioctl(cs, KVM_SET_CPUID2, &cpuid_data);                         
>     if (r) {                                                                     
>         goto fail;                                                               
>     }
> 
> Reproducer (host in question is Haswell but it's been seen on other hosts as well):
> for it to trigger the issue it must be Q35 machine with UEFI firmware
> (the rest doesn't seem to matter)

Gah, sorry.  I managed to handle KVM_GET_CPUID2, so I suspect I thought the update
in kvm_cpuid_check_equal() would take care of things, but that only operates on
the new entries.

Can you test the below?  In the meantime, I'll see if I can enhance the CPUID
selftest to detect the issue and verify the fix.

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index d563a948318b..dd6534419074 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -509,11 +509,18 @@ static int kvm_set_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid_entry2 *e2,
        u32 vcpu_caps[NR_KVM_CPU_CAPS];
        int r;
 
+       /*
+        * Apply pending runtime CPUID updates to the current CPUID entries to
+        * avoid false positives due mismatches on KVM-owned feature flags.
+        */
+       if (vcpu->arch.cpuid_dynamic_bits_dirty)
+               kvm_update_cpuid_runtime(vcpu);
+
        /*
         * Swap the existing (old) entries with the incoming (new) entries in
         * order to massage the new entries, e.g. to account for dynamic bits
-        * that KVM controls, without clobbering the current guest CPUID, which
-        * KVM needs to preserve in order to unwind on failure.
+        * that KVM controls, without losing the current guest CPUID, which KVM
+        * needs to preserve in order to unwind on failure.
         *
         * Similarly, save the vCPU's current cpu_caps so that the capabilities
         * can be updated alongside the CPUID entries when performing runtime

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 5/5] KVM: x86: Defer runtime updates of dynamic CPUID bits until CPUID emulation
  2025-12-01 15:55     ` Sean Christopherson
@ 2025-12-01 19:36       ` Sean Christopherson
  0 siblings, 0 replies; 15+ messages in thread
From: Sean Christopherson @ 2025-12-01 19:36 UTC (permalink / raw)
  To: Igor Mammedov; +Cc: Paolo Bonzini, kvm, linux-kernel, Jim Mattson, mlevitsk

On Mon, Dec 01, 2025, Sean Christopherson wrote:
> On Fri, Nov 28, 2025, Igor Mammedov wrote:
> > On Tue, 10 Dec 2024 17:33:02 -0800
> > Sean Christopherson <seanjc@google.com> wrote:
> > 
> > Sean,
> > 
> > this patch broke vCPU hotplug (still broken with current master),
> > after repeated plug/unplug of the same vCPU in a loop, QEMU exits
> > due to error in vcpu initialization:
> > 
> >     r = kvm_vcpu_ioctl(cs, KVM_SET_CPUID2, &cpuid_data);                         
> >     if (r) {                                                                     
> >         goto fail;                                                               
> >     }
> > 
> > Reproducer (host in question is Haswell but it's been seen on other hosts as well):
> > for it to trigger the issue it must be Q35 machine with UEFI firmware
> > (the rest doesn't seem to matter)
> 
> Gah, sorry.  I managed to handle KVM_GET_CPUID2, so I suspect I thought the update
> in kvm_cpuid_check_equal() would take care of things, but that only operates on
> the new entries.
> 
> Can you test the below?  In the meantime, I'll see if I can enhance the CPUID
> selftest to detect the issue and verify the fix.
> 
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index d563a948318b..dd6534419074 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -509,11 +509,18 @@ static int kvm_set_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid_entry2 *e2,
>         u32 vcpu_caps[NR_KVM_CPU_CAPS];
>         int r;
>  
> +       /*
> +        * Apply pending runtime CPUID updates to the current CPUID entries to
> +        * avoid false positives due mismatches on KVM-owned feature flags.
> +        */
> +       if (vcpu->arch.cpuid_dynamic_bits_dirty)
> +               kvm_update_cpuid_runtime(vcpu);
> +
>         /*
>          * Swap the existing (old) entries with the incoming (new) entries in
>          * order to massage the new entries, e.g. to account for dynamic bits
> -        * that KVM controls, without clobbering the current guest CPUID, which
> -        * KVM needs to preserve in order to unwind on failure.
> +        * that KVM controls, without losing the current guest CPUID, which KVM
> +        * needs to preserve in order to unwind on failure.
>          *
>          * Similarly, save the vCPU's current cpu_caps so that the capabilities
>          * can be updated alongside the CPUID entries when performing runtime

Verified the bug and the fix with the below selftest change.  I'll post proper
patches after full testing, and cross my fingers it fixes the hotplug issue :-)

diff --git a/tools/testing/selftests/kvm/x86/cpuid_test.c b/tools/testing/selftests/kvm/x86/cpuid_test.c
index 7b3fda6842bc..f9ed14996977 100644
--- a/tools/testing/selftests/kvm/x86/cpuid_test.c
+++ b/tools/testing/selftests/kvm/x86/cpuid_test.c
@@ -155,6 +155,7 @@ struct kvm_cpuid2 *vcpu_alloc_cpuid(struct kvm_vm *vm, vm_vaddr_t *p_gva, struct
 static void set_cpuid_after_run(struct kvm_vcpu *vcpu)
 {
        struct kvm_cpuid_entry2 *ent;
+       struct kvm_sregs sregs;
        int rc;
        u32 eax, ebx, x;
 
@@ -162,6 +163,20 @@ static void set_cpuid_after_run(struct kvm_vcpu *vcpu)
        rc = __vcpu_set_cpuid(vcpu);
        TEST_ASSERT(!rc, "Setting unmodified CPUID after KVM_RUN failed: %d", rc);
 
+       /*
+        * Toggle CR4 bits that affect dynamic CPUID feature flags to verify
+        * setting unmodified CPUID succeeds with runtime CPUID updates.
+        */
+       vcpu_sregs_get(vcpu, &sregs);
+       if (kvm_cpu_has(X86_FEATURE_XSAVE))
+               sregs.cr4 ^= X86_CR4_OSXSAVE;
+       if (kvm_cpu_has(X86_FEATURE_PKU))
+               sregs.cr4 ^= X86_CR4_PKE;
+       vcpu_sregs_set(vcpu, &sregs);
+
+       rc = __vcpu_set_cpuid(vcpu);
+       TEST_ASSERT(!rc, "Setting unmodified CPUID after KVM_RUN failed: %d", rc);
+
        /* Changing CPU features is forbidden */
        ent = vcpu_get_cpuid_entry(vcpu, 0x7);
        ebx = ent->ebx;


^ permalink raw reply related	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-12-01 19:36 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-11  1:32 [PATCH 0/5] KVM: x86: Address xstate_required_size() perf regression Sean Christopherson
2024-12-11  1:32 ` [PATCH 1/5] KVM: x86: Cache CPUID.0xD XSTATE offsets+sizes during module init Sean Christopherson
2024-12-11  1:32 ` [PATCH 2/5] KVM: x86: Use for-loop to iterate over XSTATE size entries Sean Christopherson
2024-12-11  1:33 ` [PATCH 3/5] KVM: x86: Apply TSX_CTRL_CPUID_CLEAR if and only if the vCPU has RTM or HLE Sean Christopherson
2024-12-13 14:39   ` Jim Mattson
2024-12-11  1:33 ` [PATCH 4/5] KVM: x86: Query X86_FEATURE_MWAIT iff userspace owns the CPUID feature bit Sean Christopherson
2024-12-11  1:33 ` [PATCH 5/5] KVM: x86: Defer runtime updates of dynamic CPUID bits until CPUID emulation Sean Christopherson
2025-11-28 11:32   ` Igor Mammedov
2025-12-01 15:55     ` Sean Christopherson
2025-12-01 19:36       ` Sean Christopherson
2024-12-11 16:42 ` [PATCH 0/5] KVM: x86: Address xstate_required_size() perf regression Sean Christopherson
2024-12-13 18:58 ` Paolo Bonzini
2024-12-16 20:04 ` Jim Mattson
2024-12-16 23:16   ` Sean Christopherson
2025-02-15  0:50 ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox