* [PATCH 1/4] KVM: SVM: Handle #MCs in guest outside of fastpath
2025-10-30 22:42 [PATCH 0/4] KVM: x86: Cleanup #MC and XCR0/XSS/PKRU handling Sean Christopherson
@ 2025-10-30 22:42 ` Sean Christopherson
2025-10-30 22:42 ` [PATCH 2/4] KVM: VMX: Handle #MCs on VM-Enter/TD-Enter outside of the fastpath Sean Christopherson
` (5 subsequent siblings)
6 siblings, 0 replies; 19+ messages in thread
From: Sean Christopherson @ 2025-10-30 22:42 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel, Jon Kohler
Handle Machine Checks (#MC) that happen in the guest (by forwarding them
to the host) outside of KVM's fastpath so that as much host state as
possible is re-loaded before invoking the kernel's #MC handler. The only
requirement is that KVM invokes the #MC handler before enabling IRQs (and
even that could _probably_ be relaxed to handling #MCs before enabling
preemption).
Waiting to handle #MCs until "more" host state is loaded hardens KVM
against flaws in the #MC handler, which has historically been quite
brittle. E.g. prior to commit 5567d11c21a1 ("x86/mce: Send #MC singal from
task work"), the #MC code could trigger a schedule() with IRQs and
preemption disabled. That led to a KVM hack-a-fix in commit 1811d979c716
("x86/kvm: move kvm_load/put_guest_xcr0 into atomic context").
Note, except for #MCs on VM-Enter, VMX already handles #MCs outside of the
fastpath.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/svm/svm.c | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index f14709a511aa..e8b158f73c79 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4335,14 +4335,6 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
vcpu->arch.regs_avail &= ~SVM_REGS_LAZY_LOAD_SET;
- /*
- * We need to handle MC intercepts here before the vcpu has a chance to
- * change the physical cpu
- */
- if (unlikely(svm->vmcb->control.exit_code ==
- SVM_EXIT_EXCP_BASE + MC_VECTOR))
- svm_handle_mce(vcpu);
-
trace_kvm_exit(vcpu, KVM_ISA_SVM);
svm_complete_interrupts(vcpu);
@@ -4631,8 +4623,16 @@ static int svm_check_intercept(struct kvm_vcpu *vcpu,
static void svm_handle_exit_irqoff(struct kvm_vcpu *vcpu)
{
- if (to_svm(vcpu)->vmcb->control.exit_code == SVM_EXIT_INTR)
+ switch (to_svm(vcpu)->vmcb->control.exit_code) {
+ case SVM_EXIT_EXCP_BASE + MC_VECTOR:
+ svm_handle_mce(vcpu);
+ break;
+ case SVM_EXIT_INTR:
vcpu->arch.at_instruction_boundary = true;
+ break;
+ default:
+ break;
+ }
}
static void svm_setup_mce(struct kvm_vcpu *vcpu)
--
2.51.1.930.gacf6e81ea2-goog
^ permalink raw reply related [flat|nested] 19+ messages in thread* [PATCH 2/4] KVM: VMX: Handle #MCs on VM-Enter/TD-Enter outside of the fastpath
2025-10-30 22:42 [PATCH 0/4] KVM: x86: Cleanup #MC and XCR0/XSS/PKRU handling Sean Christopherson
2025-10-30 22:42 ` [PATCH 1/4] KVM: SVM: Handle #MCs in guest outside of fastpath Sean Christopherson
@ 2025-10-30 22:42 ` Sean Christopherson
2025-11-17 12:38 ` Tony Lindgren
2025-10-30 22:42 ` [PATCH 3/4] KVM: x86: Load guest/host XCR0 and XSS outside of the fastpath run loop Sean Christopherson
` (4 subsequent siblings)
6 siblings, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2025-10-30 22:42 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel, Jon Kohler
Handle Machine Checks (#MC) that happen on VM-Enter (VMX or TDX) outside
of KVM's fastpath so that as much host state as possible is re-loaded
before invoking the kernel's #MC handler. The only requirement is that
KVM invokes the #MC handler before enabling IRQs (and even that could
_probably_ be related to handling #MCs before enabling preemption).
Waiting to handle #MCs until "more" host state is loaded hardens KVM
against flaws in the #MC handler, which has historically been quite
brittle. E.g. prior to commit 5567d11c21a1 ("x86/mce: Send #MC singal from
task work"), the #MC code could trigger a schedule() with IRQs and
preemption disabled. That led to a KVM hack-a-fix in commit 1811d979c716
("x86/kvm: move kvm_load/put_guest_xcr0 into atomic context").
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/vmx/main.c | 13 ++++++++++++-
arch/x86/kvm/vmx/tdx.c | 3 ---
arch/x86/kvm/vmx/vmx.c | 3 ---
3 files changed, 12 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 0eb2773b2ae2..1beaec5b9727 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -608,6 +608,17 @@ static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
vmx_load_mmu_pgd(vcpu, root_hpa, pgd_level);
}
+static void vt_handle_exit_irqoff(struct kvm_vcpu *vcpu)
+{
+ if (unlikely((u16)vmx_get_exit_reason(vcpu).basic == EXIT_REASON_MCE_DURING_VMENTRY))
+ kvm_machine_check();
+
+ if (is_td_vcpu(vcpu))
+ return;
+
+ return vmx_handle_exit_irqoff(vcpu);
+}
+
static void vt_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask)
{
if (is_td_vcpu(vcpu))
@@ -969,7 +980,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.load_mmu_pgd = vt_op(load_mmu_pgd),
.check_intercept = vmx_check_intercept,
- .handle_exit_irqoff = vmx_handle_exit_irqoff,
+ .handle_exit_irqoff = vt_op(handle_exit_irqoff),
.update_cpu_dirty_logging = vt_op(update_cpu_dirty_logging),
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 326db9b9c567..a2f6ba3268d1 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1069,9 +1069,6 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
if (unlikely((tdx->vp_enter_ret & TDX_SW_ERROR) == TDX_SW_ERROR))
return EXIT_FASTPATH_NONE;
- if (unlikely(vmx_get_exit_reason(vcpu).basic == EXIT_REASON_MCE_DURING_VMENTRY))
- kvm_machine_check();
-
trace_kvm_exit(vcpu, KVM_ISA_VMX);
if (unlikely(tdx_failed_vmentry(vcpu)))
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 1021d3b65ea0..123dae8cf46b 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7527,9 +7527,6 @@ fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
if (unlikely(vmx->fail))
return EXIT_FASTPATH_NONE;
- if (unlikely((u16)vmx_get_exit_reason(vcpu).basic == EXIT_REASON_MCE_DURING_VMENTRY))
- kvm_machine_check();
-
trace_kvm_exit(vcpu, KVM_ISA_VMX);
if (unlikely(vmx_get_exit_reason(vcpu).failed_vmentry))
--
2.51.1.930.gacf6e81ea2-goog
^ permalink raw reply related [flat|nested] 19+ messages in thread* Re: [PATCH 2/4] KVM: VMX: Handle #MCs on VM-Enter/TD-Enter outside of the fastpath
2025-10-30 22:42 ` [PATCH 2/4] KVM: VMX: Handle #MCs on VM-Enter/TD-Enter outside of the fastpath Sean Christopherson
@ 2025-11-17 12:38 ` Tony Lindgren
2025-11-17 15:47 ` Sean Christopherson
2025-11-17 16:30 ` Edgecombe, Rick P
0 siblings, 2 replies; 19+ messages in thread
From: Tony Lindgren @ 2025-11-17 12:38 UTC (permalink / raw)
To: Sean Christopherson, Edgecombe, Rick P
Cc: Paolo Bonzini, kvm, linux-kernel, Jon Kohler
Hi,
On Thu, Oct 30, 2025 at 03:42:44PM -0700, Sean Christopherson wrote:
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -608,6 +608,17 @@ static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
> vmx_load_mmu_pgd(vcpu, root_hpa, pgd_level);
> }
>
> +static void vt_handle_exit_irqoff(struct kvm_vcpu *vcpu)
> +{
> + if (unlikely((u16)vmx_get_exit_reason(vcpu).basic == EXIT_REASON_MCE_DURING_VMENTRY))
> + kvm_machine_check();
> +
> + if (is_td_vcpu(vcpu))
> + return;
> +
> + return vmx_handle_exit_irqoff(vcpu);
> +}
I bisected kvm-x86/next down to this change for a TDX guest not booting
and host producing errors like:
watchdog: CPU118: Watchdog detected hard LOCKUP on cpu 118
Dropping the is_td_vcpu(vcpu) check above fixes the issue. Earlier the
call for vmx_handle_exit_irqoff() was unconditional.
Probably the (u16) cast above can be dropped too? It was never used for
TDX looking at the patch.
Regards,
Tony
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH 2/4] KVM: VMX: Handle #MCs on VM-Enter/TD-Enter outside of the fastpath
2025-11-17 12:38 ` Tony Lindgren
@ 2025-11-17 15:47 ` Sean Christopherson
2025-11-18 5:33 ` Tony Lindgren
2025-11-17 16:30 ` Edgecombe, Rick P
1 sibling, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2025-11-17 15:47 UTC (permalink / raw)
To: Tony Lindgren
Cc: Rick P Edgecombe, Paolo Bonzini, kvm, linux-kernel, Jon Kohler
On Mon, Nov 17, 2025, Tony Lindgren wrote:
> Hi,
>
> On Thu, Oct 30, 2025 at 03:42:44PM -0700, Sean Christopherson wrote:
> > --- a/arch/x86/kvm/vmx/main.c
> > +++ b/arch/x86/kvm/vmx/main.c
> > @@ -608,6 +608,17 @@ static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
> > vmx_load_mmu_pgd(vcpu, root_hpa, pgd_level);
> > }
> >
> > +static void vt_handle_exit_irqoff(struct kvm_vcpu *vcpu)
> > +{
> > + if (unlikely((u16)vmx_get_exit_reason(vcpu).basic == EXIT_REASON_MCE_DURING_VMENTRY))
> > + kvm_machine_check();
> > +
> > + if (is_td_vcpu(vcpu))
> > + return;
> > +
> > + return vmx_handle_exit_irqoff(vcpu);
> > +}
>
> I bisected kvm-x86/next down to this change for a TDX guest not booting
> and host producing errors like:
>
> watchdog: CPU118: Watchdog detected hard LOCKUP on cpu 118
>
> Dropping the is_td_vcpu(vcpu) check above fixes the issue. Earlier the
> call for vmx_handle_exit_irqoff() was unconditional.
Ugh, once you see it, it's obvious. Sorry :-(
I'll drop the entire series and send a v2. There's only one other patch that I
already sent the "thank you" for, so I think it's worth unwinding to avoid
breaking bisection for TDX (and because the diff can be very different).
Lightly tested, but I think this patch can instead be:
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 163f854a39f2..6d41d2fc8043 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1063,9 +1063,6 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
if (unlikely((tdx->vp_enter_ret & TDX_SW_ERROR) == TDX_SW_ERROR))
return EXIT_FASTPATH_NONE;
- if (unlikely(vmx_get_exit_reason(vcpu).basic == EXIT_REASON_MCE_DURING_VMENTRY))
- kvm_machine_check();
-
trace_kvm_exit(vcpu, KVM_ISA_VMX);
if (unlikely(tdx_failed_vmentry(vcpu)))
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d98107a7bdaa..d1117da5463f 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7035,10 +7035,19 @@ void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
if (to_vt(vcpu)->emulation_required)
return;
- if (vmx_get_exit_reason(vcpu).basic == EXIT_REASON_EXTERNAL_INTERRUPT)
+ switch (vmx_get_exit_reason(vcpu).basic) {
+ case EXIT_REASON_EXTERNAL_INTERRUPT:
handle_external_interrupt_irqoff(vcpu, vmx_get_intr_info(vcpu));
- else if (vmx_get_exit_reason(vcpu).basic == EXIT_REASON_EXCEPTION_NMI)
+ break;
+ case EXIT_REASON_EXCEPTION_NMI:
handle_exception_irqoff(vcpu, vmx_get_intr_info(vcpu));
+ break;
+ case EXIT_REASON_MCE_DURING_VMENTRY:
+ kvm_machine_check();
+ break;
+ default:
+ break;
+ }
}
/*
@@ -7501,9 +7510,6 @@ fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
if (unlikely(vmx->fail))
return EXIT_FASTPATH_NONE;
- if (unlikely((u16)vmx_get_exit_reason(vcpu).basic == EXIT_REASON_MCE_DURING_VMENTRY))
- kvm_machine_check();
-
trace_kvm_exit(vcpu, KVM_ISA_VMX);
if (unlikely(vmx_get_exit_reason(vcpu).failed_vmentry))
^ permalink raw reply related [flat|nested] 19+ messages in thread* Re: [PATCH 2/4] KVM: VMX: Handle #MCs on VM-Enter/TD-Enter outside of the fastpath
2025-11-17 15:47 ` Sean Christopherson
@ 2025-11-18 5:33 ` Tony Lindgren
0 siblings, 0 replies; 19+ messages in thread
From: Tony Lindgren @ 2025-11-18 5:33 UTC (permalink / raw)
To: Sean Christopherson
Cc: Rick P Edgecombe, Paolo Bonzini, kvm, linux-kernel, Jon Kohler
On Mon, Nov 17, 2025 at 07:47:49AM -0800, Sean Christopherson wrote:
> On Mon, Nov 17, 2025, Tony Lindgren wrote:
> > Hi,
> >
> > On Thu, Oct 30, 2025 at 03:42:44PM -0700, Sean Christopherson wrote:
> > > --- a/arch/x86/kvm/vmx/main.c
> > > +++ b/arch/x86/kvm/vmx/main.c
> > > @@ -608,6 +608,17 @@ static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
> > > vmx_load_mmu_pgd(vcpu, root_hpa, pgd_level);
> > > }
> > >
> > > +static void vt_handle_exit_irqoff(struct kvm_vcpu *vcpu)
> > > +{
> > > + if (unlikely((u16)vmx_get_exit_reason(vcpu).basic == EXIT_REASON_MCE_DURING_VMENTRY))
> > > + kvm_machine_check();
> > > +
> > > + if (is_td_vcpu(vcpu))
> > > + return;
> > > +
> > > + return vmx_handle_exit_irqoff(vcpu);
> > > +}
> >
> > I bisected kvm-x86/next down to this change for a TDX guest not booting
> > and host producing errors like:
> >
> > watchdog: CPU118: Watchdog detected hard LOCKUP on cpu 118
> >
> > Dropping the is_td_vcpu(vcpu) check above fixes the issue. Earlier the
> > call for vmx_handle_exit_irqoff() was unconditional.
>
> Ugh, once you see it, it's obvious. Sorry :-(
>
> I'll drop the entire series and send a v2. There's only one other patch that I
> already sent the "thank you" for, so I think it's worth unwinding to avoid
> breaking bisection for TDX (and because the diff can be very different).
OK thanks.
> Lightly tested, but I think this patch can instead be:
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 163f854a39f2..6d41d2fc8043 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1063,9 +1063,6 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
> if (unlikely((tdx->vp_enter_ret & TDX_SW_ERROR) == TDX_SW_ERROR))
> return EXIT_FASTPATH_NONE;
>
> - if (unlikely(vmx_get_exit_reason(vcpu).basic == EXIT_REASON_MCE_DURING_VMENTRY))
> - kvm_machine_check();
> -
> trace_kvm_exit(vcpu, KVM_ISA_VMX);
>
> if (unlikely(tdx_failed_vmentry(vcpu)))
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index d98107a7bdaa..d1117da5463f 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7035,10 +7035,19 @@ void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
> if (to_vt(vcpu)->emulation_required)
> return;
>
> - if (vmx_get_exit_reason(vcpu).basic == EXIT_REASON_EXTERNAL_INTERRUPT)
> + switch (vmx_get_exit_reason(vcpu).basic) {
> + case EXIT_REASON_EXTERNAL_INTERRUPT:
> handle_external_interrupt_irqoff(vcpu, vmx_get_intr_info(vcpu));
> - else if (vmx_get_exit_reason(vcpu).basic == EXIT_REASON_EXCEPTION_NMI)
> + break;
> + case EXIT_REASON_EXCEPTION_NMI:
> handle_exception_irqoff(vcpu, vmx_get_intr_info(vcpu));
> + break;
> + case EXIT_REASON_MCE_DURING_VMENTRY:
> + kvm_machine_check();
> + break;
> + default:
> + break;
> + }
> }
>
> /*
> @@ -7501,9 +7510,6 @@ fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
> if (unlikely(vmx->fail))
> return EXIT_FASTPATH_NONE;
>
> - if (unlikely((u16)vmx_get_exit_reason(vcpu).basic == EXIT_REASON_MCE_DURING_VMENTRY))
> - kvm_machine_check();
> -
> trace_kvm_exit(vcpu, KVM_ISA_VMX);
>
> if (unlikely(vmx_get_exit_reason(vcpu).failed_vmentry))
Looks good to me.
Regards,
Tony
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/4] KVM: VMX: Handle #MCs on VM-Enter/TD-Enter outside of the fastpath
2025-11-17 12:38 ` Tony Lindgren
2025-11-17 15:47 ` Sean Christopherson
@ 2025-11-17 16:30 ` Edgecombe, Rick P
1 sibling, 0 replies; 19+ messages in thread
From: Edgecombe, Rick P @ 2025-11-17 16:30 UTC (permalink / raw)
To: tony.lindgren@linux.intel.com, seanjc@google.com
Cc: kvm@vger.kernel.org, pbonzini@redhat.com, Kohler, Jon,
linux-kernel@vger.kernel.org
On Mon, 2025-11-17 at 14:38 +0200, Tony Lindgren wrote:
> I bisected kvm-x86/next down to this change for a TDX guest not
> booting and host producing errors like:
>
> watchdog: CPU118: Watchdog detected hard LOCKUP on cpu 118
>
> Dropping the is_td_vcpu(vcpu) check above fixes the issue. Earlier
> the call for vmx_handle_exit_irqoff() was unconditional.
>
> Probably the (u16) cast above can be dropped too? It was never used
> for TDX looking at the patch.
Ah! Thanks for picking this up. I had almost got there but lost my TDX
machine for a bit.
^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 3/4] KVM: x86: Load guest/host XCR0 and XSS outside of the fastpath run loop
2025-10-30 22:42 [PATCH 0/4] KVM: x86: Cleanup #MC and XCR0/XSS/PKRU handling Sean Christopherson
2025-10-30 22:42 ` [PATCH 1/4] KVM: SVM: Handle #MCs in guest outside of fastpath Sean Christopherson
2025-10-30 22:42 ` [PATCH 2/4] KVM: VMX: Handle #MCs on VM-Enter/TD-Enter outside of the fastpath Sean Christopherson
@ 2025-10-30 22:42 ` Sean Christopherson
2025-11-05 10:42 ` Binbin Wu
2025-10-30 22:42 ` [PATCH 4/4] KVM: x86: Load guest/host PKRU " Sean Christopherson
` (3 subsequent siblings)
6 siblings, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2025-10-30 22:42 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel, Jon Kohler
Move KVM's swapping of XFEATURE masks, i.e. XCR0 and XSS, out of the
fastpath loop now that the guts of the #MC handler runs in task context,
i.e. won't invoke schedule() with preemption disabled and clobber state
(or crash the kernel) due to trying to context switch XSTATE with a mix
of host and guest state.
For all intents and purposes, this reverts commit 1811d979c716 ("x86/kvm:
move kvm_load/put_guest_xcr0 into atomic context"), which papered over an
egregious bug/flaw in the #MC handler where it would do schedule() even
though IRQs are disabled. E.g. the call stack from the commit:
kvm_load_guest_xcr0
...
kvm_x86_ops->run(vcpu)
vmx_vcpu_run
vmx_complete_atomic_exit
kvm_machine_check
do_machine_check
do_memory_failure
memory_failure
lock_page
Commit 1811d979c716 "fixed" the immediate issue of XRSTORS exploding, but
completely ignored that scheduling out a vCPU task while IRQs and
preemption is wildly broken. Thankfully, commit 5567d11c21a1 ("x86/mce:
Send #MC singal from task work") (somewhat incidentally?) fixed that flaw
by pushing the meat of the work to the user-return path, i.e. to task
context.
KVM has also hardened itself against #MC goofs by moving #MC forwarding to
kvm_x86_ops.handle_exit_irqoff(), i.e. out of the fastpath. While that's
by no means a robust fix, restoring as much state as possible before
handling the #MC will hopefully provide some measure of protection in the
event that #MC handling goes off the rails again.
Note, KVM always intercepts XCR0 writes for vCPUs without protected state,
e.g. there's no risk of consuming a stale XCR0 when determining if a PKRU
update is needed; kvm_load_host_xfeatures() only reads, and never writes,
vcpu->arch.xcr0.
Deferring the XCR0 and XSS loads shaves ~300 cycles off the fastpath for
Intel, and ~500 cycles for AMD. E.g. using INVD in KVM-Unit-Test's
vmexit.c, which an extra hack to enable CR4.OXSAVE, latency numbers for
AMD Turin go from ~2000 => 1500, and for Intel Emerald Rapids, go from
~1300 => ~1000.
Cc: Jon Kohler <jon@nutanix.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/x86.c | 39 ++++++++++++++++++++++++++-------------
1 file changed, 26 insertions(+), 13 deletions(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b4b5d2d09634..b5c2879e3330 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1203,13 +1203,12 @@ void kvm_lmsw(struct kvm_vcpu *vcpu, unsigned long msw)
}
EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_lmsw);
-void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu)
+static void kvm_load_guest_xfeatures(struct kvm_vcpu *vcpu)
{
if (vcpu->arch.guest_state_protected)
return;
if (kvm_is_cr4_bit_set(vcpu, X86_CR4_OSXSAVE)) {
-
if (vcpu->arch.xcr0 != kvm_host.xcr0)
xsetbv(XCR_XFEATURE_ENABLED_MASK, vcpu->arch.xcr0);
@@ -1217,6 +1216,27 @@ void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu)
vcpu->arch.ia32_xss != kvm_host.xss)
wrmsrq(MSR_IA32_XSS, vcpu->arch.ia32_xss);
}
+}
+
+static void kvm_load_host_xfeatures(struct kvm_vcpu *vcpu)
+{
+ if (vcpu->arch.guest_state_protected)
+ return;
+
+ if (kvm_is_cr4_bit_set(vcpu, X86_CR4_OSXSAVE)) {
+ if (vcpu->arch.xcr0 != kvm_host.xcr0)
+ xsetbv(XCR_XFEATURE_ENABLED_MASK, kvm_host.xcr0);
+
+ if (guest_cpu_cap_has(vcpu, X86_FEATURE_XSAVES) &&
+ vcpu->arch.ia32_xss != kvm_host.xss)
+ wrmsrq(MSR_IA32_XSS, kvm_host.xss);
+ }
+}
+
+void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu)
+{
+ if (vcpu->arch.guest_state_protected)
+ return;
if (cpu_feature_enabled(X86_FEATURE_PKU) &&
vcpu->arch.pkru != vcpu->arch.host_pkru &&
@@ -1238,17 +1258,6 @@ void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu)
if (vcpu->arch.pkru != vcpu->arch.host_pkru)
wrpkru(vcpu->arch.host_pkru);
}
-
- if (kvm_is_cr4_bit_set(vcpu, X86_CR4_OSXSAVE)) {
-
- if (vcpu->arch.xcr0 != kvm_host.xcr0)
- xsetbv(XCR_XFEATURE_ENABLED_MASK, kvm_host.xcr0);
-
- if (guest_cpu_cap_has(vcpu, X86_FEATURE_XSAVES) &&
- vcpu->arch.ia32_xss != kvm_host.xss)
- wrmsrq(MSR_IA32_XSS, kvm_host.xss);
- }
-
}
EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_load_host_xsave_state);
@@ -11292,6 +11301,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
if (vcpu->arch.guest_fpu.xfd_err)
wrmsrq(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
+ kvm_load_guest_xfeatures(vcpu);
+
if (unlikely(vcpu->arch.switch_db_regs &&
!(vcpu->arch.switch_db_regs & KVM_DEBUGREG_AUTO_SWITCH))) {
set_debugreg(DR7_FIXED_1, 7);
@@ -11378,6 +11389,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
vcpu->mode = OUTSIDE_GUEST_MODE;
smp_wmb();
+ kvm_load_host_xfeatures(vcpu);
+
/*
* Sync xfd before calling handle_exit_irqoff() which may
* rely on the fact that guest_fpu::xfd is up-to-date (e.g.
--
2.51.1.930.gacf6e81ea2-goog
^ permalink raw reply related [flat|nested] 19+ messages in thread* Re: [PATCH 3/4] KVM: x86: Load guest/host XCR0 and XSS outside of the fastpath run loop
2025-10-30 22:42 ` [PATCH 3/4] KVM: x86: Load guest/host XCR0 and XSS outside of the fastpath run loop Sean Christopherson
@ 2025-11-05 10:42 ` Binbin Wu
2025-11-05 14:43 ` Sean Christopherson
0 siblings, 1 reply; 19+ messages in thread
From: Binbin Wu @ 2025-11-05 10:42 UTC (permalink / raw)
To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel, Jon Kohler
On 10/31/2025 6:42 AM, Sean Christopherson wrote:
[...]
>
> -void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu)
> +static void kvm_load_guest_xfeatures(struct kvm_vcpu *vcpu)
> {
> if (vcpu->arch.guest_state_protected)
> return;
>
> if (kvm_is_cr4_bit_set(vcpu, X86_CR4_OSXSAVE)) {
> -
> if (vcpu->arch.xcr0 != kvm_host.xcr0)
> xsetbv(XCR_XFEATURE_ENABLED_MASK, vcpu->arch.xcr0);
>
> @@ -1217,6 +1216,27 @@ void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu)
> vcpu->arch.ia32_xss != kvm_host.xss)
> wrmsrq(MSR_IA32_XSS, vcpu->arch.ia32_xss);
> }
> +}
> +
> +static void kvm_load_host_xfeatures(struct kvm_vcpu *vcpu)
> +{
> + if (vcpu->arch.guest_state_protected)
> + return;
> +
> + if (kvm_is_cr4_bit_set(vcpu, X86_CR4_OSXSAVE)) {
> + if (vcpu->arch.xcr0 != kvm_host.xcr0)
> + xsetbv(XCR_XFEATURE_ENABLED_MASK, kvm_host.xcr0);
> +
> + if (guest_cpu_cap_has(vcpu, X86_FEATURE_XSAVES) &&
> + vcpu->arch.ia32_xss != kvm_host.xss)
> + wrmsrq(MSR_IA32_XSS, kvm_host.xss);
> + }
> +}
kvm_load_guest_xfeatures() and kvm_load_host_xfeatures() are almost the same
except for the guest values VS. host values to set.
I am wondering if it is worth adding a helper to dedup the code, like:
static void kvm_load_xfeatures(struct kvm_vcpu *vcpu, u64 xcr0, u64 xss)
{
if (vcpu->arch.guest_state_protected)
return;
if (kvm_is_cr4_bit_set(vcpu, X86_CR4_OSXSAVE)) {
if (vcpu->arch.xcr0 != kvm_host.xcr0)
xsetbv(XCR_XFEATURE_ENABLED_MASK, xcr0);
if (guest_cpu_cap_has(vcpu, X86_FEATURE_XSAVES) &&
vcpu->arch.ia32_xss != kvm_host.xss)
wrmsrq(MSR_IA32_XSS, xss);
}
}
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH 3/4] KVM: x86: Load guest/host XCR0 and XSS outside of the fastpath run loop
2025-11-05 10:42 ` Binbin Wu
@ 2025-11-05 14:43 ` Sean Christopherson
2025-11-06 1:55 ` Binbin Wu
0 siblings, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2025-11-05 14:43 UTC (permalink / raw)
To: Binbin Wu; +Cc: Paolo Bonzini, kvm, linux-kernel, Jon Kohler
On Wed, Nov 05, 2025, Binbin Wu wrote:
>
>
> On 10/31/2025 6:42 AM, Sean Christopherson wrote:
> [...]
> > -void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu)
> > +static void kvm_load_guest_xfeatures(struct kvm_vcpu *vcpu)
> > {
> > if (vcpu->arch.guest_state_protected)
> > return;
> > if (kvm_is_cr4_bit_set(vcpu, X86_CR4_OSXSAVE)) {
> > -
> > if (vcpu->arch.xcr0 != kvm_host.xcr0)
> > xsetbv(XCR_XFEATURE_ENABLED_MASK, vcpu->arch.xcr0);
> > @@ -1217,6 +1216,27 @@ void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu)
> > vcpu->arch.ia32_xss != kvm_host.xss)
> > wrmsrq(MSR_IA32_XSS, vcpu->arch.ia32_xss);
> > }
> > +}
> > +
> > +static void kvm_load_host_xfeatures(struct kvm_vcpu *vcpu)
> > +{
> > + if (vcpu->arch.guest_state_protected)
> > + return;
> > +
> > + if (kvm_is_cr4_bit_set(vcpu, X86_CR4_OSXSAVE)) {
> > + if (vcpu->arch.xcr0 != kvm_host.xcr0)
> > + xsetbv(XCR_XFEATURE_ENABLED_MASK, kvm_host.xcr0);
> > +
> > + if (guest_cpu_cap_has(vcpu, X86_FEATURE_XSAVES) &&
> > + vcpu->arch.ia32_xss != kvm_host.xss)
> > + wrmsrq(MSR_IA32_XSS, kvm_host.xss);
> > + }
> > +}
>
> kvm_load_guest_xfeatures() and kvm_load_host_xfeatures() are almost the same
> except for the guest values VS. host values to set.
> I am wondering if it is worth adding a helper to dedup the code, like:
>
> static void kvm_load_xfeatures(struct kvm_vcpu *vcpu, u64 xcr0, u64 xss)
> {
> if (vcpu->arch.guest_state_protected)
> return;
>
> if (kvm_is_cr4_bit_set(vcpu, X86_CR4_OSXSAVE)) {
> if (vcpu->arch.xcr0 != kvm_host.xcr0)
> xsetbv(XCR_XFEATURE_ENABLED_MASK, xcr0);
>
> if (guest_cpu_cap_has(vcpu, X86_FEATURE_XSAVES) &&
> vcpu->arch.ia32_xss != kvm_host.xss)
> wrmsrq(MSR_IA32_XSS, xss);
> }
> }
Nice! I like it. Want to send a proper patch (relative to this series)? Or
I can turn the above into a patch with a Suggested-by. Either way works for me.
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH 3/4] KVM: x86: Load guest/host XCR0 and XSS outside of the fastpath run loop
2025-11-05 14:43 ` Sean Christopherson
@ 2025-11-06 1:55 ` Binbin Wu
0 siblings, 0 replies; 19+ messages in thread
From: Binbin Wu @ 2025-11-06 1:55 UTC (permalink / raw)
To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel, Jon Kohler
On 11/5/2025 10:43 PM, Sean Christopherson wrote:
> On Wed, Nov 05, 2025, Binbin Wu wrote:
>>
>> On 10/31/2025 6:42 AM, Sean Christopherson wrote:
>> [...]
>>> -void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu)
>>> +static void kvm_load_guest_xfeatures(struct kvm_vcpu *vcpu)
>>> {
>>> if (vcpu->arch.guest_state_protected)
>>> return;
>>> if (kvm_is_cr4_bit_set(vcpu, X86_CR4_OSXSAVE)) {
>>> -
>>> if (vcpu->arch.xcr0 != kvm_host.xcr0)
>>> xsetbv(XCR_XFEATURE_ENABLED_MASK, vcpu->arch.xcr0);
>>> @@ -1217,6 +1216,27 @@ void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu)
>>> vcpu->arch.ia32_xss != kvm_host.xss)
>>> wrmsrq(MSR_IA32_XSS, vcpu->arch.ia32_xss);
>>> }
>>> +}
>>> +
>>> +static void kvm_load_host_xfeatures(struct kvm_vcpu *vcpu)
>>> +{
>>> + if (vcpu->arch.guest_state_protected)
>>> + return;
>>> +
>>> + if (kvm_is_cr4_bit_set(vcpu, X86_CR4_OSXSAVE)) {
>>> + if (vcpu->arch.xcr0 != kvm_host.xcr0)
>>> + xsetbv(XCR_XFEATURE_ENABLED_MASK, kvm_host.xcr0);
>>> +
>>> + if (guest_cpu_cap_has(vcpu, X86_FEATURE_XSAVES) &&
>>> + vcpu->arch.ia32_xss != kvm_host.xss)
>>> + wrmsrq(MSR_IA32_XSS, kvm_host.xss);
>>> + }
>>> +}
>> kvm_load_guest_xfeatures() and kvm_load_host_xfeatures() are almost the same
>> except for the guest values VS. host values to set.
>> I am wondering if it is worth adding a helper to dedup the code, like:
>>
>> static void kvm_load_xfeatures(struct kvm_vcpu *vcpu, u64 xcr0, u64 xss)
>> {
>> if (vcpu->arch.guest_state_protected)
>> return;
>>
>> if (kvm_is_cr4_bit_set(vcpu, X86_CR4_OSXSAVE)) {
>> if (vcpu->arch.xcr0 != kvm_host.xcr0)
>> xsetbv(XCR_XFEATURE_ENABLED_MASK, xcr0);
>>
>> if (guest_cpu_cap_has(vcpu, X86_FEATURE_XSAVES) &&
>> vcpu->arch.ia32_xss != kvm_host.xss)
>> wrmsrq(MSR_IA32_XSS, xss);
>> }
>> }
> Nice! I like it. Want to send a proper patch (relative to this series)? Or
> I can turn the above into a patch with a Suggested-by. Either way works for me.
>
I can send a patch.
^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 4/4] KVM: x86: Load guest/host PKRU outside of the fastpath run loop
2025-10-30 22:42 [PATCH 0/4] KVM: x86: Cleanup #MC and XCR0/XSS/PKRU handling Sean Christopherson
` (2 preceding siblings ...)
2025-10-30 22:42 ` [PATCH 3/4] KVM: x86: Load guest/host XCR0 and XSS outside of the fastpath run loop Sean Christopherson
@ 2025-10-30 22:42 ` Sean Christopherson
2025-10-31 17:58 ` Jon Kohler
2025-10-31 17:58 ` [PATCH 0/4] KVM: x86: Cleanup #MC and XCR0/XSS/PKRU handling Jon Kohler
` (2 subsequent siblings)
6 siblings, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2025-10-30 22:42 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel, Jon Kohler
Move KVM's swapping of PKRU outside of the fastpath loop, as there is no
KVM code anywhere in the fastpath that accesses guest/userspace memory,
i.e. that can consume protection keys.
As documented by commit 1be0e61c1f25 ("KVM, pkeys: save/restore PKRU when
guest/host switches"), KVM just needs to ensure the host's PKRU is loaded
when KVM (or the kernel at-large) may access userspace memory. And at the
time of commit 1be0e61c1f25, KVM didn't have a fastpath, and PKU was
strictly contained to VMX, i.e. there was no reason to swap PKRU outside
of vmx_vcpu_run().
Over time, the "need" to swap PKRU close to VM-Enter was likely falsely
solidified by the association with XFEATUREs in commit 37486135d3a7
("KVM: x86: Fix pkru save/restore when guest CR4.PKE=0, move it to x86.c"),
and XFEATURE swapping was in turn moved close to VM-Enter/VM-Exit as a
KVM hack-a-fix ution for an #MC handler bug by commit 1811d979c716
("x86/kvm: move kvm_load/put_guest_xcr0 into atomic context").
Deferring the PKRU loads shaves ~40 cycles off the fastpath for Intel,
and ~60 cycles for AMD. E.g. using INVD in KVM-Unit-Test's vmexit.c,
with extra hacks to enable CR4.PKE and PKRU=(-1u & ~0x3), latency numbers
for AMD Turin go from ~1560 => ~1500, and for Intel Emerald Rapids, go
from ~810 => ~770.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/svm/svm.c | 2 --
arch/x86/kvm/vmx/vmx.c | 4 ----
arch/x86/kvm/x86.c | 14 ++++++++++----
arch/x86/kvm/x86.h | 2 --
4 files changed, 10 insertions(+), 12 deletions(-)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index e8b158f73c79..e1fb853c263c 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4260,7 +4260,6 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
svm_set_dr6(vcpu, DR6_ACTIVE_LOW);
clgi();
- kvm_load_guest_xsave_state(vcpu);
/*
* Hardware only context switches DEBUGCTL if LBR virtualization is
@@ -4303,7 +4302,6 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
vcpu->arch.host_debugctl != svm->vmcb->save.dbgctl)
update_debugctlmsr(vcpu->arch.host_debugctl);
- kvm_load_host_xsave_state(vcpu);
stgi();
/* Any pending NMI will happen here */
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 123dae8cf46b..55d637cea84a 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7465,8 +7465,6 @@ fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
vmx_set_interrupt_shadow(vcpu, 0);
- kvm_load_guest_xsave_state(vcpu);
-
pt_guest_enter(vmx);
atomic_switch_perf_msrs(vmx);
@@ -7510,8 +7508,6 @@ fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
pt_guest_exit(vmx);
- kvm_load_host_xsave_state(vcpu);
-
if (is_guest_mode(vcpu)) {
/*
* Track VMLAUNCH/VMRESUME that have made past guest state
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b5c2879e3330..6924006f0796 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1233,7 +1233,7 @@ static void kvm_load_host_xfeatures(struct kvm_vcpu *vcpu)
}
}
-void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu)
+static void kvm_load_guest_pkru(struct kvm_vcpu *vcpu)
{
if (vcpu->arch.guest_state_protected)
return;
@@ -1244,9 +1244,8 @@ void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu)
kvm_is_cr4_bit_set(vcpu, X86_CR4_PKE)))
wrpkru(vcpu->arch.pkru);
}
-EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_load_guest_xsave_state);
-void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu)
+static void kvm_load_host_pkru(struct kvm_vcpu *vcpu)
{
if (vcpu->arch.guest_state_protected)
return;
@@ -1259,7 +1258,6 @@ void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu)
wrpkru(vcpu->arch.host_pkru);
}
}
-EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_load_host_xsave_state);
#ifdef CONFIG_X86_64
static inline u64 kvm_guest_supported_xfd(struct kvm_vcpu *vcpu)
@@ -11331,6 +11329,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
guest_timing_enter_irqoff();
+ /*
+ * Swap PKRU with hardware breakpoints disabled to minimize the number
+ * of flows where non-KVM code can run with guest state loaded.
+ */
+ kvm_load_guest_pkru(vcpu);
+
for (;;) {
/*
* Assert that vCPU vs. VM APICv state is consistent. An APICv
@@ -11359,6 +11363,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
++vcpu->stat.exits;
}
+ kvm_load_host_pkru(vcpu);
+
/*
* Do this here before restoring debug registers on the host. And
* since we do this before handling the vmexit, a DR access vmexit
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index f3dc77f006f9..24c754b0db2e 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -622,8 +622,6 @@ static inline void kvm_machine_check(void)
#endif
}
-void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu);
-void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu);
int kvm_spec_ctrl_test_value(u64 value);
int kvm_handle_memory_failure(struct kvm_vcpu *vcpu, int r,
struct x86_exception *e);
--
2.51.1.930.gacf6e81ea2-goog
^ permalink raw reply related [flat|nested] 19+ messages in thread* Re: [PATCH 4/4] KVM: x86: Load guest/host PKRU outside of the fastpath run loop
2025-10-30 22:42 ` [PATCH 4/4] KVM: x86: Load guest/host PKRU " Sean Christopherson
@ 2025-10-31 17:58 ` Jon Kohler
2025-10-31 20:52 ` Sean Christopherson
0 siblings, 1 reply; 19+ messages in thread
From: Jon Kohler @ 2025-10-31 17:58 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, kvm@vger.kernel.org, linux-kernel@vger.kernel.org
> On Oct 30, 2025, at 6:42 PM, Sean Christopherson <seanjc@google.com> wrote:
>
> !-------------------------------------------------------------------|
> CAUTION: External Email
>
> |-------------------------------------------------------------------!
>
> Move KVM's swapping of PKRU outside of the fastpath loop, as there is no
> KVM code anywhere in the fastpath that accesses guest/userspace memory,
> i.e. that can consume protection keys.
>
> As documented by commit 1be0e61c1f25 ("KVM, pkeys: save/restore PKRU when
> guest/host switches"), KVM just needs to ensure the host's PKRU is loaded
> when KVM (or the kernel at-large) may access userspace memory. And at the
> time of commit 1be0e61c1f25, KVM didn't have a fastpath, and PKU was
> strictly contained to VMX, i.e. there was no reason to swap PKRU outside
> of vmx_vcpu_run().
>
> Over time, the "need" to swap PKRU close to VM-Enter was likely falsely
> solidified by the association with XFEATUREs in commit 37486135d3a7
> ("KVM: x86: Fix pkru save/restore when guest CR4.PKE=0, move it to x86.c"),
> and XFEATURE swapping was in turn moved close to VM-Enter/VM-Exit as a
> KVM hack-a-fix ution for an #MC handler bug by commit 1811d979c716
> ("x86/kvm: move kvm_load/put_guest_xcr0 into atomic context").
>
> Deferring the PKRU loads shaves ~40 cycles off the fastpath for Intel,
> and ~60 cycles for AMD. E.g. using INVD in KVM-Unit-Test's vmexit.c,
> with extra hacks to enable CR4.PKE and PKRU=(-1u & ~0x3), latency numbers
> for AMD Turin go from ~1560 => ~1500, and for Intel Emerald Rapids, go
> from ~810 => ~770.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
> arch/x86/kvm/svm/svm.c | 2 --
> arch/x86/kvm/vmx/vmx.c | 4 ----
> arch/x86/kvm/x86.c | 14 ++++++++++----
> arch/x86/kvm/x86.h | 2 --
> 4 files changed, 10 insertions(+), 12 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index e8b158f73c79..e1fb853c263c 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -4260,7 +4260,6 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
> svm_set_dr6(vcpu, DR6_ACTIVE_LOW);
>
> clgi();
> - kvm_load_guest_xsave_state(vcpu);
>
> /*
> * Hardware only context switches DEBUGCTL if LBR virtualization is
> @@ -4303,7 +4302,6 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
> vcpu->arch.host_debugctl != svm->vmcb->save.dbgctl)
> update_debugctlmsr(vcpu->arch.host_debugctl);
>
> - kvm_load_host_xsave_state(vcpu);
> stgi();
>
> /* Any pending NMI will happen here */
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 123dae8cf46b..55d637cea84a 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7465,8 +7465,6 @@ fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
> if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
> vmx_set_interrupt_shadow(vcpu, 0);
>
> - kvm_load_guest_xsave_state(vcpu);
> -
> pt_guest_enter(vmx);
>
> atomic_switch_perf_msrs(vmx);
> @@ -7510,8 +7508,6 @@ fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
>
> pt_guest_exit(vmx);
>
> - kvm_load_host_xsave_state(vcpu);
> -
> if (is_guest_mode(vcpu)) {
> /*
> * Track VMLAUNCH/VMRESUME that have made past guest state
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index b5c2879e3330..6924006f0796 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1233,7 +1233,7 @@ static void kvm_load_host_xfeatures(struct kvm_vcpu *vcpu)
> }
> }
>
> -void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu)
> +static void kvm_load_guest_pkru(struct kvm_vcpu *vcpu)
> {
> if (vcpu->arch.guest_state_protected)
> return;
> @@ -1244,9 +1244,8 @@ void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu)
> kvm_is_cr4_bit_set(vcpu, X86_CR4_PKE)))
> wrpkru(vcpu->arch.pkru);
> }
> -EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_load_guest_xsave_state);
>
> -void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu)
> +static void kvm_load_host_pkru(struct kvm_vcpu *vcpu)
> {
> if (vcpu->arch.guest_state_protected)
> return;
> @@ -1259,7 +1258,6 @@ void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu)
> wrpkru(vcpu->arch.host_pkru);
> }
> }
> -EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_load_host_xsave_state);
>
> #ifdef CONFIG_X86_64
> static inline u64 kvm_guest_supported_xfd(struct kvm_vcpu *vcpu)
> @@ -11331,6 +11329,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>
> guest_timing_enter_irqoff();
>
> + /*
> + * Swap PKRU with hardware breakpoints disabled to minimize the number
> + * of flows where non-KVM code can run with guest state loaded.
> + */
> + kvm_load_guest_pkru(vcpu);
> +
I was mocking this up after PUCK, and went down a similar-ish path, but was
thinking it might be interesting to have an x86 op called something to the effect of
“prepare_switch_to_guest_irqoff” and “prepare_switch_to_host_irqoff”, which
might make for a place to nestle any other sort of “needs to be done in atomic
context but doesn’t need to be done in the fast path” sort of stuff (if any).
One other one that caught my eye was the cr3 stuff that was moved out a while
ago, but then moved back with 1a7158101.
I haven’t gone through absolutely everything else in that tight loop code (and didn’t
get a chance to do the same for SVM code), but figured I’d put the idea out there
to see what you think.
To be clear, I’m totally OK with the series as-is, just thinking about perhaps future
ways to incrementally optimize here?
> for (;;) {
> /*
> * Assert that vCPU vs. VM APICv state is consistent. An APICv
> @@ -11359,6 +11363,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> ++vcpu->stat.exits;
> }
>
> + kvm_load_host_pkru(vcpu);
> +
> /*
> * Do this here before restoring debug registers on the host. And
> * since we do this before handling the vmexit, a DR access vmexit
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index f3dc77f006f9..24c754b0db2e 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -622,8 +622,6 @@ static inline void kvm_machine_check(void)
> #endif
> }
>
> -void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu);
> -void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu);
> int kvm_spec_ctrl_test_value(u64 value);
> int kvm_handle_memory_failure(struct kvm_vcpu *vcpu, int r,
> struct x86_exception *e);
> --
> 2.51.1.930.gacf6e81ea2-goog
>
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH 4/4] KVM: x86: Load guest/host PKRU outside of the fastpath run loop
2025-10-31 17:58 ` Jon Kohler
@ 2025-10-31 20:52 ` Sean Christopherson
2025-11-03 15:32 ` Jon Kohler
0 siblings, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2025-10-31 20:52 UTC (permalink / raw)
To: Jon Kohler
Cc: Paolo Bonzini, kvm@vger.kernel.org, linux-kernel@vger.kernel.org
On Fri, Oct 31, 2025, Jon Kohler wrote:
> > On Oct 30, 2025, at 6:42 PM, Sean Christopherson <seanjc@google.com> wrote:
> > + /*
> > + * Swap PKRU with hardware breakpoints disabled to minimize the number
> > + * of flows where non-KVM code can run with guest state loaded.
> > + */
> > + kvm_load_guest_pkru(vcpu);
> > +
>
> I was mocking this up after PUCK, and went down a similar-ish path, but was
> thinking it might be interesting to have an x86 op called something to the effect of
> “prepare_switch_to_guest_irqoff” and “prepare_switch_to_host_irqoff”, which
> might make for a place to nestle any other sort of “needs to be done in atomic
> context but doesn’t need to be done in the fast path” sort of stuff (if any).
Hmm, I would say I'm flat out opposed to generic hooks of that nature. For
anything that _needs_ to be modified with IRQs disabled, the ordering will matter
greatly. E.g. we already have kvm_x86_ops.sync_pir_to_irr(), and that _must_ run
before kvm_vcpu_exit_request() if it triggers a late request.
And I also want to push for as much stuff as possible to be handled in common x86,
i.e. I want to actively encourage landing things like PKU and CET support in
common x86 instead of implementing support in one vendor and then having to churn
a pile of code to later move it to
> One other one that caught my eye was the cr3 stuff that was moved out a while
> ago, but then moved back with 1a7158101.
>
> I haven’t gone through absolutely everything else in that tight loop code
> (and didn’t get a chance to do the same for SVM code), but figured I’d put
> the idea out there to see what you think.
>
> To be clear, I’m totally OK with the series as-is, just thinking about
> perhaps future ways to incrementally optimize here?
To some extent, we're going to hit diminishing returns. E.g. one of the reasons
I did a straight revert in commit 1a71581012dd is that were talking about a handful
of cycles difference. E.g. as measured from the guest, eliding the CR3+CR4 checks
shaves 3-5 cycles. From the host side it _looks_ like more (~20 cycles), but it's
hard to even measure accurately because just doing RDTSC affects the results.
For SVM, I don't see any obvious candidates. E.g. pre_sev_run() has some code that
only needs to be done on the first iteration, but checking a flag or doing a static
CALL+RET is going to be just as costly as what's already there.
In short, the only flows that will benefit are relatively slow flows and/or flows
that aren't easily predicted by the CPU. E.g. __get_current_cr3_fast() and
cr4_read_shadow() require CALL+RET and might not be super predictable? But even
they are on the cusp of "who cares".
And that needs to be balanced against the probability of introducing bugs. E.g.
this code _could_ be done only on the first iteration:
if (vmx->ple_window_dirty) {
vmx->ple_window_dirty = false;
vmcs_write32(PLE_WINDOW, vmx->ple_window);
}
but (a) checking vmx->ple_window_dirty is going to be super predictable after the
first iteration, (b) handling PLE exits in the fastpath would break things, and
(c) _if_ we want to optimize that code, it can/should be simply moved to
vmx_prepare_switch_to_guest() (but outside of the guest_state_loaded check).
All that said, I'm not totally opposed to shaving cycles. Now that @run_flags
is a thing, it's actually trivially easy to optimize the CR3/CR4 checks (famous
last words):
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 48598d017d6f..5cc1f0168b8a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1709,6 +1709,7 @@ enum kvm_x86_run_flags {
KVM_RUN_FORCE_IMMEDIATE_EXIT = BIT(0),
KVM_RUN_LOAD_GUEST_DR6 = BIT(1),
KVM_RUN_LOAD_DEBUGCTL = BIT(2),
+ KVM_RUN_IS_FIRST_ITERATION = BIT(3),
};
struct kvm_x86_ops {
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 55d637cea84a..3deb20b8d0c5 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7439,22 +7439,28 @@ fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
vmx_reload_guest_debugctl(vcpu);
/*
- * Refresh vmcs.HOST_CR3 if necessary. This must be done immediately
- * prior to VM-Enter, as the kernel may load a new ASID (PCID) any time
- * it switches back to the current->mm, which can occur in KVM context
- * when switching to a temporary mm to patch kernel code, e.g. if KVM
- * toggles a static key while handling a VM-Exit.
+ * Refresh vmcs.HOST_CR3 if necessary. This must be done after IRQs
+ * are disabled, i.e. not when preparing to switch to the guest, as the
+ * the kernel may load a new ASID (PCID) any time it switches back to
+ * the current->mm, which can occur in KVM context when switching to a
+ * temporary mm to patch kernel code, e.g. if KVM toggles a static key
+ * while handling a VM-Exit.
+ *
+ * Refresh host CR3 and CR4 only on the first iteration of the inner
+ * loop, as modifying CR3 or CR4 from NMI context is not allowed.
*/
- cr3 = __get_current_cr3_fast();
- if (unlikely(cr3 != vmx->loaded_vmcs->host_state.cr3)) {
- vmcs_writel(HOST_CR3, cr3);
- vmx->loaded_vmcs->host_state.cr3 = cr3;
- }
+ if (run_flags & KVM_RUN_IS_FIRST_ITERATION) {
+ cr3 = __get_current_cr3_fast();
+ if (unlikely(cr3 != vmx->loaded_vmcs->host_state.cr3)) {
+ vmcs_writel(HOST_CR3, cr3);
+ vmx->loaded_vmcs->host_state.cr3 = cr3;
+ }
- cr4 = cr4_read_shadow();
- if (unlikely(cr4 != vmx->loaded_vmcs->host_state.cr4)) {
- vmcs_writel(HOST_CR4, cr4);
- vmx->loaded_vmcs->host_state.cr4 = cr4;
+ cr4 = cr4_read_shadow();
+ if (unlikely(cr4 != vmx->loaded_vmcs->host_state.cr4)) {
+ vmcs_writel(HOST_CR4, cr4);
+ vmx->loaded_vmcs->host_state.cr4 = cr4;
+ }
}
/* When single-stepping over STI and MOV SS, we must clear the
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6924006f0796..bff08f58c29a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11286,7 +11286,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
goto cancel_injection;
}
- run_flags = 0;
+ run_flags = KVM_RUN_IS_FIRST_ITERATION;
if (req_immediate_exit) {
run_flags |= KVM_RUN_FORCE_IMMEDIATE_EXIT;
kvm_make_request(KVM_REQ_EVENT, vcpu);
^ permalink raw reply related [flat|nested] 19+ messages in thread* Re: [PATCH 4/4] KVM: x86: Load guest/host PKRU outside of the fastpath run loop
2025-10-31 20:52 ` Sean Christopherson
@ 2025-11-03 15:32 ` Jon Kohler
0 siblings, 0 replies; 19+ messages in thread
From: Jon Kohler @ 2025-11-03 15:32 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, kvm@vger.kernel.org, linux-kernel@vger.kernel.org
> On Oct 31, 2025, at 4:52 PM, Sean Christopherson <seanjc@google.com> wrote:
>
> Hmm, I would say I'm flat out opposed to generic hooks of that nature. For
> anything that _needs_ to be modified with IRQs disabled, the ordering will matter
> greatly. E.g. we already have kvm_x86_ops.sync_pir_to_irr(), and that _must_ run
> before kvm_vcpu_exit_request() if it triggers a late request.
>
> And I also want to push for as much stuff as possible to be handled in common x86,
> i.e. I want to actively encourage landing things like PKU and CET support in
> common x86 instead of implementing support in one vendor and then having to churn
> a pile of code to later move it to
Fair, agreed having things common-ized helps everyone
> All that said, I'm not totally opposed to shaving cycles. Now that @run_flags
> is a thing, it's actually trivially easy to optimize the CR3/CR4 checks (famous
> last words):
A cycle saved is a cycle earned, perhaps? :)
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 48598d017d6f..5cc1f0168b8a 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1709,6 +1709,7 @@ enum kvm_x86_run_flags {
> KVM_RUN_FORCE_IMMEDIATE_EXIT = BIT(0),
> KVM_RUN_LOAD_GUEST_DR6 = BIT(1),
> KVM_RUN_LOAD_DEBUGCTL = BIT(2),
> + KVM_RUN_IS_FIRST_ITERATION = BIT(3),
> };
I like this approach, as it makes the code easier to grok what we want and when
>
> struct kvm_x86_ops {
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 55d637cea84a..3deb20b8d0c5 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7439,22 +7439,28 @@ fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
> vmx_reload_guest_debugctl(vcpu);
>
> /*
> - * Refresh vmcs.HOST_CR3 if necessary. This must be done immediately
> - * prior to VM-Enter, as the kernel may load a new ASID (PCID) any time
> - * it switches back to the current->mm, which can occur in KVM context
> - * when switching to a temporary mm to patch kernel code, e.g. if KVM
> - * toggles a static key while handling a VM-Exit.
> + * Refresh vmcs.HOST_CR3 if necessary. This must be done after IRQs
> + * are disabled, i.e. not when preparing to switch to the guest, as the
> + * the kernel may load a new ASID (PCID) any time it switches back to
> + * the current->mm, which can occur in KVM context when switching to a
> + * temporary mm to patch kernel code, e.g. if KVM toggles a static key
> + * while handling a VM-Exit.
> + *
> + * Refresh host CR3 and CR4 only on the first iteration of the inner
> + * loop, as modifying CR3 or CR4 from NMI context is not allowed.
> */
> - cr3 = __get_current_cr3_fast();
> - if (unlikely(cr3 != vmx->loaded_vmcs->host_state.cr3)) {
> - vmcs_writel(HOST_CR3, cr3);
> - vmx->loaded_vmcs->host_state.cr3 = cr3;
> - }
> + if (run_flags & KVM_RUN_IS_FIRST_ITERATION) {
> + cr3 = __get_current_cr3_fast();
> + if (unlikely(cr3 != vmx->loaded_vmcs->host_state.cr3)) {
> + vmcs_writel(HOST_CR3, cr3);
> + vmx->loaded_vmcs->host_state.cr3 = cr3;
> + }
>
> - cr4 = cr4_read_shadow();
> - if (unlikely(cr4 != vmx->loaded_vmcs->host_state.cr4)) {
> - vmcs_writel(HOST_CR4, cr4);
> - vmx->loaded_vmcs->host_state.cr4 = cr4;
> + cr4 = cr4_read_shadow();
> + if (unlikely(cr4 != vmx->loaded_vmcs->host_state.cr4)) {
> + vmcs_writel(HOST_CR4, cr4);
> + vmx->loaded_vmcs->host_state.cr4 = cr4;
> + }
> }
>
> /* When single-stepping over STI and MOV SS, we must clear the
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 6924006f0796..bff08f58c29a 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -11286,7 +11286,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> goto cancel_injection;
> }
>
> - run_flags = 0;
> + run_flags = KVM_RUN_IS_FIRST_ITERATION;
> if (req_immediate_exit) {
> run_flags |= KVM_RUN_FORCE_IMMEDIATE_EXIT;
> kvm_make_request(KVM_REQ_EVENT, vcpu);
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/4] KVM: x86: Cleanup #MC and XCR0/XSS/PKRU handling
2025-10-30 22:42 [PATCH 0/4] KVM: x86: Cleanup #MC and XCR0/XSS/PKRU handling Sean Christopherson
` (3 preceding siblings ...)
2025-10-30 22:42 ` [PATCH 4/4] KVM: x86: Load guest/host PKRU " Sean Christopherson
@ 2025-10-31 17:58 ` Jon Kohler
2025-10-31 23:35 ` Edgecombe, Rick P
2025-11-10 15:37 ` Sean Christopherson
6 siblings, 0 replies; 19+ messages in thread
From: Jon Kohler @ 2025-10-31 17:58 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, kvm@vger.kernel.org, linux-kernel@vger.kernel.org
> On Oct 30, 2025, at 6:42 PM, Sean Christopherson <seanjc@google.com> wrote:
>
> !-------------------------------------------------------------------|
> CAUTION: External Email
>
> |-------------------------------------------------------------------!
>
> This series is the result of the recent PUCK discussion[*] on optimizing the
> XCR0/XSS loads that are currently done on every VM-Enter and VM-Exit. My
> initial thought that swapping XCR0/XSS outside of the fastpath was spot on;
> turns out the only reason they're swapped in the fastpath is because of a
> hack-a-fix that papered over an egregious #MC handling bug where the kernel #MC
> handler would call schedule() from an atomic context. The resulting #GP due to
> trying to swap FPU state with a guest XCR0/XSS was "fixed" by loading the host
> values before handling #MCs from the guest.
>
> Thankfully, the #MC mess has long since been cleaned up, so it's once again
> safe to swap XCR0/XSS outside of the fastpath (but when IRQs are disabled!).
Thank you for doing the diligence on this, I appreciate it!
> As for what may be contributing to the SAP HANA performance improvements when
> enabling PKU, my instincts again appear to be spot on. As predicted, the
> fastpath savings are ~300 cycles on Intel (~500 on AMD). I.e. if the guest
> is literally doing _nothing_ but generating fastpath exits, it will see a
> ~%25 improvement. There's basically zero chance the uplift seen with enabling
> PKU is dues to eliding XCR0 loads; my guess is that the guest actualy uses
> protection keys to optimize something.
Every little bit counts, thats a healthy percentage speedup for fast path stuff,
especially on AMD.
> Why does kvm_load_guest_xsave_state() show up in perf? Probably because it's
> the only visible symbol other than vmx_vmexit() (and vmx_vcpu_run() when not
> hammering the fastpath). E.g. running perf top on a running VM instance yields
> these numbers with various guest workloads (the middle one is running
> mmu_stress_test in the guest, which hammers on mmu_lock in L0). But other than
> doing INVD (handled in the fastpath) in a tight loop, there's no perceived perf
> improvement from the guest.
nit: it’d be nice if these bits were labeled with what they were from (the middle one
you called out above, but what’s the first and third one)
> Overhead Shared Object Symbol
> 15.65% [kernel] [k] vmx_vmexit
> 6.78% [kernel] [k] kvm_vcpu_halt
> 5.15% [kernel] [k] __srcu_read_lock
> 4.73% [kernel] [k] kvm_load_guest_xsave_state
> 4.69% [kernel] [k] __srcu_read_unlock
> 4.65% [kernel] [k] read_tsc
> 4.44% [kernel] [k] vmx_sync_pir_to_irr
> 4.03% [kernel] [k] kvm_apic_has_interrupt
>
>
> 45.52% [kernel] [k] queued_spin_lock_slowpath
> 24.40% [kernel] [k] vmx_vmexit
> 2.84% [kernel] [k] queued_write_lock_slowpath
> 1.92% [kernel] [k] vmx_vcpu_run
> 1.40% [kernel] [k] vcpu_run
> 1.00% [kernel] [k] kvm_load_guest_xsave_state
> 0.84% [kernel] [k] kvm_load_host_xsave_state
> 0.72% [kernel] [k] mmu_try_to_unsync_pages
> 0.68% [kernel] [k] __srcu_read_lock
> 0.65% [kernel] [k] try_get_folio
>
> 17.78% [kernel] [k] vmx_vmexit
> 5.08% [kernel] [k] vmx_vcpu_run
> 4.24% [kernel] [k] vcpu_run
> 4.21% [kernel] [k] _raw_spin_lock_irqsave
> 2.99% [kernel] [k] kvm_load_guest_xsave_state
> 2.51% [kernel] [k] rcu_note_context_switch
> 2.47% [kernel] [k] ktime_get_update_offsets_now
> 2.21% [kernel] [k] kvm_load_host_xsave_state
> 2.16% [kernel] [k] fput
>
> [*] https://drive.google.com/drive/folders/1DCdvqFGudQc7pxXjM7f35vXogTf9uhD4
>
> Sean Christopherson (4):
> KVM: SVM: Handle #MCs in guest outside of fastpath
> KVM: VMX: Handle #MCs on VM-Enter/TD-Enter outside of the fastpath
> KVM: x86: Load guest/host XCR0 and XSS outside of the fastpath run
> loop
> KVM: x86: Load guest/host PKRU outside of the fastpath run loop
>
> arch/x86/kvm/svm/svm.c | 20 ++++++++--------
> arch/x86/kvm/vmx/main.c | 13 ++++++++++-
> arch/x86/kvm/vmx/tdx.c | 3 ---
> arch/x86/kvm/vmx/vmx.c | 7 ------
> arch/x86/kvm/x86.c | 51 ++++++++++++++++++++++++++++-------------
> arch/x86/kvm/x86.h | 2 --
> 6 files changed, 56 insertions(+), 40 deletions(-)
>
>
> base-commit: 4cc167c50eb19d44ac7e204938724e685e3d8057
> --
> 2.51.1.930.gacf6e81ea2-goog
>
Had one conversation starter comment on patch 4, but otherwise, LGTM for
the entire series, thanks again for the help!
Reviewed-By: Jon Kohler <jon@nutanix.com>
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH 0/4] KVM: x86: Cleanup #MC and XCR0/XSS/PKRU handling
2025-10-30 22:42 [PATCH 0/4] KVM: x86: Cleanup #MC and XCR0/XSS/PKRU handling Sean Christopherson
` (4 preceding siblings ...)
2025-10-31 17:58 ` [PATCH 0/4] KVM: x86: Cleanup #MC and XCR0/XSS/PKRU handling Jon Kohler
@ 2025-10-31 23:35 ` Edgecombe, Rick P
2025-11-10 15:37 ` Sean Christopherson
6 siblings, 0 replies; 19+ messages in thread
From: Edgecombe, Rick P @ 2025-10-31 23:35 UTC (permalink / raw)
To: pbonzini@redhat.com, seanjc@google.com
Cc: kvm@vger.kernel.org, Kohler, Jon, linux-kernel@vger.kernel.org
On Thu, 2025-10-30 at 15:42 -0700, Sean Christopherson wrote:
> Sean Christopherson (4):
> KVM: SVM: Handle #MCs in guest outside of fastpath
> KVM: VMX: Handle #MCs on VM-Enter/TD-Enter outside of the fastpath
> KVM: x86: Load guest/host XCR0 and XSS outside of the fastpath run
> loop
> KVM: x86: Load guest/host PKRU outside of the fastpath run loop
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Interesting analysis.
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH 0/4] KVM: x86: Cleanup #MC and XCR0/XSS/PKRU handling
2025-10-30 22:42 [PATCH 0/4] KVM: x86: Cleanup #MC and XCR0/XSS/PKRU handling Sean Christopherson
` (5 preceding siblings ...)
2025-10-31 23:35 ` Edgecombe, Rick P
@ 2025-11-10 15:37 ` Sean Christopherson
2025-11-17 18:35 ` Sean Christopherson
6 siblings, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2025-11-10 15:37 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel, Jon Kohler
On Thu, 30 Oct 2025 15:42:42 -0700, Sean Christopherson wrote:
> This series is the result of the recent PUCK discussion[*] on optimizing the
> XCR0/XSS loads that are currently done on every VM-Enter and VM-Exit. My
> initial thought that swapping XCR0/XSS outside of the fastpath was spot on;
> turns out the only reason they're swapped in the fastpath is because of a
> hack-a-fix that papered over an egregious #MC handling bug where the kernel #MC
> handler would call schedule() from an atomic context. The resulting #GP due to
> trying to swap FPU state with a guest XCR0/XSS was "fixed" by loading the host
> values before handling #MCs from the guest.
>
> [...]
Applied to kvm-x86 misc, thanks!
[1/4] KVM: SVM: Handle #MCs in guest outside of fastpath
https://github.com/kvm-x86/linux/commit/6e640bb5caab
[2/4] KVM: VMX: Handle #MCs on VM-Enter/TD-Enter outside of the fastpath
https://github.com/kvm-x86/linux/commit/8934c592bcbf
[3/4] KVM: x86: Load guest/host XCR0 and XSS outside of the fastpath run loop
https://github.com/kvm-x86/linux/commit/3377a9233d30
[4/4] KVM: x86: Load guest/host PKRU outside of the fastpath run loop
https://github.com/kvm-x86/linux/commit/7df3021b622f
--
https://github.com/kvm-x86/linux/tree/next
^ permalink raw reply [flat|nested] 19+ messages in thread