[PATCH 0/2] KVM: arm64: nv: Reduce FP/SVE overhead on exception/exception return

Kernel KVM virtualization development
 help / color / mirror / Atom feed

* [PATCH 0/2] KVM: arm64: nv: Reduce FP/SVE overhead on exception/exception return
@ 2026-05-12 14:07 Marc Zyngier
  2026-05-12 14:07 ` [PATCH 1/2] KVM: arm64: nv: Track L2 to L1 exception emulation Marc Zyngier
  2026-05-12 14:07 ` [PATCH 2/2] KVM: arm64: nv: Don't save/restore FP register during a nested ERET or exception Marc Zyngier
  0 siblings, 2 replies; 5+ messages in thread
From: Marc Zyngier @ 2026-05-12 14:07 UTC (permalink / raw)
  To: kvmarm, linux-arm-kernel, kvm
  Cc: Steffen Eiden, Joey Gouly, Suzuki K Poulose, Oliver Upton,
	Zenghui Yu, Mark Rutland, Will Deacon, Fuad Tabba

Staring at NV traces has shown that there is a substantial amount of
overhead being triggered when a guest switches between EL1 and EL2 (or
the reverse). This is caused by the naive put/load mechanism we use to
multiplex EL1 and EL2 onto EL1 only, and the FP handling appears as a
prime candidate for optimisation. More precisely, there are two
distinct sources of overhead here:

- the FP/SVE registers are saved, and potentially the host userspace
  state restored when doing put()

- the FP traps are reinstated as part of load(), as the state is now
  the host's

These two things mean that we end-up with a lot of work during this
switch, and that we are 100% guaranteed to get a FP/SVE trap very
quickly, as the guest keeps using the FP registers. These traps
themselves result in some horrible trap amplification in even moderate
levels of nesting, which we could trivially avoid. A bit of thinking
indicates that it should be entirely valid to elide this stuff in the
context of a nested exception/exception return.

The first patch in this small series just add a new vcpu state flag
indicating that put() and load() are done in the context of a nested
exception from L2 to L1. This is the exact pendent of IN_NESTED_ERET,
which tracks an ERET from L1 to L2.

The second patch uses these two flags to abruptly elide FP/SVE
save/restore when any of them is set, sidestepping the overhead
entirely.

Performance-wise, this is rather impressive. I get a 10%-20%
improvement on running the Debian installed as an L3 on my QC
platform. Combined with the use of the EL2 virtual timer, it almost
makes L3 usable.

But of course, nothing is simple with this stuff, which is why I'm
cc'ing Mark here, as he's done a lot of work tracking funny bugs in
our FP handling. Hopefully I haven't subtly broken anything, but let's
see!

Marc Zyngier (2):
  KVM: arm64: nv: Track L2 to L1 exception emulation
  KVM: arm64: nv: Don't save/restore FP register during a nested ERET or
    exception

 arch/arm64/include/asm/kvm_host.h | 3 ++-
 arch/arm64/kvm/emulate-nested.c   | 4 ++++
 arch/arm64/kvm/fpsimd.c           | 8 ++++++++
 3 files changed, 14 insertions(+), 1 deletion(-)

-- 
2.47.3

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 1/2] KVM: arm64: nv: Track L2 to L1 exception emulation
  2026-05-12 14:07 [PATCH 0/2] KVM: arm64: nv: Reduce FP/SVE overhead on exception/exception return Marc Zyngier
@ 2026-05-12 14:07 ` Marc Zyngier
  2026-05-12 14:07 ` [PATCH 2/2] KVM: arm64: nv: Don't save/restore FP register during a nested ERET or exception Marc Zyngier
  1 sibling, 0 replies; 5+ messages in thread
From: Marc Zyngier @ 2026-05-12 14:07 UTC (permalink / raw)
  To: kvmarm, linux-arm-kernel, kvm
  Cc: Steffen Eiden, Joey Gouly, Suzuki K Poulose, Oliver Upton,
	Zenghui Yu, Mark Rutland, Will Deacon, Fuad Tabba

While we currently track that we are emulating a nested ERET from
L1 to L2, we currently don't track the reverse direction (an exception
going from L2 to L1).

Add a new vcpu state flag for this purpose, which will see some
use shortly.

Signed-off-by: Marc Zyngier <maz@kernel.org>
---
 arch/arm64/include/asm/kvm_host.h | 3 ++-
 arch/arm64/kvm/emulate-nested.c   | 4 ++++
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 65eead8362e0b..c79747d5f4dd1 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -1112,7 +1112,8 @@ struct kvm_vcpu_arch {
 #define IN_NESTED_ERET		__vcpu_single_flag(sflags, BIT(7))
 /* SError pending for nested guest */
 #define NESTED_SERROR_PENDING	__vcpu_single_flag(sflags, BIT(8))
-
+/* KVM is currently emulating an L2 to L1 exception */
+#define IN_NESTED_EXCEPTION	__vcpu_single_flag(sflags, BIT(9))
 
 /* Pointer to the vcpu's SVE FFR for sve_{save,load}_state() */
 #define vcpu_sve_pffr(vcpu) (kern_hyp_va((vcpu)->arch.sve_state) +	\
diff --git a/arch/arm64/kvm/emulate-nested.c b/arch/arm64/kvm/emulate-nested.c
index dba7ced74ca5e..15c691a6266d5 100644
--- a/arch/arm64/kvm/emulate-nested.c
+++ b/arch/arm64/kvm/emulate-nested.c
@@ -2862,6 +2862,8 @@ static int kvm_inject_nested(struct kvm_vcpu *vcpu, u64 esr_el2,
 
 	preempt_disable();
 
+	vcpu_set_flag(vcpu, IN_NESTED_EXCEPTION);
+
 	/*
 	 * We may have an exception or PC update in the EL0/EL1 context.
 	 * Commit it before entering EL2.
@@ -2884,6 +2886,8 @@ static int kvm_inject_nested(struct kvm_vcpu *vcpu, u64 esr_el2,
 	__kvm_adjust_pc(vcpu);
 
 	kvm_arch_vcpu_load(vcpu, smp_processor_id());
+	vcpu_clear_flag(vcpu, IN_NESTED_EXCEPTION);
+
 	preempt_enable();
 
 	if (kvm_vcpu_has_pmu(vcpu))
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 2/2] KVM: arm64: nv: Don't save/restore FP register during a nested ERET or exception
  2026-05-12 14:07 [PATCH 0/2] KVM: arm64: nv: Reduce FP/SVE overhead on exception/exception return Marc Zyngier
  2026-05-12 14:07 ` [PATCH 1/2] KVM: arm64: nv: Track L2 to L1 exception emulation Marc Zyngier
@ 2026-05-12 14:07 ` Marc Zyngier
  2026-05-13 12:28   ` Mark Rutland
  1 sibling, 1 reply; 5+ messages in thread
From: Marc Zyngier @ 2026-05-12 14:07 UTC (permalink / raw)
  To: kvmarm, linux-arm-kernel, kvm
  Cc: Steffen Eiden, Joey Gouly, Suzuki K Poulose, Oliver Upton,
	Zenghui Yu, Mark Rutland, Will Deacon, Fuad Tabba

When switching between L1 and L2, we diligently use a non-preemptible
put/load sequence in order to make sure that the old state is saved,
while the new state is brought in. Crucially, this includes the FP
registers.

However, this is a bit silly. The FP registers are completely shared
between the various ELs (just like the GPRs, really), and eagerly
save/restoring those in a non-preemptible section is just overhead.
Not to mention that the next access will end-up trapping, something
that becomes exponentially expensive as we nest deeper.

The temptation is therefore to completely drop this save/restore thing.
Why is it valid to do so? By analogy, the hypervisor doesn't try to
poloce things between EL1 and EL0, or between EL2 and EL0. Why should
it do so between EL2 and EL1 (or EL2 and L2 EL0)?

Once you admit that the FP (and by extension SVE) registers are EL-agnostic,
the things that matter are:

- the trap controls: the effective values are recomputed on each entry
  into the guest to take the EL into account and merge the L0 and L1
  configuration if in a nested context, or directly use the L0 configuration
  in non-nested context (see __activate_traps()).

- the VL settings: the effective values are are also recomputed on each
  entry into the guest (see fpsimd_lazy_switch_to_guest()).

Since we appear to cover all bases, use the vcpu flags indicating the
handling of a nested ERET or exception delivery to avoid the whole FP
save/restore shenanigans.

For an EL1 L3 guest where L1 and L2 have this optimisation, this
results in at least a 10% wall clock reduction when running an I/O
heavy workload, generating a high rate of nested exceptions.

Signed-off-by: Marc Zyngier <maz@kernel.org>
---
 arch/arm64/kvm/fpsimd.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/arm64/kvm/fpsimd.c b/arch/arm64/kvm/fpsimd.c
index 15e17aca1dec0..73eda0f46b127 100644
--- a/arch/arm64/kvm/fpsimd.c
+++ b/arch/arm64/kvm/fpsimd.c
@@ -28,6 +28,10 @@ void kvm_arch_vcpu_load_fp(struct kvm_vcpu *vcpu)
 	if (!system_supports_fpsimd())
 		return;

+	if (vcpu_get_flag(vcpu, IN_NESTED_ERET) ||
+	    vcpu_get_flag(vcpu, IN_NESTED_EXCEPTION))
+		return;
+
 	/*
 	 * Ensure that any host FPSIMD/SVE/SME state is saved and unbound such
 	 * that the host kernel is responsible for restoring this state upon
@@ -102,6 +106,10 @@ void kvm_arch_vcpu_put_fp(struct kvm_vcpu *vcpu)
 {
 	unsigned long flags;

+	if (vcpu_get_flag(vcpu, IN_NESTED_ERET) ||
+	    vcpu_get_flag(vcpu, IN_NESTED_EXCEPTION))
+		return;
+
 	local_irq_save(flags);

 	if (guest_owns_fp_regs()) {
-- 
2.47.3

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 2/2] KVM: arm64: nv: Don't save/restore FP register during a nested ERET or exception
  2026-05-12 14:07 ` [PATCH 2/2] KVM: arm64: nv: Don't save/restore FP register during a nested ERET or exception Marc Zyngier
@ 2026-05-13 12:28   ` Mark Rutland
  2026-05-13 12:49     ` Marc Zyngier
  0 siblings, 1 reply; 5+ messages in thread
From: Mark Rutland @ 2026-05-13 12:28 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvmarm, linux-arm-kernel, kvm, Steffen Eiden, Joey Gouly,
	Suzuki K Poulose, Oliver Upton, Zenghui Yu, Will Deacon,
	Fuad Tabba

On Tue, May 12, 2026 at 03:07:55PM +0100, Marc Zyngier wrote:
> When switching between L1 and L2, we diligently use a non-preemptible
> put/load sequence in order to make sure that the old state is saved,
> while the new state is brought in. Crucially, this includes the FP
> registers.
> 
> However, this is a bit silly. The FP registers are completely shared
> between the various ELs (just like the GPRs, really), and eagerly
> save/restoring those in a non-preemptible section is just overhead.
> Not to mention that the next access will end-up trapping, something
> that becomes exponentially expensive as we nest deeper.
> 
> The temptation is therefore to completely drop this save/restore thing.
> Why is it valid to do so? By analogy, the hypervisor doesn't try to
> poloce things between EL1 and EL0, or between EL2 and EL0. Why should
> it do so between EL2 and EL1 (or EL2 and L2 EL0)?
>
> Once you admit that the FP (and by extension SVE) registers are EL-agnostic,
> the things that matter are:

s/poloce/police/ ?

The above is a bit flowery; it would be nice to remove the rhetorical
questions and just state that (aside from some control registers) the
FPSIMD/SVE/SME state is shared between exception levels and doesn't need
to be saved/restored.

How about:

  When switching between L1 and L2, we save the old state using
  kvm_arch_vcpu_put(), mutate the state in memory, then load the new
  state using kvm_arch_vcpu_load(). Any live FPSIMD/SVE state is saved
  and unbound, such that it can be lazily restored on a subsequent trap.

  The FPSIMD/SVE state is shared by exception levels, and only a handful
  of related control registers need to be changed when transitioning
  between L1 and L2. The save/restore of the common state is needless
  overhead, especially as trapping becomes exponentially more expensive
  with nesting.

  Avoid this overhead by leaving the common FPSIMD/SVE state live on the
  CPU, and only switching the state that is distinct for L1 and L2:
 
> - the trap controls: the effective values are recomputed on each entry
>   into the guest to take the EL into account and merge the L0 and L1
>   configuration if in a nested context, or directly use the L0 configuration
>   in non-nested context (see __activate_traps()).
> 
> - the VL settings: the effective values are are also recomputed on each
>   entry into the guest (see fpsimd_lazy_switch_to_guest()).

This is true for FPSIMD+SVE today. For SME, SMCR_ELx also contains other
controls, and will need to be dealt with similarly. It might be worth
noting that (and that ZCR_ELx could gain new controls in future).

> Since we appear to cover all bases, use the vcpu flags indicating the
> handling of a nested ERET or exception delivery to avoid the whole FP
> save/restore shenanigans.
> 
> For an EL1 L3 guest where L1 and L2 have this optimisation, this
> results in at least a 10% wall clock reduction when running an I/O
> heavy workload, generating a high rate of nested exceptions.
> 
> Signed-off-by: Marc Zyngier <maz@kernel.org>
> ---
>  arch/arm64/kvm/fpsimd.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/arch/arm64/kvm/fpsimd.c b/arch/arm64/kvm/fpsimd.c
> index 15e17aca1dec0..73eda0f46b127 100644
> --- a/arch/arm64/kvm/fpsimd.c
> +++ b/arch/arm64/kvm/fpsimd.c
> @@ -28,6 +28,10 @@ void kvm_arch_vcpu_load_fp(struct kvm_vcpu *vcpu)
>  	if (!system_supports_fpsimd())
>  		return;
>  
> +	if (vcpu_get_flag(vcpu, IN_NESTED_ERET) ||
> +	    vcpu_get_flag(vcpu, IN_NESTED_EXCEPTION))
> +		return;
> +

I think we need a comment as to why this is safe, with some other detail
from the commit message. It would also be good to have asserts here to
catch if something goes wrong.

How about:

	/*
	 * Avoid needless save/restore of the guest's common
	 * FPSIMD/SVE/SME regs during transitions between L1/L2.
	 *
	 * These transitions only happens in a non-preemptible context
	 * where the host regs have already been saved and unbound. The
	 * live registers are either free or owned by the guest.
	 */
	if (vcpu_get_flag(vcpu, IN_NESTED_ERET) ||
	    vcpu_get_flag(vcpu, IN_NESTED_EXCEPTION) {
		WARN_ON_ONCE(host_owns_fp_regs());
		return;
	}

... ?

Note: I didn't add WARN_ON_ONCE(preemptible()), since
kvm_arch_vcpu_load_fp() should *never* be called in a preemptible
context.

>  	/*
>  	 * Ensure that any host FPSIMD/SVE/SME state is saved and unbound such
>  	 * that the host kernel is responsible for restoring this state upon
> @@ -102,6 +106,10 @@ void kvm_arch_vcpu_put_fp(struct kvm_vcpu *vcpu)
>  {
>  	unsigned long flags;
>  
> +	if (vcpu_get_flag(vcpu, IN_NESTED_ERET) ||
> +	    vcpu_get_flag(vcpu, IN_NESTED_EXCEPTION))
> +		return;

Likewise here, but we can reduce the comment, e.g.

	/*
	 * See comment in kvm_arch_vcpu_load_fp().
	 */
	if (vcpu_get_flag(vcpu, IN_NESTED_ERET) ||
	    vcpu_get_flag(vcpu, IN_NESTED_EXCEPTION) {
		WARN_ON_ONCE(host_owns_fp_regs());
		return;
	}

Thanks,
Mark.

> +
>  	local_irq_save(flags);
>  
>  	if (guest_owns_fp_regs()) {
> -- 
> 2.47.3
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 2/2] KVM: arm64: nv: Don't save/restore FP register during a nested ERET or exception
  2026-05-13 12:28   ` Mark Rutland
@ 2026-05-13 12:49     ` Marc Zyngier
  0 siblings, 0 replies; 5+ messages in thread
From: Marc Zyngier @ 2026-05-13 12:49 UTC (permalink / raw)
  To: Mark Rutland
  Cc: kvmarm, linux-arm-kernel, kvm, Steffen Eiden, Joey Gouly,
	Suzuki K Poulose, Oliver Upton, Zenghui Yu, Will Deacon,
	Fuad Tabba

Hi Mark,

Thanks for looking into this.

On Wed, 13 May 2026 13:28:56 +0100,
Mark Rutland <mark.rutland@arm.com> wrote:
> 
> On Tue, May 12, 2026 at 03:07:55PM +0100, Marc Zyngier wrote:
> > When switching between L1 and L2, we diligently use a non-preemptible
> > put/load sequence in order to make sure that the old state is saved,
> > while the new state is brought in. Crucially, this includes the FP
> > registers.
> > 
> > However, this is a bit silly. The FP registers are completely shared
> > between the various ELs (just like the GPRs, really), and eagerly
> > save/restoring those in a non-preemptible section is just overhead.
> > Not to mention that the next access will end-up trapping, something
> > that becomes exponentially expensive as we nest deeper.
> > 
> > The temptation is therefore to completely drop this save/restore thing.
> > Why is it valid to do so? By analogy, the hypervisor doesn't try to
> > poloce things between EL1 and EL0, or between EL2 and EL0. Why should
> > it do so between EL2 and EL1 (or EL2 and L2 EL0)?
> >
> > Once you admit that the FP (and by extension SVE) registers are EL-agnostic,
> > the things that matter are:
> 
> s/poloce/police/ ?

That.

> 
> The above is a bit flowery; it would be nice to remove the rhetorical
> questions and just state that (aside from some control registers) the
> FPSIMD/SVE/SME state is shared between exception levels and doesn't need
> to be saved/restored.
> 
> How about:
> 
>   When switching between L1 and L2, we save the old state using
>   kvm_arch_vcpu_put(), mutate the state in memory, then load the new
>   state using kvm_arch_vcpu_load(). Any live FPSIMD/SVE state is saved
>   and unbound, such that it can be lazily restored on a subsequent trap.
> 
>   The FPSIMD/SVE state is shared by exception levels, and only a handful
>   of related control registers need to be changed when transitioning
>   between L1 and L2. The save/restore of the common state is needless
>   overhead, especially as trapping becomes exponentially more expensive
>   with nesting.
> 
>   Avoid this overhead by leaving the common FPSIMD/SVE state live on the
>   CPU, and only switching the state that is distinct for L1 and L2:
>

Sold. Do you offer a CMAAS (Commit Message As A Service)? Asking for a
friend... ;-)

> > - the trap controls: the effective values are recomputed on each entry
> >   into the guest to take the EL into account and merge the L0 and L1
> >   configuration if in a nested context, or directly use the L0 configuration
> >   in non-nested context (see __activate_traps()).
> > 
> > - the VL settings: the effective values are are also recomputed on each
> >   entry into the guest (see fpsimd_lazy_switch_to_guest()).
> 
> This is true for FPSIMD+SVE today. For SME, SMCR_ELx also contains other
> controls, and will need to be dealt with similarly. It might be worth
> noting that (and that ZCR_ELx could gain new controls in future).
>

Yeah. I tried not to worry too much about SME, but given that it is on
people's radar, I'll drop a comment here.

> > Since we appear to cover all bases, use the vcpu flags indicating the
> > handling of a nested ERET or exception delivery to avoid the whole FP
> > save/restore shenanigans.
> > 
> > For an EL1 L3 guest where L1 and L2 have this optimisation, this
> > results in at least a 10% wall clock reduction when running an I/O
> > heavy workload, generating a high rate of nested exceptions.
> > 
> > Signed-off-by: Marc Zyngier <maz@kernel.org>
> > ---
> >  arch/arm64/kvm/fpsimd.c | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/arch/arm64/kvm/fpsimd.c b/arch/arm64/kvm/fpsimd.c
> > index 15e17aca1dec0..73eda0f46b127 100644
> > --- a/arch/arm64/kvm/fpsimd.c
> > +++ b/arch/arm64/kvm/fpsimd.c
> > @@ -28,6 +28,10 @@ void kvm_arch_vcpu_load_fp(struct kvm_vcpu *vcpu)
> >  	if (!system_supports_fpsimd())
> >  		return;
> >  
> > +	if (vcpu_get_flag(vcpu, IN_NESTED_ERET) ||
> > +	    vcpu_get_flag(vcpu, IN_NESTED_EXCEPTION))
> > +		return;
> > +
> 
> I think we need a comment as to why this is safe, with some other detail
> from the commit message. It would also be good to have asserts here to
> catch if something goes wrong.
> 
> How about:
> 
> 	/*
> 	 * Avoid needless save/restore of the guest's common
> 	 * FPSIMD/SVE/SME regs during transitions between L1/L2.
> 	 *
> 	 * These transitions only happens in a non-preemptible context
> 	 * where the host regs have already been saved and unbound. The
> 	 * live registers are either free or owned by the guest.
> 	 */
> 	if (vcpu_get_flag(vcpu, IN_NESTED_ERET) ||
> 	    vcpu_get_flag(vcpu, IN_NESTED_EXCEPTION) {
> 		WARN_ON_ONCE(host_owns_fp_regs());
> 		return;
> 	}
> 
> ... ?
> 
> Note: I didn't add WARN_ON_ONCE(preemptible()), since
> kvm_arch_vcpu_load_fp() should *never* be called in a preemptible
> context.
> 
> >  	/*
> >  	 * Ensure that any host FPSIMD/SVE/SME state is saved and unbound such
> >  	 * that the host kernel is responsible for restoring this state upon
> > @@ -102,6 +106,10 @@ void kvm_arch_vcpu_put_fp(struct kvm_vcpu *vcpu)
> >  {
> >  	unsigned long flags;
> >  
> > +	if (vcpu_get_flag(vcpu, IN_NESTED_ERET) ||
> > +	    vcpu_get_flag(vcpu, IN_NESTED_EXCEPTION))
> > +		return;
> 
> Likewise here, but we can reduce the comment, e.g.
> 
> 	/*
> 	 * See comment in kvm_arch_vcpu_load_fp().
> 	 */
> 	if (vcpu_get_flag(vcpu, IN_NESTED_ERET) ||
> 	    vcpu_get_flag(vcpu, IN_NESTED_EXCEPTION) {
> 		WARN_ON_ONCE(host_owns_fp_regs());
> 		return;
> 	}

Yup, that all looks good to me. I'll repost that next week with these
changes.

Thanks again,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-05-13 12:49 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-12 14:07 [PATCH 0/2] KVM: arm64: nv: Reduce FP/SVE overhead on exception/exception return Marc Zyngier
2026-05-12 14:07 ` [PATCH 1/2] KVM: arm64: nv: Track L2 to L1 exception emulation Marc Zyngier
2026-05-12 14:07 ` [PATCH 2/2] KVM: arm64: nv: Don't save/restore FP register during a nested ERET or exception Marc Zyngier
2026-05-13 12:28   ` Mark Rutland
2026-05-13 12:49     ` Marc Zyngier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox