* [PATCH v8 0/2] arm64/sve: Performance improvements with SVE state saving
@ 2026-03-20 15:44 Mark Brown
2026-03-20 15:44 ` [PATCH v8 1/2] arm64/fpsimd: Suppress SVE access traps when loading FPSIMD state Mark Brown
` (2 more replies)
0 siblings, 3 replies; 4+ messages in thread
From: Mark Brown @ 2026-03-20 15:44 UTC (permalink / raw)
To: Catalin Marinas, Will Deacon
Cc: Mark Rutland, Ryan Roberts, linux-arm-kernel, linux-kernel,
Mark Brown
This series aims to improve our handling of SVE access traps and state
clearing. As SVE deployment progresses both hardware and software
actively using SVE is becoming more common. When a task is using SVE it
faces additional costs, the floating point state we must track is larger
and our syscall ABI requires that the extra state is cleared on every
syscall. Users have measured these overheads and raised concerns about
them.
We can avoid these costs by reenabling SVE access traps and falling back
to FPSIMD only mode but if we do this too often for tasks that are
actively using SVE the cost of the access traps becomes prohibitive.
Currently we attempt to balance the tradeoffs here by starting tasks
with SVE disabled, enabling it on first use and then turning it off if
we need to load state from memory while the task is in a syscall. This
means that CPU bound tasks that do not regularly do blocking syscalls
will rarely drop SVE while tasks that use a lot of SVE but do block in
syscalls (eg, due to network or user interaction) will be much more
likely to do and hence incur SVE access traps.
I did some instrumentation which counted the number of SVE access traps
and the number of times we loaded FPSIMD only register state for each task.
Testing with Debian Bookworm this showed that during boot the overwhelming
majority of tasks triggered another SVE access trap more than 50% of the
time after loading FPSIMD only state with a substantial number near 100%,
though some programs had a very small number of SVE accesses most likely
from the dynamic linker. There were few tasks in the range 5-45%, most
tasks either used SVE frequently or used it only a tiny proportion of
times. As expected older distributions which do not have the SVE
performance work available showed no SVE usage in general applications.
For tasks with minimal SVE usage benchmarking with fp-pidbench on a
system with 128 bit SVE shows an approximately 6% overhead on syscalls
from having used SVE in the task, the overhead should be greater on a
system with 256 bit SVE since the Z registers must be flushed as well as
the P and FFR registers.
The two patches here move to using a time based heuristic to decide when
to reenable the SVE access trap, doing so after a second. This means
that tasks actively using SVE which block in syscalls should see reduced
or similar numbers of access traps, while CPU bound tasks that rarely
use SVE will see the SVE syscall overhead removed after running for
approximately a second, confirmed via fp-pidbench.
The benchmarking here is all very much microbenchmarks so there are
obviously some concerns on the system level impacts in actual use.
Signed-off-by: Mark Brown <broonie@kernel.org>
---
Changes in v8:
- Rebase onto v7.0-rc3.
- Add some benchmarking info from physical systems.
- Add second patch that helps processes that stay on the CPU drop
TIF_SVE.
- Link to v7: https://lore.kernel.org/r/20240730-arm64-sve-trap-mitigation-v7-1-755e7e31bdd7@kernel.org
Changes in v7:
- Rebase onto v6.11-rc1.
- Only flush the predicate registers when loading FPSIMD state, Z will
be flushed by loading the V registers.
- Link to v6: https://lore.kernel.org/r/20240529-arm64-sve-trap-mitigation-v6-1-c2037be6aced@kernel.org
Changes in v6:
- Rebase onto v6.10-rc1.
- Link to v5: https://lore.kernel.org/r/20240405-arm64-sve-trap-mitigation-v5-1-126fe2515ef1@kernel.org
Changes in v5:
- Rebase onto v6.9-rc1.
- Use a timeout rather than number of state loads to decide when to
reenable traps.
- Link to v4: https://lore.kernel.org/r/20240122-arm64-sve-trap-mitigation-v4-1-54e0d78a3ae9@kernel.org
Changes in v4:
- Rebase onto v6.8-rc1.
- Link to v3: https://lore.kernel.org/r/20231113-arm64-sve-trap-mitigation-v3-1-4779c9382483@kernel.org
Changes in v3:
- Rebase onto v6.7-rc1.
- Link to v2: https://lore.kernel.org/r/20230913-arm64-sve-trap-mitigation-v2-1-1bdeff382171@kernel.org
Changes in v2:
- Rebase onto v6.6-rc1.
- Link to v1: https://lore.kernel.org/r/20230807-arm64-sve-trap-mitigation-v1-1-d92eed1d2855@kernel.org
---
Mark Brown (2):
arm64/fpsimd: Suppress SVE access traps when loading FPSIMD state
arm64/sve: Disable TIF_SVE on syscall once per second
arch/arm64/include/asm/fpsimd.h | 1 +
arch/arm64/include/asm/processor.h | 1 +
arch/arm64/kernel/entry-common.c | 14 ++++++++++--
arch/arm64/kernel/entry-fpsimd.S | 15 +++++++++++++
arch/arm64/kernel/fpsimd.c | 46 +++++++++++++++++++++++++++++++++-----
5 files changed, 70 insertions(+), 7 deletions(-)
---
base-commit: 1f318b96cc84d7c2ab792fcc0bfd42a7ca890681
change-id: 20230807-arm64-sve-trap-mitigation-2e7e2663c849
Best regards,
--
Mark Brown <broonie@kernel.org>
^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH v8 1/2] arm64/fpsimd: Suppress SVE access traps when loading FPSIMD state
2026-03-20 15:44 [PATCH v8 0/2] arm64/sve: Performance improvements with SVE state saving Mark Brown
@ 2026-03-20 15:44 ` Mark Brown
2026-03-20 15:44 ` [PATCH v8 2/2] arm64/sve: Disable TIF_SVE on syscall once per second Mark Brown
2026-05-12 14:10 ` [PATCH v8 0/2] arm64/sve: Performance improvements with SVE state saving Will Deacon
2 siblings, 0 replies; 4+ messages in thread
From: Mark Brown @ 2026-03-20 15:44 UTC (permalink / raw)
To: Catalin Marinas, Will Deacon
Cc: Mark Rutland, Ryan Roberts, linux-arm-kernel, linux-kernel,
Mark Brown
When we are in a syscall we take the opportunity to discard the SVE state,
saving only the FPSIMD subset of the register state. If we have to
reload the floating point state from memory then we reenable SVE access
traps, stopping tracking SVE until the task uses SVE again at which
point it will take another SVE access trap. This means that for a task
which is actively using SVE and also doing many blocking system calls
will have the additional overhead of SVE access traps.
The use of SVE for applications like memcpy() means that frequent SVE
usage is common with modern distributions, even with tasks that do not
obviously use floating point. I did some instrumentation which counted
the number of SVE access traps and the number of times we loaded FPSIMD
only register state for each task. Testing with Debian Bookworm this
showed that during boot the overwhelming majority of tasks triggered
another SVE access trap more than 50% of the time after loading FPSIMD
only state with a substantial number near 100%, though some programs had
a very small number of SVE accesses most likely from startup. There were
few tasks in the range 5-45%, most tasks either used SVE frequently or
used it only a tiny proportion of times. As expected older distributions
which do not have the SVE performance work available showed no SVE usage
in general applications.
This indicates that there should be some benefit from reducing the
number of SVE access traps for blocking system calls like we did for non
blocking system calls in commit 8c845e273104 ("arm64/sve: Leave SVE
enabled on syscall if we don't context switch"). Let's do this with a
timeout, when we take a SVE access trap record a jiffies after which
we'll reeanble SVE traps and then check this whenever we load a FPSIMD
only floating point state from memory. If the time has passed then we
reenable traps, otherwise we leave traps disabled and flush the
non-shared register state like we would on trap.
The timeout is currently set to a second, I pulled this number out of thin
air so there is doubtless some room for tuning. This means that for a
task which is actively using SVE the number of SVE access traps will be
equivalent or reduced but applications which use SVE only very
infrequently will avoid the overheads associated with tracking SVE state
after a second. The extra cost from additional tracking of SVE state
only occurs when a task is preempted so short running tasks should be
minimally affected.
As would be expected fp-pidbench shows minimal change from this patch,
it does not block and on a quiet system is unlikely to see it's state
reloaded from memory.
There should be no functional change resulting from this, it is purely a
performance optimisation.
Signed-off-by: Mark Brown <broonie@kernel.org>
---
arch/arm64/include/asm/fpsimd.h | 1 +
arch/arm64/include/asm/processor.h | 1 +
arch/arm64/kernel/entry-fpsimd.S | 15 +++++++++++++
arch/arm64/kernel/fpsimd.c | 46 +++++++++++++++++++++++++++++++++-----
4 files changed, 58 insertions(+), 5 deletions(-)
diff --git a/arch/arm64/include/asm/fpsimd.h b/arch/arm64/include/asm/fpsimd.h
index 1d2e33559bd5..2d9ab5bbcb22 100644
--- a/arch/arm64/include/asm/fpsimd.h
+++ b/arch/arm64/include/asm/fpsimd.h
@@ -144,6 +144,7 @@ static inline void *thread_zt_state(struct thread_struct *thread)
extern void sve_save_state(void *state, u32 *pfpsr, int save_ffr);
extern void sve_load_state(void const *state, u32 const *pfpsr,
int restore_ffr);
+extern void sve_flush_p(bool flush_ffr);
extern void sve_flush_live(bool flush_ffr, unsigned long vq_minus_1);
extern unsigned int sve_get_vl(void);
extern void sve_set_vq(unsigned long vq_minus_1);
diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h
index e30c4c8e3a7a..a174864eca5f 100644
--- a/arch/arm64/include/asm/processor.h
+++ b/arch/arm64/include/asm/processor.h
@@ -166,6 +166,7 @@ struct thread_struct {
unsigned int fpsimd_cpu;
void *sve_state; /* SVE registers, if any */
void *sme_state; /* ZA and ZT state, if any */
+ unsigned long sve_timeout; /* jiffies to drop TIF_SVE */
unsigned int vl[ARM64_VEC_MAX]; /* vector length */
unsigned int vl_onexec[ARM64_VEC_MAX]; /* vl after next exec */
unsigned long fault_address; /* fault info */
diff --git a/arch/arm64/kernel/entry-fpsimd.S b/arch/arm64/kernel/entry-fpsimd.S
index 6325db1a2179..617dd70cafd7 100644
--- a/arch/arm64/kernel/entry-fpsimd.S
+++ b/arch/arm64/kernel/entry-fpsimd.S
@@ -85,6 +85,21 @@ SYM_FUNC_START(sve_flush_live)
2: ret
SYM_FUNC_END(sve_flush_live)
+/*
+ * Zero the predicate registers
+ *
+ * VQ must already be configured by caller, any further updates of VQ
+ * will need to ensure that the register state remains valid.
+ *
+ * x0 = include FFR?
+ */
+SYM_FUNC_START(sve_flush_p)
+ sve_flush_p
+ tbz x0, #0, 1f
+ sve_flush_ffr
+1: ret
+SYM_FUNC_END(sve_flush_p)
+
#endif /* CONFIG_ARM64_SVE */
#ifdef CONFIG_ARM64_SME
diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
index 9de1d8a604cb..d46bb370f9a9 100644
--- a/arch/arm64/kernel/fpsimd.c
+++ b/arch/arm64/kernel/fpsimd.c
@@ -360,6 +360,7 @@ static void task_fpsimd_load(void)
{
bool restore_sve_regs = false;
bool restore_ffr;
+ unsigned long sve_vq_minus_one;
WARN_ON(!system_supports_fpsimd());
WARN_ON(preemptible());
@@ -368,16 +369,11 @@ static void task_fpsimd_load(void)
if (system_supports_sve() || system_supports_sme()) {
switch (current->thread.fp_type) {
case FP_STATE_FPSIMD:
- /* Stop tracking SVE for this task until next use. */
- clear_thread_flag(TIF_SVE);
break;
case FP_STATE_SVE:
if (!thread_sm_enabled(¤t->thread))
WARN_ON_ONCE(!test_and_set_thread_flag(TIF_SVE));
- if (test_thread_flag(TIF_SVE))
- sve_set_vq(sve_vq_from_vl(task_get_sve_vl(current)) - 1);
-
restore_sve_regs = true;
restore_ffr = true;
break;
@@ -396,6 +392,15 @@ static void task_fpsimd_load(void)
}
}
+ /*
+ * If SVE has been enabled we may keep it enabled even if
+ * loading only FPSIMD state, so always set the VL.
+ */
+ if (system_supports_sve() && test_thread_flag(TIF_SVE)) {
+ sve_vq_minus_one = sve_vq_from_vl(task_get_sve_vl(current)) - 1;
+ sve_set_vq(sve_vq_minus_one);
+ }
+
/* Restore SME, override SVE register configuration if needed */
if (system_supports_sme()) {
unsigned long sme_vl = task_get_sme_vl(current);
@@ -425,6 +430,30 @@ static void task_fpsimd_load(void)
} else {
WARN_ON_ONCE(current->thread.fp_type != FP_STATE_FPSIMD);
fpsimd_load_state(¤t->thread.uw.fpsimd_state);
+
+ /*
+ * If the task had been using SVE we keep it enabled
+ * when loading FPSIMD only state for a period to
+ * minimise overhead for tasks actively using SVE,
+ * disabling it periodicaly to ensure that tasks that
+ * use SVE intermittently do eventually avoid the
+ * overhead of carrying SVE state. The timeout is
+ * initialised when we take a SVE trap in do_sve_acc().
+ */
+ if (system_supports_sve() && test_thread_flag(TIF_SVE)) {
+ if (time_after(jiffies, current->thread.sve_timeout)) {
+ clear_thread_flag(TIF_SVE);
+ sve_user_disable();
+ } else {
+ /*
+ * Loading V will have flushed the
+ * rest of the Z register, SVE is
+ * enabled at EL1 and VL was set
+ * above.
+ */
+ sve_flush_p(true);
+ }
+ }
}
}
@@ -1343,6 +1372,13 @@ void do_sve_acc(unsigned long esr, struct pt_regs *regs)
get_cpu_fpsimd_context();
+ /*
+ * We will keep SVE enabled when loading FPSIMD only state for
+ * the next second to minimise traps when userspace is
+ * actively using SVE.
+ */
+ current->thread.sve_timeout = jiffies + HZ;
+
if (test_and_set_thread_flag(TIF_SVE))
WARN_ON(1); /* SVE access shouldn't have trapped */
--
2.47.3
^ permalink raw reply related [flat|nested] 4+ messages in thread
* [PATCH v8 2/2] arm64/sve: Disable TIF_SVE on syscall once per second
2026-03-20 15:44 [PATCH v8 0/2] arm64/sve: Performance improvements with SVE state saving Mark Brown
2026-03-20 15:44 ` [PATCH v8 1/2] arm64/fpsimd: Suppress SVE access traps when loading FPSIMD state Mark Brown
@ 2026-03-20 15:44 ` Mark Brown
2026-05-12 14:10 ` [PATCH v8 0/2] arm64/sve: Performance improvements with SVE state saving Will Deacon
2 siblings, 0 replies; 4+ messages in thread
From: Mark Brown @ 2026-03-20 15:44 UTC (permalink / raw)
To: Catalin Marinas, Will Deacon
Cc: Mark Rutland, Ryan Roberts, linux-arm-kernel, linux-kernel,
Mark Brown
Our syscall ABI requires that when performing a syscall the portions of
the Z registers not shared with the V registers, the P and FFR registers
are reset to 0. Since we have no way of monitoring EL0 SVE usage this
is implemented by changing the in register values on every syscall for
tasks which have SVE enabled, for systems with 128 bit SVE vector
lengths this has been benchmarked as a 6% overhead.
We currently support disabling SVE for userspace tasks when loading the
floating point state from memory during a syscall, allowing tasks that
use SVE infrequently to avoid this overhead, but this may not help CPU
bound tasks if they are not fortunate enough to block or be scheduled
during a syscall. This is done whenever the state is loaded from a
second after the last time the task generate a SVE access trap.
Extend this mechanism to also apply during syscall entry, disabling SVE
instead of flushing the live registers when we perform a syscall a
second after the last time a SVE access trap was taken. This adds an
additional memory access and branch for tasks using SVE and means that
CPU bound tasks actively using SVE will take extra SVE access traps (at
most one per second) but will allows CPU bound tasks that infrequently
use SVE to avoid the overhead of flushing the registers on syscall.
On a system with 128 bit SVE vectors fp-pidbench shows a roughly 4.5%
improvement compared to baseline after having used SVE, for a roughly
0.4% overhead when SVE is used between each syscall. Obviously this is
very much a microbenchmark.
This is purely a performance optimisation, there should be no functional
change.
Signed-off-by: Mark Brown <broonie@kernel.org>
---
arch/arm64/kernel/entry-common.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/kernel/entry-common.c b/arch/arm64/kernel/entry-common.c
index 3625797e9ee8..a7b7ec66f084 100644
--- a/arch/arm64/kernel/entry-common.c
+++ b/arch/arm64/kernel/entry-common.c
@@ -234,8 +234,18 @@ static inline void fpsimd_syscall_enter(void)
if (test_thread_flag(TIF_SVE)) {
unsigned int sve_vq_minus_one;
- sve_vq_minus_one = sve_vq_from_vl(task_get_sve_vl(current)) - 1;
- sve_flush_live(true, sve_vq_minus_one);
+ /*
+ * Ensure that tasks that don't block in a syscall
+ * also get a chance to drop TIF_SVE.
+ */
+ if (unlikely(time_after(jiffies,
+ current->thread.sve_timeout))) {
+ clear_thread_flag(TIF_SVE);
+ sve_user_disable();
+ } else {
+ sve_vq_minus_one = sve_vq_from_vl(task_get_sve_vl(current)) - 1;
+ sve_flush_live(true, sve_vq_minus_one);
+ }
}
/*
--
2.47.3
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH v8 0/2] arm64/sve: Performance improvements with SVE state saving
2026-03-20 15:44 [PATCH v8 0/2] arm64/sve: Performance improvements with SVE state saving Mark Brown
2026-03-20 15:44 ` [PATCH v8 1/2] arm64/fpsimd: Suppress SVE access traps when loading FPSIMD state Mark Brown
2026-03-20 15:44 ` [PATCH v8 2/2] arm64/sve: Disable TIF_SVE on syscall once per second Mark Brown
@ 2026-05-12 14:10 ` Will Deacon
2 siblings, 0 replies; 4+ messages in thread
From: Will Deacon @ 2026-05-12 14:10 UTC (permalink / raw)
To: Mark Brown
Cc: Catalin Marinas, Mark Rutland, Ryan Roberts, linux-arm-kernel,
linux-kernel
Hi Mark,
On Fri, Mar 20, 2026 at 03:44:13PM +0000, Mark Brown wrote:
> This series aims to improve our handling of SVE access traps and state
> clearing. As SVE deployment progresses both hardware and software
> actively using SVE is becoming more common. When a task is using SVE it
> faces additional costs, the floating point state we must track is larger
> and our syscall ABI requires that the extra state is cleared on every
> syscall. Users have measured these overheads and raised concerns about
> them.
>
> We can avoid these costs by reenabling SVE access traps and falling back
> to FPSIMD only mode but if we do this too often for tasks that are
> actively using SVE the cost of the access traps becomes prohibitive.
> Currently we attempt to balance the tradeoffs here by starting tasks
> with SVE disabled, enabling it on first use and then turning it off if
> we need to load state from memory while the task is in a syscall. This
> means that CPU bound tasks that do not regularly do blocking syscalls
> will rarely drop SVE while tasks that use a lot of SVE but do block in
> syscalls (eg, due to network or user interaction) will be much more
> likely to do and hence incur SVE access traps.
>
> I did some instrumentation which counted the number of SVE access traps
> and the number of times we loaded FPSIMD only register state for each task.
> Testing with Debian Bookworm this showed that during boot the overwhelming
> majority of tasks triggered another SVE access trap more than 50% of the
> time after loading FPSIMD only state with a substantial number near 100%,
> though some programs had a very small number of SVE accesses most likely
> from the dynamic linker. There were few tasks in the range 5-45%, most
> tasks either used SVE frequently or used it only a tiny proportion of
> times. As expected older distributions which do not have the SVE
> performance work available showed no SVE usage in general applications.
>
> For tasks with minimal SVE usage benchmarking with fp-pidbench on a
> system with 128 bit SVE shows an approximately 6% overhead on syscalls
> from having used SVE in the task, the overhead should be greater on a
> system with 256 bit SVE since the Z registers must be flushed as well as
> the P and FFR registers.
>
> The two patches here move to using a time based heuristic to decide when
> to reenable the SVE access trap, doing so after a second. This means
> that tasks actively using SVE which block in syscalls should see reduced
> or similar numbers of access traps, while CPU bound tasks that rarely
> use SVE will see the SVE syscall overhead removed after running for
> approximately a second, confirmed via fp-pidbench.
Have you looked at all at applying this heuristic to SME? I wonder if it
would help with the recent DVMSync erratum workaround, where tasks that
use SME once/infrequently end up causing IPIs for TLB invalidation every
time they run on an effected core.
Will
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2026-05-12 14:10 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-20 15:44 [PATCH v8 0/2] arm64/sve: Performance improvements with SVE state saving Mark Brown
2026-03-20 15:44 ` [PATCH v8 1/2] arm64/fpsimd: Suppress SVE access traps when loading FPSIMD state Mark Brown
2026-03-20 15:44 ` [PATCH v8 2/2] arm64/sve: Disable TIF_SVE on syscall once per second Mark Brown
2026-05-12 14:10 ` [PATCH v8 0/2] arm64/sve: Performance improvements with SVE state saving Will Deacon
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox