linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/2] ARM: allow kernel mode NEON in softirq context
@ 2022-12-07 10:39 Ard Biesheuvel
  2022-12-07 10:39 ` [PATCH v2 1/2] ARM: vfp: Manipulate VFP state with softirqs disabled Ard Biesheuvel
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Ard Biesheuvel @ 2022-12-07 10:39 UTC (permalink / raw)
  To: linux-arm-kernel, linux
  Cc: linux-crypto, Ard Biesheuvel, Linus Walleij, Arnd Bergmann

Currently on ARM, we only permit kernel mode NEON in task context, and
NEON based processing triggered from softirq context is queued for
asynchronous completion via the crypto API's cryptd layer.

For IPsec packet encryption involving highly performant crypto
implementations, this results in a substantial performance hit, and so
it would be desirable to permit those crypto operations to complete
synchronously even when invoked from softirq context.

For example, on a 1 GHz Cortex-A53 machine (SynQuacer), AES-256-GCM
executes in 7.2 cycles per byte, putting an upper bound of ~140 MB/s
on the achievable throughput of a single CPU.

Without these changes, an IPsec tunnel from a 32-bit VM to the 64-bit
host can achieve a throughput of 9.5 MB/s TX and 11.9 MB/s RX.

When the crypto algorithm is permitted to execute in softirq context,
the throughput increases to 16.5 MB/s TX and 41 MB/s RX.

(This is measured using debian's iperf3 3.11 with the default options)

So let's reorganize the VFP state handling so that it its critical
handling of the FPU registers runs with softirqs disabled. Then, update
the kernel_neon_begin()/end() logic to keep softirq processing disabled
as long as the NEON is being used in kernel mode.

Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Russell King <linux@armlinux.org.uk>

Ard Biesheuvel (2):
  ARM: vfp: Manipulate VFP state with softirqs disabled
  ARM: permit non-nested kernel mode NEON in softirq context

 arch/arm/include/asm/assembler.h | 19 ++++++++++++-------
 arch/arm/include/asm/simd.h      |  8 ++++++++
 arch/arm/kernel/asm-offsets.c    |  1 +
 arch/arm/vfp/entry.S             |  4 ++--
 arch/arm/vfp/vfphw.S             |  4 ++--
 arch/arm/vfp/vfpmodule.c         | 19 ++++++++++++-------
 6 files changed, 37 insertions(+), 18 deletions(-)
 create mode 100644 arch/arm/include/asm/simd.h

-- 
2.35.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2 1/2] ARM: vfp: Manipulate VFP state with softirqs disabled
  2022-12-07 10:39 [PATCH v2 0/2] ARM: allow kernel mode NEON in softirq context Ard Biesheuvel
@ 2022-12-07 10:39 ` Ard Biesheuvel
  2022-12-15 10:22   ` Linus Walleij
  2022-12-07 10:39 ` [PATCH v2 2/2] ARM: permit non-nested kernel mode NEON in softirq context Ard Biesheuvel
  2022-12-12 14:37 ` [PATCH v2 0/2] ARM: allow " Martin Willi
  2 siblings, 1 reply; 10+ messages in thread
From: Ard Biesheuvel @ 2022-12-07 10:39 UTC (permalink / raw)
  To: linux-arm-kernel, linux
  Cc: linux-crypto, Ard Biesheuvel, Linus Walleij, Arnd Bergmann

In a subsequent patch, we will relax the kernel mode NEON policy, and
permit kernel mode NEON to be used not only from task context, as is
permitted today, but also from softirq context.

Given that softirqs may trigger over the back of any IRQ unless they are
explicitly disabled, we need to address the resulting races in the VFP
state handling, by disabling softirq processing in two distinct but
related cases:
- kernel mode NEON will leave the FPU disabled after it completes, so
  any kernel code sequence that enables the FPU and subsequently accesses
  its registers needs to disable softirqs until it completes;
- kernel_neon_begin() will preserve the userland VFP state in memory,
  and if it interrupts the ordinary VFP state preserve sequence, the
  latter will resume execution with the VFP registers corrupted, and
  happily save them to memory.

Given that disabling softirqs also disables preemption, we can replace
the existing preempt_disable/enable occurrences in the VFP state
handling asm code with new macros that dis/enable softirqs instead.
In the VFP state handling C code, add local_bh_disable/enable() calls
in those places where the VFP state is preserved.

One thing to keep in mind is that, once we allow NEON use in softirq
context, the result of any such interruption is that the FPEXC_EN bit in
the FPEXC register will be cleared, and vfp_current_hw_state[cpu] will
be NULL. This means that any sequence that [conditionally] clears
FPEXC_EN and/or sets vfp_current_hw_state[cpu] to NULL does not need to
run with softirqs disabled, as the result will be the same. Furthermore,
the handling of THREAD_NOTIFY_SWITCH is guaranteed to run with IRQs
disabled, and so it does not need protection from softirq interruptions
either.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arm/include/asm/assembler.h | 19 ++++++++++++-------
 arch/arm/kernel/asm-offsets.c    |  1 +
 arch/arm/vfp/entry.S             |  4 ++--
 arch/arm/vfp/vfphw.S             |  4 ++--
 arch/arm/vfp/vfpmodule.c         |  8 +++++++-
 5 files changed, 24 insertions(+), 12 deletions(-)

diff --git a/arch/arm/include/asm/assembler.h b/arch/arm/include/asm/assembler.h
index 90fbe4a3f9c8472f..df999b75c0e25b01 100644
--- a/arch/arm/include/asm/assembler.h
+++ b/arch/arm/include/asm/assembler.h
@@ -236,21 +236,26 @@ THUMB(	fpreg	.req	r7	)
 	sub	\tmp, \tmp, #1			@ decrement it
 	str	\tmp, [\ti, #TI_PREEMPT]
 	.endm
-
-	.macro	dec_preempt_count_ti, ti, tmp
-	get_thread_info \ti
-	dec_preempt_count \ti, \tmp
-	.endm
 #else
 	.macro	inc_preempt_count, ti, tmp
 	.endm
 
 	.macro	dec_preempt_count, ti, tmp
 	.endm
+#endif
+
+	.macro	local_bh_disable, ti, tmp
+	ldr	\tmp, [\ti, #TI_PREEMPT]
+	add	\tmp, \tmp, #SOFTIRQ_DISABLE_OFFSET
+	str	\tmp, [\ti, #TI_PREEMPT]
+	.endm
 
-	.macro	dec_preempt_count_ti, ti, tmp
+	.macro	local_bh_enable_ti, ti, tmp
+	get_thread_info \ti
+	ldr	\tmp, [\ti, #TI_PREEMPT]
+	sub	\tmp, \tmp, #SOFTIRQ_DISABLE_OFFSET
+	str	\tmp, [\ti, #TI_PREEMPT]
 	.endm
-#endif
 
 #define USERL(l, x...)				\
 9999:	x;					\
diff --git a/arch/arm/kernel/asm-offsets.c b/arch/arm/kernel/asm-offsets.c
index 2c8d76fd7c66298a..38121c59cbc26cdd 100644
--- a/arch/arm/kernel/asm-offsets.c
+++ b/arch/arm/kernel/asm-offsets.c
@@ -56,6 +56,7 @@ int main(void)
   DEFINE(VFP_CPU,		offsetof(union vfp_state, hard.cpu));
 #endif
 #endif
+  DEFINE(SOFTIRQ_DISABLE_OFFSET,SOFTIRQ_DISABLE_OFFSET);
 #ifdef CONFIG_ARM_THUMBEE
   DEFINE(TI_THUMBEE_STATE,	offsetof(struct thread_info, thumbee_state));
 #endif
diff --git a/arch/arm/vfp/entry.S b/arch/arm/vfp/entry.S
index 27b0a1f27fbdf392..9a89264cdcc0b46e 100644
--- a/arch/arm/vfp/entry.S
+++ b/arch/arm/vfp/entry.S
@@ -22,7 +22,7 @@
 @  IRQs enabled.
 @
 ENTRY(do_vfp)
-	inc_preempt_count r10, r4
+	local_bh_disable r10, r4
  	ldr	r4, .LCvfp
 	ldr	r11, [r10, #TI_CPU]	@ CPU number
 	add	r10, r10, #TI_VFPSTATE	@ r10 = workspace
@@ -30,7 +30,7 @@ ENTRY(do_vfp)
 ENDPROC(do_vfp)
 
 ENTRY(vfp_null_entry)
-	dec_preempt_count_ti r10, r4
+	local_bh_enable_ti r10, r4
 	ret	lr
 ENDPROC(vfp_null_entry)
 
diff --git a/arch/arm/vfp/vfphw.S b/arch/arm/vfp/vfphw.S
index 6f7926c9c1790f66..26c4f61ecfa39638 100644
--- a/arch/arm/vfp/vfphw.S
+++ b/arch/arm/vfp/vfphw.S
@@ -175,7 +175,7 @@ vfp_hw_state_valid:
 					@ else it's one 32-bit instruction, so
 					@ always subtract 4 from the following
 					@ instruction address.
-	dec_preempt_count_ti r10, r4
+	local_bh_enable_ti r10, r4
 	ret	r9			@ we think we have handled things
 
 
@@ -200,7 +200,7 @@ skip:
 	@ not recognised by VFP
 
 	DBGSTR	"not VFP"
-	dec_preempt_count_ti r10, r4
+	local_bh_enable_ti r10, r4
 	ret	lr
 
 process_exception:
diff --git a/arch/arm/vfp/vfpmodule.c b/arch/arm/vfp/vfpmodule.c
index 2cb355c1b5b71694..8f5bc672b4aac04a 100644
--- a/arch/arm/vfp/vfpmodule.c
+++ b/arch/arm/vfp/vfpmodule.c
@@ -416,7 +416,7 @@ void VFP_bounce(u32 trigger, u32 fpexc, struct pt_regs *regs)
 	if (exceptions)
 		vfp_raise_exceptions(exceptions, trigger, orig_fpscr, regs);
  exit:
-	preempt_enable();
+	local_bh_enable();
 }
 
 static void vfp_enable(void *unused)
@@ -517,6 +517,8 @@ void vfp_sync_hwstate(struct thread_info *thread)
 {
 	unsigned int cpu = get_cpu();
 
+	local_bh_disable();
+
 	if (vfp_state_in_hw(cpu, thread)) {
 		u32 fpexc = fmrx(FPEXC);
 
@@ -528,6 +530,7 @@ void vfp_sync_hwstate(struct thread_info *thread)
 		fmxr(FPEXC, fpexc);
 	}
 
+	local_bh_enable();
 	put_cpu();
 }
 
@@ -717,6 +720,8 @@ void kernel_neon_begin(void)
 	unsigned int cpu;
 	u32 fpexc;
 
+	local_bh_disable();
+
 	/*
 	 * Kernel mode NEON is only allowed outside of interrupt context
 	 * with preemption disabled. This will make sure that the kernel
@@ -739,6 +744,7 @@ void kernel_neon_begin(void)
 		vfp_save_state(vfp_current_hw_state[cpu], fpexc);
 #endif
 	vfp_current_hw_state[cpu] = NULL;
+	local_bh_enable();
 }
 EXPORT_SYMBOL(kernel_neon_begin);
 
-- 
2.35.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 2/2] ARM: permit non-nested kernel mode NEON in softirq context
  2022-12-07 10:39 [PATCH v2 0/2] ARM: allow kernel mode NEON in softirq context Ard Biesheuvel
  2022-12-07 10:39 ` [PATCH v2 1/2] ARM: vfp: Manipulate VFP state with softirqs disabled Ard Biesheuvel
@ 2022-12-07 10:39 ` Ard Biesheuvel
  2022-12-15 10:26   ` Linus Walleij
  2022-12-12 14:37 ` [PATCH v2 0/2] ARM: allow " Martin Willi
  2 siblings, 1 reply; 10+ messages in thread
From: Ard Biesheuvel @ 2022-12-07 10:39 UTC (permalink / raw)
  To: linux-arm-kernel, linux
  Cc: linux-crypto, Ard Biesheuvel, Linus Walleij, Arnd Bergmann

We currently only permit kernel mode NEON in process context, to avoid
the need to preserve/restore the NEON register file when taking an
exception while running in the kernel.

Like we did on arm64, we can relax this restriction substantially, by
permitting kernel mode NEON from softirq context, while ensuring that
softirq processing is disabled when the NEON is being used in task
context. This guarantees that only NEON context belonging to user space
needs to be preserved and restored, which is already taken care of.

This is especially relevant for network encryption, where incoming
frames are typically handled in softirq context, and deferring software
decryption to a kernel thread or falling back to C code are both
undesirable from a performance PoV.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arm/include/asm/simd.h |  8 ++++++++
 arch/arm/vfp/vfpmodule.c    | 13 ++++++-------
 2 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/arch/arm/include/asm/simd.h b/arch/arm/include/asm/simd.h
new file mode 100644
index 0000000000000000..82191dbd7e78a036
--- /dev/null
+++ b/arch/arm/include/asm/simd.h
@@ -0,0 +1,8 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include <linux/hardirq.h>
+
+static __must_check inline bool may_use_simd(void)
+{
+	return IS_ENABLED(CONFIG_KERNEL_MODE_NEON) && !in_hardirq();
+}
diff --git a/arch/arm/vfp/vfpmodule.c b/arch/arm/vfp/vfpmodule.c
index 8f5bc672b4aac04a..4e1a786df76df157 100644
--- a/arch/arm/vfp/vfpmodule.c
+++ b/arch/arm/vfp/vfpmodule.c
@@ -723,12 +723,12 @@ void kernel_neon_begin(void)
 	local_bh_disable();
 
 	/*
-	 * Kernel mode NEON is only allowed outside of interrupt context
-	 * with preemption disabled. This will make sure that the kernel
-	 * mode NEON register contents never need to be preserved.
+	 * Kernel mode NEON is only allowed outside of hardirq context with
+	 * preemption and softirq processing disabled. This will make sure that
+	 * the kernel mode NEON register contents never need to be preserved.
 	 */
-	BUG_ON(in_interrupt());
-	cpu = get_cpu();
+	BUG_ON(in_hardirq());
+	cpu = __smp_processor_id();
 
 	fpexc = fmrx(FPEXC) | FPEXC_EN;
 	fmxr(FPEXC, fpexc);
@@ -744,7 +744,6 @@ void kernel_neon_begin(void)
 		vfp_save_state(vfp_current_hw_state[cpu], fpexc);
 #endif
 	vfp_current_hw_state[cpu] = NULL;
-	local_bh_enable();
 }
 EXPORT_SYMBOL(kernel_neon_begin);
 
@@ -752,7 +751,7 @@ void kernel_neon_end(void)
 {
 	/* Disable the NEON/VFP unit. */
 	fmxr(FPEXC, fmrx(FPEXC) & ~FPEXC_EN);
-	put_cpu();
+	local_bh_enable();
 }
 EXPORT_SYMBOL(kernel_neon_end);
 
-- 
2.35.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 0/2] ARM: allow kernel mode NEON in softirq context
  2022-12-07 10:39 [PATCH v2 0/2] ARM: allow kernel mode NEON in softirq context Ard Biesheuvel
  2022-12-07 10:39 ` [PATCH v2 1/2] ARM: vfp: Manipulate VFP state with softirqs disabled Ard Biesheuvel
  2022-12-07 10:39 ` [PATCH v2 2/2] ARM: permit non-nested kernel mode NEON in softirq context Ard Biesheuvel
@ 2022-12-12 14:37 ` Martin Willi
  2022-12-13 16:56   ` Ard Biesheuvel
  2 siblings, 1 reply; 10+ messages in thread
From: Martin Willi @ 2022-12-12 14:37 UTC (permalink / raw)
  To: Ard Biesheuvel, linux-arm-kernel, linux
  Cc: linux-crypto, Linus Walleij, Arnd Bergmann

Hi Ard,

> Currently on ARM, we only permit kernel mode NEON in task context [...]
> For IPsec packet encryption involving highly performant crypto
> implementations, this results in a substantial performance hit [...]

Thanks for your continued work on this.

> Without these changes, an IPsec tunnel from a 32-bit VM to the 64-bit
> host can achieve a throughput of 9.5 MB/s TX and 11.9 MB/s RX.
> 
> When the crypto algorithm is permitted to execute in softirq context,
> the throughput increases to 16.5 MB/s TX and 41 MB/s RX.

In my tests on an Armada 385, I could increase IPsec throughput with
ChaCha20/Poly1305 on RX from ~230 to ~260 MBit/s when using the NEON
code path. So you may add my:

Tested-by: Martin Willi <martin@strongswan.org>

Thanks,
Martin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 0/2] ARM: allow kernel mode NEON in softirq context
  2022-12-12 14:37 ` [PATCH v2 0/2] ARM: allow " Martin Willi
@ 2022-12-13 16:56   ` Ard Biesheuvel
  0 siblings, 0 replies; 10+ messages in thread
From: Ard Biesheuvel @ 2022-12-13 16:56 UTC (permalink / raw)
  To: Martin Willi
  Cc: linux-arm-kernel, linux, linux-crypto, Linus Walleij,
	Arnd Bergmann

On Mon, 12 Dec 2022 at 15:38, Martin Willi <martin@strongswan.org> wrote:
>
> Hi Ard,
>
> > Currently on ARM, we only permit kernel mode NEON in task context [...]
> > For IPsec packet encryption involving highly performant crypto
> > implementations, this results in a substantial performance hit [...]
>
> Thanks for your continued work on this.
>
> > Without these changes, an IPsec tunnel from a 32-bit VM to the 64-bit
> > host can achieve a throughput of 9.5 MB/s TX and 11.9 MB/s RX.
> >
> > When the crypto algorithm is permitted to execute in softirq context,
> > the throughput increases to 16.5 MB/s TX and 41 MB/s RX.
>
> In my tests on an Armada 385, I could increase IPsec throughput with
> ChaCha20/Poly1305 on RX from ~230 to ~260 MBit/s when using the NEON
> code path. So you may add my:
>
> Tested-by: Martin Willi <martin@strongswan.org>
>

Thanks!

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/2] ARM: vfp: Manipulate VFP state with softirqs disabled
  2022-12-07 10:39 ` [PATCH v2 1/2] ARM: vfp: Manipulate VFP state with softirqs disabled Ard Biesheuvel
@ 2022-12-15 10:22   ` Linus Walleij
  0 siblings, 0 replies; 10+ messages in thread
From: Linus Walleij @ 2022-12-15 10:22 UTC (permalink / raw)
  To: Ard Biesheuvel; +Cc: linux-arm-kernel, linux, linux-crypto, Arnd Bergmann

On Wed, Dec 7, 2022 at 11:39 AM Ard Biesheuvel <ardb@kernel.org> wrote:

> In a subsequent patch, we will relax the kernel mode NEON policy, and
> permit kernel mode NEON to be used not only from task context, as is
> permitted today, but also from softirq context.
>
> Given that softirqs may trigger over the back of any IRQ unless they are
> explicitly disabled, we need to address the resulting races in the VFP
> state handling, by disabling softirq processing in two distinct but
> related cases:
> - kernel mode NEON will leave the FPU disabled after it completes, so
>   any kernel code sequence that enables the FPU and subsequently accesses
>   its registers needs to disable softirqs until it completes;
> - kernel_neon_begin() will preserve the userland VFP state in memory,
>   and if it interrupts the ordinary VFP state preserve sequence, the
>   latter will resume execution with the VFP registers corrupted, and
>   happily save them to memory.
>
> Given that disabling softirqs also disables preemption, we can replace
> the existing preempt_disable/enable occurrences in the VFP state
> handling asm code with new macros that dis/enable softirqs instead.
> In the VFP state handling C code, add local_bh_disable/enable() calls
> in those places where the VFP state is preserved.
>
> One thing to keep in mind is that, once we allow NEON use in softirq
> context, the result of any such interruption is that the FPEXC_EN bit in
> the FPEXC register will be cleared, and vfp_current_hw_state[cpu] will
> be NULL. This means that any sequence that [conditionally] clears
> FPEXC_EN and/or sets vfp_current_hw_state[cpu] to NULL does not need to
> run with softirqs disabled, as the result will be the same. Furthermore,
> the handling of THREAD_NOTIFY_SWITCH is guaranteed to run with IRQs
> disabled, and so it does not need protection from softirq interruptions
> either.
>
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>

Tricky patch, I had to read it a few times and visualize the concepts,
but I am sufficiently convinced that it does the right thing.
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>

Yours,
Linus Walleij

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 2/2] ARM: permit non-nested kernel mode NEON in softirq context
  2022-12-07 10:39 ` [PATCH v2 2/2] ARM: permit non-nested kernel mode NEON in softirq context Ard Biesheuvel
@ 2022-12-15 10:26   ` Linus Walleij
  2022-12-15 10:43     ` Ard Biesheuvel
  0 siblings, 1 reply; 10+ messages in thread
From: Linus Walleij @ 2022-12-15 10:26 UTC (permalink / raw)
  To: Ard Biesheuvel; +Cc: linux-arm-kernel, linux, linux-crypto, Arnd Bergmann

On Wed, Dec 7, 2022 at 11:39 AM Ard Biesheuvel <ardb@kernel.org> wrote:

> We currently only permit kernel mode NEON in process context, to avoid
> the need to preserve/restore the NEON register file when taking an
> exception while running in the kernel.
>
> Like we did on arm64, we can relax this restriction substantially, by
> permitting kernel mode NEON from softirq context, while ensuring that
> softirq processing is disabled when the NEON is being used in task
> context. This guarantees that only NEON context belonging to user space
> needs to be preserved and restored, which is already taken care of.
>
> This is especially relevant for network encryption, where incoming
> frames are typically handled in softirq context, and deferring software
> decryption to a kernel thread or falling back to C code are both
> undesirable from a performance PoV.
>
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>

So boosting WireGuard as primary SW network encryption user?
This is really neat, BTW:
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>

Yours,
Linus Walleij

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 2/2] ARM: permit non-nested kernel mode NEON in softirq context
  2022-12-15 10:26   ` Linus Walleij
@ 2022-12-15 10:43     ` Ard Biesheuvel
  2022-12-15 10:51       ` Russell King (Oracle)
  0 siblings, 1 reply; 10+ messages in thread
From: Ard Biesheuvel @ 2022-12-15 10:43 UTC (permalink / raw)
  To: Linus Walleij; +Cc: linux-arm-kernel, linux, linux-crypto, Arnd Bergmann

On Thu, 15 Dec 2022 at 11:27, Linus Walleij <linus.walleij@linaro.org> wrote:
>
> On Wed, Dec 7, 2022 at 11:39 AM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> > We currently only permit kernel mode NEON in process context, to avoid
> > the need to preserve/restore the NEON register file when taking an
> > exception while running in the kernel.
> >
> > Like we did on arm64, we can relax this restriction substantially, by
> > permitting kernel mode NEON from softirq context, while ensuring that
> > softirq processing is disabled when the NEON is being used in task
> > context. This guarantees that only NEON context belonging to user space
> > needs to be preserved and restored, which is already taken care of.
> >
> > This is especially relevant for network encryption, where incoming
> > frames are typically handled in softirq context, and deferring software
> > decryption to a kernel thread or falling back to C code are both
> > undesirable from a performance PoV.
> >
> > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
>
> So boosting WireGuard as primary SW network encryption user?

Essentially, although the use case that inspired this work is related
to IPsec not WireGuard, and the crypto algorithm in that case (GCM) is
~3x faster than WG's chacha20poly1305, which makes the performance
overhead of asynchronous completion even more significant. (Note that
GCM needs the AES and PMULL instructions which are usually only
available when running the 32-bit kernel on a 64-bit core, whereas
chacha20poly1305 uses ordinary NEON instructions.)

But Martin responded with a Tested-by regarding chacha20poly1305 on
IPsec (not WG) where there is also a noticeable speedup, so WG on
ARM32 should definitely benefit from this as well.

> This is really neat, BTW:
> Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
>

Thanks!

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 2/2] ARM: permit non-nested kernel mode NEON in softirq context
  2022-12-15 10:43     ` Ard Biesheuvel
@ 2022-12-15 10:51       ` Russell King (Oracle)
  2022-12-15 11:48         ` Ard Biesheuvel
  0 siblings, 1 reply; 10+ messages in thread
From: Russell King (Oracle) @ 2022-12-15 10:51 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Linus Walleij, linux-arm-kernel, linux-crypto, Arnd Bergmann

On Thu, Dec 15, 2022 at 11:43:22AM +0100, Ard Biesheuvel wrote:
> On Thu, 15 Dec 2022 at 11:27, Linus Walleij <linus.walleij@linaro.org> wrote:
> >
> > On Wed, Dec 7, 2022 at 11:39 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> >
> > > We currently only permit kernel mode NEON in process context, to avoid
> > > the need to preserve/restore the NEON register file when taking an
> > > exception while running in the kernel.
> > >
> > > Like we did on arm64, we can relax this restriction substantially, by
> > > permitting kernel mode NEON from softirq context, while ensuring that
> > > softirq processing is disabled when the NEON is being used in task
> > > context. This guarantees that only NEON context belonging to user space
> > > needs to be preserved and restored, which is already taken care of.
> > >
> > > This is especially relevant for network encryption, where incoming
> > > frames are typically handled in softirq context, and deferring software
> > > decryption to a kernel thread or falling back to C code are both
> > > undesirable from a performance PoV.
> > >
> > > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> >
> > So boosting WireGuard as primary SW network encryption user?
> 
> Essentially, although the use case that inspired this work is related
> to IPsec not WireGuard, and the crypto algorithm in that case (GCM) is
> ~3x faster than WG's chacha20poly1305, which makes the performance
> overhead of asynchronous completion even more significant. (Note that
> GCM needs the AES and PMULL instructions which are usually only
> available when running the 32-bit kernel on a 64-bit core, whereas
> chacha20poly1305 uses ordinary NEON instructions.)
> 
> But Martin responded with a Tested-by regarding chacha20poly1305 on
> IPsec (not WG) where there is also a noticeable speedup, so WG on
> ARM32 should definitely benefit from this as well.

It'll be interesting to see whether there is any noticable difference
with my WG VPN.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 2/2] ARM: permit non-nested kernel mode NEON in softirq context
  2022-12-15 10:51       ` Russell King (Oracle)
@ 2022-12-15 11:48         ` Ard Biesheuvel
  0 siblings, 0 replies; 10+ messages in thread
From: Ard Biesheuvel @ 2022-12-15 11:48 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Linus Walleij, linux-arm-kernel, linux-crypto, Arnd Bergmann

On Thu, 15 Dec 2022 at 11:51, Russell King (Oracle)
<linux@armlinux.org.uk> wrote:
>
> On Thu, Dec 15, 2022 at 11:43:22AM +0100, Ard Biesheuvel wrote:
> > On Thu, 15 Dec 2022 at 11:27, Linus Walleij <linus.walleij@linaro.org> wrote:
> > >
> > > On Wed, Dec 7, 2022 at 11:39 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > >
> > > > We currently only permit kernel mode NEON in process context, to avoid
> > > > the need to preserve/restore the NEON register file when taking an
> > > > exception while running in the kernel.
> > > >
> > > > Like we did on arm64, we can relax this restriction substantially, by
> > > > permitting kernel mode NEON from softirq context, while ensuring that
> > > > softirq processing is disabled when the NEON is being used in task
> > > > context. This guarantees that only NEON context belonging to user space
> > > > needs to be preserved and restored, which is already taken care of.
> > > >
> > > > This is especially relevant for network encryption, where incoming
> > > > frames are typically handled in softirq context, and deferring software
> > > > decryption to a kernel thread or falling back to C code are both
> > > > undesirable from a performance PoV.
> > > >
> > > > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > >
> > > So boosting WireGuard as primary SW network encryption user?
> >
> > Essentially, although the use case that inspired this work is related
> > to IPsec not WireGuard, and the crypto algorithm in that case (GCM) is
> > ~3x faster than WG's chacha20poly1305, which makes the performance
> > overhead of asynchronous completion even more significant. (Note that
> > GCM needs the AES and PMULL instructions which are usually only
> > available when running the 32-bit kernel on a 64-bit core, whereas
> > chacha20poly1305 uses ordinary NEON instructions.)
> >
> > But Martin responded with a Tested-by regarding chacha20poly1305 on
> > IPsec (not WG) where there is also a noticeable speedup, so WG on
> > ARM32 should definitely benefit from this as well.
>
> It'll be interesting to see whether there is any noticable difference
> with my WG VPN.
>

Using WireGuard with the same 32-bit KVM guest communicating with its
64-bit host using virtio-net, I get a 44% speedup in the host->guest
direction. The other direction performs exactly the same, which is
unsurprising as it doesn't involve NEON crypto in softirq context at
all.

BEFORE
======

ardb@vm32:~$ iperf3 -c 192.168.11.2
Connecting to host 192.168.11.2, port 5201
[  5] local 192.168.11.1 port 40144 connected to 192.168.11.2 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  25.8 MBytes   216 Mbits/sec    0    397 KBytes
[  5]   1.00-2.00   sec  25.9 MBytes   217 Mbits/sec    0    397 KBytes
[  5]   2.00-3.00   sec  27.0 MBytes   226 Mbits/sec    0    397 KBytes
[  5]   3.00-4.00   sec  26.5 MBytes   222 Mbits/sec    0    397 KBytes
[  5]   4.00-5.00   sec  26.2 MBytes   220 Mbits/sec    0    397 KBytes
[  5]   5.00-6.00   sec  26.1 MBytes   219 Mbits/sec    0    436 KBytes
[  5]   6.00-7.00   sec  26.2 MBytes   220 Mbits/sec    0    458 KBytes
[  5]   7.00-8.00   sec  26.2 MBytes   220 Mbits/sec    0    458 KBytes
[  5]   8.00-9.00   sec  26.5 MBytes   222 Mbits/sec    0    480 KBytes
[  5]   9.00-10.00  sec  26.9 MBytes   225 Mbits/sec    0    480 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   263 MBytes   221 Mbits/sec    0             sender
[  5]   0.00-10.00  sec   262 MBytes   220 Mbits/sec                  receiver


ardb@sudo:~$ iperf3 -c 192.168.11.1
Connecting to host 192.168.11.1, port 5201
[  5] local 192.168.11.2 port 46340 connected to 192.168.11.1 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  47.5 MBytes   398 Mbits/sec    0   1.75 MBytes
[  5]   1.00-2.00   sec  45.0 MBytes   377 Mbits/sec   18   1.35 MBytes
[  5]   2.00-3.00   sec  43.8 MBytes   367 Mbits/sec    0   1.47 MBytes
[  5]   3.00-4.00   sec  45.0 MBytes   377 Mbits/sec    0   1.56 MBytes
[  5]   4.00-5.00   sec  45.0 MBytes   377 Mbits/sec    0   1.63 MBytes
[  5]   5.00-6.00   sec  42.5 MBytes   357 Mbits/sec    0   1.68 MBytes
[  5]   6.00-7.00   sec  43.8 MBytes   367 Mbits/sec    0   1.71 MBytes
[  5]   7.00-8.00   sec  43.8 MBytes   367 Mbits/sec    0   1.73 MBytes
[  5]   8.00-9.00   sec  45.0 MBytes   377 Mbits/sec    0   1.74 MBytes
[  5]   9.00-10.00  sec  43.8 MBytes   367 Mbits/sec    0   1.75 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   445 MBytes   373 Mbits/sec   18             sender
[  5]   0.00-10.04  sec   444 MBytes   371 Mbits/sec                  receiver

iperf Done.


AFTER
=====

ardb@vm32:~$ iperf3 -c 192.168.11.2
Connecting to host 192.168.11.2, port 5201
[  5] local 192.168.11.1 port 44004 connected to 192.168.11.2 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  26.2 MBytes   220 Mbits/sec    0    399 KBytes
[  5]   1.00-2.00   sec  25.9 MBytes   217 Mbits/sec    0    399 KBytes
[  5]   2.00-3.00   sec  26.0 MBytes   218 Mbits/sec    0    444 KBytes
[  5]   3.00-4.00   sec  26.8 MBytes   225 Mbits/sec    0    485 KBytes
[  5]   4.00-5.00   sec  26.4 MBytes   222 Mbits/sec    0    542 KBytes
[  5]   5.00-6.00   sec  26.6 MBytes   223 Mbits/sec    0    568 KBytes
[  5]   6.00-7.00   sec  25.4 MBytes   213 Mbits/sec    0    568 KBytes
[  5]   7.00-8.00   sec  25.9 MBytes   217 Mbits/sec    0    568 KBytes
[  5]   8.00-9.00   sec  26.7 MBytes   224 Mbits/sec    0    568 KBytes
[  5]   9.00-10.00  sec  25.9 MBytes   217 Mbits/sec    0    568 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   262 MBytes   220 Mbits/sec    0             sender
[  5]   0.00-9.99   sec   261 MBytes   219 Mbits/sec                  receiver

iperf Done.

ardb@sudo:~$ iperf3 -c 192.168.11.1
Connecting to host 192.168.11.1, port 5201
[  5] local 192.168.11.2 port 49838 connected to 192.168.11.1 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  61.2 MBytes   514 Mbits/sec    0   1.59 MBytes
[  5]   1.00-2.00   sec  66.2 MBytes   555 Mbits/sec    0   1.67 MBytes
[  5]   2.00-3.00   sec  65.0 MBytes   545 Mbits/sec   79   1.24 MBytes
[  5]   3.00-4.00   sec  63.8 MBytes   535 Mbits/sec    0   1.36 MBytes
[  5]   4.00-5.00   sec  63.8 MBytes   535 Mbits/sec    0   1.46 MBytes
[  5]   5.00-6.00   sec  63.8 MBytes   535 Mbits/sec    0   1.53 MBytes
[  5]   6.00-7.00   sec  62.5 MBytes   524 Mbits/sec    0   1.59 MBytes
[  5]   7.00-8.00   sec  65.0 MBytes   545 Mbits/sec   99   1.18 MBytes
[  5]   8.00-9.00   sec  65.0 MBytes   545 Mbits/sec    0   1.25 MBytes
[  5]   9.00-10.00  sec  65.0 MBytes   545 Mbits/sec    0   1.30 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   641 MBytes   538 Mbits/sec  178             sender
[  5]   0.00-10.02  sec   638 MBytes   535 Mbits/sec                  receiver

iperf Done.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-12-15 11:49 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-12-07 10:39 [PATCH v2 0/2] ARM: allow kernel mode NEON in softirq context Ard Biesheuvel
2022-12-07 10:39 ` [PATCH v2 1/2] ARM: vfp: Manipulate VFP state with softirqs disabled Ard Biesheuvel
2022-12-15 10:22   ` Linus Walleij
2022-12-07 10:39 ` [PATCH v2 2/2] ARM: permit non-nested kernel mode NEON in softirq context Ard Biesheuvel
2022-12-15 10:26   ` Linus Walleij
2022-12-15 10:43     ` Ard Biesheuvel
2022-12-15 10:51       ` Russell King (Oracle)
2022-12-15 11:48         ` Ard Biesheuvel
2022-12-12 14:37 ` [PATCH v2 0/2] ARM: allow " Martin Willi
2022-12-13 16:56   ` Ard Biesheuvel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).