* [PATCH 1/5] ARM: add support for kernel mode NEON
2013-06-06 15:03 [PATCH 0/5] Kernel mode NEON for XOR and RAID6 Ard Biesheuvel
@ 2013-06-06 15:03 ` Ard Biesheuvel
2013-06-06 15:03 ` [PATCH 2/5] ARM: move VFP init to an earlier boot stage Ard Biesheuvel
` (4 subsequent siblings)
5 siblings, 0 replies; 24+ messages in thread
From: Ard Biesheuvel @ 2013-06-06 15:03 UTC (permalink / raw)
To: linux-arm-kernel
In order to safely support the use of NEON instructions in
kernel mode, some precautions need to be taken:
- the userland context that may be present in the registers (even
if the NEON/VFP is currently disabled) must be stored under the
correct task (which may not be 'current' in the UP case),
- to avoid having to keep track of additional vfpstates for the
kernel side, disallow the use of NEON in interrupt context
and run with preemption disabled,
- after use, re-enable preemption and re-enable the lazy restore
machinery by disabling the NEON/VFP unit.
This patch adds the functions kernel_neon_begin() and
kernel_neon_end() which take care of the above. It also adds
the Kconfig symbol KERNEL_MODE_NEON to enable it.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: Nicolas Pitre <nico@linaro.org>
---
arch/arm/Kconfig | 7 +++++++
arch/arm/include/asm/neon.h | 36 ++++++++++++++++++++++++++++++++++++
arch/arm/vfp/vfpmodule.c | 43 +++++++++++++++++++++++++++++++++++++++++++
3 files changed, 86 insertions(+)
create mode 100644 arch/arm/include/asm/neon.h
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index ef30276..9e61402 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -2355,6 +2355,13 @@ config NEON
Say Y to include support code for NEON, the ARMv7 Advanced SIMD
Extension.
+config KERNEL_MODE_NEON
+ bool "Support for NEON in kernel mode"
+ default n
+ depends on NEON
+ help
+ Say Y to include support for NEON in kernel mode.
+
endmenu
menu "Userspace binary formats"
diff --git a/arch/arm/include/asm/neon.h b/arch/arm/include/asm/neon.h
new file mode 100644
index 0000000..8f730fe
--- /dev/null
+++ b/arch/arm/include/asm/neon.h
@@ -0,0 +1,36 @@
+/*
+ * linux/arch/arm/include/asm/neon.h
+ *
+ * Copyright (C) 2013 Linaro Ltd <ard.biesheuvel@linaro.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <asm/hwcap.h>
+
+#define cpu_has_neon() (!!(elf_hwcap & HWCAP_NEON))
+
+#ifdef __ARM_NEON__
+
+/*
+ * If you are affected by the BUILD_BUG below, it probably means that you are
+ * using NEON code /and/ calling the kernel_neon_begin() function from the same
+ * compilation unit. To prevent issues that may arise from GCC reordering or
+ * generating(1) NEON instructions outside of these begin/end functions, the
+ * only supported way of using NEON code in the kernel is by isolating it in a
+ * separate compilation unit, and calling it from another unit from inside a
+ * kernel_neon_begin/kernel_neon_end pair.
+ *
+ * (1) Current GCC (4.7) might generate NEON instructions at O3 level if
+ * -mpfu=neon is set.
+ */
+
+#define kernel_neon_begin() \
+ BUILD_BUG_ON_MSG(1, "kernel_neon_begin() called from NEON code")
+
+#else
+void kernel_neon_begin(void);
+#endif
+void kernel_neon_end(void);
diff --git a/arch/arm/vfp/vfpmodule.c b/arch/arm/vfp/vfpmodule.c
index 5dfbb0b..1cdc13b 100644
--- a/arch/arm/vfp/vfpmodule.c
+++ b/arch/arm/vfp/vfpmodule.c
@@ -20,6 +20,7 @@
#include <linux/init.h>
#include <linux/uaccess.h>
#include <linux/user.h>
+#include <linux/export.h>
#include <asm/cp15.h>
#include <asm/cputype.h>
@@ -648,6 +649,48 @@ static int vfp_hotplug(struct notifier_block *b, unsigned long action,
return NOTIFY_OK;
}
+#ifdef CONFIG_KERNEL_MODE_NEON
+
+/*
+ * Kernel-side NEON support functions
+ */
+void kernel_neon_begin(void)
+{
+ struct thread_info *thread = current_thread_info();
+ unsigned int cpu;
+ u32 fpexc;
+
+ /* Avoid using the NEON in interrupt context */
+ might_sleep();
+ cpu = get_cpu();
+
+ fpexc = fmrx(FPEXC) | FPEXC_EN;
+ fmxr(FPEXC, fpexc);
+
+ /*
+ * Save the userland NEON/VFP state. Under UP,
+ * the owner could be a task other than 'current'
+ */
+ if (vfp_state_in_hw(cpu, thread))
+ vfp_save_state(&thread->vfpstate, fpexc);
+#ifndef CONFIG_SMP
+ else if (vfp_current_hw_state[cpu] != NULL)
+ vfp_save_state(vfp_current_hw_state[cpu], fpexc);
+#endif
+ vfp_current_hw_state[cpu] = NULL;
+}
+EXPORT_SYMBOL(kernel_neon_begin);
+
+void kernel_neon_end(void)
+{
+ /* Disable the NEON/VFP unit. */
+ fmxr(FPEXC, fmrx(FPEXC) & ~FPEXC_EN);
+ put_cpu();
+}
+EXPORT_SYMBOL(kernel_neon_end);
+
+#endif /* CONFIG_KERNEL_MODE_NEON */
+
/*
* VFP support code initialisation.
*/
--
1.8.1.2
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 2/5] ARM: move VFP init to an earlier boot stage
2013-06-06 15:03 [PATCH 0/5] Kernel mode NEON for XOR and RAID6 Ard Biesheuvel
2013-06-06 15:03 ` [PATCH 1/5] ARM: add support for kernel mode NEON Ard Biesheuvel
@ 2013-06-06 15:03 ` Ard Biesheuvel
2013-06-06 15:03 ` [PATCH 3/5] ARM: be strict about FP exceptions in kernel mode Ard Biesheuvel
` (3 subsequent siblings)
5 siblings, 0 replies; 24+ messages in thread
From: Ard Biesheuvel @ 2013-06-06 15:03 UTC (permalink / raw)
To: linux-arm-kernel
In order to use the NEON unit in the kernel, we should
initialize it a bit earlier in the boot process so NEON users
that like to do a quick benchmark at load time (like the
xor_blocks or RAID-6 code) find the NEON/VFP unit already
enabled.
Replaced late_initcall() with core_initcall().
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: Nicolas Pitre <nico@linaro.org>
---
arch/arm/vfp/vfpmodule.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm/vfp/vfpmodule.c b/arch/arm/vfp/vfpmodule.c
index 1cdc13b..4c39f91 100644
--- a/arch/arm/vfp/vfpmodule.c
+++ b/arch/arm/vfp/vfpmodule.c
@@ -774,4 +774,4 @@ static int __init vfp_init(void)
return 0;
}
-late_initcall(vfp_init);
+core_initcall(vfp_init);
--
1.8.1.2
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 3/5] ARM: be strict about FP exceptions in kernel mode
2013-06-06 15:03 [PATCH 0/5] Kernel mode NEON for XOR and RAID6 Ard Biesheuvel
2013-06-06 15:03 ` [PATCH 1/5] ARM: add support for kernel mode NEON Ard Biesheuvel
2013-06-06 15:03 ` [PATCH 2/5] ARM: move VFP init to an earlier boot stage Ard Biesheuvel
@ 2013-06-06 15:03 ` Ard Biesheuvel
2013-06-06 15:03 ` [PATCH 4/5] ARM: crypto: add NEON accelerated XOR implementation Ard Biesheuvel
` (2 subsequent siblings)
5 siblings, 0 replies; 24+ messages in thread
From: Ard Biesheuvel @ 2013-06-06 15:03 UTC (permalink / raw)
To: linux-arm-kernel
The support code in vfp_support_entry does not care whether the
exception that caused it to be invoked occurred in kernel mode or
in user mode. However, neither condition that could trigger this
exception (lazy restore and VFP bounce to support code) is
currently allowable in kernel mode.
In the former case, we can just handle it as an undefined instruction.
In the latter case, we should flag it as a bug, as it implies that
the FP unit has been enabled and an attempt has been made to
execute FP instructions that are dependent on the support code, and
this is not supported in kernel mode.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: Nicolas Pitre <nico@linaro.org>
---
arch/arm/vfp/vfphw.S | 5 +++++
arch/arm/vfp/vfpmodule.c | 11 +++++++++++
2 files changed, 16 insertions(+)
diff --git a/arch/arm/vfp/vfphw.S b/arch/arm/vfp/vfphw.S
index 8d10dc8..3e5d311 100644
--- a/arch/arm/vfp/vfphw.S
+++ b/arch/arm/vfp/vfphw.S
@@ -78,6 +78,11 @@
ENTRY(vfp_support_entry)
DBGSTR3 "instr %08x pc %08x state %p", r0, r2, r10
+ ldr r3, [sp, #S_PSR] @ Neither lazy restore nor FP exceptions
+ and r3, r3, #MODE_MASK @ are supported in kernel mode
+ teq r3, #USR_MODE
+ bne vfp_kmode_exception @ Returns through lr
+
VFPFMRX r1, FPEXC @ Is the VFP enabled?
DBGSTR1 "fpexc %08x", r1
tst r1, #FPEXC_EN
diff --git a/arch/arm/vfp/vfpmodule.c b/arch/arm/vfp/vfpmodule.c
index 4c39f91..bd2f7a2 100644
--- a/arch/arm/vfp/vfpmodule.c
+++ b/arch/arm/vfp/vfpmodule.c
@@ -27,6 +27,7 @@
#include <asm/system_info.h>
#include <asm/thread_notify.h>
#include <asm/vfp.h>
+#include <asm/bug.h>
#include "vfpinstr.h"
#include "vfp.h"
@@ -691,6 +692,16 @@ EXPORT_SYMBOL(kernel_neon_end);
#endif /* CONFIG_KERNEL_MODE_NEON */
+void vfp_kmode_exception(void)
+{
+ /*
+ * Taking an FP exception in kernel mode is always a bug, because
+ * none of the FP instructions currently supported in kernel mode
+ * (i.e., NEON) should ever be bounced back to the support code.
+ */
+ BUG_ON(fmrx(FPEXC) & FPEXC_EN);
+}
+
/*
* VFP support code initialisation.
*/
--
1.8.1.2
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 4/5] ARM: crypto: add NEON accelerated XOR implementation
2013-06-06 15:03 [PATCH 0/5] Kernel mode NEON for XOR and RAID6 Ard Biesheuvel
` (2 preceding siblings ...)
2013-06-06 15:03 ` [PATCH 3/5] ARM: be strict about FP exceptions in kernel mode Ard Biesheuvel
@ 2013-06-06 15:03 ` Ard Biesheuvel
2013-06-06 15:45 ` Nicolas Pitre
2013-06-06 15:03 ` [PATCH 5/5] lib/raid6: add ARM-NEON accelerated syndrome calculation Ard Biesheuvel
2013-06-06 15:17 ` [PATCH 0/5] Kernel mode NEON for XOR and RAID6 Will Deacon
5 siblings, 1 reply; 24+ messages in thread
From: Ard Biesheuvel @ 2013-06-06 15:03 UTC (permalink / raw)
To: linux-arm-kernel
Add a source file xor-neon.c (which is really just the reference
C implementation passed through the GCC vectorizer) and hook it
up to the XOR framework.
Output captured from a Cortex-A15 @ 1.7 GHz:
xor: measuring software checksum speed
arm4regs : 2261.600 MB/sec
8regs : 1771.600 MB/sec
32regs : 1441.600 MB/sec
neon : 3619.600 MB/sec
xor: using function: neon (3619.600 MB/sec)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
arch/arm/include/asm/xor.h | 73 ++++++++++++++++++++++++++++++++++++++++++++++
arch/arm/lib/Makefile | 6 ++++
arch/arm/lib/xor-neon.c | 42 ++++++++++++++++++++++++++
3 files changed, 121 insertions(+)
create mode 100644 arch/arm/lib/xor-neon.c
diff --git a/arch/arm/include/asm/xor.h b/arch/arm/include/asm/xor.h
index 7604673..4ffb26d 100644
--- a/arch/arm/include/asm/xor.h
+++ b/arch/arm/include/asm/xor.h
@@ -7,7 +7,10 @@
* it under the terms of the GNU General Public License version 2 as
* published by the Free Software Foundation.
*/
+#include <linux/hardirq.h>
#include <asm-generic/xor.h>
+#include <asm/hwcap.h>
+#include <asm/neon.h>
#define __XOR(a1, a2) a1 ^= a2
@@ -138,4 +141,74 @@ static struct xor_block_template xor_block_arm4regs = {
xor_speed(&xor_block_arm4regs); \
xor_speed(&xor_block_8regs); \
xor_speed(&xor_block_32regs); \
+ NEON_TEMPLATES; \
} while (0)
+
+#ifdef CONFIG_KERNEL_MODE_NEON
+
+extern struct xor_block_template const xor_block_neon_inner;
+
+static void
+xor_neon_2(unsigned long bytes, unsigned long *p1, unsigned long *p2)
+{
+ if (in_interrupt()) {
+ xor_arm4regs_2(bytes, p1, p2);
+ } else {
+ kernel_neon_begin();
+ xor_block_neon_inner.do_2(bytes, p1, p2);
+ kernel_neon_end();
+ }
+}
+
+static void
+xor_neon_3(unsigned long bytes, unsigned long *p1, unsigned long *p2,
+ unsigned long *p3)
+{
+ if (in_interrupt()) {
+ xor_arm4regs_3(bytes, p1, p2, p3);
+ } else {
+ kernel_neon_begin();
+ xor_block_neon_inner.do_3(bytes, p1, p2, p3);
+ kernel_neon_end();
+ }
+}
+
+static void
+xor_neon_4(unsigned long bytes, unsigned long *p1, unsigned long *p2,
+ unsigned long *p3, unsigned long *p4)
+{
+ if (in_interrupt()) {
+ xor_arm4regs_4(bytes, p1, p2, p3, p4);
+ } else {
+ kernel_neon_begin();
+ xor_block_neon_inner.do_4(bytes, p1, p2, p3, p4);
+ kernel_neon_end();
+ }
+}
+
+static void
+xor_neon_5(unsigned long bytes, unsigned long *p1, unsigned long *p2,
+ unsigned long *p3, unsigned long *p4, unsigned long *p5)
+{
+ if (in_interrupt()) {
+ xor_arm4regs_5(bytes, p1, p2, p3, p4, p5);
+ } else {
+ kernel_neon_begin();
+ xor_block_neon_inner.do_5(bytes, p1, p2, p3, p4, p5);
+ kernel_neon_end();
+ }
+}
+
+static struct xor_block_template xor_block_neon = {
+ .name = "neon",
+ .do_2 = xor_neon_2,
+ .do_3 = xor_neon_3,
+ .do_4 = xor_neon_4,
+ .do_5 = xor_neon_5
+};
+
+#define NEON_TEMPLATES \
+ do { if (cpu_has_neon()) xor_speed(&xor_block_neon); } while (0)
+#else
+#define NEON_TEMPLATES
+#endif
diff --git a/arch/arm/lib/Makefile b/arch/arm/lib/Makefile
index af72969..aaf3a87 100644
--- a/arch/arm/lib/Makefile
+++ b/arch/arm/lib/Makefile
@@ -45,3 +45,9 @@ lib-$(CONFIG_ARCH_SHARK) += io-shark.o
$(obj)/csumpartialcopy.o: $(obj)/csumpartialcopygeneric.S
$(obj)/csumpartialcopyuser.o: $(obj)/csumpartialcopygeneric.S
+
+ifeq ($(CONFIG_KERNEL_MODE_NEON),y)
+ NEON_FLAGS := -mfloat-abi=softfp -mfpu=neon
+ CFLAGS_xor-neon.o += $(NEON_FLAGS)
+ lib-$(CONFIG_XOR_BLOCKS) += xor-neon.o
+endif
diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
new file mode 100644
index 0000000..f485e5a
--- /dev/null
+++ b/arch/arm/lib/xor-neon.c
@@ -0,0 +1,42 @@
+/*
+ * linux/arch/arm/lib/xor-neon.c
+ *
+ * Copyright (C) 2013 Linaro Ltd <ard.biesheuvel@linaro.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/raid/xor.h>
+
+#ifndef __ARM_NEON__
+#error You should compile this file with '-mfloat-abi=softfp -mfpu=neon'
+#endif
+
+/*
+ * Pull in the reference implementations while instructing GCC (through
+ * -ftree-vectorize) to attempt to exploit implicit parallelism and emit
+ * NEON instructions.
+ */
+#if __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 6)
+#pragma GCC optimize "tree-vectorize"
+#else
+/*
+ * While older versions of GCC do not generate incorrect code, they fail to
+ * recognize the parallel nature of these functions, and emit plain ARM code,
+ * which is known to be slower than the optimized ARM code in asm-arm/xor.h.
+ */
+#warning This code requires at least version 4.6 of GCC
+#endif
+
+#pragma GCC diagnostic ignored "-Wunused-variable"
+#include <asm-generic/xor.h>
+
+struct xor_block_template const xor_block_neon_inner = {
+ .name = "__inner_neon__",
+ .do_2 = xor_8regs_2,
+ .do_3 = xor_8regs_3,
+ .do_4 = xor_8regs_4,
+ .do_5 = xor_8regs_5,
+};
--
1.8.1.2
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 4/5] ARM: crypto: add NEON accelerated XOR implementation
2013-06-06 15:03 ` [PATCH 4/5] ARM: crypto: add NEON accelerated XOR implementation Ard Biesheuvel
@ 2013-06-06 15:45 ` Nicolas Pitre
0 siblings, 0 replies; 24+ messages in thread
From: Nicolas Pitre @ 2013-06-06 15:45 UTC (permalink / raw)
To: linux-arm-kernel
On Thu, 6 Jun 2013, Ard Biesheuvel wrote:
> Add a source file xor-neon.c (which is really just the reference
> C implementation passed through the GCC vectorizer) and hook it
> up to the XOR framework.
>
> Output captured from a Cortex-A15 @ 1.7 GHz:
>
> xor: measuring software checksum speed
> arm4regs : 2261.600 MB/sec
> 8regs : 1771.600 MB/sec
> 32regs : 1441.600 MB/sec
> neon : 3619.600 MB/sec
> xor: using function: neon (3619.600 MB/sec)
>
> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: Nicolas Pitre <nico@linaro.org>
> ---
> arch/arm/include/asm/xor.h | 73 ++++++++++++++++++++++++++++++++++++++++++++++
> arch/arm/lib/Makefile | 6 ++++
> arch/arm/lib/xor-neon.c | 42 ++++++++++++++++++++++++++
> 3 files changed, 121 insertions(+)
> create mode 100644 arch/arm/lib/xor-neon.c
>
> diff --git a/arch/arm/include/asm/xor.h b/arch/arm/include/asm/xor.h
> index 7604673..4ffb26d 100644
> --- a/arch/arm/include/asm/xor.h
> +++ b/arch/arm/include/asm/xor.h
> @@ -7,7 +7,10 @@
> * it under the terms of the GNU General Public License version 2 as
> * published by the Free Software Foundation.
> */
> +#include <linux/hardirq.h>
> #include <asm-generic/xor.h>
> +#include <asm/hwcap.h>
> +#include <asm/neon.h>
>
> #define __XOR(a1, a2) a1 ^= a2
>
> @@ -138,4 +141,74 @@ static struct xor_block_template xor_block_arm4regs = {
> xor_speed(&xor_block_arm4regs); \
> xor_speed(&xor_block_8regs); \
> xor_speed(&xor_block_32regs); \
> + NEON_TEMPLATES; \
> } while (0)
> +
> +#ifdef CONFIG_KERNEL_MODE_NEON
> +
> +extern struct xor_block_template const xor_block_neon_inner;
> +
> +static void
> +xor_neon_2(unsigned long bytes, unsigned long *p1, unsigned long *p2)
> +{
> + if (in_interrupt()) {
> + xor_arm4regs_2(bytes, p1, p2);
> + } else {
> + kernel_neon_begin();
> + xor_block_neon_inner.do_2(bytes, p1, p2);
> + kernel_neon_end();
> + }
> +}
> +
> +static void
> +xor_neon_3(unsigned long bytes, unsigned long *p1, unsigned long *p2,
> + unsigned long *p3)
> +{
> + if (in_interrupt()) {
> + xor_arm4regs_3(bytes, p1, p2, p3);
> + } else {
> + kernel_neon_begin();
> + xor_block_neon_inner.do_3(bytes, p1, p2, p3);
> + kernel_neon_end();
> + }
> +}
> +
> +static void
> +xor_neon_4(unsigned long bytes, unsigned long *p1, unsigned long *p2,
> + unsigned long *p3, unsigned long *p4)
> +{
> + if (in_interrupt()) {
> + xor_arm4regs_4(bytes, p1, p2, p3, p4);
> + } else {
> + kernel_neon_begin();
> + xor_block_neon_inner.do_4(bytes, p1, p2, p3, p4);
> + kernel_neon_end();
> + }
> +}
> +
> +static void
> +xor_neon_5(unsigned long bytes, unsigned long *p1, unsigned long *p2,
> + unsigned long *p3, unsigned long *p4, unsigned long *p5)
> +{
> + if (in_interrupt()) {
> + xor_arm4regs_5(bytes, p1, p2, p3, p4, p5);
> + } else {
> + kernel_neon_begin();
> + xor_block_neon_inner.do_5(bytes, p1, p2, p3, p4, p5);
> + kernel_neon_end();
> + }
> +}
> +
> +static struct xor_block_template xor_block_neon = {
> + .name = "neon",
> + .do_2 = xor_neon_2,
> + .do_3 = xor_neon_3,
> + .do_4 = xor_neon_4,
> + .do_5 = xor_neon_5
> +};
> +
> +#define NEON_TEMPLATES \
> + do { if (cpu_has_neon()) xor_speed(&xor_block_neon); } while (0)
> +#else
> +#define NEON_TEMPLATES
> +#endif
> diff --git a/arch/arm/lib/Makefile b/arch/arm/lib/Makefile
> index af72969..aaf3a87 100644
> --- a/arch/arm/lib/Makefile
> +++ b/arch/arm/lib/Makefile
> @@ -45,3 +45,9 @@ lib-$(CONFIG_ARCH_SHARK) += io-shark.o
>
> $(obj)/csumpartialcopy.o: $(obj)/csumpartialcopygeneric.S
> $(obj)/csumpartialcopyuser.o: $(obj)/csumpartialcopygeneric.S
> +
> +ifeq ($(CONFIG_KERNEL_MODE_NEON),y)
> + NEON_FLAGS := -mfloat-abi=softfp -mfpu=neon
> + CFLAGS_xor-neon.o += $(NEON_FLAGS)
> + lib-$(CONFIG_XOR_BLOCKS) += xor-neon.o
> +endif
> diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
> new file mode 100644
> index 0000000..f485e5a
> --- /dev/null
> +++ b/arch/arm/lib/xor-neon.c
> @@ -0,0 +1,42 @@
> +/*
> + * linux/arch/arm/lib/xor-neon.c
> + *
> + * Copyright (C) 2013 Linaro Ltd <ard.biesheuvel@linaro.org>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/raid/xor.h>
> +
> +#ifndef __ARM_NEON__
> +#error You should compile this file with '-mfloat-abi=softfp -mfpu=neon'
> +#endif
> +
> +/*
> + * Pull in the reference implementations while instructing GCC (through
> + * -ftree-vectorize) to attempt to exploit implicit parallelism and emit
> + * NEON instructions.
> + */
> +#if __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 6)
> +#pragma GCC optimize "tree-vectorize"
> +#else
> +/*
> + * While older versions of GCC do not generate incorrect code, they fail to
> + * recognize the parallel nature of these functions, and emit plain ARM code,
> + * which is known to be slower than the optimized ARM code in asm-arm/xor.h.
> + */
> +#warning This code requires at least version 4.6 of GCC
> +#endif
> +
> +#pragma GCC diagnostic ignored "-Wunused-variable"
> +#include <asm-generic/xor.h>
> +
> +struct xor_block_template const xor_block_neon_inner = {
> + .name = "__inner_neon__",
> + .do_2 = xor_8regs_2,
> + .do_3 = xor_8regs_3,
> + .do_4 = xor_8regs_4,
> + .do_5 = xor_8regs_5,
> +};
> --
> 1.8.1.2
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 5/5] lib/raid6: add ARM-NEON accelerated syndrome calculation
2013-06-06 15:03 [PATCH 0/5] Kernel mode NEON for XOR and RAID6 Ard Biesheuvel
` (3 preceding siblings ...)
2013-06-06 15:03 ` [PATCH 4/5] ARM: crypto: add NEON accelerated XOR implementation Ard Biesheuvel
@ 2013-06-06 15:03 ` Ard Biesheuvel
2013-06-06 15:55 ` Nicolas Pitre
2013-06-06 15:17 ` [PATCH 0/5] Kernel mode NEON for XOR and RAID6 Will Deacon
5 siblings, 1 reply; 24+ messages in thread
From: Ard Biesheuvel @ 2013-06-06 15:03 UTC (permalink / raw)
To: linux-arm-kernel
Rebased/reworked a patch contributed by Rob Herring that uses
NEON intrinsics to perform the RAID-6 syndrome calculations.
It uses the existing unroll.awk code to generate several
unrolled versions of which the best performing one is selected
at boot time.
Output captured from an ARM Cortex-A15 @ 1.7 GHz:
raid6: int32x1 200 MB/s
raid6: int32x2 304 MB/s
raid6: int32x4 388 MB/s
raid6: int32x8 423 MB/s
raid6: neonx1 799 MB/s
raid6: neonx2 1364 MB/s
raid6: neonx4 1731 MB/s
raid6: neonx8 1676 MB/s
raid6: using algorithm neonx4 (1731 MB/s)
raid6: using intx1 recovery algorithm
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
include/linux/raid/pq.h | 5 ++++
lib/raid6/.gitignore | 1 +
lib/raid6/Makefile | 31 +++++++++++++++++++
lib/raid6/algos.c | 6 ++++
lib/raid6/neon.c | 58 +++++++++++++++++++++++++++++++++++
lib/raid6/neon.uc | 80 +++++++++++++++++++++++++++++++++++++++++++++++++
lib/raid6/test/Makefile | 19 +++++++++++-
7 files changed, 199 insertions(+), 1 deletion(-)
create mode 100644 lib/raid6/neon.c
create mode 100644 lib/raid6/neon.uc
diff --git a/include/linux/raid/pq.h b/include/linux/raid/pq.h
index 8dfaa2c..0f42469 100644
--- a/include/linux/raid/pq.h
+++ b/include/linux/raid/pq.h
@@ -114,6 +114,11 @@ extern const struct raid6_recov_calls raid6_recov_intx1;
extern const struct raid6_recov_calls raid6_recov_ssse3;
extern const struct raid6_recov_calls raid6_recov_avx2;
+extern const struct raid6_calls raid6_neonx1;
+extern const struct raid6_calls raid6_neonx2;
+extern const struct raid6_calls raid6_neonx4;
+extern const struct raid6_calls raid6_neonx8;
+
/* Algorithm list */
extern const struct raid6_calls * const raid6_algos[];
extern const struct raid6_recov_calls *const raid6_recov_algos[];
diff --git a/lib/raid6/.gitignore b/lib/raid6/.gitignore
index 162beca..0a7e494 100644
--- a/lib/raid6/.gitignore
+++ b/lib/raid6/.gitignore
@@ -2,3 +2,4 @@ mktables
altivec*.c
int*.c
tables.c
+neon?.c
diff --git a/lib/raid6/Makefile b/lib/raid6/Makefile
index 9f7c184..6a51f1a 100644
--- a/lib/raid6/Makefile
+++ b/lib/raid6/Makefile
@@ -5,6 +5,7 @@ raid6_pq-y += algos.o recov.o tables.o int1.o int2.o int4.o \
raid6_pq-$(CONFIG_X86) += recov_ssse3.o recov_avx2.o mmx.o sse1.o sse2.o avx2.o
raid6_pq-$(CONFIG_ALTIVEC) += altivec1.o altivec2.o altivec4.o altivec8.o
+raid6_pq-$(CONFIG_KERNEL_MODE_NEON) += neon.o neon1.o neon2.o neon4.o neon8.o
hostprogs-y += mktables
@@ -16,6 +17,12 @@ ifeq ($(CONFIG_ALTIVEC),y)
altivec_flags := -maltivec -mabi=altivec
endif
+# The GCC option -ffreestanding is required in order to compile code containing
+# ARM/NEON intrinsics in a non C99-compliant environment (such as the kernel)
+ifeq ($(CONFIG_KERNEL_MODE_NEON),y)
+NEON_FLAGS := -mfloat-abi=softfp -mfpu=neon -ffreestanding
+endif
+
targets += int1.c
$(obj)/int1.c: UNROLL := 1
$(obj)/int1.c: $(src)/int.uc $(src)/unroll.awk FORCE
@@ -70,6 +77,30 @@ $(obj)/altivec8.c: UNROLL := 8
$(obj)/altivec8.c: $(src)/altivec.uc $(src)/unroll.awk FORCE
$(call if_changed,unroll)
+CFLAGS_neon1.o += $(NEON_FLAGS)
+targets += neon1.c
+$(obj)/neon1.c: UNROLL := 1
+$(obj)/neon1.c: $(src)/neon.uc $(src)/unroll.awk FORCE
+ $(call if_changed,unroll)
+
+CFLAGS_neon2.o += $(NEON_FLAGS)
+targets += neon2.c
+$(obj)/neon2.c: UNROLL := 2
+$(obj)/neon2.c: $(src)/neon.uc $(src)/unroll.awk FORCE
+ $(call if_changed,unroll)
+
+CFLAGS_neon4.o += $(NEON_FLAGS)
+targets += neon4.c
+$(obj)/neon4.c: UNROLL := 4
+$(obj)/neon4.c: $(src)/neon.uc $(src)/unroll.awk FORCE
+ $(call if_changed,unroll)
+
+CFLAGS_neon8.o += $(NEON_FLAGS)
+targets += neon8.c
+$(obj)/neon8.c: UNROLL := 8
+$(obj)/neon8.c: $(src)/neon.uc $(src)/unroll.awk FORCE
+ $(call if_changed,unroll)
+
quiet_cmd_mktable = TABLE $@
cmd_mktable = $(obj)/mktables > $@ || ( rm -f $@ && exit 1 )
diff --git a/lib/raid6/algos.c b/lib/raid6/algos.c
index 6d7316f..74e6f56 100644
--- a/lib/raid6/algos.c
+++ b/lib/raid6/algos.c
@@ -70,6 +70,12 @@ const struct raid6_calls * const raid6_algos[] = {
&raid6_intx2,
&raid6_intx4,
&raid6_intx8,
+#ifdef CONFIG_KERNEL_MODE_NEON
+ &raid6_neonx1,
+ &raid6_neonx2,
+ &raid6_neonx4,
+ &raid6_neonx8,
+#endif
NULL
};
diff --git a/lib/raid6/neon.c b/lib/raid6/neon.c
new file mode 100644
index 0000000..dad7102
--- /dev/null
+++ b/lib/raid6/neon.c
@@ -0,0 +1,58 @@
+/*
+ * linux/lib/raid6/neon.c - RAID6 syndrome calculation using ARM NEON intrinsics
+ *
+ * Copyright (C) 2013 Linaro Ltd <ard.biesheuvel@linaro.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/raid/pq.h>
+
+#ifdef __KERNEL__
+#include <asm/neon.h>
+#else
+#define kernel_neon_begin()
+#define kernel_neon_end()
+#define cpu_has_neon() (1)
+#endif
+
+/*
+ * There are 2 reasons these wrappers are kept in a separate compilation unit
+ * from the actual implementations in neonN.c (generated from neon.uc by
+ * unroll.awk):
+ * - the actual implementations use NEON intrinsics, and the GCC support header
+ * (arm_neon.h) is not fully compatible (type wise) with the kernel;
+ * - the neonN.c files are compiled with -mfpu=neon and optimization enabled,
+ * and we have to make sure that we never use *any* NEON/VFP instructions
+ * outside a kernel_neon_begin()/kernel_neon_end() pair.
+ */
+
+#define RAID6_NEON_WRAPPER(_n) \
+ static void raid6_neon ## _n ## _gen_syndrome(int disks, \
+ size_t bytes, void **ptrs) \
+ { \
+ void raid6_neon ## _n ## _gen_syndrome_real(int, \
+ unsigned int, void**); \
+ kernel_neon_begin(); \
+ raid6_neon ## _n ## _gen_syndrome_real(disks, \
+ (unsigned int)bytes, ptrs); \
+ kernel_neon_end(); \
+ } \
+ struct raid6_calls const raid6_neonx ## _n = { \
+ raid6_neon ## _n ## _gen_syndrome, \
+ raid6_have_neon, \
+ "neonx" #_n, \
+ 0 \
+ };
+
+static int raid6_have_neon(void)
+{
+ return cpu_has_neon();
+}
+
+RAID6_NEON_WRAPPER(1)
+RAID6_NEON_WRAPPER(2)
+RAID6_NEON_WRAPPER(4)
+RAID6_NEON_WRAPPER(8)
diff --git a/lib/raid6/neon.uc b/lib/raid6/neon.uc
new file mode 100644
index 0000000..f2d7ec0
--- /dev/null
+++ b/lib/raid6/neon.uc
@@ -0,0 +1,80 @@
+/* -----------------------------------------------------------------------
+ *
+ * neon.uc - RAID-6 syndrome calculation using ARM NEON instructions
+ *
+ * Copyright (C) 2012 Rob Herring
+ *
+ * Based on altivec.uc:
+ * Copyright 2002-2004 H. Peter Anvin - All Rights Reserved
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, Inc., 53 Temple Place Ste 330,
+ * Boston MA 02111-1307, USA; either version 2 of the License, or
+ * (at your option) any later version; incorporated herein by reference.
+ *
+ * ----------------------------------------------------------------------- */
+
+/*
+ * neon$#.c
+ *
+ * $#-way unrolled NEON intrinsics math RAID-6 instruction set
+ *
+ * This file is postprocessed using unroll.awk
+ */
+
+#include <arm_neon.h>
+
+typedef uint8x16_t unative_t;
+
+#define NBYTES(x) ((unative_t){x,x,x,x, x,x,x,x, x,x,x,x, x,x,x,x})
+#define NSIZE sizeof(unative_t)
+
+/*
+ * The SHLBYTE() operation shifts each byte left by 1, *not*
+ * rolling over into the next byte
+ */
+static inline unative_t SHLBYTE(unative_t v)
+{
+ return vshlq_n_u8(v, 1);
+}
+
+/*
+ * The MASK() operation returns 0xFF in any byte for which the high
+ * bit is 1, 0x00 for any byte for which the high bit is 0.
+ */
+static inline unative_t MASK(unative_t v)
+{
+ const uint8x16_t temp = NBYTES(0);
+ return (unative_t)vcltq_s8((int8x16_t)v, (int8x16_t)temp);
+}
+
+void raid6_neon$#_gen_syndrome_real(int disks, unsigned int bytes, void **ptrs)
+{
+ uint8_t **dptr = (uint8_t **)ptrs;
+ uint8_t *p, *q;
+ int d, z, z0;
+
+ register unative_t wd$$, wq$$, wp$$, w1$$, w2$$;
+ const unative_t x1d = NBYTES(0x1d);
+
+ z0 = disks - 3; /* Highest data disk */
+ p = dptr[z0+1]; /* XOR parity */
+ q = dptr[z0+2]; /* RS syndrome */
+
+ for ( d = 0 ; d < bytes ; d += NSIZE*$# ) {
+ wq$$ = wp$$ = vld1q_u8(&dptr[z0][d+$$*NSIZE]);
+ for ( z = z0-1 ; z >= 0 ; z-- ) {
+ wd$$ = vld1q_u8(&dptr[z][d+$$*NSIZE]);
+ wp$$ = veorq_u8(wp$$, wd$$);
+ w2$$ = MASK(wq$$);
+ w1$$ = SHLBYTE(wq$$);
+
+ w2$$ = vandq_u8(w2$$, x1d);
+ w1$$ = veorq_u8(w1$$, w2$$);
+ wq$$ = veorq_u8(w1$$, wd$$);
+ }
+ vst1q_u8(&p[d+NSIZE*$$], wp$$);
+ vst1q_u8(&q[d+NSIZE*$$], wq$$);
+ }
+}
diff --git a/lib/raid6/test/Makefile b/lib/raid6/test/Makefile
index 087332d..bbe6450 100644
--- a/lib/raid6/test/Makefile
+++ b/lib/raid6/test/Makefile
@@ -34,6 +34,11 @@ else
ifeq ($(HAS_ALTIVEC),yes)
OBJS += altivec1.o altivec2.o altivec4.o altivec8.o
endif
+ ifeq ($(ARCH),arm)
+ CFLAGS += -I../../../arch/arm/include -mfpu=neon \
+ -DCONFIG_KERNEL_MODE_NEON=1
+ OBJS += neon.o neon1.o neon2.o neon4.o neon8.o
+ endif
endif
.c.o:
@@ -55,6 +60,18 @@ raid6.a: $(OBJS)
raid6test: test.c raid6.a
$(CC) $(CFLAGS) -o raid6test $^
+neon1.c: neon.uc ../unroll.awk
+ $(AWK) ../unroll.awk -vN=1 < neon.uc > $@
+
+neon2.c: neon.uc ../unroll.awk
+ $(AWK) ../unroll.awk -vN=2 < neon.uc > $@
+
+neon4.c: neon.uc ../unroll.awk
+ $(AWK) ../unroll.awk -vN=4 < neon.uc > $@
+
+neon8.c: neon.uc ../unroll.awk
+ $(AWK) ../unroll.awk -vN=8 < neon.uc > $@
+
altivec1.c: altivec.uc ../unroll.awk
$(AWK) ../unroll.awk -vN=1 < altivec.uc > $@
@@ -89,7 +106,7 @@ tables.c: mktables
./mktables > tables.c
clean:
- rm -f *.o *.a mktables mktables.c *.uc int*.c altivec*.c tables.c raid6test
+ rm -f *.o *.a mktables mktables.c *.uc int*.c altivec*.c neon*.c tables.c raid6test
spotless: clean
rm -f *~
--
1.8.1.2
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 5/5] lib/raid6: add ARM-NEON accelerated syndrome calculation
2013-06-06 15:03 ` [PATCH 5/5] lib/raid6: add ARM-NEON accelerated syndrome calculation Ard Biesheuvel
@ 2013-06-06 15:55 ` Nicolas Pitre
0 siblings, 0 replies; 24+ messages in thread
From: Nicolas Pitre @ 2013-06-06 15:55 UTC (permalink / raw)
To: linux-arm-kernel
On Thu, 6 Jun 2013, Ard Biesheuvel wrote:
> Rebased/reworked a patch contributed by Rob Herring that uses
> NEON intrinsics to perform the RAID-6 syndrome calculations.
> It uses the existing unroll.awk code to generate several
> unrolled versions of which the best performing one is selected
> at boot time.
>
> Output captured from an ARM Cortex-A15 @ 1.7 GHz:
>
> raid6: int32x1 200 MB/s
> raid6: int32x2 304 MB/s
> raid6: int32x4 388 MB/s
> raid6: int32x8 423 MB/s
> raid6: neonx1 799 MB/s
> raid6: neonx2 1364 MB/s
> raid6: neonx4 1731 MB/s
> raid6: neonx8 1676 MB/s
> raid6: using algorithm neonx4 (1731 MB/s)
> raid6: using intx1 recovery algorithm
>
> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: Nicolas Pitre <nico@linaro.org>
> ---
> include/linux/raid/pq.h | 5 ++++
> lib/raid6/.gitignore | 1 +
> lib/raid6/Makefile | 31 +++++++++++++++++++
> lib/raid6/algos.c | 6 ++++
> lib/raid6/neon.c | 58 +++++++++++++++++++++++++++++++++++
> lib/raid6/neon.uc | 80 +++++++++++++++++++++++++++++++++++++++++++++++++
> lib/raid6/test/Makefile | 19 +++++++++++-
> 7 files changed, 199 insertions(+), 1 deletion(-)
> create mode 100644 lib/raid6/neon.c
> create mode 100644 lib/raid6/neon.uc
>
> diff --git a/include/linux/raid/pq.h b/include/linux/raid/pq.h
> index 8dfaa2c..0f42469 100644
> --- a/include/linux/raid/pq.h
> +++ b/include/linux/raid/pq.h
> @@ -114,6 +114,11 @@ extern const struct raid6_recov_calls raid6_recov_intx1;
> extern const struct raid6_recov_calls raid6_recov_ssse3;
> extern const struct raid6_recov_calls raid6_recov_avx2;
>
> +extern const struct raid6_calls raid6_neonx1;
> +extern const struct raid6_calls raid6_neonx2;
> +extern const struct raid6_calls raid6_neonx4;
> +extern const struct raid6_calls raid6_neonx8;
> +
> /* Algorithm list */
> extern const struct raid6_calls * const raid6_algos[];
> extern const struct raid6_recov_calls *const raid6_recov_algos[];
> diff --git a/lib/raid6/.gitignore b/lib/raid6/.gitignore
> index 162beca..0a7e494 100644
> --- a/lib/raid6/.gitignore
> +++ b/lib/raid6/.gitignore
> @@ -2,3 +2,4 @@ mktables
> altivec*.c
> int*.c
> tables.c
> +neon?.c
> diff --git a/lib/raid6/Makefile b/lib/raid6/Makefile
> index 9f7c184..6a51f1a 100644
> --- a/lib/raid6/Makefile
> +++ b/lib/raid6/Makefile
> @@ -5,6 +5,7 @@ raid6_pq-y += algos.o recov.o tables.o int1.o int2.o int4.o \
>
> raid6_pq-$(CONFIG_X86) += recov_ssse3.o recov_avx2.o mmx.o sse1.o sse2.o avx2.o
> raid6_pq-$(CONFIG_ALTIVEC) += altivec1.o altivec2.o altivec4.o altivec8.o
> +raid6_pq-$(CONFIG_KERNEL_MODE_NEON) += neon.o neon1.o neon2.o neon4.o neon8.o
>
> hostprogs-y += mktables
>
> @@ -16,6 +17,12 @@ ifeq ($(CONFIG_ALTIVEC),y)
> altivec_flags := -maltivec -mabi=altivec
> endif
>
> +# The GCC option -ffreestanding is required in order to compile code containing
> +# ARM/NEON intrinsics in a non C99-compliant environment (such as the kernel)
> +ifeq ($(CONFIG_KERNEL_MODE_NEON),y)
> +NEON_FLAGS := -mfloat-abi=softfp -mfpu=neon -ffreestanding
> +endif
> +
> targets += int1.c
> $(obj)/int1.c: UNROLL := 1
> $(obj)/int1.c: $(src)/int.uc $(src)/unroll.awk FORCE
> @@ -70,6 +77,30 @@ $(obj)/altivec8.c: UNROLL := 8
> $(obj)/altivec8.c: $(src)/altivec.uc $(src)/unroll.awk FORCE
> $(call if_changed,unroll)
>
> +CFLAGS_neon1.o += $(NEON_FLAGS)
> +targets += neon1.c
> +$(obj)/neon1.c: UNROLL := 1
> +$(obj)/neon1.c: $(src)/neon.uc $(src)/unroll.awk FORCE
> + $(call if_changed,unroll)
> +
> +CFLAGS_neon2.o += $(NEON_FLAGS)
> +targets += neon2.c
> +$(obj)/neon2.c: UNROLL := 2
> +$(obj)/neon2.c: $(src)/neon.uc $(src)/unroll.awk FORCE
> + $(call if_changed,unroll)
> +
> +CFLAGS_neon4.o += $(NEON_FLAGS)
> +targets += neon4.c
> +$(obj)/neon4.c: UNROLL := 4
> +$(obj)/neon4.c: $(src)/neon.uc $(src)/unroll.awk FORCE
> + $(call if_changed,unroll)
> +
> +CFLAGS_neon8.o += $(NEON_FLAGS)
> +targets += neon8.c
> +$(obj)/neon8.c: UNROLL := 8
> +$(obj)/neon8.c: $(src)/neon.uc $(src)/unroll.awk FORCE
> + $(call if_changed,unroll)
> +
> quiet_cmd_mktable = TABLE $@
> cmd_mktable = $(obj)/mktables > $@ || ( rm -f $@ && exit 1 )
>
> diff --git a/lib/raid6/algos.c b/lib/raid6/algos.c
> index 6d7316f..74e6f56 100644
> --- a/lib/raid6/algos.c
> +++ b/lib/raid6/algos.c
> @@ -70,6 +70,12 @@ const struct raid6_calls * const raid6_algos[] = {
> &raid6_intx2,
> &raid6_intx4,
> &raid6_intx8,
> +#ifdef CONFIG_KERNEL_MODE_NEON
> + &raid6_neonx1,
> + &raid6_neonx2,
> + &raid6_neonx4,
> + &raid6_neonx8,
> +#endif
> NULL
> };
>
> diff --git a/lib/raid6/neon.c b/lib/raid6/neon.c
> new file mode 100644
> index 0000000..dad7102
> --- /dev/null
> +++ b/lib/raid6/neon.c
> @@ -0,0 +1,58 @@
> +/*
> + * linux/lib/raid6/neon.c - RAID6 syndrome calculation using ARM NEON intrinsics
> + *
> + * Copyright (C) 2013 Linaro Ltd <ard.biesheuvel@linaro.org>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/raid/pq.h>
> +
> +#ifdef __KERNEL__
> +#include <asm/neon.h>
> +#else
> +#define kernel_neon_begin()
> +#define kernel_neon_end()
> +#define cpu_has_neon() (1)
> +#endif
> +
> +/*
> + * There are 2 reasons these wrappers are kept in a separate compilation unit
> + * from the actual implementations in neonN.c (generated from neon.uc by
> + * unroll.awk):
> + * - the actual implementations use NEON intrinsics, and the GCC support header
> + * (arm_neon.h) is not fully compatible (type wise) with the kernel;
> + * - the neonN.c files are compiled with -mfpu=neon and optimization enabled,
> + * and we have to make sure that we never use *any* NEON/VFP instructions
> + * outside a kernel_neon_begin()/kernel_neon_end() pair.
> + */
> +
> +#define RAID6_NEON_WRAPPER(_n) \
> + static void raid6_neon ## _n ## _gen_syndrome(int disks, \
> + size_t bytes, void **ptrs) \
> + { \
> + void raid6_neon ## _n ## _gen_syndrome_real(int, \
> + unsigned int, void**); \
> + kernel_neon_begin(); \
> + raid6_neon ## _n ## _gen_syndrome_real(disks, \
> + (unsigned int)bytes, ptrs); \
> + kernel_neon_end(); \
> + } \
> + struct raid6_calls const raid6_neonx ## _n = { \
> + raid6_neon ## _n ## _gen_syndrome, \
> + raid6_have_neon, \
> + "neonx" #_n, \
> + 0 \
> + };
> +
> +static int raid6_have_neon(void)
> +{
> + return cpu_has_neon();
> +}
> +
> +RAID6_NEON_WRAPPER(1)
> +RAID6_NEON_WRAPPER(2)
> +RAID6_NEON_WRAPPER(4)
> +RAID6_NEON_WRAPPER(8)
> diff --git a/lib/raid6/neon.uc b/lib/raid6/neon.uc
> new file mode 100644
> index 0000000..f2d7ec0
> --- /dev/null
> +++ b/lib/raid6/neon.uc
> @@ -0,0 +1,80 @@
> +/* -----------------------------------------------------------------------
> + *
> + * neon.uc - RAID-6 syndrome calculation using ARM NEON instructions
> + *
> + * Copyright (C) 2012 Rob Herring
> + *
> + * Based on altivec.uc:
> + * Copyright 2002-2004 H. Peter Anvin - All Rights Reserved
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation, Inc., 53 Temple Place Ste 330,
> + * Boston MA 02111-1307, USA; either version 2 of the License, or
> + * (at your option) any later version; incorporated herein by reference.
> + *
> + * ----------------------------------------------------------------------- */
> +
> +/*
> + * neon$#.c
> + *
> + * $#-way unrolled NEON intrinsics math RAID-6 instruction set
> + *
> + * This file is postprocessed using unroll.awk
> + */
> +
> +#include <arm_neon.h>
> +
> +typedef uint8x16_t unative_t;
> +
> +#define NBYTES(x) ((unative_t){x,x,x,x, x,x,x,x, x,x,x,x, x,x,x,x})
> +#define NSIZE sizeof(unative_t)
> +
> +/*
> + * The SHLBYTE() operation shifts each byte left by 1, *not*
> + * rolling over into the next byte
> + */
> +static inline unative_t SHLBYTE(unative_t v)
> +{
> + return vshlq_n_u8(v, 1);
> +}
> +
> +/*
> + * The MASK() operation returns 0xFF in any byte for which the high
> + * bit is 1, 0x00 for any byte for which the high bit is 0.
> + */
> +static inline unative_t MASK(unative_t v)
> +{
> + const uint8x16_t temp = NBYTES(0);
> + return (unative_t)vcltq_s8((int8x16_t)v, (int8x16_t)temp);
> +}
> +
> +void raid6_neon$#_gen_syndrome_real(int disks, unsigned int bytes, void **ptrs)
> +{
> + uint8_t **dptr = (uint8_t **)ptrs;
> + uint8_t *p, *q;
> + int d, z, z0;
> +
> + register unative_t wd$$, wq$$, wp$$, w1$$, w2$$;
> + const unative_t x1d = NBYTES(0x1d);
> +
> + z0 = disks - 3; /* Highest data disk */
> + p = dptr[z0+1]; /* XOR parity */
> + q = dptr[z0+2]; /* RS syndrome */
> +
> + for ( d = 0 ; d < bytes ; d += NSIZE*$# ) {
> + wq$$ = wp$$ = vld1q_u8(&dptr[z0][d+$$*NSIZE]);
> + for ( z = z0-1 ; z >= 0 ; z-- ) {
> + wd$$ = vld1q_u8(&dptr[z][d+$$*NSIZE]);
> + wp$$ = veorq_u8(wp$$, wd$$);
> + w2$$ = MASK(wq$$);
> + w1$$ = SHLBYTE(wq$$);
> +
> + w2$$ = vandq_u8(w2$$, x1d);
> + w1$$ = veorq_u8(w1$$, w2$$);
> + wq$$ = veorq_u8(w1$$, wd$$);
> + }
> + vst1q_u8(&p[d+NSIZE*$$], wp$$);
> + vst1q_u8(&q[d+NSIZE*$$], wq$$);
> + }
> +}
> diff --git a/lib/raid6/test/Makefile b/lib/raid6/test/Makefile
> index 087332d..bbe6450 100644
> --- a/lib/raid6/test/Makefile
> +++ b/lib/raid6/test/Makefile
> @@ -34,6 +34,11 @@ else
> ifeq ($(HAS_ALTIVEC),yes)
> OBJS += altivec1.o altivec2.o altivec4.o altivec8.o
> endif
> + ifeq ($(ARCH),arm)
> + CFLAGS += -I../../../arch/arm/include -mfpu=neon \
> + -DCONFIG_KERNEL_MODE_NEON=1
> + OBJS += neon.o neon1.o neon2.o neon4.o neon8.o
> + endif
> endif
>
> .c.o:
> @@ -55,6 +60,18 @@ raid6.a: $(OBJS)
> raid6test: test.c raid6.a
> $(CC) $(CFLAGS) -o raid6test $^
>
> +neon1.c: neon.uc ../unroll.awk
> + $(AWK) ../unroll.awk -vN=1 < neon.uc > $@
> +
> +neon2.c: neon.uc ../unroll.awk
> + $(AWK) ../unroll.awk -vN=2 < neon.uc > $@
> +
> +neon4.c: neon.uc ../unroll.awk
> + $(AWK) ../unroll.awk -vN=4 < neon.uc > $@
> +
> +neon8.c: neon.uc ../unroll.awk
> + $(AWK) ../unroll.awk -vN=8 < neon.uc > $@
> +
> altivec1.c: altivec.uc ../unroll.awk
> $(AWK) ../unroll.awk -vN=1 < altivec.uc > $@
>
> @@ -89,7 +106,7 @@ tables.c: mktables
> ./mktables > tables.c
>
> clean:
> - rm -f *.o *.a mktables mktables.c *.uc int*.c altivec*.c tables.c raid6test
> + rm -f *.o *.a mktables mktables.c *.uc int*.c altivec*.c neon*.c tables.c raid6test
>
> spotless: clean
> rm -f *~
> --
> 1.8.1.2
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 0/5] Kernel mode NEON for XOR and RAID6
2013-06-06 15:03 [PATCH 0/5] Kernel mode NEON for XOR and RAID6 Ard Biesheuvel
` (4 preceding siblings ...)
2013-06-06 15:03 ` [PATCH 5/5] lib/raid6: add ARM-NEON accelerated syndrome calculation Ard Biesheuvel
@ 2013-06-06 15:17 ` Will Deacon
2013-06-06 15:52 ` Ard Biesheuvel
2013-06-06 16:17 ` Nicolas Pitre
5 siblings, 2 replies; 24+ messages in thread
From: Will Deacon @ 2013-06-06 15:17 UTC (permalink / raw)
To: linux-arm-kernel
On Thu, Jun 06, 2013 at 04:03:00PM +0100, Ard Biesheuvel wrote:
> Hi all,
Hi Ard,
> This is a partial repost of the patches I proposed a couple of weeks ago to add
> support for VFP/NEON in kernel mode.
>
> This time, I have included two use cases that I have been using, XOR and RAID-6
> checksumming. The former gets a 60% performance boost on the NEON, the latter
> over 400%.
Whilst that sounds impressive, can you achieve similar results across all
NEON-capable CPUs? In particular, we need to make sure this doesn't cause
performance regressions on some cores. Furthermore, do you have any power
figures to complement your findings? The increased context-switch overhead
is also worth measuring if you can (i.e. run some userspace NEON-based
benchmarks in parallel with NEON and non-NEON implementations of the
checksumming).
> lib/raid6: add ARM-NEON accelerated syndrome calculation
>
> This is a port of the RAID-6 checksumming code in altivec.uc ported to use NEON
> intrinsics. It is about 4x faster than the sequential code. As this code does
> not live under arch/arm, I will send this patch separately to the appropriate
> list if/when the prerequisite patches from this series have been accepted.
We support building the kernel with older toolchains, so I don't see the
benefit of using intrinsics here. Have you tried writing an implementation
with NEON instructions directly?
Will
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 0/5] Kernel mode NEON for XOR and RAID6
2013-06-06 15:17 ` [PATCH 0/5] Kernel mode NEON for XOR and RAID6 Will Deacon
@ 2013-06-06 15:52 ` Ard Biesheuvel
2013-06-06 16:17 ` Nicolas Pitre
1 sibling, 0 replies; 24+ messages in thread
From: Ard Biesheuvel @ 2013-06-06 15:52 UTC (permalink / raw)
To: linux-arm-kernel
On 6 June 2013 17:17, Will Deacon <will.deacon@arm.com> wrote:
> On Thu, Jun 06, 2013 at 04:03:00PM +0100, Ard Biesheuvel wrote:
>> This time, I have included two use cases that I have been using, XOR and RAID-6
>> checksumming. The former gets a 60% performance boost on the NEON, the latter
>> over 400%.
>
> Whilst that sounds impressive, can you achieve similar results across all
> NEON-capable CPUs? In particular, we need to make sure this doesn't cause
> performance regressions on some cores. Furthermore, do you have any power
I don't expect A8 or A9 to be on par. However, the two examples I have
included perform a quick benchmark at boot to decide which one to
pick, so unless the benchmark is a very poor predictor of the
performance at run time, we should be ok here.
> figures to complement your findings? The increased context-switch overhead
> is also worth measuring if you can (i.e. run some userspace NEON-based
> benchmarks in parallel with NEON and non-NEON implementations of the
> checksumming).
>
Good point. I will follow up on that later.
>> lib/raid6: add ARM-NEON accelerated syndrome calculation
>>
>> This is a port of the RAID-6 checksumming code in altivec.uc ported to use NEON
>> intrinsics. It is about 4x faster than the sequential code. As this code does
>> not live under arch/arm, I will send this patch separately to the appropriate
>> list if/when the prerequisite patches from this series have been accepted.
>
> We support building the kernel with older toolchains, so I don't see the
> benefit of using intrinsics here. Have you tried writing an implementation
> with NEON instructions directly?
>
I have tried an alternate version coded in assembly that was
contributed by Vladimir Murzin. But obviously, compiling to an .S file
should also do the trick if this is a concern.
However, there are two reasons I have chosen these particular examples:
- they can be built for both v7 and v8;
- they illustrate the need to be careful about when GCC might generate
NEON instructions.
I am also working on bit sliced AES which is in fact NEON assembly,
and is about 50% faster in CTR mode (on A15)
--
Ard.
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 0/5] Kernel mode NEON for XOR and RAID6
2013-06-06 15:17 ` [PATCH 0/5] Kernel mode NEON for XOR and RAID6 Will Deacon
2013-06-06 15:52 ` Ard Biesheuvel
@ 2013-06-06 16:17 ` Nicolas Pitre
2013-06-06 23:08 ` Rob Herring
2013-06-07 17:50 ` Will Deacon
1 sibling, 2 replies; 24+ messages in thread
From: Nicolas Pitre @ 2013-06-06 16:17 UTC (permalink / raw)
To: linux-arm-kernel
On Thu, 6 Jun 2013, Will Deacon wrote:
> On Thu, Jun 06, 2013 at 04:03:00PM +0100, Ard Biesheuvel wrote:
> > Hi all,
>
> Hi Ard,
>
> > This is a partial repost of the patches I proposed a couple of weeks ago to add
> > support for VFP/NEON in kernel mode.
> >
> > This time, I have included two use cases that I have been using, XOR and RAID-6
> > checksumming. The former gets a 60% performance boost on the NEON, the latter
> > over 400%.
>
> Whilst that sounds impressive, can you achieve similar results across all
> NEON-capable CPUs? In particular, we need to make sure this doesn't cause
> performance regressions on some cores.
Note that the kernel performs runtime benchmarking of all the different
implementations it has available at boot time and selects the best one.
So if this would turn out to make things worse on some cores then the
Neon code would simply not be used.
> Furthermore, do you have any power figures to complement your
> findings?
This is going to be most useful in server type environments where a bit
more power is not such an issue but throughput is ... unless you start
using RAID6 arrays on your phone that is. :-) Otherwise this can be
left configured out for mobile targets.
> The increased context-switch overhead
> is also worth measuring if you can (i.e. run some userspace NEON-based
> benchmarks in parallel with NEON and non-NEON implementations of the
> checksumming).
Do we know the context switch cost of normal task scheduling between
tasks using FP operations? The in-kernel Neon usage should bring about
the same cost. Measuring it would be interesting albeit probably
difficult.
> We support building the kernel with older toolchains, so I don't see the
> benefit of using intrinsics here.
These days the compiler tends to do a better job than humans at properly
scheduling instructions for some code. We shouldn't deprive ourselves
from it when a recent enough gcc is available.
Nicolas
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 0/5] Kernel mode NEON for XOR and RAID6
2013-06-06 16:17 ` Nicolas Pitre
@ 2013-06-06 23:08 ` Rob Herring
2013-06-07 17:50 ` Will Deacon
1 sibling, 0 replies; 24+ messages in thread
From: Rob Herring @ 2013-06-06 23:08 UTC (permalink / raw)
To: linux-arm-kernel
On 06/06/2013 11:17 AM, Nicolas Pitre wrote:
> On Thu, 6 Jun 2013, Will Deacon wrote:
>
>> On Thu, Jun 06, 2013 at 04:03:00PM +0100, Ard Biesheuvel wrote:
>>> Hi all,
>>
>> Hi Ard,
>>
>>> This is a partial repost of the patches I proposed a couple of weeks ago to add
>>> support for VFP/NEON in kernel mode.
>>>
>>> This time, I have included two use cases that I have been using, XOR and RAID-6
>>> checksumming. The former gets a 60% performance boost on the NEON, the latter
>>> over 400%.
>>
>> Whilst that sounds impressive, can you achieve similar results across all
>> NEON-capable CPUs? In particular, we need to make sure this doesn't cause
>> performance regressions on some cores.
>
> Note that the kernel performs runtime benchmarking of all the different
> implementations it has available at boot time and selects the best one.
> So if this would turn out to make things worse on some cores then the
> Neon code would simply not be used.
>
>> Furthermore, do you have any power figures to complement your
>> findings?
>
> This is going to be most useful in server type environments where a bit
> more power is not such an issue but throughput is ... unless you start
> using RAID6 arrays on your phone that is. :-) Otherwise this can be
> left configured out for mobile targets.
Agreed. Any power difference will be noise for a server.
Rob
>> The increased context-switch overhead
>> is also worth measuring if you can (i.e. run some userspace NEON-based
>> benchmarks in parallel with NEON and non-NEON implementations of the
>> checksumming).
>
> Do we know the context switch cost of normal task scheduling between
> tasks using FP operations? The in-kernel Neon usage should bring about
> the same cost. Measuring it would be interesting albeit probably
> difficult.
>
>> We support building the kernel with older toolchains, so I don't see the
>> benefit of using intrinsics here.
>
> These days the compiler tends to do a better job than humans at properly
> scheduling instructions for some code. We shouldn't deprive ourselves
> from it when a recent enough gcc is available.
>
>
> Nicolas
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 0/5] Kernel mode NEON for XOR and RAID6
2013-06-06 16:17 ` Nicolas Pitre
2013-06-06 23:08 ` Rob Herring
@ 2013-06-07 17:50 ` Will Deacon
2013-06-07 19:49 ` Ard Biesheuvel
2013-06-08 3:09 ` Nicolas Pitre
1 sibling, 2 replies; 24+ messages in thread
From: Will Deacon @ 2013-06-07 17:50 UTC (permalink / raw)
To: linux-arm-kernel
Hello Nicolas, Ard,
On Thu, Jun 06, 2013 at 05:17:39PM +0100, Nicolas Pitre wrote:
> On Thu, 6 Jun 2013, Will Deacon wrote:
> > On Thu, Jun 06, 2013 at 04:03:00PM +0100, Ard Biesheuvel wrote:
> > > This time, I have included two use cases that I have been using, XOR and RAID-6
> > > checksumming. The former gets a 60% performance boost on the NEON, the latter
> > > over 400%.
> >
> > Whilst that sounds impressive, can you achieve similar results across all
> > NEON-capable CPUs? In particular, we need to make sure this doesn't cause
> > performance regressions on some cores.
>
> Note that the kernel performs runtime benchmarking of all the different
> implementations it has available at boot time and selects the best one.
> So if this would turn out to make things worse on some cores then the
> Neon code would simply not be used.
That will be all sorts of fun if we try to run this on big.LITTLE...
Perhaps we don't care about that either.
> > Furthermore, do you have any power figures to complement your
> > findings?
>
> This is going to be most useful in server type environments where a bit
> more power is not such an issue but throughput is ... unless you start
> using RAID6 arrays on your phone that is. :-) Otherwise this can be
> left configured out for mobile targets.
Agreed, but this patch series also sets a precedent for using NEON in the
kernel. Whilst I'd love to hook up some SCSI arrays to my Nexus 4 (!), much
more likely is that people might start reworking some of the crypto algorithms
to use the NEON/SIMD register file (especially with the crypto extensions in
ARMv8) so it would be good to have *some* feel of the power impact off the
bat.
> > The increased context-switch overhead
> > is also worth measuring if you can (i.e. run some userspace NEON-based
> > benchmarks in parallel with NEON and non-NEON implementations of the
> > checksumming).
>
> Do we know the context switch cost of normal task scheduling between
> tasks using FP operations? The in-kernel Neon usage should bring about
> the same cost. Measuring it would be interesting albeit probably
> difficult.
Sure, this stuff is hard to measure and we don't have a feel for the normal
context-switch penalities. I just think it would be useful to try and get a
feel for the increased overhead of saving/restoring this state if userspace
is trying to use the registers in parallel with the kernel.
> > We support building the kernel with older toolchains, so I don't see the
> > benefit of using intrinsics here.
>
> These days the compiler tends to do a better job than humans at properly
> scheduling instructions for some code. We shouldn't deprive ourselves
> from it when a recent enough gcc is available.
What's the earliest toolchain we claim to support nowadays? If that can't
deal with the intrinsics then we either need to bump the requirement, or
write this using hand-coded asm. In the case of the latter, I don't think
the maintenance overhead of having two implementations is worth it.
Will
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 0/5] Kernel mode NEON for XOR and RAID6
2013-06-07 17:50 ` Will Deacon
@ 2013-06-07 19:49 ` Ard Biesheuvel
2013-06-08 3:09 ` Nicolas Pitre
1 sibling, 0 replies; 24+ messages in thread
From: Ard Biesheuvel @ 2013-06-07 19:49 UTC (permalink / raw)
To: linux-arm-kernel
Hi Will,
On 7 June 2013 19:50, Will Deacon <will.deacon@arm.com> wrote:
> On Thu, Jun 06, 2013 at 05:17:39PM +0100, Nicolas Pitre wrote:
>> On Thu, 6 Jun 2013, Will Deacon wrote:
>> > On Thu, Jun 06, 2013 at 04:03:00PM +0100, Ard Biesheuvel wrote:
>> > > This time, I have included two use cases that I have been using, XOR and RAID-6
>> > > checksumming. The former gets a 60% performance boost on the NEON, the latter
>> > > over 400%.
>> >
>> > Whilst that sounds impressive, can you achieve similar results across all
>> > NEON-capable CPUs? In particular, we need to make sure this doesn't cause
>> > performance regressions on some cores.
>>
>> Note that the kernel performs runtime benchmarking of all the different
>> implementations it has available at boot time and selects the best one.
>> So if this would turn out to make things worse on some cores then the
>> Neon code would simply not be used.
>
> That will be all sorts of fun if we try to run this on big.LITTLE...
> Perhaps we don't care about that either.
>
Doesn't that apply equally with and without NEON? I mean, there are
several non-NEON flavors of the RAID6 and XOR algorithms already, and
the benchmark at boot time decides which one gets used until the next
reboot.
>> > Furthermore, do you have any power figures to complement your
>> > findings?
>>
>> This is going to be most useful in server type environments where a bit
>> more power is not such an issue but throughput is ... unless you start
>> using RAID6 arrays on your phone that is. :-) Otherwise this can be
>> left configured out for mobile targets.
>
> Agreed, but this patch series also sets a precedent for using NEON in the
> kernel. Whilst I'd love to hook up some SCSI arrays to my Nexus 4 (!), much
> more likely is that people might start reworking some of the crypto algorithms
> to use the NEON/SIMD register file (especially with the crypto extensions in
> ARMv8) so it would be good to have *some* feel of the power impact off the
> bat.
>
Why would the kernel be any different from userland in this respect?
>> > The increased context-switch overhead
>> > is also worth measuring if you can (i.e. run some userspace NEON-based
>> > benchmarks in parallel with NEON and non-NEON implementations of the
>> > checksumming).
>>
>> Do we know the context switch cost of normal task scheduling between
>> tasks using FP operations? The in-kernel Neon usage should bring about
>> the same cost. Measuring it would be interesting albeit probably
>> difficult.
>
> Sure, this stuff is hard to measure and we don't have a feel for the normal
> context-switch penalities. I just think it would be useful to try and get a
> feel for the increased overhead of saving/restoring this state if userspace
> is trying to use the registers in parallel with the kernel.
>
With NEON only supported outside interrupt context, and no preemption
(as the patch proposes), I don't expect the context switch overhead to
be substantially worse than with as many userland processes competing
for the NEON(s). Perhaps the increased latency is more of a concern
here.
>> > We support building the kernel with older toolchains, so I don't see the
>> > benefit of using intrinsics here.
>>
>> These days the compiler tends to do a better job than humans at properly
>> scheduling instructions for some code. We shouldn't deprive ourselves
>> from it when a recent enough gcc is available.
>
> What's the earliest toolchain we claim to support nowadays? If that can't
> deal with the intrinsics then we either need to bump the requirement, or
> write this using hand-coded asm. In the case of the latter, I don't think
> the maintenance overhead of having two implementations is worth it.
>
I agree that maintaining an intrinsics version side by side with an
assembly version makes no sense. The same applies to the XOR patch, it
uses -ftree-vectorize, and requires 4.6 (and issues a #warning if your
gcc is older), so if we feel that is not appropriate, I will happily
replace it with a plain assembly version.
However, the main point of this discussion is whether
a) allowing NEON in kernel mode is a good idea in the first place
b) whether the way I propose to do it is an acceptable one.
Any comments/questions on that part?
Regards,
Ard.
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 0/5] Kernel mode NEON for XOR and RAID6
2013-06-07 17:50 ` Will Deacon
2013-06-07 19:49 ` Ard Biesheuvel
@ 2013-06-08 3:09 ` Nicolas Pitre
2013-06-21 9:33 ` Will Deacon
1 sibling, 1 reply; 24+ messages in thread
From: Nicolas Pitre @ 2013-06-08 3:09 UTC (permalink / raw)
To: linux-arm-kernel
On Fri, 7 Jun 2013, Will Deacon wrote:
> Hello Nicolas, Ard,
>
> On Thu, Jun 06, 2013 at 05:17:39PM +0100, Nicolas Pitre wrote:
> > On Thu, 6 Jun 2013, Will Deacon wrote:
> > > On Thu, Jun 06, 2013 at 04:03:00PM +0100, Ard Biesheuvel wrote:
> > > > This time, I have included two use cases that I have been using, XOR and RAID-6
> > > > checksumming. The former gets a 60% performance boost on the NEON, the latter
> > > > over 400%.
> > >
> > > Whilst that sounds impressive, can you achieve similar results across all
> > > NEON-capable CPUs? In particular, we need to make sure this doesn't cause
> > > performance regressions on some cores.
> >
> > Note that the kernel performs runtime benchmarking of all the different
> > implementations it has available at boot time and selects the best one.
> > So if this would turn out to make things worse on some cores then the
> > Neon code would simply not be used.
>
> That will be all sorts of fun if we try to run this on big.LITTLE...
> Perhaps we don't care about that either.
Probably not at the present time.
> > > Furthermore, do you have any power figures to complement your
> > > findings?
> >
> > This is going to be most useful in server type environments where a bit
> > more power is not such an issue but throughput is ... unless you start
> > using RAID6 arrays on your phone that is. :-) Otherwise this can be
> > left configured out for mobile targets.
>
> Agreed, but this patch series also sets a precedent for using NEON in the
> kernel. Whilst I'd love to hook up some SCSI arrays to my Nexus 4 (!), much
> more likely is that people might start reworking some of the crypto algorithms
> to use the NEON/SIMD register file (especially with the crypto extensions in
> ARMv8) so it would be good to have *some* feel of the power impact off the
> bat.
Well... Neon is there to be used, otherwise it is just a waste of gates.
So of course it is going to use more power. but as long as the power
used by Neon is less than the power consumed by the same task performed
by the main processor then we're happy. Ard provided numbers where Neon
performs 4 times better while it surely doesn't use 4 times the power
(or so I hope).
That shouldn't matter much if that power is used in user or kernel
space. OTOH the kernel does use crypto algorithms so it does need Neon
if we want 4x the throughput.
In the end what I want to say is that the power profile is for system
integrator to assess and decide. We cannot tell if the Neon power usage
is good or bad without the overall application use case. All we should
do is to provide the mechanism and make it configurable. Same argument
applies to the context switch overhead.
> > > We support building the kernel with older toolchains, so I don't see the
> > > benefit of using intrinsics here.
> >
> > These days the compiler tends to do a better job than humans at properly
> > scheduling instructions for some code. We shouldn't deprive ourselves
> > from it when a recent enough gcc is available.
>
> What's the earliest toolchain we claim to support nowadays? If that can't
> deal with the intrinsics then we either need to bump the requirement, or
> write this using hand-coded asm. In the case of the latter, I don't think
> the maintenance overhead of having two implementations is worth it.
We have many different minimum toolchain version requirements attached
to different features being enabled already, ftrace being one of them if
I remember correctly. For these Neon optimizations the minimum gcc
version is v4.6.
Given that this is going to be interesting mostly to server systems, and
given that ARM server deployments are rather new, I don't see the point
of compiling a new server environment using an older gcc version.
I don't think we have to bump the gcc requirement for anyone wishing to
compile the kernel with their existing set of features either. That
would be rather silly. It is not because Neon intrinsics are used in
some kernel code that everyone should be forced to upgrade their
compiler, especially if they don't intend to use that in-kernel Neon
code. However, in order to benefit from optional new features that
didn't exist before, I think it is perfectly reasonable to require later
gcc versions for them if need be.
I agree that having two different implementations of the same thing is
not the way to go. So if the choice between a pure assembly vs a C
version with intrinsics has to be made, then I'd vote for the C version
unless the assembly one is much faster. C code is always preferable to
assembly code as it is much easier to review and modify, and
improvements to the compiler may still increase performance of the
unchanged code while the assembly version is static and will always be
tuned to some particular core implementations only.
If someone eventually comes along with some pure assembly version that
blows the current C version out of the water then we simply replace it,
period. We don't have to commit ourselves to a particular
implementation either.
But for that to happen we need to merge this code and let people
experiment with it. That's how improvements will come about.
Nicolas
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 0/5] Kernel mode NEON for XOR and RAID6
2013-06-08 3:09 ` Nicolas Pitre
@ 2013-06-21 9:33 ` Will Deacon
2013-06-21 10:08 ` Ard Biesheuvel
0 siblings, 1 reply; 24+ messages in thread
From: Will Deacon @ 2013-06-21 9:33 UTC (permalink / raw)
To: linux-arm-kernel
Hi guys,
On Sat, Jun 08, 2013 at 04:09:56AM +0100, Nicolas Pitre wrote:
> On Fri, 7 Jun 2013, Will Deacon wrote:
> > > Note that the kernel performs runtime benchmarking of all the different
> > > implementations it has available at boot time and selects the best one.
> > > So if this would turn out to make things worse on some cores then the
> > > Neon code would simply not be used.
> >
> > That will be all sorts of fun if we try to run this on big.LITTLE...
> > Perhaps we don't care about that either.
>
> Probably not at the present time.
[...]
> > What's the earliest toolchain we claim to support nowadays? If that can't
> > deal with the intrinsics then we either need to bump the requirement, or
> > write this using hand-coded asm. In the case of the latter, I don't think
> > the maintenance overhead of having two implementations is worth it.
>
> We have many different minimum toolchain version requirements attached
> to different features being enabled already, ftrace being one of them if
> I remember correctly. For these Neon optimizations the minimum gcc
> version is v4.6.
>
> Given that this is going to be interesting mostly to server systems, and
> given that ARM server deployments are rather new, I don't see the point
> of compiling a new server environment using an older gcc version.
I've mulled over this, had some discussions with our toolchain guys and
have concluded the following:
- The intrinsics are actually ok. I was sceptical at first, but I've been
assured that they should do a reasonable job (echoing your performance
figures).
- The current approach is targetting servers and isn't (yet) suitable for
mobile.
So, given that the patches do the right thing wrt GCC version, the only
remaining point is that we need to keep an eye out for people trying to
re-use this stuff for mobile (likely crypto, as I mentioned earlier). When
that happens, we should consider revisiting the benchmark/power figures.
Will
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 0/5] Kernel mode NEON for XOR and RAID6
2013-06-21 9:33 ` Will Deacon
@ 2013-06-21 10:08 ` Ard Biesheuvel
2013-06-21 14:58 ` Christopher Covington
0 siblings, 1 reply; 24+ messages in thread
From: Ard Biesheuvel @ 2013-06-21 10:08 UTC (permalink / raw)
To: linux-arm-kernel
On 21 June 2013 11:33, Will Deacon <will.deacon@arm.com> wrote:
> On Sat, Jun 08, 2013 at 04:09:56AM +0100, Nicolas Pitre wrote:
>> On Fri, 7 Jun 2013, Will Deacon wrote:
>> > What's the earliest toolchain we claim to support nowadays? If that can't
>> > deal with the intrinsics then we either need to bump the requirement, or
>> > write this using hand-coded asm. In the case of the latter, I don't think
>> > the maintenance overhead of having two implementations is worth it.
>>
>> We have many different minimum toolchain version requirements attached
>> to different features being enabled already, ftrace being one of them if
>> I remember correctly. For these Neon optimizations the minimum gcc
>> version is v4.6.
>>
>> Given that this is going to be interesting mostly to server systems, and
>> given that ARM server deployments are rather new, I don't see the point
>> of compiling a new server environment using an older gcc version.
>
> I've mulled over this, had some discussions with our toolchain guys and
> have concluded the following:
>
> - The intrinsics are actually ok. I was sceptical at first, but I've been
> assured that they should do a reasonable job (echoing your performance
> figures).
>
> - The current approach is targetting servers and isn't (yet) suitable for
> mobile.
>
> So, given that the patches do the right thing wrt GCC version, the only
> remaining point is that we need to keep an eye out for people trying to
> re-use this stuff for mobile (likely crypto, as I mentioned earlier). When
> that happens, we should consider revisiting the benchmark/power figures.
>
OK, so a number of points have been raised in this discussion, let me
address them one by one:
Should we allow NEON to be used in the kernel?
The consensus is not to allow floating point. However, NEON is
different, as the performance gains are considerable and there is no
dependency on support code, which makes it not as hairy as
conventional (pre-v3) VFP. Also, managing the vfpstates is easily
doable if NEON is only used outside interrupt context and with
preemption disabled.
Does my series implement it correctly?
I have addressed Russell's first round of comments. Happy to take
another round if necessary.
Should we allow NEON intrinsics in the kernel?
Should we allow GCC-generated NEON in the kernel?
Only if the implementation is clear on which minimum version of GCC it
requires. We could use my examples to set a precedent on what is a
suitable way to use NEON intrinsics or the vectorizer in kernel code
(which includes coding it such that it can be reused for arm64 with no
modifications)
Is kernel mode NEON suitable for mobile?
To me, it is unclear why kernel and userland are so different in this
respect. However, kernel mode NEON is separately configurable from
Kconfig so it can be disabled at will.
Is there a point to doing a boot time benchmark to select the optimal
implementation of an algorithm?
Perhaps not but unrelated to kernel mode NEON.
Code is here
http://git.linaro.org/gitweb?p=people/ardbiesheuvel/linux-arm.git;a=shortlog;h=refs/heads/for-rmk
Regards,
Ard.
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 0/5] Kernel mode NEON for XOR and RAID6
2013-06-21 10:08 ` Ard Biesheuvel
@ 2013-06-21 14:58 ` Christopher Covington
2013-06-24 8:08 ` Ard Biesheuvel
2013-06-25 13:56 ` Dave Martin
0 siblings, 2 replies; 24+ messages in thread
From: Christopher Covington @ 2013-06-21 14:58 UTC (permalink / raw)
To: linux-arm-kernel
Hi Ard,
On 06/21/2013 06:08 AM, Ard Biesheuvel wrote:
> On 21 June 2013 11:33, Will Deacon <will.deacon@arm.com> wrote:
>> On Sat, Jun 08, 2013 at 04:09:56AM +0100, Nicolas Pitre wrote:
>>> On Fri, 7 Jun 2013, Will Deacon wrote:
>>>> What's the earliest toolchain we claim to support nowadays? If that can't
>>>> deal with the intrinsics then we either need to bump the requirement, or
>>>> write this using hand-coded asm. In the case of the latter, I don't think
>>>> the maintenance overhead of having two implementations is worth it.
>>>
>>> We have many different minimum toolchain version requirements attached
>>> to different features being enabled already, ftrace being one of them if
>>> I remember correctly. For these Neon optimizations the minimum gcc
>>> version is v4.6.
>>>
>>> Given that this is going to be interesting mostly to server systems, and
>>> given that ARM server deployments are rather new, I don't see the point
>>> of compiling a new server environment using an older gcc version.
>>
>> I've mulled over this, had some discussions with our toolchain guys and
>> have concluded the following:
>>
>> - The intrinsics are actually ok. I was sceptical at first, but I've been
>> assured that they should do a reasonable job (echoing your performance
>> figures).
>>
>> - The current approach is targetting servers and isn't (yet) suitable for
>> mobile.
>>
>> So, given that the patches do the right thing wrt GCC version, the only
>> remaining point is that we need to keep an eye out for people trying to
>> re-use this stuff for mobile (likely crypto, as I mentioned earlier). When
>> that happens, we should consider revisiting the benchmark/power figures.
>>
>
> OK, so a number of points have been raised in this discussion, let me
> address them one by one:
>
> Should we allow NEON to be used in the kernel?
>
> The consensus is not to allow floating point. However, NEON is
> different, as the performance gains are considerable and there is no
> dependency on support code, which makes it not as hairy as
> conventional (pre-v3) VFP. Also, managing the vfpstates is easily
> doable if NEON is only used outside interrupt context and with
> preemption disabled.
>
>
> Does my series implement it correctly?
>
> I have addressed Russell's first round of comments. Happy to take
> another round if necessary.
>
>
> Should we allow NEON intrinsics in the kernel?
> Should we allow GCC-generated NEON in the kernel?
>
> Only if the implementation is clear on which minimum version of GCC it
> requires. We could use my examples to set a precedent on what is a
> suitable way to use NEON intrinsics or the vectorizer in kernel code
> (which includes coding it such that it can be reused for arm64 with no
> modifications)
>
>
> Is kernel mode NEON suitable for mobile?
>
> To me, it is unclear why kernel and userland are so different in this
> respect. However, kernel mode NEON is separately configurable from
> Kconfig so it can be disabled at will.
>
>
> Is there a point to doing a boot time benchmark to select the optimal
> implementation of an algorithm?
>
> Perhaps not but unrelated to kernel mode NEON.
If this is indeed the consensus (I don't disagree with any of it myself),
perhaps committing the main points, guidelines, and examples to
Documentation/arm/* would be useful.
Christopher
--
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by the Linux Foundation.
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 0/5] Kernel mode NEON for XOR and RAID6
2013-06-21 14:58 ` Christopher Covington
@ 2013-06-24 8:08 ` Ard Biesheuvel
2013-06-24 8:54 ` Russell King - ARM Linux
2013-06-25 13:56 ` Dave Martin
1 sibling, 1 reply; 24+ messages in thread
From: Ard Biesheuvel @ 2013-06-24 8:08 UTC (permalink / raw)
To: linux-arm-kernel
On 21 June 2013 16:58, Christopher Covington <cov@codeaurora.org> wrote:
> Hi Ard,
>
> If this is indeed the consensus (I don't disagree with any of it myself),
> perhaps committing the main points, guidelines, and examples to
> Documentation/arm/* would be useful.
>
Hello Chris,
I agree that it makes sense to document any guidelines we set
regarding the use of NEON in kernel mode, and I will happily do so.
However, let's first try to wrap up this discussion, haven't heard
back from Russell yet on his current position in this matter.
Regards,
Ard.
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 0/5] Kernel mode NEON for XOR and RAID6
2013-06-24 8:08 ` Ard Biesheuvel
@ 2013-06-24 8:54 ` Russell King - ARM Linux
2013-06-24 9:10 ` Ard Biesheuvel
0 siblings, 1 reply; 24+ messages in thread
From: Russell King - ARM Linux @ 2013-06-24 8:54 UTC (permalink / raw)
To: linux-arm-kernel
On Mon, Jun 24, 2013 at 10:08:10AM +0200, Ard Biesheuvel wrote:
> On 21 June 2013 16:58, Christopher Covington <cov@codeaurora.org> wrote:
> > Hi Ard,
> >
> > If this is indeed the consensus (I don't disagree with any of it myself),
> > perhaps committing the main points, guidelines, and examples to
> > Documentation/arm/* would be useful.
> >
>
> Hello Chris,
>
> I agree that it makes sense to document any guidelines we set
> regarding the use of NEON in kernel mode, and I will happily do so.
> However, let's first try to wrap up this discussion, haven't heard
> back from Russell yet on his current position in this matter.
I think it's probably fine now - you have ample justification, and you
seem to be ensuring that things are done safely.
The only thing I'm left wondering about is whether the bits outside
arch/arm should go to anyone else, but MAINTAINERS is being unhelpful
on that - so I'm tempted to say that if someone should've been copied
who hasn't, and they're not in MAINTAINERS that's their problem.
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 0/5] Kernel mode NEON for XOR and RAID6
2013-06-24 8:54 ` Russell King - ARM Linux
@ 2013-06-24 9:10 ` Ard Biesheuvel
0 siblings, 0 replies; 24+ messages in thread
From: Ard Biesheuvel @ 2013-06-24 9:10 UTC (permalink / raw)
To: linux-arm-kernel
On 24 June 2013 10:54, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote:
>
> I think it's probably fine now - you have ample justification, and you
> seem to be ensuring that things are done safely.
>
Great!
> The only thing I'm left wondering about is whether the bits outside
> arch/arm should go to anyone else, but MAINTAINERS is being unhelpful
> on that - so I'm tempted to say that if someone should've been copied
> who hasn't, and they're not in MAINTAINERS that's their problem.
As far as the RAID6 patch is concerned, my plan was to:
- get the ARM groundwork (this series) accepted (including the XOR
patch, which does live strictly under arch/arm);
- get the arm64 groundwork accepted (which uses the same API);
- post the RAID6 patch to LKML and cc to hpa (I have a local version
which has been slightly tweaked so it builds correctly for both arm
and arm64).
However, the RAID 6 patch does only affect arm/arm64, so perhaps it's
justified that you take it all in, only in that case, please allow me
to re-send the pull request so it contains the version that will build
on arm64 as well.
Regards,
Ard.
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 0/5] Kernel mode NEON for XOR and RAID6
2013-06-21 14:58 ` Christopher Covington
2013-06-24 8:08 ` Ard Biesheuvel
@ 2013-06-25 13:56 ` Dave Martin
2013-06-25 14:14 ` Ard Biesheuvel
1 sibling, 1 reply; 24+ messages in thread
From: Dave Martin @ 2013-06-25 13:56 UTC (permalink / raw)
To: linux-arm-kernel
On Fri, Jun 21, 2013 at 10:58:21AM -0400, Christopher Covington wrote:
> Hi Ard,
>
> On 06/21/2013 06:08 AM, Ard Biesheuvel wrote:
> > On 21 June 2013 11:33, Will Deacon <will.deacon@arm.com> wrote:
> >> On Sat, Jun 08, 2013 at 04:09:56AM +0100, Nicolas Pitre wrote:
> >>> On Fri, 7 Jun 2013, Will Deacon wrote:
> >>>> What's the earliest toolchain we claim to support nowadays? If that can't
> >>>> deal with the intrinsics then we either need to bump the requirement, or
> >>>> write this using hand-coded asm. In the case of the latter, I don't think
> >>>> the maintenance overhead of having two implementations is worth it.
> >>>
> >>> We have many different minimum toolchain version requirements attached
> >>> to different features being enabled already, ftrace being one of them if
> >>> I remember correctly. For these Neon optimizations the minimum gcc
> >>> version is v4.6.
> >>>
> >>> Given that this is going to be interesting mostly to server systems, and
> >>> given that ARM server deployments are rather new, I don't see the point
> >>> of compiling a new server environment using an older gcc version.
> >>
> >> I've mulled over this, had some discussions with our toolchain guys and
> >> have concluded the following:
> >>
> >> - The intrinsics are actually ok. I was sceptical at first, but I've been
> >> assured that they should do a reasonable job (echoing your performance
> >> figures).
> >>
> >> - The current approach is targetting servers and isn't (yet) suitable for
> >> mobile.
> >>
> >> So, given that the patches do the right thing wrt GCC version, the only
> >> remaining point is that we need to keep an eye out for people trying to
> >> re-use this stuff for mobile (likely crypto, as I mentioned earlier). When
> >> that happens, we should consider revisiting the benchmark/power figures.
> >>
> >
> > OK, so a number of points have been raised in this discussion, let me
> > address them one by one:
> >
> > Should we allow NEON to be used in the kernel?
> >
> > The consensus is not to allow floating point. However, NEON is
> > different, as the performance gains are considerable and there is no
> > dependency on support code, which makes it not as hairy as
> > conventional (pre-v3) VFP. Also, managing the vfpstates is easily
> > doable if NEON is only used outside interrupt context and with
> > preemption disabled.
> >
> >
> > Does my series implement it correctly?
> >
> > I have addressed Russell's first round of comments. Happy to take
> > another round if necessary.
> >
> >
> > Should we allow NEON intrinsics in the kernel?
> > Should we allow GCC-generated NEON in the kernel?
> >
> > Only if the implementation is clear on which minimum version of GCC it
> > requires. We could use my examples to set a precedent on what is a
> > suitable way to use NEON intrinsics or the vectorizer in kernel code
> > (which includes coding it such that it can be reused for arm64 with no
> > modifications)
> >
> >
> > Is kernel mode NEON suitable for mobile?
> >
> > To me, it is unclear why kernel and userland are so different in this
> > respect. However, kernel mode NEON is separately configurable from
> > Kconfig so it can be disabled at will.
> >
> >
> > Is there a point to doing a boot time benchmark to select the optimal
> > implementation of an algorithm?
> >
> > Perhaps not but unrelated to kernel mode NEON.
Significant benchmarks on the boot path would be unacceptable, unless they
are *fast* (and by fast, I mean fast on all platforms, not just fast on
the fast platforms). If one second gets added onto the boot path for each
optimised algorithm, that sounds like a fail. If all the benchmarks
combined take one second in total, that's no quite as bad.
Maybe benchmarks could be time-bounded (i.e., see how much data we can
chug though in X milliseconds) instead of size-bounded. This would avoid
unreasonable slowdown on slow platforms, while avoiding trivially small
benchmark payloads on faster platforms which may typically have a more
complex architecture, bigger caches etc. which would cause them to take
longer to reach saturated performance when running a particular algorithm.
Cheers
---Dave
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 0/5] Kernel mode NEON for XOR and RAID6
2013-06-25 13:56 ` Dave Martin
@ 2013-06-25 14:14 ` Ard Biesheuvel
2013-06-25 14:29 ` Dave Martin
0 siblings, 1 reply; 24+ messages in thread
From: Ard Biesheuvel @ 2013-06-25 14:14 UTC (permalink / raw)
To: linux-arm-kernel
On 25 June 2013 15:56, Dave Martin <dave.martin@linaro.org> wrote:
> Significant benchmarks on the boot path would be unacceptable, unless they
> are *fast* (and by fast, I mean fast on all platforms, not just fast on
> the fast platforms). If one second gets added onto the boot path for each
> optimised algorithm, that sounds like a fail. If all the benchmarks
> combined take one second in total, that's no quite as bad.
>
> Maybe benchmarks could be time-bounded (i.e., see how much data we can
> chug though in X milliseconds) instead of size-bounded. This would avoid
> unreasonable slowdown on slow platforms, while avoiding trivially small
> benchmark payloads on faster platforms which may typically have a more
> complex architecture, bigger caches etc. which would cause them to take
> longer to reach saturated performance when running a particular algorithm.
>
Benchmarks are already time bounded, at least the instances I am aware
of (xor and raid6) are. They each measure, for each available
implementation, the amount of work performed during a fixed time. For
RAID6, this is 16 jiffies, for XOR it's only 1 jiffy but each test is
repeated 5 times.
So I think this should not be a problem, especially as it is unlikely
that newly added implementations (such as NEON) will be able to
execute on older/slower platforms in the first place.
Regards,
Ard.
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 0/5] Kernel mode NEON for XOR and RAID6
2013-06-25 14:14 ` Ard Biesheuvel
@ 2013-06-25 14:29 ` Dave Martin
0 siblings, 0 replies; 24+ messages in thread
From: Dave Martin @ 2013-06-25 14:29 UTC (permalink / raw)
To: linux-arm-kernel
On Tue, Jun 25, 2013 at 04:14:13PM +0200, Ard Biesheuvel wrote:
> On 25 June 2013 15:56, Dave Martin <dave.martin@linaro.org> wrote:
> > Significant benchmarks on the boot path would be unacceptable, unless they
> > are *fast* (and by fast, I mean fast on all platforms, not just fast on
> > the fast platforms). If one second gets added onto the boot path for each
> > optimised algorithm, that sounds like a fail. If all the benchmarks
> > combined take one second in total, that's no quite as bad.
> >
> > Maybe benchmarks could be time-bounded (i.e., see how much data we can
> > chug though in X milliseconds) instead of size-bounded. This would avoid
> > unreasonable slowdown on slow platforms, while avoiding trivially small
> > benchmark payloads on faster platforms which may typically have a more
> > complex architecture, bigger caches etc. which would cause them to take
> > longer to reach saturated performance when running a particular algorithm.
> >
>
> Benchmarks are already time bounded, at least the instances I am aware
> of (xor and raid6) are. They each measure, for each available
> implementation, the amount of work performed during a fixed time. For
> RAID6, this is 16 jiffies, for XOR it's only 1 jiffy but each test is
> repeated 5 times.
>
> So I think this should not be a problem, especially as it is unlikely
> that newly added implementations (such as NEON) will be able to
> execute on older/slower platforms in the first place.
The tree I was originally looking at might be out of date ... apologies
for the trolling.
If the XOR benchmark really only takes 50 ms per implementation, I guess
that shouldn't be too bad.
Cheers
---Dave
^ permalink raw reply [flat|nested] 24+ messages in thread