linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] Kernel mode NEON for XOR and RAID6
@ 2013-06-06 15:03 Ard Biesheuvel
  2013-06-06 15:03 ` [PATCH 1/5] ARM: add support for kernel mode NEON Ard Biesheuvel
                   ` (5 more replies)
  0 siblings, 6 replies; 24+ messages in thread
From: Ard Biesheuvel @ 2013-06-06 15:03 UTC (permalink / raw)
  To: linux-arm-kernel

Hi all,

This is a partial repost of the patches I proposed a couple of weeks ago to add
support for VFP/NEON in kernel mode.

This time, I have included two use cases that I have been using, XOR and RAID-6
checksumming. The former gets a 60% performance boost on the NEON, the latter
over 400%.


ARM: add support for kernel mode NEON

Adds kernel_neon_begin/end (renamed from kernel_vfp_begin/end in the previous
version to de-emphasize the VFP part as VFP code that needs software assistance
is not supported currently)
Introduces <asm/neon.h> and the Kconfig symbol KERNEL_MODE_NEON. This has been
aligned with Catalin for arm64, so any NEON code that does not use assembly but
intrinsics or the GCC vectorizer (such as my examples) can potentially be shared
between arm and arm64 archs.


ARM: move VFP init to an earlier boot stage

This is needed so the NEON is enabled when the XOR and RAID-6 algo boot time
benchmarks are run.


ARM: be strict about FP exceptions in kernel mode

This adds a check to vfp_support_entry() to flag unsupported uses of the
NEON/VFP in kernel mode. FP exceptions (bounces) are flagged as a BUG(), this is
because of their potentially intermittent nature. Exceptions caused by the fact
that kernel_neon_begin has not been called are just routed through the undef
handler.


ARM: crypto: add NEON accelerated XOR implementation

This is the xor_blocks() implementation built with -ftree-vectorize, 60% faster
than optimized ARM code. It calls in_interrupt() to check whether the NEON
flavor can be used: this should really not be necessary, but due to xor_blocks's
quite generic nature, there is no telling how exactly people may be using it in
the real world.


lib/raid6: add ARM-NEON accelerated syndrome calculation

This is a port of the RAID-6 checksumming code in altivec.uc ported to use NEON
intrinsics. It is about 4x faster than the sequential code. As this code does
not live under arch/arm, I will send this patch separately to the appropriate 
list if/when the prerequisite patches from this series have been accepted.

-- 
Ard.


 arch/arm/Kconfig            |  7 ++++
 arch/arm/include/asm/neon.h | 36 ++++++++++++++++++++
 arch/arm/include/asm/xor.h  | 73 +++++++++++++++++++++++++++++++++++++++++
 arch/arm/lib/Makefile       |  6 ++++
 arch/arm/lib/xor-neon.c     | 42 ++++++++++++++++++++++++
 arch/arm/vfp/vfphw.S        |  5 +++
 arch/arm/vfp/vfpmodule.c    | 56 ++++++++++++++++++++++++++++++-
 include/linux/raid/pq.h     |  5 +++
 lib/raid6/.gitignore        |  1 +
 lib/raid6/Makefile          | 31 ++++++++++++++++++
 lib/raid6/algos.c           |  6 ++++
 lib/raid6/neon.c            | 58 ++++++++++++++++++++++++++++++++
 lib/raid6/neon.uc           | 80 +++++++++++++++++++++++++++++++++++++++++++++
 lib/raid6/test/Makefile     | 19 ++++++++++-
 14 files changed, 423 insertions(+), 2 deletions(-)
 create mode 100644 arch/arm/include/asm/neon.h
 create mode 100644 arch/arm/lib/xor-neon.c
 create mode 100644 lib/raid6/neon.c
 create mode 100644 lib/raid6/neon.uc

-- 
1.8.1.2

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/5] ARM: add support for kernel mode NEON
  2013-06-06 15:03 [PATCH 0/5] Kernel mode NEON for XOR and RAID6 Ard Biesheuvel
@ 2013-06-06 15:03 ` Ard Biesheuvel
  2013-06-06 15:03 ` [PATCH 2/5] ARM: move VFP init to an earlier boot stage Ard Biesheuvel
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 24+ messages in thread
From: Ard Biesheuvel @ 2013-06-06 15:03 UTC (permalink / raw)
  To: linux-arm-kernel

In order to safely support the use of NEON instructions in
kernel mode, some precautions need to be taken:
- the userland context that may be present in the registers (even
  if the NEON/VFP is currently disabled) must be stored under the
  correct task (which may not be 'current' in the UP case),
- to avoid having to keep track of additional vfpstates for the
  kernel side, disallow the use of NEON in interrupt context
  and run with preemption disabled,
- after use, re-enable preemption and re-enable the lazy restore
  machinery by disabling the NEON/VFP unit.

This patch adds the functions kernel_neon_begin() and
kernel_neon_end() which take care of the above. It also adds
the Kconfig symbol KERNEL_MODE_NEON to enable it.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: Nicolas Pitre <nico@linaro.org>
---
 arch/arm/Kconfig            |  7 +++++++
 arch/arm/include/asm/neon.h | 36 ++++++++++++++++++++++++++++++++++++
 arch/arm/vfp/vfpmodule.c    | 43 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 86 insertions(+)
 create mode 100644 arch/arm/include/asm/neon.h

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index ef30276..9e61402 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -2355,6 +2355,13 @@ config NEON
 	  Say Y to include support code for NEON, the ARMv7 Advanced SIMD
 	  Extension.
 
+config KERNEL_MODE_NEON
+	bool "Support for NEON in kernel mode"
+	default n
+	depends on NEON
+	help
+	  Say Y to include support for NEON in kernel mode.
+
 endmenu
 
 menu "Userspace binary formats"
diff --git a/arch/arm/include/asm/neon.h b/arch/arm/include/asm/neon.h
new file mode 100644
index 0000000..8f730fe
--- /dev/null
+++ b/arch/arm/include/asm/neon.h
@@ -0,0 +1,36 @@
+/*
+ * linux/arch/arm/include/asm/neon.h
+ *
+ * Copyright (C) 2013 Linaro Ltd <ard.biesheuvel@linaro.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <asm/hwcap.h>
+
+#define cpu_has_neon()		(!!(elf_hwcap & HWCAP_NEON))
+
+#ifdef __ARM_NEON__
+
+/*
+ * If you are affected by the BUILD_BUG below, it probably means that you are
+ * using NEON code /and/ calling the kernel_neon_begin() function from the same
+ * compilation unit. To prevent issues that may arise from GCC reordering or
+ * generating(1) NEON instructions outside of these begin/end functions, the
+ * only supported way of using NEON code in the kernel is by isolating it in a
+ * separate compilation unit, and calling it from another unit from inside a
+ * kernel_neon_begin/kernel_neon_end pair.
+ *
+ * (1) Current GCC (4.7) might generate NEON instructions at O3 level if
+ *     -mpfu=neon is set.
+ */
+
+#define kernel_neon_begin() \
+	BUILD_BUG_ON_MSG(1, "kernel_neon_begin() called from NEON code")
+
+#else
+void kernel_neon_begin(void);
+#endif
+void kernel_neon_end(void);
diff --git a/arch/arm/vfp/vfpmodule.c b/arch/arm/vfp/vfpmodule.c
index 5dfbb0b..1cdc13b 100644
--- a/arch/arm/vfp/vfpmodule.c
+++ b/arch/arm/vfp/vfpmodule.c
@@ -20,6 +20,7 @@
 #include <linux/init.h>
 #include <linux/uaccess.h>
 #include <linux/user.h>
+#include <linux/export.h>
 
 #include <asm/cp15.h>
 #include <asm/cputype.h>
@@ -648,6 +649,48 @@ static int vfp_hotplug(struct notifier_block *b, unsigned long action,
 	return NOTIFY_OK;
 }
 
+#ifdef CONFIG_KERNEL_MODE_NEON
+
+/*
+ * Kernel-side NEON support functions
+ */
+void kernel_neon_begin(void)
+{
+	struct thread_info *thread = current_thread_info();
+	unsigned int cpu;
+	u32 fpexc;
+
+	/* Avoid using the NEON in interrupt context */
+	might_sleep();
+	cpu = get_cpu();
+
+	fpexc = fmrx(FPEXC) | FPEXC_EN;
+	fmxr(FPEXC, fpexc);
+
+	/*
+	 * Save the userland NEON/VFP state. Under UP,
+	 * the owner could be a task other than 'current'
+	 */
+	if (vfp_state_in_hw(cpu, thread))
+		vfp_save_state(&thread->vfpstate, fpexc);
+#ifndef CONFIG_SMP
+	else if (vfp_current_hw_state[cpu] != NULL)
+		vfp_save_state(vfp_current_hw_state[cpu], fpexc);
+#endif
+	vfp_current_hw_state[cpu] = NULL;
+}
+EXPORT_SYMBOL(kernel_neon_begin);
+
+void kernel_neon_end(void)
+{
+	/* Disable the NEON/VFP unit. */
+	fmxr(FPEXC, fmrx(FPEXC) & ~FPEXC_EN);
+	put_cpu();
+}
+EXPORT_SYMBOL(kernel_neon_end);
+
+#endif /* CONFIG_KERNEL_MODE_NEON */
+
 /*
  * VFP support code initialisation.
  */
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/5] ARM: move VFP init to an earlier boot stage
  2013-06-06 15:03 [PATCH 0/5] Kernel mode NEON for XOR and RAID6 Ard Biesheuvel
  2013-06-06 15:03 ` [PATCH 1/5] ARM: add support for kernel mode NEON Ard Biesheuvel
@ 2013-06-06 15:03 ` Ard Biesheuvel
  2013-06-06 15:03 ` [PATCH 3/5] ARM: be strict about FP exceptions in kernel mode Ard Biesheuvel
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 24+ messages in thread
From: Ard Biesheuvel @ 2013-06-06 15:03 UTC (permalink / raw)
  To: linux-arm-kernel

In order to use the NEON unit in the kernel, we should
initialize it a bit earlier in the boot process so NEON users
that like to do a quick benchmark at load time (like the
xor_blocks or RAID-6 code) find the NEON/VFP unit already
enabled.

Replaced late_initcall() with core_initcall().

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: Nicolas Pitre <nico@linaro.org>
---
 arch/arm/vfp/vfpmodule.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm/vfp/vfpmodule.c b/arch/arm/vfp/vfpmodule.c
index 1cdc13b..4c39f91 100644
--- a/arch/arm/vfp/vfpmodule.c
+++ b/arch/arm/vfp/vfpmodule.c
@@ -774,4 +774,4 @@ static int __init vfp_init(void)
 	return 0;
 }
 
-late_initcall(vfp_init);
+core_initcall(vfp_init);
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 3/5] ARM: be strict about FP exceptions in kernel mode
  2013-06-06 15:03 [PATCH 0/5] Kernel mode NEON for XOR and RAID6 Ard Biesheuvel
  2013-06-06 15:03 ` [PATCH 1/5] ARM: add support for kernel mode NEON Ard Biesheuvel
  2013-06-06 15:03 ` [PATCH 2/5] ARM: move VFP init to an earlier boot stage Ard Biesheuvel
@ 2013-06-06 15:03 ` Ard Biesheuvel
  2013-06-06 15:03 ` [PATCH 4/5] ARM: crypto: add NEON accelerated XOR implementation Ard Biesheuvel
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 24+ messages in thread
From: Ard Biesheuvel @ 2013-06-06 15:03 UTC (permalink / raw)
  To: linux-arm-kernel

The support code in vfp_support_entry does not care whether the
exception that caused it to be invoked occurred in kernel mode or
in user mode. However, neither condition that could trigger this
exception (lazy restore and VFP bounce to support code) is
currently allowable in kernel mode.

In the former case, we can just handle it as an undefined instruction.
In the latter case, we should flag it as a bug, as it implies that
the FP unit has been enabled and an attempt has been made to
execute FP instructions that are dependent on the support code, and
this is not supported in kernel mode.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: Nicolas Pitre <nico@linaro.org>
---
 arch/arm/vfp/vfphw.S     |  5 +++++
 arch/arm/vfp/vfpmodule.c | 11 +++++++++++
 2 files changed, 16 insertions(+)

diff --git a/arch/arm/vfp/vfphw.S b/arch/arm/vfp/vfphw.S
index 8d10dc8..3e5d311 100644
--- a/arch/arm/vfp/vfphw.S
+++ b/arch/arm/vfp/vfphw.S
@@ -78,6 +78,11 @@
 ENTRY(vfp_support_entry)
 	DBGSTR3	"instr %08x pc %08x state %p", r0, r2, r10
 
+	ldr	r3, [sp, #S_PSR]	@ Neither lazy restore nor FP exceptions
+	and	r3, r3, #MODE_MASK	@ are supported in kernel mode
+	teq	r3, #USR_MODE
+	bne	vfp_kmode_exception	@ Returns through lr
+
 	VFPFMRX	r1, FPEXC		@ Is the VFP enabled?
 	DBGSTR1	"fpexc %08x", r1
 	tst	r1, #FPEXC_EN
diff --git a/arch/arm/vfp/vfpmodule.c b/arch/arm/vfp/vfpmodule.c
index 4c39f91..bd2f7a2 100644
--- a/arch/arm/vfp/vfpmodule.c
+++ b/arch/arm/vfp/vfpmodule.c
@@ -27,6 +27,7 @@
 #include <asm/system_info.h>
 #include <asm/thread_notify.h>
 #include <asm/vfp.h>
+#include <asm/bug.h>
 
 #include "vfpinstr.h"
 #include "vfp.h"
@@ -691,6 +692,16 @@ EXPORT_SYMBOL(kernel_neon_end);
 
 #endif /* CONFIG_KERNEL_MODE_NEON */
 
+void vfp_kmode_exception(void)
+{
+	/*
+	 * Taking an FP exception in kernel mode is always a bug, because
+	 * none of the FP instructions currently supported in kernel mode
+	 * (i.e., NEON) should ever be bounced back to the support code.
+	 */
+	BUG_ON(fmrx(FPEXC) & FPEXC_EN);
+}
+
 /*
  * VFP support code initialisation.
  */
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 4/5] ARM: crypto: add NEON accelerated XOR implementation
  2013-06-06 15:03 [PATCH 0/5] Kernel mode NEON for XOR and RAID6 Ard Biesheuvel
                   ` (2 preceding siblings ...)
  2013-06-06 15:03 ` [PATCH 3/5] ARM: be strict about FP exceptions in kernel mode Ard Biesheuvel
@ 2013-06-06 15:03 ` Ard Biesheuvel
  2013-06-06 15:45   ` Nicolas Pitre
  2013-06-06 15:03 ` [PATCH 5/5] lib/raid6: add ARM-NEON accelerated syndrome calculation Ard Biesheuvel
  2013-06-06 15:17 ` [PATCH 0/5] Kernel mode NEON for XOR and RAID6 Will Deacon
  5 siblings, 1 reply; 24+ messages in thread
From: Ard Biesheuvel @ 2013-06-06 15:03 UTC (permalink / raw)
  To: linux-arm-kernel

Add a source file xor-neon.c (which is really just the reference
C implementation passed through the GCC vectorizer) and hook it
up to the XOR framework.

Output captured from a Cortex-A15 @ 1.7 GHz:

xor: measuring software checksum speed
   arm4regs  :  2261.600 MB/sec
   8regs     :  1771.600 MB/sec
   32regs    :  1441.600 MB/sec
   neon      :  3619.600 MB/sec
xor: using function: neon (3619.600 MB/sec)

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm/include/asm/xor.h | 73 ++++++++++++++++++++++++++++++++++++++++++++++
 arch/arm/lib/Makefile      |  6 ++++
 arch/arm/lib/xor-neon.c    | 42 ++++++++++++++++++++++++++
 3 files changed, 121 insertions(+)
 create mode 100644 arch/arm/lib/xor-neon.c

diff --git a/arch/arm/include/asm/xor.h b/arch/arm/include/asm/xor.h
index 7604673..4ffb26d 100644
--- a/arch/arm/include/asm/xor.h
+++ b/arch/arm/include/asm/xor.h
@@ -7,7 +7,10 @@
  * it under the terms of the GNU General Public License version 2 as
  * published by the Free Software Foundation.
  */
+#include <linux/hardirq.h>
 #include <asm-generic/xor.h>
+#include <asm/hwcap.h>
+#include <asm/neon.h>
 
 #define __XOR(a1, a2) a1 ^= a2
 
@@ -138,4 +141,74 @@ static struct xor_block_template xor_block_arm4regs = {
 		xor_speed(&xor_block_arm4regs);	\
 		xor_speed(&xor_block_8regs);	\
 		xor_speed(&xor_block_32regs);	\
+		NEON_TEMPLATES;			\
 	} while (0)
+
+#ifdef CONFIG_KERNEL_MODE_NEON
+
+extern struct xor_block_template const xor_block_neon_inner;
+
+static void
+xor_neon_2(unsigned long bytes, unsigned long *p1, unsigned long *p2)
+{
+	if (in_interrupt()) {
+		xor_arm4regs_2(bytes, p1, p2);
+	} else {
+		kernel_neon_begin();
+		xor_block_neon_inner.do_2(bytes, p1, p2);
+		kernel_neon_end();
+	}
+}
+
+static void
+xor_neon_3(unsigned long bytes, unsigned long *p1, unsigned long *p2,
+		unsigned long *p3)
+{
+	if (in_interrupt()) {
+		xor_arm4regs_3(bytes, p1, p2, p3);
+	} else {
+		kernel_neon_begin();
+		xor_block_neon_inner.do_3(bytes, p1, p2, p3);
+		kernel_neon_end();
+	}
+}
+
+static void
+xor_neon_4(unsigned long bytes, unsigned long *p1, unsigned long *p2,
+		unsigned long *p3, unsigned long *p4)
+{
+	if (in_interrupt()) {
+		xor_arm4regs_4(bytes, p1, p2, p3, p4);
+	} else {
+		kernel_neon_begin();
+		xor_block_neon_inner.do_4(bytes, p1, p2, p3, p4);
+		kernel_neon_end();
+	}
+}
+
+static void
+xor_neon_5(unsigned long bytes, unsigned long *p1, unsigned long *p2,
+		unsigned long *p3, unsigned long *p4, unsigned long *p5)
+{
+	if (in_interrupt()) {
+		xor_arm4regs_5(bytes, p1, p2, p3, p4, p5);
+	} else {
+		kernel_neon_begin();
+		xor_block_neon_inner.do_5(bytes, p1, p2, p3, p4, p5);
+		kernel_neon_end();
+	}
+}
+
+static struct xor_block_template xor_block_neon = {
+	.name	= "neon",
+	.do_2	= xor_neon_2,
+	.do_3	= xor_neon_3,
+	.do_4	= xor_neon_4,
+	.do_5	= xor_neon_5
+};
+
+#define NEON_TEMPLATES	\
+	do { if (cpu_has_neon()) xor_speed(&xor_block_neon); } while (0)
+#else
+#define NEON_TEMPLATES
+#endif
diff --git a/arch/arm/lib/Makefile b/arch/arm/lib/Makefile
index af72969..aaf3a87 100644
--- a/arch/arm/lib/Makefile
+++ b/arch/arm/lib/Makefile
@@ -45,3 +45,9 @@ lib-$(CONFIG_ARCH_SHARK)	+= io-shark.o
 
 $(obj)/csumpartialcopy.o:	$(obj)/csumpartialcopygeneric.S
 $(obj)/csumpartialcopyuser.o:	$(obj)/csumpartialcopygeneric.S
+
+ifeq ($(CONFIG_KERNEL_MODE_NEON),y)
+  NEON_FLAGS			:= -mfloat-abi=softfp -mfpu=neon
+  CFLAGS_xor-neon.o		+= $(NEON_FLAGS)
+  lib-$(CONFIG_XOR_BLOCKS)	+= xor-neon.o
+endif
diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
new file mode 100644
index 0000000..f485e5a
--- /dev/null
+++ b/arch/arm/lib/xor-neon.c
@@ -0,0 +1,42 @@
+/*
+ * linux/arch/arm/lib/xor-neon.c
+ *
+ * Copyright (C) 2013 Linaro Ltd <ard.biesheuvel@linaro.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/raid/xor.h>
+
+#ifndef __ARM_NEON__
+#error You should compile this file with '-mfloat-abi=softfp -mfpu=neon'
+#endif
+
+/*
+ * Pull in the reference implementations while instructing GCC (through
+ * -ftree-vectorize) to attempt to exploit implicit parallelism and emit
+ * NEON instructions.
+ */
+#if __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 6)
+#pragma GCC optimize "tree-vectorize"
+#else
+/*
+ * While older versions of GCC do not generate incorrect code, they fail to
+ * recognize the parallel nature of these functions, and emit plain ARM code,
+ * which is known to be slower than the optimized ARM code in asm-arm/xor.h.
+ */
+#warning This code requires at least version 4.6 of GCC
+#endif
+
+#pragma GCC diagnostic ignored "-Wunused-variable"
+#include <asm-generic/xor.h>
+
+struct xor_block_template const xor_block_neon_inner = {
+	.name	= "__inner_neon__",
+	.do_2	= xor_8regs_2,
+	.do_3	= xor_8regs_3,
+	.do_4	= xor_8regs_4,
+	.do_5	= xor_8regs_5,
+};
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 5/5] lib/raid6: add ARM-NEON accelerated syndrome calculation
  2013-06-06 15:03 [PATCH 0/5] Kernel mode NEON for XOR and RAID6 Ard Biesheuvel
                   ` (3 preceding siblings ...)
  2013-06-06 15:03 ` [PATCH 4/5] ARM: crypto: add NEON accelerated XOR implementation Ard Biesheuvel
@ 2013-06-06 15:03 ` Ard Biesheuvel
  2013-06-06 15:55   ` Nicolas Pitre
  2013-06-06 15:17 ` [PATCH 0/5] Kernel mode NEON for XOR and RAID6 Will Deacon
  5 siblings, 1 reply; 24+ messages in thread
From: Ard Biesheuvel @ 2013-06-06 15:03 UTC (permalink / raw)
  To: linux-arm-kernel

Rebased/reworked a patch contributed by Rob Herring that uses
NEON intrinsics to perform the RAID-6 syndrome calculations.
It uses the existing unroll.awk code to generate several
unrolled versions of which the best performing one is selected
at boot time.

Output captured from an ARM Cortex-A15 @ 1.7 GHz:

raid6: int32x1    200 MB/s
raid6: int32x2    304 MB/s
raid6: int32x4    388 MB/s
raid6: int32x8    423 MB/s
raid6: neonx1     799 MB/s
raid6: neonx2    1364 MB/s
raid6: neonx4    1731 MB/s
raid6: neonx8    1676 MB/s
raid6: using algorithm neonx4 (1731 MB/s)
raid6: using intx1 recovery algorithm

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 include/linux/raid/pq.h |  5 ++++
 lib/raid6/.gitignore    |  1 +
 lib/raid6/Makefile      | 31 +++++++++++++++++++
 lib/raid6/algos.c       |  6 ++++
 lib/raid6/neon.c        | 58 +++++++++++++++++++++++++++++++++++
 lib/raid6/neon.uc       | 80 +++++++++++++++++++++++++++++++++++++++++++++++++
 lib/raid6/test/Makefile | 19 +++++++++++-
 7 files changed, 199 insertions(+), 1 deletion(-)
 create mode 100644 lib/raid6/neon.c
 create mode 100644 lib/raid6/neon.uc

diff --git a/include/linux/raid/pq.h b/include/linux/raid/pq.h
index 8dfaa2c..0f42469 100644
--- a/include/linux/raid/pq.h
+++ b/include/linux/raid/pq.h
@@ -114,6 +114,11 @@ extern const struct raid6_recov_calls raid6_recov_intx1;
 extern const struct raid6_recov_calls raid6_recov_ssse3;
 extern const struct raid6_recov_calls raid6_recov_avx2;
 
+extern const struct raid6_calls raid6_neonx1;
+extern const struct raid6_calls raid6_neonx2;
+extern const struct raid6_calls raid6_neonx4;
+extern const struct raid6_calls raid6_neonx8;
+
 /* Algorithm list */
 extern const struct raid6_calls * const raid6_algos[];
 extern const struct raid6_recov_calls *const raid6_recov_algos[];
diff --git a/lib/raid6/.gitignore b/lib/raid6/.gitignore
index 162beca..0a7e494 100644
--- a/lib/raid6/.gitignore
+++ b/lib/raid6/.gitignore
@@ -2,3 +2,4 @@ mktables
 altivec*.c
 int*.c
 tables.c
+neon?.c
diff --git a/lib/raid6/Makefile b/lib/raid6/Makefile
index 9f7c184..6a51f1a 100644
--- a/lib/raid6/Makefile
+++ b/lib/raid6/Makefile
@@ -5,6 +5,7 @@ raid6_pq-y	+= algos.o recov.o tables.o int1.o int2.o int4.o \
 
 raid6_pq-$(CONFIG_X86) += recov_ssse3.o recov_avx2.o mmx.o sse1.o sse2.o avx2.o
 raid6_pq-$(CONFIG_ALTIVEC) += altivec1.o altivec2.o altivec4.o altivec8.o
+raid6_pq-$(CONFIG_KERNEL_MODE_NEON) += neon.o neon1.o neon2.o neon4.o neon8.o
 
 hostprogs-y	+= mktables
 
@@ -16,6 +17,12 @@ ifeq ($(CONFIG_ALTIVEC),y)
 altivec_flags := -maltivec -mabi=altivec
 endif
 
+# The GCC option -ffreestanding is required in order to compile code containing
+# ARM/NEON intrinsics in a non C99-compliant environment (such as the kernel)
+ifeq ($(CONFIG_KERNEL_MODE_NEON),y)
+NEON_FLAGS := -mfloat-abi=softfp -mfpu=neon -ffreestanding
+endif
+
 targets += int1.c
 $(obj)/int1.c:   UNROLL := 1
 $(obj)/int1.c:   $(src)/int.uc $(src)/unroll.awk FORCE
@@ -70,6 +77,30 @@ $(obj)/altivec8.c:   UNROLL := 8
 $(obj)/altivec8.c:   $(src)/altivec.uc $(src)/unroll.awk FORCE
 	$(call if_changed,unroll)
 
+CFLAGS_neon1.o += $(NEON_FLAGS)
+targets += neon1.c
+$(obj)/neon1.c:   UNROLL := 1
+$(obj)/neon1.c:   $(src)/neon.uc $(src)/unroll.awk FORCE
+	$(call if_changed,unroll)
+
+CFLAGS_neon2.o += $(NEON_FLAGS)
+targets += neon2.c
+$(obj)/neon2.c:   UNROLL := 2
+$(obj)/neon2.c:   $(src)/neon.uc $(src)/unroll.awk FORCE
+	$(call if_changed,unroll)
+
+CFLAGS_neon4.o += $(NEON_FLAGS)
+targets += neon4.c
+$(obj)/neon4.c:   UNROLL := 4
+$(obj)/neon4.c:   $(src)/neon.uc $(src)/unroll.awk FORCE
+	$(call if_changed,unroll)
+
+CFLAGS_neon8.o += $(NEON_FLAGS)
+targets += neon8.c
+$(obj)/neon8.c:   UNROLL := 8
+$(obj)/neon8.c:   $(src)/neon.uc $(src)/unroll.awk FORCE
+	$(call if_changed,unroll)
+
 quiet_cmd_mktable = TABLE   $@
       cmd_mktable = $(obj)/mktables > $@ || ( rm -f $@ && exit 1 )
 
diff --git a/lib/raid6/algos.c b/lib/raid6/algos.c
index 6d7316f..74e6f56 100644
--- a/lib/raid6/algos.c
+++ b/lib/raid6/algos.c
@@ -70,6 +70,12 @@ const struct raid6_calls * const raid6_algos[] = {
 	&raid6_intx2,
 	&raid6_intx4,
 	&raid6_intx8,
+#ifdef CONFIG_KERNEL_MODE_NEON
+	&raid6_neonx1,
+	&raid6_neonx2,
+	&raid6_neonx4,
+	&raid6_neonx8,
+#endif
 	NULL
 };
 
diff --git a/lib/raid6/neon.c b/lib/raid6/neon.c
new file mode 100644
index 0000000..dad7102
--- /dev/null
+++ b/lib/raid6/neon.c
@@ -0,0 +1,58 @@
+/*
+ * linux/lib/raid6/neon.c - RAID6 syndrome calculation using ARM NEON intrinsics
+ *
+ * Copyright (C) 2013 Linaro Ltd <ard.biesheuvel@linaro.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/raid/pq.h>
+
+#ifdef __KERNEL__
+#include <asm/neon.h>
+#else
+#define kernel_neon_begin()
+#define kernel_neon_end()
+#define cpu_has_neon()		(1)
+#endif
+
+/*
+ * There are 2 reasons these wrappers are kept in a separate compilation unit
+ * from the actual implementations in neonN.c (generated from neon.uc by
+ * unroll.awk):
+ * - the actual implementations use NEON intrinsics, and the GCC support header
+ *   (arm_neon.h) is not fully compatible (type wise) with the kernel;
+ * - the neonN.c files are compiled with -mfpu=neon and optimization enabled,
+ *   and we have to make sure that we never use *any* NEON/VFP instructions
+ *   outside a kernel_neon_begin()/kernel_neon_end() pair.
+ */
+
+#define RAID6_NEON_WRAPPER(_n)						\
+	static void raid6_neon ## _n ## _gen_syndrome(int disks,	\
+					size_t bytes, void **ptrs)	\
+	{								\
+		void raid6_neon ## _n  ## _gen_syndrome_real(int,	\
+						unsigned int, void**);	\
+		kernel_neon_begin();					\
+		raid6_neon ## _n ## _gen_syndrome_real(disks,		\
+					(unsigned int)bytes, ptrs);	\
+		kernel_neon_end();					\
+	}								\
+	struct raid6_calls const raid6_neonx ## _n = {			\
+		raid6_neon ## _n ## _gen_syndrome,			\
+		raid6_have_neon,					\
+		"neonx" #_n,						\
+		0							\
+	};
+
+static int raid6_have_neon(void)
+{
+	return cpu_has_neon();
+}
+
+RAID6_NEON_WRAPPER(1)
+RAID6_NEON_WRAPPER(2)
+RAID6_NEON_WRAPPER(4)
+RAID6_NEON_WRAPPER(8)
diff --git a/lib/raid6/neon.uc b/lib/raid6/neon.uc
new file mode 100644
index 0000000..f2d7ec0
--- /dev/null
+++ b/lib/raid6/neon.uc
@@ -0,0 +1,80 @@
+/* -----------------------------------------------------------------------
+ *
+ *   neon.uc - RAID-6 syndrome calculation using ARM NEON instructions
+ *
+ *   Copyright (C) 2012 Rob Herring
+ *
+ *   Based on altivec.uc:
+ *     Copyright 2002-2004 H. Peter Anvin - All Rights Reserved
+ *
+ *   This program is free software; you can redistribute it and/or modify
+ *   it under the terms of the GNU General Public License as published by
+ *   the Free Software Foundation, Inc., 53 Temple Place Ste 330,
+ *   Boston MA 02111-1307, USA; either version 2 of the License, or
+ *   (at your option) any later version; incorporated herein by reference.
+ *
+ * ----------------------------------------------------------------------- */
+
+/*
+ * neon$#.c
+ *
+ * $#-way unrolled NEON intrinsics math RAID-6 instruction set
+ *
+ * This file is postprocessed using unroll.awk
+ */
+
+#include <arm_neon.h>
+
+typedef uint8x16_t unative_t;
+
+#define NBYTES(x) ((unative_t){x,x,x,x, x,x,x,x, x,x,x,x, x,x,x,x})
+#define NSIZE	sizeof(unative_t)
+
+/*
+ * The SHLBYTE() operation shifts each byte left by 1, *not*
+ * rolling over into the next byte
+ */
+static inline unative_t SHLBYTE(unative_t v)
+{
+	return vshlq_n_u8(v, 1);
+}
+
+/*
+ * The MASK() operation returns 0xFF in any byte for which the high
+ * bit is 1, 0x00 for any byte for which the high bit is 0.
+ */
+static inline unative_t MASK(unative_t v)
+{
+	const uint8x16_t temp = NBYTES(0);
+	return (unative_t)vcltq_s8((int8x16_t)v, (int8x16_t)temp);
+}
+
+void raid6_neon$#_gen_syndrome_real(int disks, unsigned int bytes, void **ptrs)
+{
+	uint8_t **dptr = (uint8_t **)ptrs;
+	uint8_t *p, *q;
+	int d, z, z0;
+
+	register unative_t wd$$, wq$$, wp$$, w1$$, w2$$;
+	const unative_t x1d = NBYTES(0x1d);
+
+	z0 = disks - 3;		/* Highest data disk */
+	p = dptr[z0+1];		/* XOR parity */
+	q = dptr[z0+2];		/* RS syndrome */
+
+	for ( d = 0 ; d < bytes ; d += NSIZE*$# ) {
+		wq$$ = wp$$ = vld1q_u8(&dptr[z0][d+$$*NSIZE]);
+		for ( z = z0-1 ; z >= 0 ; z-- ) {
+			wd$$ = vld1q_u8(&dptr[z][d+$$*NSIZE]);
+			wp$$ = veorq_u8(wp$$, wd$$);
+			w2$$ = MASK(wq$$);
+			w1$$ = SHLBYTE(wq$$);
+
+			w2$$ = vandq_u8(w2$$, x1d);
+			w1$$ = veorq_u8(w1$$, w2$$);
+			wq$$ = veorq_u8(w1$$, wd$$);
+		}
+		vst1q_u8(&p[d+NSIZE*$$], wp$$);
+		vst1q_u8(&q[d+NSIZE*$$], wq$$);
+	}
+}
diff --git a/lib/raid6/test/Makefile b/lib/raid6/test/Makefile
index 087332d..bbe6450 100644
--- a/lib/raid6/test/Makefile
+++ b/lib/raid6/test/Makefile
@@ -34,6 +34,11 @@ else
         ifeq ($(HAS_ALTIVEC),yes)
                 OBJS += altivec1.o altivec2.o altivec4.o altivec8.o
         endif
+        ifeq ($(ARCH),arm)
+                CFLAGS += -I../../../arch/arm/include -mfpu=neon \
+                          -DCONFIG_KERNEL_MODE_NEON=1
+                OBJS += neon.o neon1.o neon2.o neon4.o neon8.o
+        endif
 endif
 
 .c.o:
@@ -55,6 +60,18 @@ raid6.a: $(OBJS)
 raid6test: test.c raid6.a
 	$(CC) $(CFLAGS) -o raid6test $^
 
+neon1.c: neon.uc ../unroll.awk
+	$(AWK) ../unroll.awk -vN=1 < neon.uc > $@
+
+neon2.c: neon.uc ../unroll.awk
+	$(AWK) ../unroll.awk -vN=2 < neon.uc > $@
+
+neon4.c: neon.uc ../unroll.awk
+	$(AWK) ../unroll.awk -vN=4 < neon.uc > $@
+
+neon8.c: neon.uc ../unroll.awk
+	$(AWK) ../unroll.awk -vN=8 < neon.uc > $@
+
 altivec1.c: altivec.uc ../unroll.awk
 	$(AWK) ../unroll.awk -vN=1 < altivec.uc > $@
 
@@ -89,7 +106,7 @@ tables.c: mktables
 	./mktables > tables.c
 
 clean:
-	rm -f *.o *.a mktables mktables.c *.uc int*.c altivec*.c tables.c raid6test
+	rm -f *.o *.a mktables mktables.c *.uc int*.c altivec*.c neon*.c tables.c raid6test
 
 spotless: clean
 	rm -f *~
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 0/5] Kernel mode NEON for XOR and RAID6
  2013-06-06 15:03 [PATCH 0/5] Kernel mode NEON for XOR and RAID6 Ard Biesheuvel
                   ` (4 preceding siblings ...)
  2013-06-06 15:03 ` [PATCH 5/5] lib/raid6: add ARM-NEON accelerated syndrome calculation Ard Biesheuvel
@ 2013-06-06 15:17 ` Will Deacon
  2013-06-06 15:52   ` Ard Biesheuvel
  2013-06-06 16:17   ` Nicolas Pitre
  5 siblings, 2 replies; 24+ messages in thread
From: Will Deacon @ 2013-06-06 15:17 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Jun 06, 2013 at 04:03:00PM +0100, Ard Biesheuvel wrote:
> Hi all,

Hi Ard,

> This is a partial repost of the patches I proposed a couple of weeks ago to add
> support for VFP/NEON in kernel mode.
> 
> This time, I have included two use cases that I have been using, XOR and RAID-6
> checksumming. The former gets a 60% performance boost on the NEON, the latter
> over 400%.

Whilst that sounds impressive, can you achieve similar results across all
NEON-capable CPUs? In particular, we need to make sure this doesn't cause
performance regressions on some cores. Furthermore, do you have any power
figures to complement your findings? The increased context-switch overhead
is also worth measuring if you can (i.e. run some userspace NEON-based
benchmarks in parallel with NEON and non-NEON implementations of the
checksumming).

> lib/raid6: add ARM-NEON accelerated syndrome calculation
> 
> This is a port of the RAID-6 checksumming code in altivec.uc ported to use NEON
> intrinsics. It is about 4x faster than the sequential code. As this code does
> not live under arch/arm, I will send this patch separately to the appropriate 
> list if/when the prerequisite patches from this series have been accepted.

We support building the kernel with older toolchains, so I don't see the
benefit of using intrinsics here. Have you tried writing an implementation
with NEON instructions directly?

Will

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 4/5] ARM: crypto: add NEON accelerated XOR implementation
  2013-06-06 15:03 ` [PATCH 4/5] ARM: crypto: add NEON accelerated XOR implementation Ard Biesheuvel
@ 2013-06-06 15:45   ` Nicolas Pitre
  0 siblings, 0 replies; 24+ messages in thread
From: Nicolas Pitre @ 2013-06-06 15:45 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 6 Jun 2013, Ard Biesheuvel wrote:

> Add a source file xor-neon.c (which is really just the reference
> C implementation passed through the GCC vectorizer) and hook it
> up to the XOR framework.
> 
> Output captured from a Cortex-A15 @ 1.7 GHz:
> 
> xor: measuring software checksum speed
>    arm4regs  :  2261.600 MB/sec
>    8regs     :  1771.600 MB/sec
>    32regs    :  1441.600 MB/sec
>    neon      :  3619.600 MB/sec
> xor: using function: neon (3619.600 MB/sec)
> 
> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

Acked-by: Nicolas Pitre <nico@linaro.org>

> ---
>  arch/arm/include/asm/xor.h | 73 ++++++++++++++++++++++++++++++++++++++++++++++
>  arch/arm/lib/Makefile      |  6 ++++
>  arch/arm/lib/xor-neon.c    | 42 ++++++++++++++++++++++++++
>  3 files changed, 121 insertions(+)
>  create mode 100644 arch/arm/lib/xor-neon.c
> 
> diff --git a/arch/arm/include/asm/xor.h b/arch/arm/include/asm/xor.h
> index 7604673..4ffb26d 100644
> --- a/arch/arm/include/asm/xor.h
> +++ b/arch/arm/include/asm/xor.h
> @@ -7,7 +7,10 @@
>   * it under the terms of the GNU General Public License version 2 as
>   * published by the Free Software Foundation.
>   */
> +#include <linux/hardirq.h>
>  #include <asm-generic/xor.h>
> +#include <asm/hwcap.h>
> +#include <asm/neon.h>
>  
>  #define __XOR(a1, a2) a1 ^= a2
>  
> @@ -138,4 +141,74 @@ static struct xor_block_template xor_block_arm4regs = {
>  		xor_speed(&xor_block_arm4regs);	\
>  		xor_speed(&xor_block_8regs);	\
>  		xor_speed(&xor_block_32regs);	\
> +		NEON_TEMPLATES;			\
>  	} while (0)
> +
> +#ifdef CONFIG_KERNEL_MODE_NEON
> +
> +extern struct xor_block_template const xor_block_neon_inner;
> +
> +static void
> +xor_neon_2(unsigned long bytes, unsigned long *p1, unsigned long *p2)
> +{
> +	if (in_interrupt()) {
> +		xor_arm4regs_2(bytes, p1, p2);
> +	} else {
> +		kernel_neon_begin();
> +		xor_block_neon_inner.do_2(bytes, p1, p2);
> +		kernel_neon_end();
> +	}
> +}
> +
> +static void
> +xor_neon_3(unsigned long bytes, unsigned long *p1, unsigned long *p2,
> +		unsigned long *p3)
> +{
> +	if (in_interrupt()) {
> +		xor_arm4regs_3(bytes, p1, p2, p3);
> +	} else {
> +		kernel_neon_begin();
> +		xor_block_neon_inner.do_3(bytes, p1, p2, p3);
> +		kernel_neon_end();
> +	}
> +}
> +
> +static void
> +xor_neon_4(unsigned long bytes, unsigned long *p1, unsigned long *p2,
> +		unsigned long *p3, unsigned long *p4)
> +{
> +	if (in_interrupt()) {
> +		xor_arm4regs_4(bytes, p1, p2, p3, p4);
> +	} else {
> +		kernel_neon_begin();
> +		xor_block_neon_inner.do_4(bytes, p1, p2, p3, p4);
> +		kernel_neon_end();
> +	}
> +}
> +
> +static void
> +xor_neon_5(unsigned long bytes, unsigned long *p1, unsigned long *p2,
> +		unsigned long *p3, unsigned long *p4, unsigned long *p5)
> +{
> +	if (in_interrupt()) {
> +		xor_arm4regs_5(bytes, p1, p2, p3, p4, p5);
> +	} else {
> +		kernel_neon_begin();
> +		xor_block_neon_inner.do_5(bytes, p1, p2, p3, p4, p5);
> +		kernel_neon_end();
> +	}
> +}
> +
> +static struct xor_block_template xor_block_neon = {
> +	.name	= "neon",
> +	.do_2	= xor_neon_2,
> +	.do_3	= xor_neon_3,
> +	.do_4	= xor_neon_4,
> +	.do_5	= xor_neon_5
> +};
> +
> +#define NEON_TEMPLATES	\
> +	do { if (cpu_has_neon()) xor_speed(&xor_block_neon); } while (0)
> +#else
> +#define NEON_TEMPLATES
> +#endif
> diff --git a/arch/arm/lib/Makefile b/arch/arm/lib/Makefile
> index af72969..aaf3a87 100644
> --- a/arch/arm/lib/Makefile
> +++ b/arch/arm/lib/Makefile
> @@ -45,3 +45,9 @@ lib-$(CONFIG_ARCH_SHARK)	+= io-shark.o
>  
>  $(obj)/csumpartialcopy.o:	$(obj)/csumpartialcopygeneric.S
>  $(obj)/csumpartialcopyuser.o:	$(obj)/csumpartialcopygeneric.S
> +
> +ifeq ($(CONFIG_KERNEL_MODE_NEON),y)
> +  NEON_FLAGS			:= -mfloat-abi=softfp -mfpu=neon
> +  CFLAGS_xor-neon.o		+= $(NEON_FLAGS)
> +  lib-$(CONFIG_XOR_BLOCKS)	+= xor-neon.o
> +endif
> diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
> new file mode 100644
> index 0000000..f485e5a
> --- /dev/null
> +++ b/arch/arm/lib/xor-neon.c
> @@ -0,0 +1,42 @@
> +/*
> + * linux/arch/arm/lib/xor-neon.c
> + *
> + * Copyright (C) 2013 Linaro Ltd <ard.biesheuvel@linaro.org>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/raid/xor.h>
> +
> +#ifndef __ARM_NEON__
> +#error You should compile this file with '-mfloat-abi=softfp -mfpu=neon'
> +#endif
> +
> +/*
> + * Pull in the reference implementations while instructing GCC (through
> + * -ftree-vectorize) to attempt to exploit implicit parallelism and emit
> + * NEON instructions.
> + */
> +#if __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 6)
> +#pragma GCC optimize "tree-vectorize"
> +#else
> +/*
> + * While older versions of GCC do not generate incorrect code, they fail to
> + * recognize the parallel nature of these functions, and emit plain ARM code,
> + * which is known to be slower than the optimized ARM code in asm-arm/xor.h.
> + */
> +#warning This code requires at least version 4.6 of GCC
> +#endif
> +
> +#pragma GCC diagnostic ignored "-Wunused-variable"
> +#include <asm-generic/xor.h>
> +
> +struct xor_block_template const xor_block_neon_inner = {
> +	.name	= "__inner_neon__",
> +	.do_2	= xor_8regs_2,
> +	.do_3	= xor_8regs_3,
> +	.do_4	= xor_8regs_4,
> +	.do_5	= xor_8regs_5,
> +};
> -- 
> 1.8.1.2
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 0/5] Kernel mode NEON for XOR and RAID6
  2013-06-06 15:17 ` [PATCH 0/5] Kernel mode NEON for XOR and RAID6 Will Deacon
@ 2013-06-06 15:52   ` Ard Biesheuvel
  2013-06-06 16:17   ` Nicolas Pitre
  1 sibling, 0 replies; 24+ messages in thread
From: Ard Biesheuvel @ 2013-06-06 15:52 UTC (permalink / raw)
  To: linux-arm-kernel

On 6 June 2013 17:17, Will Deacon <will.deacon@arm.com> wrote:
> On Thu, Jun 06, 2013 at 04:03:00PM +0100, Ard Biesheuvel wrote:
>> This time, I have included two use cases that I have been using, XOR and RAID-6
>> checksumming. The former gets a 60% performance boost on the NEON, the latter
>> over 400%.
>
> Whilst that sounds impressive, can you achieve similar results across all
> NEON-capable CPUs? In particular, we need to make sure this doesn't cause
> performance regressions on some cores. Furthermore, do you have any power

I don't expect A8 or A9 to be on par. However, the two examples I have
included perform a quick benchmark at boot to decide which one to
pick, so unless the benchmark is a very poor predictor of the
performance at run time, we should be ok here.

> figures to complement your findings? The increased context-switch overhead
> is also worth measuring if you can (i.e. run some userspace NEON-based
> benchmarks in parallel with NEON and non-NEON implementations of the
> checksumming).
>

Good point. I will follow up on that later.

>> lib/raid6: add ARM-NEON accelerated syndrome calculation
>>
>> This is a port of the RAID-6 checksumming code in altivec.uc ported to use NEON
>> intrinsics. It is about 4x faster than the sequential code. As this code does
>> not live under arch/arm, I will send this patch separately to the appropriate
>> list if/when the prerequisite patches from this series have been accepted.
>
> We support building the kernel with older toolchains, so I don't see the
> benefit of using intrinsics here. Have you tried writing an implementation
> with NEON instructions directly?
>

I have tried an alternate version coded in assembly that was
contributed by Vladimir Murzin. But obviously, compiling to an .S file
should also do the trick if this is a concern.
However, there are two reasons I have chosen these particular examples:
- they can be built for both v7 and v8;
- they illustrate the need to be careful about when GCC might generate
NEON instructions.

I am also working on bit sliced AES which is in fact NEON assembly,
and is about 50% faster in CTR mode (on A15)

-- 
Ard.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 5/5] lib/raid6: add ARM-NEON accelerated syndrome calculation
  2013-06-06 15:03 ` [PATCH 5/5] lib/raid6: add ARM-NEON accelerated syndrome calculation Ard Biesheuvel
@ 2013-06-06 15:55   ` Nicolas Pitre
  0 siblings, 0 replies; 24+ messages in thread
From: Nicolas Pitre @ 2013-06-06 15:55 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 6 Jun 2013, Ard Biesheuvel wrote:

> Rebased/reworked a patch contributed by Rob Herring that uses
> NEON intrinsics to perform the RAID-6 syndrome calculations.
> It uses the existing unroll.awk code to generate several
> unrolled versions of which the best performing one is selected
> at boot time.
> 
> Output captured from an ARM Cortex-A15 @ 1.7 GHz:
> 
> raid6: int32x1    200 MB/s
> raid6: int32x2    304 MB/s
> raid6: int32x4    388 MB/s
> raid6: int32x8    423 MB/s
> raid6: neonx1     799 MB/s
> raid6: neonx2    1364 MB/s
> raid6: neonx4    1731 MB/s
> raid6: neonx8    1676 MB/s
> raid6: using algorithm neonx4 (1731 MB/s)
> raid6: using intx1 recovery algorithm
> 
> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

Acked-by: Nicolas Pitre <nico@linaro.org>


> ---
>  include/linux/raid/pq.h |  5 ++++
>  lib/raid6/.gitignore    |  1 +
>  lib/raid6/Makefile      | 31 +++++++++++++++++++
>  lib/raid6/algos.c       |  6 ++++
>  lib/raid6/neon.c        | 58 +++++++++++++++++++++++++++++++++++
>  lib/raid6/neon.uc       | 80 +++++++++++++++++++++++++++++++++++++++++++++++++
>  lib/raid6/test/Makefile | 19 +++++++++++-
>  7 files changed, 199 insertions(+), 1 deletion(-)
>  create mode 100644 lib/raid6/neon.c
>  create mode 100644 lib/raid6/neon.uc
> 
> diff --git a/include/linux/raid/pq.h b/include/linux/raid/pq.h
> index 8dfaa2c..0f42469 100644
> --- a/include/linux/raid/pq.h
> +++ b/include/linux/raid/pq.h
> @@ -114,6 +114,11 @@ extern const struct raid6_recov_calls raid6_recov_intx1;
>  extern const struct raid6_recov_calls raid6_recov_ssse3;
>  extern const struct raid6_recov_calls raid6_recov_avx2;
>  
> +extern const struct raid6_calls raid6_neonx1;
> +extern const struct raid6_calls raid6_neonx2;
> +extern const struct raid6_calls raid6_neonx4;
> +extern const struct raid6_calls raid6_neonx8;
> +
>  /* Algorithm list */
>  extern const struct raid6_calls * const raid6_algos[];
>  extern const struct raid6_recov_calls *const raid6_recov_algos[];
> diff --git a/lib/raid6/.gitignore b/lib/raid6/.gitignore
> index 162beca..0a7e494 100644
> --- a/lib/raid6/.gitignore
> +++ b/lib/raid6/.gitignore
> @@ -2,3 +2,4 @@ mktables
>  altivec*.c
>  int*.c
>  tables.c
> +neon?.c
> diff --git a/lib/raid6/Makefile b/lib/raid6/Makefile
> index 9f7c184..6a51f1a 100644
> --- a/lib/raid6/Makefile
> +++ b/lib/raid6/Makefile
> @@ -5,6 +5,7 @@ raid6_pq-y	+= algos.o recov.o tables.o int1.o int2.o int4.o \
>  
>  raid6_pq-$(CONFIG_X86) += recov_ssse3.o recov_avx2.o mmx.o sse1.o sse2.o avx2.o
>  raid6_pq-$(CONFIG_ALTIVEC) += altivec1.o altivec2.o altivec4.o altivec8.o
> +raid6_pq-$(CONFIG_KERNEL_MODE_NEON) += neon.o neon1.o neon2.o neon4.o neon8.o
>  
>  hostprogs-y	+= mktables
>  
> @@ -16,6 +17,12 @@ ifeq ($(CONFIG_ALTIVEC),y)
>  altivec_flags := -maltivec -mabi=altivec
>  endif
>  
> +# The GCC option -ffreestanding is required in order to compile code containing
> +# ARM/NEON intrinsics in a non C99-compliant environment (such as the kernel)
> +ifeq ($(CONFIG_KERNEL_MODE_NEON),y)
> +NEON_FLAGS := -mfloat-abi=softfp -mfpu=neon -ffreestanding
> +endif
> +
>  targets += int1.c
>  $(obj)/int1.c:   UNROLL := 1
>  $(obj)/int1.c:   $(src)/int.uc $(src)/unroll.awk FORCE
> @@ -70,6 +77,30 @@ $(obj)/altivec8.c:   UNROLL := 8
>  $(obj)/altivec8.c:   $(src)/altivec.uc $(src)/unroll.awk FORCE
>  	$(call if_changed,unroll)
>  
> +CFLAGS_neon1.o += $(NEON_FLAGS)
> +targets += neon1.c
> +$(obj)/neon1.c:   UNROLL := 1
> +$(obj)/neon1.c:   $(src)/neon.uc $(src)/unroll.awk FORCE
> +	$(call if_changed,unroll)
> +
> +CFLAGS_neon2.o += $(NEON_FLAGS)
> +targets += neon2.c
> +$(obj)/neon2.c:   UNROLL := 2
> +$(obj)/neon2.c:   $(src)/neon.uc $(src)/unroll.awk FORCE
> +	$(call if_changed,unroll)
> +
> +CFLAGS_neon4.o += $(NEON_FLAGS)
> +targets += neon4.c
> +$(obj)/neon4.c:   UNROLL := 4
> +$(obj)/neon4.c:   $(src)/neon.uc $(src)/unroll.awk FORCE
> +	$(call if_changed,unroll)
> +
> +CFLAGS_neon8.o += $(NEON_FLAGS)
> +targets += neon8.c
> +$(obj)/neon8.c:   UNROLL := 8
> +$(obj)/neon8.c:   $(src)/neon.uc $(src)/unroll.awk FORCE
> +	$(call if_changed,unroll)
> +
>  quiet_cmd_mktable = TABLE   $@
>        cmd_mktable = $(obj)/mktables > $@ || ( rm -f $@ && exit 1 )
>  
> diff --git a/lib/raid6/algos.c b/lib/raid6/algos.c
> index 6d7316f..74e6f56 100644
> --- a/lib/raid6/algos.c
> +++ b/lib/raid6/algos.c
> @@ -70,6 +70,12 @@ const struct raid6_calls * const raid6_algos[] = {
>  	&raid6_intx2,
>  	&raid6_intx4,
>  	&raid6_intx8,
> +#ifdef CONFIG_KERNEL_MODE_NEON
> +	&raid6_neonx1,
> +	&raid6_neonx2,
> +	&raid6_neonx4,
> +	&raid6_neonx8,
> +#endif
>  	NULL
>  };
>  
> diff --git a/lib/raid6/neon.c b/lib/raid6/neon.c
> new file mode 100644
> index 0000000..dad7102
> --- /dev/null
> +++ b/lib/raid6/neon.c
> @@ -0,0 +1,58 @@
> +/*
> + * linux/lib/raid6/neon.c - RAID6 syndrome calculation using ARM NEON intrinsics
> + *
> + * Copyright (C) 2013 Linaro Ltd <ard.biesheuvel@linaro.org>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/raid/pq.h>
> +
> +#ifdef __KERNEL__
> +#include <asm/neon.h>
> +#else
> +#define kernel_neon_begin()
> +#define kernel_neon_end()
> +#define cpu_has_neon()		(1)
> +#endif
> +
> +/*
> + * There are 2 reasons these wrappers are kept in a separate compilation unit
> + * from the actual implementations in neonN.c (generated from neon.uc by
> + * unroll.awk):
> + * - the actual implementations use NEON intrinsics, and the GCC support header
> + *   (arm_neon.h) is not fully compatible (type wise) with the kernel;
> + * - the neonN.c files are compiled with -mfpu=neon and optimization enabled,
> + *   and we have to make sure that we never use *any* NEON/VFP instructions
> + *   outside a kernel_neon_begin()/kernel_neon_end() pair.
> + */
> +
> +#define RAID6_NEON_WRAPPER(_n)						\
> +	static void raid6_neon ## _n ## _gen_syndrome(int disks,	\
> +					size_t bytes, void **ptrs)	\
> +	{								\
> +		void raid6_neon ## _n  ## _gen_syndrome_real(int,	\
> +						unsigned int, void**);	\
> +		kernel_neon_begin();					\
> +		raid6_neon ## _n ## _gen_syndrome_real(disks,		\
> +					(unsigned int)bytes, ptrs);	\
> +		kernel_neon_end();					\
> +	}								\
> +	struct raid6_calls const raid6_neonx ## _n = {			\
> +		raid6_neon ## _n ## _gen_syndrome,			\
> +		raid6_have_neon,					\
> +		"neonx" #_n,						\
> +		0							\
> +	};
> +
> +static int raid6_have_neon(void)
> +{
> +	return cpu_has_neon();
> +}
> +
> +RAID6_NEON_WRAPPER(1)
> +RAID6_NEON_WRAPPER(2)
> +RAID6_NEON_WRAPPER(4)
> +RAID6_NEON_WRAPPER(8)
> diff --git a/lib/raid6/neon.uc b/lib/raid6/neon.uc
> new file mode 100644
> index 0000000..f2d7ec0
> --- /dev/null
> +++ b/lib/raid6/neon.uc
> @@ -0,0 +1,80 @@
> +/* -----------------------------------------------------------------------
> + *
> + *   neon.uc - RAID-6 syndrome calculation using ARM NEON instructions
> + *
> + *   Copyright (C) 2012 Rob Herring
> + *
> + *   Based on altivec.uc:
> + *     Copyright 2002-2004 H. Peter Anvin - All Rights Reserved
> + *
> + *   This program is free software; you can redistribute it and/or modify
> + *   it under the terms of the GNU General Public License as published by
> + *   the Free Software Foundation, Inc., 53 Temple Place Ste 330,
> + *   Boston MA 02111-1307, USA; either version 2 of the License, or
> + *   (at your option) any later version; incorporated herein by reference.
> + *
> + * ----------------------------------------------------------------------- */
> +
> +/*
> + * neon$#.c
> + *
> + * $#-way unrolled NEON intrinsics math RAID-6 instruction set
> + *
> + * This file is postprocessed using unroll.awk
> + */
> +
> +#include <arm_neon.h>
> +
> +typedef uint8x16_t unative_t;
> +
> +#define NBYTES(x) ((unative_t){x,x,x,x, x,x,x,x, x,x,x,x, x,x,x,x})
> +#define NSIZE	sizeof(unative_t)
> +
> +/*
> + * The SHLBYTE() operation shifts each byte left by 1, *not*
> + * rolling over into the next byte
> + */
> +static inline unative_t SHLBYTE(unative_t v)
> +{
> +	return vshlq_n_u8(v, 1);
> +}
> +
> +/*
> + * The MASK() operation returns 0xFF in any byte for which the high
> + * bit is 1, 0x00 for any byte for which the high bit is 0.
> + */
> +static inline unative_t MASK(unative_t v)
> +{
> +	const uint8x16_t temp = NBYTES(0);
> +	return (unative_t)vcltq_s8((int8x16_t)v, (int8x16_t)temp);
> +}
> +
> +void raid6_neon$#_gen_syndrome_real(int disks, unsigned int bytes, void **ptrs)
> +{
> +	uint8_t **dptr = (uint8_t **)ptrs;
> +	uint8_t *p, *q;
> +	int d, z, z0;
> +
> +	register unative_t wd$$, wq$$, wp$$, w1$$, w2$$;
> +	const unative_t x1d = NBYTES(0x1d);
> +
> +	z0 = disks - 3;		/* Highest data disk */
> +	p = dptr[z0+1];		/* XOR parity */
> +	q = dptr[z0+2];		/* RS syndrome */
> +
> +	for ( d = 0 ; d < bytes ; d += NSIZE*$# ) {
> +		wq$$ = wp$$ = vld1q_u8(&dptr[z0][d+$$*NSIZE]);
> +		for ( z = z0-1 ; z >= 0 ; z-- ) {
> +			wd$$ = vld1q_u8(&dptr[z][d+$$*NSIZE]);
> +			wp$$ = veorq_u8(wp$$, wd$$);
> +			w2$$ = MASK(wq$$);
> +			w1$$ = SHLBYTE(wq$$);
> +
> +			w2$$ = vandq_u8(w2$$, x1d);
> +			w1$$ = veorq_u8(w1$$, w2$$);
> +			wq$$ = veorq_u8(w1$$, wd$$);
> +		}
> +		vst1q_u8(&p[d+NSIZE*$$], wp$$);
> +		vst1q_u8(&q[d+NSIZE*$$], wq$$);
> +	}
> +}
> diff --git a/lib/raid6/test/Makefile b/lib/raid6/test/Makefile
> index 087332d..bbe6450 100644
> --- a/lib/raid6/test/Makefile
> +++ b/lib/raid6/test/Makefile
> @@ -34,6 +34,11 @@ else
>          ifeq ($(HAS_ALTIVEC),yes)
>                  OBJS += altivec1.o altivec2.o altivec4.o altivec8.o
>          endif
> +        ifeq ($(ARCH),arm)
> +                CFLAGS += -I../../../arch/arm/include -mfpu=neon \
> +                          -DCONFIG_KERNEL_MODE_NEON=1
> +                OBJS += neon.o neon1.o neon2.o neon4.o neon8.o
> +        endif
>  endif
>  
>  .c.o:
> @@ -55,6 +60,18 @@ raid6.a: $(OBJS)
>  raid6test: test.c raid6.a
>  	$(CC) $(CFLAGS) -o raid6test $^
>  
> +neon1.c: neon.uc ../unroll.awk
> +	$(AWK) ../unroll.awk -vN=1 < neon.uc > $@
> +
> +neon2.c: neon.uc ../unroll.awk
> +	$(AWK) ../unroll.awk -vN=2 < neon.uc > $@
> +
> +neon4.c: neon.uc ../unroll.awk
> +	$(AWK) ../unroll.awk -vN=4 < neon.uc > $@
> +
> +neon8.c: neon.uc ../unroll.awk
> +	$(AWK) ../unroll.awk -vN=8 < neon.uc > $@
> +
>  altivec1.c: altivec.uc ../unroll.awk
>  	$(AWK) ../unroll.awk -vN=1 < altivec.uc > $@
>  
> @@ -89,7 +106,7 @@ tables.c: mktables
>  	./mktables > tables.c
>  
>  clean:
> -	rm -f *.o *.a mktables mktables.c *.uc int*.c altivec*.c tables.c raid6test
> +	rm -f *.o *.a mktables mktables.c *.uc int*.c altivec*.c neon*.c tables.c raid6test
>  
>  spotless: clean
>  	rm -f *~
> -- 
> 1.8.1.2
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 0/5] Kernel mode NEON for XOR and RAID6
  2013-06-06 15:17 ` [PATCH 0/5] Kernel mode NEON for XOR and RAID6 Will Deacon
  2013-06-06 15:52   ` Ard Biesheuvel
@ 2013-06-06 16:17   ` Nicolas Pitre
  2013-06-06 23:08     ` Rob Herring
  2013-06-07 17:50     ` Will Deacon
  1 sibling, 2 replies; 24+ messages in thread
From: Nicolas Pitre @ 2013-06-06 16:17 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 6 Jun 2013, Will Deacon wrote:

> On Thu, Jun 06, 2013 at 04:03:00PM +0100, Ard Biesheuvel wrote:
> > Hi all,
> 
> Hi Ard,
> 
> > This is a partial repost of the patches I proposed a couple of weeks ago to add
> > support for VFP/NEON in kernel mode.
> > 
> > This time, I have included two use cases that I have been using, XOR and RAID-6
> > checksumming. The former gets a 60% performance boost on the NEON, the latter
> > over 400%.
> 
> Whilst that sounds impressive, can you achieve similar results across all
> NEON-capable CPUs? In particular, we need to make sure this doesn't cause
> performance regressions on some cores.

Note that the kernel performs runtime benchmarking of all the different 
implementations it has available at boot time and selects the best one.  
So if this would turn out to make things worse on some cores then the 
Neon code would simply not be used.

> Furthermore, do you have any power figures to complement your 
> findings?

This is going to be most useful in server type environments where a bit 
more power is not such an issue but throughput is ... unless you start 
using RAID6 arrays on your phone that is.  :-)  Otherwise this can be 
left configured out for mobile targets.

> The increased context-switch overhead
> is also worth measuring if you can (i.e. run some userspace NEON-based
> benchmarks in parallel with NEON and non-NEON implementations of the
> checksumming).

Do we know the context switch cost of normal task scheduling between 
tasks using FP operations?  The in-kernel Neon usage should bring about 
the same cost.  Measuring it would be interesting albeit probably 
difficult.

> We support building the kernel with older toolchains, so I don't see the
> benefit of using intrinsics here.

These days the compiler tends to do a better job than humans at properly 
scheduling instructions for some code.  We shouldn't deprive ourselves 
from it when a recent enough gcc is available.


Nicolas

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 0/5] Kernel mode NEON for XOR and RAID6
  2013-06-06 16:17   ` Nicolas Pitre
@ 2013-06-06 23:08     ` Rob Herring
  2013-06-07 17:50     ` Will Deacon
  1 sibling, 0 replies; 24+ messages in thread
From: Rob Herring @ 2013-06-06 23:08 UTC (permalink / raw)
  To: linux-arm-kernel

On 06/06/2013 11:17 AM, Nicolas Pitre wrote:
> On Thu, 6 Jun 2013, Will Deacon wrote:
> 
>> On Thu, Jun 06, 2013 at 04:03:00PM +0100, Ard Biesheuvel wrote:
>>> Hi all,
>>
>> Hi Ard,
>>
>>> This is a partial repost of the patches I proposed a couple of weeks ago to add
>>> support for VFP/NEON in kernel mode.
>>>
>>> This time, I have included two use cases that I have been using, XOR and RAID-6
>>> checksumming. The former gets a 60% performance boost on the NEON, the latter
>>> over 400%.
>>
>> Whilst that sounds impressive, can you achieve similar results across all
>> NEON-capable CPUs? In particular, we need to make sure this doesn't cause
>> performance regressions on some cores.
> 
> Note that the kernel performs runtime benchmarking of all the different 
> implementations it has available at boot time and selects the best one.  
> So if this would turn out to make things worse on some cores then the 
> Neon code would simply not be used.
> 
>> Furthermore, do you have any power figures to complement your 
>> findings?
> 
> This is going to be most useful in server type environments where a bit 
> more power is not such an issue but throughput is ... unless you start 
> using RAID6 arrays on your phone that is.  :-)  Otherwise this can be 
> left configured out for mobile targets.

Agreed. Any power difference will be noise for a server.

Rob

>> The increased context-switch overhead
>> is also worth measuring if you can (i.e. run some userspace NEON-based
>> benchmarks in parallel with NEON and non-NEON implementations of the
>> checksumming).
> 
> Do we know the context switch cost of normal task scheduling between 
> tasks using FP operations?  The in-kernel Neon usage should bring about 
> the same cost.  Measuring it would be interesting albeit probably 
> difficult.
> 
>> We support building the kernel with older toolchains, so I don't see the
>> benefit of using intrinsics here.
> 
> These days the compiler tends to do a better job than humans at properly 
> scheduling instructions for some code.  We shouldn't deprive ourselves 
> from it when a recent enough gcc is available.
> 
> 
> Nicolas
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 0/5] Kernel mode NEON for XOR and RAID6
  2013-06-06 16:17   ` Nicolas Pitre
  2013-06-06 23:08     ` Rob Herring
@ 2013-06-07 17:50     ` Will Deacon
  2013-06-07 19:49       ` Ard Biesheuvel
  2013-06-08  3:09       ` Nicolas Pitre
  1 sibling, 2 replies; 24+ messages in thread
From: Will Deacon @ 2013-06-07 17:50 UTC (permalink / raw)
  To: linux-arm-kernel

Hello Nicolas, Ard,

On Thu, Jun 06, 2013 at 05:17:39PM +0100, Nicolas Pitre wrote:
> On Thu, 6 Jun 2013, Will Deacon wrote:
> > On Thu, Jun 06, 2013 at 04:03:00PM +0100, Ard Biesheuvel wrote:
> > > This time, I have included two use cases that I have been using, XOR and RAID-6
> > > checksumming. The former gets a 60% performance boost on the NEON, the latter
> > > over 400%.
> > 
> > Whilst that sounds impressive, can you achieve similar results across all
> > NEON-capable CPUs? In particular, we need to make sure this doesn't cause
> > performance regressions on some cores.
> 
> Note that the kernel performs runtime benchmarking of all the different 
> implementations it has available at boot time and selects the best one.  
> So if this would turn out to make things worse on some cores then the 
> Neon code would simply not be used.

That will be all sorts of fun if we try to run this on big.LITTLE...
Perhaps we don't care about that either.

> > Furthermore, do you have any power figures to complement your 
> > findings?
> 
> This is going to be most useful in server type environments where a bit 
> more power is not such an issue but throughput is ... unless you start 
> using RAID6 arrays on your phone that is.  :-)  Otherwise this can be 
> left configured out for mobile targets.

Agreed, but this patch series also sets a precedent for using NEON in the
kernel. Whilst I'd love to hook up some SCSI arrays to my Nexus 4 (!), much
more likely is that people might start reworking some of the crypto algorithms
to use the NEON/SIMD register file (especially with the crypto extensions in
ARMv8) so it would be good to have *some* feel of the power impact off the
bat.

> > The increased context-switch overhead
> > is also worth measuring if you can (i.e. run some userspace NEON-based
> > benchmarks in parallel with NEON and non-NEON implementations of the
> > checksumming).
> 
> Do we know the context switch cost of normal task scheduling between 
> tasks using FP operations?  The in-kernel Neon usage should bring about 
> the same cost.  Measuring it would be interesting albeit probably 
> difficult.

Sure, this stuff is hard to measure and we don't have a feel for the normal
context-switch penalities. I just think it would be useful to try and get a
feel for the increased overhead of saving/restoring this state if userspace
is trying to use the registers in parallel with the kernel.

> > We support building the kernel with older toolchains, so I don't see the
> > benefit of using intrinsics here.
> 
> These days the compiler tends to do a better job than humans at properly 
> scheduling instructions for some code.  We shouldn't deprive ourselves 
> from it when a recent enough gcc is available.

What's the earliest toolchain we claim to support nowadays? If that can't
deal with the intrinsics then we either need to bump the requirement, or
write this using hand-coded asm. In the case of the latter, I don't think
the maintenance overhead of having two implementations is worth it.

Will

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 0/5] Kernel mode NEON for XOR and RAID6
  2013-06-07 17:50     ` Will Deacon
@ 2013-06-07 19:49       ` Ard Biesheuvel
  2013-06-08  3:09       ` Nicolas Pitre
  1 sibling, 0 replies; 24+ messages in thread
From: Ard Biesheuvel @ 2013-06-07 19:49 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Will,

On 7 June 2013 19:50, Will Deacon <will.deacon@arm.com> wrote:
> On Thu, Jun 06, 2013 at 05:17:39PM +0100, Nicolas Pitre wrote:
>> On Thu, 6 Jun 2013, Will Deacon wrote:
>> > On Thu, Jun 06, 2013 at 04:03:00PM +0100, Ard Biesheuvel wrote:
>> > > This time, I have included two use cases that I have been using, XOR and RAID-6
>> > > checksumming. The former gets a 60% performance boost on the NEON, the latter
>> > > over 400%.
>> >
>> > Whilst that sounds impressive, can you achieve similar results across all
>> > NEON-capable CPUs? In particular, we need to make sure this doesn't cause
>> > performance regressions on some cores.
>>
>> Note that the kernel performs runtime benchmarking of all the different
>> implementations it has available at boot time and selects the best one.
>> So if this would turn out to make things worse on some cores then the
>> Neon code would simply not be used.
>
> That will be all sorts of fun if we try to run this on big.LITTLE...
> Perhaps we don't care about that either.
>

Doesn't that apply equally with and without NEON? I mean, there are
several non-NEON flavors of the RAID6 and XOR algorithms already, and
the benchmark at boot time decides which one gets used until the next
reboot.

>> > Furthermore, do you have any power figures to complement your
>> > findings?
>>
>> This is going to be most useful in server type environments where a bit
>> more power is not such an issue but throughput is ... unless you start
>> using RAID6 arrays on your phone that is.  :-)  Otherwise this can be
>> left configured out for mobile targets.
>
> Agreed, but this patch series also sets a precedent for using NEON in the
> kernel. Whilst I'd love to hook up some SCSI arrays to my Nexus 4 (!), much
> more likely is that people might start reworking some of the crypto algorithms
> to use the NEON/SIMD register file (especially with the crypto extensions in
> ARMv8) so it would be good to have *some* feel of the power impact off the
> bat.
>

Why would the kernel be any different from userland in this respect?

>> > The increased context-switch overhead
>> > is also worth measuring if you can (i.e. run some userspace NEON-based
>> > benchmarks in parallel with NEON and non-NEON implementations of the
>> > checksumming).
>>
>> Do we know the context switch cost of normal task scheduling between
>> tasks using FP operations?  The in-kernel Neon usage should bring about
>> the same cost.  Measuring it would be interesting albeit probably
>> difficult.
>
> Sure, this stuff is hard to measure and we don't have a feel for the normal
> context-switch penalities. I just think it would be useful to try and get a
> feel for the increased overhead of saving/restoring this state if userspace
> is trying to use the registers in parallel with the kernel.
>

With NEON only supported outside interrupt context, and no preemption
(as the patch proposes), I don't expect the context switch overhead to
be substantially worse than with as many userland processes competing
for the NEON(s). Perhaps the increased latency is more of a concern
here.

>> > We support building the kernel with older toolchains, so I don't see the
>> > benefit of using intrinsics here.
>>
>> These days the compiler tends to do a better job than humans at properly
>> scheduling instructions for some code.  We shouldn't deprive ourselves
>> from it when a recent enough gcc is available.
>
> What's the earliest toolchain we claim to support nowadays? If that can't
> deal with the intrinsics then we either need to bump the requirement, or
> write this using hand-coded asm. In the case of the latter, I don't think
> the maintenance overhead of having two implementations is worth it.
>

I agree that maintaining an intrinsics version side by side with an
assembly version makes no sense. The same applies to the XOR patch, it
uses -ftree-vectorize, and requires 4.6 (and issues a #warning if your
gcc is older), so if we feel that is not appropriate, I will happily
replace it with a plain assembly version.

However, the main point of this discussion is whether
a) allowing NEON in kernel mode is a good idea in the first place
b) whether the way I propose to do it is an acceptable one.

Any comments/questions on that part?

Regards,
Ard.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 0/5] Kernel mode NEON for XOR and RAID6
  2013-06-07 17:50     ` Will Deacon
  2013-06-07 19:49       ` Ard Biesheuvel
@ 2013-06-08  3:09       ` Nicolas Pitre
  2013-06-21  9:33         ` Will Deacon
  1 sibling, 1 reply; 24+ messages in thread
From: Nicolas Pitre @ 2013-06-08  3:09 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 7 Jun 2013, Will Deacon wrote:

> Hello Nicolas, Ard,
> 
> On Thu, Jun 06, 2013 at 05:17:39PM +0100, Nicolas Pitre wrote:
> > On Thu, 6 Jun 2013, Will Deacon wrote:
> > > On Thu, Jun 06, 2013 at 04:03:00PM +0100, Ard Biesheuvel wrote:
> > > > This time, I have included two use cases that I have been using, XOR and RAID-6
> > > > checksumming. The former gets a 60% performance boost on the NEON, the latter
> > > > over 400%.
> > > 
> > > Whilst that sounds impressive, can you achieve similar results across all
> > > NEON-capable CPUs? In particular, we need to make sure this doesn't cause
> > > performance regressions on some cores.
> > 
> > Note that the kernel performs runtime benchmarking of all the different 
> > implementations it has available at boot time and selects the best one.  
> > So if this would turn out to make things worse on some cores then the 
> > Neon code would simply not be used.
> 
> That will be all sorts of fun if we try to run this on big.LITTLE...
> Perhaps we don't care about that either.

Probably not at the present time.

> > > Furthermore, do you have any power figures to complement your 
> > > findings?
> > 
> > This is going to be most useful in server type environments where a bit 
> > more power is not such an issue but throughput is ... unless you start 
> > using RAID6 arrays on your phone that is.  :-)  Otherwise this can be 
> > left configured out for mobile targets.
> 
> Agreed, but this patch series also sets a precedent for using NEON in the
> kernel. Whilst I'd love to hook up some SCSI arrays to my Nexus 4 (!), much
> more likely is that people might start reworking some of the crypto algorithms
> to use the NEON/SIMD register file (especially with the crypto extensions in
> ARMv8) so it would be good to have *some* feel of the power impact off the
> bat.

Well... Neon is there to be used, otherwise it is just a waste of gates. 
So of course it is going to use more power.  but as long as the power 
used by Neon is less than the power consumed by the same task performed 
by the main processor then we're happy.  Ard provided numbers where Neon 
performs 4 times better while it surely doesn't use 4 times the power 
(or so I hope).

That shouldn't matter much if that power is used in user or kernel 
space.  OTOH the kernel does use crypto algorithms so it does need Neon 
if we want 4x the throughput.

In the end what I want to say is that the power profile is for system 
integrator to assess and decide.  We cannot tell if the Neon power usage 
is good or bad without the overall application use case.  All we should 
do is to provide the mechanism and make it configurable.  Same argument 
applies to the context switch overhead.

> > > We support building the kernel with older toolchains, so I don't see the
> > > benefit of using intrinsics here.
> > 
> > These days the compiler tends to do a better job than humans at properly 
> > scheduling instructions for some code.  We shouldn't deprive ourselves 
> > from it when a recent enough gcc is available.
> 
> What's the earliest toolchain we claim to support nowadays? If that can't
> deal with the intrinsics then we either need to bump the requirement, or
> write this using hand-coded asm. In the case of the latter, I don't think
> the maintenance overhead of having two implementations is worth it.

We have many different minimum toolchain version requirements attached 
to different features being enabled already, ftrace being one of them if 
I remember correctly.  For these Neon optimizations the minimum gcc 
version is v4.6.

Given that this is going to be interesting mostly to server systems, and 
given that ARM server deployments are rather new, I don't see the point 
of compiling a new server environment using an older gcc version.

I don't think we have to bump the gcc requirement for anyone wishing to 
compile the kernel with their existing set of features either.  That 
would be rather silly.  It is not because Neon intrinsics are used in 
some kernel code that everyone should be forced to upgrade their 
compiler, especially if they don't intend to use that in-kernel Neon 
code. However, in order to benefit from optional new features that 
didn't exist before, I think it is perfectly reasonable to require later 
gcc versions for them if need be.

I agree that having two different implementations of the same thing is 
not the way to go.  So if the choice between a pure assembly vs a C 
version with intrinsics has to be made, then I'd vote for the C version 
unless the assembly one is much faster.  C code is always preferable to 
assembly code as it is much easier to review and modify, and 
improvements to the compiler may still increase performance of the 
unchanged code while the assembly version is static and will always be 
tuned to some particular core implementations only.

If someone eventually comes along with some pure assembly version that 
blows the current C version out of the water then we simply replace it, 
period.  We don't have to commit ourselves to a particular 
implementation either.

But for that to happen we need to merge this code and let people 
experiment with it.  That's how improvements will come about.


Nicolas

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 0/5] Kernel mode NEON for XOR and RAID6
  2013-06-08  3:09       ` Nicolas Pitre
@ 2013-06-21  9:33         ` Will Deacon
  2013-06-21 10:08           ` Ard Biesheuvel
  0 siblings, 1 reply; 24+ messages in thread
From: Will Deacon @ 2013-06-21  9:33 UTC (permalink / raw)
  To: linux-arm-kernel

Hi guys,

On Sat, Jun 08, 2013 at 04:09:56AM +0100, Nicolas Pitre wrote:
> On Fri, 7 Jun 2013, Will Deacon wrote:
> > > Note that the kernel performs runtime benchmarking of all the different 
> > > implementations it has available at boot time and selects the best one.  
> > > So if this would turn out to make things worse on some cores then the 
> > > Neon code would simply not be used.
> > 
> > That will be all sorts of fun if we try to run this on big.LITTLE...
> > Perhaps we don't care about that either.
> 
> Probably not at the present time.

[...]

> > What's the earliest toolchain we claim to support nowadays? If that can't
> > deal with the intrinsics then we either need to bump the requirement, or
> > write this using hand-coded asm. In the case of the latter, I don't think
> > the maintenance overhead of having two implementations is worth it.
> 
> We have many different minimum toolchain version requirements attached 
> to different features being enabled already, ftrace being one of them if 
> I remember correctly.  For these Neon optimizations the minimum gcc 
> version is v4.6.
> 
> Given that this is going to be interesting mostly to server systems, and 
> given that ARM server deployments are rather new, I don't see the point 
> of compiling a new server environment using an older gcc version.

I've mulled over this, had some discussions with our toolchain guys and
have concluded the following:

  - The intrinsics are actually ok. I was sceptical at first, but I've been
    assured that they should do a reasonable job (echoing your performance
    figures).

  - The current approach is targetting servers and isn't (yet) suitable for
    mobile.

So, given that the patches do the right thing wrt GCC version, the only
remaining point is that we need to keep an eye out for people trying to
re-use this stuff for mobile (likely crypto, as I mentioned earlier). When
that happens, we should consider revisiting the benchmark/power figures.

Will

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 0/5] Kernel mode NEON for XOR and RAID6
  2013-06-21  9:33         ` Will Deacon
@ 2013-06-21 10:08           ` Ard Biesheuvel
  2013-06-21 14:58             ` Christopher Covington
  0 siblings, 1 reply; 24+ messages in thread
From: Ard Biesheuvel @ 2013-06-21 10:08 UTC (permalink / raw)
  To: linux-arm-kernel

On 21 June 2013 11:33, Will Deacon <will.deacon@arm.com> wrote:
> On Sat, Jun 08, 2013 at 04:09:56AM +0100, Nicolas Pitre wrote:
>> On Fri, 7 Jun 2013, Will Deacon wrote:
>> > What's the earliest toolchain we claim to support nowadays? If that can't
>> > deal with the intrinsics then we either need to bump the requirement, or
>> > write this using hand-coded asm. In the case of the latter, I don't think
>> > the maintenance overhead of having two implementations is worth it.
>>
>> We have many different minimum toolchain version requirements attached
>> to different features being enabled already, ftrace being one of them if
>> I remember correctly.  For these Neon optimizations the minimum gcc
>> version is v4.6.
>>
>> Given that this is going to be interesting mostly to server systems, and
>> given that ARM server deployments are rather new, I don't see the point
>> of compiling a new server environment using an older gcc version.
>
> I've mulled over this, had some discussions with our toolchain guys and
> have concluded the following:
>
>   - The intrinsics are actually ok. I was sceptical at first, but I've been
>     assured that they should do a reasonable job (echoing your performance
>     figures).
>
>   - The current approach is targetting servers and isn't (yet) suitable for
>     mobile.
>
> So, given that the patches do the right thing wrt GCC version, the only
> remaining point is that we need to keep an eye out for people trying to
> re-use this stuff for mobile (likely crypto, as I mentioned earlier). When
> that happens, we should consider revisiting the benchmark/power figures.
>

OK, so a number of points have been raised in this discussion, let me
address them one by one:

Should we allow NEON to be used in the kernel?

The consensus is not to allow floating point. However, NEON is
different, as the performance gains are considerable and there is no
dependency on support code, which makes it not as hairy as
conventional (pre-v3) VFP. Also, managing the vfpstates is easily
doable if NEON is only used outside interrupt context and with
preemption disabled.


Does my series implement it correctly?

I have addressed Russell's first round of comments. Happy to take
another round if necessary.


Should we allow NEON intrinsics in the kernel?
Should we allow GCC-generated NEON in the kernel?

Only if the implementation is clear on which minimum version of GCC it
requires. We could use my examples to set a precedent on what is a
suitable way to use NEON intrinsics or the vectorizer in kernel code
(which includes coding it such that it can be reused for arm64 with no
modifications)


Is kernel mode NEON suitable for mobile?

To me, it is unclear why kernel and userland are so different in this
respect. However, kernel mode NEON is separately configurable from
Kconfig so it can be disabled at will.


Is there a point to doing a boot time benchmark to select the optimal
implementation of an algorithm?

Perhaps not but unrelated to kernel mode NEON.


Code is here
http://git.linaro.org/gitweb?p=people/ardbiesheuvel/linux-arm.git;a=shortlog;h=refs/heads/for-rmk


Regards,
Ard.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 0/5] Kernel mode NEON for XOR and RAID6
  2013-06-21 10:08           ` Ard Biesheuvel
@ 2013-06-21 14:58             ` Christopher Covington
  2013-06-24  8:08               ` Ard Biesheuvel
  2013-06-25 13:56               ` Dave Martin
  0 siblings, 2 replies; 24+ messages in thread
From: Christopher Covington @ 2013-06-21 14:58 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Ard,

On 06/21/2013 06:08 AM, Ard Biesheuvel wrote:
> On 21 June 2013 11:33, Will Deacon <will.deacon@arm.com> wrote:
>> On Sat, Jun 08, 2013 at 04:09:56AM +0100, Nicolas Pitre wrote:
>>> On Fri, 7 Jun 2013, Will Deacon wrote:
>>>> What's the earliest toolchain we claim to support nowadays? If that can't
>>>> deal with the intrinsics then we either need to bump the requirement, or
>>>> write this using hand-coded asm. In the case of the latter, I don't think
>>>> the maintenance overhead of having two implementations is worth it.
>>>
>>> We have many different minimum toolchain version requirements attached
>>> to different features being enabled already, ftrace being one of them if
>>> I remember correctly.  For these Neon optimizations the minimum gcc
>>> version is v4.6.
>>>
>>> Given that this is going to be interesting mostly to server systems, and
>>> given that ARM server deployments are rather new, I don't see the point
>>> of compiling a new server environment using an older gcc version.
>>
>> I've mulled over this, had some discussions with our toolchain guys and
>> have concluded the following:
>>
>>   - The intrinsics are actually ok. I was sceptical at first, but I've been
>>     assured that they should do a reasonable job (echoing your performance
>>     figures).
>>
>>   - The current approach is targetting servers and isn't (yet) suitable for
>>     mobile.
>>
>> So, given that the patches do the right thing wrt GCC version, the only
>> remaining point is that we need to keep an eye out for people trying to
>> re-use this stuff for mobile (likely crypto, as I mentioned earlier). When
>> that happens, we should consider revisiting the benchmark/power figures.
>>
> 
> OK, so a number of points have been raised in this discussion, let me
> address them one by one:
> 
> Should we allow NEON to be used in the kernel?
> 
> The consensus is not to allow floating point. However, NEON is
> different, as the performance gains are considerable and there is no
> dependency on support code, which makes it not as hairy as
> conventional (pre-v3) VFP. Also, managing the vfpstates is easily
> doable if NEON is only used outside interrupt context and with
> preemption disabled.
> 
> 
> Does my series implement it correctly?
> 
> I have addressed Russell's first round of comments. Happy to take
> another round if necessary.
> 
> 
> Should we allow NEON intrinsics in the kernel?
> Should we allow GCC-generated NEON in the kernel?
> 
> Only if the implementation is clear on which minimum version of GCC it
> requires. We could use my examples to set a precedent on what is a
> suitable way to use NEON intrinsics or the vectorizer in kernel code
> (which includes coding it such that it can be reused for arm64 with no
> modifications)
> 
> 
> Is kernel mode NEON suitable for mobile?
> 
> To me, it is unclear why kernel and userland are so different in this
> respect. However, kernel mode NEON is separately configurable from
> Kconfig so it can be disabled at will.
> 
> 
> Is there a point to doing a boot time benchmark to select the optimal
> implementation of an algorithm?
> 
> Perhaps not but unrelated to kernel mode NEON.

If this is indeed the consensus (I don't disagree with any of it myself),
perhaps committing the main points, guidelines, and examples to
Documentation/arm/* would be useful.

Christopher

-- 
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by the Linux Foundation.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 0/5] Kernel mode NEON for XOR and RAID6
  2013-06-21 14:58             ` Christopher Covington
@ 2013-06-24  8:08               ` Ard Biesheuvel
  2013-06-24  8:54                 ` Russell King - ARM Linux
  2013-06-25 13:56               ` Dave Martin
  1 sibling, 1 reply; 24+ messages in thread
From: Ard Biesheuvel @ 2013-06-24  8:08 UTC (permalink / raw)
  To: linux-arm-kernel

On 21 June 2013 16:58, Christopher Covington <cov@codeaurora.org> wrote:
> Hi Ard,
>
> If this is indeed the consensus (I don't disagree with any of it myself),
> perhaps committing the main points, guidelines, and examples to
> Documentation/arm/* would be useful.
>

Hello Chris,

I agree that it makes sense to document any guidelines we set
regarding the use of NEON in kernel mode, and I will happily do so.
However, let's first try to wrap up this discussion, haven't heard
back from Russell yet on his current position in this matter.

Regards,
Ard.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 0/5] Kernel mode NEON for XOR and RAID6
  2013-06-24  8:08               ` Ard Biesheuvel
@ 2013-06-24  8:54                 ` Russell King - ARM Linux
  2013-06-24  9:10                   ` Ard Biesheuvel
  0 siblings, 1 reply; 24+ messages in thread
From: Russell King - ARM Linux @ 2013-06-24  8:54 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jun 24, 2013 at 10:08:10AM +0200, Ard Biesheuvel wrote:
> On 21 June 2013 16:58, Christopher Covington <cov@codeaurora.org> wrote:
> > Hi Ard,
> >
> > If this is indeed the consensus (I don't disagree with any of it myself),
> > perhaps committing the main points, guidelines, and examples to
> > Documentation/arm/* would be useful.
> >
> 
> Hello Chris,
> 
> I agree that it makes sense to document any guidelines we set
> regarding the use of NEON in kernel mode, and I will happily do so.
> However, let's first try to wrap up this discussion, haven't heard
> back from Russell yet on his current position in this matter.

I think it's probably fine now - you have ample justification, and you
seem to be ensuring that things are done safely.

The only thing I'm left wondering about is whether the bits outside
arch/arm should go to anyone else, but MAINTAINERS is being unhelpful
on that - so I'm tempted to say that if someone should've been copied
who hasn't, and they're not in MAINTAINERS that's their problem.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 0/5] Kernel mode NEON for XOR and RAID6
  2013-06-24  8:54                 ` Russell King - ARM Linux
@ 2013-06-24  9:10                   ` Ard Biesheuvel
  0 siblings, 0 replies; 24+ messages in thread
From: Ard Biesheuvel @ 2013-06-24  9:10 UTC (permalink / raw)
  To: linux-arm-kernel

On 24 June 2013 10:54, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote:
>
> I think it's probably fine now - you have ample justification, and you
> seem to be ensuring that things are done safely.
>

Great!

> The only thing I'm left wondering about is whether the bits outside
> arch/arm should go to anyone else, but MAINTAINERS is being unhelpful
> on that - so I'm tempted to say that if someone should've been copied
> who hasn't, and they're not in MAINTAINERS that's their problem.

As far as the RAID6 patch is concerned, my plan was to:
- get the ARM groundwork (this series) accepted (including the XOR
patch, which does live strictly under arch/arm);
- get the arm64 groundwork accepted (which uses the same API);
- post the RAID6 patch to LKML and cc to hpa (I have a local version
which has been slightly tweaked so it builds correctly for both arm
and arm64).

However, the RAID 6 patch does only affect arm/arm64, so perhaps it's
justified that you take it all in, only in that case, please allow me
to re-send the pull request so it contains the version that will build
on arm64 as well.

Regards,
Ard.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 0/5] Kernel mode NEON for XOR and RAID6
  2013-06-21 14:58             ` Christopher Covington
  2013-06-24  8:08               ` Ard Biesheuvel
@ 2013-06-25 13:56               ` Dave Martin
  2013-06-25 14:14                 ` Ard Biesheuvel
  1 sibling, 1 reply; 24+ messages in thread
From: Dave Martin @ 2013-06-25 13:56 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Jun 21, 2013 at 10:58:21AM -0400, Christopher Covington wrote:
> Hi Ard,
> 
> On 06/21/2013 06:08 AM, Ard Biesheuvel wrote:
> > On 21 June 2013 11:33, Will Deacon <will.deacon@arm.com> wrote:
> >> On Sat, Jun 08, 2013 at 04:09:56AM +0100, Nicolas Pitre wrote:
> >>> On Fri, 7 Jun 2013, Will Deacon wrote:
> >>>> What's the earliest toolchain we claim to support nowadays? If that can't
> >>>> deal with the intrinsics then we either need to bump the requirement, or
> >>>> write this using hand-coded asm. In the case of the latter, I don't think
> >>>> the maintenance overhead of having two implementations is worth it.
> >>>
> >>> We have many different minimum toolchain version requirements attached
> >>> to different features being enabled already, ftrace being one of them if
> >>> I remember correctly.  For these Neon optimizations the minimum gcc
> >>> version is v4.6.
> >>>
> >>> Given that this is going to be interesting mostly to server systems, and
> >>> given that ARM server deployments are rather new, I don't see the point
> >>> of compiling a new server environment using an older gcc version.
> >>
> >> I've mulled over this, had some discussions with our toolchain guys and
> >> have concluded the following:
> >>
> >>   - The intrinsics are actually ok. I was sceptical at first, but I've been
> >>     assured that they should do a reasonable job (echoing your performance
> >>     figures).
> >>
> >>   - The current approach is targetting servers and isn't (yet) suitable for
> >>     mobile.
> >>
> >> So, given that the patches do the right thing wrt GCC version, the only
> >> remaining point is that we need to keep an eye out for people trying to
> >> re-use this stuff for mobile (likely crypto, as I mentioned earlier). When
> >> that happens, we should consider revisiting the benchmark/power figures.
> >>
> > 
> > OK, so a number of points have been raised in this discussion, let me
> > address them one by one:
> > 
> > Should we allow NEON to be used in the kernel?
> > 
> > The consensus is not to allow floating point. However, NEON is
> > different, as the performance gains are considerable and there is no
> > dependency on support code, which makes it not as hairy as
> > conventional (pre-v3) VFP. Also, managing the vfpstates is easily
> > doable if NEON is only used outside interrupt context and with
> > preemption disabled.
> > 
> > 
> > Does my series implement it correctly?
> > 
> > I have addressed Russell's first round of comments. Happy to take
> > another round if necessary.
> > 
> > 
> > Should we allow NEON intrinsics in the kernel?
> > Should we allow GCC-generated NEON in the kernel?
> > 
> > Only if the implementation is clear on which minimum version of GCC it
> > requires. We could use my examples to set a precedent on what is a
> > suitable way to use NEON intrinsics or the vectorizer in kernel code
> > (which includes coding it such that it can be reused for arm64 with no
> > modifications)
> > 
> > 
> > Is kernel mode NEON suitable for mobile?
> > 
> > To me, it is unclear why kernel and userland are so different in this
> > respect. However, kernel mode NEON is separately configurable from
> > Kconfig so it can be disabled at will.
> > 
> > 
> > Is there a point to doing a boot time benchmark to select the optimal
> > implementation of an algorithm?
> > 
> > Perhaps not but unrelated to kernel mode NEON.

Significant benchmarks on the boot path would be unacceptable, unless they
are *fast* (and by fast, I mean fast on all platforms, not just fast on
the fast platforms).  If one second gets added onto the boot path for each
optimised algorithm, that sounds like a fail.  If all the benchmarks
combined take one second in total, that's no quite as bad.

Maybe benchmarks could be time-bounded (i.e., see how much data we can
chug though in X milliseconds) instead of size-bounded.  This would avoid
unreasonable slowdown on slow platforms, while avoiding trivially small
benchmark payloads on faster platforms which may typically have a more
complex architecture, bigger caches etc. which would cause them to take
longer to reach saturated performance when running a particular algorithm.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 0/5] Kernel mode NEON for XOR and RAID6
  2013-06-25 13:56               ` Dave Martin
@ 2013-06-25 14:14                 ` Ard Biesheuvel
  2013-06-25 14:29                   ` Dave Martin
  0 siblings, 1 reply; 24+ messages in thread
From: Ard Biesheuvel @ 2013-06-25 14:14 UTC (permalink / raw)
  To: linux-arm-kernel

On 25 June 2013 15:56, Dave Martin <dave.martin@linaro.org> wrote:
> Significant benchmarks on the boot path would be unacceptable, unless they
> are *fast* (and by fast, I mean fast on all platforms, not just fast on
> the fast platforms).  If one second gets added onto the boot path for each
> optimised algorithm, that sounds like a fail.  If all the benchmarks
> combined take one second in total, that's no quite as bad.
>
> Maybe benchmarks could be time-bounded (i.e., see how much data we can
> chug though in X milliseconds) instead of size-bounded.  This would avoid
> unreasonable slowdown on slow platforms, while avoiding trivially small
> benchmark payloads on faster platforms which may typically have a more
> complex architecture, bigger caches etc. which would cause them to take
> longer to reach saturated performance when running a particular algorithm.
>

Benchmarks are already time bounded, at least the instances I am aware
of (xor and raid6) are. They each measure, for each available
implementation, the amount of work performed during a fixed time. For
RAID6, this is 16 jiffies, for XOR it's only 1 jiffy but each test is
repeated 5 times.

So I think this should not be a problem, especially as it is unlikely
that newly added implementations (such as NEON) will be able to
execute on older/slower platforms in the first place.

Regards,
Ard.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 0/5] Kernel mode NEON for XOR and RAID6
  2013-06-25 14:14                 ` Ard Biesheuvel
@ 2013-06-25 14:29                   ` Dave Martin
  0 siblings, 0 replies; 24+ messages in thread
From: Dave Martin @ 2013-06-25 14:29 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jun 25, 2013 at 04:14:13PM +0200, Ard Biesheuvel wrote:
> On 25 June 2013 15:56, Dave Martin <dave.martin@linaro.org> wrote:
> > Significant benchmarks on the boot path would be unacceptable, unless they
> > are *fast* (and by fast, I mean fast on all platforms, not just fast on
> > the fast platforms).  If one second gets added onto the boot path for each
> > optimised algorithm, that sounds like a fail.  If all the benchmarks
> > combined take one second in total, that's no quite as bad.
> >
> > Maybe benchmarks could be time-bounded (i.e., see how much data we can
> > chug though in X milliseconds) instead of size-bounded.  This would avoid
> > unreasonable slowdown on slow platforms, while avoiding trivially small
> > benchmark payloads on faster platforms which may typically have a more
> > complex architecture, bigger caches etc. which would cause them to take
> > longer to reach saturated performance when running a particular algorithm.
> >
> 
> Benchmarks are already time bounded, at least the instances I am aware
> of (xor and raid6) are. They each measure, for each available
> implementation, the amount of work performed during a fixed time. For
> RAID6, this is 16 jiffies, for XOR it's only 1 jiffy but each test is
> repeated 5 times.
> 
> So I think this should not be a problem, especially as it is unlikely
> that newly added implementations (such as NEON) will be able to
> execute on older/slower platforms in the first place.

The tree I was originally looking at might be out of date ... apologies
for the trolling.

If the XOR benchmark really only takes 50 ms per implementation, I guess
that shouldn't be too bad.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2013-06-25 14:29 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-06-06 15:03 [PATCH 0/5] Kernel mode NEON for XOR and RAID6 Ard Biesheuvel
2013-06-06 15:03 ` [PATCH 1/5] ARM: add support for kernel mode NEON Ard Biesheuvel
2013-06-06 15:03 ` [PATCH 2/5] ARM: move VFP init to an earlier boot stage Ard Biesheuvel
2013-06-06 15:03 ` [PATCH 3/5] ARM: be strict about FP exceptions in kernel mode Ard Biesheuvel
2013-06-06 15:03 ` [PATCH 4/5] ARM: crypto: add NEON accelerated XOR implementation Ard Biesheuvel
2013-06-06 15:45   ` Nicolas Pitre
2013-06-06 15:03 ` [PATCH 5/5] lib/raid6: add ARM-NEON accelerated syndrome calculation Ard Biesheuvel
2013-06-06 15:55   ` Nicolas Pitre
2013-06-06 15:17 ` [PATCH 0/5] Kernel mode NEON for XOR and RAID6 Will Deacon
2013-06-06 15:52   ` Ard Biesheuvel
2013-06-06 16:17   ` Nicolas Pitre
2013-06-06 23:08     ` Rob Herring
2013-06-07 17:50     ` Will Deacon
2013-06-07 19:49       ` Ard Biesheuvel
2013-06-08  3:09       ` Nicolas Pitre
2013-06-21  9:33         ` Will Deacon
2013-06-21 10:08           ` Ard Biesheuvel
2013-06-21 14:58             ` Christopher Covington
2013-06-24  8:08               ` Ard Biesheuvel
2013-06-24  8:54                 ` Russell King - ARM Linux
2013-06-24  9:10                   ` Ard Biesheuvel
2013-06-25 13:56               ` Dave Martin
2013-06-25 14:14                 ` Ard Biesheuvel
2013-06-25 14:29                   ` Dave Martin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).