[PATCH 0/2] md/raid6: improvements for ARM/arm64

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] md/raid6: improvements for ARM/arm64
@ 2017-07-13 17:15 Ard Biesheuvel
  2017-07-13 17:16 ` [PATCH 1/2] md/raid6: use faster multiplication for ARM NEON delta syndrome Ard Biesheuvel
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Ard Biesheuvel @ 2017-07-13 17:15 UTC (permalink / raw)
  To: linux-arm-kernel

1. Use a faster algorithm for the delta syndrome
2. Implement recovery routines in NEON

As before, NEON intrinsics are used, which means the same code can be
compiled for ARM as well as arm64.

Given that there does not seem to be a maintainer for lib/raid6, could
we take this through one of the ARM trees instead?

Ard Biesheuvel (2):
  md/raid6: use faster multiplication for ARM NEON delta syndrome
  md/raid6: implement recovery using ARM NEON intrinsics

 include/linux/raid/pq.h      |   1 +
 lib/raid6/Makefile           |   4 +-
 lib/raid6/algos.c            |   3 +
 lib/raid6/neon.uc            |  33 +++++-
 lib/raid6/recov_neon.c       | 110 ++++++++++++++++++
 lib/raid6/recov_neon_inner.c | 117 ++++++++++++++++++++
 6 files changed, 264 insertions(+), 4 deletions(-)
 create mode 100644 lib/raid6/recov_neon.c
 create mode 100644 lib/raid6/recov_neon_inner.c

-- 
2.9.3

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/2] md/raid6: use faster multiplication for ARM NEON delta syndrome
  2017-07-13 17:15 [PATCH 0/2] md/raid6: improvements for ARM/arm64 Ard Biesheuvel
@ 2017-07-13 17:16 ` Ard Biesheuvel
  2017-07-13 17:51   ` AW: " Markus Stockhausen
  2017-07-13 17:16 ` [PATCH 2/2] md/raid6: implement recovery using ARM NEON intrinsics Ard Biesheuvel
  2017-08-09 17:51 ` [PATCH 0/2] md/raid6: improvements for ARM/arm64 Catalin Marinas
  2 siblings, 1 reply; 7+ messages in thread
From: Ard Biesheuvel @ 2017-07-13 17:16 UTC (permalink / raw)
  To: linux-arm-kernel

The P/Q left side optimization in the delta syndrome simply involves
repeatedly multiplying a value by polynomial 'x' in GF(2^8). Given
that 'x * x * x * x' equals 'x^4' even in the polynomial world, we
can accelerate this substantially by performing up to 4 such operations
at once, using the NEON instructions for polynomial multiplication.

Results on a Cortex-A57 running in 64-bit mode:

  Before:
  -------
  raid6: neonx1   xor()  1680 MB/s
  raid6: neonx2   xor()  2286 MB/s
  raid6: neonx4   xor()  3162 MB/s
  raid6: neonx8   xor()  3389 MB/s

  After:
  ------
  raid6: neonx1   xor()  2281 MB/s
  raid6: neonx2   xor()  3362 MB/s
  raid6: neonx4   xor()  3787 MB/s
  raid6: neonx8   xor()  4239 MB/s

While we're at it, simplify MASK() by using a signed shift rather than
a vector compare involving a temp register.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 lib/raid6/neon.uc | 33 ++++++++++++++++++--
 1 file changed, 30 insertions(+), 3 deletions(-)

diff --git a/lib/raid6/neon.uc b/lib/raid6/neon.uc
index 4fa51b761dd0..d5242f544551 100644
--- a/lib/raid6/neon.uc
+++ b/lib/raid6/neon.uc
@@ -46,8 +46,12 @@ static inline unative_t SHLBYTE(unative_t v)
  */
 static inline unative_t MASK(unative_t v)
 {
-	const uint8x16_t temp = NBYTES(0);
-	return (unative_t)vcltq_s8((int8x16_t)v, (int8x16_t)temp);
+	return (unative_t)vshrq_n_s8((int8x16_t)v, 7);
+}
+
+static inline unative_t PMUL(unative_t v, unative_t u)
+{
+	return (unative_t)vmulq_p8((poly8x16_t)v, (poly8x16_t)u);
 }
 
 void raid6_neon$#_gen_syndrome_real(int disks, unsigned long bytes, void **ptrs)
@@ -110,7 +114,30 @@ void raid6_neon$#_xor_syndrome_real(int disks, int start, int stop,
 			wq$$ = veorq_u8(w1$$, wd$$);
 		}
 		/* P/Q left side optimization */
-		for ( z = start-1 ; z >= 0 ; z-- ) {
+		for ( z = start-1 ; z >= 3 ; z -= 4 ) {
+			w2$$ = vshrq_n_u8(wq$$, 4);
+			w1$$ = vshlq_n_u8(wq$$, 4);
+
+			w2$$ = PMUL(w2$$, x1d);
+			wq$$ = veorq_u8(w1$$, w2$$);
+		}
+
+		switch (z) {
+		case 2:
+			w2$$ = vshrq_n_u8(wq$$, 5);
+			w1$$ = vshlq_n_u8(wq$$, 3);
+
+			w2$$ = PMUL(w2$$, x1d);
+			wq$$ = veorq_u8(w1$$, w2$$);
+			break;
+		case 1:
+			w2$$ = vshrq_n_u8(wq$$, 6);
+			w1$$ = vshlq_n_u8(wq$$, 2);
+
+			w2$$ = PMUL(w2$$, x1d);
+			wq$$ = veorq_u8(w1$$, w2$$);
+			break;
+		case 0:
 			w2$$ = MASK(wq$$);
 			w1$$ = SHLBYTE(wq$$);
 
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/2] md/raid6: implement recovery using ARM NEON intrinsics
  2017-07-13 17:15 [PATCH 0/2] md/raid6: improvements for ARM/arm64 Ard Biesheuvel
  2017-07-13 17:16 ` [PATCH 1/2] md/raid6: use faster multiplication for ARM NEON delta syndrome Ard Biesheuvel
@ 2017-07-13 17:16 ` Ard Biesheuvel
  2017-08-09 17:51 ` [PATCH 0/2] md/raid6: improvements for ARM/arm64 Catalin Marinas
  2 siblings, 0 replies; 7+ messages in thread
From: Ard Biesheuvel @ 2017-07-13 17:16 UTC (permalink / raw)
  To: linux-arm-kernel

Provide a NEON accelerated implementation of the recovery algorithm,
which supersedes the default byte-by-byte one.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 include/linux/raid/pq.h      |   1 +
 lib/raid6/Makefile           |   4 +-
 lib/raid6/algos.c            |   3 +
 lib/raid6/recov_neon.c       | 110 ++++++++++++++++++
 lib/raid6/recov_neon_inner.c | 117 ++++++++++++++++++++
 5 files changed, 234 insertions(+), 1 deletion(-)

diff --git a/include/linux/raid/pq.h b/include/linux/raid/pq.h
index 4d57bbaaa1bf..64318fa49126 100644
--- a/include/linux/raid/pq.h
+++ b/include/linux/raid/pq.h
@@ -121,6 +121,7 @@ extern const struct raid6_recov_calls raid6_recov_ssse3;
 extern const struct raid6_recov_calls raid6_recov_avx2;
 extern const struct raid6_recov_calls raid6_recov_avx512;
 extern const struct raid6_recov_calls raid6_recov_s390xc;
+extern const struct raid6_recov_calls raid6_recov_neon;
 
 extern const struct raid6_calls raid6_neonx1;
 extern const struct raid6_calls raid6_neonx2;
diff --git a/lib/raid6/Makefile b/lib/raid6/Makefile
index 3057011f5599..a93adf6dcfb2 100644
--- a/lib/raid6/Makefile
+++ b/lib/raid6/Makefile
@@ -5,7 +5,7 @@ raid6_pq-y	+= algos.o recov.o tables.o int1.o int2.o int4.o \
 
 raid6_pq-$(CONFIG_X86) += recov_ssse3.o recov_avx2.o mmx.o sse1.o sse2.o avx2.o avx512.o recov_avx512.o
 raid6_pq-$(CONFIG_ALTIVEC) += altivec1.o altivec2.o altivec4.o altivec8.o
-raid6_pq-$(CONFIG_KERNEL_MODE_NEON) += neon.o neon1.o neon2.o neon4.o neon8.o
+raid6_pq-$(CONFIG_KERNEL_MODE_NEON) += neon.o neon1.o neon2.o neon4.o neon8.o recov_neon.o recov_neon_inner.o
 raid6_pq-$(CONFIG_TILEGX) += tilegx8.o
 raid6_pq-$(CONFIG_S390) += s390vx8.o recov_s390xc.o
 
@@ -26,7 +26,9 @@ NEON_FLAGS := -ffreestanding
 ifeq ($(ARCH),arm)
 NEON_FLAGS += -mfloat-abi=softfp -mfpu=neon
 endif
+CFLAGS_recov_neon_inner.o += $(NEON_FLAGS)
 ifeq ($(ARCH),arm64)
+CFLAGS_REMOVE_recov_neon_inner.o += -mgeneral-regs-only
 CFLAGS_REMOVE_neon1.o += -mgeneral-regs-only
 CFLAGS_REMOVE_neon2.o += -mgeneral-regs-only
 CFLAGS_REMOVE_neon4.o += -mgeneral-regs-only
diff --git a/lib/raid6/algos.c b/lib/raid6/algos.c
index 7857049fd7d3..476994723258 100644
--- a/lib/raid6/algos.c
+++ b/lib/raid6/algos.c
@@ -113,6 +113,9 @@ const struct raid6_recov_calls *const raid6_recov_algos[] = {
 #ifdef CONFIG_S390
 	&raid6_recov_s390xc,
 #endif
+#if defined(CONFIG_KERNEL_MODE_NEON)
+	&raid6_recov_neon,
+#endif
 	&raid6_recov_intx1,
 	NULL
 };
diff --git a/lib/raid6/recov_neon.c b/lib/raid6/recov_neon.c
new file mode 100644
index 000000000000..eeb5c4065b92
--- /dev/null
+++ b/lib/raid6/recov_neon.c
@@ -0,0 +1,110 @@
+/*
+ * Copyright (C) 2012 Intel Corporation
+ * Copyright (C) 2017 Linaro Ltd. <ard.biesheuvel@linaro.org>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+#include <linux/raid/pq.h>
+
+#ifdef __KERNEL__
+#include <asm/neon.h>
+#else
+#define kernel_neon_begin()
+#define kernel_neon_end()
+#define cpu_has_neon()		(1)
+#endif
+
+static int raid6_has_neon(void)
+{
+	return cpu_has_neon();
+}
+
+void __raid6_2data_recov_neon(int bytes, uint8_t *p, uint8_t *q, uint8_t *dp,
+			      uint8_t *dq, const uint8_t *pbmul,
+			      const uint8_t *qmul);
+
+void __raid6_datap_recov_neon(int bytes, uint8_t *p, uint8_t *q, uint8_t *dq,
+			      const uint8_t *qmul);
+
+static void raid6_2data_recov_neon(int disks, size_t bytes, int faila,
+		int failb, void **ptrs)
+{
+	u8 *p, *q, *dp, *dq;
+	const u8 *pbmul;	/* P multiplier table for B data */
+	const u8 *qmul;		/* Q multiplier table (for both) */
+
+	p = (u8 *)ptrs[disks - 2];
+	q = (u8 *)ptrs[disks - 1];
+
+	/*
+	 * Compute syndrome with zero for the missing data pages
+	 * Use the dead data pages as temporary storage for
+	 * delta p and delta q
+	 */
+	dp = (u8 *)ptrs[faila];
+	ptrs[faila] = (void *)raid6_empty_zero_page;
+	ptrs[disks - 2] = dp;
+	dq = (u8 *)ptrs[failb];
+	ptrs[failb] = (void *)raid6_empty_zero_page;
+	ptrs[disks - 1] = dq;
+
+	raid6_call.gen_syndrome(disks, bytes, ptrs);
+
+	/* Restore pointer table */
+	ptrs[faila]     = dp;
+	ptrs[failb]     = dq;
+	ptrs[disks - 2] = p;
+	ptrs[disks - 1] = q;
+
+	/* Now, pick the proper data tables */
+	pbmul = raid6_vgfmul[raid6_gfexi[failb-faila]];
+	qmul  = raid6_vgfmul[raid6_gfinv[raid6_gfexp[faila] ^
+					 raid6_gfexp[failb]]];
+
+	kernel_neon_begin();
+	__raid6_2data_recov_neon(bytes, p, q, dp, dq, pbmul, qmul);
+	kernel_neon_end();
+}
+
+static void raid6_datap_recov_neon(int disks, size_t bytes, int faila,
+		void **ptrs)
+{
+	u8 *p, *q, *dq;
+	const u8 *qmul;		/* Q multiplier table */
+
+	p = (u8 *)ptrs[disks - 2];
+	q = (u8 *)ptrs[disks - 1];
+
+	/*
+	 * Compute syndrome with zero for the missing data page
+	 * Use the dead data page as temporary storage for delta q
+	 */
+	dq = (u8 *)ptrs[faila];
+	ptrs[faila] = (void *)raid6_empty_zero_page;
+	ptrs[disks - 1] = dq;
+
+	raid6_call.gen_syndrome(disks, bytes, ptrs);
+
+	/* Restore pointer table */
+	ptrs[faila]     = dq;
+	ptrs[disks - 1] = q;
+
+	/* Now, pick the proper data tables */
+	qmul = raid6_vgfmul[raid6_gfinv[raid6_gfexp[faila]]];
+
+	kernel_neon_begin();
+	__raid6_datap_recov_neon(bytes, p, q, dq, qmul);
+	kernel_neon_end();
+}
+
+const struct raid6_recov_calls raid6_recov_neon = {
+	.data2		= raid6_2data_recov_neon,
+	.datap		= raid6_datap_recov_neon,
+	.valid		= raid6_has_neon,
+	.name		= "neon",
+	.priority	= 10,
+};
diff --git a/lib/raid6/recov_neon_inner.c b/lib/raid6/recov_neon_inner.c
new file mode 100644
index 000000000000..8cd20c9f834a
--- /dev/null
+++ b/lib/raid6/recov_neon_inner.c
@@ -0,0 +1,117 @@
+/*
+ * Copyright (C) 2012 Intel Corporation
+ * Copyright (C) 2017 Linaro Ltd. <ard.biesheuvel@linaro.org>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+#include <arm_neon.h>
+
+static const uint8x16_t x0f = {
+	0x0f, 0x0f, 0x0f, 0x0f, 0x0f, 0x0f, 0x0f, 0x0f,
+	0x0f, 0x0f, 0x0f, 0x0f, 0x0f, 0x0f, 0x0f, 0x0f,
+};
+
+#ifdef CONFIG_ARM
+/*
+ * AArch32 does not provide this intrinsic natively because it does not
+ * implement the underlying instruction. AArch32 only provides a 64-bit
+ * wide vtbl.8 instruction, so use that instead.
+ */
+static uint8x16_t vqtbl1q_u8(uint8x16_t a, uint8x16_t b)
+{
+	union {
+		uint8x16_t	val;
+		uint8x8x2_t	pair;
+	} __a = { a };
+
+	return vcombine_u8(vtbl2_u8(__a.pair, vget_low_u8(b)),
+			   vtbl2_u8(__a.pair, vget_high_u8(b)));
+}
+#endif
+
+void __raid6_2data_recov_neon(int bytes, uint8_t *p, uint8_t *q, uint8_t *dp,
+			      uint8_t *dq, const uint8_t *pbmul,
+			      const uint8_t *qmul)
+{
+	uint8x16_t pm0 = vld1q_u8(pbmul);
+	uint8x16_t pm1 = vld1q_u8(pbmul + 16);
+	uint8x16_t qm0 = vld1q_u8(qmul);
+	uint8x16_t qm1 = vld1q_u8(qmul + 16);
+
+	/*
+	 * while ( bytes-- ) {
+	 *	uint8_t px, qx, db;
+	 *
+	 *	px    = *p ^ *dp;
+	 *	qx    = qmul[*q ^ *dq];
+	 *	*dq++ = db = pbmul[px] ^ qx;
+	 *	*dp++ = db ^ px;
+	 *	p++; q++;
+	 * }
+	 */
+
+	while (bytes) {
+		uint8x16_t vx, vy, px, qx, db;
+
+		px = veorq_u8(vld1q_u8(p), vld1q_u8(dp));
+		vx = veorq_u8(vld1q_u8(q), vld1q_u8(dq));
+
+		vy = (uint8x16_t)vshrq_n_s16((int16x8_t)vx, 4);
+		vx = vqtbl1q_u8(qm0, vandq_u8(vx, x0f));
+		vy = vqtbl1q_u8(qm1, vandq_u8(vy, x0f));
+		qx = veorq_u8(vx, vy);
+
+		vy = (uint8x16_t)vshrq_n_s16((int16x8_t)px, 4);
+		vx = vqtbl1q_u8(pm0, vandq_u8(px, x0f));
+		vy = vqtbl1q_u8(pm1, vandq_u8(vy, x0f));
+		vx = veorq_u8(vx, vy);
+		db = veorq_u8(vx, qx);
+
+		vst1q_u8(dq, db);
+		vst1q_u8(dp, veorq_u8(db, px));
+
+		bytes -= 16;
+		p += 16;
+		q += 16;
+		dp += 16;
+		dq += 16;
+	}
+}
+
+void __raid6_datap_recov_neon(int bytes, uint8_t *p, uint8_t *q, uint8_t *dq,
+			      const uint8_t *qmul)
+{
+	uint8x16_t qm0 = vld1q_u8(qmul);
+	uint8x16_t qm1 = vld1q_u8(qmul + 16);
+
+	/*
+	 * while (bytes--) {
+	 *	*p++ ^= *dq = qmul[*q ^ *dq];
+	 *	q++; dq++;
+	 * }
+	 */
+
+	while (bytes) {
+		uint8x16_t vx, vy;
+
+		vx = veorq_u8(vld1q_u8(q), vld1q_u8(dq));
+
+		vy = (uint8x16_t)vshrq_n_s16((int16x8_t)vx, 4);
+		vx = vqtbl1q_u8(qm0, vandq_u8(vx, x0f));
+		vy = vqtbl1q_u8(qm1, vandq_u8(vy, x0f));
+		vx = veorq_u8(vx, vy);
+		vy = veorq_u8(vx, vld1q_u8(p));
+
+		vst1q_u8(dq, vx);
+		vst1q_u8(p, vy);
+
+		bytes -= 16;
+		p += 16;
+		q += 16;
+		dq += 16;
+	}
+}
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* AW: [PATCH 1/2] md/raid6: use faster multiplication for ARM NEON delta syndrome
  2017-07-13 17:16 ` [PATCH 1/2] md/raid6: use faster multiplication for ARM NEON delta syndrome Ard Biesheuvel
@ 2017-07-13 17:51   ` Markus Stockhausen
  2017-07-13 18:00     ` Ard Biesheuvel
  0 siblings, 1 reply; 7+ messages in thread
From: Markus Stockhausen @ 2017-07-13 17:51 UTC (permalink / raw)
  To: linux-arm-kernel

> Von: Ard Biesheuvel [ard.biesheuvel at linaro.org]
> Gesendet: Donnerstag, 13. Juli 2017 19:16
> An: linux-arm-kernel at lists.infradead.org; linux-raid at vger.kernel.org
> Cc: shli at kernel.org; Markus Stockhausen; linux at armlinux.org.uk; will.deacon at arm.com; catalin.marinas at arm.com; Ard Biesheuvel
> Betreff: [PATCH 1/2] md/raid6: use faster multiplication for ARM NEON delta syndrome
> 
> The P/Q left side optimization in the delta syndrome simply involves
> repeatedly multiplying a value by polynomial 'x' in GF(2^8). Given
> that 'x * x * x * x' equals 'x^4' even in the polynomial world, we
> can accelerate this substantially by performing up to 4 such operations
> at once, using the NEON instructions for polynomial multiplication.
> 
> Results on a Cortex-A57 running in 64-bit mode:
> 
>   Before:
>   -------
>   raid6: neonx1   xor()  1680 MB/s
>   raid6: neonx2   xor()  2286 MB/s
>   raid6: neonx4   xor()  3162 MB/s
>   raid6: neonx8   xor()  3389 MB/s
> 
>   After:
>   ------
>   raid6: neonx1   xor()  2281 MB/s
>   raid6: neonx2   xor()  3362 MB/s
>   raid6: neonx4   xor()  3787 MB/s
>   raid6: neonx8   xor()  4239 MB/s

Nice optimiziation. Nevertheless the test algorithm favours this implementation. See:

int start = (disks>>1)-1, stop = disks-3; /* work on the second half of the disks */

What gives the before/after test if you work on the middle data disks and not on 
the right ones? In the 4K page size this should be  start = 3, stop = 11 instead of
start = 7, stop = 13. Given the large gain you see the impact should be lower but 
at least in the >10% range. 

Markus
 
> While we're at it, simplify MASK() by using a signed shift rather than
> a vector compare involving a temp register.
> 
> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> ...
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: InterScan_Disclaimer.txt
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20170713/e6453487/attachment.txt>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/2] md/raid6: use faster multiplication for ARM NEON delta syndrome
  2017-07-13 17:51   ` AW: " Markus Stockhausen
@ 2017-07-13 18:00     ` Ard Biesheuvel
  0 siblings, 0 replies; 7+ messages in thread
From: Ard Biesheuvel @ 2017-07-13 18:00 UTC (permalink / raw)
  To: linux-arm-kernel

On 13 July 2017 at 18:51, Markus Stockhausen <stockhausen@collogia.de> wrote:
>> Von: Ard Biesheuvel [ard.biesheuvel at linaro.org]
>> Gesendet: Donnerstag, 13. Juli 2017 19:16
>> An: linux-arm-kernel at lists.infradead.org; linux-raid at vger.kernel.org
>> Cc: shli at kernel.org; Markus Stockhausen; linux at armlinux.org.uk; will.deacon at arm.com; catalin.marinas at arm.com; Ard Biesheuvel
>> Betreff: [PATCH 1/2] md/raid6: use faster multiplication for ARM NEON delta syndrome
>>
>> The P/Q left side optimization in the delta syndrome simply involves
>> repeatedly multiplying a value by polynomial 'x' in GF(2^8). Given
>> that 'x * x * x * x' equals 'x^4' even in the polynomial world, we
>> can accelerate this substantially by performing up to 4 such operations
>> at once, using the NEON instructions for polynomial multiplication.
>>
>> Results on a Cortex-A57 running in 64-bit mode:
>>
>>   Before:
>>   -------
>>   raid6: neonx1   xor()  1680 MB/s
>>   raid6: neonx2   xor()  2286 MB/s
>>   raid6: neonx4   xor()  3162 MB/s
>>   raid6: neonx8   xor()  3389 MB/s
>>
>>   After:
>>   ------
>>   raid6: neonx1   xor()  2281 MB/s
>>   raid6: neonx2   xor()  3362 MB/s
>>   raid6: neonx4   xor()  3787 MB/s
>>   raid6: neonx8   xor()  4239 MB/s
>
> Nice optimiziation. Nevertheless the test algorithm favours this implementation. See:
>
> int start = (disks>>1)-1, stop = disks-3; /* work on the second half of the disks */
>
> What gives the before/after test if you work on the middle data disks and not on
> the right ones? In the 4K page size this should be  start = 3, stop = 11 instead of
> start = 7, stop = 13. Given the large gain you see the impact should be lower but
> at least in the >10% range.
>

Relative before and after (using raid6test rather than the kernel
module this time, so they should not be compared with the numbers
above)

before
raid6: neonx1   xor()  1773 MB/s
raid6: neonx2   xor()  2362 MB/s
raid6: neonx4   xor()  3223 MB/s
raid6: neonx8   xor()  3375 MB/s

after
raid6: neonx1   xor()  2259 MB/s
raid6: neonx2   xor()  2975 MB/s
raid6: neonx4   xor()  3404 MB/s
raid6: neonx8   xor()  3788 MB/s

So your estimate is correct: 12% speedup for neonx8 in the 'start = 7,
stop = 13' case

-- 
Ard.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 0/2] md/raid6: improvements for ARM/arm64
  2017-07-13 17:15 [PATCH 0/2] md/raid6: improvements for ARM/arm64 Ard Biesheuvel
  2017-07-13 17:16 ` [PATCH 1/2] md/raid6: use faster multiplication for ARM NEON delta syndrome Ard Biesheuvel
  2017-07-13 17:16 ` [PATCH 2/2] md/raid6: implement recovery using ARM NEON intrinsics Ard Biesheuvel
@ 2017-08-09 17:51 ` Catalin Marinas
  2017-08-09 17:52   ` Ard Biesheuvel
  2 siblings, 1 reply; 7+ messages in thread
From: Catalin Marinas @ 2017-08-09 17:51 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Ard,

On Thu, Jul 13, 2017 at 06:15:59PM +0100, Ard Biesheuvel wrote:
> 1. Use a faster algorithm for the delta syndrome
> 2. Implement recovery routines in NEON
> 
> As before, NEON intrinsics are used, which means the same code can be
> compiled for ARM as well as arm64.
> 
> Given that there does not seem to be a maintainer for lib/raid6, could
> we take this through one of the ARM trees instead?
> 
> Ard Biesheuvel (2):
>   md/raid6: use faster multiplication for ARM NEON delta syndrome
>   md/raid6: implement recovery using ARM NEON intrinsics
> 
>  include/linux/raid/pq.h      |   1 +
>  lib/raid6/Makefile           |   4 +-
>  lib/raid6/algos.c            |   3 +
>  lib/raid6/neon.uc            |  33 +++++-
>  lib/raid6/recov_neon.c       | 110 ++++++++++++++++++
>  lib/raid6/recov_neon_inner.c | 117 ++++++++++++++++++++

IIRC, you wanted these patches merged via the arm64 tree? I'll apply
them to the for-next/core branch.

-- 
Catalin

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 0/2] md/raid6: improvements for ARM/arm64
  2017-08-09 17:51 ` [PATCH 0/2] md/raid6: improvements for ARM/arm64 Catalin Marinas
@ 2017-08-09 17:52   ` Ard Biesheuvel
  0 siblings, 0 replies; 7+ messages in thread
From: Ard Biesheuvel @ 2017-08-09 17:52 UTC (permalink / raw)
  To: linux-arm-kernel

On 9 August 2017 at 18:51, Catalin Marinas <catalin.marinas@arm.com> wrote:
> Hi Ard,
>
> On Thu, Jul 13, 2017 at 06:15:59PM +0100, Ard Biesheuvel wrote:
>> 1. Use a faster algorithm for the delta syndrome
>> 2. Implement recovery routines in NEON
>>
>> As before, NEON intrinsics are used, which means the same code can be
>> compiled for ARM as well as arm64.
>>
>> Given that there does not seem to be a maintainer for lib/raid6, could
>> we take this through one of the ARM trees instead?
>>
>> Ard Biesheuvel (2):
>>   md/raid6: use faster multiplication for ARM NEON delta syndrome
>>   md/raid6: implement recovery using ARM NEON intrinsics
>>
>>  include/linux/raid/pq.h      |   1 +
>>  lib/raid6/Makefile           |   4 +-
>>  lib/raid6/algos.c            |   3 +
>>  lib/raid6/neon.uc            |  33 +++++-
>>  lib/raid6/recov_neon.c       | 110 ++++++++++++++++++
>>  lib/raid6/recov_neon_inner.c | 117 ++++++++++++++++++++
>
> IIRC, you wanted these patches merged via the arm64 tree? I'll apply
> them to the for-next/core branch.
>

Yes, please.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-08-09 17:52 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-07-13 17:15 [PATCH 0/2] md/raid6: improvements for ARM/arm64 Ard Biesheuvel
2017-07-13 17:16 ` [PATCH 1/2] md/raid6: use faster multiplication for ARM NEON delta syndrome Ard Biesheuvel
2017-07-13 17:51   ` AW: " Markus Stockhausen
2017-07-13 18:00     ` Ard Biesheuvel
2017-07-13 17:16 ` [PATCH 2/2] md/raid6: implement recovery using ARM NEON intrinsics Ard Biesheuvel
2017-08-09 17:51 ` [PATCH 0/2] md/raid6: improvements for ARM/arm64 Catalin Marinas
2017-08-09 17:52   ` Ard Biesheuvel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).