public inbox for linux-raid@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] raid6: arm64: add SVE optimized implementation for syndrome generation
@ 2026-03-18 15:01 Demian Shulhan
  0 siblings, 0 replies; 16+ messages in thread
From: Demian Shulhan @ 2026-03-18 15:01 UTC (permalink / raw)
  To: Song Liu, Yu Kuai
  Cc: Li Nan, linux-raid, linux-kernel, Demian Shulhan,
	kernel test robot

Implement Scalable Vector Extension (SVE) optimized routines for RAID6
syndrome generation and recovery on ARM64.

The SVE instruction set allows for variable vector lengths (from 128 to
2048 bits), scaling automatically with the hardware capabilities. This
implementation handles arbitrary SVE vector lengths using the `cntb`
instruction to determine the runtime vector length.

The implementation introduces `svex1`, `svex2`, and `svex4` algorithms.
The `svex4` algorithm utilizes loop unrolling by 4 blocks per iteration
and manual software pipelining (interleaving memory loads with XORs)
to minimize instruction dependency stalls and maximize CPU pipeline
utilization and memory bandwidth.

Performance was tested on an AWS Graviton3 (Neoverse-V1) instance which
features 256-bit SVE vector length. The `svex4` implementation outperforms
the existing 128-bit `neonx4` baseline for syndrome generation:

raid6: svex4    gen() 19688 MB/s
raid6: svex2    gen() 18610 MB/s
raid6: svex1    gen() 19254 MB/s
raid6: neonx8   gen() 18554 MB/s
raid6: neonx4   gen() 19612 MB/s
raid6: neonx2   gen() 16248 MB/s
raid6: neonx1   gen() 13591 MB/s
raid6: using algorithm svex4 gen() 19688 MB/s
raid6: .... xor() 11212 MB/s, rmw enabled
raid6: using neon recovery algorithm

Note that for the recovery path (`xor_syndrome`), NEON may still be
selected dynamically by the algorithm benchmark, as the recovery
workload is heavily memory-bound.

Signed-off-by: Demian Shulhan <demyansh@gmail.com>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202603181940.cFwYmYoi-lkp@intel.com/
---
 include/linux/raid/pq.h |   3 +
 lib/raid6/Makefile      |   5 +
 lib/raid6/algos.c       |   5 +
 lib/raid6/sve.c         | 675 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 688 insertions(+)
 create mode 100644 lib/raid6/sve.c

diff --git a/include/linux/raid/pq.h b/include/linux/raid/pq.h
index 2467b3be15c9..787cc57aea9d 100644
--- a/include/linux/raid/pq.h
+++ b/include/linux/raid/pq.h
@@ -140,6 +140,9 @@ extern const struct raid6_calls raid6_neonx1;
 extern const struct raid6_calls raid6_neonx2;
 extern const struct raid6_calls raid6_neonx4;
 extern const struct raid6_calls raid6_neonx8;
+extern const struct raid6_calls raid6_svex1;
+extern const struct raid6_calls raid6_svex2;
+extern const struct raid6_calls raid6_svex4;
 
 /* Algorithm list */
 extern const struct raid6_calls * const raid6_algos[];
diff --git a/lib/raid6/Makefile b/lib/raid6/Makefile
index 5be0a4e60ab1..6cdaa6f206fb 100644
--- a/lib/raid6/Makefile
+++ b/lib/raid6/Makefile
@@ -8,6 +8,7 @@ raid6_pq-$(CONFIG_X86) += recov_ssse3.o recov_avx2.o mmx.o sse1.o sse2.o avx2.o
 raid6_pq-$(CONFIG_ALTIVEC) += altivec1.o altivec2.o altivec4.o altivec8.o \
                               vpermxor1.o vpermxor2.o vpermxor4.o vpermxor8.o
 raid6_pq-$(CONFIG_KERNEL_MODE_NEON) += neon.o neon1.o neon2.o neon4.o neon8.o recov_neon.o recov_neon_inner.o
+raid6_pq-$(CONFIG_ARM64_SVE) += sve.o
 raid6_pq-$(CONFIG_S390) += s390vx8.o recov_s390xc.o
 raid6_pq-$(CONFIG_LOONGARCH) += loongarch_simd.o recov_loongarch_simd.o
 raid6_pq-$(CONFIG_RISCV_ISA_V) += rvv.o recov_rvv.o
@@ -67,6 +68,10 @@ CFLAGS_REMOVE_neon2.o += $(CC_FLAGS_NO_FPU)
 CFLAGS_REMOVE_neon4.o += $(CC_FLAGS_NO_FPU)
 CFLAGS_REMOVE_neon8.o += $(CC_FLAGS_NO_FPU)
 CFLAGS_REMOVE_recov_neon_inner.o += $(CC_FLAGS_NO_FPU)
+
+CFLAGS_sve.o += $(CC_FLAGS_FPU)
+CFLAGS_REMOVE_sve.o += $(CC_FLAGS_NO_FPU)
+
 targets += neon1.c neon2.c neon4.c neon8.c
 $(obj)/neon%.c: $(src)/neon.uc $(src)/unroll.awk FORCE
 	$(call if_changed,unroll)
diff --git a/lib/raid6/algos.c b/lib/raid6/algos.c
index 799e0e5eac26..0ae73c3a4be3 100644
--- a/lib/raid6/algos.c
+++ b/lib/raid6/algos.c
@@ -66,6 +66,11 @@ const struct raid6_calls * const raid6_algos[] = {
 	&raid6_neonx2,
 	&raid6_neonx1,
 #endif
+#ifdef CONFIG_ARM64_SVE
+	&raid6_svex4,
+	&raid6_svex2,
+	&raid6_svex1,
+#endif
 #ifdef CONFIG_LOONGARCH
 #ifdef CONFIG_CPU_HAS_LASX
 	&raid6_lasx,
diff --git a/lib/raid6/sve.c b/lib/raid6/sve.c
new file mode 100644
index 000000000000..d52937f806d4
--- /dev/null
+++ b/lib/raid6/sve.c
@@ -0,0 +1,675 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * RAID-6 syndrome calculation using ARM SVE instructions
+ */
+
+#include <linux/raid/pq.h>
+
+#ifdef __KERNEL__
+#include <asm/simd.h>
+#include <linux/cpufeature.h>
+#else
+#define scoped_ksimd()
+#define system_supports_sve() (1)
+#endif
+
+static void raid6_sve1_gen_syndrome_real(int disks, unsigned long bytes, void **ptrs)
+{
+	u8 **dptr = (u8 **)ptrs;
+	u8 *p, *q;
+	long z0 = disks - 3;
+
+	p = dptr[z0 + 1];
+	q = dptr[z0 + 2];
+
+	asm volatile(
+		".arch armv8.2-a+sve\n"
+		"ptrue p0.b\n"
+		"cntb x3\n"
+		"mov w4, #0x1d\n"
+		"dup z4.b, w4\n"
+		"mov x5, #0\n"
+
+		"0:\n"
+		"ldr x6, [%[dptr], %[z0], lsl #3]\n"
+		"ld1b z0.b, p0/z, [x6, x5]\n"
+		"mov z1.d, z0.d\n"
+
+		"mov w7, %w[z0]\n"
+		"sub w7, w7, #1\n"
+
+		"1:\n"
+		"cmp w7, #0\n"
+		"blt 2f\n"
+
+		"mov z3.d, z1.d\n"
+		"asr z3.b, p0/m, z3.b, #7\n"
+		"lsl z1.b, p0/m, z1.b, #1\n"
+
+		"and z3.d, z3.d, z4.d\n"
+		"eor z1.d, z1.d, z3.d\n"
+
+		"sxtw x8, w7\n"
+		"ldr x6, [%[dptr], x8, lsl #3]\n"
+		"ld1b z2.b, p0/z, [x6, x5]\n"
+
+		"eor z1.d, z1.d, z2.d\n"
+		"eor z0.d, z0.d, z2.d\n"
+
+		"sub w7, w7, #1\n"
+		"b 1b\n"
+		"2:\n"
+
+		"st1b z0.b, p0, [%[p], x5]\n"
+		"st1b z1.b, p0, [%[q], x5]\n"
+
+		"add x5, x5, x3\n"
+		"cmp x5, %[bytes]\n"
+		"blt 0b\n"
+		:
+		: [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
+		  [p] "r" (p), [q] "r" (q)
+		: "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
+		  "z0", "z1", "z2", "z3", "z4"
+	);
+}
+
+static void raid6_sve1_xor_syndrome_real(int disks, int start, int stop,
+					 unsigned long bytes, void **ptrs)
+{
+	u8 **dptr = (u8 **)ptrs;
+	u8 *p, *q;
+	long z0 = stop;
+
+	p = dptr[disks - 2];
+	q = dptr[disks - 1];
+
+	asm volatile(
+		".arch armv8.2-a+sve\n"
+		"ptrue p0.b\n"
+		"cntb x3\n"
+		"mov w4, #0x1d\n"
+		"dup z4.b, w4\n"
+		"mov x5, #0\n"
+
+		"0:\n"
+		"ldr x6, [%[dptr], %[z0], lsl #3]\n"
+		"ld1b z1.b, p0/z, [x6, x5]\n"
+		"ld1b z0.b, p0/z, [%[p], x5]\n"
+		"eor z0.d, z0.d, z1.d\n"
+
+		"mov w7, %w[z0]\n"
+		"sub w7, w7, #1\n"
+
+		"1:\n"
+		"cmp w7, %w[start]\n"
+		"blt 2f\n"
+
+		"mov z3.d, z1.d\n"
+		"asr z3.b, p0/m, z3.b, #7\n"
+		"lsl z1.b, p0/m, z1.b, #1\n"
+		"and z3.d, z3.d, z4.d\n"
+		"eor z1.d, z1.d, z3.d\n"
+
+		"sxtw x8, w7\n"
+		"ldr x6, [%[dptr], x8, lsl #3]\n"
+		"ld1b z2.b, p0/z, [x6, x5]\n"
+
+		"eor z1.d, z1.d, z2.d\n"
+		"eor z0.d, z0.d, z2.d\n"
+
+		"sub w7, w7, #1\n"
+		"b 1b\n"
+		"2:\n"
+
+		"mov w7, %w[start]\n"
+		"sub w7, w7, #1\n"
+		"3:\n"
+		"cmp w7, #0\n"
+		"blt 4f\n"
+
+		"mov z3.d, z1.d\n"
+		"asr z3.b, p0/m, z3.b, #7\n"
+		"lsl z1.b, p0/m, z1.b, #1\n"
+		"and z3.d, z3.d, z4.d\n"
+		"eor z1.d, z1.d, z3.d\n"
+
+		"sub w7, w7, #1\n"
+		"b 3b\n"
+		"4:\n"
+
+		"ld1b z2.b, p0/z, [%[q], x5]\n"
+		"eor z1.d, z1.d, z2.d\n"
+
+		"st1b z0.b, p0, [%[p], x5]\n"
+		"st1b z1.b, p0, [%[q], x5]\n"
+
+		"add x5, x5, x3\n"
+		"cmp x5, %[bytes]\n"
+		"blt 0b\n"
+		:
+		: [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
+		  [p] "r" (p), [q] "r" (q), [start] "r" (start)
+		: "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
+		  "z0", "z1", "z2", "z3", "z4"
+	);
+}
+
+static void raid6_sve2_gen_syndrome_real(int disks, unsigned long bytes, void **ptrs)
+{
+	u8 **dptr = (u8 **)ptrs;
+	u8 *p, *q;
+	long z0 = disks - 3;
+
+	p = dptr[z0 + 1];
+	q = dptr[z0 + 2];
+
+	asm volatile(
+		".arch armv8.2-a+sve\n"
+		"ptrue p0.b\n"
+		"cntb x3\n"
+		"mov w4, #0x1d\n"
+		"dup z4.b, w4\n"
+		"mov x5, #0\n"
+
+		"0:\n"
+		"ldr x6, [%[dptr], %[z0], lsl #3]\n"
+		"ld1b z0.b, p0/z, [x6, x5]\n"
+		"add x8, x5, x3\n"
+		"ld1b z5.b, p0/z, [x6, x8]\n"
+		"mov z1.d, z0.d\n"
+		"mov z6.d, z5.d\n"
+
+		"mov w7, %w[z0]\n"
+		"sub w7, w7, #1\n"
+
+		"1:\n"
+		"cmp w7, #0\n"
+		"blt 2f\n"
+
+		"mov z3.d, z1.d\n"
+		"asr z3.b, p0/m, z3.b, #7\n"
+		"lsl z1.b, p0/m, z1.b, #1\n"
+		"and z3.d, z3.d, z4.d\n"
+		"eor z1.d, z1.d, z3.d\n"
+
+		"mov z8.d, z6.d\n"
+		"asr z8.b, p0/m, z8.b, #7\n"
+		"lsl z6.b, p0/m, z6.b, #1\n"
+		"and z8.d, z8.d, z4.d\n"
+		"eor z6.d, z6.d, z8.d\n"
+
+		"sxtw x8, w7\n"
+		"ldr x6, [%[dptr], x8, lsl #3]\n"
+		"ld1b z2.b, p0/z, [x6, x5]\n"
+		"add x8, x5, x3\n"
+		"ld1b z7.b, p0/z, [x6, x8]\n"
+
+		"eor z1.d, z1.d, z2.d\n"
+		"eor z0.d, z0.d, z2.d\n"
+
+		"eor z6.d, z6.d, z7.d\n"
+		"eor z5.d, z5.d, z7.d\n"
+
+		"sub w7, w7, #1\n"
+		"b 1b\n"
+		"2:\n"
+
+		"st1b z0.b, p0, [%[p], x5]\n"
+		"st1b z1.b, p0, [%[q], x5]\n"
+		"add x8, x5, x3\n"
+		"st1b z5.b, p0, [%[p], x8]\n"
+		"st1b z6.b, p0, [%[q], x8]\n"
+
+		"add x5, x5, x3\n"
+		"add x5, x5, x3\n"
+		"cmp x5, %[bytes]\n"
+		"blt 0b\n"
+		:
+		: [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
+		  [p] "r" (p), [q] "r" (q)
+		: "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
+		  "z0", "z1", "z2", "z3", "z4",
+		  "z5", "z6", "z7", "z8"
+	);
+}
+
+static void raid6_sve2_xor_syndrome_real(int disks, int start, int stop,
+					 unsigned long bytes, void **ptrs)
+{
+	u8 **dptr = (u8 **)ptrs;
+	u8 *p, *q;
+	long z0 = stop;
+
+	p = dptr[disks - 2];
+	q = dptr[disks - 1];
+
+	asm volatile(
+		".arch armv8.2-a+sve\n"
+		"ptrue p0.b\n"
+		"cntb x3\n"
+		"mov w4, #0x1d\n"
+		"dup z4.b, w4\n"
+		"mov x5, #0\n"
+
+		"0:\n"
+		"ldr x6, [%[dptr], %[z0], lsl #3]\n"
+		"ld1b z1.b, p0/z, [x6, x5]\n"
+		"add x8, x5, x3\n"
+		"ld1b z6.b, p0/z, [x6, x8]\n"
+
+		"ld1b z0.b, p0/z, [%[p], x5]\n"
+		"ld1b z5.b, p0/z, [%[p], x8]\n"
+
+		"eor z0.d, z0.d, z1.d\n"
+		"eor z5.d, z5.d, z6.d\n"
+
+		"mov w7, %w[z0]\n"
+		"sub w7, w7, #1\n"
+
+		"1:\n"
+		"cmp w7, %w[start]\n"
+		"blt 2f\n"
+
+		"mov z3.d, z1.d\n"
+		"asr z3.b, p0/m, z3.b, #7\n"
+		"lsl z1.b, p0/m, z1.b, #1\n"
+		"and z3.d, z3.d, z4.d\n"
+		"eor z1.d, z1.d, z3.d\n"
+
+		"mov z8.d, z6.d\n"
+		"asr z8.b, p0/m, z8.b, #7\n"
+		"lsl z6.b, p0/m, z6.b, #1\n"
+		"and z8.d, z8.d, z4.d\n"
+		"eor z6.d, z6.d, z8.d\n"
+
+		"sxtw x8, w7\n"
+		"ldr x6, [%[dptr], x8, lsl #3]\n"
+		"ld1b z2.b, p0/z, [x6, x5]\n"
+		"add x8, x5, x3\n"
+		"ld1b z7.b, p0/z, [x6, x8]\n"
+
+		"eor z1.d, z1.d, z2.d\n"
+		"eor z0.d, z0.d, z2.d\n"
+
+		"eor z6.d, z6.d, z7.d\n"
+		"eor z5.d, z5.d, z7.d\n"
+
+		"sub w7, w7, #1\n"
+		"b 1b\n"
+		"2:\n"
+
+		"mov w7, %w[start]\n"
+		"sub w7, w7, #1\n"
+		"3:\n"
+		"cmp w7, #0\n"
+		"blt 4f\n"
+
+		"mov z3.d, z1.d\n"
+		"asr z3.b, p0/m, z3.b, #7\n"
+		"lsl z1.b, p0/m, z1.b, #1\n"
+		"and z3.d, z3.d, z4.d\n"
+		"eor z1.d, z1.d, z3.d\n"
+
+		"mov z8.d, z6.d\n"
+		"asr z8.b, p0/m, z8.b, #7\n"
+		"lsl z6.b, p0/m, z6.b, #1\n"
+		"and z8.d, z8.d, z4.d\n"
+		"eor z6.d, z6.d, z8.d\n"
+
+		"sub w7, w7, #1\n"
+		"b 3b\n"
+		"4:\n"
+
+		"ld1b z2.b, p0/z, [%[q], x5]\n"
+		"eor z1.d, z1.d, z2.d\n"
+		"st1b z0.b, p0, [%[p], x5]\n"
+		"st1b z1.b, p0, [%[q], x5]\n"
+
+		"add x8, x5, x3\n"
+		"ld1b z7.b, p0/z, [%[q], x8]\n"
+		"eor z6.d, z6.d, z7.d\n"
+		"st1b z5.b, p0, [%[p], x8]\n"
+		"st1b z6.b, p0, [%[q], x8]\n"
+
+		"add x5, x5, x3\n"
+		"add x5, x5, x3\n"
+		"cmp x5, %[bytes]\n"
+		"blt 0b\n"
+		:
+		: [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
+		  [p] "r" (p), [q] "r" (q), [start] "r" (start)
+		: "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
+		  "z0", "z1", "z2", "z3", "z4",
+		  "z5", "z6", "z7", "z8"
+	);
+}
+
+static void raid6_sve4_gen_syndrome_real(int disks, unsigned long bytes, void **ptrs)
+{
+	u8 **dptr = (u8 **)ptrs;
+	u8 *p, *q;
+	long z0 = disks - 3;
+
+	p = dptr[z0 + 1];
+	q = dptr[z0 + 2];
+
+	asm volatile(
+		".arch armv8.2-a+sve\n"
+		"ptrue p0.b\n"
+		"cntb x3\n"
+		"mov w4, #0x1d\n"
+		"dup z4.b, w4\n"
+		"mov x5, #0\n"
+
+		"0:\n"
+		"ldr x6, [%[dptr], %[z0], lsl #3]\n"
+		"ld1b z0.b, p0/z, [x6, x5]\n"
+		"add x8, x5, x3\n"
+		"ld1b z5.b, p0/z, [x6, x8]\n"
+		"add x8, x8, x3\n"
+		"ld1b z10.b, p0/z, [x6, x8]\n"
+		"add x8, x8, x3\n"
+		"ld1b z15.b, p0/z, [x6, x8]\n"
+
+		"mov z1.d, z0.d\n"
+		"mov z6.d, z5.d\n"
+		"mov z11.d, z10.d\n"
+		"mov z16.d, z15.d\n"
+
+		"mov w7, %w[z0]\n"
+		"sub w7, w7, #1\n"
+
+		"1:\n"
+		"cmp w7, #0\n"
+		"blt 2f\n"
+
+		// software pipelining: load data early
+		"sxtw x8, w7\n"
+		"ldr x6, [%[dptr], x8, lsl #3]\n"
+		"ld1b z2.b, p0/z, [x6, x5]\n"
+		"add x8, x5, x3\n"
+		"ld1b z7.b, p0/z, [x6, x8]\n"
+		"add x8, x8, x3\n"
+		"ld1b z12.b, p0/z, [x6, x8]\n"
+		"add x8, x8, x3\n"
+		"ld1b z17.b, p0/z, [x6, x8]\n"
+
+		// math block 1
+		"mov z3.d, z1.d\n"
+		"asr z3.b, p0/m, z3.b, #7\n"
+		"lsl z1.b, p0/m, z1.b, #1\n"
+		"and z3.d, z3.d, z4.d\n"
+		"eor z1.d, z1.d, z3.d\n"
+		"eor z1.d, z1.d, z2.d\n"
+		"eor z0.d, z0.d, z2.d\n"
+
+		// math block 2
+		"mov z8.d, z6.d\n"
+		"asr z8.b, p0/m, z8.b, #7\n"
+		"lsl z6.b, p0/m, z6.b, #1\n"
+		"and z8.d, z8.d, z4.d\n"
+		"eor z6.d, z6.d, z8.d\n"
+		"eor z6.d, z6.d, z7.d\n"
+		"eor z5.d, z5.d, z7.d\n"
+
+		// math block 3
+		"mov z13.d, z11.d\n"
+		"asr z13.b, p0/m, z13.b, #7\n"
+		"lsl z11.b, p0/m, z11.b, #1\n"
+		"and z13.d, z13.d, z4.d\n"
+		"eor z11.d, z11.d, z13.d\n"
+		"eor z11.d, z11.d, z12.d\n"
+		"eor z10.d, z10.d, z12.d\n"
+
+		// math block 4
+		"mov z18.d, z16.d\n"
+		"asr z18.b, p0/m, z18.b, #7\n"
+		"lsl z16.b, p0/m, z16.b, #1\n"
+		"and z18.d, z18.d, z4.d\n"
+		"eor z16.d, z16.d, z18.d\n"
+		"eor z16.d, z16.d, z17.d\n"
+		"eor z15.d, z15.d, z17.d\n"
+
+		"sub w7, w7, #1\n"
+		"b 1b\n"
+		"2:\n"
+
+		"st1b z0.b, p0, [%[p], x5]\n"
+		"st1b z1.b, p0, [%[q], x5]\n"
+		"add x8, x5, x3\n"
+		"st1b z5.b, p0, [%[p], x8]\n"
+		"st1b z6.b, p0, [%[q], x8]\n"
+		"add x8, x8, x3\n"
+		"st1b z10.b, p0, [%[p], x8]\n"
+		"st1b z11.b, p0, [%[q], x8]\n"
+		"add x8, x8, x3\n"
+		"st1b z15.b, p0, [%[p], x8]\n"
+		"st1b z16.b, p0, [%[q], x8]\n"
+
+		"add x8, x3, x3\n"
+		"add x5, x5, x8, lsl #1\n"
+		"cmp x5, %[bytes]\n"
+		"blt 0b\n"
+		:
+		: [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
+		  [p] "r" (p), [q] "r" (q)
+		: "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
+		  "z0", "z1", "z2", "z3", "z4",
+		  "z5", "z6", "z7", "z8",
+		  "z10", "z11", "z12", "z13",
+		  "z15", "z16", "z17", "z18"
+	);
+}
+
+static void raid6_sve4_xor_syndrome_real(int disks, int start, int stop,
+					 unsigned long bytes, void **ptrs)
+{
+	u8 **dptr = (u8 **)ptrs;
+	u8 *p, *q;
+	long z0 = stop;
+
+	p = dptr[disks - 2];
+	q = dptr[disks - 1];
+
+	asm volatile(
+		".arch armv8.2-a+sve\n"
+		"ptrue p0.b\n"
+		"cntb x3\n"
+		"mov w4, #0x1d\n"
+		"dup z4.b, w4\n"
+		"mov x5, #0\n"
+
+		"0:\n"
+		"ldr x6, [%[dptr], %[z0], lsl #3]\n"
+		"ld1b z1.b, p0/z, [x6, x5]\n"
+		"add x8, x5, x3\n"
+		"ld1b z6.b, p0/z, [x6, x8]\n"
+		"add x8, x8, x3\n"
+		"ld1b z11.b, p0/z, [x6, x8]\n"
+		"add x8, x8, x3\n"
+		"ld1b z16.b, p0/z, [x6, x8]\n"
+
+		"ld1b z0.b, p0/z, [%[p], x5]\n"
+		"add x8, x5, x3\n"
+		"ld1b z5.b, p0/z, [%[p], x8]\n"
+		"add x8, x8, x3\n"
+		"ld1b z10.b, p0/z, [%[p], x8]\n"
+		"add x8, x8, x3\n"
+		"ld1b z15.b, p0/z, [%[p], x8]\n"
+
+		"eor z0.d, z0.d, z1.d\n"
+		"eor z5.d, z5.d, z6.d\n"
+		"eor z10.d, z10.d, z11.d\n"
+		"eor z15.d, z15.d, z16.d\n"
+
+		"mov w7, %w[z0]\n"
+		"sub w7, w7, #1\n"
+
+		"1:\n"
+		"cmp w7, %w[start]\n"
+		"blt 2f\n"
+
+		// software pipelining: load data early
+		"sxtw x8, w7\n"
+		"ldr x6, [%[dptr], x8, lsl #3]\n"
+		"ld1b z2.b, p0/z, [x6, x5]\n"
+		"add x8, x5, x3\n"
+		"ld1b z7.b, p0/z, [x6, x8]\n"
+		"add x8, x8, x3\n"
+		"ld1b z12.b, p0/z, [x6, x8]\n"
+		"add x8, x8, x3\n"
+		"ld1b z17.b, p0/z, [x6, x8]\n"
+
+		// math block 1
+		"mov z3.d, z1.d\n"
+		"asr z3.b, p0/m, z3.b, #7\n"
+		"lsl z1.b, p0/m, z1.b, #1\n"
+		"and z3.d, z3.d, z4.d\n"
+		"eor z1.d, z1.d, z3.d\n"
+		"eor z1.d, z1.d, z2.d\n"
+		"eor z0.d, z0.d, z2.d\n"
+
+		// math block 2
+		"mov z8.d, z6.d\n"
+		"asr z8.b, p0/m, z8.b, #7\n"
+		"lsl z6.b, p0/m, z6.b, #1\n"
+		"and z8.d, z8.d, z4.d\n"
+		"eor z6.d, z6.d, z8.d\n"
+		"eor z6.d, z6.d, z7.d\n"
+		"eor z5.d, z5.d, z7.d\n"
+
+		// math block 3
+		"mov z13.d, z11.d\n"
+		"asr z13.b, p0/m, z13.b, #7\n"
+		"lsl z11.b, p0/m, z11.b, #1\n"
+		"and z13.d, z13.d, z4.d\n"
+		"eor z11.d, z11.d, z13.d\n"
+		"eor z11.d, z11.d, z12.d\n"
+		"eor z10.d, z10.d, z12.d\n"
+
+		// math block 4
+		"mov z18.d, z16.d\n"
+		"asr z18.b, p0/m, z18.b, #7\n"
+		"lsl z16.b, p0/m, z16.b, #1\n"
+		"and z18.d, z18.d, z4.d\n"
+		"eor z16.d, z16.d, z18.d\n"
+		"eor z16.d, z16.d, z17.d\n"
+		"eor z15.d, z15.d, z17.d\n"
+
+		"sub w7, w7, #1\n"
+		"b 1b\n"
+		"2:\n"
+
+		"mov w7, %w[start]\n"
+		"sub w7, w7, #1\n"
+		"3:\n"
+		"cmp w7, #0\n"
+		"blt 4f\n"
+
+		// math block 1
+		"mov z3.d, z1.d\n"
+		"asr z3.b, p0/m, z3.b, #7\n"
+		"lsl z1.b, p0/m, z1.b, #1\n"
+		"and z3.d, z3.d, z4.d\n"
+		"eor z1.d, z1.d, z3.d\n"
+
+		// math block 2
+		"mov z8.d, z6.d\n"
+		"asr z8.b, p0/m, z8.b, #7\n"
+		"lsl z6.b, p0/m, z6.b, #1\n"
+		"and z8.d, z8.d, z4.d\n"
+		"eor z6.d, z6.d, z8.d\n"
+
+		// math block 3
+		"mov z13.d, z11.d\n"
+		"asr z13.b, p0/m, z13.b, #7\n"
+		"lsl z11.b, p0/m, z11.b, #1\n"
+		"and z13.d, z13.d, z4.d\n"
+		"eor z11.d, z11.d, z13.d\n"
+
+		// math block 4
+		"mov z18.d, z16.d\n"
+		"asr z18.b, p0/m, z18.b, #7\n"
+		"lsl z16.b, p0/m, z16.b, #1\n"
+		"and z18.d, z18.d, z4.d\n"
+		"eor z16.d, z16.d, z18.d\n"
+
+		"sub w7, w7, #1\n"
+		"b 3b\n"
+		"4:\n"
+
+		// Load q and XOR
+		"ld1b z2.b, p0/z, [%[q], x5]\n"
+		"add x8, x5, x3\n"
+		"ld1b z7.b, p0/z, [%[q], x8]\n"
+		"add x8, x8, x3\n"
+		"ld1b z12.b, p0/z, [%[q], x8]\n"
+		"add x8, x8, x3\n"
+		"ld1b z17.b, p0/z, [%[q], x8]\n"
+
+		"eor z1.d, z1.d, z2.d\n"
+		"eor z6.d, z6.d, z7.d\n"
+		"eor z11.d, z11.d, z12.d\n"
+		"eor z16.d, z16.d, z17.d\n"
+
+		// Store results
+		"st1b z0.b, p0, [%[p], x5]\n"
+		"st1b z1.b, p0, [%[q], x5]\n"
+		"add x8, x5, x3\n"
+		"st1b z5.b, p0, [%[p], x8]\n"
+		"st1b z6.b, p0, [%[q], x8]\n"
+		"add x8, x8, x3\n"
+		"st1b z10.b, p0, [%[p], x8]\n"
+		"st1b z11.b, p0, [%[q], x8]\n"
+		"add x8, x8, x3\n"
+		"st1b z15.b, p0, [%[p], x8]\n"
+		"st1b z16.b, p0, [%[q], x8]\n"
+
+		"add x8, x3, x3\n"
+		"add x5, x5, x8, lsl #1\n"
+		"cmp x5, %[bytes]\n"
+		"blt 0b\n"
+		:
+		: [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
+		  [p] "r" (p), [q] "r" (q), [start] "r" (start)
+		: "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
+		  "z0", "z1", "z2", "z3", "z4",
+		  "z5", "z6", "z7", "z8",
+		  "z10", "z11", "z12", "z13",
+		  "z15", "z16", "z17", "z18"
+	);
+}
+
+#define RAID6_SVE_WRAPPER(_n)						\
+	static void raid6_sve ## _n ## _gen_syndrome(int disks,		\
+					size_t bytes, void **ptrs)	\
+	{								\
+		scoped_ksimd()						\
+		raid6_sve ## _n ## _gen_syndrome_real(disks,		\
+					(unsigned long)bytes, ptrs);	\
+	}								\
+	static void raid6_sve ## _n ## _xor_syndrome(int disks,		\
+					int start, int stop,		\
+					size_t bytes, void **ptrs)	\
+	{								\
+		scoped_ksimd()						\
+		raid6_sve ## _n ## _xor_syndrome_real(disks,		\
+				start, stop, (unsigned long)bytes, ptrs);\
+	}								\
+	struct raid6_calls const raid6_svex ## _n = {			\
+		raid6_sve ## _n ## _gen_syndrome,			\
+		raid6_sve ## _n ## _xor_syndrome,			\
+		raid6_have_sve,						\
+		"svex" #_n,						\
+		0							\
+	}
+
+static int raid6_have_sve(void)
+{
+	return system_supports_sve();
+}
+
+RAID6_SVE_WRAPPER(1);
+RAID6_SVE_WRAPPER(2);
+RAID6_SVE_WRAPPER(4);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2] raid6: arm64: add SVE optimized implementation for syndrome generation
@ 2026-03-18 15:02 Demian Shulhan
  2026-03-24  7:45 ` Christoph Hellwig
  2026-03-24  8:00 ` Ard Biesheuvel
  0 siblings, 2 replies; 16+ messages in thread
From: Demian Shulhan @ 2026-03-18 15:02 UTC (permalink / raw)
  To: Song Liu, Yu Kuai
  Cc: Li Nan, linux-raid, linux-kernel, Demian Shulhan,
	kernel test robot

Implement Scalable Vector Extension (SVE) optimized routines for RAID6
syndrome generation and recovery on ARM64.

The SVE instruction set allows for variable vector lengths (from 128 to
2048 bits), scaling automatically with the hardware capabilities. This
implementation handles arbitrary SVE vector lengths using the `cntb`
instruction to determine the runtime vector length.

The implementation introduces `svex1`, `svex2`, and `svex4` algorithms.
The `svex4` algorithm utilizes loop unrolling by 4 blocks per iteration
and manual software pipelining (interleaving memory loads with XORs)
to minimize instruction dependency stalls and maximize CPU pipeline
utilization and memory bandwidth.

Performance was tested on an AWS Graviton3 (Neoverse-V1) instance which
features 256-bit SVE vector length. The `svex4` implementation outperforms
the existing 128-bit `neonx4` baseline for syndrome generation:

raid6: svex4    gen() 19688 MB/s
raid6: svex2    gen() 18610 MB/s
raid6: svex1    gen() 19254 MB/s
raid6: neonx8   gen() 18554 MB/s
raid6: neonx4   gen() 19612 MB/s
raid6: neonx2   gen() 16248 MB/s
raid6: neonx1   gen() 13591 MB/s
raid6: using algorithm svex4 gen() 19688 MB/s
raid6: .... xor() 11212 MB/s, rmw enabled
raid6: using neon recovery algorithm

Note that for the recovery path (`xor_syndrome`), NEON may still be
selected dynamically by the algorithm benchmark, as the recovery
workload is heavily memory-bound.

Signed-off-by: Demian Shulhan <demyansh@gmail.com>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202603181940.cFwYmYoi-lkp@intel.com/
---
 include/linux/raid/pq.h |   3 +
 lib/raid6/Makefile      |   5 +
 lib/raid6/algos.c       |   5 +
 lib/raid6/sve.c         | 675 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 688 insertions(+)
 create mode 100644 lib/raid6/sve.c

diff --git a/include/linux/raid/pq.h b/include/linux/raid/pq.h
index 2467b3be15c9..787cc57aea9d 100644
--- a/include/linux/raid/pq.h
+++ b/include/linux/raid/pq.h
@@ -140,6 +140,9 @@ extern const struct raid6_calls raid6_neonx1;
 extern const struct raid6_calls raid6_neonx2;
 extern const struct raid6_calls raid6_neonx4;
 extern const struct raid6_calls raid6_neonx8;
+extern const struct raid6_calls raid6_svex1;
+extern const struct raid6_calls raid6_svex2;
+extern const struct raid6_calls raid6_svex4;
 
 /* Algorithm list */
 extern const struct raid6_calls * const raid6_algos[];
diff --git a/lib/raid6/Makefile b/lib/raid6/Makefile
index 5be0a4e60ab1..6cdaa6f206fb 100644
--- a/lib/raid6/Makefile
+++ b/lib/raid6/Makefile
@@ -8,6 +8,7 @@ raid6_pq-$(CONFIG_X86) += recov_ssse3.o recov_avx2.o mmx.o sse1.o sse2.o avx2.o
 raid6_pq-$(CONFIG_ALTIVEC) += altivec1.o altivec2.o altivec4.o altivec8.o \
                               vpermxor1.o vpermxor2.o vpermxor4.o vpermxor8.o
 raid6_pq-$(CONFIG_KERNEL_MODE_NEON) += neon.o neon1.o neon2.o neon4.o neon8.o recov_neon.o recov_neon_inner.o
+raid6_pq-$(CONFIG_ARM64_SVE) += sve.o
 raid6_pq-$(CONFIG_S390) += s390vx8.o recov_s390xc.o
 raid6_pq-$(CONFIG_LOONGARCH) += loongarch_simd.o recov_loongarch_simd.o
 raid6_pq-$(CONFIG_RISCV_ISA_V) += rvv.o recov_rvv.o
@@ -67,6 +68,10 @@ CFLAGS_REMOVE_neon2.o += $(CC_FLAGS_NO_FPU)
 CFLAGS_REMOVE_neon4.o += $(CC_FLAGS_NO_FPU)
 CFLAGS_REMOVE_neon8.o += $(CC_FLAGS_NO_FPU)
 CFLAGS_REMOVE_recov_neon_inner.o += $(CC_FLAGS_NO_FPU)
+
+CFLAGS_sve.o += $(CC_FLAGS_FPU)
+CFLAGS_REMOVE_sve.o += $(CC_FLAGS_NO_FPU)
+
 targets += neon1.c neon2.c neon4.c neon8.c
 $(obj)/neon%.c: $(src)/neon.uc $(src)/unroll.awk FORCE
 	$(call if_changed,unroll)
diff --git a/lib/raid6/algos.c b/lib/raid6/algos.c
index 799e0e5eac26..0ae73c3a4be3 100644
--- a/lib/raid6/algos.c
+++ b/lib/raid6/algos.c
@@ -66,6 +66,11 @@ const struct raid6_calls * const raid6_algos[] = {
 	&raid6_neonx2,
 	&raid6_neonx1,
 #endif
+#ifdef CONFIG_ARM64_SVE
+	&raid6_svex4,
+	&raid6_svex2,
+	&raid6_svex1,
+#endif
 #ifdef CONFIG_LOONGARCH
 #ifdef CONFIG_CPU_HAS_LASX
 	&raid6_lasx,
diff --git a/lib/raid6/sve.c b/lib/raid6/sve.c
new file mode 100644
index 000000000000..d52937f806d4
--- /dev/null
+++ b/lib/raid6/sve.c
@@ -0,0 +1,675 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * RAID-6 syndrome calculation using ARM SVE instructions
+ */
+
+#include <linux/raid/pq.h>
+
+#ifdef __KERNEL__
+#include <asm/simd.h>
+#include <linux/cpufeature.h>
+#else
+#define scoped_ksimd()
+#define system_supports_sve() (1)
+#endif
+
+static void raid6_sve1_gen_syndrome_real(int disks, unsigned long bytes, void **ptrs)
+{
+	u8 **dptr = (u8 **)ptrs;
+	u8 *p, *q;
+	long z0 = disks - 3;
+
+	p = dptr[z0 + 1];
+	q = dptr[z0 + 2];
+
+	asm volatile(
+		".arch armv8.2-a+sve\n"
+		"ptrue p0.b\n"
+		"cntb x3\n"
+		"mov w4, #0x1d\n"
+		"dup z4.b, w4\n"
+		"mov x5, #0\n"
+
+		"0:\n"
+		"ldr x6, [%[dptr], %[z0], lsl #3]\n"
+		"ld1b z0.b, p0/z, [x6, x5]\n"
+		"mov z1.d, z0.d\n"
+
+		"mov w7, %w[z0]\n"
+		"sub w7, w7, #1\n"
+
+		"1:\n"
+		"cmp w7, #0\n"
+		"blt 2f\n"
+
+		"mov z3.d, z1.d\n"
+		"asr z3.b, p0/m, z3.b, #7\n"
+		"lsl z1.b, p0/m, z1.b, #1\n"
+
+		"and z3.d, z3.d, z4.d\n"
+		"eor z1.d, z1.d, z3.d\n"
+
+		"sxtw x8, w7\n"
+		"ldr x6, [%[dptr], x8, lsl #3]\n"
+		"ld1b z2.b, p0/z, [x6, x5]\n"
+
+		"eor z1.d, z1.d, z2.d\n"
+		"eor z0.d, z0.d, z2.d\n"
+
+		"sub w7, w7, #1\n"
+		"b 1b\n"
+		"2:\n"
+
+		"st1b z0.b, p0, [%[p], x5]\n"
+		"st1b z1.b, p0, [%[q], x5]\n"
+
+		"add x5, x5, x3\n"
+		"cmp x5, %[bytes]\n"
+		"blt 0b\n"
+		:
+		: [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
+		  [p] "r" (p), [q] "r" (q)
+		: "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
+		  "z0", "z1", "z2", "z3", "z4"
+	);
+}
+
+static void raid6_sve1_xor_syndrome_real(int disks, int start, int stop,
+					 unsigned long bytes, void **ptrs)
+{
+	u8 **dptr = (u8 **)ptrs;
+	u8 *p, *q;
+	long z0 = stop;
+
+	p = dptr[disks - 2];
+	q = dptr[disks - 1];
+
+	asm volatile(
+		".arch armv8.2-a+sve\n"
+		"ptrue p0.b\n"
+		"cntb x3\n"
+		"mov w4, #0x1d\n"
+		"dup z4.b, w4\n"
+		"mov x5, #0\n"
+
+		"0:\n"
+		"ldr x6, [%[dptr], %[z0], lsl #3]\n"
+		"ld1b z1.b, p0/z, [x6, x5]\n"
+		"ld1b z0.b, p0/z, [%[p], x5]\n"
+		"eor z0.d, z0.d, z1.d\n"
+
+		"mov w7, %w[z0]\n"
+		"sub w7, w7, #1\n"
+
+		"1:\n"
+		"cmp w7, %w[start]\n"
+		"blt 2f\n"
+
+		"mov z3.d, z1.d\n"
+		"asr z3.b, p0/m, z3.b, #7\n"
+		"lsl z1.b, p0/m, z1.b, #1\n"
+		"and z3.d, z3.d, z4.d\n"
+		"eor z1.d, z1.d, z3.d\n"
+
+		"sxtw x8, w7\n"
+		"ldr x6, [%[dptr], x8, lsl #3]\n"
+		"ld1b z2.b, p0/z, [x6, x5]\n"
+
+		"eor z1.d, z1.d, z2.d\n"
+		"eor z0.d, z0.d, z2.d\n"
+
+		"sub w7, w7, #1\n"
+		"b 1b\n"
+		"2:\n"
+
+		"mov w7, %w[start]\n"
+		"sub w7, w7, #1\n"
+		"3:\n"
+		"cmp w7, #0\n"
+		"blt 4f\n"
+
+		"mov z3.d, z1.d\n"
+		"asr z3.b, p0/m, z3.b, #7\n"
+		"lsl z1.b, p0/m, z1.b, #1\n"
+		"and z3.d, z3.d, z4.d\n"
+		"eor z1.d, z1.d, z3.d\n"
+
+		"sub w7, w7, #1\n"
+		"b 3b\n"
+		"4:\n"
+
+		"ld1b z2.b, p0/z, [%[q], x5]\n"
+		"eor z1.d, z1.d, z2.d\n"
+
+		"st1b z0.b, p0, [%[p], x5]\n"
+		"st1b z1.b, p0, [%[q], x5]\n"
+
+		"add x5, x5, x3\n"
+		"cmp x5, %[bytes]\n"
+		"blt 0b\n"
+		:
+		: [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
+		  [p] "r" (p), [q] "r" (q), [start] "r" (start)
+		: "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
+		  "z0", "z1", "z2", "z3", "z4"
+	);
+}
+
+static void raid6_sve2_gen_syndrome_real(int disks, unsigned long bytes, void **ptrs)
+{
+	u8 **dptr = (u8 **)ptrs;
+	u8 *p, *q;
+	long z0 = disks - 3;
+
+	p = dptr[z0 + 1];
+	q = dptr[z0 + 2];
+
+	asm volatile(
+		".arch armv8.2-a+sve\n"
+		"ptrue p0.b\n"
+		"cntb x3\n"
+		"mov w4, #0x1d\n"
+		"dup z4.b, w4\n"
+		"mov x5, #0\n"
+
+		"0:\n"
+		"ldr x6, [%[dptr], %[z0], lsl #3]\n"
+		"ld1b z0.b, p0/z, [x6, x5]\n"
+		"add x8, x5, x3\n"
+		"ld1b z5.b, p0/z, [x6, x8]\n"
+		"mov z1.d, z0.d\n"
+		"mov z6.d, z5.d\n"
+
+		"mov w7, %w[z0]\n"
+		"sub w7, w7, #1\n"
+
+		"1:\n"
+		"cmp w7, #0\n"
+		"blt 2f\n"
+
+		"mov z3.d, z1.d\n"
+		"asr z3.b, p0/m, z3.b, #7\n"
+		"lsl z1.b, p0/m, z1.b, #1\n"
+		"and z3.d, z3.d, z4.d\n"
+		"eor z1.d, z1.d, z3.d\n"
+
+		"mov z8.d, z6.d\n"
+		"asr z8.b, p0/m, z8.b, #7\n"
+		"lsl z6.b, p0/m, z6.b, #1\n"
+		"and z8.d, z8.d, z4.d\n"
+		"eor z6.d, z6.d, z8.d\n"
+
+		"sxtw x8, w7\n"
+		"ldr x6, [%[dptr], x8, lsl #3]\n"
+		"ld1b z2.b, p0/z, [x6, x5]\n"
+		"add x8, x5, x3\n"
+		"ld1b z7.b, p0/z, [x6, x8]\n"
+
+		"eor z1.d, z1.d, z2.d\n"
+		"eor z0.d, z0.d, z2.d\n"
+
+		"eor z6.d, z6.d, z7.d\n"
+		"eor z5.d, z5.d, z7.d\n"
+
+		"sub w7, w7, #1\n"
+		"b 1b\n"
+		"2:\n"
+
+		"st1b z0.b, p0, [%[p], x5]\n"
+		"st1b z1.b, p0, [%[q], x5]\n"
+		"add x8, x5, x3\n"
+		"st1b z5.b, p0, [%[p], x8]\n"
+		"st1b z6.b, p0, [%[q], x8]\n"
+
+		"add x5, x5, x3\n"
+		"add x5, x5, x3\n"
+		"cmp x5, %[bytes]\n"
+		"blt 0b\n"
+		:
+		: [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
+		  [p] "r" (p), [q] "r" (q)
+		: "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
+		  "z0", "z1", "z2", "z3", "z4",
+		  "z5", "z6", "z7", "z8"
+	);
+}
+
+static void raid6_sve2_xor_syndrome_real(int disks, int start, int stop,
+					 unsigned long bytes, void **ptrs)
+{
+	u8 **dptr = (u8 **)ptrs;
+	u8 *p, *q;
+	long z0 = stop;
+
+	p = dptr[disks - 2];
+	q = dptr[disks - 1];
+
+	asm volatile(
+		".arch armv8.2-a+sve\n"
+		"ptrue p0.b\n"
+		"cntb x3\n"
+		"mov w4, #0x1d\n"
+		"dup z4.b, w4\n"
+		"mov x5, #0\n"
+
+		"0:\n"
+		"ldr x6, [%[dptr], %[z0], lsl #3]\n"
+		"ld1b z1.b, p0/z, [x6, x5]\n"
+		"add x8, x5, x3\n"
+		"ld1b z6.b, p0/z, [x6, x8]\n"
+
+		"ld1b z0.b, p0/z, [%[p], x5]\n"
+		"ld1b z5.b, p0/z, [%[p], x8]\n"
+
+		"eor z0.d, z0.d, z1.d\n"
+		"eor z5.d, z5.d, z6.d\n"
+
+		"mov w7, %w[z0]\n"
+		"sub w7, w7, #1\n"
+
+		"1:\n"
+		"cmp w7, %w[start]\n"
+		"blt 2f\n"
+
+		"mov z3.d, z1.d\n"
+		"asr z3.b, p0/m, z3.b, #7\n"
+		"lsl z1.b, p0/m, z1.b, #1\n"
+		"and z3.d, z3.d, z4.d\n"
+		"eor z1.d, z1.d, z3.d\n"
+
+		"mov z8.d, z6.d\n"
+		"asr z8.b, p0/m, z8.b, #7\n"
+		"lsl z6.b, p0/m, z6.b, #1\n"
+		"and z8.d, z8.d, z4.d\n"
+		"eor z6.d, z6.d, z8.d\n"
+
+		"sxtw x8, w7\n"
+		"ldr x6, [%[dptr], x8, lsl #3]\n"
+		"ld1b z2.b, p0/z, [x6, x5]\n"
+		"add x8, x5, x3\n"
+		"ld1b z7.b, p0/z, [x6, x8]\n"
+
+		"eor z1.d, z1.d, z2.d\n"
+		"eor z0.d, z0.d, z2.d\n"
+
+		"eor z6.d, z6.d, z7.d\n"
+		"eor z5.d, z5.d, z7.d\n"
+
+		"sub w7, w7, #1\n"
+		"b 1b\n"
+		"2:\n"
+
+		"mov w7, %w[start]\n"
+		"sub w7, w7, #1\n"
+		"3:\n"
+		"cmp w7, #0\n"
+		"blt 4f\n"
+
+		"mov z3.d, z1.d\n"
+		"asr z3.b, p0/m, z3.b, #7\n"
+		"lsl z1.b, p0/m, z1.b, #1\n"
+		"and z3.d, z3.d, z4.d\n"
+		"eor z1.d, z1.d, z3.d\n"
+
+		"mov z8.d, z6.d\n"
+		"asr z8.b, p0/m, z8.b, #7\n"
+		"lsl z6.b, p0/m, z6.b, #1\n"
+		"and z8.d, z8.d, z4.d\n"
+		"eor z6.d, z6.d, z8.d\n"
+
+		"sub w7, w7, #1\n"
+		"b 3b\n"
+		"4:\n"
+
+		"ld1b z2.b, p0/z, [%[q], x5]\n"
+		"eor z1.d, z1.d, z2.d\n"
+		"st1b z0.b, p0, [%[p], x5]\n"
+		"st1b z1.b, p0, [%[q], x5]\n"
+
+		"add x8, x5, x3\n"
+		"ld1b z7.b, p0/z, [%[q], x8]\n"
+		"eor z6.d, z6.d, z7.d\n"
+		"st1b z5.b, p0, [%[p], x8]\n"
+		"st1b z6.b, p0, [%[q], x8]\n"
+
+		"add x5, x5, x3\n"
+		"add x5, x5, x3\n"
+		"cmp x5, %[bytes]\n"
+		"blt 0b\n"
+		:
+		: [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
+		  [p] "r" (p), [q] "r" (q), [start] "r" (start)
+		: "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
+		  "z0", "z1", "z2", "z3", "z4",
+		  "z5", "z6", "z7", "z8"
+	);
+}
+
+static void raid6_sve4_gen_syndrome_real(int disks, unsigned long bytes, void **ptrs)
+{
+	u8 **dptr = (u8 **)ptrs;
+	u8 *p, *q;
+	long z0 = disks - 3;
+
+	p = dptr[z0 + 1];
+	q = dptr[z0 + 2];
+
+	asm volatile(
+		".arch armv8.2-a+sve\n"
+		"ptrue p0.b\n"
+		"cntb x3\n"
+		"mov w4, #0x1d\n"
+		"dup z4.b, w4\n"
+		"mov x5, #0\n"
+
+		"0:\n"
+		"ldr x6, [%[dptr], %[z0], lsl #3]\n"
+		"ld1b z0.b, p0/z, [x6, x5]\n"
+		"add x8, x5, x3\n"
+		"ld1b z5.b, p0/z, [x6, x8]\n"
+		"add x8, x8, x3\n"
+		"ld1b z10.b, p0/z, [x6, x8]\n"
+		"add x8, x8, x3\n"
+		"ld1b z15.b, p0/z, [x6, x8]\n"
+
+		"mov z1.d, z0.d\n"
+		"mov z6.d, z5.d\n"
+		"mov z11.d, z10.d\n"
+		"mov z16.d, z15.d\n"
+
+		"mov w7, %w[z0]\n"
+		"sub w7, w7, #1\n"
+
+		"1:\n"
+		"cmp w7, #0\n"
+		"blt 2f\n"
+
+		// software pipelining: load data early
+		"sxtw x8, w7\n"
+		"ldr x6, [%[dptr], x8, lsl #3]\n"
+		"ld1b z2.b, p0/z, [x6, x5]\n"
+		"add x8, x5, x3\n"
+		"ld1b z7.b, p0/z, [x6, x8]\n"
+		"add x8, x8, x3\n"
+		"ld1b z12.b, p0/z, [x6, x8]\n"
+		"add x8, x8, x3\n"
+		"ld1b z17.b, p0/z, [x6, x8]\n"
+
+		// math block 1
+		"mov z3.d, z1.d\n"
+		"asr z3.b, p0/m, z3.b, #7\n"
+		"lsl z1.b, p0/m, z1.b, #1\n"
+		"and z3.d, z3.d, z4.d\n"
+		"eor z1.d, z1.d, z3.d\n"
+		"eor z1.d, z1.d, z2.d\n"
+		"eor z0.d, z0.d, z2.d\n"
+
+		// math block 2
+		"mov z8.d, z6.d\n"
+		"asr z8.b, p0/m, z8.b, #7\n"
+		"lsl z6.b, p0/m, z6.b, #1\n"
+		"and z8.d, z8.d, z4.d\n"
+		"eor z6.d, z6.d, z8.d\n"
+		"eor z6.d, z6.d, z7.d\n"
+		"eor z5.d, z5.d, z7.d\n"
+
+		// math block 3
+		"mov z13.d, z11.d\n"
+		"asr z13.b, p0/m, z13.b, #7\n"
+		"lsl z11.b, p0/m, z11.b, #1\n"
+		"and z13.d, z13.d, z4.d\n"
+		"eor z11.d, z11.d, z13.d\n"
+		"eor z11.d, z11.d, z12.d\n"
+		"eor z10.d, z10.d, z12.d\n"
+
+		// math block 4
+		"mov z18.d, z16.d\n"
+		"asr z18.b, p0/m, z18.b, #7\n"
+		"lsl z16.b, p0/m, z16.b, #1\n"
+		"and z18.d, z18.d, z4.d\n"
+		"eor z16.d, z16.d, z18.d\n"
+		"eor z16.d, z16.d, z17.d\n"
+		"eor z15.d, z15.d, z17.d\n"
+
+		"sub w7, w7, #1\n"
+		"b 1b\n"
+		"2:\n"
+
+		"st1b z0.b, p0, [%[p], x5]\n"
+		"st1b z1.b, p0, [%[q], x5]\n"
+		"add x8, x5, x3\n"
+		"st1b z5.b, p0, [%[p], x8]\n"
+		"st1b z6.b, p0, [%[q], x8]\n"
+		"add x8, x8, x3\n"
+		"st1b z10.b, p0, [%[p], x8]\n"
+		"st1b z11.b, p0, [%[q], x8]\n"
+		"add x8, x8, x3\n"
+		"st1b z15.b, p0, [%[p], x8]\n"
+		"st1b z16.b, p0, [%[q], x8]\n"
+
+		"add x8, x3, x3\n"
+		"add x5, x5, x8, lsl #1\n"
+		"cmp x5, %[bytes]\n"
+		"blt 0b\n"
+		:
+		: [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
+		  [p] "r" (p), [q] "r" (q)
+		: "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
+		  "z0", "z1", "z2", "z3", "z4",
+		  "z5", "z6", "z7", "z8",
+		  "z10", "z11", "z12", "z13",
+		  "z15", "z16", "z17", "z18"
+	);
+}
+
+static void raid6_sve4_xor_syndrome_real(int disks, int start, int stop,
+					 unsigned long bytes, void **ptrs)
+{
+	u8 **dptr = (u8 **)ptrs;
+	u8 *p, *q;
+	long z0 = stop;
+
+	p = dptr[disks - 2];
+	q = dptr[disks - 1];
+
+	asm volatile(
+		".arch armv8.2-a+sve\n"
+		"ptrue p0.b\n"
+		"cntb x3\n"
+		"mov w4, #0x1d\n"
+		"dup z4.b, w4\n"
+		"mov x5, #0\n"
+
+		"0:\n"
+		"ldr x6, [%[dptr], %[z0], lsl #3]\n"
+		"ld1b z1.b, p0/z, [x6, x5]\n"
+		"add x8, x5, x3\n"
+		"ld1b z6.b, p0/z, [x6, x8]\n"
+		"add x8, x8, x3\n"
+		"ld1b z11.b, p0/z, [x6, x8]\n"
+		"add x8, x8, x3\n"
+		"ld1b z16.b, p0/z, [x6, x8]\n"
+
+		"ld1b z0.b, p0/z, [%[p], x5]\n"
+		"add x8, x5, x3\n"
+		"ld1b z5.b, p0/z, [%[p], x8]\n"
+		"add x8, x8, x3\n"
+		"ld1b z10.b, p0/z, [%[p], x8]\n"
+		"add x8, x8, x3\n"
+		"ld1b z15.b, p0/z, [%[p], x8]\n"
+
+		"eor z0.d, z0.d, z1.d\n"
+		"eor z5.d, z5.d, z6.d\n"
+		"eor z10.d, z10.d, z11.d\n"
+		"eor z15.d, z15.d, z16.d\n"
+
+		"mov w7, %w[z0]\n"
+		"sub w7, w7, #1\n"
+
+		"1:\n"
+		"cmp w7, %w[start]\n"
+		"blt 2f\n"
+
+		// software pipelining: load data early
+		"sxtw x8, w7\n"
+		"ldr x6, [%[dptr], x8, lsl #3]\n"
+		"ld1b z2.b, p0/z, [x6, x5]\n"
+		"add x8, x5, x3\n"
+		"ld1b z7.b, p0/z, [x6, x8]\n"
+		"add x8, x8, x3\n"
+		"ld1b z12.b, p0/z, [x6, x8]\n"
+		"add x8, x8, x3\n"
+		"ld1b z17.b, p0/z, [x6, x8]\n"
+
+		// math block 1
+		"mov z3.d, z1.d\n"
+		"asr z3.b, p0/m, z3.b, #7\n"
+		"lsl z1.b, p0/m, z1.b, #1\n"
+		"and z3.d, z3.d, z4.d\n"
+		"eor z1.d, z1.d, z3.d\n"
+		"eor z1.d, z1.d, z2.d\n"
+		"eor z0.d, z0.d, z2.d\n"
+
+		// math block 2
+		"mov z8.d, z6.d\n"
+		"asr z8.b, p0/m, z8.b, #7\n"
+		"lsl z6.b, p0/m, z6.b, #1\n"
+		"and z8.d, z8.d, z4.d\n"
+		"eor z6.d, z6.d, z8.d\n"
+		"eor z6.d, z6.d, z7.d\n"
+		"eor z5.d, z5.d, z7.d\n"
+
+		// math block 3
+		"mov z13.d, z11.d\n"
+		"asr z13.b, p0/m, z13.b, #7\n"
+		"lsl z11.b, p0/m, z11.b, #1\n"
+		"and z13.d, z13.d, z4.d\n"
+		"eor z11.d, z11.d, z13.d\n"
+		"eor z11.d, z11.d, z12.d\n"
+		"eor z10.d, z10.d, z12.d\n"
+
+		// math block 4
+		"mov z18.d, z16.d\n"
+		"asr z18.b, p0/m, z18.b, #7\n"
+		"lsl z16.b, p0/m, z16.b, #1\n"
+		"and z18.d, z18.d, z4.d\n"
+		"eor z16.d, z16.d, z18.d\n"
+		"eor z16.d, z16.d, z17.d\n"
+		"eor z15.d, z15.d, z17.d\n"
+
+		"sub w7, w7, #1\n"
+		"b 1b\n"
+		"2:\n"
+
+		"mov w7, %w[start]\n"
+		"sub w7, w7, #1\n"
+		"3:\n"
+		"cmp w7, #0\n"
+		"blt 4f\n"
+
+		// math block 1
+		"mov z3.d, z1.d\n"
+		"asr z3.b, p0/m, z3.b, #7\n"
+		"lsl z1.b, p0/m, z1.b, #1\n"
+		"and z3.d, z3.d, z4.d\n"
+		"eor z1.d, z1.d, z3.d\n"
+
+		// math block 2
+		"mov z8.d, z6.d\n"
+		"asr z8.b, p0/m, z8.b, #7\n"
+		"lsl z6.b, p0/m, z6.b, #1\n"
+		"and z8.d, z8.d, z4.d\n"
+		"eor z6.d, z6.d, z8.d\n"
+
+		// math block 3
+		"mov z13.d, z11.d\n"
+		"asr z13.b, p0/m, z13.b, #7\n"
+		"lsl z11.b, p0/m, z11.b, #1\n"
+		"and z13.d, z13.d, z4.d\n"
+		"eor z11.d, z11.d, z13.d\n"
+
+		// math block 4
+		"mov z18.d, z16.d\n"
+		"asr z18.b, p0/m, z18.b, #7\n"
+		"lsl z16.b, p0/m, z16.b, #1\n"
+		"and z18.d, z18.d, z4.d\n"
+		"eor z16.d, z16.d, z18.d\n"
+
+		"sub w7, w7, #1\n"
+		"b 3b\n"
+		"4:\n"
+
+		// Load q and XOR
+		"ld1b z2.b, p0/z, [%[q], x5]\n"
+		"add x8, x5, x3\n"
+		"ld1b z7.b, p0/z, [%[q], x8]\n"
+		"add x8, x8, x3\n"
+		"ld1b z12.b, p0/z, [%[q], x8]\n"
+		"add x8, x8, x3\n"
+		"ld1b z17.b, p0/z, [%[q], x8]\n"
+
+		"eor z1.d, z1.d, z2.d\n"
+		"eor z6.d, z6.d, z7.d\n"
+		"eor z11.d, z11.d, z12.d\n"
+		"eor z16.d, z16.d, z17.d\n"
+
+		// Store results
+		"st1b z0.b, p0, [%[p], x5]\n"
+		"st1b z1.b, p0, [%[q], x5]\n"
+		"add x8, x5, x3\n"
+		"st1b z5.b, p0, [%[p], x8]\n"
+		"st1b z6.b, p0, [%[q], x8]\n"
+		"add x8, x8, x3\n"
+		"st1b z10.b, p0, [%[p], x8]\n"
+		"st1b z11.b, p0, [%[q], x8]\n"
+		"add x8, x8, x3\n"
+		"st1b z15.b, p0, [%[p], x8]\n"
+		"st1b z16.b, p0, [%[q], x8]\n"
+
+		"add x8, x3, x3\n"
+		"add x5, x5, x8, lsl #1\n"
+		"cmp x5, %[bytes]\n"
+		"blt 0b\n"
+		:
+		: [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
+		  [p] "r" (p), [q] "r" (q), [start] "r" (start)
+		: "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
+		  "z0", "z1", "z2", "z3", "z4",
+		  "z5", "z6", "z7", "z8",
+		  "z10", "z11", "z12", "z13",
+		  "z15", "z16", "z17", "z18"
+	);
+}
+
+#define RAID6_SVE_WRAPPER(_n)						\
+	static void raid6_sve ## _n ## _gen_syndrome(int disks,		\
+					size_t bytes, void **ptrs)	\
+	{								\
+		scoped_ksimd()						\
+		raid6_sve ## _n ## _gen_syndrome_real(disks,		\
+					(unsigned long)bytes, ptrs);	\
+	}								\
+	static void raid6_sve ## _n ## _xor_syndrome(int disks,		\
+					int start, int stop,		\
+					size_t bytes, void **ptrs)	\
+	{								\
+		scoped_ksimd()						\
+		raid6_sve ## _n ## _xor_syndrome_real(disks,		\
+				start, stop, (unsigned long)bytes, ptrs);\
+	}								\
+	struct raid6_calls const raid6_svex ## _n = {			\
+		raid6_sve ## _n ## _gen_syndrome,			\
+		raid6_sve ## _n ## _xor_syndrome,			\
+		raid6_have_sve,						\
+		"svex" #_n,						\
+		0							\
+	}
+
+static int raid6_have_sve(void)
+{
+	return system_supports_sve();
+}
+
+RAID6_SVE_WRAPPER(1);
+RAID6_SVE_WRAPPER(2);
+RAID6_SVE_WRAPPER(4);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] raid6: arm64: add SVE optimized implementation for syndrome generation
  2026-03-18 15:02 [PATCH v2] raid6: arm64: add SVE optimized implementation for syndrome generation Demian Shulhan
@ 2026-03-24  7:45 ` Christoph Hellwig
  2026-03-24  8:00 ` Ard Biesheuvel
  1 sibling, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2026-03-24  7:45 UTC (permalink / raw)
  To: Demian Shulhan
  Cc: Song Liu, Yu Kuai, Li Nan, linux-raid, linux-kernel,
	kernel test robot, Catalin Marinas, Ard Biesheuvel, Will Deacon,
	linux-arm-kernel

Hi Damian,

thanks for looking into this.

I've added the arm64 maintainers and arm list as that's your best bet
for someone actually understanding the low-level assembly code.

On Wed, Mar 18, 2026 at 03:02:45PM +0000, Demian Shulhan wrote:
> Note that for the recovery path (`xor_syndrome`), NEON may still be
> selected dynamically by the algorithm benchmark, as the recovery
> workload is heavily memory-bound.

The recovery side has no benchmarking, you need to manually select
a priority.

Note that I just sent out a "cleanup the RAID6 P/Q library" series that
make this more explicit.  It also make it clear by prioritizing
implementations using better instructions available we can short-cut
the generation side probing path a lot, which might be worth looking
into for this.

I'm also curious if you looked why the 4x unroll is slower than
the lesser unroll, and if that is inherent in the implementation.  Or
just an effect of the small number of disks in that we don't actually
have 4 disks to unroll for every other iteration.  I.e. what would the
numbers be if RAID6_TEST_DISKS was increased to 10 or 18?

I plan into potentially select the unrolling variants based on the
number of "disks" to calculate over as a follow-on.

We'll have to wait for review on my series, but I'd love to just rebase
this ontop if possible.  I can offer to do the work, but I'd need to
run it past you for testing and final review.

> +static void raid6_sve1_gen_syndrome_real(int disks, unsigned long bytes, void **ptrs)

Overly long line.

> +{
> +	u8 **dptr = (u8 **)ptrs;
> +	u8 *p, *q;
> +	long z0 = disks - 3;
> +
> +	p = dptr[z0 + 1];
> +	q = dptr[z0 + 2];

I know all this is derived from existing code, but as I started to hate
that I'll add my cosmetic comments:

This would read nicer by initializing at declaration time:

	u8 **dptr = (u8 **)ptrs;
	long z0 = disks - 3;
	u8 *p = dptr[z0 + 1];
	u8 *q = dptr[z0 + 2];

Also z0 might better be named z_last or last_disk, or stop as in the
_xor variant routines.

> +	asm volatile(

But I wonder if just implementing the entire routine as assembly in a
.S file would make more sense than this anyway?

> +static void raid6_sve1_xor_syndrome_real(int disks, int start, int stop,
> +					 unsigned long bytes, void **ptrs)
> +{
> +	u8 **dptr = (u8 **)ptrs;
> +	u8 *p, *q;
> +	long z0 = stop;
> +
> +	p = dptr[disks - 2];
> +	q = dptr[disks - 1];
> +
> +	asm volatile(

Same comments here, plus just dropping z0 vs using stop directly.

> +#define RAID6_SVE_WRAPPER(_n)						\
> +	static void raid6_sve ## _n ## _gen_syndrome(int disks,		\
> +					size_t bytes, void **ptrs)	\
> +	{								\
> +		scoped_ksimd()						\
> +		raid6_sve ## _n ## _gen_syndrome_real(disks,		\
> +					(unsigned long)bytes, ptrs);	\

Missing indentation after the scoped_ksimd().  A lot of other code uses
separate compilation units for the SIMD code, which seems pretty useful
to avoid mixing SIMD with non-SIMD code.  That would also combine nicely
with the suggestion above to implement the low-level routines entirely
in assembly.

I'll leave comments on the actual assembly details to people that
actually know ARM64 assembly.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] raid6: arm64: add SVE optimized implementation for syndrome generation
  2026-03-18 15:02 [PATCH v2] raid6: arm64: add SVE optimized implementation for syndrome generation Demian Shulhan
  2026-03-24  7:45 ` Christoph Hellwig
@ 2026-03-24  8:00 ` Ard Biesheuvel
  2026-03-24 10:04   ` Mark Rutland
  1 sibling, 1 reply; 16+ messages in thread
From: Ard Biesheuvel @ 2026-03-24  8:00 UTC (permalink / raw)
  To: Demian Shulhan, Song Liu, Yu Kuai, Will Deacon, Catalin Marinas,
	Mark Rutland, broonie, linux-arm-kernel, robin.murphy,
	Christoph Hellwig
  Cc: Li Nan, linux-raid, linux-kernel

Hi Damian,

On Wed, 18 Mar 2026, at 16:02, Demian Shulhan wrote:
> Implement Scalable Vector Extension (SVE) optimized routines for RAID6
> syndrome generation and recovery on ARM64.
>
> The SVE instruction set allows for variable vector lengths (from 128 to
> 2048 bits), scaling automatically with the hardware capabilities. This
> implementation handles arbitrary SVE vector lengths using the `cntb`
> instruction to determine the runtime vector length.
>
> The implementation introduces `svex1`, `svex2`, and `svex4` algorithms.
> The `svex4` algorithm utilizes loop unrolling by 4 blocks per iteration
> and manual software pipelining (interleaving memory loads with XORs)
> to minimize instruction dependency stalls and maximize CPU pipeline
> utilization and memory bandwidth.
>
> Performance was tested on an AWS Graviton3 (Neoverse-V1) instance which
> features 256-bit SVE vector length. The `svex4` implementation outperforms
> the existing 128-bit `neonx4` baseline for syndrome generation:
>
> raid6: svex4    gen() 19688 MB/s
...
> raid6: neonx4   gen() 19612 MB/s

You're being very generous characterising a 0.3% speedup as 'outperforms'

But the real problem here is that the kernel-mode SIMD API only supports NEON and not SVE, and preserves/restores only the 128-bit view on the NEON/SVE register file. So any context switch or softirq that uses kernel-mode SIMD too, and your SVE register values will get truncated.

Once we encounter a good use case for SVE in the kernel, we might reconsider this, but as it stands, this patch should not be applied.

(leaving the reply untrimmed for the benefit of the cc'ees I added)

> raid6: neonx2   gen() 16248 MB/s
> raid6: neonx1   gen() 13591 MB/s
> raid6: using algorithm svex4 gen() 19688 MB/s
> raid6: .... xor() 11212 MB/s, rmw enabled
> raid6: using neon recovery algorithm
>
> Note that for the recovery path (`xor_syndrome`), NEON may still be
> selected dynamically by the algorithm benchmark, as the recovery
> workload is heavily memory-bound.
>
> Signed-off-by: Demian Shulhan <demyansh@gmail.com>
> Reported-by: kernel test robot <lkp@intel.com>
> Closes: 
> https://lore.kernel.org/oe-kbuild-all/202603181940.cFwYmYoi-lkp@intel.com/
> ---
>  include/linux/raid/pq.h |   3 +
>  lib/raid6/Makefile      |   5 +
>  lib/raid6/algos.c       |   5 +
>  lib/raid6/sve.c         | 675 ++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 688 insertions(+)
>  create mode 100644 lib/raid6/sve.c
>
> diff --git a/include/linux/raid/pq.h b/include/linux/raid/pq.h
> index 2467b3be15c9..787cc57aea9d 100644
> --- a/include/linux/raid/pq.h
> +++ b/include/linux/raid/pq.h
> @@ -140,6 +140,9 @@ extern const struct raid6_calls raid6_neonx1;
>  extern const struct raid6_calls raid6_neonx2;
>  extern const struct raid6_calls raid6_neonx4;
>  extern const struct raid6_calls raid6_neonx8;
> +extern const struct raid6_calls raid6_svex1;
> +extern const struct raid6_calls raid6_svex2;
> +extern const struct raid6_calls raid6_svex4;
> 
>  /* Algorithm list */
>  extern const struct raid6_calls * const raid6_algos[];
> diff --git a/lib/raid6/Makefile b/lib/raid6/Makefile
> index 5be0a4e60ab1..6cdaa6f206fb 100644
> --- a/lib/raid6/Makefile
> +++ b/lib/raid6/Makefile
> @@ -8,6 +8,7 @@ raid6_pq-$(CONFIG_X86) += recov_ssse3.o recov_avx2.o 
> mmx.o sse1.o sse2.o avx2.o
>  raid6_pq-$(CONFIG_ALTIVEC) += altivec1.o altivec2.o altivec4.o 
> altivec8.o \
>                                vpermxor1.o vpermxor2.o vpermxor4.o 
> vpermxor8.o
>  raid6_pq-$(CONFIG_KERNEL_MODE_NEON) += neon.o neon1.o neon2.o neon4.o 
> neon8.o recov_neon.o recov_neon_inner.o
> +raid6_pq-$(CONFIG_ARM64_SVE) += sve.o
>  raid6_pq-$(CONFIG_S390) += s390vx8.o recov_s390xc.o
>  raid6_pq-$(CONFIG_LOONGARCH) += loongarch_simd.o recov_loongarch_simd.o
>  raid6_pq-$(CONFIG_RISCV_ISA_V) += rvv.o recov_rvv.o
> @@ -67,6 +68,10 @@ CFLAGS_REMOVE_neon2.o += $(CC_FLAGS_NO_FPU)
>  CFLAGS_REMOVE_neon4.o += $(CC_FLAGS_NO_FPU)
>  CFLAGS_REMOVE_neon8.o += $(CC_FLAGS_NO_FPU)
>  CFLAGS_REMOVE_recov_neon_inner.o += $(CC_FLAGS_NO_FPU)
> +
> +CFLAGS_sve.o += $(CC_FLAGS_FPU)
> +CFLAGS_REMOVE_sve.o += $(CC_FLAGS_NO_FPU)
> +
>  targets += neon1.c neon2.c neon4.c neon8.c
>  $(obj)/neon%.c: $(src)/neon.uc $(src)/unroll.awk FORCE
>  	$(call if_changed,unroll)
> diff --git a/lib/raid6/algos.c b/lib/raid6/algos.c
> index 799e0e5eac26..0ae73c3a4be3 100644
> --- a/lib/raid6/algos.c
> +++ b/lib/raid6/algos.c
> @@ -66,6 +66,11 @@ const struct raid6_calls * const raid6_algos[] = {
>  	&raid6_neonx2,
>  	&raid6_neonx1,
>  #endif
> +#ifdef CONFIG_ARM64_SVE
> +	&raid6_svex4,
> +	&raid6_svex2,
> +	&raid6_svex1,
> +#endif
>  #ifdef CONFIG_LOONGARCH
>  #ifdef CONFIG_CPU_HAS_LASX
>  	&raid6_lasx,
> diff --git a/lib/raid6/sve.c b/lib/raid6/sve.c
> new file mode 100644
> index 000000000000..d52937f806d4
> --- /dev/null
> +++ b/lib/raid6/sve.c
> @@ -0,0 +1,675 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * RAID-6 syndrome calculation using ARM SVE instructions
> + */
> +
> +#include <linux/raid/pq.h>
> +
> +#ifdef __KERNEL__
> +#include <asm/simd.h>
> +#include <linux/cpufeature.h>
> +#else
> +#define scoped_ksimd()
> +#define system_supports_sve() (1)
> +#endif
> +
> +static void raid6_sve1_gen_syndrome_real(int disks, unsigned long 
> bytes, void **ptrs)
> +{
> +	u8 **dptr = (u8 **)ptrs;
> +	u8 *p, *q;
> +	long z0 = disks - 3;
> +
> +	p = dptr[z0 + 1];
> +	q = dptr[z0 + 2];
> +
> +	asm volatile(
> +		".arch armv8.2-a+sve\n"
> +		"ptrue p0.b\n"
> +		"cntb x3\n"
> +		"mov w4, #0x1d\n"
> +		"dup z4.b, w4\n"
> +		"mov x5, #0\n"
> +
> +		"0:\n"
> +		"ldr x6, [%[dptr], %[z0], lsl #3]\n"
> +		"ld1b z0.b, p0/z, [x6, x5]\n"
> +		"mov z1.d, z0.d\n"
> +
> +		"mov w7, %w[z0]\n"
> +		"sub w7, w7, #1\n"
> +
> +		"1:\n"
> +		"cmp w7, #0\n"
> +		"blt 2f\n"
> +
> +		"mov z3.d, z1.d\n"
> +		"asr z3.b, p0/m, z3.b, #7\n"
> +		"lsl z1.b, p0/m, z1.b, #1\n"
> +
> +		"and z3.d, z3.d, z4.d\n"
> +		"eor z1.d, z1.d, z3.d\n"
> +
> +		"sxtw x8, w7\n"
> +		"ldr x6, [%[dptr], x8, lsl #3]\n"
> +		"ld1b z2.b, p0/z, [x6, x5]\n"
> +
> +		"eor z1.d, z1.d, z2.d\n"
> +		"eor z0.d, z0.d, z2.d\n"
> +
> +		"sub w7, w7, #1\n"
> +		"b 1b\n"
> +		"2:\n"
> +
> +		"st1b z0.b, p0, [%[p], x5]\n"
> +		"st1b z1.b, p0, [%[q], x5]\n"
> +
> +		"add x5, x5, x3\n"
> +		"cmp x5, %[bytes]\n"
> +		"blt 0b\n"
> +		:
> +		: [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
> +		  [p] "r" (p), [q] "r" (q)
> +		: "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
> +		  "z0", "z1", "z2", "z3", "z4"
> +	);
> +}
> +
> +static void raid6_sve1_xor_syndrome_real(int disks, int start, int 
> stop,
> +					 unsigned long bytes, void **ptrs)
> +{
> +	u8 **dptr = (u8 **)ptrs;
> +	u8 *p, *q;
> +	long z0 = stop;
> +
> +	p = dptr[disks - 2];
> +	q = dptr[disks - 1];
> +
> +	asm volatile(
> +		".arch armv8.2-a+sve\n"
> +		"ptrue p0.b\n"
> +		"cntb x3\n"
> +		"mov w4, #0x1d\n"
> +		"dup z4.b, w4\n"
> +		"mov x5, #0\n"
> +
> +		"0:\n"
> +		"ldr x6, [%[dptr], %[z0], lsl #3]\n"
> +		"ld1b z1.b, p0/z, [x6, x5]\n"
> +		"ld1b z0.b, p0/z, [%[p], x5]\n"
> +		"eor z0.d, z0.d, z1.d\n"
> +
> +		"mov w7, %w[z0]\n"
> +		"sub w7, w7, #1\n"
> +
> +		"1:\n"
> +		"cmp w7, %w[start]\n"
> +		"blt 2f\n"
> +
> +		"mov z3.d, z1.d\n"
> +		"asr z3.b, p0/m, z3.b, #7\n"
> +		"lsl z1.b, p0/m, z1.b, #1\n"
> +		"and z3.d, z3.d, z4.d\n"
> +		"eor z1.d, z1.d, z3.d\n"
> +
> +		"sxtw x8, w7\n"
> +		"ldr x6, [%[dptr], x8, lsl #3]\n"
> +		"ld1b z2.b, p0/z, [x6, x5]\n"
> +
> +		"eor z1.d, z1.d, z2.d\n"
> +		"eor z0.d, z0.d, z2.d\n"
> +
> +		"sub w7, w7, #1\n"
> +		"b 1b\n"
> +		"2:\n"
> +
> +		"mov w7, %w[start]\n"
> +		"sub w7, w7, #1\n"
> +		"3:\n"
> +		"cmp w7, #0\n"
> +		"blt 4f\n"
> +
> +		"mov z3.d, z1.d\n"
> +		"asr z3.b, p0/m, z3.b, #7\n"
> +		"lsl z1.b, p0/m, z1.b, #1\n"
> +		"and z3.d, z3.d, z4.d\n"
> +		"eor z1.d, z1.d, z3.d\n"
> +
> +		"sub w7, w7, #1\n"
> +		"b 3b\n"
> +		"4:\n"
> +
> +		"ld1b z2.b, p0/z, [%[q], x5]\n"
> +		"eor z1.d, z1.d, z2.d\n"
> +
> +		"st1b z0.b, p0, [%[p], x5]\n"
> +		"st1b z1.b, p0, [%[q], x5]\n"
> +
> +		"add x5, x5, x3\n"
> +		"cmp x5, %[bytes]\n"
> +		"blt 0b\n"
> +		:
> +		: [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
> +		  [p] "r" (p), [q] "r" (q), [start] "r" (start)
> +		: "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
> +		  "z0", "z1", "z2", "z3", "z4"
> +	);
> +}
> +
> +static void raid6_sve2_gen_syndrome_real(int disks, unsigned long 
> bytes, void **ptrs)
> +{
> +	u8 **dptr = (u8 **)ptrs;
> +	u8 *p, *q;
> +	long z0 = disks - 3;
> +
> +	p = dptr[z0 + 1];
> +	q = dptr[z0 + 2];
> +
> +	asm volatile(
> +		".arch armv8.2-a+sve\n"
> +		"ptrue p0.b\n"
> +		"cntb x3\n"
> +		"mov w4, #0x1d\n"
> +		"dup z4.b, w4\n"
> +		"mov x5, #0\n"
> +
> +		"0:\n"
> +		"ldr x6, [%[dptr], %[z0], lsl #3]\n"
> +		"ld1b z0.b, p0/z, [x6, x5]\n"
> +		"add x8, x5, x3\n"
> +		"ld1b z5.b, p0/z, [x6, x8]\n"
> +		"mov z1.d, z0.d\n"
> +		"mov z6.d, z5.d\n"
> +
> +		"mov w7, %w[z0]\n"
> +		"sub w7, w7, #1\n"
> +
> +		"1:\n"
> +		"cmp w7, #0\n"
> +		"blt 2f\n"
> +
> +		"mov z3.d, z1.d\n"
> +		"asr z3.b, p0/m, z3.b, #7\n"
> +		"lsl z1.b, p0/m, z1.b, #1\n"
> +		"and z3.d, z3.d, z4.d\n"
> +		"eor z1.d, z1.d, z3.d\n"
> +
> +		"mov z8.d, z6.d\n"
> +		"asr z8.b, p0/m, z8.b, #7\n"
> +		"lsl z6.b, p0/m, z6.b, #1\n"
> +		"and z8.d, z8.d, z4.d\n"
> +		"eor z6.d, z6.d, z8.d\n"
> +
> +		"sxtw x8, w7\n"
> +		"ldr x6, [%[dptr], x8, lsl #3]\n"
> +		"ld1b z2.b, p0/z, [x6, x5]\n"
> +		"add x8, x5, x3\n"
> +		"ld1b z7.b, p0/z, [x6, x8]\n"
> +
> +		"eor z1.d, z1.d, z2.d\n"
> +		"eor z0.d, z0.d, z2.d\n"
> +
> +		"eor z6.d, z6.d, z7.d\n"
> +		"eor z5.d, z5.d, z7.d\n"
> +
> +		"sub w7, w7, #1\n"
> +		"b 1b\n"
> +		"2:\n"
> +
> +		"st1b z0.b, p0, [%[p], x5]\n"
> +		"st1b z1.b, p0, [%[q], x5]\n"
> +		"add x8, x5, x3\n"
> +		"st1b z5.b, p0, [%[p], x8]\n"
> +		"st1b z6.b, p0, [%[q], x8]\n"
> +
> +		"add x5, x5, x3\n"
> +		"add x5, x5, x3\n"
> +		"cmp x5, %[bytes]\n"
> +		"blt 0b\n"
> +		:
> +		: [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
> +		  [p] "r" (p), [q] "r" (q)
> +		: "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
> +		  "z0", "z1", "z2", "z3", "z4",
> +		  "z5", "z6", "z7", "z8"
> +	);
> +}
> +
> +static void raid6_sve2_xor_syndrome_real(int disks, int start, int 
> stop,
> +					 unsigned long bytes, void **ptrs)
> +{
> +	u8 **dptr = (u8 **)ptrs;
> +	u8 *p, *q;
> +	long z0 = stop;
> +
> +	p = dptr[disks - 2];
> +	q = dptr[disks - 1];
> +
> +	asm volatile(
> +		".arch armv8.2-a+sve\n"
> +		"ptrue p0.b\n"
> +		"cntb x3\n"
> +		"mov w4, #0x1d\n"
> +		"dup z4.b, w4\n"
> +		"mov x5, #0\n"
> +
> +		"0:\n"
> +		"ldr x6, [%[dptr], %[z0], lsl #3]\n"
> +		"ld1b z1.b, p0/z, [x6, x5]\n"
> +		"add x8, x5, x3\n"
> +		"ld1b z6.b, p0/z, [x6, x8]\n"
> +
> +		"ld1b z0.b, p0/z, [%[p], x5]\n"
> +		"ld1b z5.b, p0/z, [%[p], x8]\n"
> +
> +		"eor z0.d, z0.d, z1.d\n"
> +		"eor z5.d, z5.d, z6.d\n"
> +
> +		"mov w7, %w[z0]\n"
> +		"sub w7, w7, #1\n"
> +
> +		"1:\n"
> +		"cmp w7, %w[start]\n"
> +		"blt 2f\n"
> +
> +		"mov z3.d, z1.d\n"
> +		"asr z3.b, p0/m, z3.b, #7\n"
> +		"lsl z1.b, p0/m, z1.b, #1\n"
> +		"and z3.d, z3.d, z4.d\n"
> +		"eor z1.d, z1.d, z3.d\n"
> +
> +		"mov z8.d, z6.d\n"
> +		"asr z8.b, p0/m, z8.b, #7\n"
> +		"lsl z6.b, p0/m, z6.b, #1\n"
> +		"and z8.d, z8.d, z4.d\n"
> +		"eor z6.d, z6.d, z8.d\n"
> +
> +		"sxtw x8, w7\n"
> +		"ldr x6, [%[dptr], x8, lsl #3]\n"
> +		"ld1b z2.b, p0/z, [x6, x5]\n"
> +		"add x8, x5, x3\n"
> +		"ld1b z7.b, p0/z, [x6, x8]\n"
> +
> +		"eor z1.d, z1.d, z2.d\n"
> +		"eor z0.d, z0.d, z2.d\n"
> +
> +		"eor z6.d, z6.d, z7.d\n"
> +		"eor z5.d, z5.d, z7.d\n"
> +
> +		"sub w7, w7, #1\n"
> +		"b 1b\n"
> +		"2:\n"
> +
> +		"mov w7, %w[start]\n"
> +		"sub w7, w7, #1\n"
> +		"3:\n"
> +		"cmp w7, #0\n"
> +		"blt 4f\n"
> +
> +		"mov z3.d, z1.d\n"
> +		"asr z3.b, p0/m, z3.b, #7\n"
> +		"lsl z1.b, p0/m, z1.b, #1\n"
> +		"and z3.d, z3.d, z4.d\n"
> +		"eor z1.d, z1.d, z3.d\n"
> +
> +		"mov z8.d, z6.d\n"
> +		"asr z8.b, p0/m, z8.b, #7\n"
> +		"lsl z6.b, p0/m, z6.b, #1\n"
> +		"and z8.d, z8.d, z4.d\n"
> +		"eor z6.d, z6.d, z8.d\n"
> +
> +		"sub w7, w7, #1\n"
> +		"b 3b\n"
> +		"4:\n"
> +
> +		"ld1b z2.b, p0/z, [%[q], x5]\n"
> +		"eor z1.d, z1.d, z2.d\n"
> +		"st1b z0.b, p0, [%[p], x5]\n"
> +		"st1b z1.b, p0, [%[q], x5]\n"
> +
> +		"add x8, x5, x3\n"
> +		"ld1b z7.b, p0/z, [%[q], x8]\n"
> +		"eor z6.d, z6.d, z7.d\n"
> +		"st1b z5.b, p0, [%[p], x8]\n"
> +		"st1b z6.b, p0, [%[q], x8]\n"
> +
> +		"add x5, x5, x3\n"
> +		"add x5, x5, x3\n"
> +		"cmp x5, %[bytes]\n"
> +		"blt 0b\n"
> +		:
> +		: [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
> +		  [p] "r" (p), [q] "r" (q), [start] "r" (start)
> +		: "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
> +		  "z0", "z1", "z2", "z3", "z4",
> +		  "z5", "z6", "z7", "z8"
> +	);
> +}
> +
> +static void raid6_sve4_gen_syndrome_real(int disks, unsigned long 
> bytes, void **ptrs)
> +{
> +	u8 **dptr = (u8 **)ptrs;
> +	u8 *p, *q;
> +	long z0 = disks - 3;
> +
> +	p = dptr[z0 + 1];
> +	q = dptr[z0 + 2];
> +
> +	asm volatile(
> +		".arch armv8.2-a+sve\n"
> +		"ptrue p0.b\n"
> +		"cntb x3\n"
> +		"mov w4, #0x1d\n"
> +		"dup z4.b, w4\n"
> +		"mov x5, #0\n"
> +
> +		"0:\n"
> +		"ldr x6, [%[dptr], %[z0], lsl #3]\n"
> +		"ld1b z0.b, p0/z, [x6, x5]\n"
> +		"add x8, x5, x3\n"
> +		"ld1b z5.b, p0/z, [x6, x8]\n"
> +		"add x8, x8, x3\n"
> +		"ld1b z10.b, p0/z, [x6, x8]\n"
> +		"add x8, x8, x3\n"
> +		"ld1b z15.b, p0/z, [x6, x8]\n"
> +
> +		"mov z1.d, z0.d\n"
> +		"mov z6.d, z5.d\n"
> +		"mov z11.d, z10.d\n"
> +		"mov z16.d, z15.d\n"
> +
> +		"mov w7, %w[z0]\n"
> +		"sub w7, w7, #1\n"
> +
> +		"1:\n"
> +		"cmp w7, #0\n"
> +		"blt 2f\n"
> +
> +		// software pipelining: load data early
> +		"sxtw x8, w7\n"
> +		"ldr x6, [%[dptr], x8, lsl #3]\n"
> +		"ld1b z2.b, p0/z, [x6, x5]\n"
> +		"add x8, x5, x3\n"
> +		"ld1b z7.b, p0/z, [x6, x8]\n"
> +		"add x8, x8, x3\n"
> +		"ld1b z12.b, p0/z, [x6, x8]\n"
> +		"add x8, x8, x3\n"
> +		"ld1b z17.b, p0/z, [x6, x8]\n"
> +
> +		// math block 1
> +		"mov z3.d, z1.d\n"
> +		"asr z3.b, p0/m, z3.b, #7\n"
> +		"lsl z1.b, p0/m, z1.b, #1\n"
> +		"and z3.d, z3.d, z4.d\n"
> +		"eor z1.d, z1.d, z3.d\n"
> +		"eor z1.d, z1.d, z2.d\n"
> +		"eor z0.d, z0.d, z2.d\n"
> +
> +		// math block 2
> +		"mov z8.d, z6.d\n"
> +		"asr z8.b, p0/m, z8.b, #7\n"
> +		"lsl z6.b, p0/m, z6.b, #1\n"
> +		"and z8.d, z8.d, z4.d\n"
> +		"eor z6.d, z6.d, z8.d\n"
> +		"eor z6.d, z6.d, z7.d\n"
> +		"eor z5.d, z5.d, z7.d\n"
> +
> +		// math block 3
> +		"mov z13.d, z11.d\n"
> +		"asr z13.b, p0/m, z13.b, #7\n"
> +		"lsl z11.b, p0/m, z11.b, #1\n"
> +		"and z13.d, z13.d, z4.d\n"
> +		"eor z11.d, z11.d, z13.d\n"
> +		"eor z11.d, z11.d, z12.d\n"
> +		"eor z10.d, z10.d, z12.d\n"
> +
> +		// math block 4
> +		"mov z18.d, z16.d\n"
> +		"asr z18.b, p0/m, z18.b, #7\n"
> +		"lsl z16.b, p0/m, z16.b, #1\n"
> +		"and z18.d, z18.d, z4.d\n"
> +		"eor z16.d, z16.d, z18.d\n"
> +		"eor z16.d, z16.d, z17.d\n"
> +		"eor z15.d, z15.d, z17.d\n"
> +
> +		"sub w7, w7, #1\n"
> +		"b 1b\n"
> +		"2:\n"
> +
> +		"st1b z0.b, p0, [%[p], x5]\n"
> +		"st1b z1.b, p0, [%[q], x5]\n"
> +		"add x8, x5, x3\n"
> +		"st1b z5.b, p0, [%[p], x8]\n"
> +		"st1b z6.b, p0, [%[q], x8]\n"
> +		"add x8, x8, x3\n"
> +		"st1b z10.b, p0, [%[p], x8]\n"
> +		"st1b z11.b, p0, [%[q], x8]\n"
> +		"add x8, x8, x3\n"
> +		"st1b z15.b, p0, [%[p], x8]\n"
> +		"st1b z16.b, p0, [%[q], x8]\n"
> +
> +		"add x8, x3, x3\n"
> +		"add x5, x5, x8, lsl #1\n"
> +		"cmp x5, %[bytes]\n"
> +		"blt 0b\n"
> +		:
> +		: [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
> +		  [p] "r" (p), [q] "r" (q)
> +		: "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
> +		  "z0", "z1", "z2", "z3", "z4",
> +		  "z5", "z6", "z7", "z8",
> +		  "z10", "z11", "z12", "z13",
> +		  "z15", "z16", "z17", "z18"
> +	);
> +}
> +
> +static void raid6_sve4_xor_syndrome_real(int disks, int start, int 
> stop,
> +					 unsigned long bytes, void **ptrs)
> +{
> +	u8 **dptr = (u8 **)ptrs;
> +	u8 *p, *q;
> +	long z0 = stop;
> +
> +	p = dptr[disks - 2];
> +	q = dptr[disks - 1];
> +
> +	asm volatile(
> +		".arch armv8.2-a+sve\n"
> +		"ptrue p0.b\n"
> +		"cntb x3\n"
> +		"mov w4, #0x1d\n"
> +		"dup z4.b, w4\n"
> +		"mov x5, #0\n"
> +
> +		"0:\n"
> +		"ldr x6, [%[dptr], %[z0], lsl #3]\n"
> +		"ld1b z1.b, p0/z, [x6, x5]\n"
> +		"add x8, x5, x3\n"
> +		"ld1b z6.b, p0/z, [x6, x8]\n"
> +		"add x8, x8, x3\n"
> +		"ld1b z11.b, p0/z, [x6, x8]\n"
> +		"add x8, x8, x3\n"
> +		"ld1b z16.b, p0/z, [x6, x8]\n"
> +
> +		"ld1b z0.b, p0/z, [%[p], x5]\n"
> +		"add x8, x5, x3\n"
> +		"ld1b z5.b, p0/z, [%[p], x8]\n"
> +		"add x8, x8, x3\n"
> +		"ld1b z10.b, p0/z, [%[p], x8]\n"
> +		"add x8, x8, x3\n"
> +		"ld1b z15.b, p0/z, [%[p], x8]\n"
> +
> +		"eor z0.d, z0.d, z1.d\n"
> +		"eor z5.d, z5.d, z6.d\n"
> +		"eor z10.d, z10.d, z11.d\n"
> +		"eor z15.d, z15.d, z16.d\n"
> +
> +		"mov w7, %w[z0]\n"
> +		"sub w7, w7, #1\n"
> +
> +		"1:\n"
> +		"cmp w7, %w[start]\n"
> +		"blt 2f\n"
> +
> +		// software pipelining: load data early
> +		"sxtw x8, w7\n"
> +		"ldr x6, [%[dptr], x8, lsl #3]\n"
> +		"ld1b z2.b, p0/z, [x6, x5]\n"
> +		"add x8, x5, x3\n"
> +		"ld1b z7.b, p0/z, [x6, x8]\n"
> +		"add x8, x8, x3\n"
> +		"ld1b z12.b, p0/z, [x6, x8]\n"
> +		"add x8, x8, x3\n"
> +		"ld1b z17.b, p0/z, [x6, x8]\n"
> +
> +		// math block 1
> +		"mov z3.d, z1.d\n"
> +		"asr z3.b, p0/m, z3.b, #7\n"
> +		"lsl z1.b, p0/m, z1.b, #1\n"
> +		"and z3.d, z3.d, z4.d\n"
> +		"eor z1.d, z1.d, z3.d\n"
> +		"eor z1.d, z1.d, z2.d\n"
> +		"eor z0.d, z0.d, z2.d\n"
> +
> +		// math block 2
> +		"mov z8.d, z6.d\n"
> +		"asr z8.b, p0/m, z8.b, #7\n"
> +		"lsl z6.b, p0/m, z6.b, #1\n"
> +		"and z8.d, z8.d, z4.d\n"
> +		"eor z6.d, z6.d, z8.d\n"
> +		"eor z6.d, z6.d, z7.d\n"
> +		"eor z5.d, z5.d, z7.d\n"
> +
> +		// math block 3
> +		"mov z13.d, z11.d\n"
> +		"asr z13.b, p0/m, z13.b, #7\n"
> +		"lsl z11.b, p0/m, z11.b, #1\n"
> +		"and z13.d, z13.d, z4.d\n"
> +		"eor z11.d, z11.d, z13.d\n"
> +		"eor z11.d, z11.d, z12.d\n"
> +		"eor z10.d, z10.d, z12.d\n"
> +
> +		// math block 4
> +		"mov z18.d, z16.d\n"
> +		"asr z18.b, p0/m, z18.b, #7\n"
> +		"lsl z16.b, p0/m, z16.b, #1\n"
> +		"and z18.d, z18.d, z4.d\n"
> +		"eor z16.d, z16.d, z18.d\n"
> +		"eor z16.d, z16.d, z17.d\n"
> +		"eor z15.d, z15.d, z17.d\n"
> +
> +		"sub w7, w7, #1\n"
> +		"b 1b\n"
> +		"2:\n"
> +
> +		"mov w7, %w[start]\n"
> +		"sub w7, w7, #1\n"
> +		"3:\n"
> +		"cmp w7, #0\n"
> +		"blt 4f\n"
> +
> +		// math block 1
> +		"mov z3.d, z1.d\n"
> +		"asr z3.b, p0/m, z3.b, #7\n"
> +		"lsl z1.b, p0/m, z1.b, #1\n"
> +		"and z3.d, z3.d, z4.d\n"
> +		"eor z1.d, z1.d, z3.d\n"
> +
> +		// math block 2
> +		"mov z8.d, z6.d\n"
> +		"asr z8.b, p0/m, z8.b, #7\n"
> +		"lsl z6.b, p0/m, z6.b, #1\n"
> +		"and z8.d, z8.d, z4.d\n"
> +		"eor z6.d, z6.d, z8.d\n"
> +
> +		// math block 3
> +		"mov z13.d, z11.d\n"
> +		"asr z13.b, p0/m, z13.b, #7\n"
> +		"lsl z11.b, p0/m, z11.b, #1\n"
> +		"and z13.d, z13.d, z4.d\n"
> +		"eor z11.d, z11.d, z13.d\n"
> +
> +		// math block 4
> +		"mov z18.d, z16.d\n"
> +		"asr z18.b, p0/m, z18.b, #7\n"
> +		"lsl z16.b, p0/m, z16.b, #1\n"
> +		"and z18.d, z18.d, z4.d\n"
> +		"eor z16.d, z16.d, z18.d\n"
> +
> +		"sub w7, w7, #1\n"
> +		"b 3b\n"
> +		"4:\n"
> +
> +		// Load q and XOR
> +		"ld1b z2.b, p0/z, [%[q], x5]\n"
> +		"add x8, x5, x3\n"
> +		"ld1b z7.b, p0/z, [%[q], x8]\n"
> +		"add x8, x8, x3\n"
> +		"ld1b z12.b, p0/z, [%[q], x8]\n"
> +		"add x8, x8, x3\n"
> +		"ld1b z17.b, p0/z, [%[q], x8]\n"
> +
> +		"eor z1.d, z1.d, z2.d\n"
> +		"eor z6.d, z6.d, z7.d\n"
> +		"eor z11.d, z11.d, z12.d\n"
> +		"eor z16.d, z16.d, z17.d\n"
> +
> +		// Store results
> +		"st1b z0.b, p0, [%[p], x5]\n"
> +		"st1b z1.b, p0, [%[q], x5]\n"
> +		"add x8, x5, x3\n"
> +		"st1b z5.b, p0, [%[p], x8]\n"
> +		"st1b z6.b, p0, [%[q], x8]\n"
> +		"add x8, x8, x3\n"
> +		"st1b z10.b, p0, [%[p], x8]\n"
> +		"st1b z11.b, p0, [%[q], x8]\n"
> +		"add x8, x8, x3\n"
> +		"st1b z15.b, p0, [%[p], x8]\n"
> +		"st1b z16.b, p0, [%[q], x8]\n"
> +
> +		"add x8, x3, x3\n"
> +		"add x5, x5, x8, lsl #1\n"
> +		"cmp x5, %[bytes]\n"
> +		"blt 0b\n"
> +		:
> +		: [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
> +		  [p] "r" (p), [q] "r" (q), [start] "r" (start)
> +		: "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
> +		  "z0", "z1", "z2", "z3", "z4",
> +		  "z5", "z6", "z7", "z8",
> +		  "z10", "z11", "z12", "z13",
> +		  "z15", "z16", "z17", "z18"
> +	);
> +}
> +
> +#define RAID6_SVE_WRAPPER(_n)						\
> +	static void raid6_sve ## _n ## _gen_syndrome(int disks,		\
> +					size_t bytes, void **ptrs)	\
> +	{								\
> +		scoped_ksimd()						\
> +		raid6_sve ## _n ## _gen_syndrome_real(disks,		\
> +					(unsigned long)bytes, ptrs);	\
> +	}								\
> +	static void raid6_sve ## _n ## _xor_syndrome(int disks,		\
> +					int start, int stop,		\
> +					size_t bytes, void **ptrs)	\
> +	{								\
> +		scoped_ksimd()						\
> +		raid6_sve ## _n ## _xor_syndrome_real(disks,		\
> +				start, stop, (unsigned long)bytes, ptrs);\
> +	}								\
> +	struct raid6_calls const raid6_svex ## _n = {			\
> +		raid6_sve ## _n ## _gen_syndrome,			\
> +		raid6_sve ## _n ## _xor_syndrome,			\
> +		raid6_have_sve,						\
> +		"svex" #_n,						\
> +		0							\
> +	}
> +
> +static int raid6_have_sve(void)
> +{
> +	return system_supports_sve();
> +}
> +
> +RAID6_SVE_WRAPPER(1);
> +RAID6_SVE_WRAPPER(2);
> +RAID6_SVE_WRAPPER(4);
> -- 
> 2.43.0

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] raid6: arm64: add SVE optimized implementation for syndrome generation
  2026-03-24  8:00 ` Ard Biesheuvel
@ 2026-03-24 10:04   ` Mark Rutland
  2026-03-29 13:01     ` Demian Shulhan
  0 siblings, 1 reply; 16+ messages in thread
From: Mark Rutland @ 2026-03-24 10:04 UTC (permalink / raw)
  To: Ard Biesheuvel, Christoph Hellwig, Demian Shulhan
  Cc: Song Liu, Yu Kuai, Will Deacon, Catalin Marinas, broonie,
	linux-arm-kernel, robin.murphy, Li Nan, linux-raid, linux-kernel

On Tue, Mar 24, 2026 at 09:00:16AM +0100, Ard Biesheuvel wrote:
> On Wed, 18 Mar 2026, at 16:02, Demian Shulhan wrote:
> > Implement Scalable Vector Extension (SVE) optimized routines for RAID6
> > syndrome generation and recovery on ARM64.
> >
> > The SVE instruction set allows for variable vector lengths (from 128 to
> > 2048 bits), scaling automatically with the hardware capabilities. This
> > implementation handles arbitrary SVE vector lengths using the `cntb`
> > instruction to determine the runtime vector length.
> >
> > The implementation introduces `svex1`, `svex2`, and `svex4` algorithms.
> > The `svex4` algorithm utilizes loop unrolling by 4 blocks per iteration
> > and manual software pipelining (interleaving memory loads with XORs)
> > to minimize instruction dependency stalls and maximize CPU pipeline
> > utilization and memory bandwidth.
> >
> > Performance was tested on an AWS Graviton3 (Neoverse-V1) instance which
> > features 256-bit SVE vector length. The `svex4` implementation outperforms
> > the existing 128-bit `neonx4` baseline for syndrome generation:
> >
> > raid6: svex4    gen() 19688 MB/s
> ...
> > raid6: neonx4   gen() 19612 MB/s
> 
> You're being very generous characterising a 0.3% speedup as 'outperforms'
> 
> But the real problem here is that the kernel-mode SIMD API only
> supports NEON and not SVE, and preserves/restores only the 128-bit
> view on the NEON/SVE register file. So any context switch or softirq
> that uses kernel-mode SIMD too, and your SVE register values will get
> truncated.

Just to be a bit more explicit, since only the NEON register file is
saved:

* The vector registers will be truncated to 128-bit across
  preemption or softirq.

* The predicates won't be saved/restored and will change arbitrarily
  across preemption.

* The VL won't be saved/restored, and might change arbitrarily across
  preemption.

* The VL to use hasn't been programmed, so performance might vary
  arbitrarily even in the absence of preemption.

... so this isn't even safe on machines with (only) a 128-bit VL, and
there are big open design questions for the infrastructure we'd need.

> Once we encounter a good use case for SVE in the kernel, we might
> reconsider this, but as it stands, this patch should not be applied.

I agree.

Christoph, please do not pick this or any other in-kernel SVE patches.
They cannot function correctly without additional infrastructure.

Demian, for patches that use NEON/SVE/SME/etc, please Cc LAKML
(linux-arm-kernel@lists.infradead.org), so that folk familiar with ARM
see the patches.

Mark

> (leaving the reply untrimmed for the benefit of the cc'ees I added)
> 
> > raid6: neonx2   gen() 16248 MB/s
> > raid6: neonx1   gen() 13591 MB/s
> > raid6: using algorithm svex4 gen() 19688 MB/s
> > raid6: .... xor() 11212 MB/s, rmw enabled
> > raid6: using neon recovery algorithm
> >
> > Note that for the recovery path (`xor_syndrome`), NEON may still be
> > selected dynamically by the algorithm benchmark, as the recovery
> > workload is heavily memory-bound.
> >
> > Signed-off-by: Demian Shulhan <demyansh@gmail.com>
> > Reported-by: kernel test robot <lkp@intel.com>
> > Closes: 
> > https://lore.kernel.org/oe-kbuild-all/202603181940.cFwYmYoi-lkp@intel.com/
> > ---
> >  include/linux/raid/pq.h |   3 +
> >  lib/raid6/Makefile      |   5 +
> >  lib/raid6/algos.c       |   5 +
> >  lib/raid6/sve.c         | 675 ++++++++++++++++++++++++++++++++++++++++
> >  4 files changed, 688 insertions(+)
> >  create mode 100644 lib/raid6/sve.c
> >
> > diff --git a/include/linux/raid/pq.h b/include/linux/raid/pq.h
> > index 2467b3be15c9..787cc57aea9d 100644
> > --- a/include/linux/raid/pq.h
> > +++ b/include/linux/raid/pq.h
> > @@ -140,6 +140,9 @@ extern const struct raid6_calls raid6_neonx1;
> >  extern const struct raid6_calls raid6_neonx2;
> >  extern const struct raid6_calls raid6_neonx4;
> >  extern const struct raid6_calls raid6_neonx8;
> > +extern const struct raid6_calls raid6_svex1;
> > +extern const struct raid6_calls raid6_svex2;
> > +extern const struct raid6_calls raid6_svex4;
> > 
> >  /* Algorithm list */
> >  extern const struct raid6_calls * const raid6_algos[];
> > diff --git a/lib/raid6/Makefile b/lib/raid6/Makefile
> > index 5be0a4e60ab1..6cdaa6f206fb 100644
> > --- a/lib/raid6/Makefile
> > +++ b/lib/raid6/Makefile
> > @@ -8,6 +8,7 @@ raid6_pq-$(CONFIG_X86) += recov_ssse3.o recov_avx2.o 
> > mmx.o sse1.o sse2.o avx2.o
> >  raid6_pq-$(CONFIG_ALTIVEC) += altivec1.o altivec2.o altivec4.o 
> > altivec8.o \
> >                                vpermxor1.o vpermxor2.o vpermxor4.o 
> > vpermxor8.o
> >  raid6_pq-$(CONFIG_KERNEL_MODE_NEON) += neon.o neon1.o neon2.o neon4.o 
> > neon8.o recov_neon.o recov_neon_inner.o
> > +raid6_pq-$(CONFIG_ARM64_SVE) += sve.o
> >  raid6_pq-$(CONFIG_S390) += s390vx8.o recov_s390xc.o
> >  raid6_pq-$(CONFIG_LOONGARCH) += loongarch_simd.o recov_loongarch_simd.o
> >  raid6_pq-$(CONFIG_RISCV_ISA_V) += rvv.o recov_rvv.o
> > @@ -67,6 +68,10 @@ CFLAGS_REMOVE_neon2.o += $(CC_FLAGS_NO_FPU)
> >  CFLAGS_REMOVE_neon4.o += $(CC_FLAGS_NO_FPU)
> >  CFLAGS_REMOVE_neon8.o += $(CC_FLAGS_NO_FPU)
> >  CFLAGS_REMOVE_recov_neon_inner.o += $(CC_FLAGS_NO_FPU)
> > +
> > +CFLAGS_sve.o += $(CC_FLAGS_FPU)
> > +CFLAGS_REMOVE_sve.o += $(CC_FLAGS_NO_FPU)
> > +
> >  targets += neon1.c neon2.c neon4.c neon8.c
> >  $(obj)/neon%.c: $(src)/neon.uc $(src)/unroll.awk FORCE
> >  	$(call if_changed,unroll)
> > diff --git a/lib/raid6/algos.c b/lib/raid6/algos.c
> > index 799e0e5eac26..0ae73c3a4be3 100644
> > --- a/lib/raid6/algos.c
> > +++ b/lib/raid6/algos.c
> > @@ -66,6 +66,11 @@ const struct raid6_calls * const raid6_algos[] = {
> >  	&raid6_neonx2,
> >  	&raid6_neonx1,
> >  #endif
> > +#ifdef CONFIG_ARM64_SVE
> > +	&raid6_svex4,
> > +	&raid6_svex2,
> > +	&raid6_svex1,
> > +#endif
> >  #ifdef CONFIG_LOONGARCH
> >  #ifdef CONFIG_CPU_HAS_LASX
> >  	&raid6_lasx,
> > diff --git a/lib/raid6/sve.c b/lib/raid6/sve.c
> > new file mode 100644
> > index 000000000000..d52937f806d4
> > --- /dev/null
> > +++ b/lib/raid6/sve.c
> > @@ -0,0 +1,675 @@
> > +// SPDX-License-Identifier: GPL-2.0-or-later
> > +/*
> > + * RAID-6 syndrome calculation using ARM SVE instructions
> > + */
> > +
> > +#include <linux/raid/pq.h>
> > +
> > +#ifdef __KERNEL__
> > +#include <asm/simd.h>
> > +#include <linux/cpufeature.h>
> > +#else
> > +#define scoped_ksimd()
> > +#define system_supports_sve() (1)
> > +#endif
> > +
> > +static void raid6_sve1_gen_syndrome_real(int disks, unsigned long 
> > bytes, void **ptrs)
> > +{
> > +	u8 **dptr = (u8 **)ptrs;
> > +	u8 *p, *q;
> > +	long z0 = disks - 3;
> > +
> > +	p = dptr[z0 + 1];
> > +	q = dptr[z0 + 2];
> > +
> > +	asm volatile(
> > +		".arch armv8.2-a+sve\n"
> > +		"ptrue p0.b\n"
> > +		"cntb x3\n"
> > +		"mov w4, #0x1d\n"
> > +		"dup z4.b, w4\n"
> > +		"mov x5, #0\n"
> > +
> > +		"0:\n"
> > +		"ldr x6, [%[dptr], %[z0], lsl #3]\n"
> > +		"ld1b z0.b, p0/z, [x6, x5]\n"
> > +		"mov z1.d, z0.d\n"
> > +
> > +		"mov w7, %w[z0]\n"
> > +		"sub w7, w7, #1\n"
> > +
> > +		"1:\n"
> > +		"cmp w7, #0\n"
> > +		"blt 2f\n"
> > +
> > +		"mov z3.d, z1.d\n"
> > +		"asr z3.b, p0/m, z3.b, #7\n"
> > +		"lsl z1.b, p0/m, z1.b, #1\n"
> > +
> > +		"and z3.d, z3.d, z4.d\n"
> > +		"eor z1.d, z1.d, z3.d\n"
> > +
> > +		"sxtw x8, w7\n"
> > +		"ldr x6, [%[dptr], x8, lsl #3]\n"
> > +		"ld1b z2.b, p0/z, [x6, x5]\n"
> > +
> > +		"eor z1.d, z1.d, z2.d\n"
> > +		"eor z0.d, z0.d, z2.d\n"
> > +
> > +		"sub w7, w7, #1\n"
> > +		"b 1b\n"
> > +		"2:\n"
> > +
> > +		"st1b z0.b, p0, [%[p], x5]\n"
> > +		"st1b z1.b, p0, [%[q], x5]\n"
> > +
> > +		"add x5, x5, x3\n"
> > +		"cmp x5, %[bytes]\n"
> > +		"blt 0b\n"
> > +		:
> > +		: [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
> > +		  [p] "r" (p), [q] "r" (q)
> > +		: "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
> > +		  "z0", "z1", "z2", "z3", "z4"
> > +	);
> > +}
> > +
> > +static void raid6_sve1_xor_syndrome_real(int disks, int start, int 
> > stop,
> > +					 unsigned long bytes, void **ptrs)
> > +{
> > +	u8 **dptr = (u8 **)ptrs;
> > +	u8 *p, *q;
> > +	long z0 = stop;
> > +
> > +	p = dptr[disks - 2];
> > +	q = dptr[disks - 1];
> > +
> > +	asm volatile(
> > +		".arch armv8.2-a+sve\n"
> > +		"ptrue p0.b\n"
> > +		"cntb x3\n"
> > +		"mov w4, #0x1d\n"
> > +		"dup z4.b, w4\n"
> > +		"mov x5, #0\n"
> > +
> > +		"0:\n"
> > +		"ldr x6, [%[dptr], %[z0], lsl #3]\n"
> > +		"ld1b z1.b, p0/z, [x6, x5]\n"
> > +		"ld1b z0.b, p0/z, [%[p], x5]\n"
> > +		"eor z0.d, z0.d, z1.d\n"
> > +
> > +		"mov w7, %w[z0]\n"
> > +		"sub w7, w7, #1\n"
> > +
> > +		"1:\n"
> > +		"cmp w7, %w[start]\n"
> > +		"blt 2f\n"
> > +
> > +		"mov z3.d, z1.d\n"
> > +		"asr z3.b, p0/m, z3.b, #7\n"
> > +		"lsl z1.b, p0/m, z1.b, #1\n"
> > +		"and z3.d, z3.d, z4.d\n"
> > +		"eor z1.d, z1.d, z3.d\n"
> > +
> > +		"sxtw x8, w7\n"
> > +		"ldr x6, [%[dptr], x8, lsl #3]\n"
> > +		"ld1b z2.b, p0/z, [x6, x5]\n"
> > +
> > +		"eor z1.d, z1.d, z2.d\n"
> > +		"eor z0.d, z0.d, z2.d\n"
> > +
> > +		"sub w7, w7, #1\n"
> > +		"b 1b\n"
> > +		"2:\n"
> > +
> > +		"mov w7, %w[start]\n"
> > +		"sub w7, w7, #1\n"
> > +		"3:\n"
> > +		"cmp w7, #0\n"
> > +		"blt 4f\n"
> > +
> > +		"mov z3.d, z1.d\n"
> > +		"asr z3.b, p0/m, z3.b, #7\n"
> > +		"lsl z1.b, p0/m, z1.b, #1\n"
> > +		"and z3.d, z3.d, z4.d\n"
> > +		"eor z1.d, z1.d, z3.d\n"
> > +
> > +		"sub w7, w7, #1\n"
> > +		"b 3b\n"
> > +		"4:\n"
> > +
> > +		"ld1b z2.b, p0/z, [%[q], x5]\n"
> > +		"eor z1.d, z1.d, z2.d\n"
> > +
> > +		"st1b z0.b, p0, [%[p], x5]\n"
> > +		"st1b z1.b, p0, [%[q], x5]\n"
> > +
> > +		"add x5, x5, x3\n"
> > +		"cmp x5, %[bytes]\n"
> > +		"blt 0b\n"
> > +		:
> > +		: [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
> > +		  [p] "r" (p), [q] "r" (q), [start] "r" (start)
> > +		: "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
> > +		  "z0", "z1", "z2", "z3", "z4"
> > +	);
> > +}
> > +
> > +static void raid6_sve2_gen_syndrome_real(int disks, unsigned long 
> > bytes, void **ptrs)
> > +{
> > +	u8 **dptr = (u8 **)ptrs;
> > +	u8 *p, *q;
> > +	long z0 = disks - 3;
> > +
> > +	p = dptr[z0 + 1];
> > +	q = dptr[z0 + 2];
> > +
> > +	asm volatile(
> > +		".arch armv8.2-a+sve\n"
> > +		"ptrue p0.b\n"
> > +		"cntb x3\n"
> > +		"mov w4, #0x1d\n"
> > +		"dup z4.b, w4\n"
> > +		"mov x5, #0\n"
> > +
> > +		"0:\n"
> > +		"ldr x6, [%[dptr], %[z0], lsl #3]\n"
> > +		"ld1b z0.b, p0/z, [x6, x5]\n"
> > +		"add x8, x5, x3\n"
> > +		"ld1b z5.b, p0/z, [x6, x8]\n"
> > +		"mov z1.d, z0.d\n"
> > +		"mov z6.d, z5.d\n"
> > +
> > +		"mov w7, %w[z0]\n"
> > +		"sub w7, w7, #1\n"
> > +
> > +		"1:\n"
> > +		"cmp w7, #0\n"
> > +		"blt 2f\n"
> > +
> > +		"mov z3.d, z1.d\n"
> > +		"asr z3.b, p0/m, z3.b, #7\n"
> > +		"lsl z1.b, p0/m, z1.b, #1\n"
> > +		"and z3.d, z3.d, z4.d\n"
> > +		"eor z1.d, z1.d, z3.d\n"
> > +
> > +		"mov z8.d, z6.d\n"
> > +		"asr z8.b, p0/m, z8.b, #7\n"
> > +		"lsl z6.b, p0/m, z6.b, #1\n"
> > +		"and z8.d, z8.d, z4.d\n"
> > +		"eor z6.d, z6.d, z8.d\n"
> > +
> > +		"sxtw x8, w7\n"
> > +		"ldr x6, [%[dptr], x8, lsl #3]\n"
> > +		"ld1b z2.b, p0/z, [x6, x5]\n"
> > +		"add x8, x5, x3\n"
> > +		"ld1b z7.b, p0/z, [x6, x8]\n"
> > +
> > +		"eor z1.d, z1.d, z2.d\n"
> > +		"eor z0.d, z0.d, z2.d\n"
> > +
> > +		"eor z6.d, z6.d, z7.d\n"
> > +		"eor z5.d, z5.d, z7.d\n"
> > +
> > +		"sub w7, w7, #1\n"
> > +		"b 1b\n"
> > +		"2:\n"
> > +
> > +		"st1b z0.b, p0, [%[p], x5]\n"
> > +		"st1b z1.b, p0, [%[q], x5]\n"
> > +		"add x8, x5, x3\n"
> > +		"st1b z5.b, p0, [%[p], x8]\n"
> > +		"st1b z6.b, p0, [%[q], x8]\n"
> > +
> > +		"add x5, x5, x3\n"
> > +		"add x5, x5, x3\n"
> > +		"cmp x5, %[bytes]\n"
> > +		"blt 0b\n"
> > +		:
> > +		: [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
> > +		  [p] "r" (p), [q] "r" (q)
> > +		: "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
> > +		  "z0", "z1", "z2", "z3", "z4",
> > +		  "z5", "z6", "z7", "z8"
> > +	);
> > +}
> > +
> > +static void raid6_sve2_xor_syndrome_real(int disks, int start, int 
> > stop,
> > +					 unsigned long bytes, void **ptrs)
> > +{
> > +	u8 **dptr = (u8 **)ptrs;
> > +	u8 *p, *q;
> > +	long z0 = stop;
> > +
> > +	p = dptr[disks - 2];
> > +	q = dptr[disks - 1];
> > +
> > +	asm volatile(
> > +		".arch armv8.2-a+sve\n"
> > +		"ptrue p0.b\n"
> > +		"cntb x3\n"
> > +		"mov w4, #0x1d\n"
> > +		"dup z4.b, w4\n"
> > +		"mov x5, #0\n"
> > +
> > +		"0:\n"
> > +		"ldr x6, [%[dptr], %[z0], lsl #3]\n"
> > +		"ld1b z1.b, p0/z, [x6, x5]\n"
> > +		"add x8, x5, x3\n"
> > +		"ld1b z6.b, p0/z, [x6, x8]\n"
> > +
> > +		"ld1b z0.b, p0/z, [%[p], x5]\n"
> > +		"ld1b z5.b, p0/z, [%[p], x8]\n"
> > +
> > +		"eor z0.d, z0.d, z1.d\n"
> > +		"eor z5.d, z5.d, z6.d\n"
> > +
> > +		"mov w7, %w[z0]\n"
> > +		"sub w7, w7, #1\n"
> > +
> > +		"1:\n"
> > +		"cmp w7, %w[start]\n"
> > +		"blt 2f\n"
> > +
> > +		"mov z3.d, z1.d\n"
> > +		"asr z3.b, p0/m, z3.b, #7\n"
> > +		"lsl z1.b, p0/m, z1.b, #1\n"
> > +		"and z3.d, z3.d, z4.d\n"
> > +		"eor z1.d, z1.d, z3.d\n"
> > +
> > +		"mov z8.d, z6.d\n"
> > +		"asr z8.b, p0/m, z8.b, #7\n"
> > +		"lsl z6.b, p0/m, z6.b, #1\n"
> > +		"and z8.d, z8.d, z4.d\n"
> > +		"eor z6.d, z6.d, z8.d\n"
> > +
> > +		"sxtw x8, w7\n"
> > +		"ldr x6, [%[dptr], x8, lsl #3]\n"
> > +		"ld1b z2.b, p0/z, [x6, x5]\n"
> > +		"add x8, x5, x3\n"
> > +		"ld1b z7.b, p0/z, [x6, x8]\n"
> > +
> > +		"eor z1.d, z1.d, z2.d\n"
> > +		"eor z0.d, z0.d, z2.d\n"
> > +
> > +		"eor z6.d, z6.d, z7.d\n"
> > +		"eor z5.d, z5.d, z7.d\n"
> > +
> > +		"sub w7, w7, #1\n"
> > +		"b 1b\n"
> > +		"2:\n"
> > +
> > +		"mov w7, %w[start]\n"
> > +		"sub w7, w7, #1\n"
> > +		"3:\n"
> > +		"cmp w7, #0\n"
> > +		"blt 4f\n"
> > +
> > +		"mov z3.d, z1.d\n"
> > +		"asr z3.b, p0/m, z3.b, #7\n"
> > +		"lsl z1.b, p0/m, z1.b, #1\n"
> > +		"and z3.d, z3.d, z4.d\n"
> > +		"eor z1.d, z1.d, z3.d\n"
> > +
> > +		"mov z8.d, z6.d\n"
> > +		"asr z8.b, p0/m, z8.b, #7\n"
> > +		"lsl z6.b, p0/m, z6.b, #1\n"
> > +		"and z8.d, z8.d, z4.d\n"
> > +		"eor z6.d, z6.d, z8.d\n"
> > +
> > +		"sub w7, w7, #1\n"
> > +		"b 3b\n"
> > +		"4:\n"
> > +
> > +		"ld1b z2.b, p0/z, [%[q], x5]\n"
> > +		"eor z1.d, z1.d, z2.d\n"
> > +		"st1b z0.b, p0, [%[p], x5]\n"
> > +		"st1b z1.b, p0, [%[q], x5]\n"
> > +
> > +		"add x8, x5, x3\n"
> > +		"ld1b z7.b, p0/z, [%[q], x8]\n"
> > +		"eor z6.d, z6.d, z7.d\n"
> > +		"st1b z5.b, p0, [%[p], x8]\n"
> > +		"st1b z6.b, p0, [%[q], x8]\n"
> > +
> > +		"add x5, x5, x3\n"
> > +		"add x5, x5, x3\n"
> > +		"cmp x5, %[bytes]\n"
> > +		"blt 0b\n"
> > +		:
> > +		: [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
> > +		  [p] "r" (p), [q] "r" (q), [start] "r" (start)
> > +		: "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
> > +		  "z0", "z1", "z2", "z3", "z4",
> > +		  "z5", "z6", "z7", "z8"
> > +	);
> > +}
> > +
> > +static void raid6_sve4_gen_syndrome_real(int disks, unsigned long 
> > bytes, void **ptrs)
> > +{
> > +	u8 **dptr = (u8 **)ptrs;
> > +	u8 *p, *q;
> > +	long z0 = disks - 3;
> > +
> > +	p = dptr[z0 + 1];
> > +	q = dptr[z0 + 2];
> > +
> > +	asm volatile(
> > +		".arch armv8.2-a+sve\n"
> > +		"ptrue p0.b\n"
> > +		"cntb x3\n"
> > +		"mov w4, #0x1d\n"
> > +		"dup z4.b, w4\n"
> > +		"mov x5, #0\n"
> > +
> > +		"0:\n"
> > +		"ldr x6, [%[dptr], %[z0], lsl #3]\n"
> > +		"ld1b z0.b, p0/z, [x6, x5]\n"
> > +		"add x8, x5, x3\n"
> > +		"ld1b z5.b, p0/z, [x6, x8]\n"
> > +		"add x8, x8, x3\n"
> > +		"ld1b z10.b, p0/z, [x6, x8]\n"
> > +		"add x8, x8, x3\n"
> > +		"ld1b z15.b, p0/z, [x6, x8]\n"
> > +
> > +		"mov z1.d, z0.d\n"
> > +		"mov z6.d, z5.d\n"
> > +		"mov z11.d, z10.d\n"
> > +		"mov z16.d, z15.d\n"
> > +
> > +		"mov w7, %w[z0]\n"
> > +		"sub w7, w7, #1\n"
> > +
> > +		"1:\n"
> > +		"cmp w7, #0\n"
> > +		"blt 2f\n"
> > +
> > +		// software pipelining: load data early
> > +		"sxtw x8, w7\n"
> > +		"ldr x6, [%[dptr], x8, lsl #3]\n"
> > +		"ld1b z2.b, p0/z, [x6, x5]\n"
> > +		"add x8, x5, x3\n"
> > +		"ld1b z7.b, p0/z, [x6, x8]\n"
> > +		"add x8, x8, x3\n"
> > +		"ld1b z12.b, p0/z, [x6, x8]\n"
> > +		"add x8, x8, x3\n"
> > +		"ld1b z17.b, p0/z, [x6, x8]\n"
> > +
> > +		// math block 1
> > +		"mov z3.d, z1.d\n"
> > +		"asr z3.b, p0/m, z3.b, #7\n"
> > +		"lsl z1.b, p0/m, z1.b, #1\n"
> > +		"and z3.d, z3.d, z4.d\n"
> > +		"eor z1.d, z1.d, z3.d\n"
> > +		"eor z1.d, z1.d, z2.d\n"
> > +		"eor z0.d, z0.d, z2.d\n"
> > +
> > +		// math block 2
> > +		"mov z8.d, z6.d\n"
> > +		"asr z8.b, p0/m, z8.b, #7\n"
> > +		"lsl z6.b, p0/m, z6.b, #1\n"
> > +		"and z8.d, z8.d, z4.d\n"
> > +		"eor z6.d, z6.d, z8.d\n"
> > +		"eor z6.d, z6.d, z7.d\n"
> > +		"eor z5.d, z5.d, z7.d\n"
> > +
> > +		// math block 3
> > +		"mov z13.d, z11.d\n"
> > +		"asr z13.b, p0/m, z13.b, #7\n"
> > +		"lsl z11.b, p0/m, z11.b, #1\n"
> > +		"and z13.d, z13.d, z4.d\n"
> > +		"eor z11.d, z11.d, z13.d\n"
> > +		"eor z11.d, z11.d, z12.d\n"
> > +		"eor z10.d, z10.d, z12.d\n"
> > +
> > +		// math block 4
> > +		"mov z18.d, z16.d\n"
> > +		"asr z18.b, p0/m, z18.b, #7\n"
> > +		"lsl z16.b, p0/m, z16.b, #1\n"
> > +		"and z18.d, z18.d, z4.d\n"
> > +		"eor z16.d, z16.d, z18.d\n"
> > +		"eor z16.d, z16.d, z17.d\n"
> > +		"eor z15.d, z15.d, z17.d\n"
> > +
> > +		"sub w7, w7, #1\n"
> > +		"b 1b\n"
> > +		"2:\n"
> > +
> > +		"st1b z0.b, p0, [%[p], x5]\n"
> > +		"st1b z1.b, p0, [%[q], x5]\n"
> > +		"add x8, x5, x3\n"
> > +		"st1b z5.b, p0, [%[p], x8]\n"
> > +		"st1b z6.b, p0, [%[q], x8]\n"
> > +		"add x8, x8, x3\n"
> > +		"st1b z10.b, p0, [%[p], x8]\n"
> > +		"st1b z11.b, p0, [%[q], x8]\n"
> > +		"add x8, x8, x3\n"
> > +		"st1b z15.b, p0, [%[p], x8]\n"
> > +		"st1b z16.b, p0, [%[q], x8]\n"
> > +
> > +		"add x8, x3, x3\n"
> > +		"add x5, x5, x8, lsl #1\n"
> > +		"cmp x5, %[bytes]\n"
> > +		"blt 0b\n"
> > +		:
> > +		: [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
> > +		  [p] "r" (p), [q] "r" (q)
> > +		: "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
> > +		  "z0", "z1", "z2", "z3", "z4",
> > +		  "z5", "z6", "z7", "z8",
> > +		  "z10", "z11", "z12", "z13",
> > +		  "z15", "z16", "z17", "z18"
> > +	);
> > +}
> > +
> > +static void raid6_sve4_xor_syndrome_real(int disks, int start, int 
> > stop,
> > +					 unsigned long bytes, void **ptrs)
> > +{
> > +	u8 **dptr = (u8 **)ptrs;
> > +	u8 *p, *q;
> > +	long z0 = stop;
> > +
> > +	p = dptr[disks - 2];
> > +	q = dptr[disks - 1];
> > +
> > +	asm volatile(
> > +		".arch armv8.2-a+sve\n"
> > +		"ptrue p0.b\n"
> > +		"cntb x3\n"
> > +		"mov w4, #0x1d\n"
> > +		"dup z4.b, w4\n"
> > +		"mov x5, #0\n"
> > +
> > +		"0:\n"
> > +		"ldr x6, [%[dptr], %[z0], lsl #3]\n"
> > +		"ld1b z1.b, p0/z, [x6, x5]\n"
> > +		"add x8, x5, x3\n"
> > +		"ld1b z6.b, p0/z, [x6, x8]\n"
> > +		"add x8, x8, x3\n"
> > +		"ld1b z11.b, p0/z, [x6, x8]\n"
> > +		"add x8, x8, x3\n"
> > +		"ld1b z16.b, p0/z, [x6, x8]\n"
> > +
> > +		"ld1b z0.b, p0/z, [%[p], x5]\n"
> > +		"add x8, x5, x3\n"
> > +		"ld1b z5.b, p0/z, [%[p], x8]\n"
> > +		"add x8, x8, x3\n"
> > +		"ld1b z10.b, p0/z, [%[p], x8]\n"
> > +		"add x8, x8, x3\n"
> > +		"ld1b z15.b, p0/z, [%[p], x8]\n"
> > +
> > +		"eor z0.d, z0.d, z1.d\n"
> > +		"eor z5.d, z5.d, z6.d\n"
> > +		"eor z10.d, z10.d, z11.d\n"
> > +		"eor z15.d, z15.d, z16.d\n"
> > +
> > +		"mov w7, %w[z0]\n"
> > +		"sub w7, w7, #1\n"
> > +
> > +		"1:\n"
> > +		"cmp w7, %w[start]\n"
> > +		"blt 2f\n"
> > +
> > +		// software pipelining: load data early
> > +		"sxtw x8, w7\n"
> > +		"ldr x6, [%[dptr], x8, lsl #3]\n"
> > +		"ld1b z2.b, p0/z, [x6, x5]\n"
> > +		"add x8, x5, x3\n"
> > +		"ld1b z7.b, p0/z, [x6, x8]\n"
> > +		"add x8, x8, x3\n"
> > +		"ld1b z12.b, p0/z, [x6, x8]\n"
> > +		"add x8, x8, x3\n"
> > +		"ld1b z17.b, p0/z, [x6, x8]\n"
> > +
> > +		// math block 1
> > +		"mov z3.d, z1.d\n"
> > +		"asr z3.b, p0/m, z3.b, #7\n"
> > +		"lsl z1.b, p0/m, z1.b, #1\n"
> > +		"and z3.d, z3.d, z4.d\n"
> > +		"eor z1.d, z1.d, z3.d\n"
> > +		"eor z1.d, z1.d, z2.d\n"
> > +		"eor z0.d, z0.d, z2.d\n"
> > +
> > +		// math block 2
> > +		"mov z8.d, z6.d\n"
> > +		"asr z8.b, p0/m, z8.b, #7\n"
> > +		"lsl z6.b, p0/m, z6.b, #1\n"
> > +		"and z8.d, z8.d, z4.d\n"
> > +		"eor z6.d, z6.d, z8.d\n"
> > +		"eor z6.d, z6.d, z7.d\n"
> > +		"eor z5.d, z5.d, z7.d\n"
> > +
> > +		// math block 3
> > +		"mov z13.d, z11.d\n"
> > +		"asr z13.b, p0/m, z13.b, #7\n"
> > +		"lsl z11.b, p0/m, z11.b, #1\n"
> > +		"and z13.d, z13.d, z4.d\n"
> > +		"eor z11.d, z11.d, z13.d\n"
> > +		"eor z11.d, z11.d, z12.d\n"
> > +		"eor z10.d, z10.d, z12.d\n"
> > +
> > +		// math block 4
> > +		"mov z18.d, z16.d\n"
> > +		"asr z18.b, p0/m, z18.b, #7\n"
> > +		"lsl z16.b, p0/m, z16.b, #1\n"
> > +		"and z18.d, z18.d, z4.d\n"
> > +		"eor z16.d, z16.d, z18.d\n"
> > +		"eor z16.d, z16.d, z17.d\n"
> > +		"eor z15.d, z15.d, z17.d\n"
> > +
> > +		"sub w7, w7, #1\n"
> > +		"b 1b\n"
> > +		"2:\n"
> > +
> > +		"mov w7, %w[start]\n"
> > +		"sub w7, w7, #1\n"
> > +		"3:\n"
> > +		"cmp w7, #0\n"
> > +		"blt 4f\n"
> > +
> > +		// math block 1
> > +		"mov z3.d, z1.d\n"
> > +		"asr z3.b, p0/m, z3.b, #7\n"
> > +		"lsl z1.b, p0/m, z1.b, #1\n"
> > +		"and z3.d, z3.d, z4.d\n"
> > +		"eor z1.d, z1.d, z3.d\n"
> > +
> > +		// math block 2
> > +		"mov z8.d, z6.d\n"
> > +		"asr z8.b, p0/m, z8.b, #7\n"
> > +		"lsl z6.b, p0/m, z6.b, #1\n"
> > +		"and z8.d, z8.d, z4.d\n"
> > +		"eor z6.d, z6.d, z8.d\n"
> > +
> > +		// math block 3
> > +		"mov z13.d, z11.d\n"
> > +		"asr z13.b, p0/m, z13.b, #7\n"
> > +		"lsl z11.b, p0/m, z11.b, #1\n"
> > +		"and z13.d, z13.d, z4.d\n"
> > +		"eor z11.d, z11.d, z13.d\n"
> > +
> > +		// math block 4
> > +		"mov z18.d, z16.d\n"
> > +		"asr z18.b, p0/m, z18.b, #7\n"
> > +		"lsl z16.b, p0/m, z16.b, #1\n"
> > +		"and z18.d, z18.d, z4.d\n"
> > +		"eor z16.d, z16.d, z18.d\n"
> > +
> > +		"sub w7, w7, #1\n"
> > +		"b 3b\n"
> > +		"4:\n"
> > +
> > +		// Load q and XOR
> > +		"ld1b z2.b, p0/z, [%[q], x5]\n"
> > +		"add x8, x5, x3\n"
> > +		"ld1b z7.b, p0/z, [%[q], x8]\n"
> > +		"add x8, x8, x3\n"
> > +		"ld1b z12.b, p0/z, [%[q], x8]\n"
> > +		"add x8, x8, x3\n"
> > +		"ld1b z17.b, p0/z, [%[q], x8]\n"
> > +
> > +		"eor z1.d, z1.d, z2.d\n"
> > +		"eor z6.d, z6.d, z7.d\n"
> > +		"eor z11.d, z11.d, z12.d\n"
> > +		"eor z16.d, z16.d, z17.d\n"
> > +
> > +		// Store results
> > +		"st1b z0.b, p0, [%[p], x5]\n"
> > +		"st1b z1.b, p0, [%[q], x5]\n"
> > +		"add x8, x5, x3\n"
> > +		"st1b z5.b, p0, [%[p], x8]\n"
> > +		"st1b z6.b, p0, [%[q], x8]\n"
> > +		"add x8, x8, x3\n"
> > +		"st1b z10.b, p0, [%[p], x8]\n"
> > +		"st1b z11.b, p0, [%[q], x8]\n"
> > +		"add x8, x8, x3\n"
> > +		"st1b z15.b, p0, [%[p], x8]\n"
> > +		"st1b z16.b, p0, [%[q], x8]\n"
> > +
> > +		"add x8, x3, x3\n"
> > +		"add x5, x5, x8, lsl #1\n"
> > +		"cmp x5, %[bytes]\n"
> > +		"blt 0b\n"
> > +		:
> > +		: [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
> > +		  [p] "r" (p), [q] "r" (q), [start] "r" (start)
> > +		: "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
> > +		  "z0", "z1", "z2", "z3", "z4",
> > +		  "z5", "z6", "z7", "z8",
> > +		  "z10", "z11", "z12", "z13",
> > +		  "z15", "z16", "z17", "z18"
> > +	);
> > +}
> > +
> > +#define RAID6_SVE_WRAPPER(_n)						\
> > +	static void raid6_sve ## _n ## _gen_syndrome(int disks,		\
> > +					size_t bytes, void **ptrs)	\
> > +	{								\
> > +		scoped_ksimd()						\
> > +		raid6_sve ## _n ## _gen_syndrome_real(disks,		\
> > +					(unsigned long)bytes, ptrs);	\
> > +	}								\
> > +	static void raid6_sve ## _n ## _xor_syndrome(int disks,		\
> > +					int start, int stop,		\
> > +					size_t bytes, void **ptrs)	\
> > +	{								\
> > +		scoped_ksimd()						\
> > +		raid6_sve ## _n ## _xor_syndrome_real(disks,		\
> > +				start, stop, (unsigned long)bytes, ptrs);\
> > +	}								\
> > +	struct raid6_calls const raid6_svex ## _n = {			\
> > +		raid6_sve ## _n ## _gen_syndrome,			\
> > +		raid6_sve ## _n ## _xor_syndrome,			\
> > +		raid6_have_sve,						\
> > +		"svex" #_n,						\
> > +		0							\
> > +	}
> > +
> > +static int raid6_have_sve(void)
> > +{
> > +	return system_supports_sve();
> > +}
> > +
> > +RAID6_SVE_WRAPPER(1);
> > +RAID6_SVE_WRAPPER(2);
> > +RAID6_SVE_WRAPPER(4);
> > -- 
> > 2.43.0

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] raid6: arm64: add SVE optimized implementation for syndrome generation
  2026-03-24 10:04   ` Mark Rutland
@ 2026-03-29 13:01     ` Demian Shulhan
  2026-03-30  5:30       ` Christoph Hellwig
  2026-03-30 16:39       ` Ard Biesheuvel
  0 siblings, 2 replies; 16+ messages in thread
From: Demian Shulhan @ 2026-03-29 13:01 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Ard Biesheuvel, Christoph Hellwig, Song Liu, Yu Kuai, Will Deacon,
	Catalin Marinas, broonie, linux-arm-kernel, robin.murphy, Li Nan,
	linux-raid, linux-kernel

Hi all,

Thanks for the feedback and for clarifying the current limitations of
the kernel-mode SIMD API regarding SVE context preservation. I
completely understand why this patch cannot be merged in its current
state until the fundamental SVE infrastructure is in place.However, I
want to address the comment about the marginal 0.3% speedup on the
8-disk benchmark. While the pure memory bandwidth on a small array is
indeed bottlenecked, it doesn't reveal the whole picture. I extracted
the SVE and NEON implementations into a user-space benchmark to
measure the actual hardware efficiency using perf stat, running on the
same AWS Graviton3 (Neoverse-V1) instance.The results show a massive
difference in CPU efficiency. For the same 8-disk workload, the svex4
implementation requires about 35% fewer instructions and 46% fewer CPU
cycles compared to neonx4 (7.58 billion instructions vs 11.62
billion). This translates directly into significant energy savings and
reduced pressure on the CPU frontend, which would leave more compute
resources available for network and NVMe queues during an array
rebuild.

Furthermore, as Christoph suggested, I tested scalability on wider
arrays since the default kernel benchmark is hardcoded to 8 disks,
which doesn't give the unrolled SVE loop enough data to shine. On a
16-disk array, svex4 hits 15.1 GB/s compared to 8.0 GB/s for neonx4.
On a 24-disk array, while neonx4 chokes and drops to 7.8 GB/s, svex4
maintains a stable 15.0 GB/s — effectively doubling the throughput.I
agree this patch should be put on hold for now. My intention is to
leave these numbers here as evidence that implementing SVE context
preservation in the kernel (the "good use case") is highly justifiable
from both a power-efficiency and a wide-array throughput perspective
for modern ARM64 hardware.

Thanks again for your time and time and review!

---------------------------------------------------
User space test results:
==================================================
    RAID6 SVE Benchmark Results (AWS Graviton3)
==================================================
Instance Details:
Linux ip-172-31-87-234 6.8.0-1047-aws #50~22.04.1-Ubuntu SMP Thu Feb
19 20:49:25 UTC 2026 aarch64 aarch64 aarch64 GNU/Linux
--------------------------------------------------

[Test 1: Energy Efficiency / Instruction Count (8 disks)]
Running baseline (neonx4)...
algo=neonx4 ndisks=8 iterations=1000000 time=2.681s MB/s=8741.36

Running SVE (svex1)...

 Performance counter stats for './raid6_bench neonx4 8 1000000':

       11626717224      instructions                     #    1.67
insn per cycle
        6946699489      cycles
         257013219      L1-dcache-load-misses

       2.681213149 seconds time elapsed

       2.676771000 seconds user
       0.002000000 seconds sys


algo=svex1 ndisks=8 iterations=1000000 time=1.688s MB/s=13885.23

 Performance counter stats for './raid6_bench svex1 8 1000000':

       10527277490
Running SVE unrolled x4 (svex4)...
     instructions                     #    2.40  insn per cycle
        4379539835      cycles
         175695656      L1-dcache-load-misses

       1.688852006 seconds time elapsed

       1.687298000 seconds user
       0.000999000 seconds sys


algo=svex4 ndisks=8 iterations=1000000 time=1.445s MB/s=16215.04

 Performance counter stats for './raid6_bench svex4 8 1000000':

        7587813392      instructions
==================================================
[Test 2: Scalability on Wide RAID Arrays (MB/s)]
--- 16 Disks ---
 #    2.02  insn per cycle
        3748486131      cycles
         213816184      L1-dcache-load-misses

       1.446032415 seconds time elapsed

       1.442412000 seconds user
       0.002996000 seconds sys


algo=neonx4 ndisks=16 iterations=1000000 time=6.783s MB/s=8062.33
algo=svex1 ndisks=16 iterations=1000000 time=4.912s MB/s=11132.90
algo=svex4 ndisks=16 iterations=1000000 time=3.601s MB/s=15188.85

--- 24 Disks ---
algo=neonx4 ndisks=24 iterations=1000000 time=11.011s MB/s=7805.02
algo=svex1 ndisks=24 iterations=1000000 time=8.843s MB/s=9718.26
algo=svex4 ndisks=24 iterations=1000000 time=5.719s MB/s=15026.92

Extra tests:
--- 48 Disks ---
algo=neonx4 ndisks=48 iterations=500000 time=11.826s MB/s=7597.25
algo=svex4 ndisks=48 iterations=500000 time=5.808s MB/s=15468.10
--- 96 Disks ---
algo=neonx4 ndisks=96 iterations=200000 time=9.783s MB/s=7507.01
algo=svex4 ndisks=96 iterations=200000 time=4.701s MB/s=15621.17
==================================================

вт, 24 бер. 2026 р. о 12:05 Mark Rutland <mark.rutland@arm.com> пише:
>
> On Tue, Mar 24, 2026 at 09:00:16AM +0100, Ard Biesheuvel wrote:
> > On Wed, 18 Mar 2026, at 16:02, Demian Shulhan wrote:
> > > Implement Scalable Vector Extension (SVE) optimized routines for RAID6
> > > syndrome generation and recovery on ARM64.
> > >
> > > The SVE instruction set allows for variable vector lengths (from 128 to
> > > 2048 bits), scaling automatically with the hardware capabilities. This
> > > implementation handles arbitrary SVE vector lengths using the `cntb`
> > > instruction to determine the runtime vector length.
> > >
> > > The implementation introduces `svex1`, `svex2`, and `svex4` algorithms.
> > > The `svex4` algorithm utilizes loop unrolling by 4 blocks per iteration
> > > and manual software pipelining (interleaving memory loads with XORs)
> > > to minimize instruction dependency stalls and maximize CPU pipeline
> > > utilization and memory bandwidth.
> > >
> > > Performance was tested on an AWS Graviton3 (Neoverse-V1) instance which
> > > features 256-bit SVE vector length. The `svex4` implementation outperforms
> > > the existing 128-bit `neonx4` baseline for syndrome generation:
> > >
> > > raid6: svex4    gen() 19688 MB/s
> > ...
> > > raid6: neonx4   gen() 19612 MB/s
> >
> > You're being very generous characterising a 0.3% speedup as 'outperforms'
> >
> > But the real problem here is that the kernel-mode SIMD API only
> > supports NEON and not SVE, and preserves/restores only the 128-bit
> > view on the NEON/SVE register file. So any context switch or softirq
> > that uses kernel-mode SIMD too, and your SVE register values will get
> > truncated.
>
> Just to be a bit more explicit, since only the NEON register file is
> saved:
>
> * The vector registers will be truncated to 128-bit across
>   preemption or softirq.
>
> * The predicates won't be saved/restored and will change arbitrarily
>   across preemption.
>
> * The VL won't be saved/restored, and might change arbitrarily across
>   preemption.
>
> * The VL to use hasn't been programmed, so performance might vary
>   arbitrarily even in the absence of preemption.
>
> ... so this isn't even safe on machines with (only) a 128-bit VL, and
> there are big open design questions for the infrastructure we'd need.
>
> > Once we encounter a good use case for SVE in the kernel, we might
> > reconsider this, but as it stands, this patch should not be applied.
>
> I agree.
>
> Christoph, please do not pick this or any other in-kernel SVE patches.
> They cannot function correctly without additional infrastructure.
>
> Demian, for patches that use NEON/SVE/SME/etc, please Cc LAKML
> (linux-arm-kernel@lists.infradead.org), so that folk familiar with ARM
> see the patches.
>
> Mark
>
> > (leaving the reply untrimmed for the benefit of the cc'ees I added)
> >
> > > raid6: neonx2   gen() 16248 MB/s
> > > raid6: neonx1   gen() 13591 MB/s
> > > raid6: using algorithm svex4 gen() 19688 MB/s
> > > raid6: .... xor() 11212 MB/s, rmw enabled
> > > raid6: using neon recovery algorithm
> > >
> > > Note that for the recovery path (`xor_syndrome`), NEON may still be
> > > selected dynamically by the algorithm benchmark, as the recovery
> > > workload is heavily memory-bound.
> > >
> > > Signed-off-by: Demian Shulhan <demyansh@gmail.com>
> > > Reported-by: kernel test robot <lkp@intel.com>
> > > Closes:
> > > https://lore.kernel.org/oe-kbuild-all/202603181940.cFwYmYoi-lkp@intel.com/
> > > ---
> > >  include/linux/raid/pq.h |   3 +
> > >  lib/raid6/Makefile      |   5 +
> > >  lib/raid6/algos.c       |   5 +
> > >  lib/raid6/sve.c         | 675 ++++++++++++++++++++++++++++++++++++++++
> > >  4 files changed, 688 insertions(+)
> > >  create mode 100644 lib/raid6/sve.c
> > >
> > > diff --git a/include/linux/raid/pq.h b/include/linux/raid/pq.h
> > > index 2467b3be15c9..787cc57aea9d 100644
> > > --- a/include/linux/raid/pq.h
> > > +++ b/include/linux/raid/pq.h
> > > @@ -140,6 +140,9 @@ extern const struct raid6_calls raid6_neonx1;
> > >  extern const struct raid6_calls raid6_neonx2;
> > >  extern const struct raid6_calls raid6_neonx4;
> > >  extern const struct raid6_calls raid6_neonx8;
> > > +extern const struct raid6_calls raid6_svex1;
> > > +extern const struct raid6_calls raid6_svex2;
> > > +extern const struct raid6_calls raid6_svex4;
> > >
> > >  /* Algorithm list */
> > >  extern const struct raid6_calls * const raid6_algos[];
> > > diff --git a/lib/raid6/Makefile b/lib/raid6/Makefile
> > > index 5be0a4e60ab1..6cdaa6f206fb 100644
> > > --- a/lib/raid6/Makefile
> > > +++ b/lib/raid6/Makefile
> > > @@ -8,6 +8,7 @@ raid6_pq-$(CONFIG_X86) += recov_ssse3.o recov_avx2.o
> > > mmx.o sse1.o sse2.o avx2.o
> > >  raid6_pq-$(CONFIG_ALTIVEC) += altivec1.o altivec2.o altivec4.o
> > > altivec8.o \
> > >                                vpermxor1.o vpermxor2.o vpermxor4.o
> > > vpermxor8.o
> > >  raid6_pq-$(CONFIG_KERNEL_MODE_NEON) += neon.o neon1.o neon2.o neon4.o
> > > neon8.o recov_neon.o recov_neon_inner.o
> > > +raid6_pq-$(CONFIG_ARM64_SVE) += sve.o
> > >  raid6_pq-$(CONFIG_S390) += s390vx8.o recov_s390xc.o
> > >  raid6_pq-$(CONFIG_LOONGARCH) += loongarch_simd.o recov_loongarch_simd.o
> > >  raid6_pq-$(CONFIG_RISCV_ISA_V) += rvv.o recov_rvv.o
> > > @@ -67,6 +68,10 @@ CFLAGS_REMOVE_neon2.o += $(CC_FLAGS_NO_FPU)
> > >  CFLAGS_REMOVE_neon4.o += $(CC_FLAGS_NO_FPU)
> > >  CFLAGS_REMOVE_neon8.o += $(CC_FLAGS_NO_FPU)
> > >  CFLAGS_REMOVE_recov_neon_inner.o += $(CC_FLAGS_NO_FPU)
> > > +
> > > +CFLAGS_sve.o += $(CC_FLAGS_FPU)
> > > +CFLAGS_REMOVE_sve.o += $(CC_FLAGS_NO_FPU)
> > > +
> > >  targets += neon1.c neon2.c neon4.c neon8.c
> > >  $(obj)/neon%.c: $(src)/neon.uc $(src)/unroll.awk FORCE
> > >     $(call if_changed,unroll)
> > > diff --git a/lib/raid6/algos.c b/lib/raid6/algos.c
> > > index 799e0e5eac26..0ae73c3a4be3 100644
> > > --- a/lib/raid6/algos.c
> > > +++ b/lib/raid6/algos.c
> > > @@ -66,6 +66,11 @@ const struct raid6_calls * const raid6_algos[] = {
> > >     &raid6_neonx2,
> > >     &raid6_neonx1,
> > >  #endif
> > > +#ifdef CONFIG_ARM64_SVE
> > > +   &raid6_svex4,
> > > +   &raid6_svex2,
> > > +   &raid6_svex1,
> > > +#endif
> > >  #ifdef CONFIG_LOONGARCH
> > >  #ifdef CONFIG_CPU_HAS_LASX
> > >     &raid6_lasx,
> > > diff --git a/lib/raid6/sve.c b/lib/raid6/sve.c
> > > new file mode 100644
> > > index 000000000000..d52937f806d4
> > > --- /dev/null
> > > +++ b/lib/raid6/sve.c
> > > @@ -0,0 +1,675 @@
> > > +// SPDX-License-Identifier: GPL-2.0-or-later
> > > +/*
> > > + * RAID-6 syndrome calculation using ARM SVE instructions
> > > + */
> > > +
> > > +#include <linux/raid/pq.h>
> > > +
> > > +#ifdef __KERNEL__
> > > +#include <asm/simd.h>
> > > +#include <linux/cpufeature.h>
> > > +#else
> > > +#define scoped_ksimd()
> > > +#define system_supports_sve() (1)
> > > +#endif
> > > +
> > > +static void raid6_sve1_gen_syndrome_real(int disks, unsigned long
> > > bytes, void **ptrs)
> > > +{
> > > +   u8 **dptr = (u8 **)ptrs;
> > > +   u8 *p, *q;
> > > +   long z0 = disks - 3;
> > > +
> > > +   p = dptr[z0 + 1];
> > > +   q = dptr[z0 + 2];
> > > +
> > > +   asm volatile(
> > > +           ".arch armv8.2-a+sve\n"
> > > +           "ptrue p0.b\n"
> > > +           "cntb x3\n"
> > > +           "mov w4, #0x1d\n"
> > > +           "dup z4.b, w4\n"
> > > +           "mov x5, #0\n"
> > > +
> > > +           "0:\n"
> > > +           "ldr x6, [%[dptr], %[z0], lsl #3]\n"
> > > +           "ld1b z0.b, p0/z, [x6, x5]\n"
> > > +           "mov z1.d, z0.d\n"
> > > +
> > > +           "mov w7, %w[z0]\n"
> > > +           "sub w7, w7, #1\n"
> > > +
> > > +           "1:\n"
> > > +           "cmp w7, #0\n"
> > > +           "blt 2f\n"
> > > +
> > > +           "mov z3.d, z1.d\n"
> > > +           "asr z3.b, p0/m, z3.b, #7\n"
> > > +           "lsl z1.b, p0/m, z1.b, #1\n"
> > > +
> > > +           "and z3.d, z3.d, z4.d\n"
> > > +           "eor z1.d, z1.d, z3.d\n"
> > > +
> > > +           "sxtw x8, w7\n"
> > > +           "ldr x6, [%[dptr], x8, lsl #3]\n"
> > > +           "ld1b z2.b, p0/z, [x6, x5]\n"
> > > +
> > > +           "eor z1.d, z1.d, z2.d\n"
> > > +           "eor z0.d, z0.d, z2.d\n"
> > > +
> > > +           "sub w7, w7, #1\n"
> > > +           "b 1b\n"
> > > +           "2:\n"
> > > +
> > > +           "st1b z0.b, p0, [%[p], x5]\n"
> > > +           "st1b z1.b, p0, [%[q], x5]\n"
> > > +
> > > +           "add x5, x5, x3\n"
> > > +           "cmp x5, %[bytes]\n"
> > > +           "blt 0b\n"
> > > +           :
> > > +           : [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
> > > +             [p] "r" (p), [q] "r" (q)
> > > +           : "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
> > > +             "z0", "z1", "z2", "z3", "z4"
> > > +   );
> > > +}
> > > +
> > > +static void raid6_sve1_xor_syndrome_real(int disks, int start, int
> > > stop,
> > > +                                    unsigned long bytes, void **ptrs)
> > > +{
> > > +   u8 **dptr = (u8 **)ptrs;
> > > +   u8 *p, *q;
> > > +   long z0 = stop;
> > > +
> > > +   p = dptr[disks - 2];
> > > +   q = dptr[disks - 1];
> > > +
> > > +   asm volatile(
> > > +           ".arch armv8.2-a+sve\n"
> > > +           "ptrue p0.b\n"
> > > +           "cntb x3\n"
> > > +           "mov w4, #0x1d\n"
> > > +           "dup z4.b, w4\n"
> > > +           "mov x5, #0\n"
> > > +
> > > +           "0:\n"
> > > +           "ldr x6, [%[dptr], %[z0], lsl #3]\n"
> > > +           "ld1b z1.b, p0/z, [x6, x5]\n"
> > > +           "ld1b z0.b, p0/z, [%[p], x5]\n"
> > > +           "eor z0.d, z0.d, z1.d\n"
> > > +
> > > +           "mov w7, %w[z0]\n"
> > > +           "sub w7, w7, #1\n"
> > > +
> > > +           "1:\n"
> > > +           "cmp w7, %w[start]\n"
> > > +           "blt 2f\n"
> > > +
> > > +           "mov z3.d, z1.d\n"
> > > +           "asr z3.b, p0/m, z3.b, #7\n"
> > > +           "lsl z1.b, p0/m, z1.b, #1\n"
> > > +           "and z3.d, z3.d, z4.d\n"
> > > +           "eor z1.d, z1.d, z3.d\n"
> > > +
> > > +           "sxtw x8, w7\n"
> > > +           "ldr x6, [%[dptr], x8, lsl #3]\n"
> > > +           "ld1b z2.b, p0/z, [x6, x5]\n"
> > > +
> > > +           "eor z1.d, z1.d, z2.d\n"
> > > +           "eor z0.d, z0.d, z2.d\n"
> > > +
> > > +           "sub w7, w7, #1\n"
> > > +           "b 1b\n"
> > > +           "2:\n"
> > > +
> > > +           "mov w7, %w[start]\n"
> > > +           "sub w7, w7, #1\n"
> > > +           "3:\n"
> > > +           "cmp w7, #0\n"
> > > +           "blt 4f\n"
> > > +
> > > +           "mov z3.d, z1.d\n"
> > > +           "asr z3.b, p0/m, z3.b, #7\n"
> > > +           "lsl z1.b, p0/m, z1.b, #1\n"
> > > +           "and z3.d, z3.d, z4.d\n"
> > > +           "eor z1.d, z1.d, z3.d\n"
> > > +
> > > +           "sub w7, w7, #1\n"
> > > +           "b 3b\n"
> > > +           "4:\n"
> > > +
> > > +           "ld1b z2.b, p0/z, [%[q], x5]\n"
> > > +           "eor z1.d, z1.d, z2.d\n"
> > > +
> > > +           "st1b z0.b, p0, [%[p], x5]\n"
> > > +           "st1b z1.b, p0, [%[q], x5]\n"
> > > +
> > > +           "add x5, x5, x3\n"
> > > +           "cmp x5, %[bytes]\n"
> > > +           "blt 0b\n"
> > > +           :
> > > +           : [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
> > > +             [p] "r" (p), [q] "r" (q), [start] "r" (start)
> > > +           : "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
> > > +             "z0", "z1", "z2", "z3", "z4"
> > > +   );
> > > +}
> > > +
> > > +static void raid6_sve2_gen_syndrome_real(int disks, unsigned long
> > > bytes, void **ptrs)
> > > +{
> > > +   u8 **dptr = (u8 **)ptrs;
> > > +   u8 *p, *q;
> > > +   long z0 = disks - 3;
> > > +
> > > +   p = dptr[z0 + 1];
> > > +   q = dptr[z0 + 2];
> > > +
> > > +   asm volatile(
> > > +           ".arch armv8.2-a+sve\n"
> > > +           "ptrue p0.b\n"
> > > +           "cntb x3\n"
> > > +           "mov w4, #0x1d\n"
> > > +           "dup z4.b, w4\n"
> > > +           "mov x5, #0\n"
> > > +
> > > +           "0:\n"
> > > +           "ldr x6, [%[dptr], %[z0], lsl #3]\n"
> > > +           "ld1b z0.b, p0/z, [x6, x5]\n"
> > > +           "add x8, x5, x3\n"
> > > +           "ld1b z5.b, p0/z, [x6, x8]\n"
> > > +           "mov z1.d, z0.d\n"
> > > +           "mov z6.d, z5.d\n"
> > > +
> > > +           "mov w7, %w[z0]\n"
> > > +           "sub w7, w7, #1\n"
> > > +
> > > +           "1:\n"
> > > +           "cmp w7, #0\n"
> > > +           "blt 2f\n"
> > > +
> > > +           "mov z3.d, z1.d\n"
> > > +           "asr z3.b, p0/m, z3.b, #7\n"
> > > +           "lsl z1.b, p0/m, z1.b, #1\n"
> > > +           "and z3.d, z3.d, z4.d\n"
> > > +           "eor z1.d, z1.d, z3.d\n"
> > > +
> > > +           "mov z8.d, z6.d\n"
> > > +           "asr z8.b, p0/m, z8.b, #7\n"
> > > +           "lsl z6.b, p0/m, z6.b, #1\n"
> > > +           "and z8.d, z8.d, z4.d\n"
> > > +           "eor z6.d, z6.d, z8.d\n"
> > > +
> > > +           "sxtw x8, w7\n"
> > > +           "ldr x6, [%[dptr], x8, lsl #3]\n"
> > > +           "ld1b z2.b, p0/z, [x6, x5]\n"
> > > +           "add x8, x5, x3\n"
> > > +           "ld1b z7.b, p0/z, [x6, x8]\n"
> > > +
> > > +           "eor z1.d, z1.d, z2.d\n"
> > > +           "eor z0.d, z0.d, z2.d\n"
> > > +
> > > +           "eor z6.d, z6.d, z7.d\n"
> > > +           "eor z5.d, z5.d, z7.d\n"
> > > +
> > > +           "sub w7, w7, #1\n"
> > > +           "b 1b\n"
> > > +           "2:\n"
> > > +
> > > +           "st1b z0.b, p0, [%[p], x5]\n"
> > > +           "st1b z1.b, p0, [%[q], x5]\n"
> > > +           "add x8, x5, x3\n"
> > > +           "st1b z5.b, p0, [%[p], x8]\n"
> > > +           "st1b z6.b, p0, [%[q], x8]\n"
> > > +
> > > +           "add x5, x5, x3\n"
> > > +           "add x5, x5, x3\n"
> > > +           "cmp x5, %[bytes]\n"
> > > +           "blt 0b\n"
> > > +           :
> > > +           : [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
> > > +             [p] "r" (p), [q] "r" (q)
> > > +           : "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
> > > +             "z0", "z1", "z2", "z3", "z4",
> > > +             "z5", "z6", "z7", "z8"
> > > +   );
> > > +}
> > > +
> > > +static void raid6_sve2_xor_syndrome_real(int disks, int start, int
> > > stop,
> > > +                                    unsigned long bytes, void **ptrs)
> > > +{
> > > +   u8 **dptr = (u8 **)ptrs;
> > > +   u8 *p, *q;
> > > +   long z0 = stop;
> > > +
> > > +   p = dptr[disks - 2];
> > > +   q = dptr[disks - 1];
> > > +
> > > +   asm volatile(
> > > +           ".arch armv8.2-a+sve\n"
> > > +           "ptrue p0.b\n"
> > > +           "cntb x3\n"
> > > +           "mov w4, #0x1d\n"
> > > +           "dup z4.b, w4\n"
> > > +           "mov x5, #0\n"
> > > +
> > > +           "0:\n"
> > > +           "ldr x6, [%[dptr], %[z0], lsl #3]\n"
> > > +           "ld1b z1.b, p0/z, [x6, x5]\n"
> > > +           "add x8, x5, x3\n"
> > > +           "ld1b z6.b, p0/z, [x6, x8]\n"
> > > +
> > > +           "ld1b z0.b, p0/z, [%[p], x5]\n"
> > > +           "ld1b z5.b, p0/z, [%[p], x8]\n"
> > > +
> > > +           "eor z0.d, z0.d, z1.d\n"
> > > +           "eor z5.d, z5.d, z6.d\n"
> > > +
> > > +           "mov w7, %w[z0]\n"
> > > +           "sub w7, w7, #1\n"
> > > +
> > > +           "1:\n"
> > > +           "cmp w7, %w[start]\n"
> > > +           "blt 2f\n"
> > > +
> > > +           "mov z3.d, z1.d\n"
> > > +           "asr z3.b, p0/m, z3.b, #7\n"
> > > +           "lsl z1.b, p0/m, z1.b, #1\n"
> > > +           "and z3.d, z3.d, z4.d\n"
> > > +           "eor z1.d, z1.d, z3.d\n"
> > > +
> > > +           "mov z8.d, z6.d\n"
> > > +           "asr z8.b, p0/m, z8.b, #7\n"
> > > +           "lsl z6.b, p0/m, z6.b, #1\n"
> > > +           "and z8.d, z8.d, z4.d\n"
> > > +           "eor z6.d, z6.d, z8.d\n"
> > > +
> > > +           "sxtw x8, w7\n"
> > > +           "ldr x6, [%[dptr], x8, lsl #3]\n"
> > > +           "ld1b z2.b, p0/z, [x6, x5]\n"
> > > +           "add x8, x5, x3\n"
> > > +           "ld1b z7.b, p0/z, [x6, x8]\n"
> > > +
> > > +           "eor z1.d, z1.d, z2.d\n"
> > > +           "eor z0.d, z0.d, z2.d\n"
> > > +
> > > +           "eor z6.d, z6.d, z7.d\n"
> > > +           "eor z5.d, z5.d, z7.d\n"
> > > +
> > > +           "sub w7, w7, #1\n"
> > > +           "b 1b\n"
> > > +           "2:\n"
> > > +
> > > +           "mov w7, %w[start]\n"
> > > +           "sub w7, w7, #1\n"
> > > +           "3:\n"
> > > +           "cmp w7, #0\n"
> > > +           "blt 4f\n"
> > > +
> > > +           "mov z3.d, z1.d\n"
> > > +           "asr z3.b, p0/m, z3.b, #7\n"
> > > +           "lsl z1.b, p0/m, z1.b, #1\n"
> > > +           "and z3.d, z3.d, z4.d\n"
> > > +           "eor z1.d, z1.d, z3.d\n"
> > > +
> > > +           "mov z8.d, z6.d\n"
> > > +           "asr z8.b, p0/m, z8.b, #7\n"
> > > +           "lsl z6.b, p0/m, z6.b, #1\n"
> > > +           "and z8.d, z8.d, z4.d\n"
> > > +           "eor z6.d, z6.d, z8.d\n"
> > > +
> > > +           "sub w7, w7, #1\n"
> > > +           "b 3b\n"
> > > +           "4:\n"
> > > +
> > > +           "ld1b z2.b, p0/z, [%[q], x5]\n"
> > > +           "eor z1.d, z1.d, z2.d\n"
> > > +           "st1b z0.b, p0, [%[p], x5]\n"
> > > +           "st1b z1.b, p0, [%[q], x5]\n"
> > > +
> > > +           "add x8, x5, x3\n"
> > > +           "ld1b z7.b, p0/z, [%[q], x8]\n"
> > > +           "eor z6.d, z6.d, z7.d\n"
> > > +           "st1b z5.b, p0, [%[p], x8]\n"
> > > +           "st1b z6.b, p0, [%[q], x8]\n"
> > > +
> > > +           "add x5, x5, x3\n"
> > > +           "add x5, x5, x3\n"
> > > +           "cmp x5, %[bytes]\n"
> > > +           "blt 0b\n"
> > > +           :
> > > +           : [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
> > > +             [p] "r" (p), [q] "r" (q), [start] "r" (start)
> > > +           : "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
> > > +             "z0", "z1", "z2", "z3", "z4",
> > > +             "z5", "z6", "z7", "z8"
> > > +   );
> > > +}
> > > +
> > > +static void raid6_sve4_gen_syndrome_real(int disks, unsigned long
> > > bytes, void **ptrs)
> > > +{
> > > +   u8 **dptr = (u8 **)ptrs;
> > > +   u8 *p, *q;
> > > +   long z0 = disks - 3;
> > > +
> > > +   p = dptr[z0 + 1];
> > > +   q = dptr[z0 + 2];
> > > +
> > > +   asm volatile(
> > > +           ".arch armv8.2-a+sve\n"
> > > +           "ptrue p0.b\n"
> > > +           "cntb x3\n"
> > > +           "mov w4, #0x1d\n"
> > > +           "dup z4.b, w4\n"
> > > +           "mov x5, #0\n"
> > > +
> > > +           "0:\n"
> > > +           "ldr x6, [%[dptr], %[z0], lsl #3]\n"
> > > +           "ld1b z0.b, p0/z, [x6, x5]\n"
> > > +           "add x8, x5, x3\n"
> > > +           "ld1b z5.b, p0/z, [x6, x8]\n"
> > > +           "add x8, x8, x3\n"
> > > +           "ld1b z10.b, p0/z, [x6, x8]\n"
> > > +           "add x8, x8, x3\n"
> > > +           "ld1b z15.b, p0/z, [x6, x8]\n"
> > > +
> > > +           "mov z1.d, z0.d\n"
> > > +           "mov z6.d, z5.d\n"
> > > +           "mov z11.d, z10.d\n"
> > > +           "mov z16.d, z15.d\n"
> > > +
> > > +           "mov w7, %w[z0]\n"
> > > +           "sub w7, w7, #1\n"
> > > +
> > > +           "1:\n"
> > > +           "cmp w7, #0\n"
> > > +           "blt 2f\n"
> > > +
> > > +           // software pipelining: load data early
> > > +           "sxtw x8, w7\n"
> > > +           "ldr x6, [%[dptr], x8, lsl #3]\n"
> > > +           "ld1b z2.b, p0/z, [x6, x5]\n"
> > > +           "add x8, x5, x3\n"
> > > +           "ld1b z7.b, p0/z, [x6, x8]\n"
> > > +           "add x8, x8, x3\n"
> > > +           "ld1b z12.b, p0/z, [x6, x8]\n"
> > > +           "add x8, x8, x3\n"
> > > +           "ld1b z17.b, p0/z, [x6, x8]\n"
> > > +
> > > +           // math block 1
> > > +           "mov z3.d, z1.d\n"
> > > +           "asr z3.b, p0/m, z3.b, #7\n"
> > > +           "lsl z1.b, p0/m, z1.b, #1\n"
> > > +           "and z3.d, z3.d, z4.d\n"
> > > +           "eor z1.d, z1.d, z3.d\n"
> > > +           "eor z1.d, z1.d, z2.d\n"
> > > +           "eor z0.d, z0.d, z2.d\n"
> > > +
> > > +           // math block 2
> > > +           "mov z8.d, z6.d\n"
> > > +           "asr z8.b, p0/m, z8.b, #7\n"
> > > +           "lsl z6.b, p0/m, z6.b, #1\n"
> > > +           "and z8.d, z8.d, z4.d\n"
> > > +           "eor z6.d, z6.d, z8.d\n"
> > > +           "eor z6.d, z6.d, z7.d\n"
> > > +           "eor z5.d, z5.d, z7.d\n"
> > > +
> > > +           // math block 3
> > > +           "mov z13.d, z11.d\n"
> > > +           "asr z13.b, p0/m, z13.b, #7\n"
> > > +           "lsl z11.b, p0/m, z11.b, #1\n"
> > > +           "and z13.d, z13.d, z4.d\n"
> > > +           "eor z11.d, z11.d, z13.d\n"
> > > +           "eor z11.d, z11.d, z12.d\n"
> > > +           "eor z10.d, z10.d, z12.d\n"
> > > +
> > > +           // math block 4
> > > +           "mov z18.d, z16.d\n"
> > > +           "asr z18.b, p0/m, z18.b, #7\n"
> > > +           "lsl z16.b, p0/m, z16.b, #1\n"
> > > +           "and z18.d, z18.d, z4.d\n"
> > > +           "eor z16.d, z16.d, z18.d\n"
> > > +           "eor z16.d, z16.d, z17.d\n"
> > > +           "eor z15.d, z15.d, z17.d\n"
> > > +
> > > +           "sub w7, w7, #1\n"
> > > +           "b 1b\n"
> > > +           "2:\n"
> > > +
> > > +           "st1b z0.b, p0, [%[p], x5]\n"
> > > +           "st1b z1.b, p0, [%[q], x5]\n"
> > > +           "add x8, x5, x3\n"
> > > +           "st1b z5.b, p0, [%[p], x8]\n"
> > > +           "st1b z6.b, p0, [%[q], x8]\n"
> > > +           "add x8, x8, x3\n"
> > > +           "st1b z10.b, p0, [%[p], x8]\n"
> > > +           "st1b z11.b, p0, [%[q], x8]\n"
> > > +           "add x8, x8, x3\n"
> > > +           "st1b z15.b, p0, [%[p], x8]\n"
> > > +           "st1b z16.b, p0, [%[q], x8]\n"
> > > +
> > > +           "add x8, x3, x3\n"
> > > +           "add x5, x5, x8, lsl #1\n"
> > > +           "cmp x5, %[bytes]\n"
> > > +           "blt 0b\n"
> > > +           :
> > > +           : [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
> > > +             [p] "r" (p), [q] "r" (q)
> > > +           : "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
> > > +             "z0", "z1", "z2", "z3", "z4",
> > > +             "z5", "z6", "z7", "z8",
> > > +             "z10", "z11", "z12", "z13",
> > > +             "z15", "z16", "z17", "z18"
> > > +   );
> > > +}
> > > +
> > > +static void raid6_sve4_xor_syndrome_real(int disks, int start, int
> > > stop,
> > > +                                    unsigned long bytes, void **ptrs)
> > > +{
> > > +   u8 **dptr = (u8 **)ptrs;
> > > +   u8 *p, *q;
> > > +   long z0 = stop;
> > > +
> > > +   p = dptr[disks - 2];
> > > +   q = dptr[disks - 1];
> > > +
> > > +   asm volatile(
> > > +           ".arch armv8.2-a+sve\n"
> > > +           "ptrue p0.b\n"
> > > +           "cntb x3\n"
> > > +           "mov w4, #0x1d\n"
> > > +           "dup z4.b, w4\n"
> > > +           "mov x5, #0\n"
> > > +
> > > +           "0:\n"
> > > +           "ldr x6, [%[dptr], %[z0], lsl #3]\n"
> > > +           "ld1b z1.b, p0/z, [x6, x5]\n"
> > > +           "add x8, x5, x3\n"
> > > +           "ld1b z6.b, p0/z, [x6, x8]\n"
> > > +           "add x8, x8, x3\n"
> > > +           "ld1b z11.b, p0/z, [x6, x8]\n"
> > > +           "add x8, x8, x3\n"
> > > +           "ld1b z16.b, p0/z, [x6, x8]\n"
> > > +
> > > +           "ld1b z0.b, p0/z, [%[p], x5]\n"
> > > +           "add x8, x5, x3\n"
> > > +           "ld1b z5.b, p0/z, [%[p], x8]\n"
> > > +           "add x8, x8, x3\n"
> > > +           "ld1b z10.b, p0/z, [%[p], x8]\n"
> > > +           "add x8, x8, x3\n"
> > > +           "ld1b z15.b, p0/z, [%[p], x8]\n"
> > > +
> > > +           "eor z0.d, z0.d, z1.d\n"
> > > +           "eor z5.d, z5.d, z6.d\n"
> > > +           "eor z10.d, z10.d, z11.d\n"
> > > +           "eor z15.d, z15.d, z16.d\n"
> > > +
> > > +           "mov w7, %w[z0]\n"
> > > +           "sub w7, w7, #1\n"
> > > +
> > > +           "1:\n"
> > > +           "cmp w7, %w[start]\n"
> > > +           "blt 2f\n"
> > > +
> > > +           // software pipelining: load data early
> > > +           "sxtw x8, w7\n"
> > > +           "ldr x6, [%[dptr], x8, lsl #3]\n"
> > > +           "ld1b z2.b, p0/z, [x6, x5]\n"
> > > +           "add x8, x5, x3\n"
> > > +           "ld1b z7.b, p0/z, [x6, x8]\n"
> > > +           "add x8, x8, x3\n"
> > > +           "ld1b z12.b, p0/z, [x6, x8]\n"
> > > +           "add x8, x8, x3\n"
> > > +           "ld1b z17.b, p0/z, [x6, x8]\n"
> > > +
> > > +           // math block 1
> > > +           "mov z3.d, z1.d\n"
> > > +           "asr z3.b, p0/m, z3.b, #7\n"
> > > +           "lsl z1.b, p0/m, z1.b, #1\n"
> > > +           "and z3.d, z3.d, z4.d\n"
> > > +           "eor z1.d, z1.d, z3.d\n"
> > > +           "eor z1.d, z1.d, z2.d\n"
> > > +           "eor z0.d, z0.d, z2.d\n"
> > > +
> > > +           // math block 2
> > > +           "mov z8.d, z6.d\n"
> > > +           "asr z8.b, p0/m, z8.b, #7\n"
> > > +           "lsl z6.b, p0/m, z6.b, #1\n"
> > > +           "and z8.d, z8.d, z4.d\n"
> > > +           "eor z6.d, z6.d, z8.d\n"
> > > +           "eor z6.d, z6.d, z7.d\n"
> > > +           "eor z5.d, z5.d, z7.d\n"
> > > +
> > > +           // math block 3
> > > +           "mov z13.d, z11.d\n"
> > > +           "asr z13.b, p0/m, z13.b, #7\n"
> > > +           "lsl z11.b, p0/m, z11.b, #1\n"
> > > +           "and z13.d, z13.d, z4.d\n"
> > > +           "eor z11.d, z11.d, z13.d\n"
> > > +           "eor z11.d, z11.d, z12.d\n"
> > > +           "eor z10.d, z10.d, z12.d\n"
> > > +
> > > +           // math block 4
> > > +           "mov z18.d, z16.d\n"
> > > +           "asr z18.b, p0/m, z18.b, #7\n"
> > > +           "lsl z16.b, p0/m, z16.b, #1\n"
> > > +           "and z18.d, z18.d, z4.d\n"
> > > +           "eor z16.d, z16.d, z18.d\n"
> > > +           "eor z16.d, z16.d, z17.d\n"
> > > +           "eor z15.d, z15.d, z17.d\n"
> > > +
> > > +           "sub w7, w7, #1\n"
> > > +           "b 1b\n"
> > > +           "2:\n"
> > > +
> > > +           "mov w7, %w[start]\n"
> > > +           "sub w7, w7, #1\n"
> > > +           "3:\n"
> > > +           "cmp w7, #0\n"
> > > +           "blt 4f\n"
> > > +
> > > +           // math block 1
> > > +           "mov z3.d, z1.d\n"
> > > +           "asr z3.b, p0/m, z3.b, #7\n"
> > > +           "lsl z1.b, p0/m, z1.b, #1\n"
> > > +           "and z3.d, z3.d, z4.d\n"
> > > +           "eor z1.d, z1.d, z3.d\n"
> > > +
> > > +           // math block 2
> > > +           "mov z8.d, z6.d\n"
> > > +           "asr z8.b, p0/m, z8.b, #7\n"
> > > +           "lsl z6.b, p0/m, z6.b, #1\n"
> > > +           "and z8.d, z8.d, z4.d\n"
> > > +           "eor z6.d, z6.d, z8.d\n"
> > > +
> > > +           // math block 3
> > > +           "mov z13.d, z11.d\n"
> > > +           "asr z13.b, p0/m, z13.b, #7\n"
> > > +           "lsl z11.b, p0/m, z11.b, #1\n"
> > > +           "and z13.d, z13.d, z4.d\n"
> > > +           "eor z11.d, z11.d, z13.d\n"
> > > +
> > > +           // math block 4
> > > +           "mov z18.d, z16.d\n"
> > > +           "asr z18.b, p0/m, z18.b, #7\n"
> > > +           "lsl z16.b, p0/m, z16.b, #1\n"
> > > +           "and z18.d, z18.d, z4.d\n"
> > > +           "eor z16.d, z16.d, z18.d\n"
> > > +
> > > +           "sub w7, w7, #1\n"
> > > +           "b 3b\n"
> > > +           "4:\n"
> > > +
> > > +           // Load q and XOR
> > > +           "ld1b z2.b, p0/z, [%[q], x5]\n"
> > > +           "add x8, x5, x3\n"
> > > +           "ld1b z7.b, p0/z, [%[q], x8]\n"
> > > +           "add x8, x8, x3\n"
> > > +           "ld1b z12.b, p0/z, [%[q], x8]\n"
> > > +           "add x8, x8, x3\n"
> > > +           "ld1b z17.b, p0/z, [%[q], x8]\n"
> > > +
> > > +           "eor z1.d, z1.d, z2.d\n"
> > > +           "eor z6.d, z6.d, z7.d\n"
> > > +           "eor z11.d, z11.d, z12.d\n"
> > > +           "eor z16.d, z16.d, z17.d\n"
> > > +
> > > +           // Store results
> > > +           "st1b z0.b, p0, [%[p], x5]\n"
> > > +           "st1b z1.b, p0, [%[q], x5]\n"
> > > +           "add x8, x5, x3\n"
> > > +           "st1b z5.b, p0, [%[p], x8]\n"
> > > +           "st1b z6.b, p0, [%[q], x8]\n"
> > > +           "add x8, x8, x3\n"
> > > +           "st1b z10.b, p0, [%[p], x8]\n"
> > > +           "st1b z11.b, p0, [%[q], x8]\n"
> > > +           "add x8, x8, x3\n"
> > > +           "st1b z15.b, p0, [%[p], x8]\n"
> > > +           "st1b z16.b, p0, [%[q], x8]\n"
> > > +
> > > +           "add x8, x3, x3\n"
> > > +           "add x5, x5, x8, lsl #1\n"
> > > +           "cmp x5, %[bytes]\n"
> > > +           "blt 0b\n"
> > > +           :
> > > +           : [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
> > > +             [p] "r" (p), [q] "r" (q), [start] "r" (start)
> > > +           : "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
> > > +             "z0", "z1", "z2", "z3", "z4",
> > > +             "z5", "z6", "z7", "z8",
> > > +             "z10", "z11", "z12", "z13",
> > > +             "z15", "z16", "z17", "z18"
> > > +   );
> > > +}
> > > +
> > > +#define RAID6_SVE_WRAPPER(_n)                                              \
> > > +   static void raid6_sve ## _n ## _gen_syndrome(int disks,         \
> > > +                                   size_t bytes, void **ptrs)      \
> > > +   {                                                               \
> > > +           scoped_ksimd()                                          \
> > > +           raid6_sve ## _n ## _gen_syndrome_real(disks,            \
> > > +                                   (unsigned long)bytes, ptrs);    \
> > > +   }                                                               \
> > > +   static void raid6_sve ## _n ## _xor_syndrome(int disks,         \
> > > +                                   int start, int stop,            \
> > > +                                   size_t bytes, void **ptrs)      \
> > > +   {                                                               \
> > > +           scoped_ksimd()                                          \
> > > +           raid6_sve ## _n ## _xor_syndrome_real(disks,            \
> > > +                           start, stop, (unsigned long)bytes, ptrs);\
> > > +   }                                                               \
> > > +   struct raid6_calls const raid6_svex ## _n = {                   \
> > > +           raid6_sve ## _n ## _gen_syndrome,                       \
> > > +           raid6_sve ## _n ## _xor_syndrome,                       \
> > > +           raid6_have_sve,                                         \
> > > +           "svex" #_n,                                             \
> > > +           0                                                       \
> > > +   }
> > > +
> > > +static int raid6_have_sve(void)
> > > +{
> > > +   return system_supports_sve();
> > > +}
> > > +
> > > +RAID6_SVE_WRAPPER(1);
> > > +RAID6_SVE_WRAPPER(2);
> > > +RAID6_SVE_WRAPPER(4);
> > > --
> > > 2.43.0

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] raid6: arm64: add SVE optimized implementation for syndrome generation
  2026-03-29 13:01     ` Demian Shulhan
@ 2026-03-30  5:30       ` Christoph Hellwig
  2026-03-30 16:39       ` Ard Biesheuvel
  1 sibling, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2026-03-30  5:30 UTC (permalink / raw)
  To: Demian Shulhan
  Cc: Mark Rutland, Ard Biesheuvel, Christoph Hellwig, Song Liu,
	Yu Kuai, Will Deacon, Catalin Marinas, broonie, linux-arm-kernel,
	robin.murphy, Li Nan, linux-raid, linux-kernel

On Sun, Mar 29, 2026 at 04:01:06PM +0300, Demian Shulhan wrote:
> Furthermore, as Christoph suggested, I tested scalability on wider
> arrays since the default kernel benchmark is hardcoded to 8 disks,
> which doesn't give the unrolled SVE loop enough data to shine. On a
> 16-disk array, svex4 hits 15.1 GB/s compared to 8.0 GB/s for neonx4.
> On a 24-disk array, while neonx4 chokes and drops to 7.8 GB/s, svex4
> maintains a stable 15.0 GB/s — effectively doubling the throughput.I
> agree this patch should be put on hold for now. My intention is to
> leave these numbers here as evidence that implementing SVE context
> preservation in the kernel (the "good use case") is highly justifiable
> from both a power-efficiency and a wide-array throughput perspective
> for modern ARM64 hardware.
> 
> Thanks again for your time and time and review!

To me this sounds like an interesting case for a SVE kernel API.
But I'm not relly knowledgeable enough to provide one to help
with testing this further.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] raid6: arm64: add SVE optimized implementation for syndrome generation
  2026-03-29 13:01     ` Demian Shulhan
  2026-03-30  5:30       ` Christoph Hellwig
@ 2026-03-30 16:39       ` Ard Biesheuvel
  2026-03-31  6:36         ` Christoph Hellwig
  1 sibling, 1 reply; 16+ messages in thread
From: Ard Biesheuvel @ 2026-03-30 16:39 UTC (permalink / raw)
  To: Demian Shulhan, Mark Rutland
  Cc: Christoph Hellwig, Song Liu, Yu Kuai, Will Deacon,
	Catalin Marinas, Mark Brown, linux-arm-kernel, robin.murphy,
	Li Nan, linux-raid, linux-kernel

Hi Demian,

On Sun, 29 Mar 2026, at 15:01, Demian Shulhan wrote:
> I want to address the comment about the marginal 0.3% speedup on the
> 8-disk benchmark. While the pure memory bandwidth on a small array is
> indeed bottlenecked, it doesn't reveal the whole picture. I extracted
> the SVE and NEON implementations into a user-space benchmark to
> measure the actual hardware efficiency using perf stat, running on the
> same AWS Graviton3 (Neoverse-V1) instance.The results show a massive
> difference in CPU efficiency. For the same 8-disk workload, the svex4
> implementation requires about 35% fewer instructions and 46% fewer CPU
> cycles compared to neonx4 (7.58 billion instructions vs 11.62
> billion). This translates directly into significant energy savings and
> reduced pressure on the CPU frontend, which would leave more compute
> resources available for network and NVMe queues during an array
> rebuild.
>

I think the results are impressive, but I'd like to better understand
its implications on a real-world scenario. Is this code only a
bottleneck when rebuilding an array? Is it really that much more power
efficient, given that the registers (and ALU paths) are twice the size?
And given the I/O load of rebuilding a 24+ disk array, how much CPU
throughput can we make use of meaningfully in such a scenario?

Supporting SVE in the kernel primarily impacts the size of the per-task
buffers that we need to preserve/restore the context. Fortunately,
these are no longer allocated for the lifetime of the task, but
dynamically (by scoped_ksimd()), and so the main impediment has been
recently removed. But as Mark pointed out, there are other things to
take into account. Nonetheless, our position has always been that a
compelling use case could convince us that the additional complexity
of in-kernel SVE is justified.

> Furthermore, as Christoph suggested, I tested scalability on wider
> arrays since the default kernel benchmark is hardcoded to 8 disks,
> which doesn't give the unrolled SVE loop enough data to shine. On a
> 16-disk array, svex4 hits 15.1 GB/s compared to 8.0 GB/s for neonx4.
> On a 24-disk array, while neonx4 chokes and drops to 7.8 GB/s, svex4
> maintains a stable 15.0 GB/s — effectively doubling the throughput.

Does this mean the kernel benchmark is no longer fit for purpose? If
it cannot distinguish between implementations that differ in performance
by a factor of 2, I don't think we can rely on it to pick the optimal one.

> I agree this patch should be put on hold for now. My intention is to
> leave these numbers here as evidence that implementing SVE context
> preservation in the kernel (the "good use case") is highly justifiable
> from both a power-efficiency and a wide-array throughput perspective
> for modern ARM64 hardware.
>

Could you please summarize the results? The output below seems to have
become mangled a bit. Please also include the command line, a link to
the test source, and the vector length of the implementation.



> Thanks again for your time and time and review!
>
> ---------------------------------------------------
> User space test results:
> ==================================================
>     RAID6 SVE Benchmark Results (AWS Graviton3)
> ==================================================
> Instance Details:
> Linux ip-172-31-87-234 6.8.0-1047-aws #50~22.04.1-Ubuntu SMP Thu Feb
> 19 20:49:25 UTC 2026 aarch64 aarch64 aarch64 GNU/Linux
> --------------------------------------------------
>
> [Test 1: Energy Efficiency / Instruction Count (8 disks)]
> Running baseline (neonx4)...
> algo=neonx4 ndisks=8 iterations=1000000 time=2.681s MB/s=8741.36
>
> Running SVE (svex1)...
>
>  Performance counter stats for './raid6_bench neonx4 8 1000000':
>
>        11626717224      instructions                     #    1.67
> insn per cycle
>         6946699489      cycles
>          257013219      L1-dcache-load-misses
>
>        2.681213149 seconds time elapsed
>
>        2.676771000 seconds user
>        0.002000000 seconds sys
>
>
> algo=svex1 ndisks=8 iterations=1000000 time=1.688s MB/s=13885.23
>
>  Performance counter stats for './raid6_bench svex1 8 1000000':
>
>        10527277490
> Running SVE unrolled x4 (svex4)...
>      instructions                     #    2.40  insn per cycle
>         4379539835      cycles
>          175695656      L1-dcache-load-misses
>
>        1.688852006 seconds time elapsed
>
>        1.687298000 seconds user
>        0.000999000 seconds sys
>
>
> algo=svex4 ndisks=8 iterations=1000000 time=1.445s MB/s=16215.04
>
>  Performance counter stats for './raid6_bench svex4 8 1000000':
>
>         7587813392      instructions
> ==================================================
> [Test 2: Scalability on Wide RAID Arrays (MB/s)]
> --- 16 Disks ---
>  #    2.02  insn per cycle
>         3748486131      cycles
>          213816184      L1-dcache-load-misses
>
>        1.446032415 seconds time elapsed
>
>        1.442412000 seconds user
>        0.002996000 seconds sys
>
>
> algo=neonx4 ndisks=16 iterations=1000000 time=6.783s MB/s=8062.33
> algo=svex1 ndisks=16 iterations=1000000 time=4.912s MB/s=11132.90
> algo=svex4 ndisks=16 iterations=1000000 time=3.601s MB/s=15188.85
>
> --- 24 Disks ---
> algo=neonx4 ndisks=24 iterations=1000000 time=11.011s MB/s=7805.02
> algo=svex1 ndisks=24 iterations=1000000 time=8.843s MB/s=9718.26
> algo=svex4 ndisks=24 iterations=1000000 time=5.719s MB/s=15026.92
>
> Extra tests:
> --- 48 Disks ---
> algo=neonx4 ndisks=48 iterations=500000 time=11.826s MB/s=7597.25
> algo=svex4 ndisks=48 iterations=500000 time=5.808s MB/s=15468.10
> --- 96 Disks ---
> algo=neonx4 ndisks=96 iterations=200000 time=9.783s MB/s=7507.01
> algo=svex4 ndisks=96 iterations=200000 time=4.701s MB/s=15621.17
> ==================================================
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] raid6: arm64: add SVE optimized implementation for syndrome generation
  2026-03-30 16:39       ` Ard Biesheuvel
@ 2026-03-31  6:36         ` Christoph Hellwig
  2026-03-31 13:18           ` Demian Shulhan
  0 siblings, 1 reply; 16+ messages in thread
From: Christoph Hellwig @ 2026-03-31  6:36 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Demian Shulhan, Mark Rutland, Christoph Hellwig, Song Liu,
	Yu Kuai, Will Deacon, Catalin Marinas, Mark Brown,
	linux-arm-kernel, robin.murphy, Li Nan, linux-raid, linux-kernel

On Mon, Mar 30, 2026 at 06:39:49PM +0200, Ard Biesheuvel wrote:
> I think the results are impressive, but I'd like to better understand
> its implications on a real-world scenario. Is this code only a
> bottleneck when rebuilding an array?

The syndrome generation is run every time you write data to a RAID6
array, and if you do partial stripe writes it (or rather the XOR
variant) is run twice.  So this is the most performance critical
path for writing to RAID6.

Rebuild usually runs totally different code, but can end up here as well
when both parity disks are lost.

> > Furthermore, as Christoph suggested, I tested scalability on wider
> > arrays since the default kernel benchmark is hardcoded to 8 disks,
> > which doesn't give the unrolled SVE loop enough data to shine. On a
> > 16-disk array, svex4 hits 15.1 GB/s compared to 8.0 GB/s for neonx4.
> > On a 24-disk array, while neonx4 chokes and drops to 7.8 GB/s, svex4
> > maintains a stable 15.0 GB/s — effectively doubling the throughput.
> 
> Does this mean the kernel benchmark is no longer fit for purpose? If
> it cannot distinguish between implementations that differ in performance
> by a factor of 2, I don't think we can rely on it to pick the optimal one.

It is not good, and we should either fix it or run more than one.
The current setup is not really representative of real-life array.
It also leads to wrong selections on x86, but only at the which unroll
level to pick level, and only for minor differences so far.  I plan
to add this to the next version of the raid6 lib patches.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] raid6: arm64: add SVE optimized implementation for syndrome generation
  2026-03-31  6:36         ` Christoph Hellwig
@ 2026-03-31 13:18           ` Demian Shulhan
  2026-04-16 12:40             ` Demian Shulhan
  0 siblings, 1 reply; 16+ messages in thread
From: Demian Shulhan @ 2026-03-31 13:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ard Biesheuvel, Mark Rutland, Song Liu, Yu Kuai, Will Deacon,
	Catalin Marinas, Mark Brown, linux-arm-kernel, robin.murphy,
	Li Nan, linux-raid, linux-kernel

Hi all,

Ard, your questions regarding real-world I/O bottlenecks and SVE power
efficiency versus raw throughput are entirely valid. I agree that
introducing SVE support requires solid real-world data to justify the
added complexity.

Due to my current workload, I won't be able to run the necessary
hardware tests and prepare the benchmark code immediately. I will get
back to the list in about 1 week with the requested source code,
unmangled test results, and further analysis.

Thanks!


вт, 31 бер. 2026 р. о 09:37 Christoph Hellwig <hch@lst.de> пише:
>
> On Mon, Mar 30, 2026 at 06:39:49PM +0200, Ard Biesheuvel wrote:
> > I think the results are impressive, but I'd like to better understand
> > its implications on a real-world scenario. Is this code only a
> > bottleneck when rebuilding an array?
>
> The syndrome generation is run every time you write data to a RAID6
> array, and if you do partial stripe writes it (or rather the XOR
> variant) is run twice.  So this is the most performance critical
> path for writing to RAID6.
>
> Rebuild usually runs totally different code, but can end up here as well
> when both parity disks are lost.
>
> > > Furthermore, as Christoph suggested, I tested scalability on wider
> > > arrays since the default kernel benchmark is hardcoded to 8 disks,
> > > which doesn't give the unrolled SVE loop enough data to shine. On a
> > > 16-disk array, svex4 hits 15.1 GB/s compared to 8.0 GB/s for neonx4.
> > > On a 24-disk array, while neonx4 chokes and drops to 7.8 GB/s, svex4
> > > maintains a stable 15.0 GB/s — effectively doubling the throughput.
> >
> > Does this mean the kernel benchmark is no longer fit for purpose? If
> > it cannot distinguish between implementations that differ in performance
> > by a factor of 2, I don't think we can rely on it to pick the optimal one.
>
> It is not good, and we should either fix it or run more than one.
> The current setup is not really representative of real-life array.
> It also leads to wrong selections on x86, but only at the which unroll
> level to pick level, and only for minor differences so far.  I plan
> to add this to the next version of the raid6 lib patches.
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] raid6: arm64: add SVE optimized implementation for syndrome generation
  2026-03-31 13:18           ` Demian Shulhan
@ 2026-04-16 12:40             ` Demian Shulhan
  2026-04-16 13:39               ` Ard Biesheuvel
  0 siblings, 1 reply; 16+ messages in thread
From: Demian Shulhan @ 2026-04-16 12:40 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ard Biesheuvel, Mark Rutland, Song Liu, Yu Kuai, Will Deacon,
	Catalin Marinas, Mark Brown, linux-arm-kernel, robin.murphy,
	Li Nan, linux-raid, linux-kernel

Hi all,

Sorry for the delay. The tests became more complex than I initially
thought, so I needed to gather more data and properly validate the
results across different hardware configurations.

Firstly, I want to clarify the results from my March 29 tests. I found
a flaw in my initial custom benchmark. The massive 2x throughput gap on
24 disks wasn't solely due to SVE's superiority, but rather a severe L1
D-Cache thrashing issue that disproportionately penalized NEON.

My custom test lacked memset() initialization, causing all data buffers
to map to the Linux Zero Page (Virtually Indexed, Physically Tagged
cache aliasing). Furthermore, even with memset(), allocating contiguous
page-aligned buffers can causes severe Cache Address Sharing (a known
issue that Andrea Mazzoleni solved in SnapRAID 13 years ago using
RAID_MALLOC_DISPLACEMENT).

Because SVE (svex4) uses 256-bit registers on Neoverse-V1, it performs
exactly half the number of memory load instructions compared to 128-bit
NEON. This dramatically reduced the L1 cache alias thrashing, allowing
SVE to survive the memory bottleneck while NEON choked:

Custom test without memset (4kb block):
 | algo=neonx4 ndisks=24 iterations=1M time=11.014s MB/s=7802.57
 | algo=svex4  ndisks=24 iterations=1M time=5.719s  MB/s=15026.92

Custom test with memset (4kb block):
 | algo=neonx4 ndisks=24 iterations=1M time=6.165s  MB/s=13939.08
 | algo=svex4  ndisks=24 iterations=1M time=5.839s  MB/s=14718.23

Even with the corrected memory setup, the throughput gap narrowed, but
the fundamental CPU-efficiency result remained fully intact.

To completely isolate these variables and provide accurate real-world
data, the following test campaigns were done based on the SnapRAID
project (https://github.com/amadvance/snapraid) using its
perf_bench.c tool with proper memory displacement and a 256 KiB block
size.

Test configurations:
- c7g.medium (AWS Graviton3, 1 vCPU): Neoverse-V1, 256-bit SVE
- c7g.xlarge (AWS Graviton3, 4 vCPUs): Neoverse-V1, 256-bit SVE
- c8g.xlarge (AWS Graviton4, 4 vCPUs): Neoverse-V2, 128-bit SVE


=========================================================
Section 1: SnapRAID Validation on Graviton3 / Neoverse-V1
=========================================================

These runs are the most representative userspace validation. The tests
were run with standard -O2 optimizations.

1.1 SnapRAID speedtest, O2, c7g.xlarge (Raw Throughput)

 disks  neonx4  neonx8   svex4  delta(nx4)  delta(nx8)
 -----  ------  ------  ------  ----------  ----------
     8   21394   21138   23601      +10.3%      +11.6%
    24   20368   19850   21009       +3.1%       +5.8%
    48   16727   19290   20222      +20.9%       +4.8%
    96   15562   18925   17549      +12.8%       -7.3%

1.2 perf_bench, O2, c7g.xlarge (Hardware Efficiency)

 disks  neonx4 inst  svex4 inst  reduction | neonx4 cyc  svex4 cyc | MB/s (N/S)
 -----  -----------  ----------  --------- | ----------  --------- | -----------
     8       4.02 B      2.61 B     -35.1% |     1.01 B     0.92 B | 20304/22346
    24      12.16 B      8.00 B     -34.2% |     3.20 B     3.11 B | 19354/19933
    48      24.37 B     16.08 B     -34.0% |     7.73 B     6.51 B | 16048/19047
    96      48.80 B     32.24 B     -33.9% |    16.94 B    15.11 B | 14638/16421

1.3 Main Graviton3 Conclusions
 - On 256-bit SVE hardware, svex4 consistently retires about ~34% fewer
   instructions and ~10-15% fewer CPU cycles than neonx4.

=========================================================
Section 2: SnapRAID Validation on Graviton4 / Neoverse-V2
=========================================================

2.1 SnapRAID speedtest, O2, c8g.xlarge (Raw Throughput)

 disks  neonx4  neonx8   svex4  delta(nx4)  delta(nx8)
 -----  ------  ------  ------  ----------  ----------
     8   24802   25409   20451      -17.5%      -19.5%
    24   22607   24026   18577      -17.8%      -22.7%
    48   20984   22171   18019      -14.1%      -18.7%
    96   21254   21690   17108      -19.5%      -21.1%

2.2 perf_bench, O2, c8g.xlarge (Hardware Efficiency)

 disks  neonx4 inst  svex4 inst   overhead | neonx4 cyc  svex4 cyc | MB/s (N/S)
 -----  -----------  ----------  --------- | ----------  --------- | -----------
     8       4.02 B      5.22 B     +29.9% |     0.95 B     1.14 B | 23529/19512
    24      12.16 B     15.98 B     +31.4% |     3.11 B     3.79 B | 21621/17777
    48      24.37 B     32.12 B     +31.8% |     6.70 B     7.81 B | 20000/17204
    96      48.78 B     64.40 B     +32.0% |    13.24 B    16.32 B | 20253/16410

2.3 Main Graviton4 Conclusions
 - On Neoverse-V2, SVE vector length is 128-bit (same as NEON).
 - Without the 256-bit width, NEON outperforms SVE.
 - svex4 retires ~32% MORE instructions here and is consistently slower.

=========================================================
Section 3: Validation on c7g.medium (1 vCPU)
=========================================================

3.1 SnapRAID speedtest, O2, c7g.medium (Raw Throughput)

 disks  neonx4  neonx8   svex4  delta(nx4)  delta(nx8)
 -----  ------  ------  ------  ----------  ----------
     8   16768   17466   17310       +3.2%       -0.9%
    24   15843   16684   16205       +2.3%       -2.9%
    48   14032   14475   15389       +9.7%       +6.3%
    96   13404   13045   14677       +9.5%      +12.5%

3.2 perf_bench, O2, c7g.medium (Hardware Efficiency)

 disks  neonx4 inst  svex4 inst  reduction | neonx4 cyc  svex4 cyc | MB/s (N/S)
 -----  -----------  ----------  --------- | ----------  --------- | -----------
     8       3.99 B      2.61 B     -34.6% |     1.30 B     1.25 B | 16000/16666
    24      12.13 B      8.00 B     -34.0% |     4.08 B     4.02 B | 15189/15483
    48      24.34 B     16.08 B     -33.9% |     9.23 B     8.35 B | 13445/14860
    96      48.76 B     32.24 B     -33.9% |    19.34 B    17.92 B | 12834/13852

3.3 Main c7g.medium Conclusions
 - The instruction count reduction (~34%) perfectly matches the 4-vCPU
   instance.
 - The single vCPU is heavily memory-bandwidth constrained (cycle counts
   are much higher waiting for RAM).

=========================================================
Section 4: The Pitfalls of the Current Kernel Benchmark
=========================================================

As Christoph pointed out, the current in-kernel benchmark setup
(hardcoded to 8 disks and a PAGE_SIZE buffer) can be not representative
of real-life arrays.

Because 8 disks * 4 KiB = 32 KiB total data, the entire benchmark fits
into the 64 KiB L1 D-Cache of Neoverse-V1, masking memory bandwidth limits
and register spilling. This leads to objectively wrong selections.

---------------------------------------------------
Case 1: Wrong NEON unrolling selection (Graviton3)
--------------------------------------------------
The kernel benchmark tests 8 disks and locks in neonx4. However, on
real-world wide arrays (48-96 disks), neonx8 is significantly faster.

 disks     neonx4 MB/s    neonx8 MB/s    Actual Winner  Kernel's Choice
 --------  -------------  -------------  -------------  ---------------
  8 (Boot) 21,394         21,138         neonx4         neonx4 (Locked)
 48        16,727         19,290         neonx8         neonx4 (-15.3%)
 96        15,562         18,925         neonx8         neonx4 (-21.6%)

Result: Users lose up to 21% NEON throughput because of the 8-disk test.

---------------------------------------------------
Case 2: Wrong SVE vs NEON selection (Graviton3)
--------------------------------------------------
If SVE is enabled, the 8-disk benchmark strongly prefers svex4. But on
extreme wide arrays (96 disks), the heavily unrolled neonx8 actually
overtakes SVE.

 disks     neonx8 MB/s    svex4 MB/s     Actual Winner  Kernel's Choice
 --------  -------------  -------------  -------------  ---------------
  8 (Boot) 21,138         23,601         svex4          svex4 (Locked)
 96        18,925         17,549         neonx8         svex4 (-7.8%)

Result: On extreme workloads, forcing svex4 loses ~7.8% throughput.

Conclusion: The kernel benchmark requires testing with larger buffers
(exceeding L1 capacity) or simulated wide arrays to guarantee the optimal
algorithm is chosen for actual storage workloads.

---------------------------------------------------
Case 3: Buffer size distortion (Graviton3, 8 disks)
---------------------------------------------------
Even on the exact same 8-disk array, testing with a 4 KiB buffer (which
fits entirely in the L1 cache) yields a completely different winner than
testing with 256 KiB buffer (which exercises L2/L3/RAM).

 buffer      neonx4 MB/s    svex4 MB/s     Actual Winner  Kernel's Choice
 ----------  -------------  -------------  -------------  ---------------
 4 KiB       20211          19818          neonx4         neonx4 (Locked)
 256 KiB     21394          23601          svex4          neonx4 (-9.3%)

Result: By benchmarking exclusively in the L1 cache (4 KiB buffer), the
kernel incorrectly chooses neonx4, losing ~9.3% throughput for
larger I/O block sizes.


Thanks again for your time and review!

вт, 31 бер. 2026 р. о 16:18 Demian Shulhan <demyansh@gmail.com> пише:

>
> Hi all,
>
> Ard, your questions regarding real-world I/O bottlenecks and SVE power
> efficiency versus raw throughput are entirely valid. I agree that
> introducing SVE support requires solid real-world data to justify the
> added complexity.
>
> Due to my current workload, I won't be able to run the necessary
> hardware tests and prepare the benchmark code immediately. I will get
> back to the list in about 1 week with the requested source code,
> unmangled test results, and further analysis.
>
> Thanks!
>
>
> вт, 31 бер. 2026 р. о 09:37 Christoph Hellwig <hch@lst.de> пише:
> >
> > On Mon, Mar 30, 2026 at 06:39:49PM +0200, Ard Biesheuvel wrote:
> > > I think the results are impressive, but I'd like to better understand
> > > its implications on a real-world scenario. Is this code only a
> > > bottleneck when rebuilding an array?
> >
> > The syndrome generation is run every time you write data to a RAID6
> > array, and if you do partial stripe writes it (or rather the XOR
> > variant) is run twice.  So this is the most performance critical
> > path for writing to RAID6.
> >
> > Rebuild usually runs totally different code, but can end up here as well
> > when both parity disks are lost.
> >
> > > > Furthermore, as Christoph suggested, I tested scalability on wider
> > > > arrays since the default kernel benchmark is hardcoded to 8 disks,
> > > > which doesn't give the unrolled SVE loop enough data to shine. On a
> > > > 16-disk array, svex4 hits 15.1 GB/s compared to 8.0 GB/s for neonx4.
> > > > On a 24-disk array, while neonx4 chokes and drops to 7.8 GB/s, svex4
> > > > maintains a stable 15.0 GB/s — effectively doubling the throughput.
> > >
> > > Does this mean the kernel benchmark is no longer fit for purpose? If
> > > it cannot distinguish between implementations that differ in performance
> > > by a factor of 2, I don't think we can rely on it to pick the optimal one.
> >
> > It is not good, and we should either fix it or run more than one.
> > The current setup is not really representative of real-life array.
> > It also leads to wrong selections on x86, but only at the which unroll
> > level to pick level, and only for minor differences so far.  I plan
> > to add this to the next version of the raid6 lib patches.
> >

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] raid6: arm64: add SVE optimized implementation for syndrome generation
  2026-04-16 12:40             ` Demian Shulhan
@ 2026-04-16 13:39               ` Ard Biesheuvel
  2026-04-16 14:59                 ` Demian Shulhan
  0 siblings, 1 reply; 16+ messages in thread
From: Ard Biesheuvel @ 2026-04-16 13:39 UTC (permalink / raw)
  To: Demian Shulhan, Christoph Hellwig
  Cc: Mark Rutland, Song Liu, Yu Kuai, Will Deacon, Catalin Marinas,
	Mark Brown, linux-arm-kernel, Robin Murphy, Li Nan, linux-raid,
	linux-kernel

Hi Demian,

On Thu, 16 Apr 2026, at 14:40, Demian Shulhan wrote:
> Hi all,
>
> Sorry for the delay. The tests became more complex than I initially
> thought, so I needed to gather more data and properly validate the
> results across different hardware configurations.
>
> Firstly, I want to clarify the results from my March 29 tests. I found
> a flaw in my initial custom benchmark. The massive 2x throughput gap on
> 24 disks wasn't solely due to SVE's superiority, but rather a severe L1
> D-Cache thrashing issue that disproportionately penalized NEON.
>
> My custom test lacked memset() initialization, causing all data buffers
> to map to the Linux Zero Page (Virtually Indexed, Physically Tagged
> cache aliasing).

D-caches always behave as PIPT on arm64. This is complex stuff, so please
don't present conjecture as fact.

> Furthermore, even with memset(), allocating contiguous
> page-aligned buffers can causes severe Cache Address Sharing (a known
> issue that Andrea Mazzoleni solved in SnapRAID 13 years ago using
> RAID_MALLOC_DISPLACEMENT).
>
> Because SVE (svex4) uses 256-bit registers on Neoverse-V1, it performs
> exactly half the number of memory load instructions compared to 128-bit
> NEON. This dramatically reduced the L1 cache alias thrashing, allowing
> SVE to survive the memory bottleneck while NEON choked:
>

You are drawing some conclusions here without disclosing the actual
information that you based this on. D-caches are non-aliasing on arm64.

So what exactly did you fix in your test case?

> Custom test without memset (4kb block):
>  | algo=neonx4 ndisks=24 iterations=1M time=11.014s MB/s=7802.57
>  | algo=svex4  ndisks=24 iterations=1M time=5.719s  MB/s=15026.92
>

This is the result where all data buffer pointers point to the same
memory, right? I.e., the zero page? So this is an unrealistic use
case that we can disregard.

> Custom test with memset (4kb block):
>  | algo=neonx4 ndisks=24 iterations=1M time=6.165s  MB/s=13939.08
>  | algo=svex4  ndisks=24 iterations=1M time=5.839s  MB/s=14718.23
>
> Even with the corrected memory setup, the throughput gap narrowed, but
> the fundamental CPU-efficiency result remained fully intact.
>

Sorry but your result that SVE is 2x faster does not remain fully intact,
right? Given that the speedup is now 5.5%?

Should we just disregard the above results (and explanations) and focus
on the stuff below?

> To completely isolate these variables and provide accurate real-world
> data, the following test campaigns were done based on the SnapRAID
> project (https://github.com/amadvance/snapraid) using its
> perf_bench.c tool with proper memory displacement and a 256 KiB block
> size.
>
> Test configurations:
> - c7g.medium (AWS Graviton3, 1 vCPU): Neoverse-V1, 256-bit SVE
> - c7g.xlarge (AWS Graviton3, 4 vCPUs): Neoverse-V1, 256-bit SVE
> - c8g.xlarge (AWS Graviton4, 4 vCPUs): Neoverse-V2, 128-bit SVE
>
>
> =========================================================
> Section 1: SnapRAID Validation on Graviton3 / Neoverse-V1
> =========================================================
>
...
>
> 1.3 Main Graviton3 Conclusions
>  - On 256-bit SVE hardware, svex4 consistently retires about ~34% fewer
>    instructions and ~10-15% fewer CPU cycles than neonx4.
>
> =========================================================
> Section 2: SnapRAID Validation on Graviton4 / Neoverse-V2
> =========================================================
>
...
>
> 2.3 Main Graviton4 Conclusions
>  - On Neoverse-V2, SVE vector length is 128-bit (same as NEON).
>  - Without the 256-bit width, NEON outperforms SVE.
>  - svex4 retires ~32% MORE instructions here and is consistently slower.
>
> =========================================================
> Section 3: Validation on c7g.medium (1 vCPU)
> =========================================================
>
...
> 3.3 Main c7g.medium Conclusions
>  - The instruction count reduction (~34%) perfectly matches the 4-vCPU
>    instance.
>  - The single vCPU is heavily memory-bandwidth constrained (cycle counts
>    are much higher waiting for RAM).
>

OK, so the takeaway here is that SVE is only worth the hassle if the vector
length is at least 256 bits. This is not entirely surprising, but given that
Graviton4 went back to 128 bit vectors from 256, I wonder what the future
expectation is here.

But having these numbers is definitely a good first step. Now we need to
quantify the overhead associated with having kernel mode SVE state that
needs to be preserved/restored.

However, 10%-15% speedup that can only be achieved on SVE implementations
with 256 bit vectors or more may not be that enticing in the end. (The
fact that you are retiring 34% instructions less does not really matter
here unless there is some meaningful SMT-like sharing of functional units
going on in the meantime, which seems unlikely on a CPU that is maxed out
on the data side)


> =========================================================
> Section 4: The Pitfalls of the Current Kernel Benchmark
> =========================================================
>

These results seem very relevant - perhaps Christoph can give some guidance
on how we might use these to improve the built-in benchmarks to be more
accurate.


Thanks,


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] raid6: arm64: add SVE optimized implementation for syndrome generation
  2026-04-16 13:39               ` Ard Biesheuvel
@ 2026-04-16 14:59                 ` Demian Shulhan
  2026-04-16 16:26                   ` Robin Murphy
  0 siblings, 1 reply; 16+ messages in thread
From: Demian Shulhan @ 2026-04-16 14:59 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Christoph Hellwig, Mark Rutland, Song Liu, Yu Kuai, Will Deacon,
	Catalin Marinas, Mark Brown, linux-arm-kernel, Robin Murphy,
	Li Nan, linux-raid, linux-kernel

Hi Ard!

> So what exactly did you fix in your test case?

I just added the missing memset. You're right, "aliasing" was the
wrong term for PIP.

> This is the result where all data buffer pointers point to the same
> memory, right? I.e., the zero page? So this is an unrealistic use
> case that we can disregard.

Yes, that's right. It was a flaw in my previous test setup.

> Sorry but your result that SVE is 2x faster does not remain fully intact,
> right? Given that the speedup is now 5.5%?
> Should we just disregard the above results (and explanations) and focus
> on the stuff below?

Yes, it's better to focus on the data from SnapRAID. It was made on
larger blocks and a wider range of disks, providing more realistic
metrics.

> OK, so the takeaway here is that SVE is only worth the hassle if the vector
> length is at least 256 bits. This is not entirely surprising, but given that
> Graviton4 went back to 128 bit vectors from 256, I wonder what the future
> expectation is here.

I agree. The results from the SnapRAID tests are not as impressive as
I hoped, and the fact that Neoverse-V2 went back to 128-bit is a red
flag. It suggests that wide SVE registers might not be a priority in
future architecture versions.

> These results seem very relevant - perhaps Christoph can give some guidance
> on how we might use these to improve the built-in benchmarks to be more
> accurate.

This is the most important part of this report, I think. SVE looks
good only like my first idea on paper but in the real scenario it
brings more problems than benefits.

I’m happy to drop the SVE implementation for now and instead focus on
modernizing the built-in benchmarks to ensure the kernel chooses the
best available NEON path for actual storage workloads.

If you give me the green flag for it, I can start working on improving
these built-in tests.

Best regards,
Demian


чт, 16 квіт. 2026 р. о 16:40 Ard Biesheuvel <ardb@kernel.org> пише:
>
> Hi Demian,
>
> On Thu, 16 Apr 2026, at 14:40, Demian Shulhan wrote:
> > Hi all,
> >
> > Sorry for the delay. The tests became more complex than I initially
> > thought, so I needed to gather more data and properly validate the
> > results across different hardware configurations.
> >
> > Firstly, I want to clarify the results from my March 29 tests. I found
> > a flaw in my initial custom benchmark. The massive 2x throughput gap on
> > 24 disks wasn't solely due to SVE's superiority, but rather a severe L1
> > D-Cache thrashing issue that disproportionately penalized NEON.
> >
> > My custom test lacked memset() initialization, causing all data buffers
> > to map to the Linux Zero Page (Virtually Indexed, Physically Tagged
> > cache aliasing).
>
> D-caches always behave as PIPT on arm64. This is complex stuff, so please
> don't present conjecture as fact.
>
> > Furthermore, even with memset(), allocating contiguous
> > page-aligned buffers can causes severe Cache Address Sharing (a known
> > issue that Andrea Mazzoleni solved in SnapRAID 13 years ago using
> > RAID_MALLOC_DISPLACEMENT).
> >
> > Because SVE (svex4) uses 256-bit registers on Neoverse-V1, it performs
> > exactly half the number of memory load instructions compared to 128-bit
> > NEON. This dramatically reduced the L1 cache alias thrashing, allowing
> > SVE to survive the memory bottleneck while NEON choked:
> >
>
> You are drawing some conclusions here without disclosing the actual
> information that you based this on. D-caches are non-aliasing on arm64.
>
> So what exactly did you fix in your test case?
>
> > Custom test without memset (4kb block):
> >  | algo=neonx4 ndisks=24 iterations=1M time=11.014s MB/s=7802.57
> >  | algo=svex4  ndisks=24 iterations=1M time=5.719s  MB/s=15026.92
> >
>
> This is the result where all data buffer pointers point to the same
> memory, right? I.e., the zero page? So this is an unrealistic use
> case that we can disregard.
>
> > Custom test with memset (4kb block):
> >  | algo=neonx4 ndisks=24 iterations=1M time=6.165s  MB/s=13939.08
> >  | algo=svex4  ndisks=24 iterations=1M time=5.839s  MB/s=14718.23
> >
> > Even with the corrected memory setup, the throughput gap narrowed, but
> > the fundamental CPU-efficiency result remained fully intact.
> >
>
> Sorry but your result that SVE is 2x faster does not remain fully intact,
> right? Given that the speedup is now 5.5%?
>
> Should we just disregard the above results (and explanations) and focus
> on the stuff below?
>
> > To completely isolate these variables and provide accurate real-world
> > data, the following test campaigns were done based on the SnapRAID
> > project (https://github.com/amadvance/snapraid) using its
> > perf_bench.c tool with proper memory displacement and a 256 KiB block
> > size.
> >
> > Test configurations:
> > - c7g.medium (AWS Graviton3, 1 vCPU): Neoverse-V1, 256-bit SVE
> > - c7g.xlarge (AWS Graviton3, 4 vCPUs): Neoverse-V1, 256-bit SVE
> > - c8g.xlarge (AWS Graviton4, 4 vCPUs): Neoverse-V2, 128-bit SVE
> >
> >
> > =========================================================
> > Section 1: SnapRAID Validation on Graviton3 / Neoverse-V1
> > =========================================================
> >
> ...
> >
> > 1.3 Main Graviton3 Conclusions
> >  - On 256-bit SVE hardware, svex4 consistently retires about ~34% fewer
> >    instructions and ~10-15% fewer CPU cycles than neonx4.
> >
> > =========================================================
> > Section 2: SnapRAID Validation on Graviton4 / Neoverse-V2
> > =========================================================
> >
> ...
> >
> > 2.3 Main Graviton4 Conclusions
> >  - On Neoverse-V2, SVE vector length is 128-bit (same as NEON).
> >  - Without the 256-bit width, NEON outperforms SVE.
> >  - svex4 retires ~32% MORE instructions here and is consistently slower.
> >
> > =========================================================
> > Section 3: Validation on c7g.medium (1 vCPU)
> > =========================================================
> >
> ...
> > 3.3 Main c7g.medium Conclusions
> >  - The instruction count reduction (~34%) perfectly matches the 4-vCPU
> >    instance.
> >  - The single vCPU is heavily memory-bandwidth constrained (cycle counts
> >    are much higher waiting for RAM).
> >
>
> OK, so the takeaway here is that SVE is only worth the hassle if the vector
> length is at least 256 bits. This is not entirely surprising, but given that
> Graviton4 went back to 128 bit vectors from 256, I wonder what the future
> expectation is here.
>
> But having these numbers is definitely a good first step. Now we need to
> quantify the overhead associated with having kernel mode SVE state that
> needs to be preserved/restored.
>
> However, 10%-15% speedup that can only be achieved on SVE implementations
> with 256 bit vectors or more may not be that enticing in the end. (The
> fact that you are retiring 34% instructions less does not really matter
> here unless there is some meaningful SMT-like sharing of functional units
> going on in the meantime, which seems unlikely on a CPU that is maxed out
> on the data side)
>
>
> > =========================================================
> > Section 4: The Pitfalls of the Current Kernel Benchmark
> > =========================================================
> >
>
> These results seem very relevant - perhaps Christoph can give some guidance
> on how we might use these to improve the built-in benchmarks to be more
> accurate.
>
>
> Thanks,
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] raid6: arm64: add SVE optimized implementation for syndrome generation
  2026-04-16 14:59                 ` Demian Shulhan
@ 2026-04-16 16:26                   ` Robin Murphy
  2026-04-16 16:47                     ` Mark Brown
  0 siblings, 1 reply; 16+ messages in thread
From: Robin Murphy @ 2026-04-16 16:26 UTC (permalink / raw)
  To: Demian Shulhan, Ard Biesheuvel
  Cc: Christoph Hellwig, Mark Rutland, Song Liu, Yu Kuai, Will Deacon,
	Catalin Marinas, Mark Brown, linux-arm-kernel, Li Nan, linux-raid,
	linux-kernel

On 16/04/2026 3:59 pm, Demian Shulhan wrote:
> Hi Ard!
> 
>> So what exactly did you fix in your test case?
> 
> I just added the missing memset. You're right, "aliasing" was the
> wrong term for PIP.
> 
>> This is the result where all data buffer pointers point to the same
>> memory, right? I.e., the zero page? So this is an unrealistic use
>> case that we can disregard.
> 
> Yes, that's right. It was a flaw in my previous test setup.
> 
>> Sorry but your result that SVE is 2x faster does not remain fully intact,
>> right? Given that the speedup is now 5.5%?
>> Should we just disregard the above results (and explanations) and focus
>> on the stuff below?
> 
> Yes, it's better to focus on the data from SnapRAID. It was made on
> larger blocks and a wider range of disks, providing more realistic
> metrics.
> 
>> OK, so the takeaway here is that SVE is only worth the hassle if the vector
>> length is at least 256 bits. This is not entirely surprising, but given that
>> Graviton4 went back to 128 bit vectors from 256, I wonder what the future
>> expectation is here.
> 
> I agree. The results from the SnapRAID tests are not as impressive as
> I hoped, and the fact that Neoverse-V2 went back to 128-bit is a red
> flag. It suggests that wide SVE registers might not be a priority in
> future architecture versions.

If you look at the Neoverse V1 software optimisation guide[1], the SVE 
instructions generally have half the throughput of their ASIMD 
equivalents (i.e. presumably the vector pipes are still only 128 bits 
wide and SVE is just using them in pairs), so indeed the total 
instruction count is largely meaningless - IPC might be somewhat more 
relevant, but I'd say the only performance number that's really 
meaningful is the end-to-end MB/s measure of how fast the function 
implementation as a whole can process data.

Unless you've got a CPU with truly big wide vector units that _can't_ be 
fully utilised by ASMID ops, then SVE is only really offering whatever 
incidental benefits fall out of smaller code size. However, if you do 
have those wider vectors, then the cost of correctly saving/restoring 
the SVE state - of which a userspace benchmark isn't likely to be very 
representative - is also going to scale up significantly.

>> These results seem very relevant - perhaps Christoph can give some guidance
>> on how we might use these to improve the built-in benchmarks to be more
>> accurate.
> 
> This is the most important part of this report, I think. SVE looks
> good only like my first idea on paper but in the real scenario it
> brings more problems than benefits.
> 
> I’m happy to drop the SVE implementation for now and instead focus on
> modernizing the built-in benchmarks to ensure the kernel chooses the
> best available NEON path for actual storage workloads.

It's probably also worth checking whether the current NEON routines 
themselves are actually optimal for modern big CPUs - things have moved 
on quite a bit since Cortex-A57 (whose ASIMD performance could also be 
described as "esoteric" at the best of times...)

Thanks,
Robin.

[1] https://developer.arm.com/documentation/110659/

> 
> If you give me the green flag for it, I can start working on improving
> these built-in tests.
> 
> Best regards,
> Demian
> 
> 
> чт, 16 квіт. 2026 р. о 16:40 Ard Biesheuvel <ardb@kernel.org> пише:
>>
>> Hi Demian,
>>
>> On Thu, 16 Apr 2026, at 14:40, Demian Shulhan wrote:
>>> Hi all,
>>>
>>> Sorry for the delay. The tests became more complex than I initially
>>> thought, so I needed to gather more data and properly validate the
>>> results across different hardware configurations.
>>>
>>> Firstly, I want to clarify the results from my March 29 tests. I found
>>> a flaw in my initial custom benchmark. The massive 2x throughput gap on
>>> 24 disks wasn't solely due to SVE's superiority, but rather a severe L1
>>> D-Cache thrashing issue that disproportionately penalized NEON.
>>>
>>> My custom test lacked memset() initialization, causing all data buffers
>>> to map to the Linux Zero Page (Virtually Indexed, Physically Tagged
>>> cache aliasing).
>>
>> D-caches always behave as PIPT on arm64. This is complex stuff, so please
>> don't present conjecture as fact.
>>
>>> Furthermore, even with memset(), allocating contiguous
>>> page-aligned buffers can causes severe Cache Address Sharing (a known
>>> issue that Andrea Mazzoleni solved in SnapRAID 13 years ago using
>>> RAID_MALLOC_DISPLACEMENT).
>>>
>>> Because SVE (svex4) uses 256-bit registers on Neoverse-V1, it performs
>>> exactly half the number of memory load instructions compared to 128-bit
>>> NEON. This dramatically reduced the L1 cache alias thrashing, allowing
>>> SVE to survive the memory bottleneck while NEON choked:
>>>
>>
>> You are drawing some conclusions here without disclosing the actual
>> information that you based this on. D-caches are non-aliasing on arm64.
>>
>> So what exactly did you fix in your test case?
>>
>>> Custom test without memset (4kb block):
>>>   | algo=neonx4 ndisks=24 iterations=1M time=11.014s MB/s=7802.57
>>>   | algo=svex4  ndisks=24 iterations=1M time=5.719s  MB/s=15026.92
>>>
>>
>> This is the result where all data buffer pointers point to the same
>> memory, right? I.e., the zero page? So this is an unrealistic use
>> case that we can disregard.
>>
>>> Custom test with memset (4kb block):
>>>   | algo=neonx4 ndisks=24 iterations=1M time=6.165s  MB/s=13939.08
>>>   | algo=svex4  ndisks=24 iterations=1M time=5.839s  MB/s=14718.23
>>>
>>> Even with the corrected memory setup, the throughput gap narrowed, but
>>> the fundamental CPU-efficiency result remained fully intact.
>>>
>>
>> Sorry but your result that SVE is 2x faster does not remain fully intact,
>> right? Given that the speedup is now 5.5%?
>>
>> Should we just disregard the above results (and explanations) and focus
>> on the stuff below?
>>
>>> To completely isolate these variables and provide accurate real-world
>>> data, the following test campaigns were done based on the SnapRAID
>>> project (https://github.com/amadvance/snapraid) using its
>>> perf_bench.c tool with proper memory displacement and a 256 KiB block
>>> size.
>>>
>>> Test configurations:
>>> - c7g.medium (AWS Graviton3, 1 vCPU): Neoverse-V1, 256-bit SVE
>>> - c7g.xlarge (AWS Graviton3, 4 vCPUs): Neoverse-V1, 256-bit SVE
>>> - c8g.xlarge (AWS Graviton4, 4 vCPUs): Neoverse-V2, 128-bit SVE
>>>
>>>
>>> =========================================================
>>> Section 1: SnapRAID Validation on Graviton3 / Neoverse-V1
>>> =========================================================
>>>
>> ...
>>>
>>> 1.3 Main Graviton3 Conclusions
>>>   - On 256-bit SVE hardware, svex4 consistently retires about ~34% fewer
>>>     instructions and ~10-15% fewer CPU cycles than neonx4.
>>>
>>> =========================================================
>>> Section 2: SnapRAID Validation on Graviton4 / Neoverse-V2
>>> =========================================================
>>>
>> ...
>>>
>>> 2.3 Main Graviton4 Conclusions
>>>   - On Neoverse-V2, SVE vector length is 128-bit (same as NEON).
>>>   - Without the 256-bit width, NEON outperforms SVE.
>>>   - svex4 retires ~32% MORE instructions here and is consistently slower.
>>>
>>> =========================================================
>>> Section 3: Validation on c7g.medium (1 vCPU)
>>> =========================================================
>>>
>> ...
>>> 3.3 Main c7g.medium Conclusions
>>>   - The instruction count reduction (~34%) perfectly matches the 4-vCPU
>>>     instance.
>>>   - The single vCPU is heavily memory-bandwidth constrained (cycle counts
>>>     are much higher waiting for RAM).
>>>
>>
>> OK, so the takeaway here is that SVE is only worth the hassle if the vector
>> length is at least 256 bits. This is not entirely surprising, but given that
>> Graviton4 went back to 128 bit vectors from 256, I wonder what the future
>> expectation is here.
>>
>> But having these numbers is definitely a good first step. Now we need to
>> quantify the overhead associated with having kernel mode SVE state that
>> needs to be preserved/restored.
>>
>> However, 10%-15% speedup that can only be achieved on SVE implementations
>> with 256 bit vectors or more may not be that enticing in the end. (The
>> fact that you are retiring 34% instructions less does not really matter
>> here unless there is some meaningful SMT-like sharing of functional units
>> going on in the meantime, which seems unlikely on a CPU that is maxed out
>> on the data side)
>>
>>
>>> =========================================================
>>> Section 4: The Pitfalls of the Current Kernel Benchmark
>>> =========================================================
>>>
>>
>> These results seem very relevant - perhaps Christoph can give some guidance
>> on how we might use these to improve the built-in benchmarks to be more
>> accurate.
>>
>>
>> Thanks,
>>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] raid6: arm64: add SVE optimized implementation for syndrome generation
  2026-04-16 16:26                   ` Robin Murphy
@ 2026-04-16 16:47                     ` Mark Brown
  2026-04-16 17:03                       ` Robin Murphy
  0 siblings, 1 reply; 16+ messages in thread
From: Mark Brown @ 2026-04-16 16:47 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Demian Shulhan, Ard Biesheuvel, Christoph Hellwig, Mark Rutland,
	Song Liu, Yu Kuai, Will Deacon, Catalin Marinas, linux-arm-kernel,
	Li Nan, linux-raid, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 703 bytes --]

On Thu, Apr 16, 2026 at 05:26:08PM +0100, Robin Murphy wrote:

> Unless you've got a CPU with truly big wide vector units that _can't_ be
> fully utilised by ASMID ops, then SVE is only really offering whatever
> incidental benefits fall out of smaller code size. However, if you do have
> those wider vectors, then the cost of correctly saving/restoring the SVE
> state - of which a userspace benchmark isn't likely to be very
> representative - is also going to scale up significantly.

The other case will be when there's some SVE only extension that
accelerates something that's relevant for the algorithm.  That's not
really a thing at present but I imagine that we'll run into that at some
point.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] raid6: arm64: add SVE optimized implementation for syndrome generation
  2026-04-16 16:47                     ` Mark Brown
@ 2026-04-16 17:03                       ` Robin Murphy
  0 siblings, 0 replies; 16+ messages in thread
From: Robin Murphy @ 2026-04-16 17:03 UTC (permalink / raw)
  To: Mark Brown
  Cc: Demian Shulhan, Ard Biesheuvel, Christoph Hellwig, Mark Rutland,
	Song Liu, Yu Kuai, Will Deacon, Catalin Marinas, linux-arm-kernel,
	Li Nan, linux-raid, linux-kernel

On 16/04/2026 5:47 pm, Mark Brown wrote:
> On Thu, Apr 16, 2026 at 05:26:08PM +0100, Robin Murphy wrote:
> 
>> Unless you've got a CPU with truly big wide vector units that _can't_ be
>> fully utilised by ASMID ops, then SVE is only really offering whatever
>> incidental benefits fall out of smaller code size. However, if you do have
>> those wider vectors, then the cost of correctly saving/restoring the SVE
>> state - of which a userspace benchmark isn't likely to be very
>> representative - is also going to scale up significantly.
> 
> The other case will be when there's some SVE only extension that
> accelerates something that's relevant for the algorithm.  That's not
> really a thing at present but I imagine that we'll run into that at some
> point.

Indeed - I was implicitly thinking in terms of things that _are_ just 
transliterated from NEON to SVE, where the primary gain is stuff like 
predicate loops, but even that _could_ potentially be enough to justify 
an argument in-kernel SVE (using a 128-bit VL to keep the additional 
state/cost to a minimum).

Cheers,
Robin.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2026-04-16 17:03 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-18 15:02 [PATCH v2] raid6: arm64: add SVE optimized implementation for syndrome generation Demian Shulhan
2026-03-24  7:45 ` Christoph Hellwig
2026-03-24  8:00 ` Ard Biesheuvel
2026-03-24 10:04   ` Mark Rutland
2026-03-29 13:01     ` Demian Shulhan
2026-03-30  5:30       ` Christoph Hellwig
2026-03-30 16:39       ` Ard Biesheuvel
2026-03-31  6:36         ` Christoph Hellwig
2026-03-31 13:18           ` Demian Shulhan
2026-04-16 12:40             ` Demian Shulhan
2026-04-16 13:39               ` Ard Biesheuvel
2026-04-16 14:59                 ` Demian Shulhan
2026-04-16 16:26                   ` Robin Murphy
2026-04-16 16:47                     ` Mark Brown
2026-04-16 17:03                       ` Robin Murphy
  -- strict thread matches above, loose matches on Subject: below --
2026-03-18 15:01 Demian Shulhan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox