Re: [PATCH v2] raid6: arm64: add SVE optimized implementation for syndrome generation

public inbox for linux-arm-kernel@lists.infradead.org
 help / color / mirror / Atom feed

* Re: [PATCH v2] raid6: arm64: add SVE optimized implementation for syndrome generation
       [not found] <20260318150245.3080719-1-demyansh@gmail.com>
@ 2026-03-24  7:45 ` Christoph Hellwig
  2026-03-24  8:00 ` Ard Biesheuvel
  1 sibling, 0 replies; 4+ messages in thread
From: Christoph Hellwig @ 2026-03-24  7:45 UTC (permalink / raw)
  To: Demian Shulhan
  Cc: Song Liu, Yu Kuai, Li Nan, linux-raid, linux-kernel,
	kernel test robot, Catalin Marinas, Ard Biesheuvel, Will Deacon,
	linux-arm-kernel

Hi Damian,

thanks for looking into this.

I've added the arm64 maintainers and arm list as that's your best bet
for someone actually understanding the low-level assembly code.

On Wed, Mar 18, 2026 at 03:02:45PM +0000, Demian Shulhan wrote:
> Note that for the recovery path (`xor_syndrome`), NEON may still be
> selected dynamically by the algorithm benchmark, as the recovery
> workload is heavily memory-bound.

The recovery side has no benchmarking, you need to manually select
a priority.

Note that I just sent out a "cleanup the RAID6 P/Q library" series that
make this more explicit.  It also make it clear by prioritizing
implementations using better instructions available we can short-cut
the generation side probing path a lot, which might be worth looking
into for this.

I'm also curious if you looked why the 4x unroll is slower than
the lesser unroll, and if that is inherent in the implementation.  Or
just an effect of the small number of disks in that we don't actually
have 4 disks to unroll for every other iteration.  I.e. what would the
numbers be if RAID6_TEST_DISKS was increased to 10 or 18?

I plan into potentially select the unrolling variants based on the
number of "disks" to calculate over as a follow-on.

We'll have to wait for review on my series, but I'd love to just rebase
this ontop if possible.  I can offer to do the work, but I'd need to
run it past you for testing and final review.

> +static void raid6_sve1_gen_syndrome_real(int disks, unsigned long bytes, void **ptrs)

Overly long line.

> +{
> +	u8 **dptr = (u8 **)ptrs;
> +	u8 *p, *q;
> +	long z0 = disks - 3;
> +
> +	p = dptr[z0 + 1];
> +	q = dptr[z0 + 2];

I know all this is derived from existing code, but as I started to hate
that I'll add my cosmetic comments:

This would read nicer by initializing at declaration time:

	u8 **dptr = (u8 **)ptrs;
	long z0 = disks - 3;
	u8 *p = dptr[z0 + 1];
	u8 *q = dptr[z0 + 2];

Also z0 might better be named z_last or last_disk, or stop as in the
_xor variant routines.

> +	asm volatile(

But I wonder if just implementing the entire routine as assembly in a
.S file would make more sense than this anyway?

> +static void raid6_sve1_xor_syndrome_real(int disks, int start, int stop,
> +					 unsigned long bytes, void **ptrs)
> +{
> +	u8 **dptr = (u8 **)ptrs;
> +	u8 *p, *q;
> +	long z0 = stop;
> +
> +	p = dptr[disks - 2];
> +	q = dptr[disks - 1];
> +
> +	asm volatile(

Same comments here, plus just dropping z0 vs using stop directly.

> +#define RAID6_SVE_WRAPPER(_n)						\
> +	static void raid6_sve ## _n ## _gen_syndrome(int disks,		\
> +					size_t bytes, void **ptrs)	\
> +	{								\
> +		scoped_ksimd()						\
> +		raid6_sve ## _n ## _gen_syndrome_real(disks,		\
> +					(unsigned long)bytes, ptrs);	\

Missing indentation after the scoped_ksimd().  A lot of other code uses
separate compilation units for the SIMD code, which seems pretty useful
to avoid mixing SIMD with non-SIMD code.  That would also combine nicely
with the suggestion above to implement the low-level routines entirely
in assembly.

I'll leave comments on the actual assembly details to people that
actually know ARM64 assembly.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] raid6: arm64: add SVE optimized implementation for syndrome generation
       [not found] <20260318150245.3080719-1-demyansh@gmail.com>
  2026-03-24  7:45 ` [PATCH v2] raid6: arm64: add SVE optimized implementation for syndrome generation Christoph Hellwig
@ 2026-03-24  8:00 ` Ard Biesheuvel
  2026-03-24 10:04   ` Mark Rutland
  1 sibling, 1 reply; 4+ messages in thread
From: Ard Biesheuvel @ 2026-03-24  8:00 UTC (permalink / raw)
  To: Demian Shulhan, Song Liu, Yu Kuai, Will Deacon, Catalin Marinas,
	Mark Rutland, broonie, linux-arm-kernel, robin.murphy,
	Christoph Hellwig
  Cc: Li Nan, linux-raid, linux-kernel

Hi Damian,

On Wed, 18 Mar 2026, at 16:02, Demian Shulhan wrote:
> Implement Scalable Vector Extension (SVE) optimized routines for RAID6
> syndrome generation and recovery on ARM64.
>
> The SVE instruction set allows for variable vector lengths (from 128 to
> 2048 bits), scaling automatically with the hardware capabilities. This
> implementation handles arbitrary SVE vector lengths using the `cntb`
> instruction to determine the runtime vector length.
>
> The implementation introduces `svex1`, `svex2`, and `svex4` algorithms.
> The `svex4` algorithm utilizes loop unrolling by 4 blocks per iteration
> and manual software pipelining (interleaving memory loads with XORs)
> to minimize instruction dependency stalls and maximize CPU pipeline
> utilization and memory bandwidth.
>
> Performance was tested on an AWS Graviton3 (Neoverse-V1) instance which
> features 256-bit SVE vector length. The `svex4` implementation outperforms
> the existing 128-bit `neonx4` baseline for syndrome generation:
>
> raid6: svex4    gen() 19688 MB/s
...
> raid6: neonx4   gen() 19612 MB/s

You're being very generous characterising a 0.3% speedup as 'outperforms'

But the real problem here is that the kernel-mode SIMD API only supports NEON and not SVE, and preserves/restores only the 128-bit view on the NEON/SVE register file. So any context switch or softirq that uses kernel-mode SIMD too, and your SVE register values will get truncated.

Once we encounter a good use case for SVE in the kernel, we might reconsider this, but as it stands, this patch should not be applied.

(leaving the reply untrimmed for the benefit of the cc'ees I added)

> raid6: neonx2   gen() 16248 MB/s
> raid6: neonx1   gen() 13591 MB/s
> raid6: using algorithm svex4 gen() 19688 MB/s
> raid6: .... xor() 11212 MB/s, rmw enabled
> raid6: using neon recovery algorithm
>
> Note that for the recovery path (`xor_syndrome`), NEON may still be
> selected dynamically by the algorithm benchmark, as the recovery
> workload is heavily memory-bound.
>
> Signed-off-by: Demian Shulhan <demyansh@gmail.com>
> Reported-by: kernel test robot <lkp@intel.com>
> Closes: 
> https://lore.kernel.org/oe-kbuild-all/202603181940.cFwYmYoi-lkp@intel.com/
> ---
>  include/linux/raid/pq.h |   3 +
>  lib/raid6/Makefile      |   5 +
>  lib/raid6/algos.c       |   5 +
>  lib/raid6/sve.c         | 675 ++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 688 insertions(+)
>  create mode 100644 lib/raid6/sve.c
>
> diff --git a/include/linux/raid/pq.h b/include/linux/raid/pq.h
> index 2467b3be15c9..787cc57aea9d 100644
> --- a/include/linux/raid/pq.h
> +++ b/include/linux/raid/pq.h
> @@ -140,6 +140,9 @@ extern const struct raid6_calls raid6_neonx1;
>  extern const struct raid6_calls raid6_neonx2;
>  extern const struct raid6_calls raid6_neonx4;
>  extern const struct raid6_calls raid6_neonx8;
> +extern const struct raid6_calls raid6_svex1;
> +extern const struct raid6_calls raid6_svex2;
> +extern const struct raid6_calls raid6_svex4;
> 
>  /* Algorithm list */
>  extern const struct raid6_calls * const raid6_algos[];
> diff --git a/lib/raid6/Makefile b/lib/raid6/Makefile
> index 5be0a4e60ab1..6cdaa6f206fb 100644
> --- a/lib/raid6/Makefile
> +++ b/lib/raid6/Makefile
> @@ -8,6 +8,7 @@ raid6_pq-$(CONFIG_X86) += recov_ssse3.o recov_avx2.o 
> mmx.o sse1.o sse2.o avx2.o
>  raid6_pq-$(CONFIG_ALTIVEC) += altivec1.o altivec2.o altivec4.o 
> altivec8.o \
>                                vpermxor1.o vpermxor2.o vpermxor4.o 
> vpermxor8.o
>  raid6_pq-$(CONFIG_KERNEL_MODE_NEON) += neon.o neon1.o neon2.o neon4.o 
> neon8.o recov_neon.o recov_neon_inner.o
> +raid6_pq-$(CONFIG_ARM64_SVE) += sve.o
>  raid6_pq-$(CONFIG_S390) += s390vx8.o recov_s390xc.o
>  raid6_pq-$(CONFIG_LOONGARCH) += loongarch_simd.o recov_loongarch_simd.o
>  raid6_pq-$(CONFIG_RISCV_ISA_V) += rvv.o recov_rvv.o
> @@ -67,6 +68,10 @@ CFLAGS_REMOVE_neon2.o += $(CC_FLAGS_NO_FPU)
>  CFLAGS_REMOVE_neon4.o += $(CC_FLAGS_NO_FPU)
>  CFLAGS_REMOVE_neon8.o += $(CC_FLAGS_NO_FPU)
>  CFLAGS_REMOVE_recov_neon_inner.o += $(CC_FLAGS_NO_FPU)
> +
> +CFLAGS_sve.o += $(CC_FLAGS_FPU)
> +CFLAGS_REMOVE_sve.o += $(CC_FLAGS_NO_FPU)
> +
>  targets += neon1.c neon2.c neon4.c neon8.c
>  $(obj)/neon%.c: $(src)/neon.uc $(src)/unroll.awk FORCE
>  	$(call if_changed,unroll)
> diff --git a/lib/raid6/algos.c b/lib/raid6/algos.c
> index 799e0e5eac26..0ae73c3a4be3 100644
> --- a/lib/raid6/algos.c
> +++ b/lib/raid6/algos.c
> @@ -66,6 +66,11 @@ const struct raid6_calls * const raid6_algos[] = {
>  	&raid6_neonx2,
>  	&raid6_neonx1,
>  #endif
> +#ifdef CONFIG_ARM64_SVE
> +	&raid6_svex4,
> +	&raid6_svex2,
> +	&raid6_svex1,
> +#endif
>  #ifdef CONFIG_LOONGARCH
>  #ifdef CONFIG_CPU_HAS_LASX
>  	&raid6_lasx,
> diff --git a/lib/raid6/sve.c b/lib/raid6/sve.c
> new file mode 100644
> index 000000000000..d52937f806d4
> --- /dev/null
> +++ b/lib/raid6/sve.c
> @@ -0,0 +1,675 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * RAID-6 syndrome calculation using ARM SVE instructions
> + */
> +
> +#include <linux/raid/pq.h>
> +
> +#ifdef __KERNEL__
> +#include <asm/simd.h>
> +#include <linux/cpufeature.h>
> +#else
> +#define scoped_ksimd()
> +#define system_supports_sve() (1)
> +#endif
> +
> +static void raid6_sve1_gen_syndrome_real(int disks, unsigned long 
> bytes, void **ptrs)
> +{
> +	u8 **dptr = (u8 **)ptrs;
> +	u8 *p, *q;
> +	long z0 = disks - 3;
> +
> +	p = dptr[z0 + 1];
> +	q = dptr[z0 + 2];
> +
> +	asm volatile(
> +		".arch armv8.2-a+sve\n"
> +		"ptrue p0.b\n"
> +		"cntb x3\n"
> +		"mov w4, #0x1d\n"
> +		"dup z4.b, w4\n"
> +		"mov x5, #0\n"
> +
> +		"0:\n"
> +		"ldr x6, [%[dptr], %[z0], lsl #3]\n"
> +		"ld1b z0.b, p0/z, [x6, x5]\n"
> +		"mov z1.d, z0.d\n"
> +
> +		"mov w7, %w[z0]\n"
> +		"sub w7, w7, #1\n"
> +
> +		"1:\n"
> +		"cmp w7, #0\n"
> +		"blt 2f\n"
> +
> +		"mov z3.d, z1.d\n"
> +		"asr z3.b, p0/m, z3.b, #7\n"
> +		"lsl z1.b, p0/m, z1.b, #1\n"
> +
> +		"and z3.d, z3.d, z4.d\n"
> +		"eor z1.d, z1.d, z3.d\n"
> +
> +		"sxtw x8, w7\n"
> +		"ldr x6, [%[dptr], x8, lsl #3]\n"
> +		"ld1b z2.b, p0/z, [x6, x5]\n"
> +
> +		"eor z1.d, z1.d, z2.d\n"
> +		"eor z0.d, z0.d, z2.d\n"
> +
> +		"sub w7, w7, #1\n"
> +		"b 1b\n"
> +		"2:\n"
> +
> +		"st1b z0.b, p0, [%[p], x5]\n"
> +		"st1b z1.b, p0, [%[q], x5]\n"
> +
> +		"add x5, x5, x3\n"
> +		"cmp x5, %[bytes]\n"
> +		"blt 0b\n"
> +		:
> +		: [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
> +		  [p] "r" (p), [q] "r" (q)
> +		: "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
> +		  "z0", "z1", "z2", "z3", "z4"
> +	);
> +}
> +
> +static void raid6_sve1_xor_syndrome_real(int disks, int start, int 
> stop,
> +					 unsigned long bytes, void **ptrs)
> +{
> +	u8 **dptr = (u8 **)ptrs;
> +	u8 *p, *q;
> +	long z0 = stop;
> +
> +	p = dptr[disks - 2];
> +	q = dptr[disks - 1];
> +
> +	asm volatile(
> +		".arch armv8.2-a+sve\n"
> +		"ptrue p0.b\n"
> +		"cntb x3\n"
> +		"mov w4, #0x1d\n"
> +		"dup z4.b, w4\n"
> +		"mov x5, #0\n"
> +
> +		"0:\n"
> +		"ldr x6, [%[dptr], %[z0], lsl #3]\n"
> +		"ld1b z1.b, p0/z, [x6, x5]\n"
> +		"ld1b z0.b, p0/z, [%[p], x5]\n"
> +		"eor z0.d, z0.d, z1.d\n"
> +
> +		"mov w7, %w[z0]\n"
> +		"sub w7, w7, #1\n"
> +
> +		"1:\n"
> +		"cmp w7, %w[start]\n"
> +		"blt 2f\n"
> +
> +		"mov z3.d, z1.d\n"
> +		"asr z3.b, p0/m, z3.b, #7\n"
> +		"lsl z1.b, p0/m, z1.b, #1\n"
> +		"and z3.d, z3.d, z4.d\n"
> +		"eor z1.d, z1.d, z3.d\n"
> +
> +		"sxtw x8, w7\n"
> +		"ldr x6, [%[dptr], x8, lsl #3]\n"
> +		"ld1b z2.b, p0/z, [x6, x5]\n"
> +
> +		"eor z1.d, z1.d, z2.d\n"
> +		"eor z0.d, z0.d, z2.d\n"
> +
> +		"sub w7, w7, #1\n"
> +		"b 1b\n"
> +		"2:\n"
> +
> +		"mov w7, %w[start]\n"
> +		"sub w7, w7, #1\n"
> +		"3:\n"
> +		"cmp w7, #0\n"
> +		"blt 4f\n"
> +
> +		"mov z3.d, z1.d\n"
> +		"asr z3.b, p0/m, z3.b, #7\n"
> +		"lsl z1.b, p0/m, z1.b, #1\n"
> +		"and z3.d, z3.d, z4.d\n"
> +		"eor z1.d, z1.d, z3.d\n"
> +
> +		"sub w7, w7, #1\n"
> +		"b 3b\n"
> +		"4:\n"
> +
> +		"ld1b z2.b, p0/z, [%[q], x5]\n"
> +		"eor z1.d, z1.d, z2.d\n"
> +
> +		"st1b z0.b, p0, [%[p], x5]\n"
> +		"st1b z1.b, p0, [%[q], x5]\n"
> +
> +		"add x5, x5, x3\n"
> +		"cmp x5, %[bytes]\n"
> +		"blt 0b\n"
> +		:
> +		: [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
> +		  [p] "r" (p), [q] "r" (q), [start] "r" (start)
> +		: "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
> +		  "z0", "z1", "z2", "z3", "z4"
> +	);
> +}
> +
> +static void raid6_sve2_gen_syndrome_real(int disks, unsigned long 
> bytes, void **ptrs)
> +{
> +	u8 **dptr = (u8 **)ptrs;
> +	u8 *p, *q;
> +	long z0 = disks - 3;
> +
> +	p = dptr[z0 + 1];
> +	q = dptr[z0 + 2];
> +
> +	asm volatile(
> +		".arch armv8.2-a+sve\n"
> +		"ptrue p0.b\n"
> +		"cntb x3\n"
> +		"mov w4, #0x1d\n"
> +		"dup z4.b, w4\n"
> +		"mov x5, #0\n"
> +
> +		"0:\n"
> +		"ldr x6, [%[dptr], %[z0], lsl #3]\n"
> +		"ld1b z0.b, p0/z, [x6, x5]\n"
> +		"add x8, x5, x3\n"
> +		"ld1b z5.b, p0/z, [x6, x8]\n"
> +		"mov z1.d, z0.d\n"
> +		"mov z6.d, z5.d\n"
> +
> +		"mov w7, %w[z0]\n"
> +		"sub w7, w7, #1\n"
> +
> +		"1:\n"
> +		"cmp w7, #0\n"
> +		"blt 2f\n"
> +
> +		"mov z3.d, z1.d\n"
> +		"asr z3.b, p0/m, z3.b, #7\n"
> +		"lsl z1.b, p0/m, z1.b, #1\n"
> +		"and z3.d, z3.d, z4.d\n"
> +		"eor z1.d, z1.d, z3.d\n"
> +
> +		"mov z8.d, z6.d\n"
> +		"asr z8.b, p0/m, z8.b, #7\n"
> +		"lsl z6.b, p0/m, z6.b, #1\n"
> +		"and z8.d, z8.d, z4.d\n"
> +		"eor z6.d, z6.d, z8.d\n"
> +
> +		"sxtw x8, w7\n"
> +		"ldr x6, [%[dptr], x8, lsl #3]\n"
> +		"ld1b z2.b, p0/z, [x6, x5]\n"
> +		"add x8, x5, x3\n"
> +		"ld1b z7.b, p0/z, [x6, x8]\n"
> +
> +		"eor z1.d, z1.d, z2.d\n"
> +		"eor z0.d, z0.d, z2.d\n"
> +
> +		"eor z6.d, z6.d, z7.d\n"
> +		"eor z5.d, z5.d, z7.d\n"
> +
> +		"sub w7, w7, #1\n"
> +		"b 1b\n"
> +		"2:\n"
> +
> +		"st1b z0.b, p0, [%[p], x5]\n"
> +		"st1b z1.b, p0, [%[q], x5]\n"
> +		"add x8, x5, x3\n"
> +		"st1b z5.b, p0, [%[p], x8]\n"
> +		"st1b z6.b, p0, [%[q], x8]\n"
> +
> +		"add x5, x5, x3\n"
> +		"add x5, x5, x3\n"
> +		"cmp x5, %[bytes]\n"
> +		"blt 0b\n"
> +		:
> +		: [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
> +		  [p] "r" (p), [q] "r" (q)
> +		: "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
> +		  "z0", "z1", "z2", "z3", "z4",
> +		  "z5", "z6", "z7", "z8"
> +	);
> +}
> +
> +static void raid6_sve2_xor_syndrome_real(int disks, int start, int 
> stop,
> +					 unsigned long bytes, void **ptrs)
> +{
> +	u8 **dptr = (u8 **)ptrs;
> +	u8 *p, *q;
> +	long z0 = stop;
> +
> +	p = dptr[disks - 2];
> +	q = dptr[disks - 1];
> +
> +	asm volatile(
> +		".arch armv8.2-a+sve\n"
> +		"ptrue p0.b\n"
> +		"cntb x3\n"
> +		"mov w4, #0x1d\n"
> +		"dup z4.b, w4\n"
> +		"mov x5, #0\n"
> +
> +		"0:\n"
> +		"ldr x6, [%[dptr], %[z0], lsl #3]\n"
> +		"ld1b z1.b, p0/z, [x6, x5]\n"
> +		"add x8, x5, x3\n"
> +		"ld1b z6.b, p0/z, [x6, x8]\n"
> +
> +		"ld1b z0.b, p0/z, [%[p], x5]\n"
> +		"ld1b z5.b, p0/z, [%[p], x8]\n"
> +
> +		"eor z0.d, z0.d, z1.d\n"
> +		"eor z5.d, z5.d, z6.d\n"
> +
> +		"mov w7, %w[z0]\n"
> +		"sub w7, w7, #1\n"
> +
> +		"1:\n"
> +		"cmp w7, %w[start]\n"
> +		"blt 2f\n"
> +
> +		"mov z3.d, z1.d\n"
> +		"asr z3.b, p0/m, z3.b, #7\n"
> +		"lsl z1.b, p0/m, z1.b, #1\n"
> +		"and z3.d, z3.d, z4.d\n"
> +		"eor z1.d, z1.d, z3.d\n"
> +
> +		"mov z8.d, z6.d\n"
> +		"asr z8.b, p0/m, z8.b, #7\n"
> +		"lsl z6.b, p0/m, z6.b, #1\n"
> +		"and z8.d, z8.d, z4.d\n"
> +		"eor z6.d, z6.d, z8.d\n"
> +
> +		"sxtw x8, w7\n"
> +		"ldr x6, [%[dptr], x8, lsl #3]\n"
> +		"ld1b z2.b, p0/z, [x6, x5]\n"
> +		"add x8, x5, x3\n"
> +		"ld1b z7.b, p0/z, [x6, x8]\n"
> +
> +		"eor z1.d, z1.d, z2.d\n"
> +		"eor z0.d, z0.d, z2.d\n"
> +
> +		"eor z6.d, z6.d, z7.d\n"
> +		"eor z5.d, z5.d, z7.d\n"
> +
> +		"sub w7, w7, #1\n"
> +		"b 1b\n"
> +		"2:\n"
> +
> +		"mov w7, %w[start]\n"
> +		"sub w7, w7, #1\n"
> +		"3:\n"
> +		"cmp w7, #0\n"
> +		"blt 4f\n"
> +
> +		"mov z3.d, z1.d\n"
> +		"asr z3.b, p0/m, z3.b, #7\n"
> +		"lsl z1.b, p0/m, z1.b, #1\n"
> +		"and z3.d, z3.d, z4.d\n"
> +		"eor z1.d, z1.d, z3.d\n"
> +
> +		"mov z8.d, z6.d\n"
> +		"asr z8.b, p0/m, z8.b, #7\n"
> +		"lsl z6.b, p0/m, z6.b, #1\n"
> +		"and z8.d, z8.d, z4.d\n"
> +		"eor z6.d, z6.d, z8.d\n"
> +
> +		"sub w7, w7, #1\n"
> +		"b 3b\n"
> +		"4:\n"
> +
> +		"ld1b z2.b, p0/z, [%[q], x5]\n"
> +		"eor z1.d, z1.d, z2.d\n"
> +		"st1b z0.b, p0, [%[p], x5]\n"
> +		"st1b z1.b, p0, [%[q], x5]\n"
> +
> +		"add x8, x5, x3\n"
> +		"ld1b z7.b, p0/z, [%[q], x8]\n"
> +		"eor z6.d, z6.d, z7.d\n"
> +		"st1b z5.b, p0, [%[p], x8]\n"
> +		"st1b z6.b, p0, [%[q], x8]\n"
> +
> +		"add x5, x5, x3\n"
> +		"add x5, x5, x3\n"
> +		"cmp x5, %[bytes]\n"
> +		"blt 0b\n"
> +		:
> +		: [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
> +		  [p] "r" (p), [q] "r" (q), [start] "r" (start)
> +		: "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
> +		  "z0", "z1", "z2", "z3", "z4",
> +		  "z5", "z6", "z7", "z8"
> +	);
> +}
> +
> +static void raid6_sve4_gen_syndrome_real(int disks, unsigned long 
> bytes, void **ptrs)
> +{
> +	u8 **dptr = (u8 **)ptrs;
> +	u8 *p, *q;
> +	long z0 = disks - 3;
> +
> +	p = dptr[z0 + 1];
> +	q = dptr[z0 + 2];
> +
> +	asm volatile(
> +		".arch armv8.2-a+sve\n"
> +		"ptrue p0.b\n"
> +		"cntb x3\n"
> +		"mov w4, #0x1d\n"
> +		"dup z4.b, w4\n"
> +		"mov x5, #0\n"
> +
> +		"0:\n"
> +		"ldr x6, [%[dptr], %[z0], lsl #3]\n"
> +		"ld1b z0.b, p0/z, [x6, x5]\n"
> +		"add x8, x5, x3\n"
> +		"ld1b z5.b, p0/z, [x6, x8]\n"
> +		"add x8, x8, x3\n"
> +		"ld1b z10.b, p0/z, [x6, x8]\n"
> +		"add x8, x8, x3\n"
> +		"ld1b z15.b, p0/z, [x6, x8]\n"
> +
> +		"mov z1.d, z0.d\n"
> +		"mov z6.d, z5.d\n"
> +		"mov z11.d, z10.d\n"
> +		"mov z16.d, z15.d\n"
> +
> +		"mov w7, %w[z0]\n"
> +		"sub w7, w7, #1\n"
> +
> +		"1:\n"
> +		"cmp w7, #0\n"
> +		"blt 2f\n"
> +
> +		// software pipelining: load data early
> +		"sxtw x8, w7\n"
> +		"ldr x6, [%[dptr], x8, lsl #3]\n"
> +		"ld1b z2.b, p0/z, [x6, x5]\n"
> +		"add x8, x5, x3\n"
> +		"ld1b z7.b, p0/z, [x6, x8]\n"
> +		"add x8, x8, x3\n"
> +		"ld1b z12.b, p0/z, [x6, x8]\n"
> +		"add x8, x8, x3\n"
> +		"ld1b z17.b, p0/z, [x6, x8]\n"
> +
> +		// math block 1
> +		"mov z3.d, z1.d\n"
> +		"asr z3.b, p0/m, z3.b, #7\n"
> +		"lsl z1.b, p0/m, z1.b, #1\n"
> +		"and z3.d, z3.d, z4.d\n"
> +		"eor z1.d, z1.d, z3.d\n"
> +		"eor z1.d, z1.d, z2.d\n"
> +		"eor z0.d, z0.d, z2.d\n"
> +
> +		// math block 2
> +		"mov z8.d, z6.d\n"
> +		"asr z8.b, p0/m, z8.b, #7\n"
> +		"lsl z6.b, p0/m, z6.b, #1\n"
> +		"and z8.d, z8.d, z4.d\n"
> +		"eor z6.d, z6.d, z8.d\n"
> +		"eor z6.d, z6.d, z7.d\n"
> +		"eor z5.d, z5.d, z7.d\n"
> +
> +		// math block 3
> +		"mov z13.d, z11.d\n"
> +		"asr z13.b, p0/m, z13.b, #7\n"
> +		"lsl z11.b, p0/m, z11.b, #1\n"
> +		"and z13.d, z13.d, z4.d\n"
> +		"eor z11.d, z11.d, z13.d\n"
> +		"eor z11.d, z11.d, z12.d\n"
> +		"eor z10.d, z10.d, z12.d\n"
> +
> +		// math block 4
> +		"mov z18.d, z16.d\n"
> +		"asr z18.b, p0/m, z18.b, #7\n"
> +		"lsl z16.b, p0/m, z16.b, #1\n"
> +		"and z18.d, z18.d, z4.d\n"
> +		"eor z16.d, z16.d, z18.d\n"
> +		"eor z16.d, z16.d, z17.d\n"
> +		"eor z15.d, z15.d, z17.d\n"
> +
> +		"sub w7, w7, #1\n"
> +		"b 1b\n"
> +		"2:\n"
> +
> +		"st1b z0.b, p0, [%[p], x5]\n"
> +		"st1b z1.b, p0, [%[q], x5]\n"
> +		"add x8, x5, x3\n"
> +		"st1b z5.b, p0, [%[p], x8]\n"
> +		"st1b z6.b, p0, [%[q], x8]\n"
> +		"add x8, x8, x3\n"
> +		"st1b z10.b, p0, [%[p], x8]\n"
> +		"st1b z11.b, p0, [%[q], x8]\n"
> +		"add x8, x8, x3\n"
> +		"st1b z15.b, p0, [%[p], x8]\n"
> +		"st1b z16.b, p0, [%[q], x8]\n"
> +
> +		"add x8, x3, x3\n"
> +		"add x5, x5, x8, lsl #1\n"
> +		"cmp x5, %[bytes]\n"
> +		"blt 0b\n"
> +		:
> +		: [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
> +		  [p] "r" (p), [q] "r" (q)
> +		: "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
> +		  "z0", "z1", "z2", "z3", "z4",
> +		  "z5", "z6", "z7", "z8",
> +		  "z10", "z11", "z12", "z13",
> +		  "z15", "z16", "z17", "z18"
> +	);
> +}
> +
> +static void raid6_sve4_xor_syndrome_real(int disks, int start, int 
> stop,
> +					 unsigned long bytes, void **ptrs)
> +{
> +	u8 **dptr = (u8 **)ptrs;
> +	u8 *p, *q;
> +	long z0 = stop;
> +
> +	p = dptr[disks - 2];
> +	q = dptr[disks - 1];
> +
> +	asm volatile(
> +		".arch armv8.2-a+sve\n"
> +		"ptrue p0.b\n"
> +		"cntb x3\n"
> +		"mov w4, #0x1d\n"
> +		"dup z4.b, w4\n"
> +		"mov x5, #0\n"
> +
> +		"0:\n"
> +		"ldr x6, [%[dptr], %[z0], lsl #3]\n"
> +		"ld1b z1.b, p0/z, [x6, x5]\n"
> +		"add x8, x5, x3\n"
> +		"ld1b z6.b, p0/z, [x6, x8]\n"
> +		"add x8, x8, x3\n"
> +		"ld1b z11.b, p0/z, [x6, x8]\n"
> +		"add x8, x8, x3\n"
> +		"ld1b z16.b, p0/z, [x6, x8]\n"
> +
> +		"ld1b z0.b, p0/z, [%[p], x5]\n"
> +		"add x8, x5, x3\n"
> +		"ld1b z5.b, p0/z, [%[p], x8]\n"
> +		"add x8, x8, x3\n"
> +		"ld1b z10.b, p0/z, [%[p], x8]\n"
> +		"add x8, x8, x3\n"
> +		"ld1b z15.b, p0/z, [%[p], x8]\n"
> +
> +		"eor z0.d, z0.d, z1.d\n"
> +		"eor z5.d, z5.d, z6.d\n"
> +		"eor z10.d, z10.d, z11.d\n"
> +		"eor z15.d, z15.d, z16.d\n"
> +
> +		"mov w7, %w[z0]\n"
> +		"sub w7, w7, #1\n"
> +
> +		"1:\n"
> +		"cmp w7, %w[start]\n"
> +		"blt 2f\n"
> +
> +		// software pipelining: load data early
> +		"sxtw x8, w7\n"
> +		"ldr x6, [%[dptr], x8, lsl #3]\n"
> +		"ld1b z2.b, p0/z, [x6, x5]\n"
> +		"add x8, x5, x3\n"
> +		"ld1b z7.b, p0/z, [x6, x8]\n"
> +		"add x8, x8, x3\n"
> +		"ld1b z12.b, p0/z, [x6, x8]\n"
> +		"add x8, x8, x3\n"
> +		"ld1b z17.b, p0/z, [x6, x8]\n"
> +
> +		// math block 1
> +		"mov z3.d, z1.d\n"
> +		"asr z3.b, p0/m, z3.b, #7\n"
> +		"lsl z1.b, p0/m, z1.b, #1\n"
> +		"and z3.d, z3.d, z4.d\n"
> +		"eor z1.d, z1.d, z3.d\n"
> +		"eor z1.d, z1.d, z2.d\n"
> +		"eor z0.d, z0.d, z2.d\n"
> +
> +		// math block 2
> +		"mov z8.d, z6.d\n"
> +		"asr z8.b, p0/m, z8.b, #7\n"
> +		"lsl z6.b, p0/m, z6.b, #1\n"
> +		"and z8.d, z8.d, z4.d\n"
> +		"eor z6.d, z6.d, z8.d\n"
> +		"eor z6.d, z6.d, z7.d\n"
> +		"eor z5.d, z5.d, z7.d\n"
> +
> +		// math block 3
> +		"mov z13.d, z11.d\n"
> +		"asr z13.b, p0/m, z13.b, #7\n"
> +		"lsl z11.b, p0/m, z11.b, #1\n"
> +		"and z13.d, z13.d, z4.d\n"
> +		"eor z11.d, z11.d, z13.d\n"
> +		"eor z11.d, z11.d, z12.d\n"
> +		"eor z10.d, z10.d, z12.d\n"
> +
> +		// math block 4
> +		"mov z18.d, z16.d\n"
> +		"asr z18.b, p0/m, z18.b, #7\n"
> +		"lsl z16.b, p0/m, z16.b, #1\n"
> +		"and z18.d, z18.d, z4.d\n"
> +		"eor z16.d, z16.d, z18.d\n"
> +		"eor z16.d, z16.d, z17.d\n"
> +		"eor z15.d, z15.d, z17.d\n"
> +
> +		"sub w7, w7, #1\n"
> +		"b 1b\n"
> +		"2:\n"
> +
> +		"mov w7, %w[start]\n"
> +		"sub w7, w7, #1\n"
> +		"3:\n"
> +		"cmp w7, #0\n"
> +		"blt 4f\n"
> +
> +		// math block 1
> +		"mov z3.d, z1.d\n"
> +		"asr z3.b, p0/m, z3.b, #7\n"
> +		"lsl z1.b, p0/m, z1.b, #1\n"
> +		"and z3.d, z3.d, z4.d\n"
> +		"eor z1.d, z1.d, z3.d\n"
> +
> +		// math block 2
> +		"mov z8.d, z6.d\n"
> +		"asr z8.b, p0/m, z8.b, #7\n"
> +		"lsl z6.b, p0/m, z6.b, #1\n"
> +		"and z8.d, z8.d, z4.d\n"
> +		"eor z6.d, z6.d, z8.d\n"
> +
> +		// math block 3
> +		"mov z13.d, z11.d\n"
> +		"asr z13.b, p0/m, z13.b, #7\n"
> +		"lsl z11.b, p0/m, z11.b, #1\n"
> +		"and z13.d, z13.d, z4.d\n"
> +		"eor z11.d, z11.d, z13.d\n"
> +
> +		// math block 4
> +		"mov z18.d, z16.d\n"
> +		"asr z18.b, p0/m, z18.b, #7\n"
> +		"lsl z16.b, p0/m, z16.b, #1\n"
> +		"and z18.d, z18.d, z4.d\n"
> +		"eor z16.d, z16.d, z18.d\n"
> +
> +		"sub w7, w7, #1\n"
> +		"b 3b\n"
> +		"4:\n"
> +
> +		// Load q and XOR
> +		"ld1b z2.b, p0/z, [%[q], x5]\n"
> +		"add x8, x5, x3\n"
> +		"ld1b z7.b, p0/z, [%[q], x8]\n"
> +		"add x8, x8, x3\n"
> +		"ld1b z12.b, p0/z, [%[q], x8]\n"
> +		"add x8, x8, x3\n"
> +		"ld1b z17.b, p0/z, [%[q], x8]\n"
> +
> +		"eor z1.d, z1.d, z2.d\n"
> +		"eor z6.d, z6.d, z7.d\n"
> +		"eor z11.d, z11.d, z12.d\n"
> +		"eor z16.d, z16.d, z17.d\n"
> +
> +		// Store results
> +		"st1b z0.b, p0, [%[p], x5]\n"
> +		"st1b z1.b, p0, [%[q], x5]\n"
> +		"add x8, x5, x3\n"
> +		"st1b z5.b, p0, [%[p], x8]\n"
> +		"st1b z6.b, p0, [%[q], x8]\n"
> +		"add x8, x8, x3\n"
> +		"st1b z10.b, p0, [%[p], x8]\n"
> +		"st1b z11.b, p0, [%[q], x8]\n"
> +		"add x8, x8, x3\n"
> +		"st1b z15.b, p0, [%[p], x8]\n"
> +		"st1b z16.b, p0, [%[q], x8]\n"
> +
> +		"add x8, x3, x3\n"
> +		"add x5, x5, x8, lsl #1\n"
> +		"cmp x5, %[bytes]\n"
> +		"blt 0b\n"
> +		:
> +		: [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
> +		  [p] "r" (p), [q] "r" (q), [start] "r" (start)
> +		: "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
> +		  "z0", "z1", "z2", "z3", "z4",
> +		  "z5", "z6", "z7", "z8",
> +		  "z10", "z11", "z12", "z13",
> +		  "z15", "z16", "z17", "z18"
> +	);
> +}
> +
> +#define RAID6_SVE_WRAPPER(_n)						\
> +	static void raid6_sve ## _n ## _gen_syndrome(int disks,		\
> +					size_t bytes, void **ptrs)	\
> +	{								\
> +		scoped_ksimd()						\
> +		raid6_sve ## _n ## _gen_syndrome_real(disks,		\
> +					(unsigned long)bytes, ptrs);	\
> +	}								\
> +	static void raid6_sve ## _n ## _xor_syndrome(int disks,		\
> +					int start, int stop,		\
> +					size_t bytes, void **ptrs)	\
> +	{								\
> +		scoped_ksimd()						\
> +		raid6_sve ## _n ## _xor_syndrome_real(disks,		\
> +				start, stop, (unsigned long)bytes, ptrs);\
> +	}								\
> +	struct raid6_calls const raid6_svex ## _n = {			\
> +		raid6_sve ## _n ## _gen_syndrome,			\
> +		raid6_sve ## _n ## _xor_syndrome,			\
> +		raid6_have_sve,						\
> +		"svex" #_n,						\
> +		0							\
> +	}
> +
> +static int raid6_have_sve(void)
> +{
> +	return system_supports_sve();
> +}
> +
> +RAID6_SVE_WRAPPER(1);
> +RAID6_SVE_WRAPPER(2);
> +RAID6_SVE_WRAPPER(4);
> -- 
> 2.43.0


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] raid6: arm64: add SVE optimized implementation for syndrome generation
  2026-03-24  8:00 ` Ard Biesheuvel
@ 2026-03-24 10:04   ` Mark Rutland
  2026-03-29 13:01     ` Demian Shulhan
  0 siblings, 1 reply; 4+ messages in thread
From: Mark Rutland @ 2026-03-24 10:04 UTC (permalink / raw)
  To: Ard Biesheuvel, Christoph Hellwig, Demian Shulhan
  Cc: Song Liu, Yu Kuai, Will Deacon, Catalin Marinas, broonie,
	linux-arm-kernel, robin.murphy, Li Nan, linux-raid, linux-kernel

On Tue, Mar 24, 2026 at 09:00:16AM +0100, Ard Biesheuvel wrote:
> On Wed, 18 Mar 2026, at 16:02, Demian Shulhan wrote:
> > Implement Scalable Vector Extension (SVE) optimized routines for RAID6
> > syndrome generation and recovery on ARM64.
> >
> > The SVE instruction set allows for variable vector lengths (from 128 to
> > 2048 bits), scaling automatically with the hardware capabilities. This
> > implementation handles arbitrary SVE vector lengths using the `cntb`
> > instruction to determine the runtime vector length.
> >
> > The implementation introduces `svex1`, `svex2`, and `svex4` algorithms.
> > The `svex4` algorithm utilizes loop unrolling by 4 blocks per iteration
> > and manual software pipelining (interleaving memory loads with XORs)
> > to minimize instruction dependency stalls and maximize CPU pipeline
> > utilization and memory bandwidth.
> >
> > Performance was tested on an AWS Graviton3 (Neoverse-V1) instance which
> > features 256-bit SVE vector length. The `svex4` implementation outperforms
> > the existing 128-bit `neonx4` baseline for syndrome generation:
> >
> > raid6: svex4    gen() 19688 MB/s
> ...
> > raid6: neonx4   gen() 19612 MB/s
> 
> You're being very generous characterising a 0.3% speedup as 'outperforms'
> 
> But the real problem here is that the kernel-mode SIMD API only
> supports NEON and not SVE, and preserves/restores only the 128-bit
> view on the NEON/SVE register file. So any context switch or softirq
> that uses kernel-mode SIMD too, and your SVE register values will get
> truncated.

Just to be a bit more explicit, since only the NEON register file is
saved:

* The vector registers will be truncated to 128-bit across
  preemption or softirq.

* The predicates won't be saved/restored and will change arbitrarily
  across preemption.

* The VL won't be saved/restored, and might change arbitrarily across
  preemption.

* The VL to use hasn't been programmed, so performance might vary
  arbitrarily even in the absence of preemption.

... so this isn't even safe on machines with (only) a 128-bit VL, and
there are big open design questions for the infrastructure we'd need.

> Once we encounter a good use case for SVE in the kernel, we might
> reconsider this, but as it stands, this patch should not be applied.

I agree.

Christoph, please do not pick this or any other in-kernel SVE patches.
They cannot function correctly without additional infrastructure.

Demian, for patches that use NEON/SVE/SME/etc, please Cc LAKML
(linux-arm-kernel@lists.infradead.org), so that folk familiar with ARM
see the patches.

Mark

> (leaving the reply untrimmed for the benefit of the cc'ees I added)
> 
> > raid6: neonx2   gen() 16248 MB/s
> > raid6: neonx1   gen() 13591 MB/s
> > raid6: using algorithm svex4 gen() 19688 MB/s
> > raid6: .... xor() 11212 MB/s, rmw enabled
> > raid6: using neon recovery algorithm
> >
> > Note that for the recovery path (`xor_syndrome`), NEON may still be
> > selected dynamically by the algorithm benchmark, as the recovery
> > workload is heavily memory-bound.
> >
> > Signed-off-by: Demian Shulhan <demyansh@gmail.com>
> > Reported-by: kernel test robot <lkp@intel.com>
> > Closes: 
> > https://lore.kernel.org/oe-kbuild-all/202603181940.cFwYmYoi-lkp@intel.com/
> > ---
> >  include/linux/raid/pq.h |   3 +
> >  lib/raid6/Makefile      |   5 +
> >  lib/raid6/algos.c       |   5 +
> >  lib/raid6/sve.c         | 675 ++++++++++++++++++++++++++++++++++++++++
> >  4 files changed, 688 insertions(+)
> >  create mode 100644 lib/raid6/sve.c
> >
> > diff --git a/include/linux/raid/pq.h b/include/linux/raid/pq.h
> > index 2467b3be15c9..787cc57aea9d 100644
> > --- a/include/linux/raid/pq.h
> > +++ b/include/linux/raid/pq.h
> > @@ -140,6 +140,9 @@ extern const struct raid6_calls raid6_neonx1;
> >  extern const struct raid6_calls raid6_neonx2;
> >  extern const struct raid6_calls raid6_neonx4;
> >  extern const struct raid6_calls raid6_neonx8;
> > +extern const struct raid6_calls raid6_svex1;
> > +extern const struct raid6_calls raid6_svex2;
> > +extern const struct raid6_calls raid6_svex4;
> > 
> >  /* Algorithm list */
> >  extern const struct raid6_calls * const raid6_algos[];
> > diff --git a/lib/raid6/Makefile b/lib/raid6/Makefile
> > index 5be0a4e60ab1..6cdaa6f206fb 100644
> > --- a/lib/raid6/Makefile
> > +++ b/lib/raid6/Makefile
> > @@ -8,6 +8,7 @@ raid6_pq-$(CONFIG_X86) += recov_ssse3.o recov_avx2.o 
> > mmx.o sse1.o sse2.o avx2.o
> >  raid6_pq-$(CONFIG_ALTIVEC) += altivec1.o altivec2.o altivec4.o 
> > altivec8.o \
> >                                vpermxor1.o vpermxor2.o vpermxor4.o 
> > vpermxor8.o
> >  raid6_pq-$(CONFIG_KERNEL_MODE_NEON) += neon.o neon1.o neon2.o neon4.o 
> > neon8.o recov_neon.o recov_neon_inner.o
> > +raid6_pq-$(CONFIG_ARM64_SVE) += sve.o
> >  raid6_pq-$(CONFIG_S390) += s390vx8.o recov_s390xc.o
> >  raid6_pq-$(CONFIG_LOONGARCH) += loongarch_simd.o recov_loongarch_simd.o
> >  raid6_pq-$(CONFIG_RISCV_ISA_V) += rvv.o recov_rvv.o
> > @@ -67,6 +68,10 @@ CFLAGS_REMOVE_neon2.o += $(CC_FLAGS_NO_FPU)
> >  CFLAGS_REMOVE_neon4.o += $(CC_FLAGS_NO_FPU)
> >  CFLAGS_REMOVE_neon8.o += $(CC_FLAGS_NO_FPU)
> >  CFLAGS_REMOVE_recov_neon_inner.o += $(CC_FLAGS_NO_FPU)
> > +
> > +CFLAGS_sve.o += $(CC_FLAGS_FPU)
> > +CFLAGS_REMOVE_sve.o += $(CC_FLAGS_NO_FPU)
> > +
> >  targets += neon1.c neon2.c neon4.c neon8.c
> >  $(obj)/neon%.c: $(src)/neon.uc $(src)/unroll.awk FORCE
> >  	$(call if_changed,unroll)
> > diff --git a/lib/raid6/algos.c b/lib/raid6/algos.c
> > index 799e0e5eac26..0ae73c3a4be3 100644
> > --- a/lib/raid6/algos.c
> > +++ b/lib/raid6/algos.c
> > @@ -66,6 +66,11 @@ const struct raid6_calls * const raid6_algos[] = {
> >  	&raid6_neonx2,
> >  	&raid6_neonx1,
> >  #endif
> > +#ifdef CONFIG_ARM64_SVE
> > +	&raid6_svex4,
> > +	&raid6_svex2,
> > +	&raid6_svex1,
> > +#endif
> >  #ifdef CONFIG_LOONGARCH
> >  #ifdef CONFIG_CPU_HAS_LASX
> >  	&raid6_lasx,
> > diff --git a/lib/raid6/sve.c b/lib/raid6/sve.c
> > new file mode 100644
> > index 000000000000..d52937f806d4
> > --- /dev/null
> > +++ b/lib/raid6/sve.c
> > @@ -0,0 +1,675 @@
> > +// SPDX-License-Identifier: GPL-2.0-or-later
> > +/*
> > + * RAID-6 syndrome calculation using ARM SVE instructions
> > + */
> > +
> > +#include <linux/raid/pq.h>
> > +
> > +#ifdef __KERNEL__
> > +#include <asm/simd.h>
> > +#include <linux/cpufeature.h>
> > +#else
> > +#define scoped_ksimd()
> > +#define system_supports_sve() (1)
> > +#endif
> > +
> > +static void raid6_sve1_gen_syndrome_real(int disks, unsigned long 
> > bytes, void **ptrs)
> > +{
> > +	u8 **dptr = (u8 **)ptrs;
> > +	u8 *p, *q;
> > +	long z0 = disks - 3;
> > +
> > +	p = dptr[z0 + 1];
> > +	q = dptr[z0 + 2];
> > +
> > +	asm volatile(
> > +		".arch armv8.2-a+sve\n"
> > +		"ptrue p0.b\n"
> > +		"cntb x3\n"
> > +		"mov w4, #0x1d\n"
> > +		"dup z4.b, w4\n"
> > +		"mov x5, #0\n"
> > +
> > +		"0:\n"
> > +		"ldr x6, [%[dptr], %[z0], lsl #3]\n"
> > +		"ld1b z0.b, p0/z, [x6, x5]\n"
> > +		"mov z1.d, z0.d\n"
> > +
> > +		"mov w7, %w[z0]\n"
> > +		"sub w7, w7, #1\n"
> > +
> > +		"1:\n"
> > +		"cmp w7, #0\n"
> > +		"blt 2f\n"
> > +
> > +		"mov z3.d, z1.d\n"
> > +		"asr z3.b, p0/m, z3.b, #7\n"
> > +		"lsl z1.b, p0/m, z1.b, #1\n"
> > +
> > +		"and z3.d, z3.d, z4.d\n"
> > +		"eor z1.d, z1.d, z3.d\n"
> > +
> > +		"sxtw x8, w7\n"
> > +		"ldr x6, [%[dptr], x8, lsl #3]\n"
> > +		"ld1b z2.b, p0/z, [x6, x5]\n"
> > +
> > +		"eor z1.d, z1.d, z2.d\n"
> > +		"eor z0.d, z0.d, z2.d\n"
> > +
> > +		"sub w7, w7, #1\n"
> > +		"b 1b\n"
> > +		"2:\n"
> > +
> > +		"st1b z0.b, p0, [%[p], x5]\n"
> > +		"st1b z1.b, p0, [%[q], x5]\n"
> > +
> > +		"add x5, x5, x3\n"
> > +		"cmp x5, %[bytes]\n"
> > +		"blt 0b\n"
> > +		:
> > +		: [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
> > +		  [p] "r" (p), [q] "r" (q)
> > +		: "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
> > +		  "z0", "z1", "z2", "z3", "z4"
> > +	);
> > +}
> > +
> > +static void raid6_sve1_xor_syndrome_real(int disks, int start, int 
> > stop,
> > +					 unsigned long bytes, void **ptrs)
> > +{
> > +	u8 **dptr = (u8 **)ptrs;
> > +	u8 *p, *q;
> > +	long z0 = stop;
> > +
> > +	p = dptr[disks - 2];
> > +	q = dptr[disks - 1];
> > +
> > +	asm volatile(
> > +		".arch armv8.2-a+sve\n"
> > +		"ptrue p0.b\n"
> > +		"cntb x3\n"
> > +		"mov w4, #0x1d\n"
> > +		"dup z4.b, w4\n"
> > +		"mov x5, #0\n"
> > +
> > +		"0:\n"
> > +		"ldr x6, [%[dptr], %[z0], lsl #3]\n"
> > +		"ld1b z1.b, p0/z, [x6, x5]\n"
> > +		"ld1b z0.b, p0/z, [%[p], x5]\n"
> > +		"eor z0.d, z0.d, z1.d\n"
> > +
> > +		"mov w7, %w[z0]\n"
> > +		"sub w7, w7, #1\n"
> > +
> > +		"1:\n"
> > +		"cmp w7, %w[start]\n"
> > +		"blt 2f\n"
> > +
> > +		"mov z3.d, z1.d\n"
> > +		"asr z3.b, p0/m, z3.b, #7\n"
> > +		"lsl z1.b, p0/m, z1.b, #1\n"
> > +		"and z3.d, z3.d, z4.d\n"
> > +		"eor z1.d, z1.d, z3.d\n"
> > +
> > +		"sxtw x8, w7\n"
> > +		"ldr x6, [%[dptr], x8, lsl #3]\n"
> > +		"ld1b z2.b, p0/z, [x6, x5]\n"
> > +
> > +		"eor z1.d, z1.d, z2.d\n"
> > +		"eor z0.d, z0.d, z2.d\n"
> > +
> > +		"sub w7, w7, #1\n"
> > +		"b 1b\n"
> > +		"2:\n"
> > +
> > +		"mov w7, %w[start]\n"
> > +		"sub w7, w7, #1\n"
> > +		"3:\n"
> > +		"cmp w7, #0\n"
> > +		"blt 4f\n"
> > +
> > +		"mov z3.d, z1.d\n"
> > +		"asr z3.b, p0/m, z3.b, #7\n"
> > +		"lsl z1.b, p0/m, z1.b, #1\n"
> > +		"and z3.d, z3.d, z4.d\n"
> > +		"eor z1.d, z1.d, z3.d\n"
> > +
> > +		"sub w7, w7, #1\n"
> > +		"b 3b\n"
> > +		"4:\n"
> > +
> > +		"ld1b z2.b, p0/z, [%[q], x5]\n"
> > +		"eor z1.d, z1.d, z2.d\n"
> > +
> > +		"st1b z0.b, p0, [%[p], x5]\n"
> > +		"st1b z1.b, p0, [%[q], x5]\n"
> > +
> > +		"add x5, x5, x3\n"
> > +		"cmp x5, %[bytes]\n"
> > +		"blt 0b\n"
> > +		:
> > +		: [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
> > +		  [p] "r" (p), [q] "r" (q), [start] "r" (start)
> > +		: "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
> > +		  "z0", "z1", "z2", "z3", "z4"
> > +	);
> > +}
> > +
> > +static void raid6_sve2_gen_syndrome_real(int disks, unsigned long 
> > bytes, void **ptrs)
> > +{
> > +	u8 **dptr = (u8 **)ptrs;
> > +	u8 *p, *q;
> > +	long z0 = disks - 3;
> > +
> > +	p = dptr[z0 + 1];
> > +	q = dptr[z0 + 2];
> > +
> > +	asm volatile(
> > +		".arch armv8.2-a+sve\n"
> > +		"ptrue p0.b\n"
> > +		"cntb x3\n"
> > +		"mov w4, #0x1d\n"
> > +		"dup z4.b, w4\n"
> > +		"mov x5, #0\n"
> > +
> > +		"0:\n"
> > +		"ldr x6, [%[dptr], %[z0], lsl #3]\n"
> > +		"ld1b z0.b, p0/z, [x6, x5]\n"
> > +		"add x8, x5, x3\n"
> > +		"ld1b z5.b, p0/z, [x6, x8]\n"
> > +		"mov z1.d, z0.d\n"
> > +		"mov z6.d, z5.d\n"
> > +
> > +		"mov w7, %w[z0]\n"
> > +		"sub w7, w7, #1\n"
> > +
> > +		"1:\n"
> > +		"cmp w7, #0\n"
> > +		"blt 2f\n"
> > +
> > +		"mov z3.d, z1.d\n"
> > +		"asr z3.b, p0/m, z3.b, #7\n"
> > +		"lsl z1.b, p0/m, z1.b, #1\n"
> > +		"and z3.d, z3.d, z4.d\n"
> > +		"eor z1.d, z1.d, z3.d\n"
> > +
> > +		"mov z8.d, z6.d\n"
> > +		"asr z8.b, p0/m, z8.b, #7\n"
> > +		"lsl z6.b, p0/m, z6.b, #1\n"
> > +		"and z8.d, z8.d, z4.d\n"
> > +		"eor z6.d, z6.d, z8.d\n"
> > +
> > +		"sxtw x8, w7\n"
> > +		"ldr x6, [%[dptr], x8, lsl #3]\n"
> > +		"ld1b z2.b, p0/z, [x6, x5]\n"
> > +		"add x8, x5, x3\n"
> > +		"ld1b z7.b, p0/z, [x6, x8]\n"
> > +
> > +		"eor z1.d, z1.d, z2.d\n"
> > +		"eor z0.d, z0.d, z2.d\n"
> > +
> > +		"eor z6.d, z6.d, z7.d\n"
> > +		"eor z5.d, z5.d, z7.d\n"
> > +
> > +		"sub w7, w7, #1\n"
> > +		"b 1b\n"
> > +		"2:\n"
> > +
> > +		"st1b z0.b, p0, [%[p], x5]\n"
> > +		"st1b z1.b, p0, [%[q], x5]\n"
> > +		"add x8, x5, x3\n"
> > +		"st1b z5.b, p0, [%[p], x8]\n"
> > +		"st1b z6.b, p0, [%[q], x8]\n"
> > +
> > +		"add x5, x5, x3\n"
> > +		"add x5, x5, x3\n"
> > +		"cmp x5, %[bytes]\n"
> > +		"blt 0b\n"
> > +		:
> > +		: [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
> > +		  [p] "r" (p), [q] "r" (q)
> > +		: "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
> > +		  "z0", "z1", "z2", "z3", "z4",
> > +		  "z5", "z6", "z7", "z8"
> > +	);
> > +}
> > +
> > +static void raid6_sve2_xor_syndrome_real(int disks, int start, int 
> > stop,
> > +					 unsigned long bytes, void **ptrs)
> > +{
> > +	u8 **dptr = (u8 **)ptrs;
> > +	u8 *p, *q;
> > +	long z0 = stop;
> > +
> > +	p = dptr[disks - 2];
> > +	q = dptr[disks - 1];
> > +
> > +	asm volatile(
> > +		".arch armv8.2-a+sve\n"
> > +		"ptrue p0.b\n"
> > +		"cntb x3\n"
> > +		"mov w4, #0x1d\n"
> > +		"dup z4.b, w4\n"
> > +		"mov x5, #0\n"
> > +
> > +		"0:\n"
> > +		"ldr x6, [%[dptr], %[z0], lsl #3]\n"
> > +		"ld1b z1.b, p0/z, [x6, x5]\n"
> > +		"add x8, x5, x3\n"
> > +		"ld1b z6.b, p0/z, [x6, x8]\n"
> > +
> > +		"ld1b z0.b, p0/z, [%[p], x5]\n"
> > +		"ld1b z5.b, p0/z, [%[p], x8]\n"
> > +
> > +		"eor z0.d, z0.d, z1.d\n"
> > +		"eor z5.d, z5.d, z6.d\n"
> > +
> > +		"mov w7, %w[z0]\n"
> > +		"sub w7, w7, #1\n"
> > +
> > +		"1:\n"
> > +		"cmp w7, %w[start]\n"
> > +		"blt 2f\n"
> > +
> > +		"mov z3.d, z1.d\n"
> > +		"asr z3.b, p0/m, z3.b, #7\n"
> > +		"lsl z1.b, p0/m, z1.b, #1\n"
> > +		"and z3.d, z3.d, z4.d\n"
> > +		"eor z1.d, z1.d, z3.d\n"
> > +
> > +		"mov z8.d, z6.d\n"
> > +		"asr z8.b, p0/m, z8.b, #7\n"
> > +		"lsl z6.b, p0/m, z6.b, #1\n"
> > +		"and z8.d, z8.d, z4.d\n"
> > +		"eor z6.d, z6.d, z8.d\n"
> > +
> > +		"sxtw x8, w7\n"
> > +		"ldr x6, [%[dptr], x8, lsl #3]\n"
> > +		"ld1b z2.b, p0/z, [x6, x5]\n"
> > +		"add x8, x5, x3\n"
> > +		"ld1b z7.b, p0/z, [x6, x8]\n"
> > +
> > +		"eor z1.d, z1.d, z2.d\n"
> > +		"eor z0.d, z0.d, z2.d\n"
> > +
> > +		"eor z6.d, z6.d, z7.d\n"
> > +		"eor z5.d, z5.d, z7.d\n"
> > +
> > +		"sub w7, w7, #1\n"
> > +		"b 1b\n"
> > +		"2:\n"
> > +
> > +		"mov w7, %w[start]\n"
> > +		"sub w7, w7, #1\n"
> > +		"3:\n"
> > +		"cmp w7, #0\n"
> > +		"blt 4f\n"
> > +
> > +		"mov z3.d, z1.d\n"
> > +		"asr z3.b, p0/m, z3.b, #7\n"
> > +		"lsl z1.b, p0/m, z1.b, #1\n"
> > +		"and z3.d, z3.d, z4.d\n"
> > +		"eor z1.d, z1.d, z3.d\n"
> > +
> > +		"mov z8.d, z6.d\n"
> > +		"asr z8.b, p0/m, z8.b, #7\n"
> > +		"lsl z6.b, p0/m, z6.b, #1\n"
> > +		"and z8.d, z8.d, z4.d\n"
> > +		"eor z6.d, z6.d, z8.d\n"
> > +
> > +		"sub w7, w7, #1\n"
> > +		"b 3b\n"
> > +		"4:\n"
> > +
> > +		"ld1b z2.b, p0/z, [%[q], x5]\n"
> > +		"eor z1.d, z1.d, z2.d\n"
> > +		"st1b z0.b, p0, [%[p], x5]\n"
> > +		"st1b z1.b, p0, [%[q], x5]\n"
> > +
> > +		"add x8, x5, x3\n"
> > +		"ld1b z7.b, p0/z, [%[q], x8]\n"
> > +		"eor z6.d, z6.d, z7.d\n"
> > +		"st1b z5.b, p0, [%[p], x8]\n"
> > +		"st1b z6.b, p0, [%[q], x8]\n"
> > +
> > +		"add x5, x5, x3\n"
> > +		"add x5, x5, x3\n"
> > +		"cmp x5, %[bytes]\n"
> > +		"blt 0b\n"
> > +		:
> > +		: [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
> > +		  [p] "r" (p), [q] "r" (q), [start] "r" (start)
> > +		: "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
> > +		  "z0", "z1", "z2", "z3", "z4",
> > +		  "z5", "z6", "z7", "z8"
> > +	);
> > +}
> > +
> > +static void raid6_sve4_gen_syndrome_real(int disks, unsigned long 
> > bytes, void **ptrs)
> > +{
> > +	u8 **dptr = (u8 **)ptrs;
> > +	u8 *p, *q;
> > +	long z0 = disks - 3;
> > +
> > +	p = dptr[z0 + 1];
> > +	q = dptr[z0 + 2];
> > +
> > +	asm volatile(
> > +		".arch armv8.2-a+sve\n"
> > +		"ptrue p0.b\n"
> > +		"cntb x3\n"
> > +		"mov w4, #0x1d\n"
> > +		"dup z4.b, w4\n"
> > +		"mov x5, #0\n"
> > +
> > +		"0:\n"
> > +		"ldr x6, [%[dptr], %[z0], lsl #3]\n"
> > +		"ld1b z0.b, p0/z, [x6, x5]\n"
> > +		"add x8, x5, x3\n"
> > +		"ld1b z5.b, p0/z, [x6, x8]\n"
> > +		"add x8, x8, x3\n"
> > +		"ld1b z10.b, p0/z, [x6, x8]\n"
> > +		"add x8, x8, x3\n"
> > +		"ld1b z15.b, p0/z, [x6, x8]\n"
> > +
> > +		"mov z1.d, z0.d\n"
> > +		"mov z6.d, z5.d\n"
> > +		"mov z11.d, z10.d\n"
> > +		"mov z16.d, z15.d\n"
> > +
> > +		"mov w7, %w[z0]\n"
> > +		"sub w7, w7, #1\n"
> > +
> > +		"1:\n"
> > +		"cmp w7, #0\n"
> > +		"blt 2f\n"
> > +
> > +		// software pipelining: load data early
> > +		"sxtw x8, w7\n"
> > +		"ldr x6, [%[dptr], x8, lsl #3]\n"
> > +		"ld1b z2.b, p0/z, [x6, x5]\n"
> > +		"add x8, x5, x3\n"
> > +		"ld1b z7.b, p0/z, [x6, x8]\n"
> > +		"add x8, x8, x3\n"
> > +		"ld1b z12.b, p0/z, [x6, x8]\n"
> > +		"add x8, x8, x3\n"
> > +		"ld1b z17.b, p0/z, [x6, x8]\n"
> > +
> > +		// math block 1
> > +		"mov z3.d, z1.d\n"
> > +		"asr z3.b, p0/m, z3.b, #7\n"
> > +		"lsl z1.b, p0/m, z1.b, #1\n"
> > +		"and z3.d, z3.d, z4.d\n"
> > +		"eor z1.d, z1.d, z3.d\n"
> > +		"eor z1.d, z1.d, z2.d\n"
> > +		"eor z0.d, z0.d, z2.d\n"
> > +
> > +		// math block 2
> > +		"mov z8.d, z6.d\n"
> > +		"asr z8.b, p0/m, z8.b, #7\n"
> > +		"lsl z6.b, p0/m, z6.b, #1\n"
> > +		"and z8.d, z8.d, z4.d\n"
> > +		"eor z6.d, z6.d, z8.d\n"
> > +		"eor z6.d, z6.d, z7.d\n"
> > +		"eor z5.d, z5.d, z7.d\n"
> > +
> > +		// math block 3
> > +		"mov z13.d, z11.d\n"
> > +		"asr z13.b, p0/m, z13.b, #7\n"
> > +		"lsl z11.b, p0/m, z11.b, #1\n"
> > +		"and z13.d, z13.d, z4.d\n"
> > +		"eor z11.d, z11.d, z13.d\n"
> > +		"eor z11.d, z11.d, z12.d\n"
> > +		"eor z10.d, z10.d, z12.d\n"
> > +
> > +		// math block 4
> > +		"mov z18.d, z16.d\n"
> > +		"asr z18.b, p0/m, z18.b, #7\n"
> > +		"lsl z16.b, p0/m, z16.b, #1\n"
> > +		"and z18.d, z18.d, z4.d\n"
> > +		"eor z16.d, z16.d, z18.d\n"
> > +		"eor z16.d, z16.d, z17.d\n"
> > +		"eor z15.d, z15.d, z17.d\n"
> > +
> > +		"sub w7, w7, #1\n"
> > +		"b 1b\n"
> > +		"2:\n"
> > +
> > +		"st1b z0.b, p0, [%[p], x5]\n"
> > +		"st1b z1.b, p0, [%[q], x5]\n"
> > +		"add x8, x5, x3\n"
> > +		"st1b z5.b, p0, [%[p], x8]\n"
> > +		"st1b z6.b, p0, [%[q], x8]\n"
> > +		"add x8, x8, x3\n"
> > +		"st1b z10.b, p0, [%[p], x8]\n"
> > +		"st1b z11.b, p0, [%[q], x8]\n"
> > +		"add x8, x8, x3\n"
> > +		"st1b z15.b, p0, [%[p], x8]\n"
> > +		"st1b z16.b, p0, [%[q], x8]\n"
> > +
> > +		"add x8, x3, x3\n"
> > +		"add x5, x5, x8, lsl #1\n"
> > +		"cmp x5, %[bytes]\n"
> > +		"blt 0b\n"
> > +		:
> > +		: [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
> > +		  [p] "r" (p), [q] "r" (q)
> > +		: "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
> > +		  "z0", "z1", "z2", "z3", "z4",
> > +		  "z5", "z6", "z7", "z8",
> > +		  "z10", "z11", "z12", "z13",
> > +		  "z15", "z16", "z17", "z18"
> > +	);
> > +}
> > +
> > +static void raid6_sve4_xor_syndrome_real(int disks, int start, int 
> > stop,
> > +					 unsigned long bytes, void **ptrs)
> > +{
> > +	u8 **dptr = (u8 **)ptrs;
> > +	u8 *p, *q;
> > +	long z0 = stop;
> > +
> > +	p = dptr[disks - 2];
> > +	q = dptr[disks - 1];
> > +
> > +	asm volatile(
> > +		".arch armv8.2-a+sve\n"
> > +		"ptrue p0.b\n"
> > +		"cntb x3\n"
> > +		"mov w4, #0x1d\n"
> > +		"dup z4.b, w4\n"
> > +		"mov x5, #0\n"
> > +
> > +		"0:\n"
> > +		"ldr x6, [%[dptr], %[z0], lsl #3]\n"
> > +		"ld1b z1.b, p0/z, [x6, x5]\n"
> > +		"add x8, x5, x3\n"
> > +		"ld1b z6.b, p0/z, [x6, x8]\n"
> > +		"add x8, x8, x3\n"
> > +		"ld1b z11.b, p0/z, [x6, x8]\n"
> > +		"add x8, x8, x3\n"
> > +		"ld1b z16.b, p0/z, [x6, x8]\n"
> > +
> > +		"ld1b z0.b, p0/z, [%[p], x5]\n"
> > +		"add x8, x5, x3\n"
> > +		"ld1b z5.b, p0/z, [%[p], x8]\n"
> > +		"add x8, x8, x3\n"
> > +		"ld1b z10.b, p0/z, [%[p], x8]\n"
> > +		"add x8, x8, x3\n"
> > +		"ld1b z15.b, p0/z, [%[p], x8]\n"
> > +
> > +		"eor z0.d, z0.d, z1.d\n"
> > +		"eor z5.d, z5.d, z6.d\n"
> > +		"eor z10.d, z10.d, z11.d\n"
> > +		"eor z15.d, z15.d, z16.d\n"
> > +
> > +		"mov w7, %w[z0]\n"
> > +		"sub w7, w7, #1\n"
> > +
> > +		"1:\n"
> > +		"cmp w7, %w[start]\n"
> > +		"blt 2f\n"
> > +
> > +		// software pipelining: load data early
> > +		"sxtw x8, w7\n"
> > +		"ldr x6, [%[dptr], x8, lsl #3]\n"
> > +		"ld1b z2.b, p0/z, [x6, x5]\n"
> > +		"add x8, x5, x3\n"
> > +		"ld1b z7.b, p0/z, [x6, x8]\n"
> > +		"add x8, x8, x3\n"
> > +		"ld1b z12.b, p0/z, [x6, x8]\n"
> > +		"add x8, x8, x3\n"
> > +		"ld1b z17.b, p0/z, [x6, x8]\n"
> > +
> > +		// math block 1
> > +		"mov z3.d, z1.d\n"
> > +		"asr z3.b, p0/m, z3.b, #7\n"
> > +		"lsl z1.b, p0/m, z1.b, #1\n"
> > +		"and z3.d, z3.d, z4.d\n"
> > +		"eor z1.d, z1.d, z3.d\n"
> > +		"eor z1.d, z1.d, z2.d\n"
> > +		"eor z0.d, z0.d, z2.d\n"
> > +
> > +		// math block 2
> > +		"mov z8.d, z6.d\n"
> > +		"asr z8.b, p0/m, z8.b, #7\n"
> > +		"lsl z6.b, p0/m, z6.b, #1\n"
> > +		"and z8.d, z8.d, z4.d\n"
> > +		"eor z6.d, z6.d, z8.d\n"
> > +		"eor z6.d, z6.d, z7.d\n"
> > +		"eor z5.d, z5.d, z7.d\n"
> > +
> > +		// math block 3
> > +		"mov z13.d, z11.d\n"
> > +		"asr z13.b, p0/m, z13.b, #7\n"
> > +		"lsl z11.b, p0/m, z11.b, #1\n"
> > +		"and z13.d, z13.d, z4.d\n"
> > +		"eor z11.d, z11.d, z13.d\n"
> > +		"eor z11.d, z11.d, z12.d\n"
> > +		"eor z10.d, z10.d, z12.d\n"
> > +
> > +		// math block 4
> > +		"mov z18.d, z16.d\n"
> > +		"asr z18.b, p0/m, z18.b, #7\n"
> > +		"lsl z16.b, p0/m, z16.b, #1\n"
> > +		"and z18.d, z18.d, z4.d\n"
> > +		"eor z16.d, z16.d, z18.d\n"
> > +		"eor z16.d, z16.d, z17.d\n"
> > +		"eor z15.d, z15.d, z17.d\n"
> > +
> > +		"sub w7, w7, #1\n"
> > +		"b 1b\n"
> > +		"2:\n"
> > +
> > +		"mov w7, %w[start]\n"
> > +		"sub w7, w7, #1\n"
> > +		"3:\n"
> > +		"cmp w7, #0\n"
> > +		"blt 4f\n"
> > +
> > +		// math block 1
> > +		"mov z3.d, z1.d\n"
> > +		"asr z3.b, p0/m, z3.b, #7\n"
> > +		"lsl z1.b, p0/m, z1.b, #1\n"
> > +		"and z3.d, z3.d, z4.d\n"
> > +		"eor z1.d, z1.d, z3.d\n"
> > +
> > +		// math block 2
> > +		"mov z8.d, z6.d\n"
> > +		"asr z8.b, p0/m, z8.b, #7\n"
> > +		"lsl z6.b, p0/m, z6.b, #1\n"
> > +		"and z8.d, z8.d, z4.d\n"
> > +		"eor z6.d, z6.d, z8.d\n"
> > +
> > +		// math block 3
> > +		"mov z13.d, z11.d\n"
> > +		"asr z13.b, p0/m, z13.b, #7\n"
> > +		"lsl z11.b, p0/m, z11.b, #1\n"
> > +		"and z13.d, z13.d, z4.d\n"
> > +		"eor z11.d, z11.d, z13.d\n"
> > +
> > +		// math block 4
> > +		"mov z18.d, z16.d\n"
> > +		"asr z18.b, p0/m, z18.b, #7\n"
> > +		"lsl z16.b, p0/m, z16.b, #1\n"
> > +		"and z18.d, z18.d, z4.d\n"
> > +		"eor z16.d, z16.d, z18.d\n"
> > +
> > +		"sub w7, w7, #1\n"
> > +		"b 3b\n"
> > +		"4:\n"
> > +
> > +		// Load q and XOR
> > +		"ld1b z2.b, p0/z, [%[q], x5]\n"
> > +		"add x8, x5, x3\n"
> > +		"ld1b z7.b, p0/z, [%[q], x8]\n"
> > +		"add x8, x8, x3\n"
> > +		"ld1b z12.b, p0/z, [%[q], x8]\n"
> > +		"add x8, x8, x3\n"
> > +		"ld1b z17.b, p0/z, [%[q], x8]\n"
> > +
> > +		"eor z1.d, z1.d, z2.d\n"
> > +		"eor z6.d, z6.d, z7.d\n"
> > +		"eor z11.d, z11.d, z12.d\n"
> > +		"eor z16.d, z16.d, z17.d\n"
> > +
> > +		// Store results
> > +		"st1b z0.b, p0, [%[p], x5]\n"
> > +		"st1b z1.b, p0, [%[q], x5]\n"
> > +		"add x8, x5, x3\n"
> > +		"st1b z5.b, p0, [%[p], x8]\n"
> > +		"st1b z6.b, p0, [%[q], x8]\n"
> > +		"add x8, x8, x3\n"
> > +		"st1b z10.b, p0, [%[p], x8]\n"
> > +		"st1b z11.b, p0, [%[q], x8]\n"
> > +		"add x8, x8, x3\n"
> > +		"st1b z15.b, p0, [%[p], x8]\n"
> > +		"st1b z16.b, p0, [%[q], x8]\n"
> > +
> > +		"add x8, x3, x3\n"
> > +		"add x5, x5, x8, lsl #1\n"
> > +		"cmp x5, %[bytes]\n"
> > +		"blt 0b\n"
> > +		:
> > +		: [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
> > +		  [p] "r" (p), [q] "r" (q), [start] "r" (start)
> > +		: "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
> > +		  "z0", "z1", "z2", "z3", "z4",
> > +		  "z5", "z6", "z7", "z8",
> > +		  "z10", "z11", "z12", "z13",
> > +		  "z15", "z16", "z17", "z18"
> > +	);
> > +}
> > +
> > +#define RAID6_SVE_WRAPPER(_n)						\
> > +	static void raid6_sve ## _n ## _gen_syndrome(int disks,		\
> > +					size_t bytes, void **ptrs)	\
> > +	{								\
> > +		scoped_ksimd()						\
> > +		raid6_sve ## _n ## _gen_syndrome_real(disks,		\
> > +					(unsigned long)bytes, ptrs);	\
> > +	}								\
> > +	static void raid6_sve ## _n ## _xor_syndrome(int disks,		\
> > +					int start, int stop,		\
> > +					size_t bytes, void **ptrs)	\
> > +	{								\
> > +		scoped_ksimd()						\
> > +		raid6_sve ## _n ## _xor_syndrome_real(disks,		\
> > +				start, stop, (unsigned long)bytes, ptrs);\
> > +	}								\
> > +	struct raid6_calls const raid6_svex ## _n = {			\
> > +		raid6_sve ## _n ## _gen_syndrome,			\
> > +		raid6_sve ## _n ## _xor_syndrome,			\
> > +		raid6_have_sve,						\
> > +		"svex" #_n,						\
> > +		0							\
> > +	}
> > +
> > +static int raid6_have_sve(void)
> > +{
> > +	return system_supports_sve();
> > +}
> > +
> > +RAID6_SVE_WRAPPER(1);
> > +RAID6_SVE_WRAPPER(2);
> > +RAID6_SVE_WRAPPER(4);
> > -- 
> > 2.43.0


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] raid6: arm64: add SVE optimized implementation for syndrome generation
  2026-03-24 10:04   ` Mark Rutland
@ 2026-03-29 13:01     ` Demian Shulhan
  0 siblings, 0 replies; 4+ messages in thread
From: Demian Shulhan @ 2026-03-29 13:01 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Ard Biesheuvel, Christoph Hellwig, Song Liu, Yu Kuai, Will Deacon,
	Catalin Marinas, broonie, linux-arm-kernel, robin.murphy, Li Nan,
	linux-raid, linux-kernel

Hi all,

Thanks for the feedback and for clarifying the current limitations of
the kernel-mode SIMD API regarding SVE context preservation. I
completely understand why this patch cannot be merged in its current
state until the fundamental SVE infrastructure is in place.However, I
want to address the comment about the marginal 0.3% speedup on the
8-disk benchmark. While the pure memory bandwidth on a small array is
indeed bottlenecked, it doesn't reveal the whole picture. I extracted
the SVE and NEON implementations into a user-space benchmark to
measure the actual hardware efficiency using perf stat, running on the
same AWS Graviton3 (Neoverse-V1) instance.The results show a massive
difference in CPU efficiency. For the same 8-disk workload, the svex4
implementation requires about 35% fewer instructions and 46% fewer CPU
cycles compared to neonx4 (7.58 billion instructions vs 11.62
billion). This translates directly into significant energy savings and
reduced pressure on the CPU frontend, which would leave more compute
resources available for network and NVMe queues during an array
rebuild.

Furthermore, as Christoph suggested, I tested scalability on wider
arrays since the default kernel benchmark is hardcoded to 8 disks,
which doesn't give the unrolled SVE loop enough data to shine. On a
16-disk array, svex4 hits 15.1 GB/s compared to 8.0 GB/s for neonx4.
On a 24-disk array, while neonx4 chokes and drops to 7.8 GB/s, svex4
maintains a stable 15.0 GB/s — effectively doubling the throughput.I
agree this patch should be put on hold for now. My intention is to
leave these numbers here as evidence that implementing SVE context
preservation in the kernel (the "good use case") is highly justifiable
from both a power-efficiency and a wide-array throughput perspective
for modern ARM64 hardware.

Thanks again for your time and time and review!

---------------------------------------------------
User space test results:
==================================================
    RAID6 SVE Benchmark Results (AWS Graviton3)
==================================================
Instance Details:
Linux ip-172-31-87-234 6.8.0-1047-aws #50~22.04.1-Ubuntu SMP Thu Feb
19 20:49:25 UTC 2026 aarch64 aarch64 aarch64 GNU/Linux
--------------------------------------------------

[Test 1: Energy Efficiency / Instruction Count (8 disks)]
Running baseline (neonx4)...
algo=neonx4 ndisks=8 iterations=1000000 time=2.681s MB/s=8741.36

Running SVE (svex1)...

 Performance counter stats for './raid6_bench neonx4 8 1000000':

       11626717224      instructions                     #    1.67
insn per cycle
        6946699489      cycles
         257013219      L1-dcache-load-misses

       2.681213149 seconds time elapsed

       2.676771000 seconds user
       0.002000000 seconds sys


algo=svex1 ndisks=8 iterations=1000000 time=1.688s MB/s=13885.23

 Performance counter stats for './raid6_bench svex1 8 1000000':

       10527277490
Running SVE unrolled x4 (svex4)...
     instructions                     #    2.40  insn per cycle
        4379539835      cycles
         175695656      L1-dcache-load-misses

       1.688852006 seconds time elapsed

       1.687298000 seconds user
       0.000999000 seconds sys


algo=svex4 ndisks=8 iterations=1000000 time=1.445s MB/s=16215.04

 Performance counter stats for './raid6_bench svex4 8 1000000':

        7587813392      instructions
==================================================
[Test 2: Scalability on Wide RAID Arrays (MB/s)]
--- 16 Disks ---
 #    2.02  insn per cycle
        3748486131      cycles
         213816184      L1-dcache-load-misses

       1.446032415 seconds time elapsed

       1.442412000 seconds user
       0.002996000 seconds sys


algo=neonx4 ndisks=16 iterations=1000000 time=6.783s MB/s=8062.33
algo=svex1 ndisks=16 iterations=1000000 time=4.912s MB/s=11132.90
algo=svex4 ndisks=16 iterations=1000000 time=3.601s MB/s=15188.85

--- 24 Disks ---
algo=neonx4 ndisks=24 iterations=1000000 time=11.011s MB/s=7805.02
algo=svex1 ndisks=24 iterations=1000000 time=8.843s MB/s=9718.26
algo=svex4 ndisks=24 iterations=1000000 time=5.719s MB/s=15026.92

Extra tests:
--- 48 Disks ---
algo=neonx4 ndisks=48 iterations=500000 time=11.826s MB/s=7597.25
algo=svex4 ndisks=48 iterations=500000 time=5.808s MB/s=15468.10
--- 96 Disks ---
algo=neonx4 ndisks=96 iterations=200000 time=9.783s MB/s=7507.01
algo=svex4 ndisks=96 iterations=200000 time=4.701s MB/s=15621.17
==================================================

вт, 24 бер. 2026 р. о 12:05 Mark Rutland <mark.rutland@arm.com> пише:
>
> On Tue, Mar 24, 2026 at 09:00:16AM +0100, Ard Biesheuvel wrote:
> > On Wed, 18 Mar 2026, at 16:02, Demian Shulhan wrote:
> > > Implement Scalable Vector Extension (SVE) optimized routines for RAID6
> > > syndrome generation and recovery on ARM64.
> > >
> > > The SVE instruction set allows for variable vector lengths (from 128 to
> > > 2048 bits), scaling automatically with the hardware capabilities. This
> > > implementation handles arbitrary SVE vector lengths using the `cntb`
> > > instruction to determine the runtime vector length.
> > >
> > > The implementation introduces `svex1`, `svex2`, and `svex4` algorithms.
> > > The `svex4` algorithm utilizes loop unrolling by 4 blocks per iteration
> > > and manual software pipelining (interleaving memory loads with XORs)
> > > to minimize instruction dependency stalls and maximize CPU pipeline
> > > utilization and memory bandwidth.
> > >
> > > Performance was tested on an AWS Graviton3 (Neoverse-V1) instance which
> > > features 256-bit SVE vector length. The `svex4` implementation outperforms
> > > the existing 128-bit `neonx4` baseline for syndrome generation:
> > >
> > > raid6: svex4    gen() 19688 MB/s
> > ...
> > > raid6: neonx4   gen() 19612 MB/s
> >
> > You're being very generous characterising a 0.3% speedup as 'outperforms'
> >
> > But the real problem here is that the kernel-mode SIMD API only
> > supports NEON and not SVE, and preserves/restores only the 128-bit
> > view on the NEON/SVE register file. So any context switch or softirq
> > that uses kernel-mode SIMD too, and your SVE register values will get
> > truncated.
>
> Just to be a bit more explicit, since only the NEON register file is
> saved:
>
> * The vector registers will be truncated to 128-bit across
>   preemption or softirq.
>
> * The predicates won't be saved/restored and will change arbitrarily
>   across preemption.
>
> * The VL won't be saved/restored, and might change arbitrarily across
>   preemption.
>
> * The VL to use hasn't been programmed, so performance might vary
>   arbitrarily even in the absence of preemption.
>
> ... so this isn't even safe on machines with (only) a 128-bit VL, and
> there are big open design questions for the infrastructure we'd need.
>
> > Once we encounter a good use case for SVE in the kernel, we might
> > reconsider this, but as it stands, this patch should not be applied.
>
> I agree.
>
> Christoph, please do not pick this or any other in-kernel SVE patches.
> They cannot function correctly without additional infrastructure.
>
> Demian, for patches that use NEON/SVE/SME/etc, please Cc LAKML
> (linux-arm-kernel@lists.infradead.org), so that folk familiar with ARM
> see the patches.
>
> Mark
>
> > (leaving the reply untrimmed for the benefit of the cc'ees I added)
> >
> > > raid6: neonx2   gen() 16248 MB/s
> > > raid6: neonx1   gen() 13591 MB/s
> > > raid6: using algorithm svex4 gen() 19688 MB/s
> > > raid6: .... xor() 11212 MB/s, rmw enabled
> > > raid6: using neon recovery algorithm
> > >
> > > Note that for the recovery path (`xor_syndrome`), NEON may still be
> > > selected dynamically by the algorithm benchmark, as the recovery
> > > workload is heavily memory-bound.
> > >
> > > Signed-off-by: Demian Shulhan <demyansh@gmail.com>
> > > Reported-by: kernel test robot <lkp@intel.com>
> > > Closes:
> > > https://lore.kernel.org/oe-kbuild-all/202603181940.cFwYmYoi-lkp@intel.com/
> > > ---
> > >  include/linux/raid/pq.h |   3 +
> > >  lib/raid6/Makefile      |   5 +
> > >  lib/raid6/algos.c       |   5 +
> > >  lib/raid6/sve.c         | 675 ++++++++++++++++++++++++++++++++++++++++
> > >  4 files changed, 688 insertions(+)
> > >  create mode 100644 lib/raid6/sve.c
> > >
> > > diff --git a/include/linux/raid/pq.h b/include/linux/raid/pq.h
> > > index 2467b3be15c9..787cc57aea9d 100644
> > > --- a/include/linux/raid/pq.h
> > > +++ b/include/linux/raid/pq.h
> > > @@ -140,6 +140,9 @@ extern const struct raid6_calls raid6_neonx1;
> > >  extern const struct raid6_calls raid6_neonx2;
> > >  extern const struct raid6_calls raid6_neonx4;
> > >  extern const struct raid6_calls raid6_neonx8;
> > > +extern const struct raid6_calls raid6_svex1;
> > > +extern const struct raid6_calls raid6_svex2;
> > > +extern const struct raid6_calls raid6_svex4;
> > >
> > >  /* Algorithm list */
> > >  extern const struct raid6_calls * const raid6_algos[];
> > > diff --git a/lib/raid6/Makefile b/lib/raid6/Makefile
> > > index 5be0a4e60ab1..6cdaa6f206fb 100644
> > > --- a/lib/raid6/Makefile
> > > +++ b/lib/raid6/Makefile
> > > @@ -8,6 +8,7 @@ raid6_pq-$(CONFIG_X86) += recov_ssse3.o recov_avx2.o
> > > mmx.o sse1.o sse2.o avx2.o
> > >  raid6_pq-$(CONFIG_ALTIVEC) += altivec1.o altivec2.o altivec4.o
> > > altivec8.o \
> > >                                vpermxor1.o vpermxor2.o vpermxor4.o
> > > vpermxor8.o
> > >  raid6_pq-$(CONFIG_KERNEL_MODE_NEON) += neon.o neon1.o neon2.o neon4.o
> > > neon8.o recov_neon.o recov_neon_inner.o
> > > +raid6_pq-$(CONFIG_ARM64_SVE) += sve.o
> > >  raid6_pq-$(CONFIG_S390) += s390vx8.o recov_s390xc.o
> > >  raid6_pq-$(CONFIG_LOONGARCH) += loongarch_simd.o recov_loongarch_simd.o
> > >  raid6_pq-$(CONFIG_RISCV_ISA_V) += rvv.o recov_rvv.o
> > > @@ -67,6 +68,10 @@ CFLAGS_REMOVE_neon2.o += $(CC_FLAGS_NO_FPU)
> > >  CFLAGS_REMOVE_neon4.o += $(CC_FLAGS_NO_FPU)
> > >  CFLAGS_REMOVE_neon8.o += $(CC_FLAGS_NO_FPU)
> > >  CFLAGS_REMOVE_recov_neon_inner.o += $(CC_FLAGS_NO_FPU)
> > > +
> > > +CFLAGS_sve.o += $(CC_FLAGS_FPU)
> > > +CFLAGS_REMOVE_sve.o += $(CC_FLAGS_NO_FPU)
> > > +
> > >  targets += neon1.c neon2.c neon4.c neon8.c
> > >  $(obj)/neon%.c: $(src)/neon.uc $(src)/unroll.awk FORCE
> > >     $(call if_changed,unroll)
> > > diff --git a/lib/raid6/algos.c b/lib/raid6/algos.c
> > > index 799e0e5eac26..0ae73c3a4be3 100644
> > > --- a/lib/raid6/algos.c
> > > +++ b/lib/raid6/algos.c
> > > @@ -66,6 +66,11 @@ const struct raid6_calls * const raid6_algos[] = {
> > >     &raid6_neonx2,
> > >     &raid6_neonx1,
> > >  #endif
> > > +#ifdef CONFIG_ARM64_SVE
> > > +   &raid6_svex4,
> > > +   &raid6_svex2,
> > > +   &raid6_svex1,
> > > +#endif
> > >  #ifdef CONFIG_LOONGARCH
> > >  #ifdef CONFIG_CPU_HAS_LASX
> > >     &raid6_lasx,
> > > diff --git a/lib/raid6/sve.c b/lib/raid6/sve.c
> > > new file mode 100644
> > > index 000000000000..d52937f806d4
> > > --- /dev/null
> > > +++ b/lib/raid6/sve.c
> > > @@ -0,0 +1,675 @@
> > > +// SPDX-License-Identifier: GPL-2.0-or-later
> > > +/*
> > > + * RAID-6 syndrome calculation using ARM SVE instructions
> > > + */
> > > +
> > > +#include <linux/raid/pq.h>
> > > +
> > > +#ifdef __KERNEL__
> > > +#include <asm/simd.h>
> > > +#include <linux/cpufeature.h>
> > > +#else
> > > +#define scoped_ksimd()
> > > +#define system_supports_sve() (1)
> > > +#endif
> > > +
> > > +static void raid6_sve1_gen_syndrome_real(int disks, unsigned long
> > > bytes, void **ptrs)
> > > +{
> > > +   u8 **dptr = (u8 **)ptrs;
> > > +   u8 *p, *q;
> > > +   long z0 = disks - 3;
> > > +
> > > +   p = dptr[z0 + 1];
> > > +   q = dptr[z0 + 2];
> > > +
> > > +   asm volatile(
> > > +           ".arch armv8.2-a+sve\n"
> > > +           "ptrue p0.b\n"
> > > +           "cntb x3\n"
> > > +           "mov w4, #0x1d\n"
> > > +           "dup z4.b, w4\n"
> > > +           "mov x5, #0\n"
> > > +
> > > +           "0:\n"
> > > +           "ldr x6, [%[dptr], %[z0], lsl #3]\n"
> > > +           "ld1b z0.b, p0/z, [x6, x5]\n"
> > > +           "mov z1.d, z0.d\n"
> > > +
> > > +           "mov w7, %w[z0]\n"
> > > +           "sub w7, w7, #1\n"
> > > +
> > > +           "1:\n"
> > > +           "cmp w7, #0\n"
> > > +           "blt 2f\n"
> > > +
> > > +           "mov z3.d, z1.d\n"
> > > +           "asr z3.b, p0/m, z3.b, #7\n"
> > > +           "lsl z1.b, p0/m, z1.b, #1\n"
> > > +
> > > +           "and z3.d, z3.d, z4.d\n"
> > > +           "eor z1.d, z1.d, z3.d\n"
> > > +
> > > +           "sxtw x8, w7\n"
> > > +           "ldr x6, [%[dptr], x8, lsl #3]\n"
> > > +           "ld1b z2.b, p0/z, [x6, x5]\n"
> > > +
> > > +           "eor z1.d, z1.d, z2.d\n"
> > > +           "eor z0.d, z0.d, z2.d\n"
> > > +
> > > +           "sub w7, w7, #1\n"
> > > +           "b 1b\n"
> > > +           "2:\n"
> > > +
> > > +           "st1b z0.b, p0, [%[p], x5]\n"
> > > +           "st1b z1.b, p0, [%[q], x5]\n"
> > > +
> > > +           "add x5, x5, x3\n"
> > > +           "cmp x5, %[bytes]\n"
> > > +           "blt 0b\n"
> > > +           :
> > > +           : [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
> > > +             [p] "r" (p), [q] "r" (q)
> > > +           : "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
> > > +             "z0", "z1", "z2", "z3", "z4"
> > > +   );
> > > +}
> > > +
> > > +static void raid6_sve1_xor_syndrome_real(int disks, int start, int
> > > stop,
> > > +                                    unsigned long bytes, void **ptrs)
> > > +{
> > > +   u8 **dptr = (u8 **)ptrs;
> > > +   u8 *p, *q;
> > > +   long z0 = stop;
> > > +
> > > +   p = dptr[disks - 2];
> > > +   q = dptr[disks - 1];
> > > +
> > > +   asm volatile(
> > > +           ".arch armv8.2-a+sve\n"
> > > +           "ptrue p0.b\n"
> > > +           "cntb x3\n"
> > > +           "mov w4, #0x1d\n"
> > > +           "dup z4.b, w4\n"
> > > +           "mov x5, #0\n"
> > > +
> > > +           "0:\n"
> > > +           "ldr x6, [%[dptr], %[z0], lsl #3]\n"
> > > +           "ld1b z1.b, p0/z, [x6, x5]\n"
> > > +           "ld1b z0.b, p0/z, [%[p], x5]\n"
> > > +           "eor z0.d, z0.d, z1.d\n"
> > > +
> > > +           "mov w7, %w[z0]\n"
> > > +           "sub w7, w7, #1\n"
> > > +
> > > +           "1:\n"
> > > +           "cmp w7, %w[start]\n"
> > > +           "blt 2f\n"
> > > +
> > > +           "mov z3.d, z1.d\n"
> > > +           "asr z3.b, p0/m, z3.b, #7\n"
> > > +           "lsl z1.b, p0/m, z1.b, #1\n"
> > > +           "and z3.d, z3.d, z4.d\n"
> > > +           "eor z1.d, z1.d, z3.d\n"
> > > +
> > > +           "sxtw x8, w7\n"
> > > +           "ldr x6, [%[dptr], x8, lsl #3]\n"
> > > +           "ld1b z2.b, p0/z, [x6, x5]\n"
> > > +
> > > +           "eor z1.d, z1.d, z2.d\n"
> > > +           "eor z0.d, z0.d, z2.d\n"
> > > +
> > > +           "sub w7, w7, #1\n"
> > > +           "b 1b\n"
> > > +           "2:\n"
> > > +
> > > +           "mov w7, %w[start]\n"
> > > +           "sub w7, w7, #1\n"
> > > +           "3:\n"
> > > +           "cmp w7, #0\n"
> > > +           "blt 4f\n"
> > > +
> > > +           "mov z3.d, z1.d\n"
> > > +           "asr z3.b, p0/m, z3.b, #7\n"
> > > +           "lsl z1.b, p0/m, z1.b, #1\n"
> > > +           "and z3.d, z3.d, z4.d\n"
> > > +           "eor z1.d, z1.d, z3.d\n"
> > > +
> > > +           "sub w7, w7, #1\n"
> > > +           "b 3b\n"
> > > +           "4:\n"
> > > +
> > > +           "ld1b z2.b, p0/z, [%[q], x5]\n"
> > > +           "eor z1.d, z1.d, z2.d\n"
> > > +
> > > +           "st1b z0.b, p0, [%[p], x5]\n"
> > > +           "st1b z1.b, p0, [%[q], x5]\n"
> > > +
> > > +           "add x5, x5, x3\n"
> > > +           "cmp x5, %[bytes]\n"
> > > +           "blt 0b\n"
> > > +           :
> > > +           : [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
> > > +             [p] "r" (p), [q] "r" (q), [start] "r" (start)
> > > +           : "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
> > > +             "z0", "z1", "z2", "z3", "z4"
> > > +   );
> > > +}
> > > +
> > > +static void raid6_sve2_gen_syndrome_real(int disks, unsigned long
> > > bytes, void **ptrs)
> > > +{
> > > +   u8 **dptr = (u8 **)ptrs;
> > > +   u8 *p, *q;
> > > +   long z0 = disks - 3;
> > > +
> > > +   p = dptr[z0 + 1];
> > > +   q = dptr[z0 + 2];
> > > +
> > > +   asm volatile(
> > > +           ".arch armv8.2-a+sve\n"
> > > +           "ptrue p0.b\n"
> > > +           "cntb x3\n"
> > > +           "mov w4, #0x1d\n"
> > > +           "dup z4.b, w4\n"
> > > +           "mov x5, #0\n"
> > > +
> > > +           "0:\n"
> > > +           "ldr x6, [%[dptr], %[z0], lsl #3]\n"
> > > +           "ld1b z0.b, p0/z, [x6, x5]\n"
> > > +           "add x8, x5, x3\n"
> > > +           "ld1b z5.b, p0/z, [x6, x8]\n"
> > > +           "mov z1.d, z0.d\n"
> > > +           "mov z6.d, z5.d\n"
> > > +
> > > +           "mov w7, %w[z0]\n"
> > > +           "sub w7, w7, #1\n"
> > > +
> > > +           "1:\n"
> > > +           "cmp w7, #0\n"
> > > +           "blt 2f\n"
> > > +
> > > +           "mov z3.d, z1.d\n"
> > > +           "asr z3.b, p0/m, z3.b, #7\n"
> > > +           "lsl z1.b, p0/m, z1.b, #1\n"
> > > +           "and z3.d, z3.d, z4.d\n"
> > > +           "eor z1.d, z1.d, z3.d\n"
> > > +
> > > +           "mov z8.d, z6.d\n"
> > > +           "asr z8.b, p0/m, z8.b, #7\n"
> > > +           "lsl z6.b, p0/m, z6.b, #1\n"
> > > +           "and z8.d, z8.d, z4.d\n"
> > > +           "eor z6.d, z6.d, z8.d\n"
> > > +
> > > +           "sxtw x8, w7\n"
> > > +           "ldr x6, [%[dptr], x8, lsl #3]\n"
> > > +           "ld1b z2.b, p0/z, [x6, x5]\n"
> > > +           "add x8, x5, x3\n"
> > > +           "ld1b z7.b, p0/z, [x6, x8]\n"
> > > +
> > > +           "eor z1.d, z1.d, z2.d\n"
> > > +           "eor z0.d, z0.d, z2.d\n"
> > > +
> > > +           "eor z6.d, z6.d, z7.d\n"
> > > +           "eor z5.d, z5.d, z7.d\n"
> > > +
> > > +           "sub w7, w7, #1\n"
> > > +           "b 1b\n"
> > > +           "2:\n"
> > > +
> > > +           "st1b z0.b, p0, [%[p], x5]\n"
> > > +           "st1b z1.b, p0, [%[q], x5]\n"
> > > +           "add x8, x5, x3\n"
> > > +           "st1b z5.b, p0, [%[p], x8]\n"
> > > +           "st1b z6.b, p0, [%[q], x8]\n"
> > > +
> > > +           "add x5, x5, x3\n"
> > > +           "add x5, x5, x3\n"
> > > +           "cmp x5, %[bytes]\n"
> > > +           "blt 0b\n"
> > > +           :
> > > +           : [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
> > > +             [p] "r" (p), [q] "r" (q)
> > > +           : "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
> > > +             "z0", "z1", "z2", "z3", "z4",
> > > +             "z5", "z6", "z7", "z8"
> > > +   );
> > > +}
> > > +
> > > +static void raid6_sve2_xor_syndrome_real(int disks, int start, int
> > > stop,
> > > +                                    unsigned long bytes, void **ptrs)
> > > +{
> > > +   u8 **dptr = (u8 **)ptrs;
> > > +   u8 *p, *q;
> > > +   long z0 = stop;
> > > +
> > > +   p = dptr[disks - 2];
> > > +   q = dptr[disks - 1];
> > > +
> > > +   asm volatile(
> > > +           ".arch armv8.2-a+sve\n"
> > > +           "ptrue p0.b\n"
> > > +           "cntb x3\n"
> > > +           "mov w4, #0x1d\n"
> > > +           "dup z4.b, w4\n"
> > > +           "mov x5, #0\n"
> > > +
> > > +           "0:\n"
> > > +           "ldr x6, [%[dptr], %[z0], lsl #3]\n"
> > > +           "ld1b z1.b, p0/z, [x6, x5]\n"
> > > +           "add x8, x5, x3\n"
> > > +           "ld1b z6.b, p0/z, [x6, x8]\n"
> > > +
> > > +           "ld1b z0.b, p0/z, [%[p], x5]\n"
> > > +           "ld1b z5.b, p0/z, [%[p], x8]\n"
> > > +
> > > +           "eor z0.d, z0.d, z1.d\n"
> > > +           "eor z5.d, z5.d, z6.d\n"
> > > +
> > > +           "mov w7, %w[z0]\n"
> > > +           "sub w7, w7, #1\n"
> > > +
> > > +           "1:\n"
> > > +           "cmp w7, %w[start]\n"
> > > +           "blt 2f\n"
> > > +
> > > +           "mov z3.d, z1.d\n"
> > > +           "asr z3.b, p0/m, z3.b, #7\n"
> > > +           "lsl z1.b, p0/m, z1.b, #1\n"
> > > +           "and z3.d, z3.d, z4.d\n"
> > > +           "eor z1.d, z1.d, z3.d\n"
> > > +
> > > +           "mov z8.d, z6.d\n"
> > > +           "asr z8.b, p0/m, z8.b, #7\n"
> > > +           "lsl z6.b, p0/m, z6.b, #1\n"
> > > +           "and z8.d, z8.d, z4.d\n"
> > > +           "eor z6.d, z6.d, z8.d\n"
> > > +
> > > +           "sxtw x8, w7\n"
> > > +           "ldr x6, [%[dptr], x8, lsl #3]\n"
> > > +           "ld1b z2.b, p0/z, [x6, x5]\n"
> > > +           "add x8, x5, x3\n"
> > > +           "ld1b z7.b, p0/z, [x6, x8]\n"
> > > +
> > > +           "eor z1.d, z1.d, z2.d\n"
> > > +           "eor z0.d, z0.d, z2.d\n"
> > > +
> > > +           "eor z6.d, z6.d, z7.d\n"
> > > +           "eor z5.d, z5.d, z7.d\n"
> > > +
> > > +           "sub w7, w7, #1\n"
> > > +           "b 1b\n"
> > > +           "2:\n"
> > > +
> > > +           "mov w7, %w[start]\n"
> > > +           "sub w7, w7, #1\n"
> > > +           "3:\n"
> > > +           "cmp w7, #0\n"
> > > +           "blt 4f\n"
> > > +
> > > +           "mov z3.d, z1.d\n"
> > > +           "asr z3.b, p0/m, z3.b, #7\n"
> > > +           "lsl z1.b, p0/m, z1.b, #1\n"
> > > +           "and z3.d, z3.d, z4.d\n"
> > > +           "eor z1.d, z1.d, z3.d\n"
> > > +
> > > +           "mov z8.d, z6.d\n"
> > > +           "asr z8.b, p0/m, z8.b, #7\n"
> > > +           "lsl z6.b, p0/m, z6.b, #1\n"
> > > +           "and z8.d, z8.d, z4.d\n"
> > > +           "eor z6.d, z6.d, z8.d\n"
> > > +
> > > +           "sub w7, w7, #1\n"
> > > +           "b 3b\n"
> > > +           "4:\n"
> > > +
> > > +           "ld1b z2.b, p0/z, [%[q], x5]\n"
> > > +           "eor z1.d, z1.d, z2.d\n"
> > > +           "st1b z0.b, p0, [%[p], x5]\n"
> > > +           "st1b z1.b, p0, [%[q], x5]\n"
> > > +
> > > +           "add x8, x5, x3\n"
> > > +           "ld1b z7.b, p0/z, [%[q], x8]\n"
> > > +           "eor z6.d, z6.d, z7.d\n"
> > > +           "st1b z5.b, p0, [%[p], x8]\n"
> > > +           "st1b z6.b, p0, [%[q], x8]\n"
> > > +
> > > +           "add x5, x5, x3\n"
> > > +           "add x5, x5, x3\n"
> > > +           "cmp x5, %[bytes]\n"
> > > +           "blt 0b\n"
> > > +           :
> > > +           : [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
> > > +             [p] "r" (p), [q] "r" (q), [start] "r" (start)
> > > +           : "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
> > > +             "z0", "z1", "z2", "z3", "z4",
> > > +             "z5", "z6", "z7", "z8"
> > > +   );
> > > +}
> > > +
> > > +static void raid6_sve4_gen_syndrome_real(int disks, unsigned long
> > > bytes, void **ptrs)
> > > +{
> > > +   u8 **dptr = (u8 **)ptrs;
> > > +   u8 *p, *q;
> > > +   long z0 = disks - 3;
> > > +
> > > +   p = dptr[z0 + 1];
> > > +   q = dptr[z0 + 2];
> > > +
> > > +   asm volatile(
> > > +           ".arch armv8.2-a+sve\n"
> > > +           "ptrue p0.b\n"
> > > +           "cntb x3\n"
> > > +           "mov w4, #0x1d\n"
> > > +           "dup z4.b, w4\n"
> > > +           "mov x5, #0\n"
> > > +
> > > +           "0:\n"
> > > +           "ldr x6, [%[dptr], %[z0], lsl #3]\n"
> > > +           "ld1b z0.b, p0/z, [x6, x5]\n"
> > > +           "add x8, x5, x3\n"
> > > +           "ld1b z5.b, p0/z, [x6, x8]\n"
> > > +           "add x8, x8, x3\n"
> > > +           "ld1b z10.b, p0/z, [x6, x8]\n"
> > > +           "add x8, x8, x3\n"
> > > +           "ld1b z15.b, p0/z, [x6, x8]\n"
> > > +
> > > +           "mov z1.d, z0.d\n"
> > > +           "mov z6.d, z5.d\n"
> > > +           "mov z11.d, z10.d\n"
> > > +           "mov z16.d, z15.d\n"
> > > +
> > > +           "mov w7, %w[z0]\n"
> > > +           "sub w7, w7, #1\n"
> > > +
> > > +           "1:\n"
> > > +           "cmp w7, #0\n"
> > > +           "blt 2f\n"
> > > +
> > > +           // software pipelining: load data early
> > > +           "sxtw x8, w7\n"
> > > +           "ldr x6, [%[dptr], x8, lsl #3]\n"
> > > +           "ld1b z2.b, p0/z, [x6, x5]\n"
> > > +           "add x8, x5, x3\n"
> > > +           "ld1b z7.b, p0/z, [x6, x8]\n"
> > > +           "add x8, x8, x3\n"
> > > +           "ld1b z12.b, p0/z, [x6, x8]\n"
> > > +           "add x8, x8, x3\n"
> > > +           "ld1b z17.b, p0/z, [x6, x8]\n"
> > > +
> > > +           // math block 1
> > > +           "mov z3.d, z1.d\n"
> > > +           "asr z3.b, p0/m, z3.b, #7\n"
> > > +           "lsl z1.b, p0/m, z1.b, #1\n"
> > > +           "and z3.d, z3.d, z4.d\n"
> > > +           "eor z1.d, z1.d, z3.d\n"
> > > +           "eor z1.d, z1.d, z2.d\n"
> > > +           "eor z0.d, z0.d, z2.d\n"
> > > +
> > > +           // math block 2
> > > +           "mov z8.d, z6.d\n"
> > > +           "asr z8.b, p0/m, z8.b, #7\n"
> > > +           "lsl z6.b, p0/m, z6.b, #1\n"
> > > +           "and z8.d, z8.d, z4.d\n"
> > > +           "eor z6.d, z6.d, z8.d\n"
> > > +           "eor z6.d, z6.d, z7.d\n"
> > > +           "eor z5.d, z5.d, z7.d\n"
> > > +
> > > +           // math block 3
> > > +           "mov z13.d, z11.d\n"
> > > +           "asr z13.b, p0/m, z13.b, #7\n"
> > > +           "lsl z11.b, p0/m, z11.b, #1\n"
> > > +           "and z13.d, z13.d, z4.d\n"
> > > +           "eor z11.d, z11.d, z13.d\n"
> > > +           "eor z11.d, z11.d, z12.d\n"
> > > +           "eor z10.d, z10.d, z12.d\n"
> > > +
> > > +           // math block 4
> > > +           "mov z18.d, z16.d\n"
> > > +           "asr z18.b, p0/m, z18.b, #7\n"
> > > +           "lsl z16.b, p0/m, z16.b, #1\n"
> > > +           "and z18.d, z18.d, z4.d\n"
> > > +           "eor z16.d, z16.d, z18.d\n"
> > > +           "eor z16.d, z16.d, z17.d\n"
> > > +           "eor z15.d, z15.d, z17.d\n"
> > > +
> > > +           "sub w7, w7, #1\n"
> > > +           "b 1b\n"
> > > +           "2:\n"
> > > +
> > > +           "st1b z0.b, p0, [%[p], x5]\n"
> > > +           "st1b z1.b, p0, [%[q], x5]\n"
> > > +           "add x8, x5, x3\n"
> > > +           "st1b z5.b, p0, [%[p], x8]\n"
> > > +           "st1b z6.b, p0, [%[q], x8]\n"
> > > +           "add x8, x8, x3\n"
> > > +           "st1b z10.b, p0, [%[p], x8]\n"
> > > +           "st1b z11.b, p0, [%[q], x8]\n"
> > > +           "add x8, x8, x3\n"
> > > +           "st1b z15.b, p0, [%[p], x8]\n"
> > > +           "st1b z16.b, p0, [%[q], x8]\n"
> > > +
> > > +           "add x8, x3, x3\n"
> > > +           "add x5, x5, x8, lsl #1\n"
> > > +           "cmp x5, %[bytes]\n"
> > > +           "blt 0b\n"
> > > +           :
> > > +           : [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
> > > +             [p] "r" (p), [q] "r" (q)
> > > +           : "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
> > > +             "z0", "z1", "z2", "z3", "z4",
> > > +             "z5", "z6", "z7", "z8",
> > > +             "z10", "z11", "z12", "z13",
> > > +             "z15", "z16", "z17", "z18"
> > > +   );
> > > +}
> > > +
> > > +static void raid6_sve4_xor_syndrome_real(int disks, int start, int
> > > stop,
> > > +                                    unsigned long bytes, void **ptrs)
> > > +{
> > > +   u8 **dptr = (u8 **)ptrs;
> > > +   u8 *p, *q;
> > > +   long z0 = stop;
> > > +
> > > +   p = dptr[disks - 2];
> > > +   q = dptr[disks - 1];
> > > +
> > > +   asm volatile(
> > > +           ".arch armv8.2-a+sve\n"
> > > +           "ptrue p0.b\n"
> > > +           "cntb x3\n"
> > > +           "mov w4, #0x1d\n"
> > > +           "dup z4.b, w4\n"
> > > +           "mov x5, #0\n"
> > > +
> > > +           "0:\n"
> > > +           "ldr x6, [%[dptr], %[z0], lsl #3]\n"
> > > +           "ld1b z1.b, p0/z, [x6, x5]\n"
> > > +           "add x8, x5, x3\n"
> > > +           "ld1b z6.b, p0/z, [x6, x8]\n"
> > > +           "add x8, x8, x3\n"
> > > +           "ld1b z11.b, p0/z, [x6, x8]\n"
> > > +           "add x8, x8, x3\n"
> > > +           "ld1b z16.b, p0/z, [x6, x8]\n"
> > > +
> > > +           "ld1b z0.b, p0/z, [%[p], x5]\n"
> > > +           "add x8, x5, x3\n"
> > > +           "ld1b z5.b, p0/z, [%[p], x8]\n"
> > > +           "add x8, x8, x3\n"
> > > +           "ld1b z10.b, p0/z, [%[p], x8]\n"
> > > +           "add x8, x8, x3\n"
> > > +           "ld1b z15.b, p0/z, [%[p], x8]\n"
> > > +
> > > +           "eor z0.d, z0.d, z1.d\n"
> > > +           "eor z5.d, z5.d, z6.d\n"
> > > +           "eor z10.d, z10.d, z11.d\n"
> > > +           "eor z15.d, z15.d, z16.d\n"
> > > +
> > > +           "mov w7, %w[z0]\n"
> > > +           "sub w7, w7, #1\n"
> > > +
> > > +           "1:\n"
> > > +           "cmp w7, %w[start]\n"
> > > +           "blt 2f\n"
> > > +
> > > +           // software pipelining: load data early
> > > +           "sxtw x8, w7\n"
> > > +           "ldr x6, [%[dptr], x8, lsl #3]\n"
> > > +           "ld1b z2.b, p0/z, [x6, x5]\n"
> > > +           "add x8, x5, x3\n"
> > > +           "ld1b z7.b, p0/z, [x6, x8]\n"
> > > +           "add x8, x8, x3\n"
> > > +           "ld1b z12.b, p0/z, [x6, x8]\n"
> > > +           "add x8, x8, x3\n"
> > > +           "ld1b z17.b, p0/z, [x6, x8]\n"
> > > +
> > > +           // math block 1
> > > +           "mov z3.d, z1.d\n"
> > > +           "asr z3.b, p0/m, z3.b, #7\n"
> > > +           "lsl z1.b, p0/m, z1.b, #1\n"
> > > +           "and z3.d, z3.d, z4.d\n"
> > > +           "eor z1.d, z1.d, z3.d\n"
> > > +           "eor z1.d, z1.d, z2.d\n"
> > > +           "eor z0.d, z0.d, z2.d\n"
> > > +
> > > +           // math block 2
> > > +           "mov z8.d, z6.d\n"
> > > +           "asr z8.b, p0/m, z8.b, #7\n"
> > > +           "lsl z6.b, p0/m, z6.b, #1\n"
> > > +           "and z8.d, z8.d, z4.d\n"
> > > +           "eor z6.d, z6.d, z8.d\n"
> > > +           "eor z6.d, z6.d, z7.d\n"
> > > +           "eor z5.d, z5.d, z7.d\n"
> > > +
> > > +           // math block 3
> > > +           "mov z13.d, z11.d\n"
> > > +           "asr z13.b, p0/m, z13.b, #7\n"
> > > +           "lsl z11.b, p0/m, z11.b, #1\n"
> > > +           "and z13.d, z13.d, z4.d\n"
> > > +           "eor z11.d, z11.d, z13.d\n"
> > > +           "eor z11.d, z11.d, z12.d\n"
> > > +           "eor z10.d, z10.d, z12.d\n"
> > > +
> > > +           // math block 4
> > > +           "mov z18.d, z16.d\n"
> > > +           "asr z18.b, p0/m, z18.b, #7\n"
> > > +           "lsl z16.b, p0/m, z16.b, #1\n"
> > > +           "and z18.d, z18.d, z4.d\n"
> > > +           "eor z16.d, z16.d, z18.d\n"
> > > +           "eor z16.d, z16.d, z17.d\n"
> > > +           "eor z15.d, z15.d, z17.d\n"
> > > +
> > > +           "sub w7, w7, #1\n"
> > > +           "b 1b\n"
> > > +           "2:\n"
> > > +
> > > +           "mov w7, %w[start]\n"
> > > +           "sub w7, w7, #1\n"
> > > +           "3:\n"
> > > +           "cmp w7, #0\n"
> > > +           "blt 4f\n"
> > > +
> > > +           // math block 1
> > > +           "mov z3.d, z1.d\n"
> > > +           "asr z3.b, p0/m, z3.b, #7\n"
> > > +           "lsl z1.b, p0/m, z1.b, #1\n"
> > > +           "and z3.d, z3.d, z4.d\n"
> > > +           "eor z1.d, z1.d, z3.d\n"
> > > +
> > > +           // math block 2
> > > +           "mov z8.d, z6.d\n"
> > > +           "asr z8.b, p0/m, z8.b, #7\n"
> > > +           "lsl z6.b, p0/m, z6.b, #1\n"
> > > +           "and z8.d, z8.d, z4.d\n"
> > > +           "eor z6.d, z6.d, z8.d\n"
> > > +
> > > +           // math block 3
> > > +           "mov z13.d, z11.d\n"
> > > +           "asr z13.b, p0/m, z13.b, #7\n"
> > > +           "lsl z11.b, p0/m, z11.b, #1\n"
> > > +           "and z13.d, z13.d, z4.d\n"
> > > +           "eor z11.d, z11.d, z13.d\n"
> > > +
> > > +           // math block 4
> > > +           "mov z18.d, z16.d\n"
> > > +           "asr z18.b, p0/m, z18.b, #7\n"
> > > +           "lsl z16.b, p0/m, z16.b, #1\n"
> > > +           "and z18.d, z18.d, z4.d\n"
> > > +           "eor z16.d, z16.d, z18.d\n"
> > > +
> > > +           "sub w7, w7, #1\n"
> > > +           "b 3b\n"
> > > +           "4:\n"
> > > +
> > > +           // Load q and XOR
> > > +           "ld1b z2.b, p0/z, [%[q], x5]\n"
> > > +           "add x8, x5, x3\n"
> > > +           "ld1b z7.b, p0/z, [%[q], x8]\n"
> > > +           "add x8, x8, x3\n"
> > > +           "ld1b z12.b, p0/z, [%[q], x8]\n"
> > > +           "add x8, x8, x3\n"
> > > +           "ld1b z17.b, p0/z, [%[q], x8]\n"
> > > +
> > > +           "eor z1.d, z1.d, z2.d\n"
> > > +           "eor z6.d, z6.d, z7.d\n"
> > > +           "eor z11.d, z11.d, z12.d\n"
> > > +           "eor z16.d, z16.d, z17.d\n"
> > > +
> > > +           // Store results
> > > +           "st1b z0.b, p0, [%[p], x5]\n"
> > > +           "st1b z1.b, p0, [%[q], x5]\n"
> > > +           "add x8, x5, x3\n"
> > > +           "st1b z5.b, p0, [%[p], x8]\n"
> > > +           "st1b z6.b, p0, [%[q], x8]\n"
> > > +           "add x8, x8, x3\n"
> > > +           "st1b z10.b, p0, [%[p], x8]\n"
> > > +           "st1b z11.b, p0, [%[q], x8]\n"
> > > +           "add x8, x8, x3\n"
> > > +           "st1b z15.b, p0, [%[p], x8]\n"
> > > +           "st1b z16.b, p0, [%[q], x8]\n"
> > > +
> > > +           "add x8, x3, x3\n"
> > > +           "add x5, x5, x8, lsl #1\n"
> > > +           "cmp x5, %[bytes]\n"
> > > +           "blt 0b\n"
> > > +           :
> > > +           : [dptr] "r" (dptr), [z0] "r" (z0), [bytes] "r" (bytes),
> > > +             [p] "r" (p), [q] "r" (q), [start] "r" (start)
> > > +           : "memory", "p0", "x3", "x4", "x5", "x6", "x7", "x8",
> > > +             "z0", "z1", "z2", "z3", "z4",
> > > +             "z5", "z6", "z7", "z8",
> > > +             "z10", "z11", "z12", "z13",
> > > +             "z15", "z16", "z17", "z18"
> > > +   );
> > > +}
> > > +
> > > +#define RAID6_SVE_WRAPPER(_n)                                              \
> > > +   static void raid6_sve ## _n ## _gen_syndrome(int disks,         \
> > > +                                   size_t bytes, void **ptrs)      \
> > > +   {                                                               \
> > > +           scoped_ksimd()                                          \
> > > +           raid6_sve ## _n ## _gen_syndrome_real(disks,            \
> > > +                                   (unsigned long)bytes, ptrs);    \
> > > +   }                                                               \
> > > +   static void raid6_sve ## _n ## _xor_syndrome(int disks,         \
> > > +                                   int start, int stop,            \
> > > +                                   size_t bytes, void **ptrs)      \
> > > +   {                                                               \
> > > +           scoped_ksimd()                                          \
> > > +           raid6_sve ## _n ## _xor_syndrome_real(disks,            \
> > > +                           start, stop, (unsigned long)bytes, ptrs);\
> > > +   }                                                               \
> > > +   struct raid6_calls const raid6_svex ## _n = {                   \
> > > +           raid6_sve ## _n ## _gen_syndrome,                       \
> > > +           raid6_sve ## _n ## _xor_syndrome,                       \
> > > +           raid6_have_sve,                                         \
> > > +           "svex" #_n,                                             \
> > > +           0                                                       \
> > > +   }
> > > +
> > > +static int raid6_have_sve(void)
> > > +{
> > > +   return system_supports_sve();
> > > +}
> > > +
> > > +RAID6_SVE_WRAPPER(1);
> > > +RAID6_SVE_WRAPPER(2);
> > > +RAID6_SVE_WRAPPER(4);
> > > --
> > > 2.43.0


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-03-29 13:01 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260318150245.3080719-1-demyansh@gmail.com>
2026-03-24  7:45 ` [PATCH v2] raid6: arm64: add SVE optimized implementation for syndrome generation Christoph Hellwig
2026-03-24  8:00 ` Ard Biesheuvel
2026-03-24 10:04   ` Mark Rutland
2026-03-29 13:01     ` Demian Shulhan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox