Re: [PATCH v2] lib/crc: arm64: add NEON accelerated CRC64-NVMe implementation

public inbox for linux-arm-kernel@lists.infradead.org
 help / color / mirror / Atom feed

* Re: [PATCH v2] lib/crc: arm64: add NEON accelerated CRC64-NVMe implementation
       [not found] ` <20260327060211.902077-1-demyansh@gmail.com>
@ 2026-03-27 19:38   ` Eric Biggers
  0 siblings, 0 replies; 5+ messages in thread
From: Eric Biggers @ 2026-03-27 19:38 UTC (permalink / raw)
  To: Demian Shulhan; +Cc: linux-crypto, linux-kernel, ardb, linux-arm-kernel

[+Cc linux-arm-kernel@lists.infradead.org]

Thanks!  This is almost ready.  Just a few more comments:

On Fri, Mar 27, 2026 at 06:02:11AM +0000, Demian Shulhan wrote:
> - Safely falls back to the generic implementation on Big-Endian systems.

Drop the above bullet point.  This patch doesn't explicitly exclude big
endian.  Which is correct: Linux arm64 is little-endian-only now.

> +	/*
> +	 * Reduce the 128-bit value to 64 bits.
> +	 * By multiplying the high 64 bits by x^127 mod G (fold_consts_val[1])
> +	 * and XORing the result with the low 64 bits.
> +	 */

That is not what this code does.  How about something like:

	/* Multiply the 128-bit value by x^64 and reduce it back to 128 bits. */

Granted, that doesn't do a good job explaining it either.  However, a
full explanation of this stuff, like the one in the comments in
lib/crc/x86/crc-pclmul-template.S, would be much longer.

I suggest we leave the full explanation for when a similar template is
written for arm64.  For now brief comments or even no comments are fine.

Just if any comments are included they really ought to be correct, as
otherwise they are worse than no comments.

> +			scoped_ksimd() crc = crc64_nvme_arm64_c(crc, p, chunk);

clang-format doesn't know about scoped_ksimd(), so I suggest overriding
the formatting in this particular case:

	scoped_ksimd()
		crc = crc64_nvme_arm64_c(crc, p, chunk);

- Eric

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH v3] lib/crc: arm64: add NEON accelerated CRC64-NVMe implementation
       [not found] <20260317065425.2684093-1-demyansh@gmail.com>
       [not found] ` <20260327060211.902077-1-demyansh@gmail.com>
@ 2026-03-29  7:43 ` Demian Shulhan
  2026-03-29 20:38   ` Eric Biggers
  1 sibling, 1 reply; 5+ messages in thread
From: Demian Shulhan @ 2026-03-29  7:43 UTC (permalink / raw)
  To: linux-crypto, linux-kernel, linux-arm-kernel
  Cc: ebiggers, ardb, Demian Shulhan

Implement an optimized CRC64 (NVMe) algorithm for ARM64 using NEON
Polynomial Multiply Long (PMULL) instructions. The generic shift-and-XOR
software implementation is slow, which creates a bottleneck in NVMe and
other storage subsystems.

The acceleration is implemented using C intrinsics (<arm_neon.h>) rather
than raw assembly for better readability and maintainability.

Key highlights of this implementation:
- Uses 4KB chunking inside scoped_ksimd() to avoid preemption latency
  spikes on large buffers.
- Pre-calculates and loads fold constants via vld1q_u64() to minimize
  register spilling.
- Benchmarks show the break-even point against the generic implementation
  is around 128 bytes. The PMULL path is enabled only for len >= 128.

Performance results (kunit crc_benchmark on Cortex-A72):
- Generic (len=4096): ~268 MB/s
- PMULL (len=4096): ~1556 MB/s (nearly 6x improvement)

Signed-off-by: Demian Shulhan <demyansh@gmail.com>
---
v2: - Removed KERNEL_MODE_NEON check from Kconfig as it's redundant on arm64.
  - Added missing prototype for crc64_nvme_arm64_c to fix sparse/W=1 warning.
  - Improved readability in Makefile with extra newlines and comments.
  - Removed redundant include guards in crc64.h.
  - Switched to do-while loops for better optimization in hot paths.
  - Added comments explaining the magic constants (fold/Barrett).
---
v3: - Removed big-endian fallback from the commit message.
  - Rewrote the comment explaining the final Barrett reduction step.
  - Adjusted the formatting of the scoped_ksimd() call.
---
 lib/crc/Kconfig                  |  1 +
 lib/crc/Makefile                 |  8 +++-
 lib/crc/arm64/crc64-neon-inner.c | 78 ++++++++++++++++++++++++++++++++
 lib/crc/arm64/crc64.h            | 30 ++++++++++++
 4 files changed, 116 insertions(+), 1 deletion(-)
 create mode 100644 lib/crc/arm64/crc64-neon-inner.c
 create mode 100644 lib/crc/arm64/crc64.h

diff --git a/lib/crc/Kconfig b/lib/crc/Kconfig
index 70e7a6016de3..16cb42d5e306 100644
--- a/lib/crc/Kconfig
+++ b/lib/crc/Kconfig
@@ -82,6 +82,7 @@ config CRC64
 config CRC64_ARCH
 	bool
 	depends on CRC64 && CRC_OPTIMIZATIONS
+	default y if ARM64
 	default y if RISCV && RISCV_ISA_ZBC && 64BIT
 	default y if X86_64
 
diff --git a/lib/crc/Makefile b/lib/crc/Makefile
index 7543ad295ab6..c9c35419b39c 100644
--- a/lib/crc/Makefile
+++ b/lib/crc/Makefile
@@ -38,9 +38,15 @@ obj-$(CONFIG_CRC64) += crc64.o
 crc64-y := crc64-main.o
 ifeq ($(CONFIG_CRC64_ARCH),y)
 CFLAGS_crc64-main.o += -I$(src)/$(SRCARCH)
+
+CFLAGS_REMOVE_arm64/crc64-neon-inner.o += -mgeneral-regs-only
+CFLAGS_arm64/crc64-neon-inner.o += -ffreestanding -march=armv8-a+crypto
+CFLAGS_arm64/crc64-neon-inner.o += -isystem $(shell $(CC) -print-file-name=include)
+crc64-$(CONFIG_ARM64) += arm64/crc64-neon-inner.o
+
 crc64-$(CONFIG_RISCV) += riscv/crc64_lsb.o riscv/crc64_msb.o
 crc64-$(CONFIG_X86) += x86/crc64-pclmul.o
-endif
+endif # CONFIG_CRC64_ARCH
 
 obj-y += tests/
 
diff --git a/lib/crc/arm64/crc64-neon-inner.c b/lib/crc/arm64/crc64-neon-inner.c
new file mode 100644
index 000000000000..881cdafadb37
--- /dev/null
+++ b/lib/crc/arm64/crc64-neon-inner.c
@@ -0,0 +1,78 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Accelerated CRC64 (NVMe) using ARM NEON C intrinsics
+ */
+
+#include <linux/types.h>
+#include <asm/neon-intrinsics.h>
+
+u64 crc64_nvme_arm64_c(u64 crc, const u8 *p, size_t len);
+
+#define GET_P64_0(v) ((poly64_t)vgetq_lane_u64(vreinterpretq_u64_p64(v), 0))
+#define GET_P64_1(v) ((poly64_t)vgetq_lane_u64(vreinterpretq_u64_p64(v), 1))
+
+/* x^191 mod G, x^127 mod G */
+static const u64 fold_consts_val[2] = { 0xeadc41fd2ba3d420ULL,
+					0x21e9761e252621acULL };
+/* floor(x^127 / G), (G - x^64) / x */
+static const u64 bconsts_val[2] = { 0x27ecfa329aef9f77ULL,
+				    0x34d926535897936aULL };
+
+u64 crc64_nvme_arm64_c(u64 crc, const u8 *p, size_t len)
+{
+	uint64x2_t v0_u64 = { crc, 0 };
+	poly64x2_t v0 = vreinterpretq_p64_u64(v0_u64);
+	poly64x2_t fold_consts =
+		vreinterpretq_p64_u64(vld1q_u64(fold_consts_val));
+	poly64x2_t v1 = vreinterpretq_p64_u8(vld1q_u8(p));
+
+	v0 = vreinterpretq_p64_u8(veorq_u8(vreinterpretq_u8_p64(v0),
+					   vreinterpretq_u8_p64(v1)));
+	p += 16;
+	len -= 16;
+
+	do {
+		v1 = vreinterpretq_p64_u8(vld1q_u8(p));
+
+		poly128_t v2 = vmull_high_p64(fold_consts, v0);
+		poly128_t v0_128 =
+			vmull_p64(GET_P64_0(fold_consts), GET_P64_0(v0));
+
+		uint8x16_t x0 = veorq_u8(vreinterpretq_u8_p128(v0_128),
+					 vreinterpretq_u8_p128(v2));
+
+		x0 = veorq_u8(x0, vreinterpretq_u8_p64(v1));
+		v0 = vreinterpretq_p64_u8(x0);
+
+		p += 16;
+		len -= 16;
+	} while (len >= 16);
+
+	/* Multiply the 128-bit value by x^64 and reduce it back to 128 bits. */
+	poly64x2_t v7 = vreinterpretq_p64_u64((uint64x2_t){ 0, 0 });
+	poly128_t v1_128 = vmull_p64(GET_P64_1(fold_consts), GET_P64_0(v0));
+
+	uint8x16_t ext_v0 =
+		vextq_u8(vreinterpretq_u8_p64(v0), vreinterpretq_u8_p64(v7), 8);
+	uint8x16_t x0 = veorq_u8(ext_v0, vreinterpretq_u8_p128(v1_128));
+
+	v0 = vreinterpretq_p64_u8(x0);
+
+	/* Final Barrett reduction */
+	poly64x2_t bconsts = vreinterpretq_p64_u64(vld1q_u64(bconsts_val));
+
+	v1_128 = vmull_p64(GET_P64_0(bconsts), GET_P64_0(v0));
+
+	poly64x2_t v1_64 = vreinterpretq_p64_u8(vreinterpretq_u8_p128(v1_128));
+	poly128_t v3_128 = vmull_p64(GET_P64_1(bconsts), GET_P64_0(v1_64));
+
+	x0 = veorq_u8(vreinterpretq_u8_p64(v0), vreinterpretq_u8_p128(v3_128));
+
+	uint8x16_t ext_v2 = vextq_u8(vreinterpretq_u8_p64(v7),
+				     vreinterpretq_u8_p128(v1_128), 8);
+
+	x0 = veorq_u8(x0, ext_v2);
+
+	v0 = vreinterpretq_p64_u8(x0);
+	return vgetq_lane_u64(vreinterpretq_u64_p64(v0), 1);
+}
diff --git a/lib/crc/arm64/crc64.h b/lib/crc/arm64/crc64.h
new file mode 100644
index 000000000000..cc65abeee24c
--- /dev/null
+++ b/lib/crc/arm64/crc64.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * CRC64 using ARM64 PMULL instructions
+ */
+
+#include <linux/cpufeature.h>
+#include <asm/simd.h>
+#include <linux/minmax.h>
+#include <linux/sizes.h>
+
+u64 crc64_nvme_arm64_c(u64 crc, const u8 *p, size_t len);
+
+#define crc64_be_arch crc64_be_generic
+
+static inline u64 crc64_nvme_arch(u64 crc, const u8 *p, size_t len)
+{
+	if (len >= 128 && cpu_have_named_feature(PMULL) &&
+	    likely(may_use_simd())) {
+		do {
+			size_t chunk = min_t(size_t, len & ~15, SZ_4K);
+
+			scoped_ksimd()
+				crc = crc64_nvme_arm64_c(crc, p, chunk);
+
+			p += chunk;
+			len -= chunk;
+		} while (len >= 128);
+	}
+	return crc64_nvme_generic(crc, p, len);
+}
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v3] lib/crc: arm64: add NEON accelerated CRC64-NVMe implementation
  2026-03-29  7:43 ` [PATCH v3] " Demian Shulhan
@ 2026-03-29 20:38   ` Eric Biggers
  2026-03-29 21:57     ` David Laight
  0 siblings, 1 reply; 5+ messages in thread
From: Eric Biggers @ 2026-03-29 20:38 UTC (permalink / raw)
  To: Demian Shulhan; +Cc: linux-crypto, linux-kernel, linux-arm-kernel, ardb

On Sun, Mar 29, 2026 at 07:43:38AM +0000, Demian Shulhan wrote:
> Implement an optimized CRC64 (NVMe) algorithm for ARM64 using NEON
> Polynomial Multiply Long (PMULL) instructions. The generic shift-and-XOR
> software implementation is slow, which creates a bottleneck in NVMe and
> other storage subsystems.
> 
> The acceleration is implemented using C intrinsics (<arm_neon.h>) rather
> than raw assembly for better readability and maintainability.
> 
> Key highlights of this implementation:
> - Uses 4KB chunking inside scoped_ksimd() to avoid preemption latency
>   spikes on large buffers.
> - Pre-calculates and loads fold constants via vld1q_u64() to minimize
>   register spilling.
> - Benchmarks show the break-even point against the generic implementation
>   is around 128 bytes. The PMULL path is enabled only for len >= 128.
> 
> Performance results (kunit crc_benchmark on Cortex-A72):
> - Generic (len=4096): ~268 MB/s
> - PMULL (len=4096): ~1556 MB/s (nearly 6x improvement)
> 
> Signed-off-by: Demian Shulhan <demyansh@gmail.com>

Applied to https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git/log/?h=crc-next

Thanks!

- Eric


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v3] lib/crc: arm64: add NEON accelerated CRC64-NVMe implementation
  2026-03-29 20:38   ` Eric Biggers
@ 2026-03-29 21:57     ` David Laight
  2026-03-29 22:18       ` Eric Biggers
  0 siblings, 1 reply; 5+ messages in thread
From: David Laight @ 2026-03-29 21:57 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Demian Shulhan, linux-crypto, linux-kernel, linux-arm-kernel,
	ardb

On Sun, 29 Mar 2026 13:38:29 -0700
Eric Biggers <ebiggers@kernel.org> wrote:

> On Sun, Mar 29, 2026 at 07:43:38AM +0000, Demian Shulhan wrote:
> > Implement an optimized CRC64 (NVMe) algorithm for ARM64 using NEON
> > Polynomial Multiply Long (PMULL) instructions. The generic shift-and-XOR
> > software implementation is slow, which creates a bottleneck in NVMe and
> > other storage subsystems.
> > 
> > The acceleration is implemented using C intrinsics (<arm_neon.h>) rather
> > than raw assembly for better readability and maintainability.
> > 
> > Key highlights of this implementation:
> > - Uses 4KB chunking inside scoped_ksimd() to avoid preemption latency
> >   spikes on large buffers.
> > - Pre-calculates and loads fold constants via vld1q_u64() to minimize
> >   register spilling.
> > - Benchmarks show the break-even point against the generic implementation
> >   is around 128 bytes. The PMULL path is enabled only for len >= 128.

Final thought:
Is that allowing for the cost of kernel_fpu_begin()? - which I think only
affects the first call.
And the cost of the data-cache misses for the lookup table reads? - again
worse for the first call.

	David

> > 
> > Performance results (kunit crc_benchmark on Cortex-A72):
> > - Generic (len=4096): ~268 MB/s
> > - PMULL (len=4096): ~1556 MB/s (nearly 6x improvement)
> > 
> > Signed-off-by: Demian Shulhan <demyansh@gmail.com>  
> 
> Applied to https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git/log/?h=crc-next
> 
> Thanks!
> 
> - Eric
> 



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v3] lib/crc: arm64: add NEON accelerated CRC64-NVMe implementation
  2026-03-29 21:57     ` David Laight
@ 2026-03-29 22:18       ` Eric Biggers
  0 siblings, 0 replies; 5+ messages in thread
From: Eric Biggers @ 2026-03-29 22:18 UTC (permalink / raw)
  To: David Laight
  Cc: Demian Shulhan, linux-crypto, linux-kernel, linux-arm-kernel,
	ardb

On Sun, Mar 29, 2026 at 10:57:04PM +0100, David Laight wrote:
> Final thought:
> Is that allowing for the cost of kernel_fpu_begin()? - which I think only
> affects the first call.
> And the cost of the data-cache misses for the lookup table reads? - again
> worse for the first call.

I assume you mean kernel_neon_begin().  This is an arm64 patch.  (I
encourage you to actually read the code.  You seem to send a lot of
speculation-heavy comments without actually reading the code.)

Currently, the benchmark in crc_kunit just measures the throughput in a
loop (as has been discussed before).  So no, it doesn't currently
capture the overhead of pulling code and data into cache.  For NEON
register use it captures only the amortized overhead.

Note that using PMULL saves having to pull the table into memory, while
using the table is a bit less code and saves having to use kernel-mode
NEON.  So both have their advantages and disadvantages.

This patch does fall back to the table for the last 'len & ~15' bytes,
which means the table may be needed anyway.  That is not the optimal way
to do it, and it's something to address later when this is replaced with
something similar to x86's crc-pclmul-template.S.

- Eric

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-03-29 22:18 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260317065425.2684093-1-demyansh@gmail.com>
     [not found] ` <20260327060211.902077-1-demyansh@gmail.com>
2026-03-27 19:38   ` [PATCH v2] lib/crc: arm64: add NEON accelerated CRC64-NVMe implementation Eric Biggers
2026-03-29  7:43 ` [PATCH v3] " Demian Shulhan
2026-03-29 20:38   ` Eric Biggers
2026-03-29 21:57     ` David Laight
2026-03-29 22:18       ` Eric Biggers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox