Re: [PATCH] lib/crc: arm64: add NEON accelerated CRC64-NVMe implementation

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: David Laight <david.laight.linux@gmail.com>
To: Eric Biggers <ebiggers@kernel.org>
Cc: Demian Shulhan <demyansh@gmail.com>,
	ardb@kernel.org, linux-crypto@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH] lib/crc: arm64: add NEON accelerated CRC64-NVMe implementation
Date: Fri, 20 Mar 2026 10:36:24 +0000	[thread overview]
Message-ID: <20260320103624.0e13d26f@pumpkin> (raw)
In-Reply-To: <20260319190908.GB10208@quark>

On Thu, 19 Mar 2026 12:09:08 -0700
Eric Biggers <ebiggers@kernel.org> wrote:

> On Tue, Mar 17, 2026 at 06:54:25AM +0000, Demian Shulhan wrote:
> > Implement an optimized CRC64 (NVMe) algorithm for ARM64 using NEON
> > Polynomial Multiply Long (PMULL) instructions. The generic shift-and-XOR
> > software implementation is slow, which creates a bottleneck in NVMe and
> > other storage subsystems.
> > 
> > The acceleration is implemented using C intrinsics (<arm_neon.h>) rather
> > than raw assembly for better readability and maintainability.
> > 
> > Key highlights of this implementation:
> > - Uses 4KB chunking inside scoped_ksimd() to avoid preemption latency
> >   spikes on large buffers.
> > - Pre-calculates and loads fold constants via vld1q_u64() to minimize
> >   register spilling.
> > - Benchmarks show the break-even point against the generic implementation
> >   is around 128 bytes. The PMULL path is enabled only for len >= 128.
> > - Safely falls back to the generic implementation on Big-Endian systems.
> > 
> > Performance results (kunit crc_benchmark on Cortex-A72):
> > - Generic (len=4096): ~268 MB/s
> > - PMULL (len=4096): ~1556 MB/s (nearly 6x improvement)
> > 
> > Signed-off-by: Demian Shulhan <demyansh@gmail.com>  
> 
> Thanks!  I'm planning to accept this once the relatively minor comments
> later on in this email are addressed.
> 
> But just FYI, having separate code for each CRC variant isn't very
> sustainable.  CRC-T10DIF, CRC64-NVME, and CRC64-BE should all have
> similar PMULL based implementations.  x86 and riscv solve this using a
> "template" that supports all CRC variants.  I'd like to eventually bring
> a similar solution to arm64 (and arm) as well.
> 
> So while this code is fine for now, later I'd like to replace it with
> something more general, like x86 and riscv have now.  Then we can
> optimize CRC-T10DIF, CRC64-NVME, and CRC64-BE together.

I'm also pretty sure that the same loop will process 32bit and 16bit CRC
(just needs the high bits of the constant multiplier set to zero).
There are fewer bits to correct for at the end (I think it is always
the size of the CRC) but that may not be worth worrying about.

> E.g., consider that the CRC64-NVME code added by patch folds across at
> most 1 vector.  That's much less optimized than the existing CRC-T10DIF
> code in lib/crc/arm64/crc-t10dif-core.S, which folds across 8.  If we
> used a unified approach, we could optimize these CRC variants together.
> 
> As for intristics vs. assembly: the kernel usually uses assembly.
> However, I'm supportive of starting to use intrinsics more, and this a
> good start.  But we'll need to keep an eye out for any compiler issues.

But they do make the code unreadable - probably even more than the
assembler would be.
It might be better to write some C that required the architecture provide
the functions required for doing a CRC with 128bit registers that hold
two 64bit values (etc) and give them sane names.
Then common C code can be used provided the required instructions exist.
I'm pretty sure the loop is effectively:
	for (; p < limit; p++)
		p[N] ^= low(*p) * const_a ^ high(*p) * const_b;
where N is at least one and you don't actually want to write into the buffer.
Making N > 1 should improve performance - just needs care.

That might be what you've done for x86 - I keep meaning to look at that code.

	David

next prev parent reply	other threads:[~2026-03-20 10:36 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-17  6:54 [PATCH] lib/crc: arm64: add NEON accelerated CRC64-NVMe implementation Demian Shulhan
2026-03-19 19:09 ` Eric Biggers
2026-03-20 10:36   ` David Laight [this message]
2026-03-20 20:00     ` Eric Biggers
2026-03-22  9:29       ` Demian Shulhan
2026-03-22 14:13         ` Eric Biggers
2026-03-19 23:31 ` David Laight
2026-03-20 11:22 ` kernel test robot
2026-03-27  6:02 ` [PATCH v2] " Demian Shulhan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260320103624.0e13d26f@pumpkin \
    --to=david.laight.linux@gmail.com \
    --cc=ardb@kernel.org \
    --cc=demyansh@gmail.com \
    --cc=ebiggers@kernel.org \
    --cc=linux-crypto@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox