From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mail-wm1-f41.google.com (mail-wm1-f41.google.com [209.85.128.41])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id ECBC838CFE7
	for <linux-kernel@vger.kernel.org>; Fri, 20 Mar 2026 10:36:27 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.41
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1774002989; cv=none; b=McdXLcP1YAhNUe5Lm1p4j4sd4QG1Wp1N2jDEvLaYNe4RzQWY7TPqb/48t9gOO+UR6bLjeYPRlQWDrJnfdqeIwncJsUhAuMh/pcnhGnOYVPDTlqgkQNBza5/jdSx0Zr1cEwFQuLwfMP6LRFAU2gQUqCuVPeARBZ7oTOPd4PFBaKk=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1774002989; c=relaxed/simple;
	bh=dPRx2MQu6B0xiU4qpnOR7rAWWIXlHsP2JW3UYqre8S4=;
	h=Date:From:To:Cc:Subject:Message-ID:In-Reply-To:References:
	 MIME-Version:Content-Type; b=pm47U7wiu7cyuTgYHcTsbXRoh4Sk1oWHjE/hut+Xyrn1wAWVyiCwrpcQ68CgY8LPrXyGdt+T7UJKPLHeurO3IbtAV69R9HB5OIDFAzAoGlto5KuLa+QafV72uSC6yjioiYVLEKB1WxyAFIXVaeG8NYRKjE7Kwb+H9rhUGTcXpFk=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=awcwj682; arc=none smtp.client-ip=209.85.128.41
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="awcwj682"
Received: by mail-wm1-f41.google.com with SMTP id 5b1f17b1804b1-4853e1ce427so4336995e9.3
        for <linux-kernel@vger.kernel.org>; Fri, 20 Mar 2026 03:36:27 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1774002986; x=1774607786; darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:subject:cc:to:from:date:from:to:cc:subject:date
         :message-id:reply-to;
        bh=QgNW8ZyRv/2L8K0OWsGTx7038b4mfqW4545kvGPn8uw=;
        b=awcwj682VrrdkM39U4ibxVkwswu8nu7zO3ctfLy4jkeXYj6I4Zk40pNMe2yMtlfMcX
         eCq+kZuczvpfEq3JP4iZ901ngxoaZD1fPszq1URrt1a81kFY70fs2LXwrIY252soS3PI
         li1Fuzib57PTXCpAG960iQAdjUTqFrzo+VQMvh1DTZRS36d59QLTzLSgtaky7e80tHqg
         cbyWK5cnYqeNh6rbD+12hXT9uiXzppTy+e2uf6eZ0/h1RgJLwVG7EUhWtac0Qu0/Ieg+
         ubuQONZfzMJJryMy2OY8ebCr+68NA36aLCag30BK/fi0g/6YMzS1EDoWubfMXParLbHD
         P/wA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20251104; t=1774002986; x=1774607786;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:subject:cc:to:from:date:x-gm-gg:x-gm-message-state:from
         :to:cc:subject:date:message-id:reply-to;
        bh=QgNW8ZyRv/2L8K0OWsGTx7038b4mfqW4545kvGPn8uw=;
        b=qerJR2Igl9AOhPeRCIznw+37s+0JfM/Nux77TyWfBT+OwkKfBwESXYTc5+UJX8fsPg
         Yg8q8icjlPb+WGWckAZL7RnvfNqvznenQB7UKNWo/cq/zciN8yW/ZhS1sNtyFGc3lAis
         NUN70XWwTIQFXaLV1n22Y86Hs3LBsFzRQmHmk5CijREvAwWTz6aqVc07SAVl3ucASV2Q
         HsLGu+1JYkymQQNFl8jVdo+5RmPM7yL7TQZmC5LCqMmi5UXTBHwJEZOp6lYYDK/xgqn2
         KbkadgtQYvqjOqpRPoSSW+6u6yGa20YRgEb2pk5e6do43CAMBXqJkAYNDOh3IuVrRWZ5
         PQnA==
X-Forwarded-Encrypted: i=1; AJvYcCW1sIlUYoxNHhiDy7clumA+/CkXjCALUu70cU4q3YrPCXQAVa3v2rDfXbpHj9uLI1G4QtXaTg+YMBF/IVA=@vger.kernel.org
X-Gm-Message-State: AOJu0YxEcaEzCC+jFPHKoREjFbGqq1sy2VOkh4fCbuvzd5eyLtn1xfE9
	XkbAdITrEXHWS2hE5cj2EXTGz0XdqW72sc2ap1YqqO8jiH8nbVVW/Q0t
X-Gm-Gg: ATEYQzxPymRyrPkMu/Ng4mobvjSSvceF8SXhrRjnNnSseVw1fxTyjJdQh+SoOiJogSv
	d1MHT21PiicV8Y9UU/k1TkzW3ybn4JP1PlEDGLdwo3GUuDgFkvQPw9H0lAFj/yPVjV4aX2q1sXd
	WllUHSvcWu4SbsEBZDG8kwEQaUL7CH8d5ObAd+JIQPCqI4yjTaPpV8EgP14YzsTD6PjsP/7IPfz
	YW0Zm7+VLq7BxAfUQtm+yS7eQnt7/HZMgu2q9bl4FiK5FjsnsmrdJRWmjGTh4cVQIxXsFSSGuv1
	cthLOss6TdTTfg+mYAL0dJsq4qZ8uagMuEipeP2amNNzsu8RsAFA/6UZMeiyWUQfleacXowv66/
	TTW/D5y+p2unbcztOKT61dwtYXu2QodYTcF4pZhNi02N1WivkSbI7EsIdgG+qHo4SmHoha80LWq
	7Is89o64gYd4AuWoeIOOn+e160hvWq4BAmLy9/9A58Jme1psVcGMT3KUkBYX48Z17i
X-Received: by 2002:a05:600c:3b23:b0:486:fdc6:1c0d with SMTP id 5b1f17b1804b1-486ff02da9emr37797415e9.22.1774002985855;
        Fri, 20 Mar 2026 03:36:25 -0700 (PDT)
Received: from pumpkin (82-69-66-36.dsl.in-addr.zen.co.uk. [82.69.66.36])
        by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-43b6470c239sm4926279f8f.27.2026.03.20.03.36.25
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 20 Mar 2026 03:36:25 -0700 (PDT)
Date: Fri, 20 Mar 2026 10:36:24 +0000
From: David Laight <david.laight.linux@gmail.com>
To: Eric Biggers <ebiggers@kernel.org>
Cc: Demian Shulhan <demyansh@gmail.com>, ardb@kernel.org,
 linux-crypto@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] lib/crc: arm64: add NEON accelerated CRC64-NVMe
 implementation
Message-ID: <20260320103624.0e13d26f@pumpkin>
In-Reply-To: <20260319190908.GB10208@quark>
References: <20260317065425.2684093-1-demyansh@gmail.com>
	<20260319190908.GB10208@quark>
X-Mailer: Claws Mail 4.1.1 (GTK 3.24.38; arm-unknown-linux-gnueabihf)
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

On Thu, 19 Mar 2026 12:09:08 -0700
Eric Biggers <ebiggers@kernel.org> wrote:

> On Tue, Mar 17, 2026 at 06:54:25AM +0000, Demian Shulhan wrote:
> > Implement an optimized CRC64 (NVMe) algorithm for ARM64 using NEON
> > Polynomial Multiply Long (PMULL) instructions. The generic shift-and-XOR
> > software implementation is slow, which creates a bottleneck in NVMe and
> > other storage subsystems.
> > 
> > The acceleration is implemented using C intrinsics (<arm_neon.h>) rather
> > than raw assembly for better readability and maintainability.
> > 
> > Key highlights of this implementation:
> > - Uses 4KB chunking inside scoped_ksimd() to avoid preemption latency
> >   spikes on large buffers.
> > - Pre-calculates and loads fold constants via vld1q_u64() to minimize
> >   register spilling.
> > - Benchmarks show the break-even point against the generic implementation
> >   is around 128 bytes. The PMULL path is enabled only for len >= 128.
> > - Safely falls back to the generic implementation on Big-Endian systems.
> > 
> > Performance results (kunit crc_benchmark on Cortex-A72):
> > - Generic (len=4096): ~268 MB/s
> > - PMULL (len=4096): ~1556 MB/s (nearly 6x improvement)
> > 
> > Signed-off-by: Demian Shulhan <demyansh@gmail.com>  
> 
> Thanks!  I'm planning to accept this once the relatively minor comments
> later on in this email are addressed.
> 
> But just FYI, having separate code for each CRC variant isn't very
> sustainable.  CRC-T10DIF, CRC64-NVME, and CRC64-BE should all have
> similar PMULL based implementations.  x86 and riscv solve this using a
> "template" that supports all CRC variants.  I'd like to eventually bring
> a similar solution to arm64 (and arm) as well.
> 
> So while this code is fine for now, later I'd like to replace it with
> something more general, like x86 and riscv have now.  Then we can
> optimize CRC-T10DIF, CRC64-NVME, and CRC64-BE together.

I'm also pretty sure that the same loop will process 32bit and 16bit CRC
(just needs the high bits of the constant multiplier set to zero).
There are fewer bits to correct for at the end (I think it is always
the size of the CRC) but that may not be worth worrying about.

> E.g., consider that the CRC64-NVME code added by patch folds across at
> most 1 vector.  That's much less optimized than the existing CRC-T10DIF
> code in lib/crc/arm64/crc-t10dif-core.S, which folds across 8.  If we
> used a unified approach, we could optimize these CRC variants together.
> 
> As for intristics vs. assembly: the kernel usually uses assembly.
> However, I'm supportive of starting to use intrinsics more, and this a
> good start.  But we'll need to keep an eye out for any compiler issues.

But they do make the code unreadable - probably even more than the
assembler would be.
It might be better to write some C that required the architecture provide
the functions required for doing a CRC with 128bit registers that hold
two 64bit values (etc) and give them sane names.
Then common C code can be used provided the required instructions exist.
I'm pretty sure the loop is effectively:
	for (; p < limit; p++)
		p[N] ^= low(*p) * const_a ^ high(*p) * const_b;
where N is at least one and you don't actually want to write into the buffer.
Making N > 1 should improve performance - just needs care.

That might be what you've done for x86 - I keep meaning to look at that code.

	David