linux-crypto.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] RISC-V CRC optimizations
@ 2025-02-16 22:55 Eric Biggers
  2025-02-16 22:55 ` [PATCH 1/4] riscv/crc: add "template" for Zbc optimized CRC functions Eric Biggers
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Eric Biggers @ 2025-02-16 22:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-crypto, linux-riscv, Zhihang Shao, Ard Biesheuvel,
	Xiao Wang, Charlie Jenkins

This patchset is a replacement for
"[PATCH v4] riscv: Optimize crct10dif with Zbc extension"
(https://lore.kernel.org/r/20250211071101.181652-1-zhihang.shao.iscas@gmail.com/).
It adopts the approach that I'm taking for x86 where code is shared
among CRC variants.  It replaces the existing Zbc optimized CRC32
functions, then adds Zbc optimized CRC-T10DIF and CRC64 functions.

This new code should be significantly faster than the current Zbc
optimized CRC32 code and the previously proposed CRC-T10DIF code.  It
uses "folding" instead of just Barrett reduction, and it also implements
Barrett reduction more efficiently.

This applies to crc-next at
https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git/log/?h=crc-next.
It depends on other patches that are queued there for 6.15, so I plan to
take it through there if there are no objections.

Tested with crc_kunit in QEMU (set CONFIG_CRC_KUNIT_TEST=y and
CONFIG_CRC_BENCHMARK=y), both 32-bit and 64-bit.  I don't have real Zbc
capable hardware to benchmark this on, but the new code should work very
well; similar optimizations work very well on other architectures.

Eric Biggers (4):
  riscv/crc: add "template" for Zbc optimized CRC functions
  riscv/crc32: reimplement the CRC32 functions using new template
  riscv/crc-t10dif: add Zbc optimized CRC-T10DIF function
  riscv/crc64: add Zbc optimized CRC64 functions

 arch/riscv/Kconfig                  |   2 +
 arch/riscv/lib/Makefile             |   5 +
 arch/riscv/lib/crc-clmul-consts.h   | 122 +++++++++++
 arch/riscv/lib/crc-clmul-template.h | 265 ++++++++++++++++++++++++
 arch/riscv/lib/crc-clmul.h          |  23 +++
 arch/riscv/lib/crc-t10dif.c         |  24 +++
 arch/riscv/lib/crc16_msb.c          |  18 ++
 arch/riscv/lib/crc32-riscv.c        | 310 ----------------------------
 arch/riscv/lib/crc32.c              |  53 +++++
 arch/riscv/lib/crc32_lsb.c          |  18 ++
 arch/riscv/lib/crc32_msb.c          |  18 ++
 arch/riscv/lib/crc64.c              |  34 +++
 arch/riscv/lib/crc64_lsb.c          |  18 ++
 arch/riscv/lib/crc64_msb.c          |  18 ++
 scripts/gen-crc-consts.py           |  55 ++++-
 15 files changed, 672 insertions(+), 311 deletions(-)
 create mode 100644 arch/riscv/lib/crc-clmul-consts.h
 create mode 100644 arch/riscv/lib/crc-clmul-template.h
 create mode 100644 arch/riscv/lib/crc-clmul.h
 create mode 100644 arch/riscv/lib/crc-t10dif.c
 create mode 100644 arch/riscv/lib/crc16_msb.c
 delete mode 100644 arch/riscv/lib/crc32-riscv.c
 create mode 100644 arch/riscv/lib/crc32.c
 create mode 100644 arch/riscv/lib/crc32_lsb.c
 create mode 100644 arch/riscv/lib/crc32_msb.c
 create mode 100644 arch/riscv/lib/crc64.c
 create mode 100644 arch/riscv/lib/crc64_lsb.c
 create mode 100644 arch/riscv/lib/crc64_msb.c


base-commit: cf1ea3a7c1f63cba7d1dd313ee3accde0c0c8988
-- 
2.48.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/4] riscv/crc: add "template" for Zbc optimized CRC functions
  2025-02-16 22:55 [PATCH 0/4] RISC-V CRC optimizations Eric Biggers
@ 2025-02-16 22:55 ` Eric Biggers
  2025-02-16 22:55 ` [PATCH 2/4] riscv/crc32: reimplement the CRC32 functions using new template Eric Biggers
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Eric Biggers @ 2025-02-16 22:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-crypto, linux-riscv, Zhihang Shao, Ard Biesheuvel,
	Xiao Wang, Charlie Jenkins

From: Eric Biggers <ebiggers@google.com>

Add a "template" crc-clmul-template.h that can generate RISC-V Zbc
optimized CRC functions.  Each generated CRC function is parameterized
by CRC length and bit order, and it accepts a pointer to the constants
struct required for the specific CRC polynomial desired.  Update
gen-crc-consts.py to support generating the needed constants structs.

This makes it possible to easily wire up a Zbc optimized implementation
of almost any CRC.

The design generally follows what I did for x86, but it is simplified by
using RISC-V's scalar carryless multiplication Zbc, which has no
equivalent on x86.  RISC-V's clmulr instruction is also helpful.  A
potential switch to Zvbc (or support for Zvbc alongside Zbc) is left for
future work.  For long messages Zvbc should be fastest, but it would
need to be shown to be worthwhile over just using Zbc which is
significantly more convenient to use, especially in the kernel context.

Compared to the existing Zbc-optimized CRC32 code and the earlier
proposed Zbc-optimized CRC-T10DIF code
(https://lore.kernel.org/r/20250211071101.181652-1-zhihang.shao.iscas@gmail.com),
this submission deduplicates the code among CRC variants and is
significantly more optimized.  It uses "folding" to take better
advantage of instruction-level parallelism (to a more limited extent
than x86 for now, but it could be extended to more), it reworks the
Barrett reduction to eliminate unnecessary instructions, and it
documents all the math used and makes all the constants reproducible.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 arch/riscv/lib/crc-clmul-template.h | 265 ++++++++++++++++++++++++++++
 scripts/gen-crc-consts.py           |  55 +++++-
 2 files changed, 319 insertions(+), 1 deletion(-)
 create mode 100644 arch/riscv/lib/crc-clmul-template.h

diff --git a/arch/riscv/lib/crc-clmul-template.h b/arch/riscv/lib/crc-clmul-template.h
new file mode 100644
index 000000000000..77187e7f1762
--- /dev/null
+++ b/arch/riscv/lib/crc-clmul-template.h
@@ -0,0 +1,265 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/* Copyright 2025 Google LLC */
+
+/*
+ * This file is a "template" that generates a CRC function optimized using the
+ * RISC-V Zbc (scalar carryless multiplication) extension.  The includer of this
+ * file must define the following parameters to specify the type of CRC:
+ *
+ *	crc_t: the data type of the CRC, e.g. u32 for a 32-bit CRC
+ *	LSB_CRC: 0 for a msb (most-significant-bit) first CRC, i.e. natural
+ *		 mapping between bits and polynomial coefficients
+ *	         1 for a lsb (least-significant-bit) first CRC, i.e. reflected
+ *	         mapping between bits and polynomial coefficients
+ */
+
+#include <asm/byteorder.h>
+#include <linux/minmax.h>
+
+#define CRC_BITS	(8 * sizeof(crc_t))	/* a.k.a. 'n' */
+
+static inline unsigned long clmul(unsigned long a, unsigned long b)
+{
+	unsigned long res;
+
+	asm(".option push\n"
+	    ".option arch,+zbc\n"
+	    "clmul %0, %1, %2\n"
+	    ".option pop\n"
+	    : "=r" (res) : "r" (a), "r" (b));
+	return res;
+}
+
+static inline unsigned long clmulh(unsigned long a, unsigned long b)
+{
+	unsigned long res;
+
+	asm(".option push\n"
+	    ".option arch,+zbc\n"
+	    "clmulh %0, %1, %2\n"
+	    ".option pop\n"
+	    : "=r" (res) : "r" (a), "r" (b));
+	return res;
+}
+
+static inline unsigned long clmulr(unsigned long a, unsigned long b)
+{
+	unsigned long res;
+
+	asm(".option push\n"
+	    ".option arch,+zbc\n"
+	    "clmulr %0, %1, %2\n"
+	    ".option pop\n"
+	    : "=r" (res) : "r" (a), "r" (b));
+	return res;
+}
+
+/*
+ * crc_load_long() loads one "unsigned long" of aligned data bytes, producing a
+ * polynomial whose bit order matches the CRC's bit order.
+ */
+#ifdef CONFIG_64BIT
+#  if LSB_CRC
+#    define crc_load_long(x)	le64_to_cpup(x)
+#  else
+#    define crc_load_long(x)	be64_to_cpup(x)
+#  endif
+#else
+#  if LSB_CRC
+#    define crc_load_long(x)	le32_to_cpup(x)
+#  else
+#    define crc_load_long(x)	be32_to_cpup(x)
+#  endif
+#endif
+
+/* XOR @crc into the end of @msgpoly that represents the high-order terms. */
+static inline unsigned long
+crc_clmul_prep(crc_t crc, unsigned long msgpoly)
+{
+#if LSB_CRC
+	return msgpoly ^ crc;
+#else
+	return msgpoly ^ ((unsigned long)crc << (BITS_PER_LONG - CRC_BITS));
+#endif
+}
+
+/*
+ * Multiply the long-sized @msgpoly by x^n (a.k.a. x^CRC_BITS) and reduce it
+ * modulo the generator polynomial G.  This gives the CRC of @msgpoly.
+ */
+static inline crc_t
+crc_clmul_long(unsigned long msgpoly, const struct crc_clmul_consts *consts)
+{
+	unsigned long tmp;
+
+	/*
+	 * First step of Barrett reduction with integrated multiplication by
+	 * x^n: calculate floor((msgpoly * x^n) / G).  This is the value by
+	 * which G needs to be multiplied to cancel out the x^n and higher terms
+	 * of msgpoly * x^n.  Do it using the following formula:
+	 *
+	 * msb-first:
+	 *    floor((msgpoly * floor(x^(BITS_PER_LONG-1+n) / G)) / x^(BITS_PER_LONG-1))
+	 * lsb-first:
+	 *    floor((msgpoly * floor(x^(BITS_PER_LONG-1+n) / G) * x) / x^BITS_PER_LONG)
+	 *
+	 * barrett_reduction_const_1 contains floor(x^(BITS_PER_LONG-1+n) / G),
+	 * which fits a long exactly.  Using any lower power of x there would
+	 * not carry enough precision through the calculation, while using any
+	 * higher power of x would require extra instructions to handle a wider
+	 * multiplication.  In the msb-first case, using this power of x results
+	 * in needing a floored division by x^(BITS_PER_LONG-1), which matches
+	 * what clmulr produces.  In the lsb-first case, a factor of x gets
+	 * implicitly introduced by each carryless multiplication (shown as
+	 * '* x' above), and the floored division instead needs to be by
+	 * x^BITS_PER_LONG which matches what clmul produces.
+	 */
+#if LSB_CRC
+	tmp = clmul(msgpoly, consts->barrett_reduction_const_1);
+#else
+	tmp = clmulr(msgpoly, consts->barrett_reduction_const_1);
+#endif
+
+	/*
+	 * Second step of Barrett reduction:
+	 *
+	 *    crc := (msgpoly * x^n) + (G * floor((msgpoly * x^n) / G))
+	 *
+	 * This reduces (msgpoly * x^n) modulo G by adding the appropriate
+	 * multiple of G to it.  The result uses only the x^0..x^(n-1) terms.
+	 * HOWEVER, since the unreduced value (msgpoly * x^n) is zero in those
+	 * terms in the first place, it is more efficient to do the equivalent:
+	 *
+	 *    crc := ((G - x^n) * floor((msgpoly * x^n) / G)) mod x^n
+	 *
+	 * In the lsb-first case further modify it to the following which avoids
+	 * a shift, as the crc ends up in the physically low n bits from clmulr:
+	 *
+	 *    product := ((G - x^n) * x^(BITS_PER_LONG - n)) * floor((msgpoly * x^n) / G) * x
+	 *    crc := floor(product / x^(BITS_PER_LONG + 1 - n)) mod x^n
+	 *
+	 * barrett_reduction_const_2 contains the constant multiplier (G - x^n)
+	 * or (G - x^n) * x^(BITS_PER_LONG - n) from the formulas above.  The
+	 * cast of the result to crc_t is essential, as it applies the mod x^n!
+	 */
+#if LSB_CRC
+	return clmulr(tmp, consts->barrett_reduction_const_2);
+#else
+	return clmul(tmp, consts->barrett_reduction_const_2);
+#endif
+}
+
+/* Update @crc with the data from @msgpoly. */
+static inline crc_t
+crc_clmul_update_long(crc_t crc, unsigned long msgpoly,
+		      const struct crc_clmul_consts *consts)
+{
+	return crc_clmul_long(crc_clmul_prep(crc, msgpoly), consts);
+}
+
+/* Update @crc with 1 <= @len < sizeof(unsigned long) bytes of data. */
+static inline crc_t
+crc_clmul_update_partial(crc_t crc, const u8 *p, size_t len,
+			 const struct crc_clmul_consts *consts)
+{
+	unsigned long msgpoly;
+	size_t i;
+
+#if LSB_CRC
+	msgpoly = (unsigned long)p[0] << (BITS_PER_LONG - 8);
+	for (i = 1; i < len; i++)
+		msgpoly = (msgpoly >> 8) ^ ((unsigned long)p[i] << (BITS_PER_LONG - 8));
+#else
+	msgpoly = p[0];
+	for (i = 1; i < len; i++)
+		msgpoly = (msgpoly << 8) ^ p[i];
+#endif
+
+	if (len >= sizeof(crc_t)) {
+	#if LSB_CRC
+		msgpoly ^= (unsigned long)crc << (BITS_PER_LONG - 8*len);
+	#else
+		msgpoly ^= (unsigned long)crc << (8*len - CRC_BITS);
+	#endif
+		return crc_clmul_long(msgpoly, consts);
+	}
+#if LSB_CRC
+	msgpoly ^= (unsigned long)crc << (BITS_PER_LONG - 8*len);
+	return crc_clmul_long(msgpoly, consts) ^ (crc >> (8*len));
+#else
+	msgpoly ^= crc >> (CRC_BITS - 8*len);
+	return crc_clmul_long(msgpoly, consts) ^ (crc << (8*len));
+#endif
+}
+
+static inline crc_t
+crc_clmul(crc_t crc, const void *p, size_t len,
+	  const struct crc_clmul_consts *consts)
+{
+	size_t align;
+
+	/* This implementation assumes that the CRC fits in an unsigned long. */
+	BUILD_BUG_ON(sizeof(crc_t) > sizeof(unsigned long));
+
+	/* If the buffer is not long-aligned, align it. */
+	align = (unsigned long)p % sizeof(unsigned long);
+	if (align && len) {
+		align = min(sizeof(unsigned long) - align, len);
+		crc = crc_clmul_update_partial(crc, p, align, consts);
+		p += align;
+		len -= align;
+	}
+
+	if (len >= 4 * sizeof(unsigned long)) {
+		unsigned long m0, m1;
+
+		m0 = crc_clmul_prep(crc, crc_load_long(p));
+		m1 = crc_load_long(p + sizeof(unsigned long));
+		p += 2 * sizeof(unsigned long);
+		len -= 2 * sizeof(unsigned long);
+		/*
+		 * Main loop.  Each iteration starts with a message polynomial
+		 * (x^BITS_PER_LONG)*m0 + m1, then logically extends it by two
+		 * more longs of data to form x^(3*BITS_PER_LONG)*m0 +
+		 * x^(2*BITS_PER_LONG)*m1 + x^BITS_PER_LONG*m2 + m3, then
+		 * "folds" that back into a congruent (modulo G) value that uses
+		 * just m0 and m1 again.  This is done by multiplying m0 by the
+		 * precomputed constant (x^(3*BITS_PER_LONG) mod G) and m1 by
+		 * the precomputed constant (x^(2*BITS_PER_LONG) mod G), then
+		 * adding the results to m2 and m3 as appropriate.  Each such
+		 * multiplication produces a result twice the length of a long,
+		 * which in RISC-V is two instructions clmul and clmulh.
+		 *
+		 * This could be changed to fold across more than 2 longs at a
+		 * time if there is a CPU that can take advantage of it.
+		 */
+		do {
+			unsigned long p0, p1, p2, p3;
+
+			p0 = clmulh(m0, consts->fold_across_2_longs_const_hi);
+			p1 = clmul(m0, consts->fold_across_2_longs_const_hi);
+			p2 = clmulh(m1, consts->fold_across_2_longs_const_lo);
+			p3 = clmul(m1, consts->fold_across_2_longs_const_lo);
+			m0 = (LSB_CRC ? p1 ^ p3 : p0 ^ p2) ^ crc_load_long(p);
+			m1 = (LSB_CRC ? p0 ^ p2 : p1 ^ p3) ^
+			     crc_load_long(p + sizeof(unsigned long));
+
+			p += 2 * sizeof(unsigned long);
+			len -= 2 * sizeof(unsigned long);
+		} while (len >= 2 * sizeof(unsigned long));
+
+		crc = crc_clmul_long(m0, consts);
+		crc = crc_clmul_update_long(crc, m1, consts);
+	}
+
+	while (len >= sizeof(unsigned long)) {
+		crc = crc_clmul_update_long(crc, crc_load_long(p), consts);
+		p += sizeof(unsigned long);
+		len -= sizeof(unsigned long);
+	}
+
+	if (len)
+		crc = crc_clmul_update_partial(crc, p, len, consts);
+
+	return crc;
+}
diff --git a/scripts/gen-crc-consts.py b/scripts/gen-crc-consts.py
index aa678a50897d..f9b44fc3a03f 100755
--- a/scripts/gen-crc-consts.py
+++ b/scripts/gen-crc-consts.py
@@ -103,10 +103,61 @@ def gen_slicebyN_tables(variants, n):
             s += (' ' if s else '') + next_entry
         if s:
             print(f'\t{s}')
         print('};')
 
+def print_riscv_const(v, bits_per_long, name, val, desc):
+    print(f'\t.{name} = {fmt_poly(v, val, bits_per_long)}, /* {desc} */')
+
+def do_gen_riscv_clmul_consts(v, bits_per_long):
+    (G, n, lsb) = (v.G, v.bits, v.lsb)
+
+    pow_of_x = 3 * bits_per_long - (1 if lsb else 0)
+    print_riscv_const(v, bits_per_long, 'fold_across_2_longs_const_hi',
+                      reduce(1 << pow_of_x, G), f'x^{pow_of_x} mod G')
+    pow_of_x = 2 * bits_per_long - (1 if lsb else 0)
+    print_riscv_const(v, bits_per_long, 'fold_across_2_longs_const_lo',
+                      reduce(1 << pow_of_x, G), f'x^{pow_of_x} mod G')
+
+    pow_of_x = bits_per_long - 1 + n
+    print_riscv_const(v, bits_per_long, 'barrett_reduction_const_1',
+                      div(1 << pow_of_x, G), f'floor(x^{pow_of_x} / G)')
+
+    val = G - (1 << n)
+    desc = f'G - x^{n}'
+    if lsb:
+        val <<= bits_per_long - n
+        desc = f'({desc}) * x^{bits_per_long - n}'
+    print_riscv_const(v, bits_per_long, 'barrett_reduction_const_2', val, desc)
+
+def gen_riscv_clmul_consts(variants):
+    print('')
+    print('struct crc_clmul_consts {');
+    print('\tunsigned long fold_across_2_longs_const_hi;');
+    print('\tunsigned long fold_across_2_longs_const_lo;');
+    print('\tunsigned long barrett_reduction_const_1;');
+    print('\tunsigned long barrett_reduction_const_2;');
+    print('};');
+    for v in variants:
+        print('');
+        if v.bits > 32:
+            print_header(v, 'Constants')
+            print('#ifdef CONFIG_64BIT')
+            print(f'static const struct crc_clmul_consts {v.name}_consts __maybe_unused = {{')
+            do_gen_riscv_clmul_consts(v, 64)
+            print('};')
+            print('#endif')
+        else:
+            print_header(v, 'Constants')
+            print(f'static const struct crc_clmul_consts {v.name}_consts __maybe_unused = {{')
+            print('#ifdef CONFIG_64BIT')
+            do_gen_riscv_clmul_consts(v, 64)
+            print('#else')
+            do_gen_riscv_clmul_consts(v, 32)
+            print('#endif')
+            print('};')
+
 # Generate constants for carryless multiplication based CRC computation.
 def gen_x86_pclmul_consts(variants):
     # These are the distances, in bits, to generate folding constants for.
     FOLD_DISTANCES = [2048, 1024, 512, 256, 128]
 
@@ -211,11 +262,11 @@ def parse_crc_variants(vars_string):
         variants.append(CrcVariant(bits, generator_poly, bit_order))
     return variants
 
 if len(sys.argv) != 3:
     sys.stderr.write(f'Usage: {sys.argv[0]} CONSTS_TYPE[,CONSTS_TYPE]... CRC_VARIANT[,CRC_VARIANT]...\n')
-    sys.stderr.write('  CONSTS_TYPE can be sliceby[1-8] or x86_pclmul\n')
+    sys.stderr.write('  CONSTS_TYPE can be sliceby[1-8], riscv_clmul, or x86_pclmul\n')
     sys.stderr.write('  CRC_VARIANT is crc${num_bits}_${bit_order}_${generator_poly_as_hex}\n')
     sys.stderr.write('     E.g. crc16_msb_0x8bb7 or crc32_lsb_0xedb88320\n')
     sys.stderr.write('     Polynomial must use the given bit_order and exclude x^{num_bits}\n')
     sys.exit(1)
 
@@ -230,9 +281,11 @@ print(' */')
 consts_types = sys.argv[1].split(',')
 variants = parse_crc_variants(sys.argv[2])
 for consts_type in consts_types:
     if consts_type.startswith('sliceby'):
         gen_slicebyN_tables(variants, int(consts_type.removeprefix('sliceby')))
+    elif consts_type == 'riscv_clmul':
+        gen_riscv_clmul_consts(variants)
     elif consts_type == 'x86_pclmul':
         gen_x86_pclmul_consts(variants)
     else:
         raise ValueError(f'Unknown consts_type: {consts_type}')
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/4] riscv/crc32: reimplement the CRC32 functions using new template
  2025-02-16 22:55 [PATCH 0/4] RISC-V CRC optimizations Eric Biggers
  2025-02-16 22:55 ` [PATCH 1/4] riscv/crc: add "template" for Zbc optimized CRC functions Eric Biggers
@ 2025-02-16 22:55 ` Eric Biggers
  2025-02-16 22:55 ` [PATCH 3/4] riscv/crc-t10dif: add Zbc optimized CRC-T10DIF function Eric Biggers
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Eric Biggers @ 2025-02-16 22:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-crypto, linux-riscv, Zhihang Shao, Ard Biesheuvel,
	Xiao Wang, Charlie Jenkins

From: Eric Biggers <ebiggers@google.com>

Delete the previous Zbc optimized CRC32 code, and re-implement it using
the new template.  The new implementation is more optimized and shares
more code among CRC variants.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 arch/riscv/lib/Makefile           |   1 +
 arch/riscv/lib/crc-clmul-consts.h |  72 +++++++
 arch/riscv/lib/crc-clmul.h        |  15 ++
 arch/riscv/lib/crc32-riscv.c      | 310 ------------------------------
 arch/riscv/lib/crc32.c            |  53 +++++
 arch/riscv/lib/crc32_lsb.c        |  18 ++
 arch/riscv/lib/crc32_msb.c        |  18 ++
 7 files changed, 177 insertions(+), 310 deletions(-)
 create mode 100644 arch/riscv/lib/crc-clmul-consts.h
 create mode 100644 arch/riscv/lib/crc-clmul.h
 delete mode 100644 arch/riscv/lib/crc32-riscv.c
 create mode 100644 arch/riscv/lib/crc32.c
 create mode 100644 arch/riscv/lib/crc32_lsb.c
 create mode 100644 arch/riscv/lib/crc32_msb.c

diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile
index 79368a895fee..7b32d3e88337 100644
--- a/arch/riscv/lib/Makefile
+++ b/arch/riscv/lib/Makefile
@@ -14,8 +14,9 @@ lib-$(CONFIG_RISCV_ISA_V)	+= uaccess_vector.o
 endif
 lib-$(CONFIG_MMU)	+= uaccess.o
 lib-$(CONFIG_64BIT)	+= tishift.o
 lib-$(CONFIG_RISCV_ISA_ZICBOZ)	+= clear_page.o
 obj-$(CONFIG_CRC32_ARCH)	+= crc32-riscv.o
+crc32-riscv-y := crc32.o crc32_msb.o crc32_lsb.o
 obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
 lib-$(CONFIG_RISCV_ISA_V)	+= xor.o
 lib-$(CONFIG_RISCV_ISA_V)	+= riscv_v_helpers.o
diff --git a/arch/riscv/lib/crc-clmul-consts.h b/arch/riscv/lib/crc-clmul-consts.h
new file mode 100644
index 000000000000..6fdf10648a20
--- /dev/null
+++ b/arch/riscv/lib/crc-clmul-consts.h
@@ -0,0 +1,72 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * CRC constants generated by:
+ *
+ *	./scripts/gen-crc-consts.py riscv_clmul crc32_msb_0x04c11db7,crc32_lsb_0xedb88320,crc32_lsb_0x82f63b78
+ *
+ * Do not edit manually.
+ */
+
+struct crc_clmul_consts {
+	unsigned long fold_across_2_longs_const_hi;
+	unsigned long fold_across_2_longs_const_lo;
+	unsigned long barrett_reduction_const_1;
+	unsigned long barrett_reduction_const_2;
+};
+
+/*
+ * Constants generated for most-significant-bit-first CRC-32 using
+ * G(x) = x^32 + x^26 + x^23 + x^22 + x^16 + x^12 + x^11 + x^10 + x^8 + x^7 +
+ *        x^5 + x^4 + x^2 + x^1 + x^0
+ */
+static const struct crc_clmul_consts crc32_msb_0x04c11db7_consts __maybe_unused = {
+#ifdef CONFIG_64BIT
+	.fold_across_2_longs_const_hi = 0x00000000c5b9cd4c, /* x^192 mod G */
+	.fold_across_2_longs_const_lo = 0x00000000e8a45605, /* x^128 mod G */
+	.barrett_reduction_const_1 = 0x826880efa40da72d, /* floor(x^95 / G) */
+	.barrett_reduction_const_2 = 0x0000000004c11db7, /* G - x^32 */
+#else
+	.fold_across_2_longs_const_hi = 0xf200aa66, /* x^96 mod G */
+	.fold_across_2_longs_const_lo = 0x490d678d, /* x^64 mod G */
+	.barrett_reduction_const_1 = 0x826880ef, /* floor(x^63 / G) */
+	.barrett_reduction_const_2 = 0x04c11db7, /* G - x^32 */
+#endif
+};
+
+/*
+ * Constants generated for least-significant-bit-first CRC-32 using
+ * G(x) = x^32 + x^26 + x^23 + x^22 + x^16 + x^12 + x^11 + x^10 + x^8 + x^7 +
+ *        x^5 + x^4 + x^2 + x^1 + x^0
+ */
+static const struct crc_clmul_consts crc32_lsb_0xedb88320_consts __maybe_unused = {
+#ifdef CONFIG_64BIT
+	.fold_across_2_longs_const_hi = 0x65673b4600000000, /* x^191 mod G */
+	.fold_across_2_longs_const_lo = 0x9ba54c6f00000000, /* x^127 mod G */
+	.barrett_reduction_const_1 = 0xb4e5b025f7011641, /* floor(x^95 / G) */
+	.barrett_reduction_const_2 = 0x00000000edb88320, /* (G - x^32) * x^32 */
+#else
+	.fold_across_2_longs_const_hi = 0xccaa009e, /* x^95 mod G */
+	.fold_across_2_longs_const_lo = 0xb8bc6765, /* x^63 mod G */
+	.barrett_reduction_const_1 = 0xf7011641, /* floor(x^63 / G) */
+	.barrett_reduction_const_2 = 0xedb88320, /* (G - x^32) * x^0 */
+#endif
+};
+
+/*
+ * Constants generated for least-significant-bit-first CRC-32 using
+ * G(x) = x^32 + x^28 + x^27 + x^26 + x^25 + x^23 + x^22 + x^20 + x^19 + x^18 +
+ *        x^14 + x^13 + x^11 + x^10 + x^9 + x^8 + x^6 + x^0
+ */
+static const struct crc_clmul_consts crc32_lsb_0x82f63b78_consts __maybe_unused = {
+#ifdef CONFIG_64BIT
+	.fold_across_2_longs_const_hi = 0x3743f7bd00000000, /* x^191 mod G */
+	.fold_across_2_longs_const_lo = 0x3171d43000000000, /* x^127 mod G */
+	.barrett_reduction_const_1 = 0x4869ec38dea713f1, /* floor(x^95 / G) */
+	.barrett_reduction_const_2 = 0x0000000082f63b78, /* (G - x^32) * x^32 */
+#else
+	.fold_across_2_longs_const_hi = 0x493c7d27, /* x^95 mod G */
+	.fold_across_2_longs_const_lo = 0xdd45aab8, /* x^63 mod G */
+	.barrett_reduction_const_1 = 0xdea713f1, /* floor(x^63 / G) */
+	.barrett_reduction_const_2 = 0x82f63b78, /* (G - x^32) * x^0 */
+#endif
+};
diff --git a/arch/riscv/lib/crc-clmul.h b/arch/riscv/lib/crc-clmul.h
new file mode 100644
index 000000000000..62bd41080785
--- /dev/null
+++ b/arch/riscv/lib/crc-clmul.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/* Copyright 2025 Google LLC */
+
+#ifndef _RISCV_CRC_CLMUL_H
+#define _RISCV_CRC_CLMUL_H
+
+#include <linux/types.h>
+#include "crc-clmul-consts.h"
+
+u32 crc32_msb_clmul(u32 crc, const void *p, size_t len,
+		    const struct crc_clmul_consts *consts);
+u32 crc32_lsb_clmul(u32 crc, const void *p, size_t len,
+		    const struct crc_clmul_consts *consts);
+
+#endif /* _RISCV_CRC_CLMUL_H */
diff --git a/arch/riscv/lib/crc32-riscv.c b/arch/riscv/lib/crc32-riscv.c
deleted file mode 100644
index b5cb752847c4..000000000000
--- a/arch/riscv/lib/crc32-riscv.c
+++ /dev/null
@@ -1,310 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0-only
-/*
- * Accelerated CRC32 implementation with Zbc extension.
- *
- * Copyright (C) 2024 Intel Corporation
- */
-
-#include <asm/hwcap.h>
-#include <asm/alternative-macros.h>
-#include <asm/byteorder.h>
-
-#include <linux/types.h>
-#include <linux/minmax.h>
-#include <linux/crc32poly.h>
-#include <linux/crc32.h>
-#include <linux/byteorder/generic.h>
-#include <linux/module.h>
-
-/*
- * Refer to https://www.corsix.org/content/barrett-reduction-polynomials for
- * better understanding of how this math works.
- *
- * let "+" denotes polynomial add (XOR)
- * let "-" denotes polynomial sub (XOR)
- * let "*" denotes polynomial multiplication
- * let "/" denotes polynomial floor division
- * let "S" denotes source data, XLEN bit wide
- * let "P" denotes CRC32 polynomial
- * let "T" denotes 2^(XLEN+32)
- * let "QT" denotes quotient of T/P, with the bit for 2^XLEN being implicit
- *
- * crc32(S, P)
- * => S * (2^32) - S * (2^32) / P * P
- * => lowest 32 bits of: S * (2^32) / P * P
- * => lowest 32 bits of: S * (2^32) * (T / P) / T * P
- * => lowest 32 bits of: S * (2^32) * quotient / T * P
- * => lowest 32 bits of: S * quotient / 2^XLEN * P
- * => lowest 32 bits of: (clmul_high_part(S, QT) + S) * P
- * => clmul_low_part(clmul_high_part(S, QT) + S, P)
- *
- * In terms of below implementations, the BE case is more intuitive, since the
- * higher order bit sits at more significant position.
- */
-
-#if __riscv_xlen == 64
-/* Slide by XLEN bits per iteration */
-# define STEP_ORDER 3
-
-/* Each below polynomial quotient has an implicit bit for 2^XLEN */
-
-/* Polynomial quotient of (2^(XLEN+32))/CRC32_POLY, in LE format */
-# define CRC32_POLY_QT_LE	0x5a72d812fb808b20
-
-/* Polynomial quotient of (2^(XLEN+32))/CRC32C_POLY, in LE format */
-# define CRC32C_POLY_QT_LE	0xa434f61c6f5389f8
-
-/* Polynomial quotient of (2^(XLEN+32))/CRC32_POLY, in BE format, it should be
- * the same as the bit-reversed version of CRC32_POLY_QT_LE
- */
-# define CRC32_POLY_QT_BE	0x04d101df481b4e5a
-
-static inline u64 crc32_le_prep(u32 crc, unsigned long const *ptr)
-{
-	return (u64)crc ^ (__force u64)__cpu_to_le64(*ptr);
-}
-
-static inline u32 crc32_le_zbc(unsigned long s, u32 poly, unsigned long poly_qt)
-{
-	u32 crc;
-
-	/* We don't have a "clmulrh" insn, so use clmul + slli instead. */
-	asm volatile (".option push\n"
-		      ".option arch,+zbc\n"
-		      "clmul	%0, %1, %2\n"
-		      "slli	%0, %0, 1\n"
-		      "xor	%0, %0, %1\n"
-		      "clmulr	%0, %0, %3\n"
-		      "srli	%0, %0, 32\n"
-		      ".option pop\n"
-		      : "=&r" (crc)
-		      : "r" (s),
-			"r" (poly_qt),
-			"r" ((u64)poly << 32)
-		      :);
-	return crc;
-}
-
-static inline u64 crc32_be_prep(u32 crc, unsigned long const *ptr)
-{
-	return ((u64)crc << 32) ^ (__force u64)__cpu_to_be64(*ptr);
-}
-
-#elif __riscv_xlen == 32
-# define STEP_ORDER 2
-/* Each quotient should match the upper half of its analog in RV64 */
-# define CRC32_POLY_QT_LE	0xfb808b20
-# define CRC32C_POLY_QT_LE	0x6f5389f8
-# define CRC32_POLY_QT_BE	0x04d101df
-
-static inline u32 crc32_le_prep(u32 crc, unsigned long const *ptr)
-{
-	return crc ^ (__force u32)__cpu_to_le32(*ptr);
-}
-
-static inline u32 crc32_le_zbc(unsigned long s, u32 poly, unsigned long poly_qt)
-{
-	u32 crc;
-
-	/* We don't have a "clmulrh" insn, so use clmul + slli instead. */
-	asm volatile (".option push\n"
-		      ".option arch,+zbc\n"
-		      "clmul	%0, %1, %2\n"
-		      "slli	%0, %0, 1\n"
-		      "xor	%0, %0, %1\n"
-		      "clmulr	%0, %0, %3\n"
-		      ".option pop\n"
-		      : "=&r" (crc)
-		      : "r" (s),
-			"r" (poly_qt),
-			"r" (poly)
-		      :);
-	return crc;
-}
-
-static inline u32 crc32_be_prep(u32 crc, unsigned long const *ptr)
-{
-	return crc ^ (__force u32)__cpu_to_be32(*ptr);
-}
-
-#else
-# error "Unexpected __riscv_xlen"
-#endif
-
-static inline u32 crc32_be_zbc(unsigned long s)
-{
-	u32 crc;
-
-	asm volatile (".option push\n"
-		      ".option arch,+zbc\n"
-		      "clmulh	%0, %1, %2\n"
-		      "xor	%0, %0, %1\n"
-		      "clmul	%0, %0, %3\n"
-		      ".option pop\n"
-		      : "=&r" (crc)
-		      : "r" (s),
-			"r" (CRC32_POLY_QT_BE),
-			"r" (CRC32_POLY_BE)
-		      :);
-	return crc;
-}
-
-#define STEP		(1 << STEP_ORDER)
-#define OFFSET_MASK	(STEP - 1)
-
-typedef u32 (*fallback)(u32 crc, unsigned char const *p, size_t len);
-
-static inline u32 crc32_le_unaligned(u32 crc, unsigned char const *p,
-				     size_t len, u32 poly,
-				     unsigned long poly_qt)
-{
-	size_t bits = len * 8;
-	unsigned long s = 0;
-	u32 crc_low = 0;
-
-	for (int i = 0; i < len; i++)
-		s = ((unsigned long)*p++ << (__riscv_xlen - 8)) | (s >> 8);
-
-	s ^= (unsigned long)crc << (__riscv_xlen - bits);
-	if (__riscv_xlen == 32 || len < sizeof(u32))
-		crc_low = crc >> bits;
-
-	crc = crc32_le_zbc(s, poly, poly_qt);
-	crc ^= crc_low;
-
-	return crc;
-}
-
-static inline u32 crc32_le_generic(u32 crc, unsigned char const *p, size_t len,
-				   u32 poly, unsigned long poly_qt,
-				   fallback crc_fb)
-{
-	size_t offset, head_len, tail_len;
-	unsigned long const *p_ul;
-	unsigned long s;
-
-	asm goto(ALTERNATIVE("j %l[legacy]", "nop", 0,
-			     RISCV_ISA_EXT_ZBC, 1)
-		 : : : : legacy);
-
-	/* Handle the unaligned head. */
-	offset = (unsigned long)p & OFFSET_MASK;
-	if (offset && len) {
-		head_len = min(STEP - offset, len);
-		crc = crc32_le_unaligned(crc, p, head_len, poly, poly_qt);
-		p += head_len;
-		len -= head_len;
-	}
-
-	tail_len = len & OFFSET_MASK;
-	len = len >> STEP_ORDER;
-	p_ul = (unsigned long const *)p;
-
-	for (int i = 0; i < len; i++) {
-		s = crc32_le_prep(crc, p_ul);
-		crc = crc32_le_zbc(s, poly, poly_qt);
-		p_ul++;
-	}
-
-	/* Handle the tail bytes. */
-	p = (unsigned char const *)p_ul;
-	if (tail_len)
-		crc = crc32_le_unaligned(crc, p, tail_len, poly, poly_qt);
-
-	return crc;
-
-legacy:
-	return crc_fb(crc, p, len);
-}
-
-u32 crc32_le_arch(u32 crc, const u8 *p, size_t len)
-{
-	return crc32_le_generic(crc, p, len, CRC32_POLY_LE, CRC32_POLY_QT_LE,
-				crc32_le_base);
-}
-EXPORT_SYMBOL(crc32_le_arch);
-
-u32 crc32c_arch(u32 crc, const u8 *p, size_t len)
-{
-	return crc32_le_generic(crc, p, len, CRC32C_POLY_LE,
-				CRC32C_POLY_QT_LE, crc32c_base);
-}
-EXPORT_SYMBOL(crc32c_arch);
-
-static inline u32 crc32_be_unaligned(u32 crc, unsigned char const *p,
-				     size_t len)
-{
-	size_t bits = len * 8;
-	unsigned long s = 0;
-	u32 crc_low = 0;
-
-	s = 0;
-	for (int i = 0; i < len; i++)
-		s = *p++ | (s << 8);
-
-	if (__riscv_xlen == 32 || len < sizeof(u32)) {
-		s ^= crc >> (32 - bits);
-		crc_low = crc << bits;
-	} else {
-		s ^= (unsigned long)crc << (bits - 32);
-	}
-
-	crc = crc32_be_zbc(s);
-	crc ^= crc_low;
-
-	return crc;
-}
-
-u32 crc32_be_arch(u32 crc, const u8 *p, size_t len)
-{
-	size_t offset, head_len, tail_len;
-	unsigned long const *p_ul;
-	unsigned long s;
-
-	asm goto(ALTERNATIVE("j %l[legacy]", "nop", 0,
-			     RISCV_ISA_EXT_ZBC, 1)
-		 : : : : legacy);
-
-	/* Handle the unaligned head. */
-	offset = (unsigned long)p & OFFSET_MASK;
-	if (offset && len) {
-		head_len = min(STEP - offset, len);
-		crc = crc32_be_unaligned(crc, p, head_len);
-		p += head_len;
-		len -= head_len;
-	}
-
-	tail_len = len & OFFSET_MASK;
-	len = len >> STEP_ORDER;
-	p_ul = (unsigned long const *)p;
-
-	for (int i = 0; i < len; i++) {
-		s = crc32_be_prep(crc, p_ul);
-		crc = crc32_be_zbc(s);
-		p_ul++;
-	}
-
-	/* Handle the tail bytes. */
-	p = (unsigned char const *)p_ul;
-	if (tail_len)
-		crc = crc32_be_unaligned(crc, p, tail_len);
-
-	return crc;
-
-legacy:
-	return crc32_be_base(crc, p, len);
-}
-EXPORT_SYMBOL(crc32_be_arch);
-
-u32 crc32_optimizations(void)
-{
-	if (riscv_has_extension_likely(RISCV_ISA_EXT_ZBC))
-		return CRC32_LE_OPTIMIZATION |
-		       CRC32_BE_OPTIMIZATION |
-		       CRC32C_OPTIMIZATION;
-	return 0;
-}
-EXPORT_SYMBOL(crc32_optimizations);
-
-MODULE_LICENSE("GPL");
-MODULE_DESCRIPTION("Accelerated CRC32 implementation with Zbc extension");
diff --git a/arch/riscv/lib/crc32.c b/arch/riscv/lib/crc32.c
new file mode 100644
index 000000000000..a3188b7d9c40
--- /dev/null
+++ b/arch/riscv/lib/crc32.c
@@ -0,0 +1,53 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * RISC-V optimized CRC32 functions
+ *
+ * Copyright 2025 Google LLC
+ */
+
+#include <asm/hwcap.h>
+#include <asm/alternative-macros.h>
+#include <linux/crc32.h>
+#include <linux/module.h>
+
+#include "crc-clmul.h"
+
+u32 crc32_le_arch(u32 crc, const u8 *p, size_t len)
+{
+	if (riscv_has_extension_likely(RISCV_ISA_EXT_ZBC))
+		return crc32_lsb_clmul(crc, p, len,
+				       &crc32_lsb_0xedb88320_consts);
+	return crc32_le_base(crc, p, len);
+}
+EXPORT_SYMBOL(crc32_le_arch);
+
+u32 crc32_be_arch(u32 crc, const u8 *p, size_t len)
+{
+	if (riscv_has_extension_likely(RISCV_ISA_EXT_ZBC))
+		return crc32_msb_clmul(crc, p, len,
+				       &crc32_msb_0x04c11db7_consts);
+	return crc32_be_base(crc, p, len);
+}
+EXPORT_SYMBOL(crc32_be_arch);
+
+u32 crc32c_arch(u32 crc, const u8 *p, size_t len)
+{
+	if (riscv_has_extension_likely(RISCV_ISA_EXT_ZBC))
+		return crc32_lsb_clmul(crc, p, len,
+				       &crc32_lsb_0x82f63b78_consts);
+	return crc32c_base(crc, p, len);
+}
+EXPORT_SYMBOL(crc32c_arch);
+
+u32 crc32_optimizations(void)
+{
+	if (riscv_has_extension_likely(RISCV_ISA_EXT_ZBC))
+		return CRC32_LE_OPTIMIZATION |
+		       CRC32_BE_OPTIMIZATION |
+		       CRC32C_OPTIMIZATION;
+	return 0;
+}
+EXPORT_SYMBOL(crc32_optimizations);
+
+MODULE_DESCRIPTION("RISC-V optimized CRC32 functions");
+MODULE_LICENSE("GPL");
diff --git a/arch/riscv/lib/crc32_lsb.c b/arch/riscv/lib/crc32_lsb.c
new file mode 100644
index 000000000000..72fd67e7470c
--- /dev/null
+++ b/arch/riscv/lib/crc32_lsb.c
@@ -0,0 +1,18 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * RISC-V optimized least-significant-bit-first CRC32
+ *
+ * Copyright 2025 Google LLC
+ */
+
+#include "crc-clmul.h"
+
+typedef u32 crc_t;
+#define LSB_CRC 1
+#include "crc-clmul-template.h"
+
+u32 crc32_lsb_clmul(u32 crc, const void *p, size_t len,
+		    const struct crc_clmul_consts *consts)
+{
+	return crc_clmul(crc, p, len, consts);
+}
diff --git a/arch/riscv/lib/crc32_msb.c b/arch/riscv/lib/crc32_msb.c
new file mode 100644
index 000000000000..fdbeaccc369f
--- /dev/null
+++ b/arch/riscv/lib/crc32_msb.c
@@ -0,0 +1,18 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * RISC-V optimized most-significant-bit-first CRC32
+ *
+ * Copyright 2025 Google LLC
+ */
+
+#include "crc-clmul.h"
+
+typedef u32 crc_t;
+#define LSB_CRC 0
+#include "crc-clmul-template.h"
+
+u32 crc32_msb_clmul(u32 crc, const void *p, size_t len,
+		    const struct crc_clmul_consts *consts)
+{
+	return crc_clmul(crc, p, len, consts);
+}
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 3/4] riscv/crc-t10dif: add Zbc optimized CRC-T10DIF function
  2025-02-16 22:55 [PATCH 0/4] RISC-V CRC optimizations Eric Biggers
  2025-02-16 22:55 ` [PATCH 1/4] riscv/crc: add "template" for Zbc optimized CRC functions Eric Biggers
  2025-02-16 22:55 ` [PATCH 2/4] riscv/crc32: reimplement the CRC32 functions using new template Eric Biggers
@ 2025-02-16 22:55 ` Eric Biggers
  2025-02-16 22:55 ` [PATCH 4/4] riscv/crc64: add Zbc optimized CRC64 functions Eric Biggers
  2025-02-24 18:06 ` [PATCH 0/4] RISC-V CRC optimizations Eric Biggers
  4 siblings, 0 replies; 12+ messages in thread
From: Eric Biggers @ 2025-02-16 22:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-crypto, linux-riscv, Zhihang Shao, Ard Biesheuvel,
	Xiao Wang, Charlie Jenkins

From: Eric Biggers <ebiggers@google.com>

Wire up crc_t10dif_arch() for RISC-V using crc-clmul-template.h.  This
greatly improves CRC-T10DIF performance on Zbc-capable CPUs.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 arch/riscv/Kconfig                |  1 +
 arch/riscv/lib/Makefile           |  2 ++
 arch/riscv/lib/crc-clmul-consts.h | 20 +++++++++++++++++++-
 arch/riscv/lib/crc-clmul.h        |  2 ++
 arch/riscv/lib/crc-t10dif.c       | 24 ++++++++++++++++++++++++
 arch/riscv/lib/crc16_msb.c        | 18 ++++++++++++++++++
 6 files changed, 66 insertions(+), 1 deletion(-)
 create mode 100644 arch/riscv/lib/crc-t10dif.c
 create mode 100644 arch/riscv/lib/crc16_msb.c

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 7612c52e9b1e..db1cf9666dfd 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -23,10 +23,11 @@ config RISCV
 	select ARCH_ENABLE_MEMORY_HOTREMOVE if MEMORY_HOTPLUG
 	select ARCH_ENABLE_SPLIT_PMD_PTLOCK if PGTABLE_LEVELS > 2
 	select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
 	select ARCH_HAS_BINFMT_FLAT
 	select ARCH_HAS_CRC32 if RISCV_ISA_ZBC
+	select ARCH_HAS_CRC_T10DIF if RISCV_ISA_ZBC
 	select ARCH_HAS_CURRENT_STACK_POINTER
 	select ARCH_HAS_DEBUG_VIRTUAL if MMU
 	select ARCH_HAS_DEBUG_VM_PGTABLE
 	select ARCH_HAS_DEBUG_WX
 	select ARCH_HAS_FAST_MULTIPLIER
diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile
index 7b32d3e88337..06d9552b9c8b 100644
--- a/arch/riscv/lib/Makefile
+++ b/arch/riscv/lib/Makefile
@@ -15,8 +15,10 @@ endif
 lib-$(CONFIG_MMU)	+= uaccess.o
 lib-$(CONFIG_64BIT)	+= tishift.o
 lib-$(CONFIG_RISCV_ISA_ZICBOZ)	+= clear_page.o
 obj-$(CONFIG_CRC32_ARCH)	+= crc32-riscv.o
 crc32-riscv-y := crc32.o crc32_msb.o crc32_lsb.o
+obj-$(CONFIG_CRC_T10DIF_ARCH)	+= crc-t10dif-riscv.o
+crc-t10dif-riscv-y := crc-t10dif.o crc16_msb.o
 obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
 lib-$(CONFIG_RISCV_ISA_V)	+= xor.o
 lib-$(CONFIG_RISCV_ISA_V)	+= riscv_v_helpers.o
diff --git a/arch/riscv/lib/crc-clmul-consts.h b/arch/riscv/lib/crc-clmul-consts.h
index 6fdf10648a20..b3a02b9096cd 100644
--- a/arch/riscv/lib/crc-clmul-consts.h
+++ b/arch/riscv/lib/crc-clmul-consts.h
@@ -1,10 +1,10 @@
 /* SPDX-License-Identifier: GPL-2.0-or-later */
 /*
  * CRC constants generated by:
  *
- *	./scripts/gen-crc-consts.py riscv_clmul crc32_msb_0x04c11db7,crc32_lsb_0xedb88320,crc32_lsb_0x82f63b78
+ *	./scripts/gen-crc-consts.py riscv_clmul crc16_msb_0x8bb7,crc32_msb_0x04c11db7,crc32_lsb_0xedb88320,crc32_lsb_0x82f63b78
  *
  * Do not edit manually.
  */
 
 struct crc_clmul_consts {
@@ -12,10 +12,28 @@ struct crc_clmul_consts {
 	unsigned long fold_across_2_longs_const_lo;
 	unsigned long barrett_reduction_const_1;
 	unsigned long barrett_reduction_const_2;
 };
 
+/*
+ * Constants generated for most-significant-bit-first CRC-16 using
+ * G(x) = x^16 + x^15 + x^11 + x^9 + x^8 + x^7 + x^5 + x^4 + x^2 + x^1 + x^0
+ */
+static const struct crc_clmul_consts crc16_msb_0x8bb7_consts __maybe_unused = {
+#ifdef CONFIG_64BIT
+	.fold_across_2_longs_const_hi = 0x0000000000001faa, /* x^192 mod G */
+	.fold_across_2_longs_const_lo = 0x000000000000a010, /* x^128 mod G */
+	.barrett_reduction_const_1 = 0xfb2d2bfc0e99d245, /* floor(x^79 / G) */
+	.barrett_reduction_const_2 = 0x0000000000008bb7, /* G - x^16 */
+#else
+	.fold_across_2_longs_const_hi = 0x00005890, /* x^96 mod G */
+	.fold_across_2_longs_const_lo = 0x0000f249, /* x^64 mod G */
+	.barrett_reduction_const_1 = 0xfb2d2bfc, /* floor(x^47 / G) */
+	.barrett_reduction_const_2 = 0x00008bb7, /* G - x^16 */
+#endif
+};
+
 /*
  * Constants generated for most-significant-bit-first CRC-32 using
  * G(x) = x^32 + x^26 + x^23 + x^22 + x^16 + x^12 + x^11 + x^10 + x^8 + x^7 +
  *        x^5 + x^4 + x^2 + x^1 + x^0
  */
diff --git a/arch/riscv/lib/crc-clmul.h b/arch/riscv/lib/crc-clmul.h
index 62bd41080785..162c1b12b219 100644
--- a/arch/riscv/lib/crc-clmul.h
+++ b/arch/riscv/lib/crc-clmul.h
@@ -5,10 +5,12 @@
 #define _RISCV_CRC_CLMUL_H
 
 #include <linux/types.h>
 #include "crc-clmul-consts.h"
 
+u16 crc16_msb_clmul(u16 crc, const void *p, size_t len,
+		    const struct crc_clmul_consts *consts);
 u32 crc32_msb_clmul(u32 crc, const void *p, size_t len,
 		    const struct crc_clmul_consts *consts);
 u32 crc32_lsb_clmul(u32 crc, const void *p, size_t len,
 		    const struct crc_clmul_consts *consts);
 
diff --git a/arch/riscv/lib/crc-t10dif.c b/arch/riscv/lib/crc-t10dif.c
new file mode 100644
index 000000000000..e6b0051ccd86
--- /dev/null
+++ b/arch/riscv/lib/crc-t10dif.c
@@ -0,0 +1,24 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * RISC-V optimized CRC-T10DIF function
+ *
+ * Copyright 2025 Google LLC
+ */
+
+#include <asm/hwcap.h>
+#include <asm/alternative-macros.h>
+#include <linux/crc-t10dif.h>
+#include <linux/module.h>
+
+#include "crc-clmul.h"
+
+u16 crc_t10dif_arch(u16 crc, const u8 *p, size_t len)
+{
+	if (riscv_has_extension_likely(RISCV_ISA_EXT_ZBC))
+		return crc16_msb_clmul(crc, p, len, &crc16_msb_0x8bb7_consts);
+	return crc_t10dif_generic(crc, p, len);
+}
+EXPORT_SYMBOL(crc_t10dif_arch);
+
+MODULE_DESCRIPTION("RISC-V optimized CRC-T10DIF function");
+MODULE_LICENSE("GPL");
diff --git a/arch/riscv/lib/crc16_msb.c b/arch/riscv/lib/crc16_msb.c
new file mode 100644
index 000000000000..554d295e95f5
--- /dev/null
+++ b/arch/riscv/lib/crc16_msb.c
@@ -0,0 +1,18 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * RISC-V optimized most-significant-bit-first CRC16
+ *
+ * Copyright 2025 Google LLC
+ */
+
+#include "crc-clmul.h"
+
+typedef u16 crc_t;
+#define LSB_CRC 0
+#include "crc-clmul-template.h"
+
+u16 crc16_msb_clmul(u16 crc, const void *p, size_t len,
+		    const struct crc_clmul_consts *consts)
+{
+	return crc_clmul(crc, p, len, consts);
+}
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 4/4] riscv/crc64: add Zbc optimized CRC64 functions
  2025-02-16 22:55 [PATCH 0/4] RISC-V CRC optimizations Eric Biggers
                   ` (2 preceding siblings ...)
  2025-02-16 22:55 ` [PATCH 3/4] riscv/crc-t10dif: add Zbc optimized CRC-T10DIF function Eric Biggers
@ 2025-02-16 22:55 ` Eric Biggers
  2025-02-24 18:06 ` [PATCH 0/4] RISC-V CRC optimizations Eric Biggers
  4 siblings, 0 replies; 12+ messages in thread
From: Eric Biggers @ 2025-02-16 22:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-crypto, linux-riscv, Zhihang Shao, Ard Biesheuvel,
	Xiao Wang, Charlie Jenkins

From: Eric Biggers <ebiggers@google.com>

Wire up crc64_be_arch() and crc64_nvme_arch() for 64-bit RISC-V using
crc-clmul-template.h.  This greatly improves the performance of these
CRCs on Zbc-capable CPUs in 64-bit kernels.

These optimized CRC64 functions are not yet supported in 32-bit kernels,
since crc-clmul-template.h assumes that the CRC fits in an unsigned
long.  That implementation limitation could be addressed, but it would
add a fair bit of complexity, so it has been omitted for now.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 arch/riscv/Kconfig                |  1 +
 arch/riscv/lib/Makefile           |  2 ++
 arch/riscv/lib/crc-clmul-consts.h | 34 ++++++++++++++++++++++++++++++-
 arch/riscv/lib/crc-clmul.h        |  6 ++++++
 arch/riscv/lib/crc64.c            | 34 +++++++++++++++++++++++++++++++
 arch/riscv/lib/crc64_lsb.c        | 18 ++++++++++++++++
 arch/riscv/lib/crc64_msb.c        | 18 ++++++++++++++++
 7 files changed, 112 insertions(+), 1 deletion(-)
 create mode 100644 arch/riscv/lib/crc64.c
 create mode 100644 arch/riscv/lib/crc64_lsb.c
 create mode 100644 arch/riscv/lib/crc64_msb.c

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index db1cf9666dfd..e10dda2d0bfe 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -23,10 +23,11 @@ config RISCV
 	select ARCH_ENABLE_MEMORY_HOTREMOVE if MEMORY_HOTPLUG
 	select ARCH_ENABLE_SPLIT_PMD_PTLOCK if PGTABLE_LEVELS > 2
 	select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
 	select ARCH_HAS_BINFMT_FLAT
 	select ARCH_HAS_CRC32 if RISCV_ISA_ZBC
+	select ARCH_HAS_CRC64 if 64BIT && RISCV_ISA_ZBC
 	select ARCH_HAS_CRC_T10DIF if RISCV_ISA_ZBC
 	select ARCH_HAS_CURRENT_STACK_POINTER
 	select ARCH_HAS_DEBUG_VIRTUAL if MMU
 	select ARCH_HAS_DEBUG_VM_PGTABLE
 	select ARCH_HAS_DEBUG_WX
diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile
index 06d9552b9c8b..b1c46153606a 100644
--- a/arch/riscv/lib/Makefile
+++ b/arch/riscv/lib/Makefile
@@ -15,10 +15,12 @@ endif
 lib-$(CONFIG_MMU)	+= uaccess.o
 lib-$(CONFIG_64BIT)	+= tishift.o
 lib-$(CONFIG_RISCV_ISA_ZICBOZ)	+= clear_page.o
 obj-$(CONFIG_CRC32_ARCH)	+= crc32-riscv.o
 crc32-riscv-y := crc32.o crc32_msb.o crc32_lsb.o
+obj-$(CONFIG_CRC64_ARCH) += crc64-riscv.o
+crc64-riscv-y := crc64.o crc64_msb.o crc64_lsb.o
 obj-$(CONFIG_CRC_T10DIF_ARCH)	+= crc-t10dif-riscv.o
 crc-t10dif-riscv-y := crc-t10dif.o crc16_msb.o
 obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
 lib-$(CONFIG_RISCV_ISA_V)	+= xor.o
 lib-$(CONFIG_RISCV_ISA_V)	+= riscv_v_helpers.o
diff --git a/arch/riscv/lib/crc-clmul-consts.h b/arch/riscv/lib/crc-clmul-consts.h
index b3a02b9096cd..8d73449235ef 100644
--- a/arch/riscv/lib/crc-clmul-consts.h
+++ b/arch/riscv/lib/crc-clmul-consts.h
@@ -1,10 +1,10 @@
 /* SPDX-License-Identifier: GPL-2.0-or-later */
 /*
  * CRC constants generated by:
  *
- *	./scripts/gen-crc-consts.py riscv_clmul crc16_msb_0x8bb7,crc32_msb_0x04c11db7,crc32_lsb_0xedb88320,crc32_lsb_0x82f63b78
+ *	./scripts/gen-crc-consts.py riscv_clmul crc16_msb_0x8bb7,crc32_msb_0x04c11db7,crc32_lsb_0xedb88320,crc32_lsb_0x82f63b78,crc64_msb_0x42f0e1eba9ea3693,crc64_lsb_0x9a6c9329ac4bc9b5
  *
  * Do not edit manually.
  */
 
 struct crc_clmul_consts {
@@ -86,5 +86,37 @@ static const struct crc_clmul_consts crc32_lsb_0x82f63b78_consts __maybe_unused
 	.fold_across_2_longs_const_lo = 0xdd45aab8, /* x^63 mod G */
 	.barrett_reduction_const_1 = 0xdea713f1, /* floor(x^63 / G) */
 	.barrett_reduction_const_2 = 0x82f63b78, /* (G - x^32) * x^0 */
 #endif
 };
+
+/*
+ * Constants generated for most-significant-bit-first CRC-64 using
+ * G(x) = x^64 + x^62 + x^57 + x^55 + x^54 + x^53 + x^52 + x^47 + x^46 + x^45 +
+ *        x^40 + x^39 + x^38 + x^37 + x^35 + x^33 + x^32 + x^31 + x^29 + x^27 +
+ *        x^24 + x^23 + x^22 + x^21 + x^19 + x^17 + x^13 + x^12 + x^10 + x^9 +
+ *        x^7 + x^4 + x^1 + x^0
+ */
+#ifdef CONFIG_64BIT
+static const struct crc_clmul_consts crc64_msb_0x42f0e1eba9ea3693_consts __maybe_unused = {
+	.fold_across_2_longs_const_hi = 0x4eb938a7d257740e, /* x^192 mod G */
+	.fold_across_2_longs_const_lo = 0x05f5c3c7eb52fab6, /* x^128 mod G */
+	.barrett_reduction_const_1 = 0xabc694e836627c39, /* floor(x^127 / G) */
+	.barrett_reduction_const_2 = 0x42f0e1eba9ea3693, /* G - x^64 */
+};
+#endif
+
+/*
+ * Constants generated for least-significant-bit-first CRC-64 using
+ * G(x) = x^64 + x^63 + x^61 + x^59 + x^58 + x^56 + x^55 + x^52 + x^49 + x^48 +
+ *        x^47 + x^46 + x^44 + x^41 + x^37 + x^36 + x^34 + x^32 + x^31 + x^28 +
+ *        x^26 + x^23 + x^22 + x^19 + x^16 + x^13 + x^12 + x^10 + x^9 + x^6 +
+ *        x^4 + x^3 + x^0
+ */
+#ifdef CONFIG_64BIT
+static const struct crc_clmul_consts crc64_lsb_0x9a6c9329ac4bc9b5_consts __maybe_unused = {
+	.fold_across_2_longs_const_hi = 0xeadc41fd2ba3d420, /* x^191 mod G */
+	.fold_across_2_longs_const_lo = 0x21e9761e252621ac, /* x^127 mod G */
+	.barrett_reduction_const_1 = 0x27ecfa329aef9f77, /* floor(x^127 / G) */
+	.barrett_reduction_const_2 = 0x9a6c9329ac4bc9b5, /* (G - x^64) * x^0 */
+};
+#endif
diff --git a/arch/riscv/lib/crc-clmul.h b/arch/riscv/lib/crc-clmul.h
index 162c1b12b219..dd1736245815 100644
--- a/arch/riscv/lib/crc-clmul.h
+++ b/arch/riscv/lib/crc-clmul.h
@@ -11,7 +11,13 @@ u16 crc16_msb_clmul(u16 crc, const void *p, size_t len,
 		    const struct crc_clmul_consts *consts);
 u32 crc32_msb_clmul(u32 crc, const void *p, size_t len,
 		    const struct crc_clmul_consts *consts);
 u32 crc32_lsb_clmul(u32 crc, const void *p, size_t len,
 		    const struct crc_clmul_consts *consts);
+#ifdef CONFIG_64BIT
+u64 crc64_msb_clmul(u64 crc, const void *p, size_t len,
+		    const struct crc_clmul_consts *consts);
+u64 crc64_lsb_clmul(u64 crc, const void *p, size_t len,
+		    const struct crc_clmul_consts *consts);
+#endif
 
 #endif /* _RISCV_CRC_CLMUL_H */
diff --git a/arch/riscv/lib/crc64.c b/arch/riscv/lib/crc64.c
new file mode 100644
index 000000000000..f0015a27836a
--- /dev/null
+++ b/arch/riscv/lib/crc64.c
@@ -0,0 +1,34 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * RISC-V optimized CRC64 functions
+ *
+ * Copyright 2025 Google LLC
+ */
+
+#include <asm/hwcap.h>
+#include <asm/alternative-macros.h>
+#include <linux/crc64.h>
+#include <linux/module.h>
+
+#include "crc-clmul.h"
+
+u64 crc64_be_arch(u64 crc, const u8 *p, size_t len)
+{
+	if (riscv_has_extension_likely(RISCV_ISA_EXT_ZBC))
+		return crc64_msb_clmul(crc, p, len,
+				       &crc64_msb_0x42f0e1eba9ea3693_consts);
+	return crc64_be_generic(crc, p, len);
+}
+EXPORT_SYMBOL(crc64_be_arch);
+
+u64 crc64_nvme_arch(u64 crc, const u8 *p, size_t len)
+{
+	if (riscv_has_extension_likely(RISCV_ISA_EXT_ZBC))
+		return crc64_lsb_clmul(crc, p, len,
+				       &crc64_lsb_0x9a6c9329ac4bc9b5_consts);
+	return crc64_nvme_generic(crc, p, len);
+}
+EXPORT_SYMBOL(crc64_nvme_arch);
+
+MODULE_DESCRIPTION("RISC-V optimized CRC64 functions");
+MODULE_LICENSE("GPL");
diff --git a/arch/riscv/lib/crc64_lsb.c b/arch/riscv/lib/crc64_lsb.c
new file mode 100644
index 000000000000..c5371bb85d90
--- /dev/null
+++ b/arch/riscv/lib/crc64_lsb.c
@@ -0,0 +1,18 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * RISC-V optimized least-significant-bit-first CRC64
+ *
+ * Copyright 2025 Google LLC
+ */
+
+#include "crc-clmul.h"
+
+typedef u64 crc_t;
+#define LSB_CRC 1
+#include "crc-clmul-template.h"
+
+u64 crc64_lsb_clmul(u64 crc, const void *p, size_t len,
+		    const struct crc_clmul_consts *consts)
+{
+	return crc_clmul(crc, p, len, consts);
+}
diff --git a/arch/riscv/lib/crc64_msb.c b/arch/riscv/lib/crc64_msb.c
new file mode 100644
index 000000000000..1925d1dbe225
--- /dev/null
+++ b/arch/riscv/lib/crc64_msb.c
@@ -0,0 +1,18 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * RISC-V optimized most-significant-bit-first CRC64
+ *
+ * Copyright 2025 Google LLC
+ */
+
+#include "crc-clmul.h"
+
+typedef u64 crc_t;
+#define LSB_CRC 0
+#include "crc-clmul-template.h"
+
+u64 crc64_msb_clmul(u64 crc, const void *p, size_t len,
+		    const struct crc_clmul_consts *consts)
+{
+	return crc_clmul(crc, p, len, consts);
+}
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/4] RISC-V CRC optimizations
  2025-02-16 22:55 [PATCH 0/4] RISC-V CRC optimizations Eric Biggers
                   ` (3 preceding siblings ...)
  2025-02-16 22:55 ` [PATCH 4/4] riscv/crc64: add Zbc optimized CRC64 functions Eric Biggers
@ 2025-02-24 18:06 ` Eric Biggers
  2025-03-02 18:56   ` Björn Töpel
  4 siblings, 1 reply; 12+ messages in thread
From: Eric Biggers @ 2025-02-24 18:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-crypto, linux-riscv, Zhihang Shao, Ard Biesheuvel,
	Xiao Wang, Charlie Jenkins

On Sun, Feb 16, 2025 at 02:55:26PM -0800, Eric Biggers wrote:
> This patchset is a replacement for
> "[PATCH v4] riscv: Optimize crct10dif with Zbc extension"
> (https://lore.kernel.org/r/20250211071101.181652-1-zhihang.shao.iscas@gmail.com/).
> It adopts the approach that I'm taking for x86 where code is shared
> among CRC variants.  It replaces the existing Zbc optimized CRC32
> functions, then adds Zbc optimized CRC-T10DIF and CRC64 functions.
> 
> This new code should be significantly faster than the current Zbc
> optimized CRC32 code and the previously proposed CRC-T10DIF code.  It
> uses "folding" instead of just Barrett reduction, and it also implements
> Barrett reduction more efficiently.
> 
> This applies to crc-next at
> https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git/log/?h=crc-next.
> It depends on other patches that are queued there for 6.15, so I plan to
> take it through there if there are no objections.
> 
> Tested with crc_kunit in QEMU (set CONFIG_CRC_KUNIT_TEST=y and
> CONFIG_CRC_BENCHMARK=y), both 32-bit and 64-bit.  I don't have real Zbc
> capable hardware to benchmark this on, but the new code should work very
> well; similar optimizations work very well on other architectures.

Any feedback on this series from the RISC-V side?

- Eric

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/4] RISC-V CRC optimizations
  2025-02-24 18:06 ` [PATCH 0/4] RISC-V CRC optimizations Eric Biggers
@ 2025-03-02 18:56   ` Björn Töpel
  2025-03-02 22:04     ` Eric Biggers
  2025-03-03  6:53     ` Zhihang Shao
  0 siblings, 2 replies; 12+ messages in thread
From: Björn Töpel @ 2025-03-02 18:56 UTC (permalink / raw)
  To: Eric Biggers, linux-kernel
  Cc: linux-crypto, linux-riscv, Zhihang Shao, Ard Biesheuvel,
	Xiao Wang, Charlie Jenkins, Alexandre Ghiti

Eric!

Eric Biggers <ebiggers@kernel.org> writes:

> On Sun, Feb 16, 2025 at 02:55:26PM -0800, Eric Biggers wrote:
>> This patchset is a replacement for
>> "[PATCH v4] riscv: Optimize crct10dif with Zbc extension"
>> (https://lore.kernel.org/r/20250211071101.181652-1-zhihang.shao.iscas@gmail.com/).
>> It adopts the approach that I'm taking for x86 where code is shared
>> among CRC variants.  It replaces the existing Zbc optimized CRC32
>> functions, then adds Zbc optimized CRC-T10DIF and CRC64 functions.
>> 
>> This new code should be significantly faster than the current Zbc
>> optimized CRC32 code and the previously proposed CRC-T10DIF code.  It
>> uses "folding" instead of just Barrett reduction, and it also implements
>> Barrett reduction more efficiently.
>> 
>> This applies to crc-next at
>> https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git/log/?h=crc-next.
>> It depends on other patches that are queued there for 6.15, so I plan to
>> take it through there if there are no objections.
>> 
>> Tested with crc_kunit in QEMU (set CONFIG_CRC_KUNIT_TEST=y and
>> CONFIG_CRC_BENCHMARK=y), both 32-bit and 64-bit.  I don't have real Zbc
>> capable hardware to benchmark this on, but the new code should work very
>> well; similar optimizations work very well on other architectures.
>
> Any feedback on this series from the RISC-V side?

I have not reviewed your series, but I did a testrun the Milk-V Jupiter
which sports a Spacemit K1 that has Zbc.

I based the run on commit 1973160c90d7 ("Merge tag
'gpio-fixes-for-v6.14-rc5' of
git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux"), plus your
crc-next branch (commit a0bd462f3a13 ("x86/crc: add ANNOTATE_NOENDBR to
suppress objtool warnings")) merged:

  | --- base1.txt	2025-03-02 18:31:16.169438876 +0000
  | +++ eric.txt	2025-03-02 18:35:58.683017223 +0000
  | @@ -11,7 +11,7 @@
  |      # crc16_benchmark: len=127: 153 MB/s
  |      # crc16_benchmark: len=128: 153 MB/s
  |      # crc16_benchmark: len=200: 153 MB/s
  | -    # crc16_benchmark: len=256: 153 MB/s
  | +    # crc16_benchmark: len=256: 154 MB/s
  |      # crc16_benchmark: len=511: 154 MB/s
  |      # crc16_benchmark: len=512: 154 MB/s
  |      # crc16_benchmark: len=1024: 155 MB/s
  | @@ -20,94 +20,94 @@
  |      # crc16_benchmark: len=16384: 155 MB/s
  |      ok 2 crc16_benchmark
  |      ok 3 crc_t10dif_test
  | -    # crc_t10dif_benchmark: len=1: 48 MB/s
  | -    # crc_t10dif_benchmark: len=16: 125 MB/s
  | -    # crc_t10dif_benchmark: len=64: 136 MB/s
  | -    # crc_t10dif_benchmark: len=127: 138 MB/s
  | -    # crc_t10dif_benchmark: len=128: 138 MB/s
  | -    # crc_t10dif_benchmark: len=200: 138 MB/s
  | -    # crc_t10dif_benchmark: len=256: 138 MB/s
  | -    # crc_t10dif_benchmark: len=511: 139 MB/s
  | -    # crc_t10dif_benchmark: len=512: 139 MB/s
  | -    # crc_t10dif_benchmark: len=1024: 139 MB/s
  | -    # crc_t10dif_benchmark: len=3173: 140 MB/s
  | -    # crc_t10dif_benchmark: len=4096: 140 MB/s
  | -    # crc_t10dif_benchmark: len=16384: 140 MB/s
  | +    # crc_t10dif_benchmark: len=1: 28 MB/s
  | +    # crc_t10dif_benchmark: len=16: 236 MB/s
  | +    # crc_t10dif_benchmark: len=64: 450 MB/s
  | +    # crc_t10dif_benchmark: len=127: 480 MB/s
  | +    # crc_t10dif_benchmark: len=128: 540 MB/s
  | +    # crc_t10dif_benchmark: len=200: 559 MB/s
  | +    # crc_t10dif_benchmark: len=256: 600 MB/s
  | +    # crc_t10dif_benchmark: len=511: 613 MB/s
  | +    # crc_t10dif_benchmark: len=512: 635 MB/s
  | +    # crc_t10dif_benchmark: len=1024: 654 MB/s
  | +    # crc_t10dif_benchmark: len=3173: 665 MB/s
  | +    # crc_t10dif_benchmark: len=4096: 669 MB/s
  | +    # crc_t10dif_benchmark: len=16384: 673 MB/s
  |      ok 4 crc_t10dif_benchmark
  |      ok 5 crc32_le_test
  |      # crc32_le_benchmark: len=1: 31 MB/s
  | -    # crc32_le_benchmark: len=16: 456 MB/s
  | -    # crc32_le_benchmark: len=64: 682 MB/s
  | -    # crc32_le_benchmark: len=127: 620 MB/s
  | -    # crc32_le_benchmark: len=128: 744 MB/s
  | -    # crc32_le_benchmark: len=200: 768 MB/s
  | -    # crc32_le_benchmark: len=256: 777 MB/s
  | -    # crc32_le_benchmark: len=511: 758 MB/s
  | -    # crc32_le_benchmark: len=512: 798 MB/s
  | -    # crc32_le_benchmark: len=1024: 807 MB/s
  | -    # crc32_le_benchmark: len=3173: 807 MB/s
  | -    # crc32_le_benchmark: len=4096: 814 MB/s
  | -    # crc32_le_benchmark: len=16384: 816 MB/s
  | +    # crc32_le_benchmark: len=16: 439 MB/s
  | +    # crc32_le_benchmark: len=64: 1209 MB/s
  | +    # crc32_le_benchmark: len=127: 1067 MB/s
  | +    # crc32_le_benchmark: len=128: 1616 MB/s
  | +    # crc32_le_benchmark: len=200: 1739 MB/s
  | +    # crc32_le_benchmark: len=256: 1951 MB/s
  | +    # crc32_le_benchmark: len=511: 1855 MB/s
  | +    # crc32_le_benchmark: len=512: 2174 MB/s
  | +    # crc32_le_benchmark: len=1024: 2301 MB/s
  | +    # crc32_le_benchmark: len=3173: 2347 MB/s
  | +    # crc32_le_benchmark: len=4096: 2407 MB/s
  | +    # crc32_le_benchmark: len=16384: 2440 MB/s
  |      ok 6 crc32_le_benchmark
  |      ok 7 crc32_be_test
  | -    # crc32_be_benchmark: len=1: 27 MB/s
  | -    # crc32_be_benchmark: len=16: 258 MB/s
  | -    # crc32_be_benchmark: len=64: 388 MB/s
  | -    # crc32_be_benchmark: len=127: 402 MB/s
  | -    # crc32_be_benchmark: len=128: 424 MB/s
  | -    # crc32_be_benchmark: len=200: 438 MB/s
  | -    # crc32_be_benchmark: len=256: 444 MB/s
  | -    # crc32_be_benchmark: len=511: 449 MB/s
  | -    # crc32_be_benchmark: len=512: 455 MB/s
  | -    # crc32_be_benchmark: len=1024: 461 MB/s
  | -    # crc32_be_benchmark: len=3173: 463 MB/s
  | -    # crc32_be_benchmark: len=4096: 465 MB/s
  | -    # crc32_be_benchmark: len=16384: 466 MB/s
  | +    # crc32_be_benchmark: len=1: 25 MB/s
  | +    # crc32_be_benchmark: len=16: 251 MB/s
  | +    # crc32_be_benchmark: len=64: 458 MB/s
  | +    # crc32_be_benchmark: len=127: 496 MB/s
  | +    # crc32_be_benchmark: len=128: 547 MB/s
  | +    # crc32_be_benchmark: len=200: 569 MB/s
  | +    # crc32_be_benchmark: len=256: 605 MB/s
  | +    # crc32_be_benchmark: len=511: 621 MB/s
  | +    # crc32_be_benchmark: len=512: 637 MB/s
  | +    # crc32_be_benchmark: len=1024: 657 MB/s
  | +    # crc32_be_benchmark: len=3173: 668 MB/s
  | +    # crc32_be_benchmark: len=4096: 671 MB/s
  | +    # crc32_be_benchmark: len=16384: 674 MB/s
  |      ok 8 crc32_be_benchmark
  |      ok 9 crc32c_test
  |      # crc32c_benchmark: len=1: 31 MB/s
  | -    # crc32c_benchmark: len=16: 457 MB/s
  | -    # crc32c_benchmark: len=64: 682 MB/s
  | -    # crc32c_benchmark: len=127: 620 MB/s
  | -    # crc32c_benchmark: len=128: 744 MB/s
  | -    # crc32c_benchmark: len=200: 769 MB/s
  | -    # crc32c_benchmark: len=256: 779 MB/s
  | -    # crc32c_benchmark: len=511: 758 MB/s
  | -    # crc32c_benchmark: len=512: 797 MB/s
  | -    # crc32c_benchmark: len=1024: 807 MB/s
  | -    # crc32c_benchmark: len=3173: 806 MB/s
  | -    # crc32c_benchmark: len=4096: 813 MB/s
  | -    # crc32c_benchmark: len=16384: 816 MB/s
  | +    # crc32c_benchmark: len=16: 446 MB/s
  | +    # crc32c_benchmark: len=64: 1188 MB/s
  | +    # crc32c_benchmark: len=127: 1066 MB/s
  | +    # crc32c_benchmark: len=128: 1600 MB/s
  | +    # crc32c_benchmark: len=200: 1727 MB/s
  | +    # crc32c_benchmark: len=256: 1941 MB/s
  | +    # crc32c_benchmark: len=511: 1854 MB/s
  | +    # crc32c_benchmark: len=512: 2164 MB/s
  | +    # crc32c_benchmark: len=1024: 2300 MB/s
  | +    # crc32c_benchmark: len=3173: 2345 MB/s
  | +    # crc32c_benchmark: len=4096: 2402 MB/s
  | +    # crc32c_benchmark: len=16384: 2437 MB/s
  |      ok 10 crc32c_benchmark
  |      ok 11 crc64_be_test
  | -    # crc64_be_benchmark: len=1: 64 MB/s
  | -    # crc64_be_benchmark: len=16: 144 MB/s
  | -    # crc64_be_benchmark: len=64: 154 MB/s
  | -    # crc64_be_benchmark: len=127: 156 MB/s
  | -    # crc64_be_benchmark: len=128: 156 MB/s
  | -    # crc64_be_benchmark: len=200: 156 MB/s
  | -    # crc64_be_benchmark: len=256: 156 MB/s
  | -    # crc64_be_benchmark: len=511: 157 MB/s
  | -    # crc64_be_benchmark: len=512: 157 MB/s
  | -    # crc64_be_benchmark: len=1024: 157 MB/s
  | -    # crc64_be_benchmark: len=3173: 158 MB/s
  | -    # crc64_be_benchmark: len=4096: 158 MB/s
  | -    # crc64_be_benchmark: len=16384: 158 MB/s
  | +    # crc64_be_benchmark: len=1: 29 MB/s
  | +    # crc64_be_benchmark: len=16: 264 MB/s
  | +    # crc64_be_benchmark: len=64: 476 MB/s
  | +    # crc64_be_benchmark: len=127: 499 MB/s
  | +    # crc64_be_benchmark: len=128: 558 MB/s
  | +    # crc64_be_benchmark: len=200: 576 MB/s
  | +    # crc64_be_benchmark: len=256: 611 MB/s
  | +    # crc64_be_benchmark: len=511: 621 MB/s
  | +    # crc64_be_benchmark: len=512: 638 MB/s
  | +    # crc64_be_benchmark: len=1024: 659 MB/s
  | +    # crc64_be_benchmark: len=3173: 667 MB/s
  | +    # crc64_be_benchmark: len=4096: 671 MB/s
  | +    # crc64_be_benchmark: len=16384: 674 MB/s
  |      ok 12 crc64_be_benchmark
  |      ok 13 crc64_nvme_test
  | -    # crc64_nvme_benchmark: len=1: 64 MB/s
  | -    # crc64_nvme_benchmark: len=16: 144 MB/s
  | -    # crc64_nvme_benchmark: len=64: 154 MB/s
  | -    # crc64_nvme_benchmark: len=127: 156 MB/s
  | -    # crc64_nvme_benchmark: len=128: 156 MB/s
  | -    # crc64_nvme_benchmark: len=200: 156 MB/s
  | -    # crc64_nvme_benchmark: len=256: 156 MB/s
  | -    # crc64_nvme_benchmark: len=511: 157 MB/s
  | -    # crc64_nvme_benchmark: len=512: 157 MB/s
  | -    # crc64_nvme_benchmark: len=1024: 157 MB/s
  | -    # crc64_nvme_benchmark: len=3173: 158 MB/s
  | -    # crc64_nvme_benchmark: len=4096: 158 MB/s
  | -    # crc64_nvme_benchmark: len=16384: 158 MB/s
  | +    # crc64_nvme_benchmark: len=1: 36 MB/s
  | +    # crc64_nvme_benchmark: len=16: 479 MB/s
  | +    # crc64_nvme_benchmark: len=64: 1340 MB/s
  | +    # crc64_nvme_benchmark: len=127: 1179 MB/s
  | +    # crc64_nvme_benchmark: len=128: 1766 MB/s
  | +    # crc64_nvme_benchmark: len=200: 1965 MB/s
  | +    # crc64_nvme_benchmark: len=256: 2201 MB/s
  | +    # crc64_nvme_benchmark: len=511: 2087 MB/s
  | +    # crc64_nvme_benchmark: len=512: 2464 MB/s
  | +    # crc64_nvme_benchmark: len=1024: 2331 MB/s
  | +    # crc64_nvme_benchmark: len=3173: 2673 MB/s
  | +    # crc64_nvme_benchmark: len=4096: 2745 MB/s
  | +    # crc64_nvme_benchmark: len=16384: 2782 MB/s
  |      ok 14 crc64_nvme_benchmark
  |  # crc: pass:14 fail:0 skip:0 total:14
  |  # Totals: pass:14 fail:0 skip:0 total:14

That's a significant speed up for this popular SoC, and it would be
great to get this series in for the next merge window! Thank you!

Tested-by: Björn Töpel <bjorn@rivosinc.com>


Björn

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/4] RISC-V CRC optimizations
  2025-03-02 18:56   ` Björn Töpel
@ 2025-03-02 22:04     ` Eric Biggers
  2025-03-08 12:58       ` Ignacio Encinas Rubio
  2025-03-10 12:44       ` Alexandre Ghiti
  2025-03-03  6:53     ` Zhihang Shao
  1 sibling, 2 replies; 12+ messages in thread
From: Eric Biggers @ 2025-03-02 22:04 UTC (permalink / raw)
  To: Björn Töpel, Palmer Dabbelt
  Cc: linux-kernel, linux-crypto, linux-riscv, Zhihang Shao,
	Ard Biesheuvel, Xiao Wang, Charlie Jenkins, Alexandre Ghiti

On Sun, Mar 02, 2025 at 07:56:46PM +0100, Björn Töpel wrote:
> Eric!
> 
> Eric Biggers <ebiggers@kernel.org> writes:
> 
> > On Sun, Feb 16, 2025 at 02:55:26PM -0800, Eric Biggers wrote:
> >> This patchset is a replacement for
> >> "[PATCH v4] riscv: Optimize crct10dif with Zbc extension"
> >> (https://lore.kernel.org/r/20250211071101.181652-1-zhihang.shao.iscas@gmail.com/).
> >> It adopts the approach that I'm taking for x86 where code is shared
> >> among CRC variants.  It replaces the existing Zbc optimized CRC32
> >> functions, then adds Zbc optimized CRC-T10DIF and CRC64 functions.
> >> 
> >> This new code should be significantly faster than the current Zbc
> >> optimized CRC32 code and the previously proposed CRC-T10DIF code.  It
> >> uses "folding" instead of just Barrett reduction, and it also implements
> >> Barrett reduction more efficiently.
> >> 
> >> This applies to crc-next at
> >> https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git/log/?h=crc-next.
> >> It depends on other patches that are queued there for 6.15, so I plan to
> >> take it through there if there are no objections.
> >> 
> >> Tested with crc_kunit in QEMU (set CONFIG_CRC_KUNIT_TEST=y and
> >> CONFIG_CRC_BENCHMARK=y), both 32-bit and 64-bit.  I don't have real Zbc
> >> capable hardware to benchmark this on, but the new code should work very
> >> well; similar optimizations work very well on other architectures.
> >
> > Any feedback on this series from the RISC-V side?
> 
> I have not reviewed your series, but I did a testrun the Milk-V Jupiter
> which sports a Spacemit K1 that has Zbc.
> 
> I based the run on commit 1973160c90d7 ("Merge tag
> 'gpio-fixes-for-v6.14-rc5' of
> git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux"), plus your
> crc-next branch (commit a0bd462f3a13 ("x86/crc: add ANNOTATE_NOENDBR to
> suppress objtool warnings")) merged:
> 
>   | --- base1.txt	2025-03-02 18:31:16.169438876 +0000
>   | +++ eric.txt	2025-03-02 18:35:58.683017223 +0000
>   | @@ -11,7 +11,7 @@
>   |      # crc16_benchmark: len=127: 153 MB/s
>   |      # crc16_benchmark: len=128: 153 MB/s
>   |      # crc16_benchmark: len=200: 153 MB/s
>   | -    # crc16_benchmark: len=256: 153 MB/s
>   | +    # crc16_benchmark: len=256: 154 MB/s
>   |      # crc16_benchmark: len=511: 154 MB/s
>   |      # crc16_benchmark: len=512: 154 MB/s
>   |      # crc16_benchmark: len=1024: 155 MB/s
>   | @@ -20,94 +20,94 @@
>   |      # crc16_benchmark: len=16384: 155 MB/s
>   |      ok 2 crc16_benchmark
>   |      ok 3 crc_t10dif_test
>   | -    # crc_t10dif_benchmark: len=1: 48 MB/s
>   | -    # crc_t10dif_benchmark: len=16: 125 MB/s
>   | -    # crc_t10dif_benchmark: len=64: 136 MB/s
>   | -    # crc_t10dif_benchmark: len=127: 138 MB/s
>   | -    # crc_t10dif_benchmark: len=128: 138 MB/s
>   | -    # crc_t10dif_benchmark: len=200: 138 MB/s
>   | -    # crc_t10dif_benchmark: len=256: 138 MB/s
>   | -    # crc_t10dif_benchmark: len=511: 139 MB/s
>   | -    # crc_t10dif_benchmark: len=512: 139 MB/s
>   | -    # crc_t10dif_benchmark: len=1024: 139 MB/s
>   | -    # crc_t10dif_benchmark: len=3173: 140 MB/s
>   | -    # crc_t10dif_benchmark: len=4096: 140 MB/s
>   | -    # crc_t10dif_benchmark: len=16384: 140 MB/s
>   | +    # crc_t10dif_benchmark: len=1: 28 MB/s
>   | +    # crc_t10dif_benchmark: len=16: 236 MB/s
>   | +    # crc_t10dif_benchmark: len=64: 450 MB/s
>   | +    # crc_t10dif_benchmark: len=127: 480 MB/s
>   | +    # crc_t10dif_benchmark: len=128: 540 MB/s
>   | +    # crc_t10dif_benchmark: len=200: 559 MB/s
>   | +    # crc_t10dif_benchmark: len=256: 600 MB/s
>   | +    # crc_t10dif_benchmark: len=511: 613 MB/s
>   | +    # crc_t10dif_benchmark: len=512: 635 MB/s
>   | +    # crc_t10dif_benchmark: len=1024: 654 MB/s
>   | +    # crc_t10dif_benchmark: len=3173: 665 MB/s
>   | +    # crc_t10dif_benchmark: len=4096: 669 MB/s
>   | +    # crc_t10dif_benchmark: len=16384: 673 MB/s
>   |      ok 4 crc_t10dif_benchmark
>   |      ok 5 crc32_le_test
>   |      # crc32_le_benchmark: len=1: 31 MB/s
>   | -    # crc32_le_benchmark: len=16: 456 MB/s
>   | -    # crc32_le_benchmark: len=64: 682 MB/s
>   | -    # crc32_le_benchmark: len=127: 620 MB/s
>   | -    # crc32_le_benchmark: len=128: 744 MB/s
>   | -    # crc32_le_benchmark: len=200: 768 MB/s
>   | -    # crc32_le_benchmark: len=256: 777 MB/s
>   | -    # crc32_le_benchmark: len=511: 758 MB/s
>   | -    # crc32_le_benchmark: len=512: 798 MB/s
>   | -    # crc32_le_benchmark: len=1024: 807 MB/s
>   | -    # crc32_le_benchmark: len=3173: 807 MB/s
>   | -    # crc32_le_benchmark: len=4096: 814 MB/s
>   | -    # crc32_le_benchmark: len=16384: 816 MB/s
>   | +    # crc32_le_benchmark: len=16: 439 MB/s
>   | +    # crc32_le_benchmark: len=64: 1209 MB/s
>   | +    # crc32_le_benchmark: len=127: 1067 MB/s
>   | +    # crc32_le_benchmark: len=128: 1616 MB/s
>   | +    # crc32_le_benchmark: len=200: 1739 MB/s
>   | +    # crc32_le_benchmark: len=256: 1951 MB/s
>   | +    # crc32_le_benchmark: len=511: 1855 MB/s
>   | +    # crc32_le_benchmark: len=512: 2174 MB/s
>   | +    # crc32_le_benchmark: len=1024: 2301 MB/s
>   | +    # crc32_le_benchmark: len=3173: 2347 MB/s
>   | +    # crc32_le_benchmark: len=4096: 2407 MB/s
>   | +    # crc32_le_benchmark: len=16384: 2440 MB/s
>   |      ok 6 crc32_le_benchmark
>   |      ok 7 crc32_be_test
>   | -    # crc32_be_benchmark: len=1: 27 MB/s
>   | -    # crc32_be_benchmark: len=16: 258 MB/s
>   | -    # crc32_be_benchmark: len=64: 388 MB/s
>   | -    # crc32_be_benchmark: len=127: 402 MB/s
>   | -    # crc32_be_benchmark: len=128: 424 MB/s
>   | -    # crc32_be_benchmark: len=200: 438 MB/s
>   | -    # crc32_be_benchmark: len=256: 444 MB/s
>   | -    # crc32_be_benchmark: len=511: 449 MB/s
>   | -    # crc32_be_benchmark: len=512: 455 MB/s
>   | -    # crc32_be_benchmark: len=1024: 461 MB/s
>   | -    # crc32_be_benchmark: len=3173: 463 MB/s
>   | -    # crc32_be_benchmark: len=4096: 465 MB/s
>   | -    # crc32_be_benchmark: len=16384: 466 MB/s
>   | +    # crc32_be_benchmark: len=1: 25 MB/s
>   | +    # crc32_be_benchmark: len=16: 251 MB/s
>   | +    # crc32_be_benchmark: len=64: 458 MB/s
>   | +    # crc32_be_benchmark: len=127: 496 MB/s
>   | +    # crc32_be_benchmark: len=128: 547 MB/s
>   | +    # crc32_be_benchmark: len=200: 569 MB/s
>   | +    # crc32_be_benchmark: len=256: 605 MB/s
>   | +    # crc32_be_benchmark: len=511: 621 MB/s
>   | +    # crc32_be_benchmark: len=512: 637 MB/s
>   | +    # crc32_be_benchmark: len=1024: 657 MB/s
>   | +    # crc32_be_benchmark: len=3173: 668 MB/s
>   | +    # crc32_be_benchmark: len=4096: 671 MB/s
>   | +    # crc32_be_benchmark: len=16384: 674 MB/s
>   |      ok 8 crc32_be_benchmark
>   |      ok 9 crc32c_test
>   |      # crc32c_benchmark: len=1: 31 MB/s
>   | -    # crc32c_benchmark: len=16: 457 MB/s
>   | -    # crc32c_benchmark: len=64: 682 MB/s
>   | -    # crc32c_benchmark: len=127: 620 MB/s
>   | -    # crc32c_benchmark: len=128: 744 MB/s
>   | -    # crc32c_benchmark: len=200: 769 MB/s
>   | -    # crc32c_benchmark: len=256: 779 MB/s
>   | -    # crc32c_benchmark: len=511: 758 MB/s
>   | -    # crc32c_benchmark: len=512: 797 MB/s
>   | -    # crc32c_benchmark: len=1024: 807 MB/s
>   | -    # crc32c_benchmark: len=3173: 806 MB/s
>   | -    # crc32c_benchmark: len=4096: 813 MB/s
>   | -    # crc32c_benchmark: len=16384: 816 MB/s
>   | +    # crc32c_benchmark: len=16: 446 MB/s
>   | +    # crc32c_benchmark: len=64: 1188 MB/s
>   | +    # crc32c_benchmark: len=127: 1066 MB/s
>   | +    # crc32c_benchmark: len=128: 1600 MB/s
>   | +    # crc32c_benchmark: len=200: 1727 MB/s
>   | +    # crc32c_benchmark: len=256: 1941 MB/s
>   | +    # crc32c_benchmark: len=511: 1854 MB/s
>   | +    # crc32c_benchmark: len=512: 2164 MB/s
>   | +    # crc32c_benchmark: len=1024: 2300 MB/s
>   | +    # crc32c_benchmark: len=3173: 2345 MB/s
>   | +    # crc32c_benchmark: len=4096: 2402 MB/s
>   | +    # crc32c_benchmark: len=16384: 2437 MB/s
>   |      ok 10 crc32c_benchmark
>   |      ok 11 crc64_be_test
>   | -    # crc64_be_benchmark: len=1: 64 MB/s
>   | -    # crc64_be_benchmark: len=16: 144 MB/s
>   | -    # crc64_be_benchmark: len=64: 154 MB/s
>   | -    # crc64_be_benchmark: len=127: 156 MB/s
>   | -    # crc64_be_benchmark: len=128: 156 MB/s
>   | -    # crc64_be_benchmark: len=200: 156 MB/s
>   | -    # crc64_be_benchmark: len=256: 156 MB/s
>   | -    # crc64_be_benchmark: len=511: 157 MB/s
>   | -    # crc64_be_benchmark: len=512: 157 MB/s
>   | -    # crc64_be_benchmark: len=1024: 157 MB/s
>   | -    # crc64_be_benchmark: len=3173: 158 MB/s
>   | -    # crc64_be_benchmark: len=4096: 158 MB/s
>   | -    # crc64_be_benchmark: len=16384: 158 MB/s
>   | +    # crc64_be_benchmark: len=1: 29 MB/s
>   | +    # crc64_be_benchmark: len=16: 264 MB/s
>   | +    # crc64_be_benchmark: len=64: 476 MB/s
>   | +    # crc64_be_benchmark: len=127: 499 MB/s
>   | +    # crc64_be_benchmark: len=128: 558 MB/s
>   | +    # crc64_be_benchmark: len=200: 576 MB/s
>   | +    # crc64_be_benchmark: len=256: 611 MB/s
>   | +    # crc64_be_benchmark: len=511: 621 MB/s
>   | +    # crc64_be_benchmark: len=512: 638 MB/s
>   | +    # crc64_be_benchmark: len=1024: 659 MB/s
>   | +    # crc64_be_benchmark: len=3173: 667 MB/s
>   | +    # crc64_be_benchmark: len=4096: 671 MB/s
>   | +    # crc64_be_benchmark: len=16384: 674 MB/s
>   |      ok 12 crc64_be_benchmark
>   |      ok 13 crc64_nvme_test
>   | -    # crc64_nvme_benchmark: len=1: 64 MB/s
>   | -    # crc64_nvme_benchmark: len=16: 144 MB/s
>   | -    # crc64_nvme_benchmark: len=64: 154 MB/s
>   | -    # crc64_nvme_benchmark: len=127: 156 MB/s
>   | -    # crc64_nvme_benchmark: len=128: 156 MB/s
>   | -    # crc64_nvme_benchmark: len=200: 156 MB/s
>   | -    # crc64_nvme_benchmark: len=256: 156 MB/s
>   | -    # crc64_nvme_benchmark: len=511: 157 MB/s
>   | -    # crc64_nvme_benchmark: len=512: 157 MB/s
>   | -    # crc64_nvme_benchmark: len=1024: 157 MB/s
>   | -    # crc64_nvme_benchmark: len=3173: 158 MB/s
>   | -    # crc64_nvme_benchmark: len=4096: 158 MB/s
>   | -    # crc64_nvme_benchmark: len=16384: 158 MB/s
>   | +    # crc64_nvme_benchmark: len=1: 36 MB/s
>   | +    # crc64_nvme_benchmark: len=16: 479 MB/s
>   | +    # crc64_nvme_benchmark: len=64: 1340 MB/s
>   | +    # crc64_nvme_benchmark: len=127: 1179 MB/s
>   | +    # crc64_nvme_benchmark: len=128: 1766 MB/s
>   | +    # crc64_nvme_benchmark: len=200: 1965 MB/s
>   | +    # crc64_nvme_benchmark: len=256: 2201 MB/s
>   | +    # crc64_nvme_benchmark: len=511: 2087 MB/s
>   | +    # crc64_nvme_benchmark: len=512: 2464 MB/s
>   | +    # crc64_nvme_benchmark: len=1024: 2331 MB/s
>   | +    # crc64_nvme_benchmark: len=3173: 2673 MB/s
>   | +    # crc64_nvme_benchmark: len=4096: 2745 MB/s
>   | +    # crc64_nvme_benchmark: len=16384: 2782 MB/s
>   |      ok 14 crc64_nvme_benchmark
>   |  # crc: pass:14 fail:0 skip:0 total:14
>   |  # Totals: pass:14 fail:0 skip:0 total:14
> 
> That's a significant speed up for this popular SoC, and it would be
> great to get this series in for the next merge window! Thank you!
> 
> Tested-by: Björn Töpel <bjorn@rivosinc.com>

Thanks for testing this patchset!  So to summarize, on long messages the results
were roughly:

    lsb-first CRCs (crc32_le, crc32c, crc64_nvme):
        Generic table-based code:             158 MB/s
        Old Zbc-optimized code (crc32* only): 816 MB/s
        New Zbc-optimized code:               2440 MB/s

    mst-first CRCs (crc_t10dif, crc32_be, crc64_be):
        Generic table-based code:             158 MB/s
        Old Zbc-optimized code (crc32* only): 466 MB/s
        New Zbc-optimized code:               674 MB/s

So, quite positive results.  Though, the fact the msb-first CRCs are (still) so
much slower than lsb-first ones indicates that be64_to_cpu() is super slow on
RISC-V.  That seems to be caused by the rev8 instruction from Zbb not being
used.  I wonder if there are any plans to make the endianness swap macros use
rev8, or if I'm going to have to roll my own endianness swap in the CRC code.
(I assume it would be fine for the CRC code to depend on both Zbb and Zbc.)

Anyway, I've applied this series to the crc tree
(https://web.git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git/log/?h=crc-next).

Palmer, I'd appreciate your ack though!

- Eric

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/4] RISC-V CRC optimizations
  2025-03-02 18:56   ` Björn Töpel
  2025-03-02 22:04     ` Eric Biggers
@ 2025-03-03  6:53     ` Zhihang Shao
  1 sibling, 0 replies; 12+ messages in thread
From: Zhihang Shao @ 2025-03-03  6:53 UTC (permalink / raw)
  To: Björn Töpel, Eric Biggers, linux-kernel
  Cc: linux-crypto, linux-riscv, Ard Biesheuvel, Xiao Wang,
	Charlie Jenkins, Alexandre Ghiti

On 2025/3/3 2:56, Björn Töpel wrote:

> Eric!
>
> Eric Biggers <ebiggers@kernel.org> writes:
>
>> On Sun, Feb 16, 2025 at 02:55:26PM -0800, Eric Biggers wrote:
>>> This patchset is a replacement for
>>> "[PATCH v4] riscv: Optimize crct10dif with Zbc extension"
>>> (https://lore.kernel.org/r/20250211071101.181652-1-zhihang.shao.iscas@gmail.com/).
>>> It adopts the approach that I'm taking for x86 where code is shared
>>> among CRC variants.  It replaces the existing Zbc optimized CRC32
>>> functions, then adds Zbc optimized CRC-T10DIF and CRC64 functions.
>>>
>>> This new code should be significantly faster than the current Zbc
>>> optimized CRC32 code and the previously proposed CRC-T10DIF code.  It
>>> uses "folding" instead of just Barrett reduction, and it also implements
>>> Barrett reduction more efficiently.
>>>
>>> This applies to crc-next at
>>> https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git/log/?h=crc-next.
>>> It depends on other patches that are queued there for 6.15, so I plan to
>>> take it through there if there are no objections.
>>>
>>> Tested with crc_kunit in QEMU (set CONFIG_CRC_KUNIT_TEST=y and
>>> CONFIG_CRC_BENCHMARK=y), both 32-bit and 64-bit.  I don't have real Zbc
>>> capable hardware to benchmark this on, but the new code should work very
>>> well; similar optimizations work very well on other architectures.
>> Any feedback on this series from the RISC-V side?
> I have not reviewed your series, but I did a testrun the Milk-V Jupiter
> which sports a Spacemit K1 that has Zbc.
>
> I based the run on commit 1973160c90d7 ("Merge tag
> 'gpio-fixes-for-v6.14-rc5' of
> git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux"), plus your
> crc-next branch (commit a0bd462f3a13 ("x86/crc: add ANNOTATE_NOENDBR to
> suppress objtool warnings")) merged:
>
>    | --- base1.txt	2025-03-02 18:31:16.169438876 +0000
>    | +++ eric.txt	2025-03-02 18:35:58.683017223 +0000
>    | @@ -11,7 +11,7 @@
>    |      # crc16_benchmark: len=127: 153 MB/s
>    |      # crc16_benchmark: len=128: 153 MB/s
>    |      # crc16_benchmark: len=200: 153 MB/s
>    | -    # crc16_benchmark: len=256: 153 MB/s
>    | +    # crc16_benchmark: len=256: 154 MB/s
>    |      # crc16_benchmark: len=511: 154 MB/s
>    |      # crc16_benchmark: len=512: 154 MB/s
>    |      # crc16_benchmark: len=1024: 155 MB/s
>    | @@ -20,94 +20,94 @@
>    |      # crc16_benchmark: len=16384: 155 MB/s
>    |      ok 2 crc16_benchmark
>    |      ok 3 crc_t10dif_test
>    | -    # crc_t10dif_benchmark: len=1: 48 MB/s
>    | -    # crc_t10dif_benchmark: len=16: 125 MB/s
>    | -    # crc_t10dif_benchmark: len=64: 136 MB/s
>    | -    # crc_t10dif_benchmark: len=127: 138 MB/s
>    | -    # crc_t10dif_benchmark: len=128: 138 MB/s
>    | -    # crc_t10dif_benchmark: len=200: 138 MB/s
>    | -    # crc_t10dif_benchmark: len=256: 138 MB/s
>    | -    # crc_t10dif_benchmark: len=511: 139 MB/s
>    | -    # crc_t10dif_benchmark: len=512: 139 MB/s
>    | -    # crc_t10dif_benchmark: len=1024: 139 MB/s
>    | -    # crc_t10dif_benchmark: len=3173: 140 MB/s
>    | -    # crc_t10dif_benchmark: len=4096: 140 MB/s
>    | -    # crc_t10dif_benchmark: len=16384: 140 MB/s
>    | +    # crc_t10dif_benchmark: len=1: 28 MB/s
>    | +    # crc_t10dif_benchmark: len=16: 236 MB/s
>    | +    # crc_t10dif_benchmark: len=64: 450 MB/s
>    | +    # crc_t10dif_benchmark: len=127: 480 MB/s
>    | +    # crc_t10dif_benchmark: len=128: 540 MB/s
>    | +    # crc_t10dif_benchmark: len=200: 559 MB/s
>    | +    # crc_t10dif_benchmark: len=256: 600 MB/s
>    | +    # crc_t10dif_benchmark: len=511: 613 MB/s
>    | +    # crc_t10dif_benchmark: len=512: 635 MB/s
>    | +    # crc_t10dif_benchmark: len=1024: 654 MB/s
>    | +    # crc_t10dif_benchmark: len=3173: 665 MB/s
>    | +    # crc_t10dif_benchmark: len=4096: 669 MB/s
>    | +    # crc_t10dif_benchmark: len=16384: 673 MB/s
>    |      ok 4 crc_t10dif_benchmark
>    |      ok 5 crc32_le_test
>    |      # crc32_le_benchmark: len=1: 31 MB/s
>    | -    # crc32_le_benchmark: len=16: 456 MB/s
>    | -    # crc32_le_benchmark: len=64: 682 MB/s
>    | -    # crc32_le_benchmark: len=127: 620 MB/s
>    | -    # crc32_le_benchmark: len=128: 744 MB/s
>    | -    # crc32_le_benchmark: len=200: 768 MB/s
>    | -    # crc32_le_benchmark: len=256: 777 MB/s
>    | -    # crc32_le_benchmark: len=511: 758 MB/s
>    | -    # crc32_le_benchmark: len=512: 798 MB/s
>    | -    # crc32_le_benchmark: len=1024: 807 MB/s
>    | -    # crc32_le_benchmark: len=3173: 807 MB/s
>    | -    # crc32_le_benchmark: len=4096: 814 MB/s
>    | -    # crc32_le_benchmark: len=16384: 816 MB/s
>    | +    # crc32_le_benchmark: len=16: 439 MB/s
>    | +    # crc32_le_benchmark: len=64: 1209 MB/s
>    | +    # crc32_le_benchmark: len=127: 1067 MB/s
>    | +    # crc32_le_benchmark: len=128: 1616 MB/s
>    | +    # crc32_le_benchmark: len=200: 1739 MB/s
>    | +    # crc32_le_benchmark: len=256: 1951 MB/s
>    | +    # crc32_le_benchmark: len=511: 1855 MB/s
>    | +    # crc32_le_benchmark: len=512: 2174 MB/s
>    | +    # crc32_le_benchmark: len=1024: 2301 MB/s
>    | +    # crc32_le_benchmark: len=3173: 2347 MB/s
>    | +    # crc32_le_benchmark: len=4096: 2407 MB/s
>    | +    # crc32_le_benchmark: len=16384: 2440 MB/s
>    |      ok 6 crc32_le_benchmark
>    |      ok 7 crc32_be_test
>    | -    # crc32_be_benchmark: len=1: 27 MB/s
>    | -    # crc32_be_benchmark: len=16: 258 MB/s
>    | -    # crc32_be_benchmark: len=64: 388 MB/s
>    | -    # crc32_be_benchmark: len=127: 402 MB/s
>    | -    # crc32_be_benchmark: len=128: 424 MB/s
>    | -    # crc32_be_benchmark: len=200: 438 MB/s
>    | -    # crc32_be_benchmark: len=256: 444 MB/s
>    | -    # crc32_be_benchmark: len=511: 449 MB/s
>    | -    # crc32_be_benchmark: len=512: 455 MB/s
>    | -    # crc32_be_benchmark: len=1024: 461 MB/s
>    | -    # crc32_be_benchmark: len=3173: 463 MB/s
>    | -    # crc32_be_benchmark: len=4096: 465 MB/s
>    | -    # crc32_be_benchmark: len=16384: 466 MB/s
>    | +    # crc32_be_benchmark: len=1: 25 MB/s
>    | +    # crc32_be_benchmark: len=16: 251 MB/s
>    | +    # crc32_be_benchmark: len=64: 458 MB/s
>    | +    # crc32_be_benchmark: len=127: 496 MB/s
>    | +    # crc32_be_benchmark: len=128: 547 MB/s
>    | +    # crc32_be_benchmark: len=200: 569 MB/s
>    | +    # crc32_be_benchmark: len=256: 605 MB/s
>    | +    # crc32_be_benchmark: len=511: 621 MB/s
>    | +    # crc32_be_benchmark: len=512: 637 MB/s
>    | +    # crc32_be_benchmark: len=1024: 657 MB/s
>    | +    # crc32_be_benchmark: len=3173: 668 MB/s
>    | +    # crc32_be_benchmark: len=4096: 671 MB/s
>    | +    # crc32_be_benchmark: len=16384: 674 MB/s
>    |      ok 8 crc32_be_benchmark
>    |      ok 9 crc32c_test
>    |      # crc32c_benchmark: len=1: 31 MB/s
>    | -    # crc32c_benchmark: len=16: 457 MB/s
>    | -    # crc32c_benchmark: len=64: 682 MB/s
>    | -    # crc32c_benchmark: len=127: 620 MB/s
>    | -    # crc32c_benchmark: len=128: 744 MB/s
>    | -    # crc32c_benchmark: len=200: 769 MB/s
>    | -    # crc32c_benchmark: len=256: 779 MB/s
>    | -    # crc32c_benchmark: len=511: 758 MB/s
>    | -    # crc32c_benchmark: len=512: 797 MB/s
>    | -    # crc32c_benchmark: len=1024: 807 MB/s
>    | -    # crc32c_benchmark: len=3173: 806 MB/s
>    | -    # crc32c_benchmark: len=4096: 813 MB/s
>    | -    # crc32c_benchmark: len=16384: 816 MB/s
>    | +    # crc32c_benchmark: len=16: 446 MB/s
>    | +    # crc32c_benchmark: len=64: 1188 MB/s
>    | +    # crc32c_benchmark: len=127: 1066 MB/s
>    | +    # crc32c_benchmark: len=128: 1600 MB/s
>    | +    # crc32c_benchmark: len=200: 1727 MB/s
>    | +    # crc32c_benchmark: len=256: 1941 MB/s
>    | +    # crc32c_benchmark: len=511: 1854 MB/s
>    | +    # crc32c_benchmark: len=512: 2164 MB/s
>    | +    # crc32c_benchmark: len=1024: 2300 MB/s
>    | +    # crc32c_benchmark: len=3173: 2345 MB/s
>    | +    # crc32c_benchmark: len=4096: 2402 MB/s
>    | +    # crc32c_benchmark: len=16384: 2437 MB/s
>    |      ok 10 crc32c_benchmark
>    |      ok 11 crc64_be_test
>    | -    # crc64_be_benchmark: len=1: 64 MB/s
>    | -    # crc64_be_benchmark: len=16: 144 MB/s
>    | -    # crc64_be_benchmark: len=64: 154 MB/s
>    | -    # crc64_be_benchmark: len=127: 156 MB/s
>    | -    # crc64_be_benchmark: len=128: 156 MB/s
>    | -    # crc64_be_benchmark: len=200: 156 MB/s
>    | -    # crc64_be_benchmark: len=256: 156 MB/s
>    | -    # crc64_be_benchmark: len=511: 157 MB/s
>    | -    # crc64_be_benchmark: len=512: 157 MB/s
>    | -    # crc64_be_benchmark: len=1024: 157 MB/s
>    | -    # crc64_be_benchmark: len=3173: 158 MB/s
>    | -    # crc64_be_benchmark: len=4096: 158 MB/s
>    | -    # crc64_be_benchmark: len=16384: 158 MB/s
>    | +    # crc64_be_benchmark: len=1: 29 MB/s
>    | +    # crc64_be_benchmark: len=16: 264 MB/s
>    | +    # crc64_be_benchmark: len=64: 476 MB/s
>    | +    # crc64_be_benchmark: len=127: 499 MB/s
>    | +    # crc64_be_benchmark: len=128: 558 MB/s
>    | +    # crc64_be_benchmark: len=200: 576 MB/s
>    | +    # crc64_be_benchmark: len=256: 611 MB/s
>    | +    # crc64_be_benchmark: len=511: 621 MB/s
>    | +    # crc64_be_benchmark: len=512: 638 MB/s
>    | +    # crc64_be_benchmark: len=1024: 659 MB/s
>    | +    # crc64_be_benchmark: len=3173: 667 MB/s
>    | +    # crc64_be_benchmark: len=4096: 671 MB/s
>    | +    # crc64_be_benchmark: len=16384: 674 MB/s
>    |      ok 12 crc64_be_benchmark
>    |      ok 13 crc64_nvme_test
>    | -    # crc64_nvme_benchmark: len=1: 64 MB/s
>    | -    # crc64_nvme_benchmark: len=16: 144 MB/s
>    | -    # crc64_nvme_benchmark: len=64: 154 MB/s
>    | -    # crc64_nvme_benchmark: len=127: 156 MB/s
>    | -    # crc64_nvme_benchmark: len=128: 156 MB/s
>    | -    # crc64_nvme_benchmark: len=200: 156 MB/s
>    | -    # crc64_nvme_benchmark: len=256: 156 MB/s
>    | -    # crc64_nvme_benchmark: len=511: 157 MB/s
>    | -    # crc64_nvme_benchmark: len=512: 157 MB/s
>    | -    # crc64_nvme_benchmark: len=1024: 157 MB/s
>    | -    # crc64_nvme_benchmark: len=3173: 158 MB/s
>    | -    # crc64_nvme_benchmark: len=4096: 158 MB/s
>    | -    # crc64_nvme_benchmark: len=16384: 158 MB/s
>    | +    # crc64_nvme_benchmark: len=1: 36 MB/s
>    | +    # crc64_nvme_benchmark: len=16: 479 MB/s
>    | +    # crc64_nvme_benchmark: len=64: 1340 MB/s
>    | +    # crc64_nvme_benchmark: len=127: 1179 MB/s
>    | +    # crc64_nvme_benchmark: len=128: 1766 MB/s
>    | +    # crc64_nvme_benchmark: len=200: 1965 MB/s
>    | +    # crc64_nvme_benchmark: len=256: 2201 MB/s
>    | +    # crc64_nvme_benchmark: len=511: 2087 MB/s
>    | +    # crc64_nvme_benchmark: len=512: 2464 MB/s
>    | +    # crc64_nvme_benchmark: len=1024: 2331 MB/s
>    | +    # crc64_nvme_benchmark: len=3173: 2673 MB/s
>    | +    # crc64_nvme_benchmark: len=4096: 2745 MB/s
>    | +    # crc64_nvme_benchmark: len=16384: 2782 MB/s
>    |      ok 14 crc64_nvme_benchmark
>    |  # crc: pass:14 fail:0 skip:0 total:14
>    |  # Totals: pass:14 fail:0 skip:0 total:14
>
> That's a significant speed up for this popular SoC, and it would be
> great to get this series in for the next merge window! Thank you!
>
> Tested-by: Björn Töpel <bjorn@rivosinc.com>
>
>
> Björn

I am happy to see that crc-t10dif can achieve remarkable performance 
improvements under the optimization of the ZBC extension. It has been a 
great honor for me to contribute to such impactful work.

Zhihang


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/4] RISC-V CRC optimizations
  2025-03-02 22:04     ` Eric Biggers
@ 2025-03-08 12:58       ` Ignacio Encinas Rubio
  2025-03-10 12:34         ` Alexandre Ghiti
  2025-03-10 12:44       ` Alexandre Ghiti
  1 sibling, 1 reply; 12+ messages in thread
From: Ignacio Encinas Rubio @ 2025-03-08 12:58 UTC (permalink / raw)
  To: Eric Biggers, Björn Töpel, Palmer Dabbelt
  Cc: linux-kernel-mentees, linux-kernel, linux-crypto, linux-riscv,
	Zhihang Shao, Ard Biesheuvel, Xiao Wang, Charlie Jenkins,
	Alexandre Ghiti, skhan

Hello!

On 2/3/25 23:04, Eric Biggers wrote:
> So, quite positive results.  Though, the fact the msb-first CRCs are (still) so
> much slower than lsb-first ones indicates that be64_to_cpu() is super slow on
> RISC-V.  That seems to be caused by the rev8 instruction from Zbb not being
> used.  I wonder if there are any plans to make the endianness swap macros use
> rev8, or if I'm going to have to roll my own endianness swap in the CRC code.
> (I assume it would be fine for the CRC code to depend on both Zbb and Zbc.)

I saw this message the other day and started working on a patch, but I
would like to double-check I'm on the right track:

- be64_to_cpu ends up being __swab64 (include/uapi/linux/swab.h) 

If Zbb was part of the base ISA, turning CONFIG_ARCH_USE_BUILTIN_BSWAP 
would take care of the problem, but it is not the case. 

Therefore, we have to define __arch_swab<X> like some "arches"(?) do in 
arch/<ARCH>/include/uapi/asm/swab.h

For those functions to be correct in generic kernels, we would need to 
use ALTERNATIVE() macros like in arch/riscv/include/asm/bitops.h.
Would this be ok? I'm not sure if the ALTERNATIVEs overhead can be a
problem here.

Thanks in advance :)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/4] RISC-V CRC optimizations
  2025-03-08 12:58       ` Ignacio Encinas Rubio
@ 2025-03-10 12:34         ` Alexandre Ghiti
  0 siblings, 0 replies; 12+ messages in thread
From: Alexandre Ghiti @ 2025-03-10 12:34 UTC (permalink / raw)
  To: Ignacio Encinas Rubio, Eric Biggers, Björn Töpel,
	Palmer Dabbelt
  Cc: linux-kernel-mentees, linux-kernel, linux-crypto, linux-riscv,
	Zhihang Shao, Ard Biesheuvel, Xiao Wang, Charlie Jenkins,
	Alexandre Ghiti, skhan

Hi Ignacio,

On 08/03/2025 13:58, Ignacio Encinas Rubio wrote:
> Hello!
>
> On 2/3/25 23:04, Eric Biggers wrote:
>> So, quite positive results.  Though, the fact the msb-first CRCs are (still) so
>> much slower than lsb-first ones indicates that be64_to_cpu() is super slow on
>> RISC-V.  That seems to be caused by the rev8 instruction from Zbb not being
>> used.  I wonder if there are any plans to make the endianness swap macros use
>> rev8, or if I'm going to have to roll my own endianness swap in the CRC code.
>> (I assume it would be fine for the CRC code to depend on both Zbb and Zbc.)
> I saw this message the other day and started working on a patch, but I
> would like to double-check I'm on the right track:
>
> - be64_to_cpu ends up being __swab64 (include/uapi/linux/swab.h)
>
> If Zbb was part of the base ISA, turning CONFIG_ARCH_USE_BUILTIN_BSWAP
> would take care of the problem, but it is not the case.
>
> Therefore, we have to define __arch_swab<X> like some "arches"(?) do in
> arch/<ARCH>/include/uapi/asm/swab.h
>
> For those functions to be correct in generic kernels, we would need to
> use ALTERNATIVE() macros like in arch/riscv/include/asm/bitops.h.
> Would this be ok? I'm not sure if the ALTERNATIVEs overhead can be a
> problem here.


Yes, using alternatives here is the right way to go. And the only 
overhead when Zbb is available would be a nop (take a look at lib/csum.c).

Thanks for working on this, looking forward to your patch,

Alex


> Thanks in advance :)
>
> _______________________________________________
> linux-riscv mailing list
> linux-riscv@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/4] RISC-V CRC optimizations
  2025-03-02 22:04     ` Eric Biggers
  2025-03-08 12:58       ` Ignacio Encinas Rubio
@ 2025-03-10 12:44       ` Alexandre Ghiti
  1 sibling, 0 replies; 12+ messages in thread
From: Alexandre Ghiti @ 2025-03-10 12:44 UTC (permalink / raw)
  To: Eric Biggers, Björn Töpel, Palmer Dabbelt
  Cc: linux-kernel, linux-crypto, linux-riscv, Zhihang Shao,
	Ard Biesheuvel, Xiao Wang, Charlie Jenkins, Alexandre Ghiti

Hi Eric,

On 02/03/2025 23:04, Eric Biggers wrote:
> On Sun, Mar 02, 2025 at 07:56:46PM +0100, Björn Töpel wrote:
>> Eric!
>>
>> Eric Biggers <ebiggers@kernel.org> writes:
>>
>>> On Sun, Feb 16, 2025 at 02:55:26PM -0800, Eric Biggers wrote:
>>>> This patchset is a replacement for
>>>> "[PATCH v4] riscv: Optimize crct10dif with Zbc extension"
>>>> (https://lore.kernel.org/r/20250211071101.181652-1-zhihang.shao.iscas@gmail.com/).
>>>> It adopts the approach that I'm taking for x86 where code is shared
>>>> among CRC variants.  It replaces the existing Zbc optimized CRC32
>>>> functions, then adds Zbc optimized CRC-T10DIF and CRC64 functions.
>>>>
>>>> This new code should be significantly faster than the current Zbc
>>>> optimized CRC32 code and the previously proposed CRC-T10DIF code.  It
>>>> uses "folding" instead of just Barrett reduction, and it also implements
>>>> Barrett reduction more efficiently.
>>>>
>>>> This applies to crc-next at
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git/log/?h=crc-next.
>>>> It depends on other patches that are queued there for 6.15, so I plan to
>>>> take it through there if there are no objections.
>>>>
>>>> Tested with crc_kunit in QEMU (set CONFIG_CRC_KUNIT_TEST=y and
>>>> CONFIG_CRC_BENCHMARK=y), both 32-bit and 64-bit.  I don't have real Zbc
>>>> capable hardware to benchmark this on, but the new code should work very
>>>> well; similar optimizations work very well on other architectures.
>>> Any feedback on this series from the RISC-V side?
>> I have not reviewed your series, but I did a testrun the Milk-V Jupiter
>> which sports a Spacemit K1 that has Zbc.
>>
>> I based the run on commit 1973160c90d7 ("Merge tag
>> 'gpio-fixes-for-v6.14-rc5' of
>> git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux"), plus your
>> crc-next branch (commit a0bd462f3a13 ("x86/crc: add ANNOTATE_NOENDBR to
>> suppress objtool warnings")) merged:
>>
>>    | --- base1.txt	2025-03-02 18:31:16.169438876 +0000
>>    | +++ eric.txt	2025-03-02 18:35:58.683017223 +0000
>>    | @@ -11,7 +11,7 @@
>>    |      # crc16_benchmark: len=127: 153 MB/s
>>    |      # crc16_benchmark: len=128: 153 MB/s
>>    |      # crc16_benchmark: len=200: 153 MB/s
>>    | -    # crc16_benchmark: len=256: 153 MB/s
>>    | +    # crc16_benchmark: len=256: 154 MB/s
>>    |      # crc16_benchmark: len=511: 154 MB/s
>>    |      # crc16_benchmark: len=512: 154 MB/s
>>    |      # crc16_benchmark: len=1024: 155 MB/s
>>    | @@ -20,94 +20,94 @@
>>    |      # crc16_benchmark: len=16384: 155 MB/s
>>    |      ok 2 crc16_benchmark
>>    |      ok 3 crc_t10dif_test
>>    | -    # crc_t10dif_benchmark: len=1: 48 MB/s
>>    | -    # crc_t10dif_benchmark: len=16: 125 MB/s
>>    | -    # crc_t10dif_benchmark: len=64: 136 MB/s
>>    | -    # crc_t10dif_benchmark: len=127: 138 MB/s
>>    | -    # crc_t10dif_benchmark: len=128: 138 MB/s
>>    | -    # crc_t10dif_benchmark: len=200: 138 MB/s
>>    | -    # crc_t10dif_benchmark: len=256: 138 MB/s
>>    | -    # crc_t10dif_benchmark: len=511: 139 MB/s
>>    | -    # crc_t10dif_benchmark: len=512: 139 MB/s
>>    | -    # crc_t10dif_benchmark: len=1024: 139 MB/s
>>    | -    # crc_t10dif_benchmark: len=3173: 140 MB/s
>>    | -    # crc_t10dif_benchmark: len=4096: 140 MB/s
>>    | -    # crc_t10dif_benchmark: len=16384: 140 MB/s
>>    | +    # crc_t10dif_benchmark: len=1: 28 MB/s
>>    | +    # crc_t10dif_benchmark: len=16: 236 MB/s
>>    | +    # crc_t10dif_benchmark: len=64: 450 MB/s
>>    | +    # crc_t10dif_benchmark: len=127: 480 MB/s
>>    | +    # crc_t10dif_benchmark: len=128: 540 MB/s
>>    | +    # crc_t10dif_benchmark: len=200: 559 MB/s
>>    | +    # crc_t10dif_benchmark: len=256: 600 MB/s
>>    | +    # crc_t10dif_benchmark: len=511: 613 MB/s
>>    | +    # crc_t10dif_benchmark: len=512: 635 MB/s
>>    | +    # crc_t10dif_benchmark: len=1024: 654 MB/s
>>    | +    # crc_t10dif_benchmark: len=3173: 665 MB/s
>>    | +    # crc_t10dif_benchmark: len=4096: 669 MB/s
>>    | +    # crc_t10dif_benchmark: len=16384: 673 MB/s
>>    |      ok 4 crc_t10dif_benchmark
>>    |      ok 5 crc32_le_test
>>    |      # crc32_le_benchmark: len=1: 31 MB/s
>>    | -    # crc32_le_benchmark: len=16: 456 MB/s
>>    | -    # crc32_le_benchmark: len=64: 682 MB/s
>>    | -    # crc32_le_benchmark: len=127: 620 MB/s
>>    | -    # crc32_le_benchmark: len=128: 744 MB/s
>>    | -    # crc32_le_benchmark: len=200: 768 MB/s
>>    | -    # crc32_le_benchmark: len=256: 777 MB/s
>>    | -    # crc32_le_benchmark: len=511: 758 MB/s
>>    | -    # crc32_le_benchmark: len=512: 798 MB/s
>>    | -    # crc32_le_benchmark: len=1024: 807 MB/s
>>    | -    # crc32_le_benchmark: len=3173: 807 MB/s
>>    | -    # crc32_le_benchmark: len=4096: 814 MB/s
>>    | -    # crc32_le_benchmark: len=16384: 816 MB/s
>>    | +    # crc32_le_benchmark: len=16: 439 MB/s
>>    | +    # crc32_le_benchmark: len=64: 1209 MB/s
>>    | +    # crc32_le_benchmark: len=127: 1067 MB/s
>>    | +    # crc32_le_benchmark: len=128: 1616 MB/s
>>    | +    # crc32_le_benchmark: len=200: 1739 MB/s
>>    | +    # crc32_le_benchmark: len=256: 1951 MB/s
>>    | +    # crc32_le_benchmark: len=511: 1855 MB/s
>>    | +    # crc32_le_benchmark: len=512: 2174 MB/s
>>    | +    # crc32_le_benchmark: len=1024: 2301 MB/s
>>    | +    # crc32_le_benchmark: len=3173: 2347 MB/s
>>    | +    # crc32_le_benchmark: len=4096: 2407 MB/s
>>    | +    # crc32_le_benchmark: len=16384: 2440 MB/s
>>    |      ok 6 crc32_le_benchmark
>>    |      ok 7 crc32_be_test
>>    | -    # crc32_be_benchmark: len=1: 27 MB/s
>>    | -    # crc32_be_benchmark: len=16: 258 MB/s
>>    | -    # crc32_be_benchmark: len=64: 388 MB/s
>>    | -    # crc32_be_benchmark: len=127: 402 MB/s
>>    | -    # crc32_be_benchmark: len=128: 424 MB/s
>>    | -    # crc32_be_benchmark: len=200: 438 MB/s
>>    | -    # crc32_be_benchmark: len=256: 444 MB/s
>>    | -    # crc32_be_benchmark: len=511: 449 MB/s
>>    | -    # crc32_be_benchmark: len=512: 455 MB/s
>>    | -    # crc32_be_benchmark: len=1024: 461 MB/s
>>    | -    # crc32_be_benchmark: len=3173: 463 MB/s
>>    | -    # crc32_be_benchmark: len=4096: 465 MB/s
>>    | -    # crc32_be_benchmark: len=16384: 466 MB/s
>>    | +    # crc32_be_benchmark: len=1: 25 MB/s
>>    | +    # crc32_be_benchmark: len=16: 251 MB/s
>>    | +    # crc32_be_benchmark: len=64: 458 MB/s
>>    | +    # crc32_be_benchmark: len=127: 496 MB/s
>>    | +    # crc32_be_benchmark: len=128: 547 MB/s
>>    | +    # crc32_be_benchmark: len=200: 569 MB/s
>>    | +    # crc32_be_benchmark: len=256: 605 MB/s
>>    | +    # crc32_be_benchmark: len=511: 621 MB/s
>>    | +    # crc32_be_benchmark: len=512: 637 MB/s
>>    | +    # crc32_be_benchmark: len=1024: 657 MB/s
>>    | +    # crc32_be_benchmark: len=3173: 668 MB/s
>>    | +    # crc32_be_benchmark: len=4096: 671 MB/s
>>    | +    # crc32_be_benchmark: len=16384: 674 MB/s
>>    |      ok 8 crc32_be_benchmark
>>    |      ok 9 crc32c_test
>>    |      # crc32c_benchmark: len=1: 31 MB/s
>>    | -    # crc32c_benchmark: len=16: 457 MB/s
>>    | -    # crc32c_benchmark: len=64: 682 MB/s
>>    | -    # crc32c_benchmark: len=127: 620 MB/s
>>    | -    # crc32c_benchmark: len=128: 744 MB/s
>>    | -    # crc32c_benchmark: len=200: 769 MB/s
>>    | -    # crc32c_benchmark: len=256: 779 MB/s
>>    | -    # crc32c_benchmark: len=511: 758 MB/s
>>    | -    # crc32c_benchmark: len=512: 797 MB/s
>>    | -    # crc32c_benchmark: len=1024: 807 MB/s
>>    | -    # crc32c_benchmark: len=3173: 806 MB/s
>>    | -    # crc32c_benchmark: len=4096: 813 MB/s
>>    | -    # crc32c_benchmark: len=16384: 816 MB/s
>>    | +    # crc32c_benchmark: len=16: 446 MB/s
>>    | +    # crc32c_benchmark: len=64: 1188 MB/s
>>    | +    # crc32c_benchmark: len=127: 1066 MB/s
>>    | +    # crc32c_benchmark: len=128: 1600 MB/s
>>    | +    # crc32c_benchmark: len=200: 1727 MB/s
>>    | +    # crc32c_benchmark: len=256: 1941 MB/s
>>    | +    # crc32c_benchmark: len=511: 1854 MB/s
>>    | +    # crc32c_benchmark: len=512: 2164 MB/s
>>    | +    # crc32c_benchmark: len=1024: 2300 MB/s
>>    | +    # crc32c_benchmark: len=3173: 2345 MB/s
>>    | +    # crc32c_benchmark: len=4096: 2402 MB/s
>>    | +    # crc32c_benchmark: len=16384: 2437 MB/s
>>    |      ok 10 crc32c_benchmark
>>    |      ok 11 crc64_be_test
>>    | -    # crc64_be_benchmark: len=1: 64 MB/s
>>    | -    # crc64_be_benchmark: len=16: 144 MB/s
>>    | -    # crc64_be_benchmark: len=64: 154 MB/s
>>    | -    # crc64_be_benchmark: len=127: 156 MB/s
>>    | -    # crc64_be_benchmark: len=128: 156 MB/s
>>    | -    # crc64_be_benchmark: len=200: 156 MB/s
>>    | -    # crc64_be_benchmark: len=256: 156 MB/s
>>    | -    # crc64_be_benchmark: len=511: 157 MB/s
>>    | -    # crc64_be_benchmark: len=512: 157 MB/s
>>    | -    # crc64_be_benchmark: len=1024: 157 MB/s
>>    | -    # crc64_be_benchmark: len=3173: 158 MB/s
>>    | -    # crc64_be_benchmark: len=4096: 158 MB/s
>>    | -    # crc64_be_benchmark: len=16384: 158 MB/s
>>    | +    # crc64_be_benchmark: len=1: 29 MB/s
>>    | +    # crc64_be_benchmark: len=16: 264 MB/s
>>    | +    # crc64_be_benchmark: len=64: 476 MB/s
>>    | +    # crc64_be_benchmark: len=127: 499 MB/s
>>    | +    # crc64_be_benchmark: len=128: 558 MB/s
>>    | +    # crc64_be_benchmark: len=200: 576 MB/s
>>    | +    # crc64_be_benchmark: len=256: 611 MB/s
>>    | +    # crc64_be_benchmark: len=511: 621 MB/s
>>    | +    # crc64_be_benchmark: len=512: 638 MB/s
>>    | +    # crc64_be_benchmark: len=1024: 659 MB/s
>>    | +    # crc64_be_benchmark: len=3173: 667 MB/s
>>    | +    # crc64_be_benchmark: len=4096: 671 MB/s
>>    | +    # crc64_be_benchmark: len=16384: 674 MB/s
>>    |      ok 12 crc64_be_benchmark
>>    |      ok 13 crc64_nvme_test
>>    | -    # crc64_nvme_benchmark: len=1: 64 MB/s
>>    | -    # crc64_nvme_benchmark: len=16: 144 MB/s
>>    | -    # crc64_nvme_benchmark: len=64: 154 MB/s
>>    | -    # crc64_nvme_benchmark: len=127: 156 MB/s
>>    | -    # crc64_nvme_benchmark: len=128: 156 MB/s
>>    | -    # crc64_nvme_benchmark: len=200: 156 MB/s
>>    | -    # crc64_nvme_benchmark: len=256: 156 MB/s
>>    | -    # crc64_nvme_benchmark: len=511: 157 MB/s
>>    | -    # crc64_nvme_benchmark: len=512: 157 MB/s
>>    | -    # crc64_nvme_benchmark: len=1024: 157 MB/s
>>    | -    # crc64_nvme_benchmark: len=3173: 158 MB/s
>>    | -    # crc64_nvme_benchmark: len=4096: 158 MB/s
>>    | -    # crc64_nvme_benchmark: len=16384: 158 MB/s
>>    | +    # crc64_nvme_benchmark: len=1: 36 MB/s
>>    | +    # crc64_nvme_benchmark: len=16: 479 MB/s
>>    | +    # crc64_nvme_benchmark: len=64: 1340 MB/s
>>    | +    # crc64_nvme_benchmark: len=127: 1179 MB/s
>>    | +    # crc64_nvme_benchmark: len=128: 1766 MB/s
>>    | +    # crc64_nvme_benchmark: len=200: 1965 MB/s
>>    | +    # crc64_nvme_benchmark: len=256: 2201 MB/s
>>    | +    # crc64_nvme_benchmark: len=511: 2087 MB/s
>>    | +    # crc64_nvme_benchmark: len=512: 2464 MB/s
>>    | +    # crc64_nvme_benchmark: len=1024: 2331 MB/s
>>    | +    # crc64_nvme_benchmark: len=3173: 2673 MB/s
>>    | +    # crc64_nvme_benchmark: len=4096: 2745 MB/s
>>    | +    # crc64_nvme_benchmark: len=16384: 2782 MB/s
>>    |      ok 14 crc64_nvme_benchmark
>>    |  # crc: pass:14 fail:0 skip:0 total:14
>>    |  # Totals: pass:14 fail:0 skip:0 total:14
>>
>> That's a significant speed up for this popular SoC, and it would be
>> great to get this series in for the next merge window! Thank you!
>>
>> Tested-by: Björn Töpel <bjorn@rivosinc.com>
> Thanks for testing this patchset!  So to summarize, on long messages the results
> were roughly:
>
>      lsb-first CRCs (crc32_le, crc32c, crc64_nvme):
>          Generic table-based code:             158 MB/s
>          Old Zbc-optimized code (crc32* only): 816 MB/s
>          New Zbc-optimized code:               2440 MB/s
>
>      mst-first CRCs (crc_t10dif, crc32_be, crc64_be):
>          Generic table-based code:             158 MB/s
>          Old Zbc-optimized code (crc32* only): 466 MB/s
>          New Zbc-optimized code:               674 MB/s
>
> So, quite positive results.  Though, the fact the msb-first CRCs are (still) so
> much slower than lsb-first ones indicates that be64_to_cpu() is super slow on
> RISC-V.  That seems to be caused by the rev8 instruction from Zbb not being
> used.  I wonder if there are any plans to make the endianness swap macros use
> rev8, or if I'm going to have to roll my own endianness swap in the CRC code.
> (I assume it would be fine for the CRC code to depend on both Zbb and Zbc.)
>
> Anyway, I've applied this series to the crc tree
> (https://web.git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git/log/?h=crc-next).
>
> Palmer, I'd appreciate your ack though!


Very hard to review but given Bjorn's tests on a Zbc platform, you can add:

Acked-by: Alexandre Ghiti <alexghiti@rivosinc.com>

Thank you very much for working on this and unveiling the issue with the 
beXX_to_cpu() macros!

Alex


>
> - Eric
>
> _______________________________________________
> linux-riscv mailing list
> linux-riscv@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-03-10 12:44 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-16 22:55 [PATCH 0/4] RISC-V CRC optimizations Eric Biggers
2025-02-16 22:55 ` [PATCH 1/4] riscv/crc: add "template" for Zbc optimized CRC functions Eric Biggers
2025-02-16 22:55 ` [PATCH 2/4] riscv/crc32: reimplement the CRC32 functions using new template Eric Biggers
2025-02-16 22:55 ` [PATCH 3/4] riscv/crc-t10dif: add Zbc optimized CRC-T10DIF function Eric Biggers
2025-02-16 22:55 ` [PATCH 4/4] riscv/crc64: add Zbc optimized CRC64 functions Eric Biggers
2025-02-24 18:06 ` [PATCH 0/4] RISC-V CRC optimizations Eric Biggers
2025-03-02 18:56   ` Björn Töpel
2025-03-02 22:04     ` Eric Biggers
2025-03-08 12:58       ` Ignacio Encinas Rubio
2025-03-10 12:34         ` Alexandre Ghiti
2025-03-10 12:44       ` Alexandre Ghiti
2025-03-03  6:53     ` Zhihang Shao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).