All of lore.kernel.org
 help / color / mirror / Atom feed
From: Baokun Li <libaokun@linux.alibaba.com>
To: linux-ext4@vger.kernel.org
Cc: linux-crypto@vger.kernel.org, ebiggers@kernel.org,
	ardb@kernel.org, tytso@mit.edu, adilger.kernel@dilger.ca,
	jack@suse.cz, yi.zhang@huawei.com, ojaswin@linux.ibm.com,
	ritesh.list@gmail.com, Baokun Li <libaokun@linux.alibaba.com>
Subject: [PATCH RFC 01/17] lib/crc: add crc32c_flip_range() for incremental CRC update
Date: Fri,  8 May 2026 20:15:23 +0800	[thread overview]
Message-ID: <20260508121539.4174601-2-libaokun@linux.alibaba.com> (raw)
In-Reply-To: <20260508121539.4174601-1-libaokun@linux.alibaba.com>

When a contiguous range of bits in a buffer is flipped, the CRC32c
checksum can be updated incrementally without re-scanning the entire
buffer, by exploiting the linearity of CRCs over GF(2):

  New_CRC = Old_CRC ^ CRC(flip_mask << trailing_bits)

Introduce crc32c_flip_range() which computes this delta using
precomputed GF(2) shift matrices and nibble-indexed lookup tables.
The implementation decomposes nbits and trailing_bits into
power-of-2 components and combines them via the CRC concatenation
property:

  CRC(A || B) = shift(CRC(A), len(B)) ^ CRC(B)

This gives O(log N) complexity with only ~9.8KB of static tables
(fits in L1 cache).  The current maximum supported buffer size is
64KB (INCR_MAX_ORDER = 19, i.e. 2^19 bits = 524288 bits = 64KB).

This is useful for filesystems like ext4, where bitmap updates
involve flipping a contiguous range of bits, and recalculating
the full CRC after every update is wasteful.

Benchmark results on Intel Xeon (Ice Lake) with CRC32c hardware
acceleration:

  bitmap:      1024  2048  4096  8192  16384  32768  65536
  flip(ns):      48    53    57    63     68     73     78
  full(ns):      45    88   182   357    709   1421   2853
  speedup:     0.9x  1.6x  3.1x  5.6x  10.3x  19.3x  36.3x

Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
---
 include/linux/crc32.h           |  25 ++++++
 lib/crc/.gitignore              |   2 +
 lib/crc/Makefile                |  13 ++-
 lib/crc/crc32c-incr.c           | 140 +++++++++++++++++++++++++++++++
 lib/crc/gen_crc32c_incr_table.c | 141 ++++++++++++++++++++++++++++++++
 5 files changed, 318 insertions(+), 3 deletions(-)
 create mode 100644 lib/crc/crc32c-incr.c
 create mode 100644 lib/crc/gen_crc32c_incr_table.c

diff --git a/include/linux/crc32.h b/include/linux/crc32.h
index da78b215ff2e..034f73f0f5dc 100644
--- a/include/linux/crc32.h
+++ b/include/linux/crc32.h
@@ -81,6 +81,31 @@ u32 crc32_be(u32 crc, const void *p, size_t len);
  */
 u32 crc32c(u32 crc, const void *p, size_t len);
 
+/**
+ * crc32c_flip_range - Update CRC32c after flipping a range of bits
+ * @old_crc:    Existing CRC32c value of the buffer (pre-flip).
+ * @total_bits: Total size of the buffer in bits (e.g., 524288 for 64KB).
+ * @bit_off:    Starting bit offset of the modified range.
+ * @nbits:      Length of the flipped bit sequence.
+ *
+ * This function calculates the new CRC32c value when a contiguous range of
+ * bits is flipped (XORed with 1s) without re-scanning the entire buffer.
+ * It leverages the linearity of CRCs in Galois Field GF(2):
+ *
+ * New_CRC = Old_CRC ^ CRC(Mask_of_Ones << Trailing_Bits)
+ *
+ * The complexity is O(log nbits + log trailing_bits), making it
+ * significantly faster than recomputing the CRC for large buffers.
+ *
+ * Note: @total_bits must not exceed 524288 (2^19 bits = 64KB).  Callers
+ * must ensure that @bit_off + @nbits <= @total_bits.  Behavior is
+ * undefined if these constraints are violated.
+ *
+ * Return: The updated CRC32c value.
+ */
+u32 crc32c_flip_range(u32 old_crc, u32 total_bits,
+		      u32 bit_off, u32 nbits);
+
 /*
  * crc32_optimizations() returns flags that indicate which CRC32 library
  * functions are using architecture-specific optimizations.  Unlike
diff --git a/lib/crc/.gitignore b/lib/crc/.gitignore
index a9e48103c9fb..4e2b9524426d 100644
--- a/lib/crc/.gitignore
+++ b/lib/crc/.gitignore
@@ -1,5 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0-only
 /crc32table.h
+/crc32c-incr-table.h
 /crc64table.h
 /gen_crc32table
+/gen_crc32c_incr_table
 /gen_crc64table
diff --git a/lib/crc/Makefile b/lib/crc/Makefile
index ff213590e4e3..2c255ac029d0 100644
--- a/lib/crc/Makefile
+++ b/lib/crc/Makefile
@@ -21,7 +21,7 @@ crc-t10dif-$(CONFIG_X86) += x86/crc16-msb-pclmul.o
 endif
 
 obj-$(CONFIG_CRC32) += crc32.o
-crc32-y := crc32-main.o
+crc32-y := crc32-main.o crc32c-incr.o
 ifeq ($(CONFIG_CRC32_ARCH),y)
 CFLAGS_crc32-main.o += -I$(src)/$(SRCARCH)
 crc32-$(CONFIG_ARM) += arm/crc32-core.o
@@ -49,20 +49,27 @@ endif # CONFIG_CRC64_ARCH
 
 obj-y += tests/
 
-hostprogs := gen_crc32table gen_crc64table
-clean-files := crc32table.h crc64table.h
+hostprogs := gen_crc32table gen_crc32c_incr_table gen_crc64table
+clean-files := crc32table.h crc32c-incr-table.h crc64table.h
 
 $(obj)/crc32-main.o: $(obj)/crc32table.h
+$(obj)/crc32c-incr.o: $(obj)/crc32c-incr-table.h
 $(obj)/crc64-main.o: $(obj)/crc64table.h
 
 quiet_cmd_crc32 = GEN     $@
       cmd_crc32 = $< > $@
 
+quiet_cmd_crc32c_incr = GEN     $@
+      cmd_crc32c_incr = $< > $@
+
 quiet_cmd_crc64 = GEN     $@
       cmd_crc64 = $< > $@
 
 $(obj)/crc32table.h: $(obj)/gen_crc32table
 	$(call cmd,crc32)
 
+$(obj)/crc32c-incr-table.h: $(obj)/gen_crc32c_incr_table
+	$(call cmd,crc32c_incr)
+
 $(obj)/crc64table.h: $(obj)/gen_crc64table
 	$(call cmd,crc64)
diff --git a/lib/crc/crc32c-incr.c b/lib/crc/crc32c-incr.c
new file mode 100644
index 000000000000..b6258231cc0d
--- /dev/null
+++ b/lib/crc/crc32c-incr.c
@@ -0,0 +1,140 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * GF(2) matrix-based CRC32c incremental update.
+ *
+ * When a contiguous range of bits is flipped, the new CRC can be
+ * derived from the old one without re-scanning the buffer:
+ *   New_CRC = Old_CRC ^ CRC(flip_mask << trailing_bits)
+ *
+ * The delta CRC is computed by decomposing num_bits and trailing_bits
+ * into power-of-2 components and combining them via the CRC
+ * concatenation property, giving O(log N) complexity.
+ *
+ * Memory usage: ~9.8KB
+ * - crc32c_incr_nibble_table: 19 * 8 * 16 * 4 = 9728 bytes
+ * - crc32c_incr_ones_lookup:  20 * 4          = 80   bytes
+ *
+ * Tables are generated at compile time by gen_crc32c_incr_table.
+ * INCR_MAX_ORDER 19 supports up to 64KB buffers (2^19 bits).
+ *
+ * Copyright (C) 2026 Alibaba Inc.
+ */
+
+#include <linux/bitops.h>
+#include <linux/bug.h>
+#include <linux/export.h>
+#include <linux/crc32.h>
+
+#include "crc32c-incr-table.h"
+
+#define INCR_MAX_ORDER		19
+
+/**
+ * gf2_xform - Multiply a CRC state vector by a GF(2) shift matrix
+ * @order: Selects the precomputed matrix M^(2^order).
+ * @v: The 32-bit CRC state vector.
+ *
+ * Computes v * M^(2^order) using nibble (4-bit) indexed tables,
+ * reducing the operation from 32 bit-level iterations to 8 lookups.
+ */
+static inline u32 gf2_xform(int order, u32 v)
+{
+	const u32 (*tab)[16] = crc32c_incr_nibble_table[order];
+
+	return tab[0][v & 0xf] ^
+	       tab[1][(v >> 4) & 0xf] ^
+	       tab[2][(v >> 8) & 0xf] ^
+	       tab[3][(v >> 12) & 0xf] ^
+	       tab[4][(v >> 16) & 0xf] ^
+	       tab[5][(v >> 20) & 0xf] ^
+	       tab[6][(v >> 24) & 0xf] ^
+	       tab[7][(v >> 28) & 0xf];
+}
+
+/**
+ * crc32c_incr_get_ones_delta - Compute CRC of an all-ones bit sequence
+ * @num_bits: Length of the all-ones sequence.
+ *
+ * Returns CRC(0, [111...1] of length num_bits).  Decomposes num_bits
+ * into powers of 2 (MSB-first) and combines using:
+ *   CRC(A || B) = shift(CRC(A), len(B)) ^ CRC(B)
+ *
+ * This requires only (popcount - 1) gf2_xform calls, each doing
+ * 8 table lookups.
+ *
+ * Caller must ensure num_bits <= (1UL << INCR_MAX_ORDER).
+ */
+static u32 crc32c_incr_get_ones_delta(size_t num_bits)
+{
+	u32 delta;
+	int n;
+
+	if (!num_bits)
+		return 0;
+
+	/* Initialize with the highest power-of-2 block */
+	n = __fls(num_bits);
+	delta = crc32c_incr_ones_lookup[n];
+	num_bits ^= (1UL << n);
+
+	/* Concatenate remaining blocks from high to low */
+	while (num_bits) {
+		n = __fls(num_bits);
+		delta = gf2_xform(n, delta);
+		delta ^= crc32c_incr_ones_lookup[n];
+		num_bits ^= (1UL << n);
+	}
+	return delta;
+}
+
+/**
+ * gf2_shift_crc - Shift a CRC state by @trailing_bits zero-bit positions
+ * @crc: The CRC state vector.
+ * @trailing_bits: Number of zero bits to shift through.
+ *
+ * Equivalent to appending @trailing_bits zero bits to the data stream
+ * and continuing the CRC computation.  Decomposes trailing_bits into
+ * powers of 2 and applies the corresponding precomputed matrices.
+ */
+static u32 gf2_shift_crc(u32 crc, size_t trailing_bits)
+{
+	int n;
+
+	for (n = 0; trailing_bits > 0 && n < INCR_MAX_ORDER; n++) {
+		if (trailing_bits & 1)
+			crc = gf2_xform(n, crc);
+		trailing_bits >>= 1;
+	}
+	return crc;
+}
+
+/* See full kernel-doc in include/linux/crc32.h */
+u32 crc32c_flip_range(u32 old_crc, u32 total_bits,
+		      u32 bit_off, u32 nbits)
+{
+	u32 delta, trailing_bits;
+
+	if (!nbits)
+		return old_crc;
+
+	/*
+	 * total_bits must not exceed 2^INCR_MAX_ORDER bits (64KB).
+	 * bit_off + nbits must not exceed total_bits.
+	 */
+	if (WARN_ON_ONCE(total_bits > (1UL << INCR_MAX_ORDER)))
+		return old_crc;
+	if (WARN_ON_ONCE(bit_off + nbits > total_bits))
+		return old_crc;
+
+	trailing_bits = total_bits - (bit_off + nbits);
+
+	/* 1. Calculate CRC of the flip-mask (all 1s of length nbits) */
+	delta = crc32c_incr_get_ones_delta(nbits);
+
+	/* 2. Shift the mask-CRC to the correct bit position */
+	delta = gf2_shift_crc(delta, trailing_bits);
+
+	/* 3. Apply the delta to the existing CRC */
+	return old_crc ^ delta;
+}
+EXPORT_SYMBOL(crc32c_flip_range);
diff --git a/lib/crc/gen_crc32c_incr_table.c b/lib/crc/gen_crc32c_incr_table.c
new file mode 100644
index 000000000000..f906506282cc
--- /dev/null
+++ b/lib/crc/gen_crc32c_incr_table.c
@@ -0,0 +1,141 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Generate GF(2) nibble-based lookup tables for incremental CRC32c updates.
+ * MAX_ORDER 19 supports up to 64KB buffers (2^19 bits = 524288 bits).
+ *
+ * Instead of storing raw 32x32 bit matrices (32 rows per order),
+ * we precompute nibble (4-bit) indexed tables.  This reduces gf2_xform
+ * to 8 table lookups instead of 32 branchless mask-and-XOR iterations.
+ *
+ * Memory layout:
+ * - crc32c_incr_nibble_table[19][8][16]: 19 * 8 * 16 * 4 = 9728 bytes
+ * - crc32c_incr_ones_lookup[20]:         20 * 4          = 80   bytes
+ * Total: ~9.8KB (fits comfortably in L1 cache)
+ *
+ * Copyright (C) 2026 Alibaba Inc.
+ */
+
+#include <stdio.h>
+#include <inttypes.h>
+
+#include "../../include/linux/crc32poly.h"
+
+#define CRC32C_INCR_MAX_ORDER	19
+#define NIBBLES_PER_U32		8
+
+static uint32_t bit_matrix[CRC32C_INCR_MAX_ORDER][32];
+static uint32_t nibble_table[CRC32C_INCR_MAX_ORDER][NIBBLES_PER_U32][16];
+static uint32_t ones_lookup[CRC32C_INCR_MAX_ORDER + 1];
+
+static void crc32c_incr_init(void)
+{
+	int n, i, k, v;
+
+	/*
+	 * Step 1: Build the order-0 matrix M, where M[i] is the CRC
+	 * state after shifting basis vector e_i by one bit position.
+	 */
+	for (i = 0; i < 32; i++) {
+		uint32_t r = 1U << i;
+
+		bit_matrix[0][i] = (r & 1) ?
+			(r >> 1) ^ CRC32C_POLY_LE : (r >> 1);
+	}
+
+	/* Step 2: M^(2^n) = (M^(2^(n-1)))^2 via matrix squaring */
+	for (n = 1; n < CRC32C_INCR_MAX_ORDER; n++) {
+		for (i = 0; i < 32; i++) {
+			uint32_t r = bit_matrix[n - 1][i];
+			uint32_t res = 0;
+
+			for (k = 0; k < 32; k++) {
+				if (r & (1U << k))
+					res ^= bit_matrix[n - 1][k];
+			}
+			bit_matrix[n][i] = res;
+		}
+	}
+
+	/* Step 3: Convert bit matrices to nibble-indexed lookup tables */
+	for (n = 0; n < CRC32C_INCR_MAX_ORDER; n++) {
+		for (i = 0; i < NIBBLES_PER_U32; i++) {
+			nibble_table[n][i][0] = 0;
+			for (v = 1; v < 16; v++) {
+				uint32_t res = 0;
+
+				for (k = 0; k < 4; k++) {
+					if (v & (1 << k))
+						res ^= bit_matrix[n][i * 4 + k];
+				}
+				nibble_table[n][i][v] = res;
+			}
+		}
+	}
+
+	/*
+	 * Step 4: ones_lookup[n] = CRC(0, all-ones of 2^n bits).
+	 * Uses CRC(A||B) = shift(CRC(A), len(B)) ^ CRC(B) to double
+	 * the length at each step.  ones_lookup[0] = CRC of a single
+	 * 1-bit, which equals the generator polynomial.
+	 */
+	ones_lookup[0] = CRC32C_POLY_LE;
+
+	for (n = 1; n <= CRC32C_INCR_MAX_ORDER; n++) {
+		uint32_t low = ones_lookup[n - 1];
+		uint32_t high = 0;
+
+		for (k = 0; k < 32; k++) {
+			if (low & (1U << k))
+				high ^= bit_matrix[n - 1][k];
+		}
+		ones_lookup[n] = low ^ high;
+	}
+}
+
+int main(int argc, char **argv)
+{
+	int n, i, v;
+
+	crc32c_incr_init();
+
+	printf("/* this file is generated - do not edit */\n\n");
+
+	printf("static const u32 crc32c_incr_nibble_table[%d][%d][16] = {\n",
+	       CRC32C_INCR_MAX_ORDER, NIBBLES_PER_U32);
+	for (n = 0; n < CRC32C_INCR_MAX_ORDER; n++) {
+		printf("\t{\n");
+		for (i = 0; i < NIBBLES_PER_U32; i++) {
+			printf("\t\t{\n");
+			for (v = 0; v < 16; v += 4) {
+				printf("\t\t\t0x%08x, 0x%08x, 0x%08x, 0x%08x,\n",
+				       nibble_table[n][i][v],
+				       nibble_table[n][i][v + 1],
+				       nibble_table[n][i][v + 2],
+				       nibble_table[n][i][v + 3]);
+			}
+			printf("\t\t},\n");
+		}
+		printf("\t},\n");
+	}
+	printf("};\n\n");
+
+	printf("static const u32 crc32c_incr_ones_lookup[%d] = {\n",
+	       CRC32C_INCR_MAX_ORDER + 1);
+	for (n = 0; n <= CRC32C_INCR_MAX_ORDER; n += 4) {
+		int remaining = CRC32C_INCR_MAX_ORDER + 1 - n;
+
+		if (remaining >= 4) {
+			printf("\t0x%08x, 0x%08x, 0x%08x, 0x%08x,\n",
+			       ones_lookup[n], ones_lookup[n + 1],
+			       ones_lookup[n + 2], ones_lookup[n + 3]);
+		} else {
+			printf("\t");
+			for (i = 0; i < remaining; i++)
+				printf("0x%08x, ", ones_lookup[n + i]);
+			printf("\n");
+		}
+	}
+	printf("};\n");
+
+	return 0;
+}
-- 
2.43.7


  reply	other threads:[~2026-05-08 12:16 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-08 12:15 [PATCH RFC 00/17] ext4/lib-crc: LBS performance part 1 - incremental CRC32c for bitmap checksums Baokun Li
2026-05-08 12:15 ` Baokun Li [this message]
     [not found]   ` <20260508204019.9E5A5C2BCB0@smtp.kernel.org>
2026-05-10  9:44     ` [PATCH RFC 01/17] lib/crc: add crc32c_flip_range() for incremental CRC update Baokun Li
2026-05-14  3:52   ` Eric Biggers
2026-05-08 12:15 ` [PATCH RFC 02/17] lib/crc: crc_kunit: add kunit test for crc32c_flip_range() Baokun Li
2026-05-08 12:15 ` [PATCH RFC 03/17] lib/crc: crc_kunit: add benchmark " Baokun Li
     [not found]   ` <20260508205415.8B843C2BCB0@smtp.kernel.org>
2026-05-10 10:03     ` Baokun Li
2026-05-08 12:15 ` [PATCH RFC 04/17] ext4: fix incorrect block bitmap free clusters update on metadata overlap Baokun Li
     [not found]   ` <20260508211732.E50B4C2BCB0@smtp.kernel.org>
2026-05-11  6:17     ` Baokun Li
2026-05-08 12:15 ` [PATCH RFC 05/17] ext4: extract block bitmap checksum get and store helpers Baokun Li
2026-05-08 12:15 ` [PATCH RFC 06/17] ext4: add ext4_block_bitmap_csum_set_range() for incremental checksum update Baokun Li
     [not found]   ` <20260508214640.B3A74C2BCB0@smtp.kernel.org>
2026-05-11  8:09     ` Baokun Li
2026-05-11  8:31     ` Baokun Li
2026-05-08 12:15 ` [PATCH RFC 07/17] ext4: use fast incremental CRC update in ext4_mb_mark_context() Baokun Li
     [not found]   ` <20260508223130.20E7AC2BCB0@smtp.kernel.org>
2026-05-11  8:15     ` Baokun Li
2026-05-08 12:15 ` [PATCH RFC 08/17] ext4: extract inode bitmap checksum get and store helpers Baokun Li
2026-05-08 12:15 ` [PATCH RFC 09/17] ext4: add ext4_inode_bitmap_csum_set_fast() for incremental checksum update Baokun Li
     [not found]   ` <20260508225807.71D9FC2BCB0@smtp.kernel.org>
2026-05-11  8:35     ` Baokun Li
2026-05-08 12:15 ` [PATCH RFC 10/17] ext4: use fast incremental CRC update in ext4_free_inode() Baokun Li
2026-05-08 12:15 ` [PATCH RFC 11/17] ext4: fix missing bg_used_dirs_count update in fast commit replay Baokun Li
2026-05-08 12:15 ` [PATCH RFC 12/17] ext4: factor out ext4_might_init_block_bitmap() helper Baokun Li
2026-05-08 12:15 ` [PATCH RFC 13/17] ext4: use fast incremental CRC update in ext4_mark_inode_used() Baokun Li
2026-05-08 12:15 ` [PATCH RFC 14/17] ext4: rename ino to bit in __ext4_new_inode() Baokun Li
2026-05-08 12:15 ` [PATCH RFC 15/17] ext4: use fast incremental CRC update " Baokun Li
2026-05-08 12:15 ` [PATCH RFC 16/17] ext4: extract ext4_update_inode_group_desc() to reduce duplication Baokun Li
2026-05-08 12:15 ` [PATCH RFC 17/17] ext4: add ext4_get_flex_group() helper to simplify flex group lookups Baokun Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260508121539.4174601-2-libaokun@linux.alibaba.com \
    --to=libaokun@linux.alibaba.com \
    --cc=adilger.kernel@dilger.ca \
    --cc=ardb@kernel.org \
    --cc=ebiggers@kernel.org \
    --cc=jack@suse.cz \
    --cc=linux-crypto@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=ojaswin@linux.ibm.com \
    --cc=ritesh.list@gmail.com \
    --cc=tytso@mit.edu \
    --cc=yi.zhang@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.