Re: [PATCH] lib/raid/xor: x86: Add AVX-512 optimized xor_gen()

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Eric Biggers <ebiggers@kernel.org>
To: Christoph Hellwig <hch@lst.de>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org, linux-crypto@vger.kernel.org,
	x86@kernel.org, Andrea Mazzoleni <amadvance@gmail.com>
Subject: Re: [PATCH] lib/raid/xor: x86: Add AVX-512 optimized xor_gen()
Date: Thu, 11 Jun 2026 22:59:33 -0700	[thread overview]
Message-ID: <20260612055933.GA6675@sol> (raw)
In-Reply-To: <20260612052247.GA8848@lst.de>

On Fri, Jun 12, 2026 at 07:22:47AM +0200, Christoph Hellwig wrote:
> On Thu, Jun 11, 2026 at 09:40:34PM -0700, Eric Biggers wrote:
> > Add an implementation of xor_gen() using AVX-512.
> 
> > Benchmark on AMD Ryzen 9 9950X (Zen 5):
> 
> Can you share the benchmark?

For now I had just hacked up do_xor_speed() as follows and changed
xor_force() to xor_register().  There should be a benchmark added to the
KUnit test similar to the one in the crypto and CRC tests, though.

diff --git a/lib/raid/xor/xor-core.c b/lib/raid/xor/xor-core.c
index bd4e6e434418..8c5814af03d5 100644
--- a/lib/raid/xor/xor-core.c
+++ b/lib/raid/xor/xor-core.c
@@ -76,15 +76,24 @@ void __init xor_force(struct xor_block_template *tmpl)
 #define REPS		800U
 
 static void __init
-do_xor_speed(struct xor_block_template *tmpl, void *b1, void *b2)
+do_xor_speed(struct xor_block_template *tmpl, void *b1, void *b2,
+	     void *b3, void *b4, void *b5)
 {
+	for (int src_cnt = 1; src_cnt <= 4; src_cnt++) {
 	int speed;
 	unsigned long reps;
 	ktime_t min, start, t0;
-	void *srcs[1] = { b2 };
+	void *srcs[4] = { b2, b3, b4, b5 };
 
 	preempt_disable();
 
+	/* warm-up */
+	for (int i = 0; i < 8000; i++) {
+		mb(); /* prevent loop optimization */
+		tmpl->xor_gen(b1, srcs, src_cnt, BENCH_SIZE);
+		mb();
+	}
+
 	reps = 0;
 	t0 = ktime_get();
 	/* delay start until time has advanced */
@@ -92,7 +101,7 @@ do_xor_speed(struct xor_block_template *tmpl, void *b1, void *b2)
 		cpu_relax();
 	do {
 		mb(); /* prevent loop optimization */
-		tmpl->xor_gen(b1, srcs, 1, BENCH_SIZE);
+		tmpl->xor_gen(b1, srcs, src_cnt, BENCH_SIZE);
 		mb();
 	} while (reps++ < REPS || (t0 = ktime_get()) == start);
 	min = ktime_sub(t0, start);
@@ -105,26 +114,30 @@ do_xor_speed(struct xor_block_template *tmpl, void *b1, void *b2)
 
 	pr_info("   %-16s: %5d MB/sec\n", tmpl->name, speed);
 }
+}
 
 static int __init calibrate_xor_blocks(void)
 {
-	void *b1, *b2;
+	void *b1, *b2, *b3, *b4, *b5;
 	struct xor_block_template *f, *fastest;
 
 	if (forced_template)
 		return 0;
 
-	b1 = (void *) __get_free_pages(GFP_KERNEL, 2);
+	b1 = (void *) __get_free_pages(GFP_KERNEL, 4);
 	if (!b1) {
 		pr_warn("xor: Yikes!  No memory available.\n");
 		return -ENOMEM;
 	}
 	b2 = b1 + 2*PAGE_SIZE + BENCH_SIZE;
+	b3 = b2 + 2*PAGE_SIZE + BENCH_SIZE;
+	b4 = b3 + 2*PAGE_SIZE + BENCH_SIZE;
+	b5 = b4 + 2*PAGE_SIZE + BENCH_SIZE;
 
 	pr_info("xor: measuring software checksum speed\n");
 	fastest = template_list;
 	for (f = template_list; f; f = f->next) {
-		do_xor_speed(f, b1, b2);
+		do_xor_speed(f, b1, b2, b3, b4, b5);
 		if (f->speed > fastest->speed)
 			fastest = f;
 	}

> In my local tree I have ports of the AVX2 and AVX512 implementations
> from snapraid (https://github.com/amadvance/snapraid), which in userspace
> give really good performance.  On my Laptop with a AMD Ryzen AI 7 PRO 350
> (which is a Zen5 with the slower double pumped AVX512 unit), both of
> them get over 1GB/s throughput on the snapraid benchmarks.  I've been
> holding them back as I don't have a good kernel benchmarking harness,
> and it's missing the quirks for old AVX512 or the newer AMD special
> cases.
> 
> Attached for reference.
> 
> Note that either way I'd prefer if we could get away from the stange
> old code organization with the DO{1-4} helpers which don't really
> help.

Well, doing the same on your avx512bw version and adding a column to my
table for it (by the way, I think it really just needs avx512f), I get:

        src_cnt    avx          avx512       avx512bw
        =======    ==========   ==========   ==========
        1          68423 MB/s   81940 MB/s   12067 MB/s
        2          56035 MB/s   74112 MB/s   10958 MB/s
        3          49396 MB/s   67011 MB/s   8608 MB/s
        4          43056 MB/s   60823 MB/s   8069 MB/s

So, your version isn't great, I'm afraid.  Making the inner loop be over
src_cnt does simplify the code a lot, but it destroys performance since
it turns into 9 instructions for each 64 bytes in each 3 buffers:

      5b:   89 c1                   mov    %eax,%ecx
      5d:   8d 70 01                lea    0x1(%rax),%esi
      60:   48 8b 0c cb             mov    (%rbx,%rcx,8),%rcx
      64:   48 8b 34 f3             mov    (%rbx,%rsi,8),%rsi
      68:   62 f1 fd 48 6f 0c 11    vmovdqa64 (%rcx,%rdx,1),%zmm1
      6f:   62 f3 f5 48 25 04 16    vpternlogq $0x96,(%rsi,%rdx,1),%zmm1,%zmm0
      76:   96 
      77:   83 c0 02                add    $0x2,%eax
      7a:   39 f8                   cmp    %edi,%eax
      7c:   72 dd                   jb     5b <xor_gen_avx512bw+0x4b>

You could try unrolling by 512 bytes, which should help.

- Eric

next prev parent reply	other threads:[~2026-06-12  6:00 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-12  4:40 [PATCH] lib/raid/xor: x86: Add AVX-512 optimized xor_gen() Eric Biggers
2026-06-12  5:22 ` Christoph Hellwig
2026-06-12  5:59   ` Eric Biggers [this message]
2026-06-12  9:04   ` David Laight
2026-06-13  0:27     ` Eric Biggers
2026-06-13  8:48       ` David Laight

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:bd4e6e43441 dfblob:8c5814af03d )
 OR (
bs:"Re: [PATCH] lib/raid/xor: x86: Add AVX-512 optimized xor_gen()" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260612055933.GA6675@sol \
    --to=ebiggers@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=amadvance@gmail.com \
    --cc=hch@lst.de \
    --cc=linux-crypto@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.