Re: [PATCH] lib/raid/xor: x86: Add AVX-512 optimized xor_gen()

Linux cryptographic layer development
 help / color / mirror / Atom feed

From: Eric Biggers <ebiggers@kernel.org>
To: David Laight <david.laight.linux@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org, linux-crypto@vger.kernel.org,
	x86@kernel.org, Andrea Mazzoleni <amadvance@gmail.com>
Subject: Re: [PATCH] lib/raid/xor: x86: Add AVX-512 optimized xor_gen()
Date: Fri, 12 Jun 2026 17:27:04 -0700	[thread overview]
Message-ID: <20260613002704.GA11922@sol> (raw)
In-Reply-To: <20260612100432.1f1c8c7a@pumpkin>

On Fri, Jun 12, 2026 at 10:04:32AM +0100, David Laight wrote:
> On Fri, 12 Jun 2026 07:22:47 +0200
> Christoph Hellwig <hch@lst.de> wrote:
> 
> > On Thu, Jun 11, 2026 at 09:40:34PM -0700, Eric Biggers wrote:
> > > Add an implementation of xor_gen() using AVX-512.  
> > 
> > > Benchmark on AMD Ryzen 9 9950X (Zen 5):  
> > 
> > Can you share the benchmark?
> > 
> > In my local tree I have ports of the AVX2 and AVX512 implementations
> > from snapraid (https://github.com/amadvance/snapraid), which in userspace
> > give really good performance.  On my Laptop with a AMD Ryzen AI 7 PRO 350
> > (which is a Zen5 with the slower double pumped AVX512 unit), both of
> > them get over 1GB/s throughput on the snapraid benchmarks.  I've been
> > holding them back as I don't have a good kernel benchmarking harness,
> > and it's missing the quirks for old AVX512 or the newer AMD special
> > cases.
> 
> From my experiments on Intel cpu (and I don't remember the zen-5 being
> that different - but I've done less testing on it) you don't need to
> unroll loops very much at all.
> 
> A reasonable model seems to be that the uops generated by the instruction
> decoder get executed when all the prerequisite registers and the required
> execution unit are available.
> So for a memory copy (and the xor is basically a copy) the control loop
> can run way ahead of the read/write instructions.
> This means you can get the control loop 'for free' and unrolling further
> makes no/little difference.
> 
> Each xor is two memory reads and one memory write.
> The cpu I was using could only do one write/clock - so you can only do one
> xor each clock. I think some of the newer ones can to two writes/clock but
> I'm not sure how many reads/clock they can do - might still be 2, don't
> think it s 4.
> So you should be able to get one xor per clock, but I doubt you'll get two
> (and possibly not even 1.3 - which would require 4 memory accesses per clock).
> 
> The best loop construct is the one that uses negative offsets from the
> end of the buffers, basically:
> 	buf += len;
> 	offset = -len;
> 	do
> 		f(buf[offset]);
> 	while (offset += size);
> that reduces the loop control to just an 'add' and 'jnz' (which can
> get merged into a single u-op).
> 
> The cpu have enough execution units to execute two memory reads,
> a memory write, an xor the add and jnz every clock.
> So even the 'rolled up' loop might run at one xor per clock.
> While I think I got a 'one clock loop' on my zen-5 (testing
> word-at-a-time strlen) I only managed a two clock loop on the newest
> Intel cpu I've got (which isn't that new).
> So put two xor in the loop and it shouldn't be limited by the loop
> control, but will be limited by the memory accesses instead.
> 
> Further unrolling shouldn't help and may make things worse.
> The Intel cpu have logic to directly forward the result of an
> ALU instruction into the next few instructions, but after that you can
> get a stall because of the 'round trip' via the register file.
> So part way down an unrolled nn(%reg) sequence you can get a stall.
> An extra 'add $0,%reg' in the middle of the unrolled loop will
> 'refresh' the register and speed things up.
> (I hit that with a loop that needed a rather more complicated control
> structure.)
> 
> You definitely need to use the pmc clock counter and data dependencies
> against the rdpmc instruction to get sensible performance figures.
> The can reasonably reliably measure down to less than 20 clocks.

The version at the end of this email is what you're suggesting, I think.
On Sapphire Rapids and Ryzen 9 9950X it's about the same speed as mine,
just a few percent slower on Sapphire with src_cnt == 1.

So we could use it.  It's just a bit fragile since it assumes the loop
overhead and indexed addressing will never be a bottleneck on any
current or future CPU.  Unrolling by more gives something more robust
that "just works", without having to analyze whether the loops are okay
on each CPU model individually based on microarchitectural details.

// SPDX-License-Identifier: GPL-2.0-or-later
/*
 * AVX-512 optimized implementation of xor_gen()
 *
 * Copyright 2026 Google LLC
 */

#include <linux/compiler.h>
#include <linux/types.h>
#include <linux/unroll.h>
#include <asm/fpu/api.h>
#include "xor_impl.h"
#include "xor_arch.h"

static void xor_avx512_2(long bytes, u8 *p0, const u8 *p1)
{
	long i = -bytes; /* Use negative indexing to minimize loop overhead. */

	p0 += bytes;
	p1 += bytes;
	unrolled_none
	do {
		/* unroll by 2x to reduce loop overhead */
		asm volatile("vmovdqa64 (%2,%0), %%zmm0\n"
			     "vmovdqa64 64(%2,%0), %%zmm1\n"
			     "vpxorq (%2,%1), %%zmm0, %%zmm0\n"
			     "vpxorq 64(%2,%1), %%zmm1, %%zmm1\n"
			     "vmovdqa64 %%zmm0, (%2,%0)\n"
			     "vmovdqa64 %%zmm1, 64(%2,%0)\n"
			     :
			     : "r"(p0), "r"(p1), "r"(i)
			     : "memory");
	} while ((i += 128) != 0);
}

static void xor_avx512_3(long bytes, u8 *p0, const u8 *p1, const u8 *p2)
{
	long i = -bytes; /* Use negative indexing to minimize loop overhead. */

	p0 += bytes;
	p1 += bytes;
	p2 += bytes;
	unrolled_none
	do {
		/* unroll by 2x to reduce loop overhead */
		asm volatile("vmovdqa64 (%3,%0), %%zmm0\n"
			     "vmovdqa64 64(%3,%0), %%zmm1\n"
			     "vmovdqa64 (%3,%1), %%zmm2\n"
			     "vmovdqa64 64(%3,%1), %%zmm3\n"
			     "vpternlogq $0x96, (%3,%2), %%zmm2, %%zmm0\n"
			     "vpternlogq $0x96, 64(%3,%2), %%zmm3, %%zmm1\n"
			     "vmovdqa64 %%zmm0, (%3,%0)\n"
			     "vmovdqa64 %%zmm1, 64(%3,%0)\n"
			     :
			     : "r"(p0), "r"(p1), "r"(p2), "r"(i)
			     : "memory");
	} while ((i += 128) != 0);
}

static void xor_avx512_4(long bytes, u8 *p0, const u8 *p1, const u8 *p2,
			 const u8 *p3)
{
	long i = -bytes; /* Use negative indexing to minimize loop overhead. */

	p0 += bytes;
	p1 += bytes;
	p2 += bytes;
	p3 += bytes;
	unrolled_none
	do {
		asm volatile("vmovdqa64 (%4,%0), %%zmm0\n"
			     "vmovdqa64 (%4,%1), %%zmm1\n"
			     "vpxorq (%4,%2), %%zmm0, %%zmm0\n"
			     "vpternlogq $0x96, (%4,%3), %%zmm1, %%zmm0\n"
			     "vmovdqa64 %%zmm0, (%4,%0)\n"
			     :
			     : "r"(p0), "r"(p1), "r"(p2), "r"(p3), "r"(i)
			     : "memory");
	} while ((i += 64) != 0);
}

static void xor_avx512_5(long bytes, u8 *p0, const u8 *p1, const u8 *p2,
			 const u8 *p3, const u8 *p4)
{
	long i = -bytes; /* Use negative indexing to minimize loop overhead. */

	p0 += bytes;
	p1 += bytes;
	p2 += bytes;
	p3 += bytes;
	p4 += bytes;
	unrolled_none
	do {
		asm volatile("vmovdqa64 (%5,%0), %%zmm0\n"
			     "vmovdqa64 (%5,%1), %%zmm1\n"
			     "vpternlogq $0x96, (%5,%2), %%zmm1, %%zmm0\n"
			     "vmovdqa64 (%5,%3), %%zmm1\n"
			     "vpternlogq $0x96, (%5,%4), %%zmm1, %%zmm0\n"
			     "vmovdqa64 %%zmm0, (%5,%0)\n"
			     :
			     : "r"(p0), "r"(p1), "r"(p2), "r"(p3), "r"(p4),
			       "r"(i)
			     : "memory");
	} while ((i += 64) != 0);
}

DO_XOR_BLOCKS(avx512_inner, xor_avx512_2, xor_avx512_3, xor_avx512_4,
	      xor_avx512_5);

/*
 * Preconditions: bytes is a nonzero multiple of 512, and all buffers are
 * 64-byte aligned.
 */
static void xor_gen_avx512(void *dest, void **srcs, unsigned int src_cnt,
			   unsigned int bytes)
{
	kernel_fpu_begin();
	xor_gen_avx512_inner(dest, srcs, src_cnt, bytes);
	kernel_fpu_end();
}

struct xor_block_template xor_block_avx512 = {
	.name = "avx512",
	.xor_gen = xor_gen_avx512,
};

next prev parent reply	other threads:[~2026-06-13  0:28 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-12  4:40 [PATCH] lib/raid/xor: x86: Add AVX-512 optimized xor_gen() Eric Biggers
2026-06-12  5:22 ` Christoph Hellwig
2026-06-12  5:59   ` Eric Biggers
2026-06-12  9:04   ` David Laight
2026-06-13  0:27     ` Eric Biggers [this message]
2026-06-13  8:48       ` David Laight

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260613002704.GA11922@sol \
    --to=ebiggers@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=amadvance@gmail.com \
    --cc=david.laight.linux@gmail.com \
    --cc=hch@lst.de \
    --cc=linux-crypto@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox