Re: [PATCH] lib/raid/xor: x86: Add AVX-512 optimized xor_gen()

Linux cryptographic layer development
 help / color / mirror / Atom feed

From: David Laight <david.laight.linux@gmail.com>
To: Christoph Hellwig <hch@lst.de>
Cc: Eric Biggers <ebiggers@kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org, linux-crypto@vger.kernel.org,
	x86@kernel.org, Andrea Mazzoleni <amadvance@gmail.com>
Subject: Re: [PATCH] lib/raid/xor: x86: Add AVX-512 optimized xor_gen()
Date: Fri, 12 Jun 2026 10:04:32 +0100	[thread overview]
Message-ID: <20260612100432.1f1c8c7a@pumpkin> (raw)
In-Reply-To: <20260612052247.GA8848@lst.de>

On Fri, 12 Jun 2026 07:22:47 +0200
Christoph Hellwig <hch@lst.de> wrote:

> On Thu, Jun 11, 2026 at 09:40:34PM -0700, Eric Biggers wrote:
> > Add an implementation of xor_gen() using AVX-512.  
> 
> > Benchmark on AMD Ryzen 9 9950X (Zen 5):  
> 
> Can you share the benchmark?
> 
> In my local tree I have ports of the AVX2 and AVX512 implementations
> from snapraid (https://github.com/amadvance/snapraid), which in userspace
> give really good performance.  On my Laptop with a AMD Ryzen AI 7 PRO 350
> (which is a Zen5 with the slower double pumped AVX512 unit), both of
> them get over 1GB/s throughput on the snapraid benchmarks.  I've been
> holding them back as I don't have a good kernel benchmarking harness,
> and it's missing the quirks for old AVX512 or the newer AMD special
> cases.

From my experiments on Intel cpu (and I don't remember the zen-5 being
that different - but I've done less testing on it) you don't need to
unroll loops very much at all.

A reasonable model seems to be that the uops generated by the instruction
decoder get executed when all the prerequisite registers and the required
execution unit are available.
So for a memory copy (and the xor is basically a copy) the control loop
can run way ahead of the read/write instructions.
This means you can get the control loop 'for free' and unrolling further
makes no/little difference.

Each xor is two memory reads and one memory write.
The cpu I was using could only do one write/clock - so you can only do one
xor each clock. I think some of the newer ones can to two writes/clock but
I'm not sure how many reads/clock they can do - might still be 2, don't
think it s 4.
So you should be able to get one xor per clock, but I doubt you'll get two
(and possibly not even 1.3 - which would require 4 memory accesses per clock).

The best loop construct is the one that uses negative offsets from the
end of the buffers, basically:
	buf += len;
	offset = -len;
	do
		f(buf[offset]);
	while (offset += size);
that reduces the loop control to just an 'add' and 'jnz' (which can
get merged into a single u-op).

The cpu have enough execution units to execute two memory reads,
a memory write, an xor the add and jnz every clock.
So even the 'rolled up' loop might run at one xor per clock.
While I think I got a 'one clock loop' on my zen-5 (testing
word-at-a-time strlen) I only managed a two clock loop on the newest
Intel cpu I've got (which isn't that new).
So put two xor in the loop and it shouldn't be limited by the loop
control, but will be limited by the memory accesses instead.

Further unrolling shouldn't help and may make things worse.
The Intel cpu have logic to directly forward the result of an
ALU instruction into the next few instructions, but after that you can
get a stall because of the 'round trip' via the register file.
So part way down an unrolled nn(%reg) sequence you can get a stall.
An extra 'add $0,%reg' in the middle of the unrolled loop will
'refresh' the register and speed things up.
(I hit that with a loop that needed a rather more complicated control
structure.)

You definitely need to use the pmc clock counter and data dependencies
against the rdpmc instruction to get sensible performance figures.
The can reasonably reliably measure down to less than 20 clocks.

	David

next prev parent reply	other threads:[~2026-06-12  9:04 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-12  4:40 [PATCH] lib/raid/xor: x86: Add AVX-512 optimized xor_gen() Eric Biggers
2026-06-12  5:22 ` Christoph Hellwig
2026-06-12  5:59   ` Eric Biggers
2026-06-12  9:04   ` David Laight [this message]
2026-06-13  0:27     ` Eric Biggers
2026-06-13  8:48       ` David Laight

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260612100432.1f1c8c7a@pumpkin \
    --to=david.laight.linux@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=amadvance@gmail.com \
    --cc=ebiggers@kernel.org \
    --cc=hch@lst.de \
    --cc=linux-crypto@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox