Re: [PATCH 3/3] crypto: x86/crc32c - eliminate jump table and excessive unrolling

Linux cryptographic layer development
 help / color / mirror / Atom feed

From: Eric Biggers <ebiggers@kernel.org>
To: David Laight <David.Laight@aculab.com>
Cc: "linux-crypto@vger.kernel.org" <linux-crypto@vger.kernel.org>,
	"x86@kernel.org" <x86@kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Ard Biesheuvel <ardb@kernel.org>,
	Josh Poimboeuf <jpoimboe@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>
Subject: Re: [PATCH 3/3] crypto: x86/crc32c - eliminate jump table and excessive unrolling
Date: Mon, 14 Oct 2024 16:59:17 -0700	[thread overview]
Message-ID: <20241014235917.GA1114@sol.localdomain> (raw)
In-Reply-To: <00c9c7c84e9043689942fc1f36e28591@AcuMS.aculab.com>

On Mon, Oct 14, 2024 at 10:32:48PM +0000, David Laight wrote:
> ...
> > > Do you need to unroll it at all?
> 
> > It looks like on most CPUs, no.  On Haswell, Emerald Rapids, Zen 2 it does not
> > make a significant difference.  However, it helps on Zen 5.
> 
> I wonder if one of the loop instructions is using the ALU
> unit you really want to be processing a crc32?
> If the cpu has fused arithmetic+jump u-ops then trying to get the
> decoder to use one of those may help.
> 
> Is Zen 5 actually slower than the other systems?
> I've managed to get clock cycle counts using the performance counters
> that more of less match the predicted values.
> You can't use 'rdtsc' because the cpu frequence isn't stable.

No, Zen 5 is faster than the other CPUs.  I looked more into what was happening,
and it turns out it's actually executing more than 3 crc32q in parallel on
average, by overlapping the execution of different calls to crc_pcl().  If I
chain the CRC values, that goes away and the 4x unrolling no longer helps.

Of course, whether users are chaining the CRC values or not is up to the user.
A user might be checksumming lots of small messages, or they might be
checksumming a large message in smaller pieces.

I do think the 4x unrolling is probably worth keeping around to reduce
dependency on microarchitectural details for future-proofing.  It's quite modest
compared to the 128x unrolling that was used before...

> ...
> > > If you are really lucky you'll get two memory reads/clock.
> > > So you won't ever to do than two crc32/clock.
> > > Looking at Agner's instruction latency tables I don't think
> > > any cpu can do more that one per clock, or pipeline them.
> > > I think that means you don't even need two (never mind 3)
> > > buffers.
> > 
> > On most Intel and AMD CPUs (I tested Haswell for old Intel, Emerald Rapids for
> > new Intel, and Zen 2 for slightly-old AMD), crc32q has 3 cycle latency and 1 per
> > cycle throughput.  So you do need at least 3 streams.
> 
> Bah, I missed the latency column :-)
> 
> > AMD Zen 5 has much higher crc32q throughput and seems to want up to 7 streams.
> > This is not implemented yet.
> 
> The copy of the tables I have is old - doesn't contain Zen-5.
> Does that mean that 2 (or more) of its alu 'units' can do crc32
> so you can do more than 1/clock (along with the memory reads).

That's correct.  It seems that 3 ALUs on Zen 5 can do crc32.

> One thought is how much of it is actually worth while!
> If the data isn't already in the L1 data cache then the cache
> loads almost certainly dominate - especially if you have to
> do out to 'real memory'.
> You can benchmark the loops by repeatedly accessing the same
> data - but that isn't what will actually happen.
> 

Well, data is rarely checksummed on its own but rather immediately before using
it or right after generating it.  In those cases it needs to be pulled into L1
cache, or has already been pulled into L1 cache, anyway.

> > > Most modern x86 can do 4 or 5 (or even more) ALU operations
> > > per clock - depending on the combination of instructions.
> > >
> > > Replace the loop termination with a comparison of 'bufp'
> > > against a pre-calculated limit and you get two instructions
> > > (that might get merged into one u-op) for the loop overhead.
> > > They'll run in parallel with the crc32q instructions.
> > 
> > That's actually still three instructions: add, cmp, and jne.
> 
> I was really thinking of the loop I quoted later.
> The one that uses negative offsets from the end of the buffer.
> That has an 'add' and a 'jnz' - which might even fuse into a
> single u-op.
> Maybe even constrained to p6 - so won't go near p1.
> (I don't have a recent AMD cpu)
>
> It may not actually matter.
> The add/subtract/cmp are only dependant on themselves.
> Similarly the jne is only dependant on the result of the sub/cmp.
> In principle they can all run in the same clock (for different
> loop cycles) since the rest of the loop only needs one of the
> ALU blocks (on Intel only P1 can do crc).
> But I failed to get a 1 clock loop (using ADC - which doesn't
> have a latency issue).
> It might be impossible because a predicted-taken conditional jmp
> has a latency of 2.

Yes, it's an interesting idea.  There would need to be a separate bufend pointer
for each chunk set up.

- Eric

next prev parent reply	other threads:[~2024-10-14 23:59 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-14  4:24 [PATCH 0/3] crypto: x86/crc32c - jump table elimination and other cleanups Eric Biggers
2024-10-14  4:24 ` [PATCH 1/3] crypto: x86/crc32c - simplify code for handling fewer than 200 bytes Eric Biggers
2024-10-14  4:24 ` [PATCH 2/3] crypto: x86/crc32c - access 32-bit arguments as 32-bit Eric Biggers
2024-10-14  4:24 ` [PATCH 3/3] crypto: x86/crc32c - eliminate jump table and excessive unrolling Eric Biggers
2024-10-14 16:30   ` David Laight
2024-10-14 19:01     ` Eric Biggers
2024-10-14 22:32       ` David Laight
2024-10-14 23:59         ` Eric Biggers [this message]
2024-10-15 10:55 ` [PATCH 0/3] crypto: x86/crc32c - jump table elimination and other cleanups Ard Biesheuvel
2024-10-26  6:53 ` Herbert Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20241014235917.GA1114@sol.localdomain \
    --to=ebiggers@kernel.org \
    --cc=David.Laight@aculab.com \
    --cc=ardb@kernel.org \
    --cc=jpoimboe@kernel.org \
    --cc=linux-crypto@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox