public inbox for linux-crypto@vger.kernel.org
 help / color / mirror / Atom feed
From: Eric Biggers <ebiggers@kernel.org>
To: David Laight <David.Laight@aculab.com>
Cc: "linux-crypto@vger.kernel.org" <linux-crypto@vger.kernel.org>,
	"x86@kernel.org" <x86@kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Ard Biesheuvel <ardb@kernel.org>,
	Josh Poimboeuf <jpoimboe@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>
Subject: Re: [PATCH 3/3] crypto: x86/crc32c - eliminate jump table and excessive unrolling
Date: Mon, 14 Oct 2024 12:01:42 -0700	[thread overview]
Message-ID: <20241014190142.GA1137@sol.localdomain> (raw)
In-Reply-To: <a6c0c04a0486404ca4db3fd57a809d5b@AcuMS.aculab.com>

On Mon, Oct 14, 2024 at 04:30:05PM +0000, David Laight wrote:
> From: Eric Biggers
> > Sent: 14 October 2024 05:25
> > 
> > crc32c-pcl-intel-asm_64.S has a loop with 1 to 127 iterations fully
> > unrolled and uses a jump table to jump into the correct location.  This
> > optimization is misguided, as it bloats the binary code size and
> > introduces an indirect call.  x86_64 CPUs can predict loops well, so it
> > is fine to just use a loop instead.  Loop bookkeeping instructions can
> > compete with the crc instructions for the ALUs, but this is easily
> > mitigated by unrolling the loop by a smaller amount, such as 4 times.
> 
> Do you need to unroll it at all?

It looks like on most CPUs, no.  On Haswell, Emerald Rapids, Zen 2 it does not
make a significant difference.  However, it helps on Zen 5.

> > +	# Unroll the loop by a factor of 4 to reduce the overhead of the loop
> > +	# bookkeeping instructions, which can compete with crc32q for the ALUs.
> > +.Lcrc_3lanes_4x_loop:
> > +	crc32q	(bufp), crc_init_q
> > +	crc32q	(bufp,chunk_bytes_q), crc1
> > +	crc32q	(bufp,chunk_bytes_q,2), crc2
> > +	crc32q	8(bufp), crc_init_q
> > +	crc32q	8(bufp,chunk_bytes_q), crc1
> > +	crc32q	8(bufp,chunk_bytes_q,2), crc2
> > +	crc32q	16(bufp), crc_init_q
> > +	crc32q	16(bufp,chunk_bytes_q), crc1
> > +	crc32q	16(bufp,chunk_bytes_q,2), crc2
> > +	crc32q	24(bufp), crc_init_q
> > +	crc32q	24(bufp,chunk_bytes_q), crc1
> > +	crc32q	24(bufp,chunk_bytes_q,2), crc2
> > +	add	$32, bufp
> > +	sub	$4, %eax
> > +	jge	.Lcrc_3lanes_4x_loop
> 
> If you are really lucky you'll get two memory reads/clock.
> So you won't ever to do than two crc32/clock.
> Looking at Agner's instruction latency tables I don't think
> any cpu can do more that one per clock, or pipeline them.
> I think that means you don't even need two (never mind 3)
> buffers.

On most Intel and AMD CPUs (I tested Haswell for old Intel, Emerald Rapids for
new Intel, and Zen 2 for slightly-old AMD), crc32q has 3 cycle latency and 1 per
cycle throughput.  So you do need at least 3 streams.

AMD Zen 5 has much higher crc32q throughput and seems to want up to 7 streams.
This is not implemented yet.

> Most modern x86 can do 4 or 5 (or even more) ALU operations
> per clock - depending on the combination of instructions.
> 
> Replace the loop termination with a comparison of 'bufp'
> against a pre-calculated limit and you get two instructions
> (that might get merged into one u-op) for the loop overhead.
> They'll run in parallel with the crc32q instructions.

That's actually still three instructions: add, cmp, and jne.

I tried it on both Intel and AMD, and it did not help.

> I've never managed to get a 1-clock loop, but two is easy.
> You might find that just:
> 10:
> 	crc32q	(bufp), crc
> 	crc32q	8(bufp), crc
> 	add		$16, bufp
> 	cmp		bufp, buf_lim
> 	jne		10b
> will run at 8 bytes/clock on modern intel cpu.

No, the latency of crc32q is still three cycles.  You need three streams.

> You can write that in C with a simple asm function for the crc32
> instruction itself.

Well, the single-stream CRC32C implementation already does that; see
arch/x86/crypto/crc32c-intel_glue.c.  Things are not as simple for this
multi-stream implementation, which uses pclmulqdq to combine the CRCs.

- Eric

  reply	other threads:[~2024-10-14 19:01 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-14  4:24 [PATCH 0/3] crypto: x86/crc32c - jump table elimination and other cleanups Eric Biggers
2024-10-14  4:24 ` [PATCH 1/3] crypto: x86/crc32c - simplify code for handling fewer than 200 bytes Eric Biggers
2024-10-14  4:24 ` [PATCH 2/3] crypto: x86/crc32c - access 32-bit arguments as 32-bit Eric Biggers
2024-10-14  4:24 ` [PATCH 3/3] crypto: x86/crc32c - eliminate jump table and excessive unrolling Eric Biggers
2024-10-14 16:30   ` David Laight
2024-10-14 19:01     ` Eric Biggers [this message]
2024-10-14 22:32       ` David Laight
2024-10-14 23:59         ` Eric Biggers
2024-10-15 10:55 ` [PATCH 0/3] crypto: x86/crc32c - jump table elimination and other cleanups Ard Biesheuvel
2024-10-26  6:53 ` Herbert Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20241014190142.GA1137@sol.localdomain \
    --to=ebiggers@kernel.org \
    --cc=David.Laight@aculab.com \
    --cc=ardb@kernel.org \
    --cc=jpoimboe@kernel.org \
    --cc=linux-crypto@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox